Skip to content

OceannTwT/LQCA

Repository files navigation

Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

Python Version 3.9 GitHub Stars

LQCA performs coreference resolution and context rewriting for long documents so that downstream reasoning (QA, multi‑hop inference, retrieval) can operate over unambiguous entity mentions.

This README matches your local layout and run flow: install.sh → test_main.sh → test_predict.sh. Results are saved under result/, and logs can be found in the repo root or result/.

Framework of LQCA.


📁 Repository Layout (matches your local tree)

LQCA/
├─ __pycache__/
├─ GPTFactory/
├─ data/
├─ result/
├─ pron.txt
├─ config.py
├─ main.py
├─ predict.py
├─ requirements.txt
├─ install.sh
├─ test_main.sh
├─ test_predict.sh
├─ slide.py
└─ README.md

🚀 Quick Start (Five Steps)

Step 1: Download the model and the project skeleton

  • Get the lightweight coreference model maverick-mes-ontonotes.
  • Place the LQCA project and the model under your local path (recommended: LQCA/model/maverick-mes-ontonotes/), or set the path in config.py.

Example layout:

LQCA/model/maverick-mes-ontonotes/
├─ config.json
├─ tokenizer.json
└─ ...

Step 2: Install the environment

bash install.sh
conda activate test_lqca

Default Python is 3.9 (your local py39). If you change versions, update install.sh / requirements.txt accordingly.

Step 3: Generate coref‑resolved contexts

bash test_main.sh

This script calls main.py to run mention detection → coreference clustering → context rewriting. The rewritten texts are saved to result/.

Step 4: Run evaluation

bash test_predict.sh

This script calls predict.py for evaluation (e.g., QA F1 / EM) on the rewritten outputs and writes results to result/.

🧩 Data Setup

  • Put datasets under data/, e.g. data/hotpotqa.jsonl, data/2wikimqa.jsonl.
  • Minimal JSONL schema:
{"id": "sample-0001", "input": "<long context>", "question": "<optional>", "answer": "<optional>"}

input is sufficient for rewriting. If question/answer are provided, the evaluator will use them.


🔍 Evaluation Overview

  • Text‑match F1 (QA style): normalize by lowercasing, removing punctuation/articles/extra spaces; compute token F1; use max‑over‑gold when multiple references exist.

  • Rewrite diagnostics:

    • Mention span coverage
    • Pronoun reduction rate
    • Overlap resolution statistics
    • Long‑context buckets (length / #entities / cross‑sentence distance)

🧪 Direct CLI (optional)

If you prefer to bypass the scripts:

# Coref rewriting only
python main.py \
  --api_key {your_api_key} \
  --utilized_model {your_chosen_model} \
  --coref_model_path model/maverick-mes-ontonotes \
  --thredhold_value 0.85 \
  --dataset_loc {dataset_loc} \
  --dataset_loc {coref_dataset_loc}

# Evaluate on the rewritten results
python predict.py \
  --api_key {your_api_key} \
  --utilized_model {your_chosen_model} \

Argument names above are illustrative—use the actual flags supported by your main.py / predict.py / config.py (consistent with your scripts).


🗓 Changelog

  • 2025/08/18: Public prototype v0.1.0 — coref rewriting, max‑cover span merging, QA metrics & logging.

🔗 Citation

If this project is useful to your research or product, please cite:

@inproceedings{liubridging,
  title={Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding},
  author={Liu, Yanming and Peng, Xinyue and Cao, Jiannan and Bo, Shi and Shen, Yanxin and Du, Tianyu and Cheng, Sheng and Wang, Xun and Yin, Jianwei and Zhang, Xuhong},
  booktitle={The Thirteenth International Conference on Learning Representations}
}

🙏 Acknowledgements

Thanks to the open‑source community for resources in coreference and long‑context modeling (OntoNotes, PyTorch, HuggingFace, HotpotQA, 2WikiMultiHopQA, etc.). PRs improving span merging, rewrite policies, and evaluation coverage are welcome.

About

[ICLR 2025] Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors