Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

LQCA performs coreference resolution and context rewriting for long documents so that downstream reasoning (QA, multi‑hop inference, retrieval) can operate over unambiguous entity mentions.

This README matches your local layout and run flow: install.sh → test_main.sh → test_predict.sh. Results are saved under result/, and logs can be found in the repo root or result/.

📁 Repository Layout (matches your local tree)

LQCA/
├─ __pycache__/
├─ GPTFactory/
├─ data/
├─ result/
├─ pron.txt
├─ config.py
├─ main.py
├─ predict.py
├─ requirements.txt
├─ install.sh
├─ test_main.sh
├─ test_predict.sh
├─ slide.py
└─ README.md

🚀 Quick Start (Five Steps)

Step 1: Download the model and the project skeleton

Get the lightweight coreference model maverick-mes-ontonotes.
Place the LQCA project and the model under your local path (recommended: LQCA/model/maverick-mes-ontonotes/), or set the path in config.py.

Example layout:

LQCA/model/maverick-mes-ontonotes/
├─ config.json
├─ tokenizer.json
└─ ...

Step 2: Install the environment

bash install.sh
conda activate test_lqca

Default Python is 3.9 (your local py39). If you change versions, update install.sh / requirements.txt accordingly.

Step 3: Generate coref‑resolved contexts

bash test_main.sh

This script calls main.py to run mention detection → coreference clustering → context rewriting. The rewritten texts are saved to result/.

Step 4: Run evaluation

bash test_predict.sh

This script calls predict.py for evaluation (e.g., QA F1 / EM) on the rewritten outputs and writes results to result/.

🧩 Data Setup

Put datasets under data/, e.g. data/hotpotqa.jsonl, data/2wikimqa.jsonl.
Minimal JSONL schema:

{"id": "sample-0001", "input": "<long context>", "question": "<optional>", "answer": "<optional>"}

input is sufficient for rewriting. If question/answer are provided, the evaluator will use them.

🔍 Evaluation Overview

Text‑match F1 (QA style): normalize by lowercasing, removing punctuation/articles/extra spaces; compute token F1; use max‑over‑gold when multiple references exist.
Rewrite diagnostics:
- Mention span coverage
- Pronoun reduction rate
- Overlap resolution statistics
- Long‑context buckets (length / #entities / cross‑sentence distance)

🧪 Direct CLI (optional)

If you prefer to bypass the scripts:

# Coref rewriting only
python main.py \
  --api_key {your_api_key} \
  --utilized_model {your_chosen_model} \
  --coref_model_path model/maverick-mes-ontonotes \
  --thredhold_value 0.85 \
  --dataset_loc {dataset_loc} \
  --dataset_loc {coref_dataset_loc}

# Evaluate on the rewritten results
python predict.py \
  --api_key {your_api_key} \
  --utilized_model {your_chosen_model} \

Argument names above are illustrative—use the actual flags supported by your main.py / predict.py / config.py (consistent with your scripts).

🗓 Changelog

2025/08/18: Public prototype v0.1.0 — coref rewriting, max‑cover span merging, QA metrics & logging.

🔗 Citation

If this project is useful to your research or product, please cite:

@inproceedings{liubridging,
  title={Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding},
  author={Liu, Yanming and Peng, Xinyue and Cao, Jiannan and Bo, Shi and Shen, Yanxin and Du, Tianyu and Cheng, Sheng and Wang, Xun and Yin, Jianwei and Zhang, Xuhong},
  booktitle={The Thirteenth International Conference on Learning Representations}
}

🙏 Acknowledgements

Thanks to the open‑source community for resources in coreference and long‑context modeling (OntoNotes, PyTorch, HuggingFace, HotpotQA, 2WikiMultiHopQA, etc.). PRs improving span merging, rewrite policies, and evaluation coverage are welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

📁 Repository Layout (matches your local tree)

🚀 Quick Start (Five Steps)

Step 1: Download the model and the project skeleton

Step 2: Install the environment

Step 3: Generate coref‑resolved contexts

Step 4: Run evaluation

🧩 Data Setup

🔍 Evaluation Overview

🧪 Direct CLI (optional)

🗓 Changelog

🔗 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
GPTFactory		GPTFactory
assets		assets
data/data		data/data
result		result
.gitignore		.gitignore
README.md		README.md
config.py		config.py
install.sh		install.sh
main.py		main.py
predict.py		predict.py
pron.txt		pron.txt
requirements.txt		requirements.txt
slide.py		slide.py
test_main.sh		test_main.sh
test_predict.sh		test_predict.sh
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

📁 Repository Layout (matches your local tree)

🚀 Quick Start (Five Steps)

Step 1: Download the model and the project skeleton

Step 2: Install the environment

Step 3: Generate coref‑resolved contexts

Step 4: Run evaluation

🧩 Data Setup

🔍 Evaluation Overview

🧪 Direct CLI (optional)

🗓 Changelog

🔗 Citation

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages