This project implements multiple advanced text summarization methods to generate comprehensive summaries of the Dodd-Frank Wall Street Reform and Consumer Protection Act. The project is designed for senior analysts and experts who require high-quality, technical summaries with minimal loss of context and detail.
- Generate high-quality summaries with minimal loss of context and detail
- Ensure conciseness while preserving maximum detail
- Maintain readability and natural flow
- Utilize technical language appropriate for senior analysts
- Hierarchical Summarization - Structures summaries by titles and sections, maintaining the original document organization
- Hybrid Extractive-Abstractive Summarization - Combines relevance ranking with abstractive summarization for efficiency
- Agglomerative Clustering Summarization - Uses clustering techniques to group related content before summarization
- Readability Assessment: Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog Index
- Technical Complexity: Weighted scoring based on financial and legal terminology
- Coverage Analysis: BERTScore evaluation for precision, recall, and F1 scores
- Conciseness Measurement: Word count analysis
- Python 3.8 or higher
- Azure OpenAI API access
- Git
-
Clone the repository
git clone <your-repo-url> cd summarizing_dodd_frank_project
-
Create a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set up environment variables
cp .env.template .env # Edit .env with your Azure OpenAI credentials -
Download the Dodd-Frank Act PDF
- Place the PDF file in the
data/directory asDODD_FRANK.pdf - You can download it from: Congress.gov
- Place the PDF file in the
-
Open the Jupyter notebook
jupyter notebook "summarize_dodd (1).ipynb" -
Execute cells in order to:
- Clean and preprocess the PDF text
- Run different summarization methods
- Evaluate summary quality
- Generate comparison visualizations
summarizing_dodd_frank_project/
├── data/
│ └── DODD_FRANK.pdf # Source PDF document
├── generated_summary_examples/ # Example outputs from different methods
├── .env.template # Environment variables template
├── .gitignore # Git ignore rules
├── requirements.txt # Python dependencies
├── README.md # This file
└── summarize_dodd (1).ipynb # Main analysis notebook
Create a .env file with the following variables:
AZURE_OPENAI_API_KEY=your_api_key_here
AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com/
AZURE_OPENAI_API_VERSION=2024-02-15-preview
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=chat- Summary Length: Adjust
top_nparameter in hybrid method - Clustering: Modify
num_clustersin agglomerative clustering - Technical Terms: Update the
technical_termsdictionary for complexity scoring
The project includes comprehensive evaluation metrics comparing different summarization approaches:
- Hierarchical Method: Best structure preservation, highest detail retention
- Hybrid Method: Most efficient, good balance of speed and quality
- Agglomerative Clustering: Good thematic organization, moderate efficiency
-
Never commit API keys to version control
-
Use environment variables or
.envfiles for sensitive configuration -
The
.gitignorefile excludes sensitive files and temporary outputs -
Dodd-Frank Act source: Congress.gov
-
LangChain framework for LLM integration
-
Azure OpenAI for language model access
-
Various Python libraries for NLP and machine learning
For questions or issues, please open an issue in the GitHub repository.
Note: This project is designed for educational and research purposes. Ensure compliance with all applicable terms of service when using external APIs and data sources.