Skip to content

Archdiner/summarizing_dodd_frank_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Dodd-Frank Act Summarization Project

This project implements multiple advanced text summarization methods to generate comprehensive summaries of the Dodd-Frank Wall Street Reform and Consumer Protection Act. The project is designed for senior analysts and experts who require high-quality, technical summaries with minimal loss of context and detail.

🎯 Project Goals

  1. Generate high-quality summaries with minimal loss of context and detail
  2. Ensure conciseness while preserving maximum detail
  3. Maintain readability and natural flow
  4. Utilize technical language appropriate for senior analysts

📋 Features

Summarization Methods Implemented

  1. Hierarchical Summarization - Structures summaries by titles and sections, maintaining the original document organization
  2. Hybrid Extractive-Abstractive Summarization - Combines relevance ranking with abstractive summarization for efficiency
  3. Agglomerative Clustering Summarization - Uses clustering techniques to group related content before summarization

Evaluation Metrics

  • Readability Assessment: Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog Index
  • Technical Complexity: Weighted scoring based on financial and legal terminology
  • Coverage Analysis: BERTScore evaluation for precision, recall, and F1 scores
  • Conciseness Measurement: Word count analysis

🚀 Quick Start

Prerequisites

  • Python 3.8 or higher
  • Azure OpenAI API access
  • Git

Installation

  1. Clone the repository

    git clone <your-repo-url>
    cd summarizing_dodd_frank_project
  2. Create a virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Set up environment variables

    cp .env.template .env
    # Edit .env with your Azure OpenAI credentials
  5. Download the Dodd-Frank Act PDF

    • Place the PDF file in the data/ directory as DODD_FRANK.pdf
    • You can download it from: Congress.gov

Running the Project

  1. Open the Jupyter notebook

    jupyter notebook "summarize_dodd (1).ipynb"
  2. Execute cells in order to:

    • Clean and preprocess the PDF text
    • Run different summarization methods
    • Evaluate summary quality
    • Generate comparison visualizations

📁 Project Structure

summarizing_dodd_frank_project/
├── data/
│   └── DODD_FRANK.pdf          # Source PDF document
├── generated_summary_examples/  # Example outputs from different methods
├── .env.template               # Environment variables template
├── .gitignore                  # Git ignore rules
├── requirements.txt           # Python dependencies
├── README.md                  # This file
└── summarize_dodd (1).ipynb  # Main analysis notebook

🔧 Configuration

Environment Variables

Create a .env file with the following variables:

AZURE_OPENAI_API_KEY=your_api_key_here
AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com/
AZURE_OPENAI_API_VERSION=2024-02-15-preview
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=chat

Customization Options

  • Summary Length: Adjust top_n parameter in hybrid method
  • Clustering: Modify num_clusters in agglomerative clustering
  • Technical Terms: Update the technical_terms dictionary for complexity scoring

📊 Results and Evaluation

The project includes comprehensive evaluation metrics comparing different summarization approaches:

  • Hierarchical Method: Best structure preservation, highest detail retention
  • Hybrid Method: Most efficient, good balance of speed and quality
  • Agglomerative Clustering: Good thematic organization, moderate efficiency

🔒 Security Notes

  • Never commit API keys to version control

  • Use environment variables or .env files for sensitive configuration

  • The .gitignore file excludes sensitive files and temporary outputs

  • Dodd-Frank Act source: Congress.gov

  • LangChain framework for LLM integration

  • Azure OpenAI for language model access

  • Various Python libraries for NLP and machine learning

📞 Support

For questions or issues, please open an issue in the GitHub repository.


Note: This project is designed for educational and research purposes. Ensure compliance with all applicable terms of service when using external APIs and data sources.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors