Advanced Document Analyzer

A Streamlit web application for extracting text from documents, performing analysis, generating AI-powered summaries, and answering questions.

Features

Multi-format Support: Supports PDF, DOCX, and TXT file uploads.
Text Extraction and Splitting: Extracts text from uploaded files and splits it into manageable paragraphs or chunks.
Semantic Search: Implements TF-IDF vectorization and cosine similarity to enable searching within the document.
AI Summarization: Generates summaries using the Ollama API.
Question Answering: Answers user queries based on document content using the Ollama API.
Text-to-Speech: Converts text to speech using gTTS.
Document History: Keeps track of recently uploaded documents.
Document Statistics: Calculates and displays various document statistics.
Document Comparison: Compares two documents for similarity.
Export Features: Allows exporting activity logs as CSV and analysis results as ZIP archives.
Customizable Settings: Includes configurable options for summary length, search result highlighting, dark mode, and more.
Robust Logging: Logs user actions and any errors encountered.

Clone the repository:

git clone <repository_url>
cd <repository_directory>

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Linux/macOS
venv\Scripts\activate  # On Windows

Install the required packages:
```
pip install -r requirements.txt
```
Set up Ollama:
- Ensure that you have Ollama installed and running.
- Verify that the Ollama API is accessible at http://localhost:11434.

Run the Streamlit app:
```
streamlit run docsummarizer.py
```
Upload Documents:
- Use the file uploader to upload your PDF, DOCX, or TXT files.
Explore Features:
- View extracted text.
- Search within the document.
- Generate summaries and ask questions using the AI model.
- Listen to the text using the text-to-speech functionality.
- Compare documents if desired.
- View and export activity logs.

Adjust the following settings in the sidebar:

The application logs various activities and errors. Logs are stored in document_analyzer.log.
You can disable logging via the sidebar settings.
Export activity logs to CSV format for analysis.

Ollama API Connection Issues: Ensure Ollama is running and accessible at http://localhost:11434.
File Extraction Errors: Check the file format and ensure it's a valid PDF, DOCX, or TXT file.

This project is licensed under the MIT License.