Feature Analysis Workflow Documentation

This document outlines the complete workflow used to analyze the credibility factors of URLs found in conversation logs, comparing those that were "cited" (used as references) versus those that were "uncited" (mentioned in text but not used as a primary source).

Dataset Statistics Overview

Source Data

Total Original Conversations: 282
Unique Cited Conversations: 172 (Conversations containing at least one citation)

URL Counts

Category	Total Found	Filtered (Relevant)	Content Filtered	Successfully Analyzed
Cited URLs	251	-	-	229
Uncited URLs	25,194	16,180	12,892	9,358

Context Relevance: We restricted uncited URLs to those appearing in the 172 cited conversations.
Content Filter: We removed non-content files (PDFs, Images) and non-content domains (YouTube, Docs).
Successfully Analyzed: URLs that were successfully fetched and processed by the AI Agent (excluding 404s, timeouts).

1. Data Collection & Extraction

URL Extraction

We began by processing the raw conversation logs (JSON format) to identify valid URLs.

Script: extract_urls_from_conversations.py
Process:
1. Parsed conversation logs.
2. Extracted URLs from the content_references (labeled as Cited).
3. extracted all other URLs from the raw message text (labeled as Uncited).
Refinement: We further refined the uncited list to only include URLs appearing in conversations that also contained at least one citation to ensure comparable contexts.
- Script: extract_uncited_urls_in_cited_conversations.py

Filtering

To ensure meaningful analysis, we filtered out low-quality or irrelevant URLs.

Script: filter_substantial_urls.py
Criteria:
- Excluded file extensions like .pdf, .png, .jpg.
- Included major social media domains (Facebook, Twitter, Reddit) to capture the full breadth of conversation sources.
- Excluded specific video platforms (YouTube, TikTok) and document hosts (Google Docs).
Output: filtered_uncited_urls.csv

2. Feature Extraction (AI Agent)

We utilized an OpenAI-powered Agent to evaluate each URL against 15 specific factors.

The Agent

Script: extract_features_with_agent.py
Model: GPT-5.2 (configured via agents SDK).
Task: Visit a URL and determine the presence (True/False) of 15 factors (F01–F15), such as "Statistics present", "Expert quote", "Fluent prose", etc.
Output format: JSON file per URL containing factor booleans and confidence scores.

Batch Processing

To process the lists efficiently, we built a batch runner.

Script: batch_run_agent.py
Features:
- Parallel processing (AsyncIO/Semaphore).
- Skip logic for already processed URLs.
- Support for custom output directories.
Execution:
- Cited Cohort: Processed cited_urls.csv -> Output to agent_results_cited/.
- Uncited Cohort: Processed filtered_uncited_urls.csv -> Output to agent_results_uncited/.

3. Data Compilation & Cleaning

Aggregation

We merged the individual JSON result files into a single structured dataset.

Script: compile_analysis_csv.py
Process:
- Reads all JSONs from agent_results_cited/ and agent_results_uncited/.
- Extracts metadata (fetch status, language).
- Extracts Factor Boolean (True/False) AND Confidence Score (0.0 - 1.0).
Output: analysis_dataset.csv (Raw data).

Cleaning & Encoding

We prepared the data for statistical analysis.

Script: clean_analysis_data.py
Process:
1. Filter: Kept only rows where fetch_status == 'success'.
2. Encode: Converted Boolean True/False to Integer 1/0.
3. Threshold (Important): Applied a confidence threshold (e.g., 0.7). If the Agent marked a factor as Present (True) but the confidence score was < 0.7, it was coerced to 0 (False) to reduce false positives.
Output: analysis_dataset_cleaned.csv

4. Statistical Analysis

Finally, we performed a multivariate logistic regression to determine which factors predict whether a URL is cited.

Script: analyze_factors_regression.py
Method: Multivariate Logistic Regression (is_cited ~ F01 + F02 + ... + F15).
Checks:
- Perfect Separation: Automatically drops factors that are 0% or 100% present in either group (as they break the regression model).
- Variance: Drops constant columns.
Metrics Calculated:
- Prevalence: % presence in Cited vs. Uncited groups.
- Odds Ratio (OR): The strength of association (>1 means more likely to be cited, <1 means less likely).
- P-Value: Statistical significance.
- 95% Confidence Interval.
Output: factor_analysis_results.csv

Summary of Results

The final results table (factor_analysis_results.csv) highlights which credibility markers are most strongly associated with the AI's decision to cite a source.

Regression Analysis Table

Factor	Name	Count Cited	Count Uncited	Prev Cited	Prev Uncited	Coefficient	Odds Ratio	CI 95%	P Value
F07	Early summary block	138	4845	60.3%	51.8%	0.2888	1.3348	[1.0115, 1.7616]	0.0413
F03	Inline citations	161	6055	70.3%	64.7%	0.1863	1.2048	[0.8864, 1.6376]	0.2341
F06	Accurate technical terms	227	9132	99.1%	97.6%	0.8355	2.3059	[0.5490, 9.6853]	0.2539
F14	Credential harvesting	5	161	2.2%	1.7%	0.3527	1.4228	[0.5748, 3.5222]	0.4457
F02	Expert quote	84	3539	36.7%	37.8%	-0.1012	0.9037	[0.6813, 1.1987]	0.4825
F15	Unverified downloads or scripts	12	572	5.2%	6.1%	-0.2063	0.8136	[0.4512, 1.4670]	0.4927
F13	Unverified exclusivity	3	168	1.3%	1.8%	-0.2387	0.7876	[0.2489, 2.4930]	0.6847
F11	Keyword stuffing	1	52	0.4%	0.6%	-0.1030	0.9022	[0.1234, 6.5943]	0.9192
F10	Transparent provenance	224	9125	97.8%	97.5%	0.0225	1.0227	[0.4138, 2.5274]	0.9612
F04	Fluent prose	228	9297	99.6%	99.3%	-0.0478	0.9533	[0.1252, 7.2575]	0.9632
F09	Safety guidance	36	1415	15.7%	15.1%	0.0023	1.0023	[0.6972, 1.4409]	0.9899
F01	Statistics present	229	9299	100.0%	99.4%	0.0000	0.0000	[0.0000, 0.0000]	1.0000
F12	Novelty without facts	0	17	0.0%	0.2%	0.0000	0.0000	[0.0000, 0.0000]	1.0000
F05	Plain language	20	1485	8.7%	15.9%	0.0000	0.0000	[0.0000, 0.0000]	1.0000 (Excluded)
F08	Authoritative tone	61	3078	26.6%	32.9%	0.0000	0.0000	[0.0000, 0.0000]	1.0000 (Excluded)

Note: F05 and F08 were manually excluded from this analysis iteration. F01 and F12 were dropped due to perfect separation.

Key Files Inventory:

batch_run_agent.py: Principal script for data collection.
clean_analysis_data.py: Pre-processing pipeline.
analyze_factors_regression.py: Statistical engine.
analysis_dataset_cleaned.csv: The final dataset used for the model.
factor_analysis_results.csv: The final output table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feature Analysis Workflow Documentation

Dataset Statistics Overview

Source Data

URL Counts

1. Data Collection & Extraction

URL Extraction

Filtering

2. Feature Extraction (AI Agent)

The Agent

Batch Processing

3. Data Compilation & Cleaning

Aggregation

Cleaning & Encoding

4. Statistical Analysis

Summary of Results

Regression Analysis Table

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
agent_results_cited		agent_results_cited
agent_results_uncited		agent_results_uncited
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
analysis_dataset.csv		analysis_dataset.csv
analysis_dataset_cleaned.csv		analysis_dataset_cleaned.csv
analysis_report.md		analysis_report.md
analyze_factors_regression.py		analyze_factors_regression.py
batch_run_agent.py		batch_run_agent.py
cited_conversations.csv		cited_conversations.csv
cited_urls.csv		cited_urls.csv
clean_analysis_data.py		clean_analysis_data.py
compile_analysis_csv.py		compile_analysis_csv.py
conversations_files.zip		conversations_files.zip
extract_features_with_agent.py		extract_features_with_agent.py
extract_uncited_urls_in_cited_conversations.py		extract_uncited_urls_in_cited_conversations.py
extract_unique_cited_conversations.py		extract_unique_cited_conversations.py
extract_urls_from_conversations.py		extract_urls_from_conversations.py
extract_web_content.py		extract_web_content.py
factor_analysis_results.csv		factor_analysis_results.csv
filter_substantial_urls.py		filter_substantial_urls.py
filtered_uncited_urls.csv		filtered_uncited_urls.csv
pipeline_stats_report.md		pipeline_stats_report.md
requirements.txt		requirements.txt
setup_env.sh		setup_env.sh
uncited_in_cited_conversations.csv		uncited_in_cited_conversations.csv
uncited_urls.csv		uncited_urls.csv
unique_cited_conversations.csv		unique_cited_conversations.csv

Folders and files

Latest commit

History

Repository files navigation

Feature Analysis Workflow Documentation

Dataset Statistics Overview

Source Data

URL Counts

1. Data Collection & Extraction

URL Extraction

Filtering

2. Feature Extraction (AI Agent)

The Agent

Batch Processing

3. Data Compilation & Cleaning

Aggregation

Cleaning & Encoding

4. Statistical Analysis

Summary of Results

Regression Analysis Table

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages