Obsolete - archived

ins.gov2.ro — descarcă, normalizează + UI/navigator pentru datele oferite de Institutul național de statistică, via TEMPO-Online (statistici.insse.ro)

INS Tempo Online but make it nice.

📋 Backlog · 📅 Activity History · 🤖 LLMs.txt

Running

What	Command	URL
Main app (FastAPI + DuckDB)	`uvicorn app.main:app --reload --port 8080`	http://localhost:8080
Static UI explorers	`python -m http.server 8000`	http://localhost:8000/ui/dataset-navigator.html
StatExplorer (alt)	`uvicorn explorer.main:app --reload --port 8081`	http://localhost:8081
DuckDB browser	`python duckdb-browser.py`	http://localhost:5000

Activate venv first: source ~/devbox/envs/240826/bin/activate

Web Application (`app/`)

FastAPI backend with DuckDB + Parquet, Vanilla JS + ECharts frontend.

app/
  main.py             — FastAPI entry, mounts API routers + static files
  config.py           — DB_PATH, PARQUET_DIR (v3), MAX_DATA_ROWS=50000
  db.py               — DuckDB cursor-per-request (concurrency-safe)
  routers/            — /api/categories, /api/datasets, /api/datasets/{id}/data, /api/datasets/{id}/download, /sdmx/

SDMX 2.1 REST API

The app exposes a minimal SDMX 2.1 REST API (agency INS) compatible with sdmxthon and the SDMX Dashboard Generator:

Endpoint	Description
`GET /sdmx/2.1/data/INS,{flow}/{key}`	GenericData XML — dot-notation key, `+` OR separator, `startPeriod`/`endPeriod`/`lastNObservations` query params
`GET /sdmx/2.1/datastructure/INS/{flow}/1.0`	DataStructure Definition (DSD) XML with codelists
`GET /sdmx/2.1/dataflow/INS/{flow}/1.0`	Dataflow definition XML

Example: /sdmx/2.1/data/INS,ACC102B/.. returns all observations for dataset ACC102B.

  services/           — chart_config, chart_selector, query_builder
  static/js/          — dataset-page, chart-factory, chart-geo, chart-demographic, filter-panel
  static/css/         — dataset.css, datasets.css, main.css
  static/geo/         — romania-counties/regions/macroregions.geojson

Pipeline Scripts

Sequential data pipeline — run in order. All scripts accept --lang ro|en (default: ro).

Script	Output	Description
`1-fetch-context.py`	`data/1-indexes/{lang}/context.csv`	Fetches category/context hierarchy from TEMPO API
`2-fetch-matrices.py`	`data/1-indexes/{lang}/matrices.csv`	Fetches dataset list from TEMPO API
`3-fetch-metas.py`	`data/2-metas/{lang}/{id}.json`	Downloads JSON metadata for each dataset
`4-build-meta-index.py`	`data/1-indexes/{lang}/matrices-list.csv`	Builds summary index from metadata JSONs
`5-varstats-db.py`	`data/3-db/{lang}/tempo-indexes.db`	Creates SQLite DB from metadata
`6-fetch-csv.py`	`data/4-datasets/{lang}/`	Downloads raw CSV data files from TEMPO API
`7-data-compactor.py`	`data/5-compact-datasets/{lang}/`	Replaces text labels with numeric IDs in CSVs
`8-setup-duckdb-schema.py`	`data/tempo_metadata.duckdb`	Creates DuckDB schema (contexts, matrices, dimensions)
`9-csv-to-parquet.py`	`data/parquet/ro/`	Converts compacted CSVs to Parquet
`10-import-metadata.py`	DuckDB tables	Imports all metadata into DuckDB
`10-classify-dimensions.py`	`dimension_options_parsed`, `matrix_profiles`	Parses/classifies dimension options, detects archetypes
`10-sdmx-export.py`	`data/6-sdmx-csv/ro/`	Converts compacted CSVs to SDMX-CSV 2.0
`11-build-sdmx-codes.py`	DuckDB code mapping tables	Builds SDMX code mappings in DuckDB
`11-coverage-profiler.py`	`dataset_coverage` DuckDB table	Analyzes data completeness per dataset
`12-parquet-to-sdmx.py`	`data/parquet-v3/ro/`	Transforms parquet-v2 to SDMX-native parquet-v3
`12-split-datasets.py`	`data/parquet-v3/ro/`	Splits inconsistent datasets into clean sub-datasets

Incremental Update (`update-pipeline.py`)

Orchestrates incremental updates from the INS news feed — processes only datasets updated since the last run.

python update-pipeline.py                        # auto-incremental (since last run)
python update-pipeline.py --refetch-news         # re-fetch news CSV first
python update-pipeline.py --since 01.03.2026     # explicit date filter
python update-pipeline.py --all                  # ignore last run, process all news
python update-pipeline.py --matrix ACC101B       # single dataset
python update-pipeline.py --force                # re-download CSVs + parquets
python update-pipeline.py --force-meta           # re-fetch metadata only (date sync, no CSV re-download)
python update-pipeline.py --dry-run              # preview without executing

Per-matrix steps: fetch metadata JSON → download CSV → convert to parquet → SDMX transform → split if needed → view profile. After all matrices: rebuild meta index, sync ultima_actualizare dates to DuckDB. Saves data/logs/last-pipeline-run.txt on completion — used as auto --since on next run.

Other root-level scripts

Script	Description
`generate_view_profiles.py`	Generates per-dataset JSON view profiles → `data/corpus/view-profiles/`
`build-geo-regions.py`	Dissolves county GeoJSON into regions + macroregions
`split_rules.py`	Split rules engine — classifies datasets needing structural splits
`detect_trends.py`	Detects trends, YoY growth, seasonality → `dataset_trends` DuckDB table
`duckdb_config.py`	Central config: paths for all DuckDB/Parquet processing
`duckdb-browser.py`	Flask browser for exploring DuckDB + Parquet data
`build-dataset-metadata.py`	Scans CSVs, calculates stats → `ui/data/dataset-metadata.json`
`get-news.py`	Scrapes INS news/press releases → `data/insse_news.csv`
`test_chart_selector.py`	Tests chart selection engine across all datasets

utils/

Script	Description
`14-parquet-to-ids.py`	Converts Parquet from text labels → numeric IDs → `data/parquet-v2/ro/`
`13-slim-samples-to-markdown.py`	Converts slim-sample CSVs to markdown for LLM analysis
`12-csv-headers-index.py`	Extracts headers from all CSVs → `data/2-metas/csv-headers-index.csv`
`11-slim-samples.py`	Samples up to 100 rows per dataset → `data/4-datasets-slim-samples/`
`build-dimension-index.py`	Builds searchable SQLite index from metadata JSONs
`build-enhanced-navigator-index.py`	Builds optimized SQLite + JSON indexes for the dataset navigator UI
`build-static-index.py`	Generates static JSON indexes for client-side explorer
`query-dimensions.py`	CLI query tool for the dimension index SQLite DB
`query-duckdb.py`	Query helper for DuckDB + Parquet
`explore-data.py`	Exploration script showing DuckDB + Parquet integration patterns
`export-db-to-json.py`	Exports dimension index SQLite → `ui/data/dimension_index.json`
`check-meta-consistency.py`	Validates consistency between metadata directories

profiling/

Script	Description
`data_profiler.py`	Main profiler: validates CSV structure, classifies column types, generates reports
`variable_classifier.py`	Classifies variable labels using a CSV ruleset
`unit_classifier.py`	Classifies unit-of-measure labels semantically
`validation_rules.py`	Modular validation framework (column names, data content, file structure)
`build_indexes.py`	Builds keyword/theme indexes from datasets → `data/indexes/`
`tool-list-headers.py`	Extracts CSV headers → `data/2-csv-cols/ro/`
`tool-sample-csvs.py`	Creates sampled CSVs (first/mid/last 5 rows) → `data/datasets-samples/ro/`
`tool-word-frequency.py`	Romanian word frequency analysis of dataset titles

Data

data/
  1-indexes/{lang}/        context.csv, matrices.csv
  2-metas/{lang}/          {dataset-id}.json — metadata per dataset
  3-db/{lang}/             tempo-indexes.db (SQLite, legacy)
  4-datasets/{lang}/       raw CSVs from TEMPO API
  4-datasets-slim-samples/ 50/ and 100/ row samples for LLM analysis
  5-compact-datasets/      CSVs with numeric IDs instead of labels
  6-sdmx-csv/ro/           SDMX-CSV 2.0 output
  corpus/
    parquet/             Canonical SDMX parquet files — 3,632 files ← used by app
    metadata.duckdb      Main DuckDB metadata (16 tables)
    view-profiles/       Per-dataset JSON view profiles — ~3,800 files
  parquet-v2/ro/           Parquet v2 (numeric IDs) — legacy fallback
  meta/                    Reference data (judet CSVs, SIRUTA)
  logs/                    Pipeline execution logs

Historical data snapshots in data-old/, data-25-1/, data-2026/.

Deployment

Dockerfile + fly.toml — Fly.io deployment (shared-cpu-1x, 512MB, Amsterdam region)
scripts/prepare-deploy-data.sh — Stages parquet-v3 + v2 fallbacks + DuckDB + view-profiles into deploy-data/
deploy/oracle/ — Oracle Cloud deployment (systemd + nginx)
deploy/hf-spaces/ — Hugging Face Spaces deployment

Roadmap

Done

Current

UI polish — responsive layout, chart label truncation
URL state persistence (filters, period, chart type in URL)
Monthly → yearly aggregation toggle

Later

SDMX generic UI framework (multi-source: Eurostat, OECD)
NL2SQL natural language queries
Notebook-ready exports, publish to Kaggle
Basic stats/charts per localități (normalize to population)
Static site migration (DuckDB-WASM, Cloudflare Pages)

~~see also: ui/readme.md~~

Notes

Kill process for 5050

lsof -i :5050 | grep -v COMMAND | awk '{print $2}' | xargs kill -9 2>/dev/null && echo "Killed process on port 5050" || echo "No process found on port 5050"

Atentie! Nomenclatoarele care prezinta doar optiunea "Total" se vor completa automat cu alte optiuni doar daca nomenclatorului anterior i se deselecteaza optiunea "Total" si i se alege o singura alta optiune,

Important! Unitatile de masura sunt implicit selectate toate pentru a preveni rezultate goale atunci cand se combina cereri incompatibile (spre exemplu, s-ar putea selecta productia de porumb in litri; sau o valoare in ROL dupa denominarea din 2005, etc.).

docs/ Summary

Core Architecture

DUCKDB_SPECS.md — Schema design for the DuckDB + Parquet hybrid: tables, file structure, query examples. The canonical DB spec.
DUCKDB_GUIDE.md — Practical query patterns and performance tips for DuckDB + Parquet usage.
classify-dimensions.md — Spec for dimension classification/normalization: semantic types, parsing rules, archetype detection.

Application Specs

app-spec.md — Full v1 spec: FastAPI + DuckDB backend, ECharts frontend, all phases from data prep to UI polish.
app-spec-v2.md — v2 redesign spec. Discovery-first approach leveraging enriched metadata (tags, trends, relationships, chart recommendations).
chart-framework-spec.md — Generic chart selection engine: 15 chart types, dimension role assignment, filter system.

Data & Profiling

data analysis.md — Framework for systematic profiling, classification, and dashboard generation.
data notes.md — Raw notes on data quirks, dimension patterns, edge cases.
TODO_COMPACTION.md — Known issues with CSV compaction and label normalization fixes.
PROFILING_AND_EXPLORER.md — Guide to the CSV profiler and Explorer UI: output paths, validation flags, API endpoints.

UI / Legacy

STATIC-CONVERSION-SUMMARY.md — Documents conversion from server-based to static site explorer.
STATIC-VS-PYTHON-COMPARISON.md — Test results showing 100% feature parity between Python and static versions.
JUDET-SPLIT-IMPLEMENTATION.md — How large datasets are split by county (judet) to stay under the 30k-cell API limit.

Deployment

DEPLOY-DREAMHOST.md — Deployment guide for shared hosting via Passenger WSGI.

Agents Pipeline (`docs/agents/`)

README.md — Overview of the 6-agent enrichment pipeline (produces value_profiles, coverage, trends, tags, relationships, chart_recs).
pipeline.md — Orchestration: 3 phases, 7 agents, execution order and verification steps.
phase1-value-profiler.md — Agent 1A: min/max/mean/percentiles/distribution per dataset.
phase1-coverage-profiler.md — Agent 1B: time/geo coverage, fill rate, sparsity.
phase1-trend-detector.md — Agent 1C: trend direction, YoY growth, seasonality, breakpoints.
phase2-topic-tagger.md — Agent 2A: bilingual semantic tags from context + dataset names.
phase2-dimension-overlap.md — Agent 2B: related-dataset discovery via dimension fingerprints.
phase2-chart-recommender.md — Agent 2C: data-driven chart type recommendations.
phase3-ia-designer.md — Agent 3A: generates the full information architecture spec from enriched metadata.

Obsolete - archived

StatExplorer (`explorer/`)

Alternative Tableau-inspired explorer with i18n and component-based JS architecture.

explorer/
  main.py             — FastAPI ("StatExplorer")
  services/           — chart_selector, query_builder, translations
  static/js/charts/   — bar, geo, line, heatmap, pyramid, bubble, small-multiples, table
  static/js/components/ — ChartCanvas, DatasetPicker, FilterBar, LeftSidebar, TopNav
  static/js/lib/      — api, i18n, utils

Moved to docs/misc-ideas/explorer

Static site exploration

see docs/misc-ideas/static-site, STATIC-CONVERSION-SUMMARY.md, STATIC-VS-PYTHON-COMPARISON.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Running

Web Application (`app/`)

SDMX 2.1 REST API

Pipeline Scripts

Incremental Update (`update-pipeline.py`)

Other root-level scripts

utils/

profiling/

Data

Deployment

Roadmap

Done

Current

Later

Notes

docs/ Summary

Core Architecture

Application Specs

Data & Profiling

UI / Legacy

Deployment

Agents Pipeline (`docs/agents/`)

Obsolete - archived

StatExplorer (`explorer/`)

Static site exploration

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
app		app
data/eval		data/eval
docs		docs
scripts		scripts
tools/tempo-dev-mcp		tools/tempo-dev-mcp
.dockerignore		.dockerignore
.gitignore		.gitignore
.mcp.json		.mcp.json
1-fetch-context.py		1-fetch-context.py
10-classify-dimensions.py		10-classify-dimensions.py
10-import-metadata.py		10-import-metadata.py
10-sdmx-export.py		10-sdmx-export.py
11-build-sdmx-codes.py		11-build-sdmx-codes.py
11-coverage-profiler.py		11-coverage-profiler.py
12-parquet-to-sdmx.py		12-parquet-to-sdmx.py
12-split-datasets.py		12-split-datasets.py
2-fetch-matrices.py		2-fetch-matrices.py
3-fetch-metas.py		3-fetch-metas.py
4-build-meta-index.py		4-build-meta-index.py
5-varstats-db.py		5-varstats-db.py
6-fetch-csv.py		6-fetch-csv.py
7-data-compactor-1.py		7-data-compactor-1.py
7-data-compactor.py		7-data-compactor.py
8-setup-duckdb-schema.py		8-setup-duckdb-schema.py
9-csv-to-parquet.py		9-csv-to-parquet.py
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
build-dataset-metadata.py		build-dataset-metadata.py
build-geo-regions.py		build-geo-regions.py
build-static-site.py		build-static-site.py
detect_trends.py		detect_trends.py
duckdb-browser.py		duckdb-browser.py
duckdb_config.py		duckdb_config.py
fly.toml		fly.toml
generate_sdmx_yaml.py		generate_sdmx_yaml.py
generate_view_profiles.py		generate_view_profiles.py
get-news.py		get-news.py
llms.txt		llms.txt
readme.md		readme.md
requirements.txt		requirements.txt
split_rules.py		split_rules.py
test_chart_selector.py		test_chart_selector.py
update-pipeline.py		update-pipeline.py

Folders and files

Latest commit

History

Repository files navigation

Running

Web Application (app/)

SDMX 2.1 REST API

Pipeline Scripts

Incremental Update (update-pipeline.py)

Other root-level scripts

utils/

profiling/

Data

Deployment

Roadmap

Done

Current

Later

Notes

docs/ Summary

Core Architecture

Application Specs

Data & Profiling

UI / Legacy

Deployment

Agents Pipeline (docs/agents/)

Obsolete - archived

StatExplorer (explorer/)

Static site exploration

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Web Application (`app/`)

Incremental Update (`update-pipeline.py`)

Agents Pipeline (`docs/agents/`)

StatExplorer (`explorer/`)