Skip to content

Add OPD ingestion pipeline #326

@jonesrussell

Description

@jonesrussell

Summary

Add the Ojibwe People's Dictionary (OPD) as a first-class ingestion source in NorthCloud. The scraping, schema design, and validation have been completed in waaseyaa/sandbox-ojibwe.

Context

  • 22,197 entry URLs discovered from OPD browse index
  • 21,358 entries scraped and validated (99.2% success rate)
  • 180 failures logged (missing lemma on variant pages — under investigation)
  • Full NorthCloud ingestion schema designed (JSON Schema draft-07)
  • Pydantic model validated against sample dataset

Dataset Location

  • Repo: waaseyaa/sandbox-ojibwe
  • Entries: data/all_entries.jsonl (21,358 validated entries)
  • Failures: data/failures.jsonl (180 entries)
  • Discovered URLs: data/entry_urls.json (22,197 URLs)
  • Schemas: specs/canonical-dictionary-schema.json, specs/opd-source-metadata-schema.json
  • Examples: specs/examples.json
  • Full spec: specs/ingestion-spec.md

Schema

The canonical dictionary schema supports any structured language reference source. OPD-specific fields live in a source_metadata envelope. Key fields:

  • lemma, word_class, word_class_normalized
  • definitions[] (with language codes)
  • inflections (raw + parsed forms + stem)
  • examples[] (Ojibwe text + English translation pairs)
  • word_family[] (related entries with relationship labels)
  • media[] (audio with speaker metadata + copyright)
  • attribution, license (CC BY-NC-SA 4.0), consent (defaults false)
  • raw_html + content_hash (SHA-256 for change detection)

Implementation Steps

  1. Create sources/opd/ module in ingestion service
  2. Implement transformer: sandbox Pydantic model → canonical NorthCloud schema
  3. Implement validator: JSON Schema draft-07 validation
  4. Implement content hash diffing for monthly re-scrape updates
  5. Add HTML fixtures + unit tests (see test plan in specs/ingestion-spec.md)
  6. Integration test: ingest 100 random entries end-to-end
  7. Bulk import of 21,358 entries from all_entries.jsonl
  8. Expose /dictionary/entries, /dictionary/words/{id}, /dictionary/search API endpoints

License & Consent

  • OPD content is CC BY-NC-SA 4.0
  • All consent flags default to false (no public display, no AI training, no derivative works)
  • Attribution required on every use: "Ojibwe People's Dictionary, University of Minnesota"
  • No OPD content in commercial tiers without explicit authorization

Update Strategy

  • Monthly re-scrape recommended
  • SHA-256 content hash comparison for change detection
  • Field-level diffs stored in data/diffs/v{date}/
  • Deleted entries flagged for review, not auto-removed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions