-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Milestone
Description
Summary
Add the Ojibwe People's Dictionary (OPD) as a first-class ingestion source in NorthCloud. The scraping, schema design, and validation have been completed in waaseyaa/sandbox-ojibwe.
Context
- 22,197 entry URLs discovered from OPD browse index
- 21,358 entries scraped and validated (99.2% success rate)
- 180 failures logged (missing lemma on variant pages — under investigation)
- Full NorthCloud ingestion schema designed (JSON Schema draft-07)
- Pydantic model validated against sample dataset
Dataset Location
- Repo:
waaseyaa/sandbox-ojibwe - Entries:
data/all_entries.jsonl(21,358 validated entries) - Failures:
data/failures.jsonl(180 entries) - Discovered URLs:
data/entry_urls.json(22,197 URLs) - Schemas:
specs/canonical-dictionary-schema.json,specs/opd-source-metadata-schema.json - Examples:
specs/examples.json - Full spec:
specs/ingestion-spec.md
Schema
The canonical dictionary schema supports any structured language reference source. OPD-specific fields live in a source_metadata envelope. Key fields:
lemma,word_class,word_class_normalizeddefinitions[](with language codes)inflections(raw + parsed forms + stem)examples[](Ojibwe text + English translation pairs)word_family[](related entries with relationship labels)media[](audio with speaker metadata + copyright)attribution,license(CC BY-NC-SA 4.0),consent(defaults false)raw_html+content_hash(SHA-256 for change detection)
Implementation Steps
- Create
sources/opd/module in ingestion service - Implement transformer: sandbox Pydantic model → canonical NorthCloud schema
- Implement validator: JSON Schema draft-07 validation
- Implement content hash diffing for monthly re-scrape updates
- Add HTML fixtures + unit tests (see test plan in
specs/ingestion-spec.md) - Integration test: ingest 100 random entries end-to-end
- Bulk import of 21,358 entries from
all_entries.jsonl - Expose
/dictionary/entries,/dictionary/words/{id},/dictionary/searchAPI endpoints
License & Consent
- OPD content is CC BY-NC-SA 4.0
- All consent flags default to
false(no public display, no AI training, no derivative works) - Attribution required on every use: "Ojibwe People's Dictionary, University of Minnesota"
- No OPD content in commercial tiers without explicit authorization
Update Strategy
- Monthly re-scrape recommended
- SHA-256 content hash comparison for change detection
- Field-level diffs stored in
data/diffs/v{date}/ - Deleted entries flagged for review, not auto-removed
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels