Skip to content

keithporcaro/deedscrape

Repository files navigation

Orange County Deed Scraper

Install

Clone this repo and then: npm install or pnpm install

Scrape

node poolScrape.mjs <START_BOOK> <END_BOOK> <START_PAGE> <CONCURRENCY>

e.g., node poolScrape.mjs 44 48 1 3

START_BOOK - number of book you'd like to start with END_BOOK - number of book you'd like to end with START_PAGE - defaults to 1 CONCURRENCY - maximum number of books this will scrape at a time. Defaults to 3

If you start seeing 404 errors in the console, try restarting or scaling back the concurrency.

Books will download to book_number/booknumber_pagenumber.pdf. The scraper will auto-skip docs that have been downloaded already.

Transcriptions

node generate.mjs <book_number> <concurrency> Includes my approach for generating transcripts. Needs a Gemini API key in transcribe.mjs. This will generate text files for each individual page. Default concurrency is set to 100.

Use node mergetranscript.mjs <book_number> to combine the text files into a single file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors