Orange County Deed Scraper

Install

Clone this repo and then: npm install or pnpm install

Scrape

node poolScrape.mjs <START_BOOK> <END_BOOK> <START_PAGE> <CONCURRENCY>

e.g., node poolScrape.mjs 44 48 1 3

START_BOOK - number of book you'd like to start with END_BOOK - number of book you'd like to end with START_PAGE - defaults to 1 CONCURRENCY - maximum number of books this will scrape at a time. Defaults to 3

If you start seeing 404 errors in the console, try restarting or scaling back the concurrency.

Books will download to book_number/booknumber_pagenumber.pdf. The scraper will auto-skip docs that have been downloaded already.

Transcriptions

node generate.mjs <book_number> <concurrency> Includes my approach for generating transcripts. Needs a Gemini API key in transcribe.mjs. This will generate text files for each individual page. Default concurrency is set to 100.

Use node mergetranscript.mjs <book_number> to combine the text files into a single file.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
downloadbook.mjs		downloadbook.mjs
generate.mjs		generate.mjs
mergetranscript.mjs		mergetranscript.mjs
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
poolScrape.mjs		poolScrape.mjs
scrapeURLs.mjs		scrapeURLs.mjs
transcribe.mjs		transcribe.mjs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Orange County Deed Scraper

Install

Scrape

Transcriptions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

keithporcaro/deedscrape

Folders and files

Latest commit

History

Repository files navigation

Orange County Deed Scraper

Install

Scrape

Transcriptions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages