Clone this repo and then:
npm install or pnpm install
node poolScrape.mjs <START_BOOK> <END_BOOK> <START_PAGE> <CONCURRENCY>
e.g., node poolScrape.mjs 44 48 1 3
START_BOOK - number of book you'd like to start with END_BOOK - number of book you'd like to end with START_PAGE - defaults to 1 CONCURRENCY - maximum number of books this will scrape at a time. Defaults to 3
If you start seeing 404 errors in the console, try restarting or scaling back the concurrency.
Books will download to book_number/booknumber_pagenumber.pdf. The scraper will auto-skip docs that have been downloaded already.
node generate.mjs <book_number> <concurrency>
Includes my approach for generating transcripts. Needs a Gemini API key in transcribe.mjs. This will generate text files for each individual page. Default concurrency is set to 100.
Use node mergetranscript.mjs <book_number> to combine the text files into a single file.