pages | ||
.gitignore | ||
broken_pdf_remove.sh | ||
broken_pdfs.txt | ||
get_page_urls_browser.js | ||
get_pages_with_pdfs.sh | ||
package-lock.json | ||
package.json | ||
parse_html_pages.js | ||
README.md | ||
survivorlibrary_pages.txt | ||
validate_pdfs.sh |
Survival Library
Various scripts for scraping and parsing survivallibrary.com
Keep in mind it was meant to be a quick-and-dirty project, so things were kind of hotglued together as I went along.
Requirements
- Node.js + npm for
parse_html_pages.js
- I was using
v16.13.2
(LTS) at the time of writing. - Remember to run
npm install
before attempting to runnode parse_html_pages.js
- I was using
pdfinfo
viapoppler-utils
- Used by one of the Bash scripts to validate the downloaded PDF files
- Bash for the various scripts
- Bash scripts were used on a Debian 10 (Buster) machine, which has it by default. Theoretically they should work on Windows (e.g. via Git Bash), but due to requirement #2 it might not work as expected.
curl
- which downloads all the pages.