Go to file

Alex Thomassen 2645ba7e5c Add README, minor tweaks and comments to `parse_html_pages.js`		2022-02-07 10:56:01 +01:00
pages	Initial commit	2022-02-03 15:53:39 +01:00
.gitignore	Add more scripts and tools	2022-02-07 10:30:15 +01:00
broken_pdf_remove.sh	Add more scripts and tools	2022-02-07 10:30:15 +01:00
broken_pdfs.txt	Add more scripts and tools	2022-02-07 10:30:15 +01:00
get_page_urls_browser.js	Add more scripts and tools	2022-02-07 10:30:15 +01:00
get_pages_with_pdfs.sh	Add more scripts and tools	2022-02-07 10:30:15 +01:00
package-lock.json	Initial commit	2022-02-03 15:53:39 +01:00
package.json	Initial commit	2022-02-03 15:53:39 +01:00
parse_html_pages.js	Add README, minor tweaks and comments to `parse_html_pages.js`	2022-02-07 10:56:01 +01:00
README.md	Add README, minor tweaks and comments to `parse_html_pages.js`	2022-02-07 10:56:01 +01:00
survivorlibrary_pages.txt	Initial commit	2022-02-03 15:53:39 +01:00
validate_pdfs.sh	Add more scripts and tools	2022-02-07 10:30:15 +01:00

Survival Library

Various scripts for scraping and parsing survivallibrary.com

Keep in mind it was meant to be a quick-and-dirty project, so things were kind of hotglued together as I went along.

Requirements

Node.js + npm for parse_html_pages.js
- I was using v16.13.2 (LTS) at the time of writing.
- Remember to run npm install before attempting to run node parse_html_pages.js
pdfinfo via poppler-utils
- Used by one of the Bash scripts to validate the downloaded PDF files
Bash for the various scripts
- Bash scripts were used on a Debian 10 (Buster) machine, which has it by default. Theoretically they should work on Windows (e.g. via Git Bash), but due to requirement #2 it might not work as expected.
curl - which downloads all the pages.