Go to file
2022-02-07 10:56:01 +01:00
pages Initial commit 2022-02-03 15:53:39 +01:00
.gitignore Add more scripts and tools 2022-02-07 10:30:15 +01:00
broken_pdf_remove.sh Add more scripts and tools 2022-02-07 10:30:15 +01:00
broken_pdfs.txt Add more scripts and tools 2022-02-07 10:30:15 +01:00
get_page_urls_browser.js Add more scripts and tools 2022-02-07 10:30:15 +01:00
get_pages_with_pdfs.sh Add more scripts and tools 2022-02-07 10:30:15 +01:00
package-lock.json Initial commit 2022-02-03 15:53:39 +01:00
package.json Initial commit 2022-02-03 15:53:39 +01:00
parse_html_pages.js Add README, minor tweaks and comments to parse_html_pages.js 2022-02-07 10:56:01 +01:00
README.md Add README, minor tweaks and comments to parse_html_pages.js 2022-02-07 10:56:01 +01:00
survivorlibrary_pages.txt Initial commit 2022-02-03 15:53:39 +01:00
validate_pdfs.sh Add more scripts and tools 2022-02-07 10:30:15 +01:00

Survival Library

Various scripts for scraping and parsing survivallibrary.com

Keep in mind it was meant to be a quick-and-dirty project, so things were kind of hotglued together as I went along.

Requirements

  1. Node.js + npm for parse_html_pages.js
    • I was using v16.13.2 (LTS) at the time of writing.
    • Remember to run npm install before attempting to run node parse_html_pages.js
  2. pdfinfo via poppler-utils
    • Used by one of the Bash scripts to validate the downloaded PDF files
  3. Bash for the various scripts
    • Bash scripts were used on a Debian 10 (Buster) machine, which has it by default. Theoretically they should work on Windows (e.g. via Git Bash), but due to requirement #2 it might not work as expected.
  4. curl - which downloads all the pages.