Go to file
2022-02-07 17:20:40 +01:00
pages Initial commit 2022-02-03 15:53:39 +01:00
.gitignore Add more scripts and tools 2022-02-07 10:30:15 +01:00
broken_pdf_remove.sh Add more scripts and tools 2022-02-07 10:30:15 +01:00
broken_pdfs.txt Add more scripts and tools 2022-02-07 10:30:15 +01:00
get_page_urls_browser.js Add more scripts and tools 2022-02-07 10:30:15 +01:00
get_pages_with_pdfs.sh Add more scripts and tools 2022-02-07 10:30:15 +01:00
package-lock.json Initial commit 2022-02-03 15:53:39 +01:00
package.json Initial commit 2022-02-03 15:53:39 +01:00
parse_html_pages.js Fix typos 2022-02-07 17:20:40 +01:00
README.md Fix typos 2022-02-07 17:20:40 +01:00
survivorlibrary_pages.txt Initial commit 2022-02-03 15:53:39 +01:00
validate_pdfs.sh Add more scripts and tools 2022-02-07 10:30:15 +01:00

Survivor Library

Various scripts for scraping and parsing survivorlibrary.com

Keep in mind it was meant to be a quick-and-dirty project, so things were kind of hotglued together as I went along.

Requirements

  1. Node.js + npm for parse_html_pages.js
    • I was using v16.13.2 (LTS) at the time of writing.
    • Remember to run npm install before attempting to run node parse_html_pages.js
  2. pdfinfo via poppler-utils
    • Used by one of the Bash scripts to validate the downloaded PDF files
  3. Bash for the various scripts
    • Bash scripts were used on a Debian 10 (Buster) machine, which has it by default. Theoretically they should work on Windows (e.g. via Git Bash), but due to requirement #2 it might not work as expected.
  4. curl - Used to actually download all the pages via pdfUrls.sh.

Order of scripts

  1. Browser: get_page_urls_browser.js
    1. Add URLs into file survivorlibrary_pages.txt
  2. Bash: get_pages_with_pdfs.sh
    1. This one will take a while, since it downloads the HTML of all the category pages and dumps it into the pages/ directory.
  3. Node: parse_html_pages.js
    1. Generates files such as pdfUrls.sh and folderLink.sh
  4. Bash: pdfUrls.sh
    1. Will download all the PDFs into the currently specified directory.
    2. Since downloads are not done in parallel, it's going to take a while.
      1. This was intentional to avoid any rate limits/blocks.
      2. When I ran it on my server, it took about 2 days to complete. It's 13k files individually downloaded, so I hope you have patience.
  5. Bash: validate_pdfs.sh
    1. Take the list of PDFs from validate.log and add them to a new file: broken_pdfs.txt
      1. Place this file into the same folder as all the unsorted PDFs.
      2. Some text editor skills might be necessary to extract a clean list of filenames, as the log format is something like: [2022-02-01 01:47:03] filename.pdf is broken.
  6. Bash: broken_pdf_remove.sh
    1. Place this in the same directory as the unsorted PDFs and the broken_pdfs.txt file.
    2. Run it once as just bash broken_pdf_remove.sh. If it looks good. Run bash broken_pdf_remove.sh delete.
  7. Bash: folderLink.sh