pages | ||
.gitignore | ||
broken_pdf_remove.sh | ||
broken_pdfs.txt | ||
get_page_urls_browser.js | ||
get_pages_with_pdfs.sh | ||
package-lock.json | ||
package.json | ||
parse_html_pages.js | ||
README.md | ||
survivorlibrary_pages.txt | ||
validate_pdfs.sh |
Survivor Library
Various scripts for scraping and parsing survivorlibrary.com
Keep in mind it was meant to be a quick-and-dirty project, so things were kind of hotglued together as I went along.
Requirements
- Node.js + npm for
parse_html_pages.js
- I was using
v16.13.2
(LTS) at the time of writing. - Remember to run
npm install
before attempting to runnode parse_html_pages.js
- I was using
pdfinfo
viapoppler-utils
- Used by one of the Bash scripts to validate the downloaded PDF files
- Bash for the various scripts
- Bash scripts were used on a Debian 10 (Buster) machine, which has it by default. Theoretically they should work on Windows (e.g. via Git Bash), but due to requirement #2 it might not work as expected.
curl
- Used to actually download all the pages viapdfUrls.sh
.
Order of scripts
- Browser:
get_page_urls_browser.js
- Add URLs into file
survivorlibrary_pages.txt
- Add URLs into file
- Bash:
get_pages_with_pdfs.sh
- This one will take a while, since it downloads the HTML of all the category pages and dumps it into the
pages/
directory.
- This one will take a while, since it downloads the HTML of all the category pages and dumps it into the
- Node:
parse_html_pages.js
- Generates files such as
pdfUrls.sh
andfolderLink.sh
- Generates files such as
- Bash:
pdfUrls.sh
- Will download all the PDFs into the currently specified directory.
- Since downloads are not done in parallel, it's going to take a while.
- This was intentional to avoid any rate limits/blocks.
- When I ran it on my server, it took about 2 days to complete. It's 13k files individually downloaded, so I hope you have patience.
- Bash:
validate_pdfs.sh
- Take the list of PDFs from
validate.log
and add them to a new file:broken_pdfs.txt
- Place this file into the same folder as all the unsorted PDFs.
- Some text editor skills might be necessary to extract a clean list of filenames, as the log format is something like:
[2022-02-01 01:47:03] filename.pdf is broken
.
- Take the list of PDFs from
- Bash:
broken_pdf_remove.sh
- Place this in the same directory as the unsorted PDFs and the
broken_pdfs.txt
file. - Run it once as just
bash broken_pdf_remove.sh
. If it looks good. Runbash broken_pdf_remove.sh delete
.
- Place this in the same directory as the unsorted PDFs and the
- Bash:
folderLink.sh