Add quick guide to README
This commit is contained in:
parent
2645ba7e5c
commit
a578850d37
25
README.md
25
README.md
@ -13,5 +13,26 @@ Keep in mind it was meant to be a quick-and-dirty project, so things were kind o
|
|||||||
- Used by one of the Bash scripts to validate the downloaded PDF files
|
- Used by one of the Bash scripts to validate the downloaded PDF files
|
||||||
3. Bash for the various scripts
|
3. Bash for the various scripts
|
||||||
- Bash scripts were used on a Debian 10 (Buster) machine, which has it by default. Theoretically they should work on Windows (e.g. via Git Bash), but due to requirement #2 it might not work as expected.
|
- Bash scripts were used on a Debian 10 (Buster) machine, which has it by default. Theoretically they should work on Windows (e.g. via Git Bash), but due to requirement #2 it might not work as expected.
|
||||||
4. `curl` - which downloads all the pages.
|
4. `curl` - Used to actually download all the pages via `pdfUrls.sh`.
|
||||||
5.
|
|
||||||
|
## Order of scripts
|
||||||
|
|
||||||
|
1. Browser: `get_page_urls_browser.js`
|
||||||
|
1. Add URLs into file `survivallibrary_pages.txt`
|
||||||
|
2. Bash: `get_pages_with_pdfs.sh`
|
||||||
|
1. This one will take a while, since it downloads the HTML of all the category pages and dumps it into the `pages/` directory.
|
||||||
|
3. Node: `parse_html_pages.js`
|
||||||
|
1. Generates files such as `pdfUrls.sh` and `folderLink.sh`
|
||||||
|
4. Bash: `pdfUrls.sh`
|
||||||
|
1. Will download all the PDFs into the currently specified directory.
|
||||||
|
2. Since downloads are *not done* in parallel, it's going to take a while.
|
||||||
|
1. This was intentional to avoid any rate limits/blocks.
|
||||||
|
2. When I ran it on my server, it took about 2 days to complete. It's 13k files individually downloaded, so I hope you have patience.
|
||||||
|
5. Bash: `validate_pdfs.sh`
|
||||||
|
1. Take the list of PDFs from `validate.log` and add them to a new file: `broken_pdfs.txt`
|
||||||
|
1. Place this file into the same folder as all the unsorted PDFs.
|
||||||
|
2. Some text editor skills might be necessary to extract a clean list of filenames, as the log format is something like: `[2022-02-01 01:47:03] filename.pdf is broken`.
|
||||||
|
6. Bash: `broken_pdf_remove.sh`
|
||||||
|
1. Place this in the same directory as the unsorted PDFs and the `broken_pdfs.txt` file.
|
||||||
|
2. Run it once as just `bash broken_pdf_remove.sh`. If it looks good. Run `bash broken_pdf_remove.sh delete`.
|
||||||
|
7. Bash: `folderLink.sh`
|
Loading…
Reference in New Issue
Block a user