1
0
mirror of https://github.com/mikf/gallery-dl.git synced 2024-11-24 19:52:32 +01:00

Merge branch 'master' into support-webtoonxyz

This commit is contained in:
Ion Chary 2023-07-29 21:03:06 -07:00
commit 6f98527111
150 changed files with 5446 additions and 2222 deletions

View File

@ -20,19 +20,36 @@ jobs:
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v3
- name: Check file permissions
run: |
if [[ "$(find ./gallery_dl -type f -not -perm 644)" ]]; then exit 1; fi
- name: Set up Python ${{ matrix.python-version }} - name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4 uses: actions/setup-python@v4
with: with:
python-version: ${{ matrix.python-version }} python-version: ${{ matrix.python-version }}
- name: Install dependencies - name: Install dependencies
env:
PYV: ${{ matrix.python-version }}
run: | run: |
pip install -r requirements.txt pip install -r requirements.txt
pip install "flake8<4" "importlib-metadata<5" pip install "flake8<4" "importlib-metadata<5"
pip install youtube-dl pip install youtube-dl
if [[ "$PYV" != "3.4" && "$PYV" != "3.5" ]]; then pip install yt-dlp; fi
- name: Install yt-dlp
run: |
case "${{ matrix.python-version }}" in
3.4|3.5)
# don't install yt-dlp
;;
3.6)
# install from PyPI
pip install yt-dlp
;;
*)
# install from master
pip install https://github.com/yt-dlp/yt-dlp/archive/refs/heads/master.tar.gz
;;
esac
- name: Lint with flake8 - name: Lint with flake8
run: | run: |

View File

@ -1,5 +1,177 @@
# Changelog # Changelog
## 1.25.8 - 2023-07-15
### Changes
- update default User-Agent header to Firefox 115 ESR
### Additions
- [gfycat] support `@me` user ([#3770](https://github.com/mikf/gallery-dl/issues/3770), [#4271](https://github.com/mikf/gallery-dl/issues/4271))
- [gfycat] implement login support ([#3770](https://github.com/mikf/gallery-dl/issues/3770), [#4271](https://github.com/mikf/gallery-dl/issues/4271))
- [reddit] notify users about registering an OAuth application ([#4292](https://github.com/mikf/gallery-dl/issues/4292))
- [twitter] add `ratelimit` option ([#4251](https://github.com/mikf/gallery-dl/issues/4251))
- [twitter] use `TweetResultByRestId` endpoint that allows accessing single Tweets without login ([#4250](https://github.com/mikf/gallery-dl/issues/4250))
### Fixes
- [bunkr] use `.la` TLD for `media-files12` servers ([#4147](https://github.com/mikf/gallery-dl/issues/4147), [#4276](https://github.com/mikf/gallery-dl/issues/4276))
- [erome] ignore duplicate album IDs
- [fantia] send `X-Requested-With` header ([#4273](https://github.com/mikf/gallery-dl/issues/4273))
- [gelbooru_v01] fix `source` metadata ([#4302](https://github.com/mikf/gallery-dl/issues/4302), [#4303](https://github.com/mikf/gallery-dl/issues/4303))
- [gelbooru_v01] update `vidyart` domain
- [jpgfish] update domain to `jpeg.pet`
- [mangaread] fix `tags` metadata extraction
- [naverwebtoon] fix `comic` metadata extraction
- [newgrounds] extract & pass auth token during login ([#4268](https://github.com/mikf/gallery-dl/issues/4268))
- [paheal] fix extraction ([#4262](https://github.com/mikf/gallery-dl/issues/4262), [#4293](https://github.com/mikf/gallery-dl/issues/4293))
- [paheal] unescape `source`
- [philomena] fix `--range` ([#4288](https://github.com/mikf/gallery-dl/issues/4288))
- [philomena] handle `429 Too Many Requests` errors ([#4288](https://github.com/mikf/gallery-dl/issues/4288))
- [pornhub] set `accessAgeDisclaimerPH` cookie ([#4301](https://github.com/mikf/gallery-dl/issues/4301))
- [reddit] use 0.6s delay between API requests ([#4292](https://github.com/mikf/gallery-dl/issues/4292))
- [seiga] set `skip_fetish_warning` cookie ([#4242](https://github.com/mikf/gallery-dl/issues/4242))
- [slideshare] fix extraction
- [twitter] fix `following` extractor not getting all users ([#4287](https://github.com/mikf/gallery-dl/issues/4287))
- [twitter] use GraphQL search endpoint by default ([#4264](https://github.com/mikf/gallery-dl/issues/4264))
- [twitter] do not treat missing `TimelineAddEntries` instruction as fatal ([#4278](https://github.com/mikf/gallery-dl/issues/4278))
- [weibo] fix cursor based pagination
- [wikifeet] fix `tag` extraction ([#4289](https://github.com/mikf/gallery-dl/issues/4289), [#4291](https://github.com/mikf/gallery-dl/issues/4291))
### Removals
- [bcy] remove module
- [lineblog] remove module
## 1.25.7 - 2023-07-02
### Additions
- [flickr] add 'exif' option
- [flickr] add 'metadata' option ([#4227](https://github.com/mikf/gallery-dl/issues/4227))
- [mangapark] add 'source' option ([#3969](https://github.com/mikf/gallery-dl/issues/3969))
- [twitter] extend 'conversations' option ([#4211](https://github.com/mikf/gallery-dl/issues/4211))
### Fixes
- [furaffinity] improve 'description' HTML ([#4224](https://github.com/mikf/gallery-dl/issues/4224))
- [gelbooru_v01] fix '--range' ([#4167](https://github.com/mikf/gallery-dl/issues/4167))
- [hentaifox] fix titles containing '@' ([#4201](https://github.com/mikf/gallery-dl/issues/4201))
- [mangapark] update to v5 ([#3969](https://github.com/mikf/gallery-dl/issues/3969))
- [piczel] update API server address ([#4244](https://github.com/mikf/gallery-dl/issues/4244))
- [poipiku] improve error detection ([#4206](https://github.com/mikf/gallery-dl/issues/4206))
- [sankaku] improve warnings for unavailable posts
- [senmanga] ensure download URLs have a scheme ([#4235](https://github.com/mikf/gallery-dl/issues/4235))
## 1.25.6 - 2023-06-17
### Additions
- [blogger] download files from `lh*.googleusercontent.com` ([#4070](https://github.com/mikf/gallery-dl/issues/4070))
- [fantia] extract `plan` metadata ([#2477](https://github.com/mikf/gallery-dl/issues/2477))
- [fantia] emit warning for non-visible content sections ([#4128](https://github.com/mikf/gallery-dl/issues/4128))
- [furaffinity] extract `favorite_id` metadata ([#4133](https://github.com/mikf/gallery-dl/issues/4133))
- [jschan] add generic extractors for jschan image boards ([#3447](https://github.com/mikf/gallery-dl/issues/3447))
- [kemonoparty] support `.su` TLDs ([#4139](https://github.com/mikf/gallery-dl/issues/4139))
- [pixiv:novel] add `novel-bookmark` extractor ([#4111](https://github.com/mikf/gallery-dl/issues/4111))
- [pixiv:novel] add `full-series` option ([#4111](https://github.com/mikf/gallery-dl/issues/4111))
- [postimage] add gallery support, update image extractor ([#3115](https://github.com/mikf/gallery-dl/issues/3115), [#4134](https://github.com/mikf/gallery-dl/issues/4134))
- [redgifs] support galleries ([#4021](https://github.com/mikf/gallery-dl/issues/4021))
- [twitter] extract `conversation_id` metadata ([#3839](https://github.com/mikf/gallery-dl/issues/3839))
- [vipergirls] add login support ([#4166](https://github.com/mikf/gallery-dl/issues/4166))
- [vipergirls] use API endpoints ([#4166](https://github.com/mikf/gallery-dl/issues/4166))
- [formatter] implement `H` conversion ([#4164](https://github.com/mikf/gallery-dl/issues/4164))
### Fixes
- [acidimg] fix extraction ([#4136](https://github.com/mikf/gallery-dl/issues/4136))
- [bunkr] update domain to bunkrr.su ([#4159](https://github.com/mikf/gallery-dl/issues/4159), [#4189](https://github.com/mikf/gallery-dl/issues/4189))
- [bunkr] fix video downloads
- [fanbox] prevent exception due to missing embeds ([#4088](https://github.com/mikf/gallery-dl/issues/4088))
- [instagram] fix retrieving `/tagged` posts ([#4122](https://github.com/mikf/gallery-dl/issues/4122))
- [jpgfish] update domain to `jpg.pet` ([#4138](https://github.com/mikf/gallery-dl/issues/4138))
- [pixiv:novel] fix error with embeds extraction ([#4175](https://github.com/mikf/gallery-dl/issues/4175))
- [pornhub] improve redirect handling ([#4188](https://github.com/mikf/gallery-dl/issues/4188))
- [reddit] fix crash due to empty `crosspost_parent_lists` ([#4120](https://github.com/mikf/gallery-dl/issues/4120), [#4172](https://github.com/mikf/gallery-dl/issues/4172))
- [redgifs] update `search` URL pattern ([#4115](https://github.com/mikf/gallery-dl/issues/4115), [#4185](https://github.com/mikf/gallery-dl/issues/4185))
- [senmanga] fix and update ([#4160](https://github.com/mikf/gallery-dl/issues/4160))
- [twitter] use GraphQL API search endpoint ([#3942](https://github.com/mikf/gallery-dl/issues/3942))
- [wallhaven] improve HTTP error handling ([#4192](https://github.com/mikf/gallery-dl/issues/4192))
- [weibo] prevent fatal exception due to missing video data ([#4150](https://github.com/mikf/gallery-dl/issues/4150))
- [weibo] fix `.json` extension for some videos
## 1.25.5 - 2023-05-27
### Additions
- [8muses] add `parts` metadata field ([#3329](https://github.com/mikf/gallery-dl/issues/3329))
- [danbooru] add `date` metadata field ([#4047](https://github.com/mikf/gallery-dl/issues/4047))
- [e621] add `date` metadata field ([#4047](https://github.com/mikf/gallery-dl/issues/4047))
- [gofile] add basic password support ([#4056](https://github.com/mikf/gallery-dl/issues/4056))
- [imagechest] implement API support ([#4065](https://github.com/mikf/gallery-dl/issues/4065))
- [instagram] add `order-files` option ([#3993](https://github.com/mikf/gallery-dl/issues/3993), [#4017](https://github.com/mikf/gallery-dl/issues/4017))
- [instagram] add `order-posts` option ([#3993](https://github.com/mikf/gallery-dl/issues/3993), [#4017](https://github.com/mikf/gallery-dl/issues/4017))
- [instagram] add `metadata` option ([#3107](https://github.com/mikf/gallery-dl/issues/3107))
- [jpgfish] add `jpg.fishing` extractors ([#2657](https://github.com/mikf/gallery-dl/issues/2657), [#2719](https://github.com/mikf/gallery-dl/issues/2719))
- [lensdump] add `lensdump.com` extractors ([#2078](https://github.com/mikf/gallery-dl/issues/2078), [#4104](https://github.com/mikf/gallery-dl/issues/4104))
- [mangaread] add `mangaread.org` extractors ([#2425](https://github.com/mikf/gallery-dl/issues/2425), [#2781](https://github.com/mikf/gallery-dl/issues/2781))
- [misskey] add `favorite` extractor ([#3950](https://github.com/mikf/gallery-dl/issues/3950))
- [pixiv] add `novel` support ([#1241](https://github.com/mikf/gallery-dl/issues/1241), [#4044](https://github.com/mikf/gallery-dl/issues/4044))
- [reddit] support cross-posted media ([#887](https://github.com/mikf/gallery-dl/issues/887), [#3586](https://github.com/mikf/gallery-dl/issues/3586), [#3976](https://github.com/mikf/gallery-dl/issues/3976))
- [postprocessor:exec] support tilde expansion for `command`
- [formatter] support slicing strings as bytes ([#4087](https://github.com/mikf/gallery-dl/issues/4087))
### Fixes
- [8muses] fix value of `album[url]` ([#3329](https://github.com/mikf/gallery-dl/issues/3329))
- [danbooru] refactor pagination logic ([#4002](https://github.com/mikf/gallery-dl/issues/4002))
- [fanbox] skip invalid posts ([#4088](https://github.com/mikf/gallery-dl/issues/4088))
- [gofile] automatically fetch `website-token`
- [kemonoparty] fix kemono and coomer logins sharing the same cache ([#4098](https://github.com/mikf/gallery-dl/issues/4098))
- [newgrounds] add default delay between requests ([#4046](https://github.com/mikf/gallery-dl/issues/4046))
- [nsfwalbum] detect placeholder images
- [poipiku] extract full `descriptions` ([#4066](https://github.com/mikf/gallery-dl/issues/4066))
- [tcbscans] update domain to `tcbscans.com` ([#4080](https://github.com/mikf/gallery-dl/issues/4080))
- [twitter] extract TwitPic URLs in text ([#3792](https://github.com/mikf/gallery-dl/issues/3792), [#3796](https://github.com/mikf/gallery-dl/issues/3796))
- [weibo] require numeric IDs to have length >= 10 ([#4059](https://github.com/mikf/gallery-dl/issues/4059))
- [ytdl] fix crash due to removed `no_color` attribute
- [cookies] improve logging behavior ([#4050](https://github.com/mikf/gallery-dl/issues/4050))
## 1.25.4 - 2023-05-07
### Additions
- [4chanarchives] add `thread` and `board` extractors ([#4012](https://github.com/mikf/gallery-dl/issues/4012))
- [foolfuuka] add `archive.palanq.win`
- [imgur] add `favorite-folder` extractor ([#4016](https://github.com/mikf/gallery-dl/issues/4016))
- [mangadex] add `status` and `tags` metadata ([#4031](https://github.com/mikf/gallery-dl/issues/4031))
- allow selecting a domain with `--cookies-from-browser`
- add `--cookies-export` command-line option
- add `-C` as short option for `--cookies`
- include exception type in config error messages
### Fixes
- [exhentai] update sadpanda check
- [imagechest] load all images when a "Load More" button is present ([#4028](https://github.com/mikf/gallery-dl/issues/4028))
- [imgur] fix bug causing some images/albums from user profiles and favorites to be ignored
- [pinterest] update endpoint for related board pins
- [pinterest] fix `pin.it` extractor
- [ytdl] fix yt-dlp `--xff/--geo-bypass` tests ([#3989](https://github.com/mikf/gallery-dl/issues/3989))
### Removals
- [420chan] remove module
- [foolfuuka] remove `archive.alice.al` and `tokyochronos.net`
- [foolslide] remove `sensescans.com`
- [nana] remove module
## 1.25.3 - 2023-04-30
### Additions
- [imagefap] extract `description` and `categories` metadata ([#3905](https://github.com/mikf/gallery-dl/issues/3905))
- [imxto] add `gallery` extractor ([#1289](https://github.com/mikf/gallery-dl/issues/1289))
- [itchio] add `game` extractor ([#3923](https://github.com/mikf/gallery-dl/issues/3923))
- [nitter] extract user IDs from encoded banner URLs
- [pixiv] allow sorting search results by popularity ([#3970](https://github.com/mikf/gallery-dl/issues/3970))
- [reddit] match `preview.redd.it` URLs ([#3935](https://github.com/mikf/gallery-dl/issues/3935))
- [sankaku] support post URLs with MD5 hashes ([#3952](https://github.com/mikf/gallery-dl/issues/3952))
- [shimmie2] add generic extractors for Shimmie2 sites ([#3734](https://github.com/mikf/gallery-dl/issues/3734), [#943](https://github.com/mikf/gallery-dl/issues/943))
- [tumblr] add `day` extractor ([#3951](https://github.com/mikf/gallery-dl/issues/3951))
- [twitter] support `profile-conversation` entries ([#3938](https://github.com/mikf/gallery-dl/issues/3938))
- [vipergirls] add `thread` and `post` extractors ([#3812](https://github.com/mikf/gallery-dl/issues/3812), [#2720](https://github.com/mikf/gallery-dl/issues/2720), [#731](https://github.com/mikf/gallery-dl/issues/731))
- [downloader:http] add `consume-content` option ([#3748](https://github.com/mikf/gallery-dl/issues/3748))
### Fixes
- [2chen] update domain to sturdychan.help
- [behance] fix extraction ([#3980](https://github.com/mikf/gallery-dl/issues/3980))
- [deviantart] retry downloads with private token ([#3941](https://github.com/mikf/gallery-dl/issues/3941))
- [imagefap] fix empty `tags` metadata
- [manganelo] support arbitrary minor version separators ([#3972](https://github.com/mikf/gallery-dl/issues/3972))
- [nozomi] fix file URLs ([#3925](https://github.com/mikf/gallery-dl/issues/3925))
- [oauth] catch exceptions from `webbrowser.get()` ([#3947](https://github.com/mikf/gallery-dl/issues/3947))
- [pixiv] fix `pixivision` extraction
- [reddit] ignore `id-max` value `"zik0zj"`/`2147483647` ([#3939](https://github.com/mikf/gallery-dl/issues/3939), [#3862](https://github.com/mikf/gallery-dl/issues/3862), [#3697](https://github.com/mikf/gallery-dl/issues/3697), [#3606](https://github.com/mikf/gallery-dl/issues/3606), [#3546](https://github.com/mikf/gallery-dl/issues/3546), [#3521](https://github.com/mikf/gallery-dl/issues/3521), [#3412](https://github.com/mikf/gallery-dl/issues/3412))
- [sankaku] sanitize `date:` tags ([#1790](https://github.com/mikf/gallery-dl/issues/1790))
- [tumblr] fix and update pagination logic ([#2191](https://github.com/mikf/gallery-dl/issues/2191))
- [twitter] fix `user` metadata when downloading quoted Tweets ([#3922](https://github.com/mikf/gallery-dl/issues/3922))
- [ytdl] fix crash due to `--geo-bypass` deprecation ([#3975](https://github.com/mikf/gallery-dl/issues/3975))
- [postprocessor:metadata] support putting keys in quotes
- include more optional dependencies in executables ([#3907](https://github.com/mikf/gallery-dl/issues/3907))
## 1.25.2 - 2023-04-15 ## 1.25.2 - 2023-04-15
### Additions ### Additions
- [deviantart] add `public` option - [deviantart] add `public` option

View File

@ -72,9 +72,9 @@ Standalone Executable
Prebuilt executable files with a Python interpreter and Prebuilt executable files with a Python interpreter and
required Python packages included are available for required Python packages included are available for
- `Windows <https://github.com/mikf/gallery-dl/releases/download/v1.25.2/gallery-dl.exe>`__ - `Windows <https://github.com/mikf/gallery-dl/releases/download/v1.25.8/gallery-dl.exe>`__
(Requires `Microsoft Visual C++ Redistributable Package (x86) <https://aka.ms/vs/17/release/vc_redist.x86.exe>`__) (Requires `Microsoft Visual C++ Redistributable Package (x86) <https://aka.ms/vs/17/release/vc_redist.x86.exe>`__)
- `Linux <https://github.com/mikf/gallery-dl/releases/download/v1.25.2/gallery-dl.bin>`__ - `Linux <https://github.com/mikf/gallery-dl/releases/download/v1.25.8/gallery-dl.bin>`__
Nightly Builds Nightly Builds
@ -123,6 +123,15 @@ For macOS or Linux users using Homebrew:
brew install gallery-dl brew install gallery-dl
MacPorts
--------
For macOS users with MacPorts:
.. code:: bash
sudo port install gallery-dl
Usage Usage
===== =====

View File

@ -382,6 +382,7 @@ Description
* ``e621`` (*) * ``e621`` (*)
* ``e926`` (*) * ``e926`` (*)
* ``exhentai`` * ``exhentai``
* ``gfycat``
* ``idolcomplex`` * ``idolcomplex``
* ``imgbb`` * ``imgbb``
* ``inkbunny`` * ``inkbunny``
@ -395,6 +396,7 @@ Description
* ``tapas`` * ``tapas``
* ``tsumino`` * ``tsumino``
* ``twitter`` * ``twitter``
* ``vipergirls``
* ``zerochan`` * ``zerochan``
These values can also be specified via the These values can also be specified via the
@ -440,30 +442,35 @@ Description
"isAdult" : "1" "isAdult" : "1"
} }
* A ``list`` with up to 4 entries specifying a browser profile. * A ``list`` with up to 5 entries specifying a browser profile.
* The first entry is the browser name * The first entry is the browser name
* The optional second entry is a profile name or an absolute path to a profile directory * The optional second entry is a profile name or an absolute path to a profile directory
* The optional third entry is the keyring to retrieve passwords for decrypting cookies from * The optional third entry is the keyring to retrieve passwords for decrypting cookies from
* The optional fourth entry is a (Firefox) container name (``"none"`` for only cookies with no container) * The optional fourth entry is a (Firefox) container name (``"none"`` for only cookies with no container)
* The optional fifth entry is the domain to extract cookies for. Prefix it with a dot ``.`` to include cookies for subdomains. Has no effect when also specifying a container.
.. code:: json .. code:: json
["firefox"] ["firefox"]
["firefox", null, null, "Personal"] ["firefox", null, null, "Personal"]
["chromium", "Private", "kwallet"] ["chromium", "Private", "kwallet", null, ".twitter.com"]
extractor.*.cookies-update extractor.*.cookies-update
-------------------------- --------------------------
Type Type
``bool`` * ``bool``
* |Path|_
Default Default
``true`` ``true``
Description Description
If `extractor.*.cookies`_ specifies the |Path|_ of a cookies.txt Export session cookies in cookies.txt format.
file and it can be opened and parsed without errors,
update its contents with cookies received during data extraction. * If this is a |Path|_, write cookies to the given file path.
* If this is ``true`` and `extractor.*.cookies`_ specifies the |Path|_
of a valid cookies.txt file, update its contents.
extractor.*.proxy extractor.*.proxy
@ -519,7 +526,7 @@ extractor.*.user-agent
Type Type
``string`` ``string``
Default Default
``"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0"`` ``"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:115.0) Gecko/20100101 Firefox/115.0"``
Description Description
User-Agent header value to be used for HTTP requests. User-Agent header value to be used for HTTP requests.
@ -1151,7 +1158,7 @@ Description
Note: This requires 1 additional HTTP request per 200-post batch. Note: This requires 1 additional HTTP request per 200-post batch.
extractor.{Danbooru].threshold extractor.[Danbooru].threshold
------------------------------ ------------------------------
Type Type
* ``string`` * ``string``
@ -1535,6 +1542,39 @@ Description
from `linking your Flickr account to gallery-dl <OAuth_>`__. from `linking your Flickr account to gallery-dl <OAuth_>`__.
extractor.flickr.exif
---------------------
Type
``bool``
Default
``false``
Description
Fetch `exif` and `camera` metadata for each photo.
Note: This requires 1 additional API call per photo.
extractor.flickr.metadata
-------------------------
Type
* ``bool``
* ``string``
* ``list`` of ``strings``
Default
``false``
Example
* ``license,last_update,machine_tags``
* ``["license", "last_update", "machine_tags"]``
Description
Extract additional metadata
(license, date_taken, original_format, last_update, geo, machine_tags, o_dims)
It is possible to specify a custom list of metadata includes.
See `the extras parameter <https://www.flickr.com/services/api/flickr.people.getPhotos.html>`__
in `Flickr API docs <https://www.flickr.com/services/api/>`__
for possible field names.
extractor.flickr.videos extractor.flickr.videos
----------------------- -----------------------
Type Type
@ -1651,7 +1691,11 @@ Default
``["mp4", "webm", "mobile", "gif"]`` ``["mp4", "webm", "mobile", "gif"]``
Description Description
List of names of the preferred animation format, which can be List of names of the preferred animation format, which can be
``"mp4"``, ``"webm"``, ``"mobile"``, ``"gif"``, or ``"webp"``. ``"mp4"``,
``"webm"``,
``"mobile"``,
``"gif"``, or
``"webp"``.
If a selected format is not available, the next one in the list will be If a selected format is not available, the next one in the list will be
tried until an available format is found. tried until an available format is found.
@ -1677,15 +1721,14 @@ extractor.gofile.website-token
------------------------------ ------------------------------
Type Type
``string`` ``string``
Default
``"12345"``
Description Description
API token value used during API requests. API token value used during API requests.
A not up-to-date value will result in ``401 Unauthorized`` errors. An invalid or not up-to-date value
will result in ``401 Unauthorized`` errors.
Setting this value to ``null`` will do an extra HTTP request to fetch Keeping this option unset will use an extra HTTP request
the current value used by gofile. to attempt to fetch the current value used by gofile.
extractor.gofile.recursive extractor.gofile.recursive
@ -1733,6 +1776,21 @@ Description
but is most likely going to fail with ``403 Forbidden`` errors. but is most likely going to fail with ``403 Forbidden`` errors.
extractor.imagechest.access-token
---------------------------------
Type
``string``
Description
Your personal Image Chest access token.
These tokens allow using the API instead of having to scrape HTML pages,
providing more detailed metadata.
(``date``, ``description``, etc)
See https://imgchest.com/docs/api/1.0/general/authorization
for instructions on how to generate such a token.
extractor.imgur.client-id extractor.imgur.client-id
------------------------- -------------------------
Type Type
@ -1808,6 +1866,55 @@ Description
It is possible to use ``"all"`` instead of listing all values separately. It is possible to use ``"all"`` instead of listing all values separately.
extractor.instagram.metadata
----------------------------
Type
``bool``
Default
``false``
Description
Provide extended ``user`` metadata even when referring to a user by ID,
e.g. ``instagram.com/id:12345678``.
Note: This metadata is always available when referring to a user by name,
e.g. ``instagram.com/USERNAME``.
extractor.instagram.order-files
-------------------------------
Type
``string``
Default
``"asc"``
Description
Controls the order in which files of each post are returned.
* ``"asc"``: Same order as displayed in a post
* ``"desc"``: Reverse order as displayed in a post
* ``"reverse"``: Same as ``"desc"``
Note: This option does *not* affect ``{num}``.
To enumerate files in reverse order, use ``count - num + 1``.
extractor.instagram.order-posts
-------------------------------
Type
``string``
Default
``"asc"``
Description
Controls the order in which posts are returned.
* ``"asc"``: Same order as displayed
* ``"desc"``: Reverse order as displayed
* ``"id"`` or ``"id_asc"``: Ascending order by ID
* ``"id_desc"``: Descending order by ID
* ``"reverse"``: Same as ``"desc"``
Note: This option only affects ``highlights``.
extractor.instagram.previews extractor.instagram.previews
---------------------------- ----------------------------
Type Type
@ -1979,18 +2086,21 @@ Example
Description Description
Additional query parameters to send when fetching manga chapters. Additional query parameters to send when fetching manga chapters.
(See `/manga/{id}/feed <https://api.mangadex.org/docs.html#operation/get-manga-id-feed>`_ (See `/manga/{id}/feed <https://api.mangadex.org/docs/swagger.html#/Manga/get-manga-id-feed>`__
and `/user/follows/manga/feed <https://api.mangadex.org/docs.html#operation/get-user-follows-manga-feed>`_) and `/user/follows/manga/feed <https://api.mangadex.org/docs/swagger.html#/Feed/get-user-follows-manga-feed>`__)
extractor.mangadex.lang extractor.mangadex.lang
----------------------- -----------------------
Type Type
``string`` * ``string``
* ``list`` of ``strings``
Example Example
``"en"`` * ``"en"``
* ``"fr,it"``
* ``["fr", "it"]``
Description Description
`ISO 639-1 <https://en.wikipedia.org/wiki/ISO_639-1>`__ language code `ISO 639-1 <https://en.wikipedia.org/wiki/ISO_639-1>`__ language codes
to filter chapters by. to filter chapters by.
@ -2004,6 +2114,24 @@ Description
List of acceptable content ratings for returned chapters. List of acceptable content ratings for returned chapters.
extractor.mangapark.source
--------------------------
Type
* ``string``
* ``integer``
Example
* ``"koala:en"``
* ``15150116``
Description
Select chapter source and language for a manga.
| The general syntax is ``"<source name>:<ISO 639-1 language code>"``.
| Both are optional, meaning ``"koala"``, ``"koala:"``, ``":en"``,
or even just ``":"`` are possible as well.
Specifying the numeric ``ID`` of a source is also supported.
extractor.[mastodon].access-token extractor.[mastodon].access-token
--------------------------------- ---------------------------------
Type Type
@ -2050,8 +2178,16 @@ Description
Also emit metadata for text-only posts without media content. Also emit metadata for text-only posts without media content.
extractor.[misskey].access-token
--------------------------------
Type
``string``
Description
Your access token, necessary to fetch favorited notes.
extractor.[misskey].renotes extractor.[misskey].renotes
---------------------------- ---------------------------
Type Type
``bool`` ``bool``
Default Default
@ -2061,7 +2197,7 @@ Description
extractor.[misskey].replies extractor.[misskey].replies
---------------------------- ---------------------------
Type Type
``bool`` ``bool``
Default Default
@ -2070,17 +2206,6 @@ Description
Fetch media from replies to other notes. Fetch media from replies to other notes.
extractor.nana.favkey
---------------------
Type
``string``
Default
``null``
Description
Your `Nana Favorite Key <https://nana.my.id/tutorial>`__,
used to access your favorite archives.
extractor.newgrounds.flash extractor.newgrounds.flash
-------------------------- --------------------------
Type Type
@ -2341,7 +2466,12 @@ Description
when processing a user profile. when processing a user profile.
Possible values are Possible values are
``"artworks"``, ``"avatar"``, ``"background"``, ``"favorite"``. ``"artworks"``,
``"avatar"``,
``"background"``,
``"favorite"``,
``"novel-user"``,
``"novel-bookmark"``.
It is possible to use ``"all"`` instead of listing all values separately. It is possible to use ``"all"`` instead of listing all values separately.
@ -2357,6 +2487,27 @@ Description
`gppt <https://github.com/eggplants/get-pixivpy-token>`__. `gppt <https://github.com/eggplants/get-pixivpy-token>`__.
extractor.pixiv.embeds
----------------------
Type
``bool``
Default
``false``
Description
Download images embedded in novels.
extractor.pixiv.novel.full-series
---------------------------------
Type
``bool``
Default
``false``
Description
When downloading a novel being part of a series,
download all novels of that series.
extractor.pixiv.metadata extractor.pixiv.metadata
------------------------ ------------------------
Type Type
@ -2602,7 +2753,12 @@ Default
``["hd", "sd", "gif"]`` ``["hd", "sd", "gif"]``
Description Description
List of names of the preferred animation format, which can be List of names of the preferred animation format, which can be
``"hd"``, ``"sd"``, `"gif"``, `"vthumbnail"``, `"thumbnail"``, or ``"poster"``. ``"hd"``,
``"sd"``,
``"gif"``,
``"thumbnail"``,
``"vthumbnail"``, or
``"poster"``.
If a selected format is not available, the next one in the list will be If a selected format is not available, the next one in the list will be
tried until an available format is found. tried until an available format is found.
@ -2901,15 +3057,19 @@ Description
extractor.twitter.conversations extractor.twitter.conversations
------------------------------- -------------------------------
Type Type
``bool`` * ``bool``
* ``string``
Default Default
``false`` ``false``
Description Description
For input URLs pointing to a single Tweet, For input URLs pointing to a single Tweet,
e.g. `https://twitter.com/i/web/status/<TweetID>`, e.g. `https://twitter.com/i/web/status/<TweetID>`,
fetch media from all Tweets and replies in this `conversation fetch media from all Tweets and replies in this `conversation
<https://help.twitter.com/en/using-twitter/twitter-conversations>`__ <https://help.twitter.com/en/using-twitter/twitter-conversations>`__.
or thread.
If this option is equal to ``"accessible"``,
only download from conversation Tweets
if the given initial Tweet is accessible.
extractor.twitter.csrf extractor.twitter.csrf
@ -2945,6 +3105,32 @@ Description
`syndication <extractor.twitter.syndication_>`__ API. `syndication <extractor.twitter.syndication_>`__ API.
extractor.twitter.include
-------------------------
Type
* ``string``
* ``list`` of ``strings``
Default
``"timeline"``
Example
* ``"avatar,background,media"``
* ``["avatar", "background", "media"]``
Description
A (comma-separated) list of subcategories to include
when processing a user profile.
Possible values are
``"avatar"``,
``"background"``,
``"timeline"``,
``"tweets"``,
``"media"``,
``"replies"``,
``"likes"``.
It is possible to use ``"all"`` instead of listing all values separately.
extractor.twitter.transform extractor.twitter.transform
--------------------------- ---------------------------
Type Type
@ -2955,6 +3141,20 @@ Description
Transform Tweet and User metadata into a simpler, uniform format. Transform Tweet and User metadata into a simpler, uniform format.
extractor.twitter.tweet-endpoint
--------------------------------
Type
``string``
Default
``"auto"``
Description
Selects the API endpoint used to retrieve single Tweets.
* ``"restid"``: ``/TweetResultByRestId`` - accessible to guest users
* ``"detail"``: ``/TweetDetail`` - more stable
* ``"auto"``: ``"detail"`` when logged in, ``"restid"`` otherwise
extractor.twitter.size extractor.twitter.size
---------------------- ----------------------
Type Type
@ -3027,6 +3227,19 @@ Description
a quoted (original) Tweet when it sees the Tweet which quotes it. a quoted (original) Tweet when it sees the Tweet which quotes it.
extractor.twitter.ratelimit
---------------------------
Type
``string``
Default
``"wait"``
Description
Selects how to handle exceeding the API rate limit.
* ``"abort"``: Raise an error and stop extraction
* ``"wait"``: Wait until rate limit reset
extractor.twitter.replies extractor.twitter.replies
------------------------- -------------------------
Type Type
@ -3067,8 +3280,8 @@ Type
Default Default
``"auto"`` ``"auto"``
Description Description
Controls the strategy / tweet source used for user URLs Controls the strategy / tweet source used for timeline URLs
(``https://twitter.com/USER``). (``https://twitter.com/USER/timeline``).
* ``"tweets"``: `/tweets <https://twitter.com/USER/tweets>`__ timeline + search * ``"tweets"``: `/tweets <https://twitter.com/USER/tweets>`__ timeline + search
* ``"media"``: `/media <https://twitter.com/USER/media>`__ timeline + search * ``"media"``: `/media <https://twitter.com/USER/media>`__ timeline + search
@ -3637,6 +3850,25 @@ Description
contains JPEG/JFIF data. contains JPEG/JFIF data.
downloader.http.consume-content
-------------------------------
Type
``bool``
Default
``false``
Description
Controls the behavior when an HTTP response is considered
unsuccessful
If the value is ``true``, consume the response body. This
avoids closing the connection and therefore improves connection
reuse.
If the value is ``false``, immediately close the connection
without reading the response. This can be useful if the server
is known to send large bodies for error responses.
downloader.http.chunk-size downloader.http.chunk-size
-------------------------- --------------------------
Type Type
@ -4497,7 +4729,7 @@ Default
Description Description
Name of the metadata field whose value should be used. Name of the metadata field whose value should be used.
This value must either be a UNIX timestamp or a This value must be either a UNIX timestamp or a
|datetime|_ object. |datetime|_ object.
Note: This option gets ignored if `mtime.value`_ is set. Note: This option gets ignored if `mtime.value`_ is set.
@ -4515,10 +4747,54 @@ Example
Description Description
A `format string`_ whose value should be used. A `format string`_ whose value should be used.
The resulting value must either be a UNIX timestamp or a The resulting value must be either a UNIX timestamp or a
|datetime|_ object. |datetime|_ object.
python.archive
--------------
Type
|Path|_
Description
File to store IDs of called Python functions in,
similar to `extractor.*.archive`_.
``archive-format``, ``archive-prefix``, and ``archive-pragma`` options,
akin to
`extractor.*.archive-format`_,
`extractor.*.archive-prefix`_, and
`extractor.*.archive-pragma`_, are supported as well.
python.event
------------
Type
``string``
Default
``"file"``
Description
The event for which `python.function`_ gets called.
See `metadata.event`_ for a list of available events.
python.function
---------------
Type
``string``
Example
* ``"my_module:generate_text"``
* ``"~/.local/share/gdl-utils.py:resize"``
Description
The Python function to call.
This function gets specified as ``<module>:<function name>``
and gets called with the current metadata dict as argument.
``module`` is either an importable Python module name
or the |Path|_ to a `.py` file,
ugoira.extension ugoira.extension
---------------- ----------------
Type Type
@ -4836,17 +5112,6 @@ Description
used for (urllib3) warnings. used for (urllib3) warnings.
pyopenssl
---------
Type
``bool``
Default
``false``
Description
Use `pyOpenSSL <https://www.pyopenssl.org/en/stable/>`__-backed
SSL-support.
API Tokens & IDs API Tokens & IDs
================ ================
@ -4912,6 +5177,10 @@ How To
``user-agent`` and replace ``<application name>`` and ``<username>`` ``user-agent`` and replace ``<application name>`` and ``<username>``
accordingly (see Reddit's accordingly (see Reddit's
`API access rules <https://github.com/reddit/reddit/wiki/API>`__) `API access rules <https://github.com/reddit/reddit/wiki/API>`__)
* clear your `cache <cache.file_>`__ to delete any remaining
``access-token`` entries. (``gallery-dl --clear-cache reddit``)
* get a `refresh-token <extractor.reddit.refresh-token_>`__ for the
new ``client-id`` (``gallery-dl oauth:reddit``)
extractor.smugmug.api-key & .api-secret extractor.smugmug.api-key & .api-secret
@ -5123,6 +5392,8 @@ Description
Write metadata to separate files Write metadata to separate files
``mtime`` ``mtime``
Set file modification time according to its metadata Set file modification time according to its metadata
``python``
Call Python functions
``ugoira`` ``ugoira``
Convert Pixiv Ugoira to WebM using `FFmpeg <https://www.ffmpeg.org/>`__ Convert Pixiv Ugoira to WebM using `FFmpeg <https://www.ffmpeg.org/>`__
``zip`` ``zip``

View File

@ -11,14 +11,16 @@ Field names select the metadata value to use in a replacement field.
While simple names are usually enough, more complex forms like accessing values by attribute, element index, or slicing are also supported. While simple names are usually enough, more complex forms like accessing values by attribute, element index, or slicing are also supported.
| | Example | Result | | | Example | Result |
| -------------------- | ----------------- | ---------------------- | | -------------------- | ------------------- | ---------------------- |
| Name | `{title}` | `Hello World` | | Name | `{title}` | `Hello World` |
| Element Index | `{title[6]}` | `W` | | Element Index | `{title[6]}` | `W` |
| Slicing | `{title[3:8]}` | `lo Wo` | | Slicing | `{title[3:8]}` | `lo Wo` |
| Alternatives | `{empty\|title}` | `Hello World` | | Slicing (Bytes) | `{title_ja[b3:18]}` | `ロー・ワー` |
| Element Access | `{user[name]}` | `John Doe` | | Alternatives | `{empty\|title}` | `Hello World` |
| Attribute Access | `{extractor.url}` | `https://example.org/` | | Attribute Access | `{extractor.url}` | `https://example.org/` |
| Element Access | `{user[name]}` | `John Doe` |
| | `{user['name']}` | `John Doe` |
All of these methods can be combined as needed. All of these methods can be combined as needed.
For example `{title[24]|empty|extractor.url[15:-1]}` would result in `.org`. For example `{title[24]|empty|extractor.url[15:-1]}` would result in `.org`.
@ -92,6 +94,18 @@ Conversion specifiers allow to *convert* the value to a different form or type.
<td><code>{created!d}</code></td> <td><code>{created!d}</code></td>
<td><code>2010-01-01 00:00:00</code></td> <td><code>2010-01-01 00:00:00</code></td>
</tr> </tr>
<tr>
<td align="center"><code>U</code></td>
<td>Convert HTML entities</td>
<td><code>{html!U}</code></td>
<td><code>&lt;p&gt;foo &amp; bar&lt;/p&gt;</code></td>
</tr>
<tr>
<td align="center"><code>H</code></td>
<td>Convert HTML entities &amp; remove HTML tags</td>
<td><code>{html!H}</code></td>
<td><code>foo &amp; bar</code></td>
</tr>
<tr> <tr>
<td align="center"><code>s</code></td> <td align="center"><code>s</code></td>
<td>Convert value to <a href="https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str" rel="nofollow"><code>str</code></a></td> <td>Convert value to <a href="https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str" rel="nofollow"><code>str</code></a></td>
@ -150,6 +164,12 @@ Format specifiers can be used for advanced formatting by using the options provi
<td><code>{foo:[1:-1]}</code></td> <td><code>{foo:[1:-1]}</code></td>
<td><code>oo&nbsp;Ba</code></td> <td><code>oo&nbsp;Ba</code></td>
</tr> </tr>
<tr>
<td><code>[b&lt;start&gt;:&lt;stop&gt;]</code></td>
<td>Same as above, but applies to the <a href="https://docs.python.org/3/library/stdtypes.html#bytes"><code>bytes()</code></a> representation of a string in <a href="https://docs.python.org/3/library/sys.html#sys.getfilesystemencoding">filesystem encoding</a></td>
<td><code>{foo_ja:[b3:-1]}</code></td>
<td><code>ー・バ</code></td>
</tr>
<tr> <tr>
<td rowspan="2"><code>L&lt;maxlen&gt;/&lt;repl&gt;/</code></td> <td rowspan="2"><code>L&lt;maxlen&gt;/&lt;repl&gt;/</code></td>
<td rowspan="2">Replaces the entire output with <code>&lt;repl&gt;</code> if its length exceeds <code>&lt;maxlen&gt;</code></td> <td rowspan="2">Replaces the entire output with <code>&lt;repl&gt;</code> if its length exceeds <code>&lt;maxlen&gt;</code></td>
@ -193,7 +213,9 @@ Format specifiers can be used for advanced formatting by using the options provi
</tbody> </tbody>
</table> </table>
All special format specifiers (`?`, `L`, `J`, `R`, `D`, `O`) can be chained and combined with one another, but must always come before any standard format specifiers: All special format specifiers (`?`, `L`, `J`, `R`, `D`, `O`, etc)
can be chained and combined with one another,
but must always appear before any standard format specifiers:
For example `{foo:?//RF/B/Ro/e/> 10}` -> `   Bee Bar` For example `{foo:?//RF/B/Ro/e/> 10}` -> `   Bee Bar`
- `?//` - Tests if `foo` has a value - `?//` - Tests if `foo` has a value
@ -244,7 +266,7 @@ Replacement field names that are available in all format strings.
## Special Type Format Strings ## Special Type Format Strings
Starting a format string with '\f<Type> ' allows to set a different format string type than the default. Available ones are: Starting a format string with `\f<Type> ` allows to set a different format string type than the default. Available ones are:
<table> <table>
<thead> <thead>
@ -285,13 +307,3 @@ Starting a format string with '\f<Type> ' allows to set a different format strin
</tr> </tr>
</tbody> </tbody>
</table> </table>
> **Note:**
>
> `\f` is the [Form Feed](https://en.wikipedia.org/w/index.php?title=Page_break&oldid=1027475805#Form_feed)
> character. (ASCII code 12 or 0xc)
>
> Writing it as `\f` is native to JSON, but will *not* get interpreted
> as such by most shells. To use this character there:
> * hold `Ctrl`, then press `v` followed by `l`, resulting in `^L` or
> * use `echo` or `printf` (e.g. `gallery-dl -f "$(echo -ne \\fM) my_module:generate_text"`)

View File

@ -10,7 +10,7 @@
"proxy": null, "proxy": null,
"skip": true, "skip": true,
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:115.0) Gecko/20100101 Firefox/115.0",
"retries": 4, "retries": 4,
"timeout": 30.0, "timeout": 30.0,
"verify": true, "verify": true,
@ -108,8 +108,10 @@
}, },
"flickr": "flickr":
{ {
"videos": true, "exif": false,
"size-max": null "metadata": false,
"size-max": null,
"videos": true
}, },
"furaffinity": "furaffinity":
{ {
@ -129,7 +131,7 @@
}, },
"gofile": { "gofile": {
"api-token": null, "api-token": null,
"website-token": "12345" "website-token": null
}, },
"hentaifoundry": "hentaifoundry":
{ {
@ -146,6 +148,9 @@
"password": null, "password": null,
"sleep-request": 5.0 "sleep-request": 5.0
}, },
"imagechest": {
"access-token": null
},
"imgbb": "imgbb":
{ {
"username": null, "username": null,
@ -166,6 +171,9 @@
"api": "rest", "api": "rest",
"cookies": null, "cookies": null,
"include": "posts", "include": "posts",
"order-files": "asc",
"order-posts": "asc",
"previews": false,
"sleep-request": [6.0, 12.0], "sleep-request": [6.0, 12.0],
"videos": true "videos": true
}, },
@ -190,6 +198,7 @@
"password": null "password": null
}, },
"misskey": { "misskey": {
"access-token": null,
"renotes": false, "renotes": false,
"replies": true "replies": true
}, },
@ -201,10 +210,6 @@
"format": "original", "format": "original",
"include": "art" "include": "art"
}, },
"nana":
{
"favkey": null
},
"nijie": "nijie":
{ {
"username": null, "username": null,
@ -243,6 +248,7 @@
{ {
"refresh-token": null, "refresh-token": null,
"include": "artworks", "include": "artworks",
"embeds": false,
"metadata": false, "metadata": false,
"metadata-bookmark": false, "metadata-bookmark": false,
"tags": "japanese", "tags": "japanese",
@ -255,6 +261,9 @@
}, },
"reddit": "reddit":
{ {
"client-id": null,
"user-agent": null,
"refresh-token": null,
"comments": 0, "comments": 0,
"morecomments": false, "morecomments": false,
"date-min": 0, "date-min": 0,

View File

@ -18,12 +18,6 @@
--user-agent UA User-Agent request header --user-agent UA User-Agent request header
--clear-cache MODULE Delete cached login sessions, cookies, etc. for --clear-cache MODULE Delete cached login sessions, cookies, etc. for
MODULE (ALL to delete everything) MODULE (ALL to delete everything)
--cookies FILE File to load additional cookies from
--cookies-from-browser BROWSER[+KEYRING][:PROFILE][::CONTAINER]
Name of the browser to load cookies from, with
optional keyring name prefixed with '+', profile
prefixed with ':', and container prefixed with
'::' ('none' for no container)
## Output Options: ## Output Options:
-q, --quiet Activate quiet mode -q, --quiet Activate quiet mode
@ -84,6 +78,16 @@
-p, --password PASS Password belonging to the given username -p, --password PASS Password belonging to the given username
--netrc Enable .netrc authentication data --netrc Enable .netrc authentication data
## Cookie Options:
-C, --cookies FILE File to load additional cookies from
--cookies-export FILE Export session cookies to FILE
--cookies-from-browser BROWSER[/DOMAIN][+KEYRING][:PROFILE][::CONTAINER]
Name of the browser to load cookies from, with
optional domain prefixed with '/', keyring name
prefixed with '+', profile prefixed with ':',
and container prefixed with '::' ('none' for no
container)
## Selection Options: ## Selection Options:
--download-archive FILE Record all downloaded or skipped files in FILE --download-archive FILE Record all downloaded or skipped files in FILE
and skip downloading any file already in it and skip downloading any file already in it

View File

@ -32,14 +32,14 @@ Consider all sites to be NSFW unless otherwise known.
<td></td> <td></td>
</tr> </tr>
<tr> <tr>
<td>420chan</td> <td>4chan</td>
<td>https://420chan.org/</td> <td>https://www.4chan.org/</td>
<td>Boards, Threads</td> <td>Boards, Threads</td>
<td></td> <td></td>
</tr> </tr>
<tr> <tr>
<td>4chan</td> <td>4chanarchives</td>
<td>https://www.4chan.org/</td> <td>https://4chanarchives.com/</td>
<td>Boards, Threads</td> <td>Boards, Threads</td>
<td></td> <td></td>
</tr> </tr>
@ -111,7 +111,7 @@ Consider all sites to be NSFW unless otherwise known.
</tr> </tr>
<tr> <tr>
<td>Bunkr</td> <td>Bunkr</td>
<td>https://bunkr.la/</td> <td>https://bunkrr.su/</td>
<td>Albums</td> <td>Albums</td>
<td></td> <td></td>
</tr> </tr>
@ -251,7 +251,7 @@ Consider all sites to be NSFW unless otherwise known.
<td>Gfycat</td> <td>Gfycat</td>
<td>https://gfycat.com/</td> <td>https://gfycat.com/</td>
<td>Collections, individual Images, Search Results, User Profiles</td> <td>Collections, individual Images, Search Results, User Profiles</td>
<td></td> <td>Supported</td>
</tr> </tr>
<tr> <tr>
<td>Gofile</td> <td>Gofile</td>
@ -394,7 +394,7 @@ Consider all sites to be NSFW unless otherwise known.
<tr> <tr>
<td>imgur</td> <td>imgur</td>
<td>https://imgur.com/</td> <td>https://imgur.com/</td>
<td>Albums, Favorites, Galleries, individual Images, Search Results, Subreddits, Tag Searches, User Profiles</td> <td>Albums, Favorites, Favorites Folders, Galleries, individual Images, Search Results, Subreddits, Tag Searches, User Profiles</td>
<td></td> <td></td>
</tr> </tr>
<tr> <tr>
@ -427,6 +427,18 @@ Consider all sites to be NSFW unless otherwise known.
<td>Galleries, individual Images</td> <td>Galleries, individual Images</td>
<td></td> <td></td>
</tr> </tr>
<tr>
<td>itch.io</td>
<td>https://itch.io/</td>
<td>Games</td>
<td></td>
</tr>
<tr>
<td>JPG Fish</td>
<td>https://jpeg.pet/</td>
<td>Albums, individual Images, User Profiles</td>
<td></td>
</tr>
<tr> <tr>
<td>Keenspot</td> <td>Keenspot</td>
<td>http://www.keenspot.com/</td> <td>http://www.keenspot.com/</td>
@ -451,6 +463,12 @@ Consider all sites to be NSFW unless otherwise known.
<td>Chapters, Manga</td> <td>Chapters, Manga</td>
<td></td> <td></td>
</tr> </tr>
<tr>
<td>Lensdump</td>
<td>https://lensdump.com/</td>
<td>Albums, individual Images</td>
<td></td>
</tr>
<tr> <tr>
<td>Lexica</td> <td>Lexica</td>
<td>https://lexica.art/</td> <td>https://lexica.art/</td>
@ -463,12 +481,6 @@ Consider all sites to be NSFW unless otherwise known.
<td>Galleries</td> <td>Galleries</td>
<td></td> <td></td>
</tr> </tr>
<tr>
<td>LINE BLOG</td>
<td>https://www.lineblog.me/</td>
<td>Blogs, Posts</td>
<td></td>
</tr>
<tr> <tr>
<td>livedoor Blog</td> <td>livedoor Blog</td>
<td>http://blog.livedoor.jp/</td> <td>http://blog.livedoor.jp/</td>
@ -523,6 +535,12 @@ Consider all sites to be NSFW unless otherwise known.
<td>Chapters, Manga</td> <td>Chapters, Manga</td>
<td></td> <td></td>
</tr> </tr>
<tr>
<td>MangaRead</td>
<td>https://mangaread.org/</td>
<td>Chapters, Manga</td>
<td></td>
</tr>
<tr> <tr>
<td>MangaSee</td> <td>MangaSee</td>
<td>https://mangasee123.com/</td> <td>https://mangasee123.com/</td>
@ -535,24 +553,12 @@ Consider all sites to be NSFW unless otherwise known.
<td>Albums, Channels</td> <td>Albums, Channels</td>
<td>Supported</td> <td>Supported</td>
</tr> </tr>
<tr>
<td>meme.museum</td>
<td>https://meme.museum/</td>
<td>Posts, Tag Searches</td>
<td></td>
</tr>
<tr> <tr>
<td>My Hentai Gallery</td> <td>My Hentai Gallery</td>
<td>https://myhentaigallery.com/</td> <td>https://myhentaigallery.com/</td>
<td>Galleries</td> <td>Galleries</td>
<td></td> <td></td>
</tr> </tr>
<tr>
<td>Nana</td>
<td>https://nana.my.id/</td>
<td>Galleries, Favorites, Search Results</td>
<td></td>
</tr>
<tr> <tr>
<td>Naver</td> <td>Naver</td>
<td>https://blog.naver.com/</td> <td>https://blog.naver.com/</td>
@ -652,7 +658,7 @@ Consider all sites to be NSFW unless otherwise known.
<tr> <tr>
<td>Pixiv</td> <td>Pixiv</td>
<td>https://www.pixiv.net/</td> <td>https://www.pixiv.net/</td>
<td>Artworks, Avatars, Backgrounds, Favorites, Follows, pixiv.me Links, pixivision, Rankings, Search Results, Series, Sketch, User Profiles, individual Images</td> <td>Artworks, Avatars, Backgrounds, Favorites, Follows, pixiv.me Links, Novels, Novel Bookmarks, Novel Series, pixivision, Rankings, Search Results, Series, Sketch, User Profiles, individual Images</td>
<td><a href="https://github.com/mikf/gallery-dl#oauth">OAuth</a></td> <td><a href="https://github.com/mikf/gallery-dl#oauth">OAuth</a></td>
</tr> </tr>
<tr> <tr>
@ -700,7 +706,7 @@ Consider all sites to be NSFW unless otherwise known.
<tr> <tr>
<td>Postimg</td> <td>Postimg</td>
<td>https://postimages.org/</td> <td>https://postimages.org/</td>
<td>individual Images</td> <td>Galleries, individual Images</td>
<td></td> <td></td>
</tr> </tr>
<tr> <tr>
@ -724,7 +730,7 @@ Consider all sites to be NSFW unless otherwise known.
<tr> <tr>
<td>RedGIFs</td> <td>RedGIFs</td>
<td>https://redgifs.com/</td> <td>https://redgifs.com/</td>
<td>Collections, individual Images, Search Results, User Profiles</td> <td>Collections, individual Images, Niches, Search Results, User Profiles</td>
<td></td> <td></td>
</tr> </tr>
<tr> <tr>
@ -819,7 +825,7 @@ Consider all sites to be NSFW unless otherwise known.
</tr> </tr>
<tr> <tr>
<td>TCB Scans</td> <td>TCB Scans</td>
<td>https://onepiecechapters.com/</td> <td>https://tcbscans.com/</td>
<td>Chapters, Manga</td> <td>Chapters, Manga</td>
<td></td> <td></td>
</tr> </tr>
@ -844,7 +850,7 @@ Consider all sites to be NSFW unless otherwise known.
<tr> <tr>
<td>Tumblr</td> <td>Tumblr</td>
<td>https://www.tumblr.com/</td> <td>https://www.tumblr.com/</td>
<td>Likes, Posts, Tag Searches, User Profiles</td> <td>Days, Likes, Posts, Tag Searches, User Profiles</td>
<td><a href="https://github.com/mikf/gallery-dl#oauth">OAuth</a></td> <td><a href="https://github.com/mikf/gallery-dl#oauth">OAuth</a></td>
</tr> </tr>
<tr> <tr>
@ -868,7 +874,7 @@ Consider all sites to be NSFW unless otherwise known.
<tr> <tr>
<td>Twitter</td> <td>Twitter</td>
<td>https://twitter.com/</td> <td>https://twitter.com/</td>
<td>Avatars, Backgrounds, Bookmarks, Events, Hashtags, individual Images, Likes, Lists, List Members, Media Timelines, Search Results, Timelines, Tweets</td> <td>Avatars, Backgrounds, Bookmarks, Events, Hashtags, individual Images, Likes, Lists, List Members, Media Timelines, Search Results, Timelines, Tweets, User Profiles</td>
<td>Supported</td> <td>Supported</td>
</tr> </tr>
<tr> <tr>
@ -887,7 +893,7 @@ Consider all sites to be NSFW unless otherwise known.
<td>Vipergirls</td> <td>Vipergirls</td>
<td>https://vipergirls.to/</td> <td>https://vipergirls.to/</td>
<td>Posts, Threads</td> <td>Posts, Threads</td>
<td></td> <td>Supported</td>
</tr> </tr>
<tr> <tr>
<td>Vipr</td> <td>Vipr</td>
@ -989,7 +995,7 @@ Consider all sites to be NSFW unless otherwise known.
<td>Zerochan</td> <td>Zerochan</td>
<td>https://www.zerochan.net/</td> <td>https://www.zerochan.net/</td>
<td>individual Images, Tag Searches</td> <td>individual Images, Tag Searches</td>
<td></td> <td>Supported</td>
</tr> </tr>
<tr> <tr>
<td>かべうち</td> <td>かべうち</td>
@ -1003,12 +1009,6 @@ Consider all sites to be NSFW unless otherwise known.
<td>Posts, Tag Searches</td> <td>Posts, Tag Searches</td>
<td></td> <td></td>
</tr> </tr>
<tr>
<td>半次元</td>
<td>https://bcy.net/</td>
<td>Posts, User Profiles</td>
<td></td>
</tr>
<tr> <tr>
<td colspan="4"><strong>Danbooru Instances</strong></td> <td colspan="4"><strong>Danbooru Instances</strong></td>
@ -1031,6 +1031,12 @@ Consider all sites to be NSFW unless otherwise known.
<td>Pools, Popular Images, Posts, Tag Searches</td> <td>Pools, Popular Images, Posts, Tag Searches</td>
<td>Supported</td> <td>Supported</td>
</tr> </tr>
<tr>
<td>Booruvar</td>
<td>https://booru.borvar.art/</td>
<td>Pools, Popular Images, Posts, Tag Searches</td>
<td></td>
</tr>
<tr> <tr>
<td colspan="4"><strong>e621 Instances</strong></td> <td colspan="4"><strong>e621 Instances</strong></td>
@ -1047,6 +1053,12 @@ Consider all sites to be NSFW unless otherwise known.
<td>Favorites, Pools, Popular Images, Posts, Tag Searches</td> <td>Favorites, Pools, Popular Images, Posts, Tag Searches</td>
<td>Supported</td> <td>Supported</td>
</tr> </tr>
<tr>
<td>e6AI</td>
<td>https://e6ai.net/</td>
<td>Favorites, Pools, Popular Images, Posts, Tag Searches</td>
<td></td>
</tr>
<tr> <tr>
<td colspan="4"><strong>Gelbooru Beta 0.1.11</strong></td> <td colspan="4"><strong>Gelbooru Beta 0.1.11</strong></td>
@ -1076,8 +1088,8 @@ Consider all sites to be NSFW unless otherwise known.
<td></td> <td></td>
</tr> </tr>
<tr> <tr>
<td>/v/idyart</td> <td>/v/idyart2</td>
<td>https://vidyart.booru.org/</td> <td>https://vidyart2.booru.org/</td>
<td>Favorites, Posts, Tag Searches</td> <td>Favorites, Posts, Tag Searches</td>
<td></td> <td></td>
</tr> </tr>
@ -1116,6 +1128,16 @@ Consider all sites to be NSFW unless otherwise known.
<td></td> <td></td>
</tr> </tr>
<tr>
<td colspan="4"><strong>jschan Imageboards</strong></td>
</tr>
<tr>
<td>94chan</td>
<td>https://94chan.org/</td>
<td>Boards, Threads</td>
<td></td>
</tr>
<tr> <tr>
<td colspan="4"><strong>LynxChan Imageboards</strong></td> <td colspan="4"><strong>LynxChan Imageboards</strong></td>
</tr> </tr>
@ -1144,19 +1166,19 @@ Consider all sites to be NSFW unless otherwise known.
<tr> <tr>
<td>Misskey.io</td> <td>Misskey.io</td>
<td>https://misskey.io/</td> <td>https://misskey.io/</td>
<td>Images from Notes, User Profiles</td> <td>Favorites, Images from Notes, User Profiles</td>
<td></td> <td></td>
</tr> </tr>
<tr> <tr>
<td>Lesbian.energy</td> <td>Lesbian.energy</td>
<td>https://lesbian.energy/</td> <td>https://lesbian.energy/</td>
<td>Images from Notes, User Profiles</td> <td>Favorites, Images from Notes, User Profiles</td>
<td></td> <td></td>
</tr> </tr>
<tr> <tr>
<td>Sushi.ski</td> <td>Sushi.ski</td>
<td>https://sushi.ski/</td> <td>https://sushi.ski/</td>
<td>Images from Notes, User Profiles</td> <td>Favorites, Images from Notes, User Profiles</td>
<td></td> <td></td>
</tr> </tr>
@ -1266,6 +1288,40 @@ Consider all sites to be NSFW unless otherwise known.
<td></td> <td></td>
</tr> </tr>
<tr>
<td colspan="4"><strong>Shimmie2 Instances</strong></td>
</tr>
<tr>
<td>meme.museum</td>
<td>https://meme.museum/</td>
<td>Posts, Tag Searches</td>
<td></td>
</tr>
<tr>
<td>Loudbooru</td>
<td>https://loudbooru.com/</td>
<td>Posts, Tag Searches</td>
<td></td>
</tr>
<tr>
<td>Giantessbooru</td>
<td>https://giantessbooru.com/</td>
<td>Posts, Tag Searches</td>
<td></td>
</tr>
<tr>
<td>Tentaclerape</td>
<td>https://tentaclerape.net/</td>
<td>Posts, Tag Searches</td>
<td></td>
</tr>
<tr>
<td>Cavemanon</td>
<td>https://booru.cavemanon.xyz/</td>
<td>Posts, Tag Searches</td>
<td></td>
</tr>
<tr> <tr>
<td colspan="4"><strong>szurubooru Instances</strong></td> <td colspan="4"><strong>szurubooru Instances</strong></td>
</tr> </tr>
@ -1388,14 +1444,8 @@ Consider all sites to be NSFW unless otherwise known.
<td></td> <td></td>
</tr> </tr>
<tr> <tr>
<td>Rozen Arcana</td> <td>Palanq</td>
<td>https://archive.alice.al/</td> <td>https://archive.palanq.win/</td>
<td>Boards, Galleries, Search Results, Threads</td>
<td></td>
</tr>
<tr>
<td>TokyoChronos</td>
<td>https://www.tokyochronos.net/</td>
<td>Boards, Galleries, Search Results, Threads</td> <td>Boards, Galleries, Search Results, Threads</td>
<td></td> <td></td>
</tr> </tr>
@ -1421,12 +1471,6 @@ Consider all sites to be NSFW unless otherwise known.
<td>Chapters, Manga</td> <td>Chapters, Manga</td>
<td></td> <td></td>
</tr> </tr>
<tr>
<td>Sense-Scans</td>
<td>https://sensescans.com/reader/</td>
<td>Chapters, Manga</td>
<td></td>
</tr>
<tr> <tr>
<td colspan="4"><strong>Mastodon Instances</strong></td> <td colspan="4"><strong>Mastodon Instances</strong></td>

View File

@ -70,12 +70,14 @@ def main():
if args.cookies_from_browser: if args.cookies_from_browser:
browser, _, profile = args.cookies_from_browser.partition(":") browser, _, profile = args.cookies_from_browser.partition(":")
browser, _, keyring = browser.partition("+") browser, _, keyring = browser.partition("+")
browser, _, domain = browser.partition("/")
if profile.startswith(":"): if profile.startswith(":"):
container = profile[1:] container = profile[1:]
profile = None profile = None
else: else:
profile, _, container = profile.partition("::") profile, _, container = profile.partition("::")
config.set((), "cookies", (browser, profile, keyring, container)) config.set((), "cookies", (
browser, profile, keyring, container, domain))
if args.options_pp: if args.options_pp:
config.set((), "postprocessor-options", args.options_pp) config.set((), "postprocessor-options", args.options_pp)
for opts in args.options: for opts in args.options:

View File

@ -102,7 +102,8 @@ def load(files=None, strict=False, load=util.json_loads):
log.error(exc) log.error(exc)
sys.exit(1) sys.exit(1)
except Exception as exc: except Exception as exc:
log.warning("Could not parse '%s': %s", path, exc) log.error("%s when loading '%s': %s",
exc.__class__.__name__, path, exc)
if strict: if strict:
sys.exit(2) sys.exit(2)
else: else:
@ -118,7 +119,7 @@ def clear():
_config.clear() _config.clear()
def get(path, key, default=None, *, conf=_config): def get(path, key, default=None, conf=_config):
"""Get the value of property 'key' or a default value""" """Get the value of property 'key' or a default value"""
try: try:
for p in path: for p in path:
@ -128,7 +129,7 @@ def get(path, key, default=None, *, conf=_config):
return default return default
def interpolate(path, key, default=None, *, conf=_config): def interpolate(path, key, default=None, conf=_config):
"""Interpolate the value of 'key'""" """Interpolate the value of 'key'"""
if key in conf: if key in conf:
return conf[key] return conf[key]
@ -142,7 +143,7 @@ def interpolate(path, key, default=None, *, conf=_config):
return default return default
def interpolate_common(common, paths, key, default=None, *, conf=_config): def interpolate_common(common, paths, key, default=None, conf=_config):
"""Interpolate the value of 'key' """Interpolate the value of 'key'
using multiple 'paths' along a 'common' ancestor using multiple 'paths' along a 'common' ancestor
""" """
@ -174,7 +175,7 @@ def interpolate_common(common, paths, key, default=None, *, conf=_config):
return default return default
def accumulate(path, key, *, conf=_config): def accumulate(path, key, conf=_config):
"""Accumulate the values of 'key' along 'path'""" """Accumulate the values of 'key' along 'path'"""
result = [] result = []
try: try:
@ -193,7 +194,7 @@ def accumulate(path, key, *, conf=_config):
return result return result
def set(path, key, value, *, conf=_config): def set(path, key, value, conf=_config):
"""Set the value of property 'key' for this session""" """Set the value of property 'key' for this session"""
for p in path: for p in path:
try: try:
@ -203,7 +204,7 @@ def set(path, key, value, *, conf=_config):
conf[key] = value conf[key] = value
def setdefault(path, key, value, *, conf=_config): def setdefault(path, key, value, conf=_config):
"""Set the value of property 'key' if it doesn't exist""" """Set the value of property 'key' if it doesn't exist"""
for p in path: for p in path:
try: try:
@ -213,7 +214,7 @@ def setdefault(path, key, value, *, conf=_config):
return conf.setdefault(key, value) return conf.setdefault(key, value)
def unset(path, key, *, conf=_config): def unset(path, key, conf=_config):
"""Unset the value of property 'key'""" """Unset the value of property 'key'"""
try: try:
for p in path: for p in path:

View File

@ -20,7 +20,6 @@ import struct
import subprocess import subprocess
import sys import sys
import tempfile import tempfile
from datetime import datetime, timedelta, timezone
from hashlib import pbkdf2_hmac from hashlib import pbkdf2_hmac
from http.cookiejar import Cookie from http.cookiejar import Cookie
from . import aes, text, util from . import aes, text, util
@ -34,19 +33,19 @@ logger = logging.getLogger("cookies")
def load_cookies(cookiejar, browser_specification): def load_cookies(cookiejar, browser_specification):
browser_name, profile, keyring, container = \ browser_name, profile, keyring, container, domain = \
_parse_browser_specification(*browser_specification) _parse_browser_specification(*browser_specification)
if browser_name == "firefox": if browser_name == "firefox":
load_cookies_firefox(cookiejar, profile, container) load_cookies_firefox(cookiejar, profile, container, domain)
elif browser_name == "safari": elif browser_name == "safari":
load_cookies_safari(cookiejar, profile) load_cookies_safari(cookiejar, profile, domain)
elif browser_name in SUPPORTED_BROWSERS_CHROMIUM: elif browser_name in SUPPORTED_BROWSERS_CHROMIUM:
load_cookies_chrome(cookiejar, browser_name, profile, keyring) load_cookies_chrome(cookiejar, browser_name, profile, keyring, domain)
else: else:
raise ValueError("unknown browser '{}'".format(browser_name)) raise ValueError("unknown browser '{}'".format(browser_name))
def load_cookies_firefox(cookiejar, profile=None, container=None): def load_cookies_firefox(cookiejar, profile=None, container=None, domain=None):
path, container_id = _firefox_cookies_database(profile, container) path, container_id = _firefox_cookies_database(profile, container)
with DatabaseCopy(path) as db: with DatabaseCopy(path) as db:
@ -60,6 +59,13 @@ def load_cookies_firefox(cookiejar, profile=None, container=None):
sql += " WHERE originAttributes LIKE ? OR originAttributes LIKE ?" sql += " WHERE originAttributes LIKE ? OR originAttributes LIKE ?"
uid = "%userContextId={}".format(container_id) uid = "%userContextId={}".format(container_id)
parameters = (uid, uid + "&%") parameters = (uid, uid + "&%")
elif domain:
if domain[0] == ".":
sql += " WHERE host == ? OR host LIKE ?"
parameters = (domain[1:], "%" + domain)
else:
sql += " WHERE host == ? OR host == ?"
parameters = (domain, "." + domain)
set_cookie = cookiejar.set_cookie set_cookie = cookiejar.set_cookie
for name, value, domain, path, secure, expires in db.execute( for name, value, domain, path, secure, expires in db.execute(
@ -69,9 +75,10 @@ def load_cookies_firefox(cookiejar, profile=None, container=None):
domain, bool(domain), domain.startswith("."), domain, bool(domain), domain.startswith("."),
path, bool(path), secure, expires, False, None, None, {}, path, bool(path), secure, expires, False, None, None, {},
)) ))
_log_info("Extracted %s cookies from Firefox", len(cookiejar))
def load_cookies_safari(cookiejar, profile=None): def load_cookies_safari(cookiejar, profile=None, domain=None):
"""Ref.: https://github.com/libyal/dtformats/blob """Ref.: https://github.com/libyal/dtformats/blob
/main/documentation/Safari%20Cookies.asciidoc /main/documentation/Safari%20Cookies.asciidoc
- This data appears to be out of date - This data appears to be out of date
@ -87,27 +94,40 @@ def load_cookies_safari(cookiejar, profile=None):
_safari_parse_cookies_page(p.read_bytes(page_size), cookiejar) _safari_parse_cookies_page(p.read_bytes(page_size), cookiejar)
def load_cookies_chrome(cookiejar, browser_name, profile, keyring): def load_cookies_chrome(cookiejar, browser_name, profile=None,
keyring=None, domain=None):
config = _get_chromium_based_browser_settings(browser_name) config = _get_chromium_based_browser_settings(browser_name)
path = _chrome_cookies_database(profile, config) path = _chrome_cookies_database(profile, config)
logger.debug("Extracting cookies from %s", path) _log_debug("Extracting cookies from %s", path)
with DatabaseCopy(path) as db: with DatabaseCopy(path) as db:
db.text_factory = bytes db.text_factory = bytes
decryptor = get_cookie_decryptor( decryptor = get_cookie_decryptor(
config["directory"], config["keyring"], keyring=keyring) config["directory"], config["keyring"], keyring)
if domain:
if domain[0] == ".":
condition = " WHERE host_key == ? OR host_key LIKE ?"
parameters = (domain[1:], "%" + domain)
else:
condition = " WHERE host_key == ? OR host_key == ?"
parameters = (domain, "." + domain)
else:
condition = ""
parameters = ()
try: try:
rows = db.execute( rows = db.execute(
"SELECT host_key, name, value, encrypted_value, path, " "SELECT host_key, name, value, encrypted_value, path, "
"expires_utc, is_secure FROM cookies") "expires_utc, is_secure FROM cookies" + condition, parameters)
except sqlite3.OperationalError: except sqlite3.OperationalError:
rows = db.execute( rows = db.execute(
"SELECT host_key, name, value, encrypted_value, path, " "SELECT host_key, name, value, encrypted_value, path, "
"expires_utc, secure FROM cookies") "expires_utc, secure FROM cookies" + condition, parameters)
set_cookie = cookiejar.set_cookie set_cookie = cookiejar.set_cookie
failed_cookies = unencrypted_cookies = 0 failed_cookies = 0
unencrypted_cookies = 0
for domain, name, value, enc_value, path, expires, secure in rows: for domain, name, value, enc_value, path, expires, secure in rows:
@ -135,11 +155,11 @@ def load_cookies_chrome(cookiejar, browser_name, profile, keyring):
else: else:
failed_message = "" failed_message = ""
logger.info("Extracted %s cookies from %s%s", _log_info("Extracted %s cookies from %s%s",
len(cookiejar), browser_name, failed_message) len(cookiejar), browser_name.capitalize(), failed_message)
counts = decryptor.cookie_counts.copy() counts = decryptor.cookie_counts
counts["unencrypted"] = unencrypted_cookies counts["unencrypted"] = unencrypted_cookies
logger.debug("cookie version breakdown: %s", counts) _log_debug("Cookie version breakdown: %s", counts)
# -------------------------------------------------------------------- # --------------------------------------------------------------------
@ -157,11 +177,11 @@ def _firefox_cookies_database(profile=None, container=None):
if path is None: if path is None:
raise FileNotFoundError("Unable to find Firefox cookies database in " raise FileNotFoundError("Unable to find Firefox cookies database in "
"{}".format(search_root)) "{}".format(search_root))
logger.debug("Extracting cookies from %s", path) _log_debug("Extracting cookies from %s", path)
if container == "none": if container == "none":
container_id = False container_id = False
logger.debug("Only loading cookies not belonging to any container") _log_debug("Only loading cookies not belonging to any container")
elif container: elif container:
containers_path = os.path.join( containers_path = os.path.join(
@ -171,8 +191,8 @@ def _firefox_cookies_database(profile=None, container=None):
with open(containers_path) as file: with open(containers_path) as file:
identities = util.json_loads(file.read())["identities"] identities = util.json_loads(file.read())["identities"]
except OSError: except OSError:
logger.error("Unable to read Firefox container database at %s", _log_error("Unable to read Firefox container database at '%s'",
containers_path) containers_path)
raise raise
except KeyError: except KeyError:
identities = () identities = ()
@ -183,10 +203,10 @@ def _firefox_cookies_database(profile=None, container=None):
container_id = context["userContextId"] container_id = context["userContextId"]
break break
else: else:
raise ValueError("Unable to find Firefox container {}".format( raise ValueError("Unable to find Firefox container '{}'".format(
container)) container))
logger.debug("Only loading cookies from container '%s' (ID %s)", _log_debug("Only loading cookies from container '%s' (ID %s)",
container, container_id) container, container_id)
else: else:
container_id = None container_id = None
@ -209,7 +229,7 @@ def _safari_cookies_database():
path = os.path.expanduser("~/Library/Cookies/Cookies.binarycookies") path = os.path.expanduser("~/Library/Cookies/Cookies.binarycookies")
return open(path, "rb") return open(path, "rb")
except FileNotFoundError: except FileNotFoundError:
logger.debug("Trying secondary cookie location") _log_debug("Trying secondary cookie location")
path = os.path.expanduser("~/Library/Containers/com.apple.Safari/Data" path = os.path.expanduser("~/Library/Containers/com.apple.Safari/Data"
"/Library/Cookies/Cookies.binarycookies") "/Library/Cookies/Cookies.binarycookies")
return open(path, "rb") return open(path, "rb")
@ -224,13 +244,13 @@ def _safari_parse_cookies_header(data):
return page_sizes, p.cursor return page_sizes, p.cursor
def _safari_parse_cookies_page(data, jar): def _safari_parse_cookies_page(data, cookiejar, domain=None):
p = DataParser(data) p = DataParser(data)
p.expect_bytes(b"\x00\x00\x01\x00", "page signature") p.expect_bytes(b"\x00\x00\x01\x00", "page signature")
number_of_cookies = p.read_uint() number_of_cookies = p.read_uint()
record_offsets = [p.read_uint() for _ in range(number_of_cookies)] record_offsets = [p.read_uint() for _ in range(number_of_cookies)]
if number_of_cookies == 0: if number_of_cookies == 0:
logger.debug("a cookies page of size %s has no cookies", len(data)) _log_debug("Cookies page of size %s has no cookies", len(data))
return return
p.skip_to(record_offsets[0], "unknown page header field") p.skip_to(record_offsets[0], "unknown page header field")
@ -238,12 +258,12 @@ def _safari_parse_cookies_page(data, jar):
for i, record_offset in enumerate(record_offsets): for i, record_offset in enumerate(record_offsets):
p.skip_to(record_offset, "space between records") p.skip_to(record_offset, "space between records")
record_length = _safari_parse_cookies_record( record_length = _safari_parse_cookies_record(
data[record_offset:], jar) data[record_offset:], cookiejar, domain)
p.read_bytes(record_length) p.read_bytes(record_length)
p.skip_to_end("space in between pages") p.skip_to_end("space in between pages")
def _safari_parse_cookies_record(data, cookiejar): def _safari_parse_cookies_record(data, cookiejar, host=None):
p = DataParser(data) p = DataParser(data)
record_size = p.read_uint() record_size = p.read_uint()
p.skip(4, "unknown record field 1") p.skip(4, "unknown record field 1")
@ -262,6 +282,14 @@ def _safari_parse_cookies_record(data, cookiejar):
p.skip_to(domain_offset) p.skip_to(domain_offset)
domain = p.read_cstring() domain = p.read_cstring()
if host:
if host[0] == ".":
if host[1:] != domain and not domain.endswith(host):
return record_size
else:
if host != domain and ("." + host) != domain:
return record_size
p.skip_to(name_offset) p.skip_to(name_offset)
name = p.read_cstring() name = p.read_cstring()
@ -271,8 +299,7 @@ def _safari_parse_cookies_record(data, cookiejar):
p.skip_to(value_offset) p.skip_to(value_offset)
value = p.read_cstring() value = p.read_cstring()
except UnicodeDecodeError: except UnicodeDecodeError:
logger.warning("failed to parse Safari cookie " _log_warning("Failed to parse Safari cookie")
"because UTF-8 decoding failed")
return record_size return record_size
p.skip_to(record_size, "space at the end of the record") p.skip_to(record_size, "space at the end of the record")
@ -300,7 +327,7 @@ def _chrome_cookies_database(profile, config):
elif config["profiles"]: elif config["profiles"]:
search_root = os.path.join(config["directory"], profile) search_root = os.path.join(config["directory"], profile)
else: else:
logger.warning("%s does not support profiles", config["browser"]) _log_warning("%s does not support profiles", config["browser"])
search_root = config["directory"] search_root = config["directory"]
path = _find_most_recently_used_file(search_root, "Cookies") path = _find_most_recently_used_file(search_root, "Cookies")
@ -412,18 +439,17 @@ class ChromeCookieDecryptor:
raise NotImplementedError("Must be implemented by sub classes") raise NotImplementedError("Must be implemented by sub classes")
def get_cookie_decryptor(browser_root, browser_keyring_name, *, keyring=None): def get_cookie_decryptor(browser_root, browser_keyring_name, keyring=None):
if sys.platform in ("win32", "cygwin"): if sys.platform in ("win32", "cygwin"):
return WindowsChromeCookieDecryptor(browser_root) return WindowsChromeCookieDecryptor(browser_root)
elif sys.platform == "darwin": elif sys.platform == "darwin":
return MacChromeCookieDecryptor(browser_keyring_name) return MacChromeCookieDecryptor(browser_keyring_name)
else: else:
return LinuxChromeCookieDecryptor( return LinuxChromeCookieDecryptor(browser_keyring_name, keyring)
browser_keyring_name, keyring=keyring)
class LinuxChromeCookieDecryptor(ChromeCookieDecryptor): class LinuxChromeCookieDecryptor(ChromeCookieDecryptor):
def __init__(self, browser_keyring_name, *, keyring=None): def __init__(self, browser_keyring_name, keyring=None):
self._v10_key = self.derive_key(b"peanuts") self._v10_key = self.derive_key(b"peanuts")
password = _get_linux_keyring_password(browser_keyring_name, keyring) password = _get_linux_keyring_password(browser_keyring_name, keyring)
self._v11_key = None if password is None else self.derive_key(password) self._v11_key = None if password is None else self.derive_key(password)
@ -452,7 +478,7 @@ class LinuxChromeCookieDecryptor(ChromeCookieDecryptor):
elif version == b"v11": elif version == b"v11":
self._cookie_counts["v11"] += 1 self._cookie_counts["v11"] += 1
if self._v11_key is None: if self._v11_key is None:
logger.warning("cannot decrypt v11 cookies: no key found") _log_warning("Unable to decrypt v11 cookies: no key found")
return None return None
return _decrypt_aes_cbc(ciphertext, self._v11_key) return _decrypt_aes_cbc(ciphertext, self._v11_key)
@ -486,7 +512,7 @@ class MacChromeCookieDecryptor(ChromeCookieDecryptor):
if version == b"v10": if version == b"v10":
self._cookie_counts["v10"] += 1 self._cookie_counts["v10"] += 1
if self._v10_key is None: if self._v10_key is None:
logger.warning("cannot decrypt v10 cookies: no key found") _log_warning("Unable to decrypt v10 cookies: no key found")
return None return None
return _decrypt_aes_cbc(ciphertext, self._v10_key) return _decrypt_aes_cbc(ciphertext, self._v10_key)
@ -516,7 +542,7 @@ class WindowsChromeCookieDecryptor(ChromeCookieDecryptor):
if version == b"v10": if version == b"v10":
self._cookie_counts["v10"] += 1 self._cookie_counts["v10"] += 1
if self._v10_key is None: if self._v10_key is None:
logger.warning("cannot decrypt v10 cookies: no key found") _log_warning("Unable to decrypt v10 cookies: no key found")
return None return None
# https://chromium.googlesource.com/chromium/src/+/refs/heads # https://chromium.googlesource.com/chromium/src/+/refs/heads
@ -554,7 +580,7 @@ def _choose_linux_keyring():
SelectBackend SelectBackend
""" """
desktop_environment = _get_linux_desktop_environment(os.environ) desktop_environment = _get_linux_desktop_environment(os.environ)
logger.debug("Detected desktop environment: %s", desktop_environment) _log_debug("Detected desktop environment: %s", desktop_environment)
if desktop_environment == DE_KDE: if desktop_environment == DE_KDE:
return KEYRING_KWALLET return KEYRING_KWALLET
if desktop_environment == DE_OTHER: if desktop_environment == DE_OTHER:
@ -582,23 +608,23 @@ def _get_kwallet_network_wallet():
) )
if proc.returncode != 0: if proc.returncode != 0:
logger.warning("failed to read NetworkWallet") _log_warning("Failed to read NetworkWallet")
return default_wallet return default_wallet
else: else:
network_wallet = stdout.decode().strip() network_wallet = stdout.decode().strip()
logger.debug("NetworkWallet = '%s'", network_wallet) _log_debug("NetworkWallet = '%s'", network_wallet)
return network_wallet return network_wallet
except Exception as exc: except Exception as exc:
logger.warning("exception while obtaining NetworkWallet (%s: %s)", _log_warning("Error while obtaining NetworkWallet (%s: %s)",
exc.__class__.__name__, exc) exc.__class__.__name__, exc)
return default_wallet return default_wallet
def _get_kwallet_password(browser_keyring_name): def _get_kwallet_password(browser_keyring_name):
logger.debug("using kwallet-query to obtain password from kwallet") _log_debug("Using kwallet-query to obtain password from kwallet")
if shutil.which("kwallet-query") is None: if shutil.which("kwallet-query") is None:
logger.error( _log_error(
"kwallet-query command not found. KWallet and kwallet-query " "kwallet-query command not found. KWallet and kwallet-query "
"must be installed to read from KWallet. kwallet-query should be " "must be installed to read from KWallet. kwallet-query should be "
"included in the kwallet package for your distribution") "included in the kwallet package for your distribution")
@ -615,14 +641,14 @@ def _get_kwallet_password(browser_keyring_name):
) )
if proc.returncode != 0: if proc.returncode != 0:
logger.error("kwallet-query failed with return code {}. " _log_error("kwallet-query failed with return code {}. "
"Please consult the kwallet-query man page " "Please consult the kwallet-query man page "
"for details".format(proc.returncode)) "for details".format(proc.returncode))
return b"" return b""
if stdout.lower().startswith(b"failed to read"): if stdout.lower().startswith(b"failed to read"):
logger.debug("Failed to read password from kwallet. " _log_debug("Failed to read password from kwallet. "
"Using empty string instead") "Using empty string instead")
# This sometimes occurs in KDE because chrome does not check # This sometimes occurs in KDE because chrome does not check
# hasEntry and instead just tries to read the value (which # hasEntry and instead just tries to read the value (which
# kwallet returns "") whereas kwallet-query checks hasEntry. # kwallet returns "") whereas kwallet-query checks hasEntry.
@ -633,13 +659,12 @@ def _get_kwallet_password(browser_keyring_name):
# random password and store it, but that doesn't matter here. # random password and store it, but that doesn't matter here.
return b"" return b""
else: else:
logger.debug("password found")
if stdout[-1:] == b"\n": if stdout[-1:] == b"\n":
stdout = stdout[:-1] stdout = stdout[:-1]
return stdout return stdout
except Exception as exc: except Exception as exc:
logger.warning("exception running kwallet-query (%s: %s)", _log_warning("Error when running kwallet-query (%s: %s)",
exc.__class__.__name__, exc) exc.__class__.__name__, exc)
return b"" return b""
@ -647,7 +672,7 @@ def _get_gnome_keyring_password(browser_keyring_name):
try: try:
import secretstorage import secretstorage
except ImportError: except ImportError:
logger.error("secretstorage not available") _log_error("'secretstorage' Python package not available")
return b"" return b""
# Gnome keyring does not seem to organise keys in the same way as KWallet, # Gnome keyring does not seem to organise keys in the same way as KWallet,
@ -662,7 +687,7 @@ def _get_gnome_keyring_password(browser_keyring_name):
if item.get_label() == label: if item.get_label() == label:
return item.get_secret() return item.get_secret()
else: else:
logger.error("failed to read from keyring") _log_error("Failed to read from GNOME keyring")
return b"" return b""
@ -676,7 +701,7 @@ def _get_linux_keyring_password(browser_keyring_name, keyring):
if not keyring: if not keyring:
keyring = _choose_linux_keyring() keyring = _choose_linux_keyring()
logger.debug("Chosen keyring: %s", keyring) _log_debug("Chosen keyring: %s", keyring)
if keyring == KEYRING_KWALLET: if keyring == KEYRING_KWALLET:
return _get_kwallet_password(browser_keyring_name) return _get_kwallet_password(browser_keyring_name)
@ -690,8 +715,8 @@ def _get_linux_keyring_password(browser_keyring_name, keyring):
def _get_mac_keyring_password(browser_keyring_name): def _get_mac_keyring_password(browser_keyring_name):
logger.debug("using find-generic-password to obtain " _log_debug("Using find-generic-password to obtain "
"password from OSX keychain") "password from OSX keychain")
try: try:
proc, stdout = Popen_communicate( proc, stdout = Popen_communicate(
"security", "find-generic-password", "security", "find-generic-password",
@ -704,28 +729,28 @@ def _get_mac_keyring_password(browser_keyring_name):
stdout = stdout[:-1] stdout = stdout[:-1]
return stdout return stdout
except Exception as exc: except Exception as exc:
logger.warning("exception running find-generic-password (%s: %s)", _log_warning("Error when using find-generic-password (%s: %s)",
exc.__class__.__name__, exc) exc.__class__.__name__, exc)
return None return None
def _get_windows_v10_key(browser_root): def _get_windows_v10_key(browser_root):
path = _find_most_recently_used_file(browser_root, "Local State") path = _find_most_recently_used_file(browser_root, "Local State")
if path is None: if path is None:
logger.error("could not find local state file") _log_error("Unable to find Local State file")
return None return None
logger.debug("Found local state file at '%s'", path) _log_debug("Found Local State file at '%s'", path)
with open(path, encoding="utf-8") as file: with open(path, encoding="utf-8") as file:
data = util.json_loads(file.read()) data = util.json_loads(file.read())
try: try:
base64_key = data["os_crypt"]["encrypted_key"] base64_key = data["os_crypt"]["encrypted_key"]
except KeyError: except KeyError:
logger.error("no encrypted key in Local State") _log_error("Unable to find encrypted key in Local State")
return None return None
encrypted_key = binascii.a2b_base64(base64_key) encrypted_key = binascii.a2b_base64(base64_key)
prefix = b"DPAPI" prefix = b"DPAPI"
if not encrypted_key.startswith(prefix): if not encrypted_key.startswith(prefix):
logger.error("invalid key") _log_error("Invalid Local State key")
return None return None
return _decrypt_windows_dpapi(encrypted_key[len(prefix):]) return _decrypt_windows_dpapi(encrypted_key[len(prefix):])
@ -777,10 +802,10 @@ class DataParser:
def skip(self, num_bytes, description="unknown"): def skip(self, num_bytes, description="unknown"):
if num_bytes > 0: if num_bytes > 0:
logger.debug("skipping {} bytes ({}): {!r}".format( _log_debug("Skipping {} bytes ({}): {!r}".format(
num_bytes, description, self.read_bytes(num_bytes))) num_bytes, description, self.read_bytes(num_bytes)))
elif num_bytes < 0: elif num_bytes < 0:
raise ParserError("invalid skip of {} bytes".format(num_bytes)) raise ParserError("Invalid skip of {} bytes".format(num_bytes))
def skip_to(self, offset, description="unknown"): def skip_to(self, offset, description="unknown"):
self.skip(offset - self.cursor, description) self.skip(offset - self.cursor, description)
@ -893,8 +918,8 @@ def _get_linux_desktop_environment(env):
def _mac_absolute_time_to_posix(timestamp): def _mac_absolute_time_to_posix(timestamp):
return int((datetime(2001, 1, 1, 0, 0, tzinfo=timezone.utc) + # 978307200 is timestamp of 2001-01-01 00:00:00
timedelta(seconds=timestamp)).timestamp()) return 978307200 + int(timestamp)
def pbkdf2_sha1(password, salt, iterations, key_length): def pbkdf2_sha1(password, salt, iterations, key_length):
@ -902,31 +927,25 @@ def pbkdf2_sha1(password, salt, iterations, key_length):
def _decrypt_aes_cbc(ciphertext, key, initialization_vector=b" " * 16): def _decrypt_aes_cbc(ciphertext, key, initialization_vector=b" " * 16):
plaintext = aes.unpad_pkcs7(
aes.aes_cbc_decrypt_bytes(ciphertext, key, initialization_vector))
try: try:
return plaintext.decode() return aes.unpad_pkcs7(aes.aes_cbc_decrypt_bytes(
ciphertext, key, initialization_vector)).decode()
except UnicodeDecodeError: except UnicodeDecodeError:
logger.warning("failed to decrypt cookie (AES-CBC) because UTF-8 " _log_warning("Failed to decrypt cookie (AES-CBC Unicode)")
"decoding failed. Possibly the key is wrong?") except ValueError:
return None _log_warning("Failed to decrypt cookie (AES-CBC)")
return None
def _decrypt_aes_gcm(ciphertext, key, nonce, authentication_tag): def _decrypt_aes_gcm(ciphertext, key, nonce, authentication_tag):
try: try:
plaintext = aes.aes_gcm_decrypt_and_verify_bytes( return aes.aes_gcm_decrypt_and_verify_bytes(
ciphertext, key, authentication_tag, nonce) ciphertext, key, authentication_tag, nonce).decode()
except ValueError:
logger.warning("failed to decrypt cookie (AES-GCM) because MAC check "
"failed. Possibly the key is wrong?")
return None
try:
return plaintext.decode()
except UnicodeDecodeError: except UnicodeDecodeError:
logger.warning("failed to decrypt cookie (AES-GCM) because UTF-8 " _log_warning("Failed to decrypt cookie (AES-GCM Unicode)")
"decoding failed. Possibly the key is wrong?") except ValueError:
return None _log_warning("Failed to decrypt cookie (AES-GCM MAC)")
return None
def _decrypt_windows_dpapi(ciphertext): def _decrypt_windows_dpapi(ciphertext):
@ -954,7 +973,7 @@ def _decrypt_windows_dpapi(ciphertext):
ctypes.byref(blob_out) # pDataOut ctypes.byref(blob_out) # pDataOut
) )
if not ret: if not ret:
logger.warning("failed to decrypt with DPAPI") _log_warning("Failed to decrypt cookie (DPAPI)")
return None return None
result = ctypes.string_at(blob_out.pbData, blob_out.cbData) result = ctypes.string_at(blob_out.pbData, blob_out.cbData)
@ -979,12 +998,29 @@ def _is_path(value):
def _parse_browser_specification( def _parse_browser_specification(
browser, profile=None, keyring=None, container=None): browser, profile=None, keyring=None, container=None, domain=None):
browser = browser.lower() browser = browser.lower()
if browser not in SUPPORTED_BROWSERS: if browser not in SUPPORTED_BROWSERS:
raise ValueError("unsupported browser '{}'".format(browser)) raise ValueError("Unsupported browser '{}'".format(browser))
if keyring and keyring not in SUPPORTED_KEYRINGS: if keyring and keyring not in SUPPORTED_KEYRINGS:
raise ValueError("unsupported keyring '{}'".format(keyring)) raise ValueError("Unsupported keyring '{}'".format(keyring))
if profile and _is_path(profile): if profile and _is_path(profile):
profile = os.path.expanduser(profile) profile = os.path.expanduser(profile)
return browser, profile, keyring, container return browser, profile, keyring, container, domain
_log_cache = set()
_log_debug = logger.debug
_log_info = logger.info
def _log_warning(msg, *args):
if msg not in _log_cache:
_log_cache.add(msg)
logger.warning(msg, *args)
def _log_error(msg, *args):
if msg not in _log_cache:
_log_cache.add(msg)
logger.error(msg, *args)

View File

@ -44,6 +44,12 @@ class HttpDownloader(DownloaderBase):
self.mtime = self.config("mtime", True) self.mtime = self.config("mtime", True)
self.rate = self.config("rate") self.rate = self.config("rate")
if not self.config("consume-content", False):
# this resets the underlying TCP connection, and therefore
# if the program makes another request to the same domain,
# a new connection (either TLS or plain TCP) must be made
self.release_conn = lambda resp: resp.close()
if self.retries < 0: if self.retries < 0:
self.retries = float("inf") self.retries = float("inf")
if self.minsize: if self.minsize:
@ -106,7 +112,7 @@ class HttpDownloader(DownloaderBase):
while True: while True:
if tries: if tries:
if response: if response:
response.close() self.release_conn(response)
response = None response = None
self.log.warning("%s (%s/%s)", msg, tries, self.retries+1) self.log.warning("%s (%s/%s)", msg, tries, self.retries+1)
if tries > self.retries: if tries > self.retries:
@ -165,18 +171,24 @@ class HttpDownloader(DownloaderBase):
retry = kwdict.get("_http_retry") retry = kwdict.get("_http_retry")
if retry and retry(response): if retry and retry(response):
continue continue
self.release_conn(response)
self.log.warning(msg) self.log.warning(msg)
return False return False
# check for invalid responses # check for invalid responses
validate = kwdict.get("_http_validate") validate = kwdict.get("_http_validate")
if validate and self.validate: if validate and self.validate:
result = validate(response) try:
result = validate(response)
except Exception:
self.release_conn(response)
raise
if isinstance(result, str): if isinstance(result, str):
url = result url = result
tries -= 1 tries -= 1
continue continue
if not result: if not result:
self.release_conn(response)
self.log.warning("Invalid response") self.log.warning("Invalid response")
return False return False
@ -184,11 +196,13 @@ class HttpDownloader(DownloaderBase):
size = text.parse_int(size, None) size = text.parse_int(size, None)
if size is not None: if size is not None:
if self.minsize and size < self.minsize: if self.minsize and size < self.minsize:
self.release_conn(response)
self.log.warning( self.log.warning(
"File size smaller than allowed minimum (%s < %s)", "File size smaller than allowed minimum (%s < %s)",
size, self.minsize) size, self.minsize)
return False return False
if self.maxsize and size > self.maxsize: if self.maxsize and size > self.maxsize:
self.release_conn(response)
self.log.warning( self.log.warning(
"File size larger than allowed maximum (%s > %s)", "File size larger than allowed maximum (%s > %s)",
size, self.maxsize) size, self.maxsize)
@ -280,6 +294,18 @@ class HttpDownloader(DownloaderBase):
return True return True
def release_conn(self, response):
"""Release connection back to pool by consuming response body"""
try:
for _ in response.iter_content(self.chunk_size):
pass
except (RequestException, SSLError, OpenSSLError) as exc:
print()
self.log.debug(
"Unable to consume response body (%s: %s); "
"closing the connection anyway", exc.__class__.__name__, exc)
response.close()
@staticmethod @staticmethod
def receive(fp, content, bytes_total, bytes_start): def receive(fp, content, bytes_total, bytes_start):
write = fp.write write = fp.write

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2015-2020 Mike Fährmann # Copyright 2015-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -17,12 +17,10 @@ class _3dbooruBase():
basecategory = "booru" basecategory = "booru"
root = "http://behoimi.org" root = "http://behoimi.org"
def __init__(self, match): def _init(self):
super().__init__(match) headers = self.session.headers
self.session.headers.update({ headers["Referer"] = "http://behoimi.org/post/show/"
"Referer": "http://behoimi.org/post/show/", headers["Accept-Encoding"] = "identity"
"Accept-Encoding": "identity",
})
class _3dbooruTagExtractor(_3dbooruBase, moebooru.MoebooruTagExtractor): class _3dbooruTagExtractor(_3dbooruBase, moebooru.MoebooruTagExtractor):

View File

@ -1,76 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2021 Mike Fährmann
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
"""Extractors for https://420chan.org/"""
from .common import Extractor, Message
class _420chanThreadExtractor(Extractor):
"""Extractor for 420chan threads"""
category = "420chan"
subcategory = "thread"
directory_fmt = ("{category}", "{board}", "{thread} {title}")
archive_fmt = "{board}_{thread}_{filename}"
pattern = r"(?:https?://)?boards\.420chan\.org/([^/?#]+)/thread/(\d+)"
test = ("https://boards.420chan.org/ani/thread/33251/chow-chows", {
"pattern": r"https://boards\.420chan\.org/ani/src/\d+\.jpg",
"content": "b07c803b0da78de159709da923e54e883c100934",
"count": 2,
})
def __init__(self, match):
Extractor.__init__(self, match)
self.board, self.thread = match.groups()
def items(self):
url = "https://api.420chan.org/{}/res/{}.json".format(
self.board, self.thread)
posts = self.request(url).json()["posts"]
data = {
"board" : self.board,
"thread": self.thread,
"title" : posts[0].get("sub") or posts[0]["com"][:50],
}
yield Message.Directory, data
for post in posts:
if "filename" in post:
post.update(data)
post["extension"] = post["ext"][1:]
url = "https://boards.420chan.org/{}/src/{}{}".format(
post["board"], post["filename"], post["ext"])
yield Message.Url, url, post
class _420chanBoardExtractor(Extractor):
"""Extractor for 420chan boards"""
category = "420chan"
subcategory = "board"
pattern = r"(?:https?://)?boards\.420chan\.org/([^/?#]+)/\d*$"
test = ("https://boards.420chan.org/po/", {
"pattern": _420chanThreadExtractor.pattern,
"count": ">= 100",
})
def __init__(self, match):
Extractor.__init__(self, match)
self.board = match.group(1)
def items(self):
url = "https://api.420chan.org/{}/threads.json".format(self.board)
threads = self.request(url).json()
for page in threads:
for thread in page["threads"]:
url = "https://boards.420chan.org/{}/thread/{}/".format(
self.board, thread["no"])
thread["page"] = page["page"]
thread["_extractor"] = _420chanThreadExtractor
yield Message.Queue, url, thread

View File

@ -0,0 +1,139 @@
# -*- coding: utf-8 -*-
# Copyright 2023 Mike Fährmann
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
"""Extractors for https://4chanarchives.com/"""
from .common import Extractor, Message
from .. import text
class _4chanarchivesThreadExtractor(Extractor):
"""Extractor for threads on 4chanarchives.com"""
category = "4chanarchives"
subcategory = "thread"
root = "https://4chanarchives.com"
directory_fmt = ("{category}", "{board}", "{thread} - {title}")
filename_fmt = "{no}-{filename}.{extension}"
archive_fmt = "{board}_{thread}_{no}"
pattern = r"(?:https?://)?4chanarchives\.com/board/([^/?#]+)/thread/(\d+)"
test = (
("https://4chanarchives.com/board/c/thread/2707110", {
"pattern": r"https://i\.imgur\.com/(0wLGseE|qbByWDc)\.jpg",
"count": 2,
"keyword": {
"board": "c",
"com": str,
"name": "Anonymous",
"no": int,
"thread": "2707110",
"time": r"re:2016-07-1\d \d\d:\d\d:\d\d",
"title": "Ren Kagami from 'Oyako Neburi'",
},
}),
)
def __init__(self, match):
Extractor.__init__(self, match)
self.board, self.thread = match.groups()
def items(self):
url = "{}/board/{}/thread/{}".format(
self.root, self.board, self.thread)
page = self.request(url).text
data = self.metadata(page)
posts = self.posts(page)
if not data["title"]:
data["title"] = text.unescape(text.remove_html(
posts[0]["com"]))[:50]
for post in posts:
post.update(data)
yield Message.Directory, post
if "url" in post:
yield Message.Url, post["url"], post
def metadata(self, page):
return {
"board" : self.board,
"thread" : self.thread,
"title" : text.unescape(text.extr(
page, 'property="og:title" content="', '"')),
}
def posts(self, page):
"""Build a list of all post objects"""
return [self.parse(html) for html in text.extract_iter(
page, 'id="pc', '</blockquote>')]
def parse(self, html):
"""Build post object by extracting data from an HTML post"""
post = self._extract_post(html)
if ">File: <" in html:
self._extract_file(html, post)
post["extension"] = post["url"].rpartition(".")[2]
return post
@staticmethod
def _extract_post(html):
extr = text.extract_from(html)
return {
"no" : text.parse_int(extr('', '"')),
"name": extr('class="name">', '<'),
"time": extr('class="dateTime postNum" >', '<').rstrip(),
"com" : text.unescape(
html[html.find('<blockquote'):].partition(">")[2]),
}
@staticmethod
def _extract_file(html, post):
extr = text.extract_from(html, html.index(">File: <"))
post["url"] = extr('href="', '"')
post["filename"] = text.unquote(extr(">", "<").rpartition(".")[0])
post["fsize"] = extr("(", ", ")
post["w"] = text.parse_int(extr("", "x"))
post["h"] = text.parse_int(extr("", ")"))
class _4chanarchivesBoardExtractor(Extractor):
"""Extractor for boards on 4chanarchives.com"""
category = "4chanarchives"
subcategory = "board"
root = "https://4chanarchives.com"
pattern = r"(?:https?://)?4chanarchives\.com/board/([^/?#]+)(?:/(\d+))?/?$"
test = (
("https://4chanarchives.com/board/c/", {
"pattern": _4chanarchivesThreadExtractor.pattern,
"range": "1-40",
"count": 40,
}),
("https://4chanarchives.com/board/c"),
("https://4chanarchives.com/board/c/10"),
)
def __init__(self, match):
Extractor.__init__(self, match)
self.board, self.page = match.groups()
def items(self):
data = {"_extractor": _4chanarchivesThreadExtractor}
pnum = text.parse_int(self.page, 1)
needle = '''<span class="postNum desktop">
<span><a href="'''
while True:
url = "{}/board/{}/{}".format(self.root, self.board, pnum)
page = self.request(url).text
thread = None
for thread in text.extract_iter(page, needle, '"'):
yield Message.Queue, thread, data
if thread is None:
return
pnum += 1

View File

@ -21,10 +21,9 @@ class _500pxExtractor(Extractor):
filename_fmt = "{id}_{name}.{extension}" filename_fmt = "{id}_{name}.{extension}"
archive_fmt = "{id}" archive_fmt = "{id}"
root = "https://500px.com" root = "https://500px.com"
cookiedomain = ".500px.com" cookies_domain = ".500px.com"
def __init__(self, match): def _init(self):
Extractor.__init__(self, match)
self.session.headers["Referer"] = self.root + "/" self.session.headers["Referer"] = self.root + "/"
def items(self): def items(self):
@ -73,7 +72,7 @@ class _500pxExtractor(Extractor):
def _request_api(self, url, params): def _request_api(self, url, params):
headers = { headers = {
"Origin": self.root, "Origin": self.root,
"x-csrf-token": self.session.cookies.get( "x-csrf-token": self.cookies.get(
"x-csrf-token", domain=".500px.com"), "x-csrf-token", domain=".500px.com"),
} }
return self.request(url, headers=headers, params=params).json() return self.request(url, headers=headers, params=params).json()
@ -81,7 +80,7 @@ class _500pxExtractor(Extractor):
def _request_graphql(self, opname, variables): def _request_graphql(self, opname, variables):
url = "https://api.500px.com/graphql" url = "https://api.500px.com/graphql"
headers = { headers = {
"x-csrf-token": self.session.cookies.get( "x-csrf-token": self.cookies.get(
"x-csrf-token", domain=".500px.com"), "x-csrf-token", domain=".500px.com"),
} }
data = { data = {

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2022 Mike Fährmann # Copyright 2022-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -27,7 +27,7 @@ class _8chanExtractor(Extractor):
Extractor.__init__(self, match) Extractor.__init__(self, match)
@memcache() @memcache()
def _prepare_cookies(self): def cookies_prepare(self):
# fetch captcha cookies # fetch captcha cookies
# (necessary to download without getting interrupted) # (necessary to download without getting interrupted)
now = datetime.utcnow() now = datetime.utcnow()
@ -39,14 +39,14 @@ class _8chanExtractor(Extractor):
# - remove 'expires' timestamp # - remove 'expires' timestamp
# - move 'captchaexpiration' value forward by 1 month) # - move 'captchaexpiration' value forward by 1 month)
domain = self.root.rpartition("/")[2] domain = self.root.rpartition("/")[2]
for cookie in self.session.cookies: for cookie in self.cookies:
if cookie.domain.endswith(domain): if cookie.domain.endswith(domain):
cookie.expires = None cookie.expires = None
if cookie.name == "captchaexpiration": if cookie.name == "captchaexpiration":
cookie.value = (now + timedelta(30, 300)).strftime( cookie.value = (now + timedelta(30, 300)).strftime(
"%a, %d %b %Y %H:%M:%S GMT") "%a, %d %b %Y %H:%M:%S GMT")
return self.session.cookies return self.cookies
class _8chanThreadExtractor(_8chanExtractor): class _8chanThreadExtractor(_8chanExtractor):
@ -113,7 +113,7 @@ class _8chanThreadExtractor(_8chanExtractor):
thread["_http_headers"] = {"Referer": url + "html"} thread["_http_headers"] = {"Referer": url + "html"}
try: try:
self.session.cookies = self._prepare_cookies() self.cookies = self.cookies_prepare()
except Exception as exc: except Exception as exc:
self.log.debug("Failed to fetch captcha cookies: %s: %s", self.log.debug("Failed to fetch captcha cookies: %s: %s",
exc.__class__.__name__, exc, exc_info=True) exc.__class__.__name__, exc, exc_info=True)
@ -150,6 +150,8 @@ class _8chanBoardExtractor(_8chanExtractor):
def __init__(self, match): def __init__(self, match):
_8chanExtractor.__init__(self, match) _8chanExtractor.__init__(self, match)
_, self.board, self.page = match.groups() _, self.board, self.page = match.groups()
def _init(self):
self.session.headers["Referer"] = self.root + "/" self.session.headers["Referer"] = self.root + "/"
def items(self): def items(self):

View File

@ -35,8 +35,10 @@ class _8musesAlbumExtractor(Extractor):
"id" : 10467, "id" : 10467,
"title" : "Liar", "title" : "Liar",
"path" : "Fakku Comics/mogg/Liar", "path" : "Fakku Comics/mogg/Liar",
"parts" : ["Fakku Comics", "mogg", "Liar"],
"private": False, "private": False,
"url" : str, "url" : "https://comics.8muses.com/comics"
"/album/Fakku-Comics/mogg/Liar",
"parent" : 10464, "parent" : 10464,
"views" : int, "views" : int,
"likes" : int, "likes" : int,
@ -118,9 +120,10 @@ class _8musesAlbumExtractor(Extractor):
return { return {
"id" : album["id"], "id" : album["id"],
"path" : album["path"], "path" : album["path"],
"parts" : album["path"].split("/"),
"title" : album["name"], "title" : album["name"],
"private": album["isPrivate"], "private": album["isPrivate"],
"url" : self.root + album["permalink"], "url" : self.root + "/comics/album/" + album["permalink"],
"parent" : text.parse_int(album["parentId"]), "parent" : text.parse_int(album["parentId"]),
"views" : text.parse_int(album["numberViews"]), "views" : text.parse_int(album["numberViews"]),
"likes" : text.parse_int(album["numberLikes"]), "likes" : text.parse_int(album["numberLikes"]),

View File

@ -14,8 +14,8 @@ modules = [
"2chen", "2chen",
"35photo", "35photo",
"3dbooru", "3dbooru",
"420chan",
"4chan", "4chan",
"4chanarchives",
"500px", "500px",
"8chan", "8chan",
"8muses", "8muses",
@ -24,7 +24,6 @@ modules = [
"artstation", "artstation",
"aryion", "aryion",
"bbc", "bbc",
"bcy",
"behance", "behance",
"blogger", "blogger",
"bunkr", "bunkr",
@ -74,14 +73,17 @@ modules = [
"instagram", "instagram",
"issuu", "issuu",
"itaku", "itaku",
"itchio",
"jpgfish",
"jschan",
"kabeuchi", "kabeuchi",
"keenspot", "keenspot",
"kemonoparty", "kemonoparty",
"khinsider", "khinsider",
"komikcast", "komikcast",
"lensdump",
"lexica", "lexica",
"lightroom", "lightroom",
"lineblog",
"livedoor", "livedoor",
"luscious", "luscious",
"lynxchan", "lynxchan",
@ -91,13 +93,12 @@ modules = [
"mangakakalot", "mangakakalot",
"manganelo", "manganelo",
"mangapark", "mangapark",
"mangaread",
"mangasee", "mangasee",
"mangoxo", "mangoxo",
"mememuseum",
"misskey", "misskey",
"myhentaigallery", "myhentaigallery",
"myportfolio", "myportfolio",
"nana",
"naver", "naver",
"naverwebtoon", "naverwebtoon",
"newgrounds", "newgrounds",
@ -133,6 +134,7 @@ modules = [
"seiga", "seiga",
"senmanga", "senmanga",
"sexcom", "sexcom",
"shimmie2",
"simplyhentai", "simplyhentai",
"skeb", "skeb",
"slickpic", "slickpic",

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2018-2022 Mike Fährmann # Copyright 2018-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -27,12 +27,12 @@ class ArtstationExtractor(Extractor):
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.user = match.group(1) or match.group(2) self.user = match.group(1) or match.group(2)
self.external = self.config("external", False)
def items(self): def items(self):
data = self.metadata() data = self.metadata()
projects = self.projects() projects = self.projects()
external = self.config("external", False)
max_posts = self.config("max-posts") max_posts = self.config("max-posts")
if max_posts: if max_posts:
projects = itertools.islice(projects, max_posts) projects = itertools.islice(projects, max_posts)
@ -45,7 +45,7 @@ class ArtstationExtractor(Extractor):
asset["num"] = num asset["num"] = num
yield Message.Directory, asset yield Message.Directory, asset
if adict["has_embedded_player"] and self.external: if adict["has_embedded_player"] and external:
player = adict["player_embedded"] player = adict["player_embedded"]
url = (text.extr(player, 'src="', '"') or url = (text.extr(player, 'src="', '"') or
text.extr(player, "src='", "'")) text.extr(player, "src='", "'"))

View File

@ -23,8 +23,8 @@ class AryionExtractor(Extractor):
directory_fmt = ("{category}", "{user!l}", "{path:J - }") directory_fmt = ("{category}", "{user!l}", "{path:J - }")
filename_fmt = "{id} {title}.{extension}" filename_fmt = "{id} {title}.{extension}"
archive_fmt = "{id}" archive_fmt = "{id}"
cookiedomain = ".aryion.com" cookies_domain = ".aryion.com"
cookienames = ("phpbb3_rl7a3_sid",) cookies_names = ("phpbb3_rl7a3_sid",)
root = "https://aryion.com" root = "https://aryion.com"
def __init__(self, match): def __init__(self, match):
@ -33,11 +33,12 @@ class AryionExtractor(Extractor):
self.recursive = True self.recursive = True
def login(self): def login(self):
if self._check_cookies(self.cookienames): if self.cookies_check(self.cookies_names):
return return
username, password = self._get_auth_info() username, password = self._get_auth_info()
if username: if username:
self._update_cookies(self._login_impl(username, password)) self.cookies_update(self._login_impl(username, password))
@cache(maxage=14*24*3600, keyarg=1) @cache(maxage=14*24*3600, keyarg=1)
def _login_impl(self, username, password): def _login_impl(self, username, password):
@ -53,7 +54,7 @@ class AryionExtractor(Extractor):
response = self.request(url, method="POST", data=data) response = self.request(url, method="POST", data=data)
if b"You have been successfully logged in." not in response.content: if b"You have been successfully logged in." not in response.content:
raise exception.AuthenticationError() raise exception.AuthenticationError()
return {c: response.cookies[c] for c in self.cookienames} return {c: response.cookies[c] for c in self.cookies_names}
def items(self): def items(self):
self.login() self.login()
@ -188,9 +189,11 @@ class AryionGalleryExtractor(AryionExtractor):
def __init__(self, match): def __init__(self, match):
AryionExtractor.__init__(self, match) AryionExtractor.__init__(self, match)
self.recursive = self.config("recursive", True)
self.offset = 0 self.offset = 0
def _init(self):
self.recursive = self.config("recursive", True)
def skip(self, num): def skip(self, num):
if self.recursive: if self.recursive:
return 0 return 0
@ -216,9 +219,11 @@ class AryionTagExtractor(AryionExtractor):
"count": ">= 5", "count": ">= 5",
}) })
def metadata(self): def _init(self):
self.params = text.parse_query(self.user) self.params = text.parse_query(self.user)
self.user = None self.user = None
def metadata(self):
return {"search_tags": self.params.get("tag")} return {"search_tags": self.params.get("tag")}
def posts(self): def posts(self):

View File

@ -1,206 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2020-2023 Mike Fährmann
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
"""Extractors for https://bcy.net/"""
from .common import Extractor, Message
from .. import text, util, exception
import re
class BcyExtractor(Extractor):
"""Base class for bcy extractors"""
category = "bcy"
directory_fmt = ("{category}", "{user[id]} {user[name]}")
filename_fmt = "{post[id]} {id}.{extension}"
archive_fmt = "{post[id]}_{id}"
root = "https://bcy.net"
def __init__(self, match):
Extractor.__init__(self, match)
self.item_id = match.group(1)
self.session.headers["Referer"] = self.root + "/"
def items(self):
sub = re.compile(r"^https?://p\d+-bcy"
r"(?:-sign\.bcyimg\.com|\.byteimg\.com/img)"
r"/banciyuan").sub
iroot = "https://img-bcy-qn.pstatp.com"
noop = self.config("noop")
for post in self.posts():
if not post["image_list"]:
continue
multi = None
tags = post.get("post_tags") or ()
data = {
"user": {
"id" : post["uid"],
"name" : post["uname"],
"avatar" : sub(iroot, post["avatar"].partition("~")[0]),
},
"post": {
"id" : text.parse_int(post["item_id"]),
"tags" : [t["tag_name"] for t in tags],
"date" : text.parse_timestamp(post["ctime"]),
"parody" : post["work"],
"content": post["plain"],
"likes" : post["like_count"],
"shares" : post["share_count"],
"replies": post["reply_count"],
},
}
yield Message.Directory, data
for data["num"], image in enumerate(post["image_list"], 1):
data["id"] = image["mid"]
data["width"] = image["w"]
data["height"] = image["h"]
url = image["path"].partition("~")[0]
text.nameext_from_url(url, data)
# full-resolution image without watermark
if data["extension"]:
if not url.startswith(iroot):
url = sub(iroot, url)
data["filter"] = ""
yield Message.Url, url, data
# watermarked image & low quality noop filter
else:
if multi is None:
multi = self._data_from_post(
post["item_id"])["post_data"]["multi"]
image = multi[data["num"] - 1]
if image["origin"]:
data["filter"] = "watermark"
yield Message.Url, image["origin"], data
if noop:
data["extension"] = ""
data["filter"] = "noop"
yield Message.Url, image["original_path"], data
def posts(self):
"""Returns an iterable with all relevant 'post' objects"""
def _data_from_post(self, post_id):
url = "{}/item/detail/{}".format(self.root, post_id)
page = self.request(url, notfound="post").text
data = (text.extr(page, 'JSON.parse("', '");')
.replace('\\\\u002F', '/')
.replace('\\"', '"'))
try:
return util.json_loads(data)["detail"]
except ValueError:
return util.json_loads(data.replace('\\"', '"'))["detail"]
class BcyUserExtractor(BcyExtractor):
"""Extractor for user timelines"""
subcategory = "user"
pattern = r"(?:https?://)?bcy\.net/u/(\d+)"
test = (
("https://bcy.net/u/1933712", {
"pattern": r"https://img-bcy-qn.pstatp.com/\w+/\d+/post/\w+/.+jpg",
"count": ">= 20",
}),
("https://bcy.net/u/109282764041", {
"pattern": r"https://p\d-bcy-sign\.bcyimg\.com/banciyuan/[0-9a-f]+"
r"~tplv-bcyx-yuan-logo-v1:.+\.image",
"range": "1-25",
"count": 25,
}),
)
def posts(self):
url = self.root + "/apiv3/user/selfPosts"
params = {"uid": self.item_id, "since": None}
while True:
data = self.request(url, params=params).json()
try:
items = data["data"]["items"]
except KeyError:
return
if not items:
return
for item in items:
yield item["item_detail"]
params["since"] = item["since"]
class BcyPostExtractor(BcyExtractor):
"""Extractor for individual posts"""
subcategory = "post"
pattern = r"(?:https?://)?bcy\.net/item/detail/(\d+)"
test = (
("https://bcy.net/item/detail/6355835481002893070", {
"url": "301202375e61fd6e0e2e35de6c3ac9f74885dec3",
"count": 1,
"keyword": {
"user": {
"id" : 1933712,
"name" : "wukloo",
"avatar" : "re:https://img-bcy-qn.pstatp.com/Public/",
},
"post": {
"id" : 6355835481002893070,
"tags" : list,
"date" : "dt:2016-11-22 08:47:46",
"parody" : "东方PROJECT",
"content": "re:根据微博的建议稍微做了点修改",
"likes" : int,
"shares" : int,
"replies": int,
},
"id": 8330182,
"num": 1,
"width" : 3000,
"height": 1687,
"filename": "712e0780b09011e696f973c3d1568337",
"extension": "jpg",
},
}),
# only watermarked images available
("https://bcy.net/item/detail/6950136331708144648", {
"pattern": r"https://p\d-bcy-sign\.bcyimg\.com/banciyuan/[0-9a-f]+"
r"~tplv-bcyx-yuan-logo-v1:.+\.image",
"count": 10,
"keyword": {"filter": "watermark"},
}),
# deleted
("https://bcy.net/item/detail/6780546160802143237", {
"exception": exception.NotFoundError,
"count": 0,
}),
# only visible to logged in users
("https://bcy.net/item/detail/6747523535150783495", {
"count": 0,
}),
# JSON decode error (#3321)
("https://bcy.net/item/detail/7166939271872388110", {
"count": 0,
}),
)
def posts(self):
try:
data = self._data_from_post(self.item_id)
except KeyError:
return ()
post = data["post_data"]
post["image_list"] = post["multi"]
post["plain"] = text.parse_unicode_escapes(post["plain"])
post.update(data["detail_user"])
return (post,)

View File

@ -81,10 +81,13 @@ class BehanceGalleryExtractor(BehanceExtractor):
("https://www.behance.net/gallery/88276087/Audi-R8-RWD", { ("https://www.behance.net/gallery/88276087/Audi-R8-RWD", {
"count": 20, "count": 20,
"url": "6bebff0d37f85349f9ad28bd8b76fd66627c1e2f", "url": "6bebff0d37f85349f9ad28bd8b76fd66627c1e2f",
"pattern": r"https://mir-s3-cdn-cf\.behance\.net/project_modules"
r"/source/[0-9a-f]+.[0-9a-f]+\.jpg"
}), }),
# 'video' modules (#1282) # 'video' modules (#1282)
("https://www.behance.net/gallery/101185577/COLCCI", { ("https://www.behance.net/gallery/101185577/COLCCI", {
"pattern": r"ytdl:https://cdn-prod-ccv\.adobe\.com/", "pattern": r"https://cdn-prod-ccv\.adobe\.com/\w+"
r"/rend/\w+_720\.mp4\?",
"count": 3, "count": 3,
}), }),
) )
@ -129,26 +132,35 @@ class BehanceGalleryExtractor(BehanceExtractor):
append = result.append append = result.append
for module in data["modules"]: for module in data["modules"]:
mtype = module["type"] mtype = module["__typename"]
if mtype == "image": if mtype == "ImageModule":
url = module["sizes"]["original"] url = module["imageSizes"]["size_original"]["url"]
append((url, module)) append((url, module))
elif mtype == "video": elif mtype == "VideoModule":
page = self.request(module["src"]).text renditions = module["videoData"]["renditions"]
url = text.extr(page, '<source src="', '"') try:
if text.ext_from_url(url) == "m3u8": url = [
url = "ytdl:" + url r["url"] for r in renditions
if text.ext_from_url(r["url"]) != "m3u8"
][-1]
except Exception as exc:
self.log.debug("%s: %s", exc.__class__.__name__, exc)
url = "ytdl:" + renditions[-1]["url"]
append((url, module)) append((url, module))
elif mtype == "media_collection": elif mtype == "MediaCollectionModule":
for component in module["components"]: for component in module["components"]:
url = component["sizes"]["source"] for size in component["imageSizes"].values():
append((url, module)) if size:
parts = size["url"].split("/")
parts[4] = "source"
append(("/".join(parts), module))
break
elif mtype == "embed": elif mtype == "EmbedModule":
embed = module.get("original_embed") or module.get("embed") embed = module.get("originalEmbed") or module.get("fluidEmbed")
if embed: if embed:
append(("ytdl:" + text.extr(embed, 'src="', '"'), module)) append(("ytdl:" + text.extr(embed, 'src="', '"'), module))

View File

@ -28,12 +28,13 @@ class BloggerExtractor(Extractor):
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.videos = self.config("videos", True)
self.blog = match.group(1) or match.group(2) self.blog = match.group(1) or match.group(2)
def _init(self):
self.api = BloggerAPI(self) self.api = BloggerAPI(self)
self.videos = self.config("videos", True)
def items(self): def items(self):
blog = self.api.blog_by_url("http://" + self.blog) blog = self.api.blog_by_url("http://" + self.blog)
blog["pages"] = blog["pages"]["totalItems"] blog["pages"] = blog["pages"]["totalItems"]
blog["posts"] = blog["posts"]["totalItems"] blog["posts"] = blog["posts"]["totalItems"]
@ -44,6 +45,7 @@ class BloggerExtractor(Extractor):
findall_image = re.compile( findall_image = re.compile(
r'src="(https?://(?:' r'src="(https?://(?:'
r'blogger\.googleusercontent\.com/img|' r'blogger\.googleusercontent\.com/img|'
r'lh\d+\.googleusercontent\.com/|'
r'\d+\.bp\.blogspot\.com)/[^"]+)').findall r'\d+\.bp\.blogspot\.com)/[^"]+)').findall
findall_video = re.compile( findall_video = re.compile(
r'src="(https?://www\.blogger\.com/video\.g\?token=[^"]+)').findall r'src="(https?://www\.blogger\.com/video\.g\?token=[^"]+)').findall

View File

@ -6,19 +6,19 @@
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation. # published by the Free Software Foundation.
"""Extractors for https://bunkr.la/""" """Extractors for https://bunkrr.su/"""
from .lolisafe import LolisafeAlbumExtractor from .lolisafe import LolisafeAlbumExtractor
from .. import text from .. import text
class BunkrAlbumExtractor(LolisafeAlbumExtractor): class BunkrAlbumExtractor(LolisafeAlbumExtractor):
"""Extractor for bunkr.la albums""" """Extractor for bunkrr.su albums"""
category = "bunkr" category = "bunkr"
root = "https://bunkr.la" root = "https://bunkrr.su"
pattern = r"(?:https?://)?(?:app\.)?bunkr\.(?:la|[sr]u|is|to)/a/([^/?#]+)" pattern = r"(?:https?://)?(?:app\.)?bunkr+\.(?:la|[sr]u|is|to)/a/([^/?#]+)"
test = ( test = (
("https://bunkr.la/a/Lktg9Keq", { ("https://bunkrr.su/a/Lktg9Keq", {
"pattern": r"https://cdn\.bunkr\.ru/test-テスト-\"&>-QjgneIQv\.png", "pattern": r"https://cdn\.bunkr\.ru/test-テスト-\"&>-QjgneIQv\.png",
"content": "0c8768055e4e20e7c7259608b67799171b691140", "content": "0c8768055e4e20e7c7259608b67799171b691140",
"keyword": { "keyword": {
@ -52,6 +52,12 @@ class BunkrAlbumExtractor(LolisafeAlbumExtractor):
"num": int, "num": int,
}, },
}), }),
# cdn12 .ru TLD (#4147)
("https://bunkrr.su/a/j1G29CnD", {
"pattern": r"https://(cdn12.bunkr.ru|media-files12.bunkr.la)/\w+",
"count": 8,
}),
("https://bunkrr.su/a/Lktg9Keq"),
("https://bunkr.la/a/Lktg9Keq"), ("https://bunkr.la/a/Lktg9Keq"),
("https://bunkr.su/a/Lktg9Keq"), ("https://bunkr.su/a/Lktg9Keq"),
("https://bunkr.ru/a/Lktg9Keq"), ("https://bunkr.ru/a/Lktg9Keq"),
@ -70,7 +76,7 @@ class BunkrAlbumExtractor(LolisafeAlbumExtractor):
cdn = None cdn = None
files = [] files = []
append = files.append append = files.append
headers = {"Referer": self.root.replace("://", "://stream.", 1) + "/"} headers = {"Referer": self.root + "/"}
pos = page.index('class="grid-images') pos = page.index('class="grid-images')
for url in text.extract_iter(page, '<a href="', '"', pos): for url in text.extract_iter(page, '<a href="', '"', pos):
@ -86,10 +92,12 @@ class BunkrAlbumExtractor(LolisafeAlbumExtractor):
url = text.unescape(url) url = text.unescape(url)
if url.endswith((".mp4", ".m4v", ".mov", ".webm", ".mkv", ".ts", if url.endswith((".mp4", ".m4v", ".mov", ".webm", ".mkv", ".ts",
".zip", ".rar", ".7z")): ".zip", ".rar", ".7z")):
append({"file": url.replace("://cdn", "://media-files", 1), if url.startswith("https://cdn12."):
"_http_headers": headers}) url = ("https://media-files12.bunkr.la" +
else: url[url.find("/", 14):])
append({"file": url}) else:
url = url.replace("://cdn", "://media-files", 1)
append({"file": url, "_http_headers": headers})
return files, { return files, {
"album_id" : self.album_id, "album_id" : self.album_id,

View File

@ -32,11 +32,10 @@ class Extractor():
directory_fmt = ("{category}",) directory_fmt = ("{category}",)
filename_fmt = "{filename}.{extension}" filename_fmt = "{filename}.{extension}"
archive_fmt = "" archive_fmt = ""
cookiedomain = "" cookies_domain = ""
browser = None browser = None
root = "" root = ""
test = None test = None
finalize = None
request_interval = 0.0 request_interval = 0.0
request_interval_min = 0.0 request_interval_min = 0.0
request_timestamp = 0.0 request_timestamp = 0.0
@ -45,32 +44,9 @@ class Extractor():
def __init__(self, match): def __init__(self, match):
self.log = logging.getLogger(self.category) self.log = logging.getLogger(self.category)
self.url = match.string self.url = match.string
if self.basecategory:
self.config = self._config_shared
self.config_accumulate = self._config_shared_accumulate
self._cfgpath = ("extractor", self.category, self.subcategory) self._cfgpath = ("extractor", self.category, self.subcategory)
self._parentdir = "" self._parentdir = ""
self._write_pages = self.config("write-pages", False)
self._retry_codes = self.config("retry-codes")
self._retries = self.config("retries", 4)
self._timeout = self.config("timeout", 30)
self._verify = self.config("verify", True)
self._proxies = util.build_proxy_map(self.config("proxy"), self.log)
self._interval = util.build_duration_func(
self.config("sleep-request", self.request_interval),
self.request_interval_min,
)
if self._retries < 0:
self._retries = float("inf")
if not self._retry_codes:
self._retry_codes = ()
self._init_session()
self._init_cookies()
@classmethod @classmethod
def from_url(cls, url): def from_url(cls, url):
if isinstance(cls.pattern, str): if isinstance(cls.pattern, str):
@ -79,8 +55,19 @@ class Extractor():
return cls(match) if match else None return cls(match) if match else None
def __iter__(self): def __iter__(self):
self.initialize()
return self.items() return self.items()
def initialize(self):
self._init_options()
self._init_session()
self._init_cookies()
self._init()
self.initialize = util.noop
def finalize(self):
pass
def items(self): def items(self):
yield Message.Version, 1 yield Message.Version, 1
@ -90,23 +77,44 @@ class Extractor():
def config(self, key, default=None): def config(self, key, default=None):
return config.interpolate(self._cfgpath, key, default) return config.interpolate(self._cfgpath, key, default)
def config_deprecated(self, key, deprecated, default=None,
sentinel=util.SENTINEL, history=set()):
value = self.config(deprecated, sentinel)
if value is not sentinel:
if deprecated not in history:
history.add(deprecated)
self.log.warning("'%s' is deprecated. Use '%s' instead.",
deprecated, key)
default = value
value = self.config(key, sentinel)
if value is not sentinel:
return value
return default
def config_accumulate(self, key): def config_accumulate(self, key):
return config.accumulate(self._cfgpath, key) return config.accumulate(self._cfgpath, key)
def _config_shared(self, key, default=None): def _config_shared(self, key, default=None):
return config.interpolate_common(("extractor",), ( return config.interpolate_common(
(self.category, self.subcategory), ("extractor",), self._cfgpath, key, default)
(self.basecategory, self.subcategory),
), key, default)
def _config_shared_accumulate(self, key): def _config_shared_accumulate(self, key):
values = config.accumulate(self._cfgpath, key) first = True
conf = config.get(("extractor",), self.basecategory) extr = ("extractor",)
if conf:
values[:0] = config.accumulate((self.subcategory,), key, conf=conf) for path in self._cfgpath:
if first:
first = False
values = config.accumulate(extr + path, key)
else:
conf = config.get(extr, path[0])
if conf:
values[:0] = config.accumulate(
(self.subcategory,), key, conf=conf)
return values return values
def request(self, url, *, method="GET", session=None, def request(self, url, method="GET", session=None,
retries=None, retry_codes=None, encoding=None, retries=None, retry_codes=None, encoding=None,
fatal=True, notfound=None, **kwargs): fatal=True, notfound=None, **kwargs):
if session is None: if session is None:
@ -180,7 +188,7 @@ class Extractor():
raise exception.HttpError(msg, response) raise exception.HttpError(msg, response)
def wait(self, *, seconds=None, until=None, adjust=1.0, def wait(self, seconds=None, until=None, adjust=1.0,
reason="rate limit reset"): reason="rate limit reset"):
now = time.time() now = time.time()
@ -230,6 +238,26 @@ class Extractor():
return username, password return username, password
def _init(self):
pass
def _init_options(self):
self._write_pages = self.config("write-pages", False)
self._retry_codes = self.config("retry-codes")
self._retries = self.config("retries", 4)
self._timeout = self.config("timeout", 30)
self._verify = self.config("verify", True)
self._proxies = util.build_proxy_map(self.config("proxy"), self.log)
self._interval = util.build_duration_func(
self.config("sleep-request", self.request_interval),
self.request_interval_min,
)
if self._retries < 0:
self._retries = float("inf")
if not self._retry_codes:
self._retry_codes = ()
def _init_session(self): def _init_session(self):
self.session = session = requests.Session() self.session = session = requests.Session()
headers = session.headers headers = session.headers
@ -271,7 +299,7 @@ class Extractor():
useragent = self.config("user-agent") useragent = self.config("user-agent")
if useragent is None: if useragent is None:
useragent = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64; " useragent = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64; "
"rv:102.0) Gecko/20100101 Firefox/102.0") "rv:115.0) Gecko/20100101 Firefox/115.0")
elif useragent == "browser": elif useragent == "browser":
useragent = _browser_useragent() useragent = _browser_useragent()
headers["User-Agent"] = useragent headers["User-Agent"] = useragent
@ -315,26 +343,26 @@ class Extractor():
def _init_cookies(self): def _init_cookies(self):
"""Populate the session's cookiejar""" """Populate the session's cookiejar"""
self._cookiefile = None self.cookies = self.session.cookies
self._cookiejar = self.session.cookies self.cookies_file = None
if self.cookiedomain is None: if self.cookies_domain is None:
return return
cookies = self.config("cookies") cookies = self.config("cookies")
if cookies: if cookies:
if isinstance(cookies, dict): if isinstance(cookies, dict):
self._update_cookies_dict(cookies, self.cookiedomain) self.cookies_update_dict(cookies, self.cookies_domain)
elif isinstance(cookies, str): elif isinstance(cookies, str):
cookiefile = util.expand_path(cookies) path = util.expand_path(cookies)
try: try:
with open(cookiefile) as fp: with open(path) as fp:
util.cookiestxt_load(fp, self._cookiejar) util.cookiestxt_load(fp, self.cookies)
except Exception as exc: except Exception as exc:
self.log.warning("cookies: %s", exc) self.log.warning("cookies: %s", exc)
else: else:
self.log.debug("Loading cookies from '%s'", cookies) self.log.debug("Loading cookies from '%s'", cookies)
self._cookiefile = cookiefile self.cookies_file = path
elif isinstance(cookies, (list, tuple)): elif isinstance(cookies, (list, tuple)):
key = tuple(cookies) key = tuple(cookies)
@ -342,7 +370,7 @@ class Extractor():
if cookiejar is None: if cookiejar is None:
from ..cookies import load_cookies from ..cookies import load_cookies
cookiejar = self._cookiejar.__class__() cookiejar = self.cookies.__class__()
try: try:
load_cookies(cookiejar, cookies) load_cookies(cookiejar, cookies)
except Exception as exc: except Exception as exc:
@ -352,9 +380,9 @@ class Extractor():
else: else:
self.log.debug("Using cached cookies from %s", key) self.log.debug("Using cached cookies from %s", key)
setcookie = self._cookiejar.set_cookie set_cookie = self.cookies.set_cookie
for cookie in cookiejar: for cookie in cookiejar:
setcookie(cookie) set_cookie(cookie)
else: else:
self.log.warning( self.log.warning(
@ -362,46 +390,56 @@ class Extractor():
"option, got '%s' (%s)", "option, got '%s' (%s)",
cookies.__class__.__name__, cookies) cookies.__class__.__name__, cookies)
def _store_cookies(self): def cookies_store(self):
"""Store the session's cookiejar in a cookies.txt file""" """Store the session's cookies in a cookies.txt file"""
if self._cookiefile and self.config("cookies-update", True): export = self.config("cookies-update", True)
try: if not export:
with open(self._cookiefile, "w") as fp: return
util.cookiestxt_store(fp, self._cookiejar)
except OSError as exc:
self.log.warning("cookies: %s", exc)
def _update_cookies(self, cookies, *, domain=""): if isinstance(export, str):
path = util.expand_path(export)
else:
path = self.cookies_file
if not path:
return
try:
with open(path, "w") as fp:
util.cookiestxt_store(fp, self.cookies)
except OSError as exc:
self.log.warning("cookies: %s", exc)
def cookies_update(self, cookies, domain=""):
"""Update the session's cookiejar with 'cookies'""" """Update the session's cookiejar with 'cookies'"""
if isinstance(cookies, dict): if isinstance(cookies, dict):
self._update_cookies_dict(cookies, domain or self.cookiedomain) self.cookies_update_dict(cookies, domain or self.cookies_domain)
else: else:
setcookie = self._cookiejar.set_cookie set_cookie = self.cookies.set_cookie
try: try:
cookies = iter(cookies) cookies = iter(cookies)
except TypeError: except TypeError:
setcookie(cookies) set_cookie(cookies)
else: else:
for cookie in cookies: for cookie in cookies:
setcookie(cookie) set_cookie(cookie)
def _update_cookies_dict(self, cookiedict, domain): def cookies_update_dict(self, cookiedict, domain):
"""Update cookiejar with name-value pairs from a dict""" """Update cookiejar with name-value pairs from a dict"""
setcookie = self._cookiejar.set set_cookie = self.cookies.set
for name, value in cookiedict.items(): for name, value in cookiedict.items():
setcookie(name, value, domain=domain) set_cookie(name, value, domain=domain)
def _check_cookies(self, cookienames, *, domain=None): def cookies_check(self, cookies_names, domain=None):
"""Check if all 'cookienames' are in the session's cookiejar""" """Check if all 'cookies_names' are in the session's cookiejar"""
if not self._cookiejar: if not self.cookies:
return False return False
if domain is None: if domain is None:
domain = self.cookiedomain domain = self.cookies_domain
names = set(cookienames) names = set(cookies_names)
now = time.time() now = time.time()
for cookie in self._cookiejar: for cookie in self.cookies:
if cookie.name in names and ( if cookie.name in names and (
not domain or cookie.domain == domain): not domain or cookie.domain == domain):
@ -425,9 +463,16 @@ class Extractor():
return False return False
def _prepare_ddosguard_cookies(self): def _prepare_ddosguard_cookies(self):
if not self._cookiejar.get("__ddg2", domain=self.cookiedomain): if not self.cookies.get("__ddg2", domain=self.cookies_domain):
self._cookiejar.set( self.cookies.set(
"__ddg2", util.generate_token(), domain=self.cookiedomain) "__ddg2", util.generate_token(), domain=self.cookies_domain)
def _cache(self, func, maxage, keyarg=None):
# return cache.DatabaseCacheDecorator(func, maxage, keyarg)
return cache.DatabaseCacheDecorator(func, keyarg, maxage)
def _cache_memory(self, func, maxage=None, keyarg=None):
return cache.Memcache()
def _get_date_min_max(self, dmin=None, dmax=None): def _get_date_min_max(self, dmin=None, dmax=None):
"""Retrieve and parse 'date-min' and 'date-max' config values""" """Retrieve and parse 'date-min' and 'date-max' config values"""
@ -530,7 +575,13 @@ class GalleryExtractor(Extractor):
def items(self): def items(self):
self.login() self.login()
page = self.request(self.gallery_url, notfound=self.subcategory).text
if self.gallery_url:
page = self.request(
self.gallery_url, notfound=self.subcategory).text
else:
page = None
data = self.metadata(page) data = self.metadata(page)
imgs = self.images(page) imgs = self.images(page)
@ -623,6 +674,8 @@ class AsynchronousMixin():
"""Run info extraction in a separate thread""" """Run info extraction in a separate thread"""
def __iter__(self): def __iter__(self):
self.initialize()
messages = queue.Queue(5) messages = queue.Queue(5)
thread = threading.Thread( thread = threading.Thread(
target=self.async_items, target=self.async_items,
@ -774,8 +827,8 @@ _browser_cookies = {}
HTTP_HEADERS = { HTTP_HEADERS = {
"firefox": ( "firefox": (
("User-Agent", "Mozilla/5.0 ({}; rv:102.0) " ("User-Agent", "Mozilla/5.0 ({}; rv:115.0) "
"Gecko/20100101 Firefox/102.0"), "Gecko/20100101 Firefox/115.0"),
("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9," ("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,"
"image/avif,image/webp,*/*;q=0.8"), "image/avif,image/webp,*/*;q=0.8"),
("Accept-Language", "en-US,en;q=0.5"), ("Accept-Language", "en-US,en;q=0.5"),
@ -866,13 +919,3 @@ if action:
except Exception: except Exception:
pass pass
del action del action
# Undo automatic pyOpenSSL injection by requests
pyopenssl = config.get((), "pyopenssl", False)
if not pyopenssl:
try:
from requests.packages.urllib3.contrib import pyopenssl # noqa
pyopenssl.extract_from_urllib3()
except ImportError:
pass
del pyopenssl

View File

@ -22,8 +22,7 @@ class DanbooruExtractor(BaseExtractor):
per_page = 200 per_page = 200
request_interval = 1.0 request_interval = 1.0
def __init__(self, match): def _init(self):
BaseExtractor.__init__(self, match)
self.ugoira = self.config("ugoira", False) self.ugoira = self.config("ugoira", False)
self.external = self.config("external", False) self.external = self.config("external", False)
self.includes = False self.includes = False
@ -70,6 +69,8 @@ class DanbooruExtractor(BaseExtractor):
continue continue
text.nameext_from_url(url, post) text.nameext_from_url(url, post)
post["date"] = text.parse_datetime(
post["created_at"], "%Y-%m-%dT%H:%M:%S.%f%z")
if post["extension"] == "zip": if post["extension"] == "zip":
if self.ugoira: if self.ugoira:
@ -92,42 +93,47 @@ class DanbooruExtractor(BaseExtractor):
def posts(self): def posts(self):
return () return ()
def _pagination(self, endpoint, params, pages=False): def _pagination(self, endpoint, params, prefix=None):
url = self.root + endpoint url = self.root + endpoint
params["limit"] = self.per_page params["limit"] = self.per_page
params["page"] = self.page_start params["page"] = self.page_start
first = True
while True: while True:
posts = self.request(url, params=params).json() posts = self.request(url, params=params).json()
if "posts" in posts: if isinstance(posts, dict):
posts = posts["posts"] posts = posts["posts"]
if self.includes and posts: if posts:
if not pages and "only" not in params: if self.includes:
params["page"] = "b{}".format(posts[0]["id"] + 1) params_meta = {
params["only"] = self.includes "only" : self.includes,
data = { "limit": len(posts),
meta["id"]: meta "tags" : "id:" + ",".join(str(p["id"]) for p in posts),
for meta in self.request(url, params=params).json() }
} data = {
for post in posts: meta["id"]: meta
post.update(data[post["id"]]) for meta in self.request(
params["only"] = None url, params=params_meta).json()
}
for post in posts:
post.update(data[post["id"]])
yield from posts if prefix == "a" and not first:
posts.reverse()
yield from posts
if len(posts) < self.threshold: if len(posts) < self.threshold:
return return
if pages: if prefix:
params["page"] = "{}{}".format(prefix, posts[-1]["id"])
elif params["page"]:
params["page"] += 1 params["page"] += 1
else: else:
for post in reversed(posts): params["page"] = 2
if "id" in post: first = False
params["page"] = "b{}".format(post["id"])
break
else:
return
def _ugoira_frames(self, post): def _ugoira_frames(self, post):
data = self.request("{}/posts/{}.json?only=media_metadata".format( data = self.request("{}/posts/{}.json?only=media_metadata".format(
@ -153,7 +159,11 @@ BASE_PATTERN = DanbooruExtractor.update({
"aibooru": { "aibooru": {
"root": None, "root": None,
"pattern": r"(?:safe.)?aibooru\.online", "pattern": r"(?:safe.)?aibooru\.online",
} },
"booruvar": {
"root": "https://booru.borvar.art",
"pattern": r"booru\.borvar\.art",
},
}) })
@ -181,7 +191,12 @@ class DanbooruTagExtractor(DanbooruExtractor):
"count": 12, "count": 12,
}), }),
("https://aibooru.online/posts?tags=center_frills&z=1", { ("https://aibooru.online/posts?tags=center_frills&z=1", {
"pattern": r"https://aibooru\.online/data/original" "pattern": r"https://cdn\.aibooru\.online/original"
r"/[0-9a-f]{2}/[0-9a-f]{2}/[0-9a-f]{32}\.\w+",
"count": ">= 3",
}),
("https://booru.borvar.art/posts?tags=chibi&z=1", {
"pattern": r"https://booru\.borvar\.art/data/original"
r"/[0-9a-f]{2}/[0-9a-f]{2}/[0-9a-f]{32}\.\w+", r"/[0-9a-f]{2}/[0-9a-f]{2}/[0-9a-f]{32}\.\w+",
"count": ">= 3", "count": ">= 3",
}), }),
@ -200,7 +215,21 @@ class DanbooruTagExtractor(DanbooruExtractor):
return {"search_tags": self.tags} return {"search_tags": self.tags}
def posts(self): def posts(self):
return self._pagination("/posts.json", {"tags": self.tags}) prefix = "b"
for tag in self.tags.split():
if tag.startswith("order:"):
if tag == "order:id" or tag == "order:id_asc":
prefix = "a"
elif tag == "order:id_desc":
prefix = "b"
else:
prefix = None
elif tag.startswith(
("id:", "md5", "ordfav:", "ordfavgroup:", "ordpool:")):
prefix = None
break
return self._pagination("/posts.json", {"tags": self.tags}, prefix)
class DanbooruPoolExtractor(DanbooruExtractor): class DanbooruPoolExtractor(DanbooruExtractor):
@ -217,6 +246,10 @@ class DanbooruPoolExtractor(DanbooruExtractor):
"url": "902549ffcdb00fe033c3f63e12bc3cb95c5fd8d5", "url": "902549ffcdb00fe033c3f63e12bc3cb95c5fd8d5",
"count": 6, "count": 6,
}), }),
("https://booru.borvar.art/pools/2", {
"url": "77fa3559a3fc919f72611f4e3dd0f919d19d3e0d",
"count": 4,
}),
("https://aibooru.online/pools/1"), ("https://aibooru.online/pools/1"),
("https://danbooru.donmai.us/pool/show/7659"), ("https://danbooru.donmai.us/pool/show/7659"),
) )
@ -234,7 +267,7 @@ class DanbooruPoolExtractor(DanbooruExtractor):
def posts(self): def posts(self):
params = {"tags": "pool:" + self.pool_id} params = {"tags": "pool:" + self.pool_id}
return self._pagination("/posts.json", params) return self._pagination("/posts.json", params, "b")
class DanbooruPostExtractor(DanbooruExtractor): class DanbooruPostExtractor(DanbooruExtractor):
@ -245,6 +278,7 @@ class DanbooruPostExtractor(DanbooruExtractor):
test = ( test = (
("https://danbooru.donmai.us/posts/294929", { ("https://danbooru.donmai.us/posts/294929", {
"content": "5e255713cbf0a8e0801dc423563c34d896bb9229", "content": "5e255713cbf0a8e0801dc423563c34d896bb9229",
"keyword": {"date": "dt:2008-08-12 04:46:05"},
}), }),
("https://danbooru.donmai.us/posts/3613024", { ("https://danbooru.donmai.us/posts/3613024", {
"pattern": r"https?://.+\.zip$", "pattern": r"https?://.+\.zip$",
@ -256,6 +290,9 @@ class DanbooruPostExtractor(DanbooruExtractor):
("https://aibooru.online/posts/1", { ("https://aibooru.online/posts/1", {
"content": "54d548743cd67799a62c77cbae97cfa0fec1b7e9", "content": "54d548743cd67799a62c77cbae97cfa0fec1b7e9",
}), }),
("https://booru.borvar.art/posts/1487", {
"content": "91273ac1ea413a12be468841e2b5804656a50bff",
}),
("https://danbooru.donmai.us/post/show/294929"), ("https://danbooru.donmai.us/post/show/294929"),
) )
@ -287,6 +324,7 @@ class DanbooruPopularExtractor(DanbooruExtractor):
}), }),
("https://booru.allthefallen.moe/explore/posts/popular"), ("https://booru.allthefallen.moe/explore/posts/popular"),
("https://aibooru.online/explore/posts/popular"), ("https://aibooru.online/explore/posts/popular"),
("https://booru.borvar.art/explore/posts/popular"),
) )
def __init__(self, match): def __init__(self, match):
@ -307,7 +345,4 @@ class DanbooruPopularExtractor(DanbooruExtractor):
return {"date": date, "scale": scale} return {"date": date, "scale": scale}
def posts(self): def posts(self):
if self.page_start is None: return self._pagination("/explore/posts/popular.json", self.params)
self.page_start = 1
return self._pagination(
"/explore/posts/popular.json", self.params, True)

View File

@ -32,20 +32,24 @@ class DeviantartExtractor(Extractor):
root = "https://www.deviantart.com" root = "https://www.deviantart.com"
directory_fmt = ("{category}", "{username}") directory_fmt = ("{category}", "{username}")
filename_fmt = "{category}_{index}_{title}.{extension}" filename_fmt = "{category}_{index}_{title}.{extension}"
cookiedomain = None cookies_domain = None
cookienames = ("auth", "auth_secure", "userinfo") cookies_names = ("auth", "auth_secure", "userinfo")
_last_request = 0 _last_request = 0
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.user = match.group(1) or match.group(2)
def _init(self):
self.flat = self.config("flat", True) self.flat = self.config("flat", True)
self.extra = self.config("extra", False) self.extra = self.config("extra", False)
self.original = self.config("original", True) self.original = self.config("original", True)
self.comments = self.config("comments", False) self.comments = self.config("comments", False)
self.user = match.group(1) or match.group(2)
self.api = DeviantartOAuthAPI(self)
self.group = False self.group = False
self.offset = 0 self.offset = 0
self.api = None self._premium_cache = {}
unwatch = self.config("auto-unwatch") unwatch = self.config("auto-unwatch")
if unwatch: if unwatch:
@ -60,27 +64,28 @@ class DeviantartExtractor(Extractor):
self._update_content = self._update_content_image self._update_content = self._update_content_image
self.original = True self.original = True
self._premium_cache = {} journals = self.config("journals", "html")
self.commit_journal = { if journals == "html":
"html": self._commit_journal_html, self.commit_journal = self._commit_journal_html
"text": self._commit_journal_text, elif journals == "text":
}.get(self.config("journals", "html")) self.commit_journal = self._commit_journal_text
else:
self.commit_journal = None
def skip(self, num): def skip(self, num):
self.offset += num self.offset += num
return num return num
def login(self): def login(self):
if not self._check_cookies(self.cookienames): if self.cookies_check(self.cookies_names):
username, password = self._get_auth_info() return True
if not username:
return False username, password = self._get_auth_info()
self._update_cookies(_login_impl(self, username, password)) if username:
return True self.cookies_update(_login_impl(self, username, password))
return True
def items(self): def items(self):
self.api = DeviantartOAuthAPI(self)
if self.user and self.config("group", True): if self.user and self.config("group", True):
profile = self.api.user_profile(self.user) profile = self.api.user_profile(self.user)
self.group = not profile self.group = not profile
@ -448,6 +453,9 @@ class DeviantartUserExtractor(DeviantartExtractor):
("https://shimoda7.deviantart.com/"), ("https://shimoda7.deviantart.com/"),
) )
def initialize(self):
pass
def items(self): def items(self):
base = "{}/{}/".format(self.root, self.user) base = "{}/{}/".format(self.root, self.user)
return self._dispatch_extractors(( return self._dispatch_extractors((
@ -1105,11 +1113,14 @@ class DeviantartDeviationExtractor(DeviantartExtractor):
match.group(4) or match.group(5) or id_from_base36(match.group(6)) match.group(4) or match.group(5) or id_from_base36(match.group(6))
def deviations(self): def deviations(self):
url = "{}/{}/{}/{}".format( if self.user:
self.root, self.user or "u", self.type or "art", self.deviation_id) url = "{}/{}/{}/{}".format(
self.root, self.user, self.type or "art", self.deviation_id)
else:
url = "{}/view/{}/".format(self.root, self.deviation_id)
uuid = text.extract(self._limited_request(url).text, uuid = text.extr(self._limited_request(url).text,
'"deviationUuid\\":\\"', '\\')[0] '"deviationUuid\\":\\"', '\\')
if not uuid: if not uuid:
raise exception.NotFoundError("deviation") raise exception.NotFoundError("deviation")
return (self.api.deviation(uuid),) return (self.api.deviation(uuid),)
@ -1120,7 +1131,7 @@ class DeviantartScrapsExtractor(DeviantartExtractor):
subcategory = "scraps" subcategory = "scraps"
directory_fmt = ("{category}", "{username}", "Scraps") directory_fmt = ("{category}", "{username}", "Scraps")
archive_fmt = "s_{_username}_{index}.{extension}" archive_fmt = "s_{_username}_{index}.{extension}"
cookiedomain = ".deviantart.com" cookies_domain = ".deviantart.com"
pattern = BASE_PATTERN + r"/gallery/(?:\?catpath=)?scraps\b" pattern = BASE_PATTERN + r"/gallery/(?:\?catpath=)?scraps\b"
test = ( test = (
("https://www.deviantart.com/shimoda7/gallery/scraps", { ("https://www.deviantart.com/shimoda7/gallery/scraps", {
@ -1143,7 +1154,7 @@ class DeviantartSearchExtractor(DeviantartExtractor):
subcategory = "search" subcategory = "search"
directory_fmt = ("{category}", "Search", "{search_tags}") directory_fmt = ("{category}", "Search", "{search_tags}")
archive_fmt = "Q_{search_tags}_{index}.{extension}" archive_fmt = "Q_{search_tags}_{index}.{extension}"
cookiedomain = ".deviantart.com" cookies_domain = ".deviantart.com"
pattern = (r"(?:https?://)?www\.deviantart\.com" pattern = (r"(?:https?://)?www\.deviantart\.com"
r"/search(?:/deviations)?/?\?([^#]+)") r"/search(?:/deviations)?/?\?([^#]+)")
test = ( test = (
@ -1202,7 +1213,7 @@ class DeviantartGallerySearchExtractor(DeviantartExtractor):
"""Extractor for deviantart gallery searches""" """Extractor for deviantart gallery searches"""
subcategory = "gallery-search" subcategory = "gallery-search"
archive_fmt = "g_{_username}_{index}.{extension}" archive_fmt = "g_{_username}_{index}.{extension}"
cookiedomain = ".deviantart.com" cookies_domain = ".deviantart.com"
pattern = BASE_PATTERN + r"/gallery/?\?(q=[^#]+)" pattern = BASE_PATTERN + r"/gallery/?\?(q=[^#]+)"
test = ( test = (
("https://www.deviantart.com/shimoda7/gallery?q=memory", { ("https://www.deviantart.com/shimoda7/gallery?q=memory", {
@ -1417,7 +1428,14 @@ class DeviantartOAuthAPI():
"""Get the original file download (if allowed)""" """Get the original file download (if allowed)"""
endpoint = "/deviation/download/" + deviation_id endpoint = "/deviation/download/" + deviation_id
params = {"mature_content": self.mature} params = {"mature_content": self.mature}
return self._call(endpoint, params=params, public=public)
try:
return self._call(
endpoint, params=params, public=public, log=False)
except Exception:
if not self.refresh_token_key:
raise
return self._call(endpoint, params=params, public=False)
def deviation_metadata(self, deviations): def deviation_metadata(self, deviations):
""" Fetch deviation metadata for a set of deviations""" """ Fetch deviation metadata for a set of deviations"""
@ -1518,7 +1536,7 @@ class DeviantartOAuthAPI():
refresh_token_key, data["refresh_token"]) refresh_token_key, data["refresh_token"])
return "Bearer " + data["access_token"] return "Bearer " + data["access_token"]
def _call(self, endpoint, fatal=True, public=None, **kwargs): def _call(self, endpoint, fatal=True, log=True, public=None, **kwargs):
"""Call an API endpoint""" """Call an API endpoint"""
url = "https://www.deviantart.com/api/v1/oauth2" + endpoint url = "https://www.deviantart.com/api/v1/oauth2" + endpoint
kwargs["fatal"] = None kwargs["fatal"] = None
@ -1563,7 +1581,8 @@ class DeviantartOAuthAPI():
"cs/configuration.rst#extractordeviantartclient-id" "cs/configuration.rst#extractordeviantartclient-id"
"--client-secret") "--client-secret")
else: else:
self.log.error(msg) if log:
self.log.error(msg)
return data return data
def _pagination(self, endpoint, params, def _pagination(self, endpoint, params,
@ -1571,15 +1590,14 @@ class DeviantartOAuthAPI():
warn = True warn = True
if public is None: if public is None:
public = self.public public = self.public
elif not public:
self.public = False
while True: while True:
data = self._call(endpoint, params=params, public=public) data = self._call(endpoint, params=params, public=public)
if key not in data: try:
results = data[key]
except KeyError:
self.log.error("Unexpected API response: %s", data) self.log.error("Unexpected API response: %s", data)
return return
results = data[key]
if unpack: if unpack:
results = [item["journal"] for item in results results = [item["journal"] for item in results
@ -1588,7 +1606,7 @@ class DeviantartOAuthAPI():
if public and len(results) < params["limit"]: if public and len(results) < params["limit"]:
if self.refresh_token_key: if self.refresh_token_key:
self.log.debug("Switching to private access token") self.log.debug("Switching to private access token")
self.public = public = False public = False
continue continue
elif data["has_more"] and warn: elif data["has_more"] and warn:
warn = False warn = False
@ -1859,7 +1877,7 @@ def _login_impl(extr, username, password):
return { return {
cookie.name: cookie.value cookie.name: cookie.value
for cookie in extr.session.cookies for cookie in extr.cookies
} }

View File

@ -57,6 +57,8 @@ class E621Extractor(danbooru.DanbooruExtractor):
post["filename"] = file["md5"] post["filename"] = file["md5"]
post["extension"] = file["ext"] post["extension"] = file["ext"]
post["date"] = text.parse_datetime(
post["created_at"], "%Y-%m-%dT%H:%M:%S.%f%z")
post.update(data) post.update(data)
yield Message.Directory, post yield Message.Directory, post
@ -72,6 +74,10 @@ BASE_PATTERN = E621Extractor.update({
"root": "https://e926.net", "root": "https://e926.net",
"pattern": r"e926\.net", "pattern": r"e926\.net",
}, },
"e6ai": {
"root": "https://e6ai.net",
"pattern": r"e6ai\.net",
},
}) })
@ -92,6 +98,10 @@ class E621TagExtractor(E621Extractor, danbooru.DanbooruTagExtractor):
}), }),
("https://e926.net/post/index/1/anry"), ("https://e926.net/post/index/1/anry"),
("https://e926.net/post?tags=anry"), ("https://e926.net/post?tags=anry"),
("https://e6ai.net/posts?tags=anry"),
("https://e6ai.net/post/index/1/anry"),
("https://e6ai.net/post?tags=anry"),
) )
@ -110,6 +120,11 @@ class E621PoolExtractor(E621Extractor, danbooru.DanbooruPoolExtractor):
"content": "91abe5d5334425d9787811d7f06d34c77974cd22", "content": "91abe5d5334425d9787811d7f06d34c77974cd22",
}), }),
("https://e926.net/pool/show/73"), ("https://e926.net/pool/show/73"),
("https://e6ai.net/pools/3", {
"url": "a6d1ad67a3fa9b9f73731d34d5f6f26f7e85855f",
}),
("https://e6ai.net/pool/show/3"),
) )
def posts(self): def posts(self):
@ -140,6 +155,7 @@ class E621PostExtractor(E621Extractor, danbooru.DanbooruPostExtractor):
("https://e621.net/posts/535", { ("https://e621.net/posts/535", {
"url": "f7f78b44c9b88f8f09caac080adc8d6d9fdaa529", "url": "f7f78b44c9b88f8f09caac080adc8d6d9fdaa529",
"content": "66f46e96a893fba8e694c4e049b23c2acc9af462", "content": "66f46e96a893fba8e694c4e049b23c2acc9af462",
"keyword": {"date": "dt:2007-02-17 19:02:32"},
}), }),
("https://e621.net/posts/3181052", { ("https://e621.net/posts/3181052", {
"options": (("metadata", "notes,pools"),), "options": (("metadata", "notes,pools"),),
@ -189,6 +205,12 @@ class E621PostExtractor(E621Extractor, danbooru.DanbooruPostExtractor):
"content": "66f46e96a893fba8e694c4e049b23c2acc9af462", "content": "66f46e96a893fba8e694c4e049b23c2acc9af462",
}), }),
("https://e926.net/post/show/535"), ("https://e926.net/post/show/535"),
("https://e6ai.net/posts/23", {
"url": "3c85a806b3d9eec861948af421fe0e8ad6b8f881",
"content": "a05a484e4eb64637d56d751c02e659b4bc8ea5d5",
}),
("https://e6ai.net/post/show/23"),
) )
def posts(self): def posts(self):
@ -213,12 +235,12 @@ class E621PopularExtractor(E621Extractor, danbooru.DanbooruPopularExtractor):
"pattern": r"https://static\d.e926.net/data/../../[0-9a-f]+", "pattern": r"https://static\d.e926.net/data/../../[0-9a-f]+",
"count": ">= 70", "count": ">= 70",
}), }),
("https://e6ai.net/explore/posts/popular"),
) )
def posts(self): def posts(self):
if self.page_start is None: return self._pagination("/popular.json", self.params)
self.page_start = 1
return self._pagination("/popular.json", self.params, True)
class E621FavoriteExtractor(E621Extractor): class E621FavoriteExtractor(E621Extractor):
@ -239,6 +261,8 @@ class E621FavoriteExtractor(E621Extractor):
"pattern": r"https://static\d.e926.net/data/../../[0-9a-f]+", "pattern": r"https://static\d.e926.net/data/../../[0-9a-f]+",
"count": "> 260", "count": "> 260",
}), }),
("https://e6ai.net/favorites"),
) )
def __init__(self, match): def __init__(self, match):
@ -249,6 +273,4 @@ class E621FavoriteExtractor(E621Extractor):
return {"user_id": self.query.get("user_id", "")} return {"user_id": self.query.get("user_id", "")}
def posts(self): def posts(self):
if self.page_start is None: return self._pagination("/favorites.json", self.query)
self.page_start = 1
return self._pagination("/favorites.json", self.query, True)

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2021-2022 Mike Fährmann # Copyright 2021-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -65,7 +65,7 @@ class EromeExtractor(Extractor):
def request(self, url, **kwargs): def request(self, url, **kwargs):
if self.__cookies: if self.__cookies:
self.__cookies = False self.__cookies = False
self.session.cookies.update(_cookie_cache()) self.cookies.update(_cookie_cache())
for _ in range(5): for _ in range(5):
response = Extractor.request(self, url, **kwargs) response = Extractor.request(self, url, **kwargs)
@ -80,7 +80,7 @@ class EromeExtractor(Extractor):
for params["page"] in itertools.count(1): for params["page"] in itertools.count(1):
page = self.request(url, params=params).text page = self.request(url, params=params).text
album_ids = EromeAlbumExtractor.pattern.findall(page) album_ids = EromeAlbumExtractor.pattern.findall(page)[::2]
yield from album_ids yield from album_ids
if len(album_ids) < 36: if len(album_ids) < 36:

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2014-2022 Mike Fährmann # Copyright 2014-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -21,28 +21,31 @@ class ExhentaiExtractor(Extractor):
"""Base class for exhentai extractors""" """Base class for exhentai extractors"""
category = "exhentai" category = "exhentai"
directory_fmt = ("{category}", "{gid} {title[:247]}") directory_fmt = ("{category}", "{gid} {title[:247]}")
filename_fmt = ( filename_fmt = "{gid}_{num:>04}_{image_token}_{filename}.{extension}"
"{gid}_{num:>04}_{image_token}_{filename}.{extension}")
archive_fmt = "{gid}_{num}" archive_fmt = "{gid}_{num}"
cookienames = ("ipb_member_id", "ipb_pass_hash") cookies_domain = ".exhentai.org"
cookiedomain = ".exhentai.org" cookies_names = ("ipb_member_id", "ipb_pass_hash")
root = "https://exhentai.org" root = "https://exhentai.org"
request_interval = 5.0 request_interval = 5.0
LIMIT = False LIMIT = False
def __init__(self, match): def __init__(self, match):
# allow calling 'self.config()' before 'Extractor.__init__()' Extractor.__init__(self, match)
self._cfgpath = ("extractor", self.category, self.subcategory) self.version = match.group(1)
version = match.group(1) def initialize(self):
domain = self.config("domain", "auto") domain = self.config("domain", "auto")
if domain == "auto": if domain == "auto":
domain = ("ex" if version == "ex" else "e-") + "hentai.org" domain = ("ex" if self.version == "ex" else "e-") + "hentai.org"
self.root = "https://" + domain self.root = "https://" + domain
self.cookiedomain = "." + domain self.cookies_domain = "." + domain
Extractor.__init__(self, match) Extractor.initialize(self)
if self.version != "ex":
self.cookies.set("nw", "1", domain=self.cookies_domain)
self.session.headers["Referer"] = self.root + "/"
self.original = self.config("original", True) self.original = self.config("original", True)
limits = self.config("limits", False) limits = self.config("limits", False)
@ -52,14 +55,10 @@ class ExhentaiExtractor(Extractor):
else: else:
self.limits = False self.limits = False
self.session.headers["Referer"] = self.root + "/" def request(self, url, **kwargs):
if version != "ex": response = Extractor.request(self, url, **kwargs)
self.session.cookies.set("nw", "1", domain=self.cookiedomain) if response.history and response.headers.get("Content-Length") == "0":
self.log.info("blank page")
def request(self, *args, **kwargs):
response = Extractor.request(self, *args, **kwargs)
if self._is_sadpanda(response):
self.log.info("sadpanda.jpg")
raise exception.AuthorizationError() raise exception.AuthorizationError()
return response return response
@ -67,17 +66,20 @@ class ExhentaiExtractor(Extractor):
"""Login and set necessary cookies""" """Login and set necessary cookies"""
if self.LIMIT: if self.LIMIT:
raise exception.StopExtraction("Image limit reached!") raise exception.StopExtraction("Image limit reached!")
if self._check_cookies(self.cookienames):
if self.cookies_check(self.cookies_names):
return return
username, password = self._get_auth_info() username, password = self._get_auth_info()
if username: if username:
self._update_cookies(self._login_impl(username, password)) return self.cookies_update(self._login_impl(username, password))
else:
self.log.info("no username given; using e-hentai.org") self.log.info("no username given; using e-hentai.org")
self.root = "https://e-hentai.org" self.root = "https://e-hentai.org"
self.original = False self.cookies_domain = ".e-hentai.org"
self.limits = False self.cookies.set("nw", "1", domain=self.cookies_domain)
self.session.cookies["nw"] = "1" self.original = False
self.limits = False
@cache(maxage=90*24*3600, keyarg=1) @cache(maxage=90*24*3600, keyarg=1)
def _login_impl(self, username, password): def _login_impl(self, username, password):
@ -98,15 +100,7 @@ class ExhentaiExtractor(Extractor):
response = self.request(url, method="POST", headers=headers, data=data) response = self.request(url, method="POST", headers=headers, data=data)
if b"You are now logged in as:" not in response.content: if b"You are now logged in as:" not in response.content:
raise exception.AuthenticationError() raise exception.AuthenticationError()
return {c: response.cookies[c] for c in self.cookienames} return {c: response.cookies[c] for c in self.cookies_names}
@staticmethod
def _is_sadpanda(response):
"""Return True if the response object contains a sad panda"""
return (
response.headers.get("Content-Length") == "9615" and
"sadpanda.jpg" in response.headers.get("Content-Disposition", "")
)
class ExhentaiGalleryExtractor(ExhentaiExtractor): class ExhentaiGalleryExtractor(ExhentaiExtractor):
@ -180,6 +174,7 @@ class ExhentaiGalleryExtractor(ExhentaiExtractor):
self.image_token = match.group(4) self.image_token = match.group(4)
self.image_num = text.parse_int(match.group(6), 1) self.image_num = text.parse_int(match.group(6), 1)
def _init(self):
source = self.config("source") source = self.config("source")
if source == "hitomi": if source == "hitomi":
self.items = self._items_hitomi self.items = self._items_hitomi
@ -399,8 +394,9 @@ class ExhentaiGalleryExtractor(ExhentaiExtractor):
url = "https://e-hentai.org/home.php" url = "https://e-hentai.org/home.php"
cookies = { cookies = {
cookie.name: cookie.value cookie.name: cookie.value
for cookie in self.session.cookies for cookie in self.cookies
if cookie.domain == self.cookiedomain and cookie.name != "igneous" if cookie.domain == self.cookies_domain and
cookie.name != "igneous"
} }
page = self.request(url, cookies=cookies).text page = self.request(url, cookies=cookies).text

View File

@ -6,9 +6,9 @@
"""Extractors for https://www.fanbox.cc/""" """Extractors for https://www.fanbox.cc/"""
import re
from .common import Extractor, Message from .common import Extractor, Message
from .. import text from .. import text
import re
BASE_PATTERN = ( BASE_PATTERN = (
@ -27,14 +27,12 @@ class FanboxExtractor(Extractor):
archive_fmt = "{id}_{num}" archive_fmt = "{id}_{num}"
_warning = True _warning = True
def __init__(self, match): def _init(self):
Extractor.__init__(self, match)
self.embeds = self.config("embeds", True) self.embeds = self.config("embeds", True)
def items(self): def items(self):
if self._warning: if self._warning:
if not self._check_cookies(("FANBOXSESSID",)): if not self.cookies_check(("FANBOXSESSID",)):
self.log.warning("no 'FANBOXSESSID' cookie set") self.log.warning("no 'FANBOXSESSID' cookie set")
FanboxExtractor._warning = False FanboxExtractor._warning = False
@ -52,8 +50,11 @@ class FanboxExtractor(Extractor):
url = text.ensure_http_scheme(url) url = text.ensure_http_scheme(url)
body = self.request(url, headers=headers).json()["body"] body = self.request(url, headers=headers).json()["body"]
for item in body["items"]: for item in body["items"]:
yield self._get_post_data(item["id"]) try:
yield self._get_post_data(item["id"])
except Exception as exc:
self.log.warning("Skipping post %s (%s: %s)",
item["id"], exc.__class__.__name__, exc)
url = body["nextUrl"] url = body["nextUrl"]
def _get_post_data(self, post_id): def _get_post_data(self, post_id):
@ -211,9 +212,15 @@ class FanboxExtractor(Extractor):
# to a proper Fanbox URL # to a proper Fanbox URL
url = "https://www.pixiv.net/fanbox/"+content_id url = "https://www.pixiv.net/fanbox/"+content_id
# resolve redirect # resolve redirect
response = self.request(url, method="HEAD", allow_redirects=False) try:
url = response.headers["Location"] url = self.request(url, method="HEAD",
final_post["_extractor"] = FanboxPostExtractor allow_redirects=False).headers["location"]
except Exception as exc:
url = None
self.log.warning("Unable to extract fanbox embed %s (%s: %s)",
content_id, exc.__class__.__name__, exc)
else:
final_post["_extractor"] = FanboxPostExtractor
elif provider == "twitter": elif provider == "twitter":
url = "https://twitter.com/_/status/"+content_id url = "https://twitter.com/_/status/"+content_id
elif provider == "google_forms": elif provider == "google_forms":

View File

@ -23,30 +23,54 @@ class FantiaExtractor(Extractor):
self.headers = { self.headers = {
"Accept" : "application/json, text/plain, */*", "Accept" : "application/json, text/plain, */*",
"Referer": self.root, "Referer": self.root,
"X-Requested-With": "XMLHttpRequest",
}
_empty_plan = {
"id" : 0,
"price": 0,
"limit": 0,
"name" : "",
"description": "",
"thumb": self.root + "/images/fallback/plan/thumb_default.png",
} }
if self._warning: if self._warning:
if not self._check_cookies(("_session_id",)): if not self.cookies_check(("_session_id",)):
self.log.warning("no '_session_id' cookie set") self.log.warning("no '_session_id' cookie set")
FantiaExtractor._warning = False FantiaExtractor._warning = False
for post_id in self.posts(): for post_id in self.posts():
full_response, post = self._get_post_data(post_id) post = self._get_post_data(post_id)
yield Message.Directory, post
post["num"] = 0 post["num"] = 0
for url, url_data in self._get_urls_from_post(full_response, post):
post["num"] += 1 for content in self._get_post_contents(post):
fname = url_data["content_filename"] or url post["content_category"] = content["category"]
text.nameext_from_url(fname, url_data) post["content_title"] = content["title"]
url_data["file_url"] = url post["content_filename"] = content.get("filename", "")
yield Message.Url, url, url_data post["content_id"] = content["id"]
post["plan"] = content["plan"] or _empty_plan
yield Message.Directory, post
if content["visible_status"] != "visible":
self.log.warning(
"Unable to download '%s' files from "
"%s#post-content-id-%s", content["visible_status"],
post["post_url"], content["id"])
for url in self._get_content_urls(post, content):
text.nameext_from_url(
post["content_filename"] or url, post)
post["file_url"] = url
post["num"] += 1
yield Message.Url, url, post
def posts(self): def posts(self):
"""Return post IDs""" """Return post IDs"""
def _pagination(self, url): def _pagination(self, url):
params = {"page": 1} params = {"page": 1}
headers = self.headers headers = self.headers.copy()
del headers["X-Requested-With"]
while True: while True:
page = self.request(url, params=params, headers=headers).text page = self.request(url, params=params, headers=headers).text
@ -71,7 +95,7 @@ class FantiaExtractor(Extractor):
"""Fetch and process post data""" """Fetch and process post data"""
url = self.root+"/api/v1/posts/"+post_id url = self.root+"/api/v1/posts/"+post_id
resp = self.request(url, headers=self.headers).json()["post"] resp = self.request(url, headers=self.headers).json()["post"]
post = { return {
"post_id": resp["id"], "post_id": resp["id"],
"post_url": self.root + "/posts/" + str(resp["id"]), "post_url": self.root + "/posts/" + str(resp["id"]),
"post_title": resp["title"], "post_title": resp["title"],
@ -85,55 +109,65 @@ class FantiaExtractor(Extractor):
"fanclub_user_name": resp["fanclub"]["user"]["name"], "fanclub_user_name": resp["fanclub"]["user"]["name"],
"fanclub_name": resp["fanclub"]["name"], "fanclub_name": resp["fanclub"]["name"],
"fanclub_url": self.root+"/fanclubs/"+str(resp["fanclub"]["id"]), "fanclub_url": self.root+"/fanclubs/"+str(resp["fanclub"]["id"]),
"tags": resp["tags"] "tags": resp["tags"],
"_data": resp,
} }
return resp, post
def _get_urls_from_post(self, resp, post): def _get_post_contents(self, post):
contents = post["_data"]["post_contents"]
try:
url = post["_data"]["thumb"]["original"]
except Exception:
pass
else:
contents.insert(0, {
"id": "thumb",
"title": "thumb",
"category": "thumb",
"download_uri": url,
"visible_status": "visible",
"plan": None,
})
return contents
def _get_content_urls(self, post, content):
"""Extract individual URL data from the response""" """Extract individual URL data from the response"""
if "thumb" in resp and resp["thumb"] and "original" in resp["thumb"]: if "comment" in content:
post["content_filename"] = "" post["content_comment"] = content["comment"]
post["content_category"] = "thumb"
post["file_id"] = "thumb"
yield resp["thumb"]["original"], post
for content in resp["post_contents"]: if "post_content_photos" in content:
post["content_category"] = content["category"] for photo in content["post_content_photos"]:
post["content_title"] = content["title"] post["file_id"] = photo["id"]
post["content_filename"] = content.get("filename", "") yield photo["url"]["original"]
post["content_id"] = content["id"]
if "comment" in content: if "download_uri" in content:
post["content_comment"] = content["comment"] post["file_id"] = content["id"]
url = content["download_uri"]
if url[0] == "/":
url = self.root + url
yield url
if "post_content_photos" in content: if content["category"] == "blog" and "comment" in content:
for photo in content["post_content_photos"]: comment_json = util.json_loads(content["comment"])
post["file_id"] = photo["id"] ops = comment_json.get("ops") or ()
yield photo["url"]["original"], post
if "download_uri" in content: # collect blogpost text first
post["file_id"] = content["id"] blog_text = ""
yield self.root+"/"+content["download_uri"], post for op in ops:
insert = op.get("insert")
if isinstance(insert, str):
blog_text += insert
post["blogpost_text"] = blog_text
if content["category"] == "blog" and "comment" in content: # collect images
comment_json = util.json_loads(content["comment"]) for op in ops:
ops = comment_json.get("ops", ()) insert = op.get("insert")
if isinstance(insert, dict) and "fantiaImage" in insert:
# collect blogpost text first img = insert["fantiaImage"]
blog_text = "" post["file_id"] = img["id"]
for op in ops: yield self.root + img["original_url"]
insert = op.get("insert")
if isinstance(insert, str):
blog_text += insert
post["blogpost_text"] = blog_text
# collect images
for op in ops:
insert = op.get("insert")
if isinstance(insert, dict) and "fantiaImage" in insert:
img = insert["fantiaImage"]
post["file_id"] = img["id"]
yield "https://fantia.jp" + img["original_url"], post
class FantiaCreatorExtractor(FantiaExtractor): class FantiaCreatorExtractor(FantiaExtractor):

View File

@ -20,12 +20,16 @@ class FlickrExtractor(Extractor):
filename_fmt = "{category}_{id}.{extension}" filename_fmt = "{category}_{id}.{extension}"
directory_fmt = ("{category}", "{user[username]}") directory_fmt = ("{category}", "{user[username]}")
archive_fmt = "{id}" archive_fmt = "{id}"
cookiedomain = None cookies_domain = None
request_interval = (1.0, 2.0)
request_interval_min = 0.2
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.api = FlickrAPI(self)
self.item_id = match.group(1) self.item_id = match.group(1)
def _init(self):
self.api = FlickrAPI(self)
self.user = None self.user = None
def items(self): def items(self):
@ -106,6 +110,8 @@ class FlickrImageExtractor(FlickrExtractor):
def items(self): def items(self):
photo = self.api.photos_getInfo(self.item_id) photo = self.api.photos_getInfo(self.item_id)
if self.api.exif:
photo.update(self.api.photos_getExif(self.item_id))
if photo["media"] == "video" and self.api.videos: if photo["media"] == "video" and self.api.videos:
self.api._extract_video(photo) self.api._extract_video(photo)
@ -287,8 +293,8 @@ class FlickrAPI(oauth.OAuth1API):
""" """
API_URL = "https://api.flickr.com/services/rest/" API_URL = "https://api.flickr.com/services/rest/"
API_KEY = "ac4fd7aa98585b9eee1ba761c209de68" API_KEY = "f8f78d1a40debf471f0b22fa2d00525f"
API_SECRET = "3adb0f568dc68393" API_SECRET = "4f9dae1113e45556"
FORMATS = [ FORMATS = [
("o" , "Original" , None), ("o" , "Original" , None),
("6k", "X-Large 6K" , 6144), ("6k", "X-Large 6K" , 6144),
@ -323,6 +329,7 @@ class FlickrAPI(oauth.OAuth1API):
def __init__(self, extractor): def __init__(self, extractor):
oauth.OAuth1API.__init__(self, extractor) oauth.OAuth1API.__init__(self, extractor)
self.exif = extractor.config("exif", False)
self.videos = extractor.config("videos", True) self.videos = extractor.config("videos", True)
self.maxsize = extractor.config("size-max") self.maxsize = extractor.config("size-max")
if isinstance(self.maxsize, str): if isinstance(self.maxsize, str):
@ -367,6 +374,11 @@ class FlickrAPI(oauth.OAuth1API):
params = {"user_id": user_id} params = {"user_id": user_id}
return self._pagination("people.getPhotos", params) return self._pagination("people.getPhotos", params)
def photos_getExif(self, photo_id):
"""Retrieves a list of EXIF/TIFF/GPS tags for a given photo."""
params = {"photo_id": photo_id}
return self._call("photos.getExif", params)["photo"]
def photos_getInfo(self, photo_id): def photos_getInfo(self, photo_id):
"""Get information about a photo.""" """Get information about a photo."""
params = {"photo_id": photo_id} params = {"photo_id": photo_id}
@ -451,9 +463,19 @@ class FlickrAPI(oauth.OAuth1API):
return data return data
def _pagination(self, method, params, key="photos"): def _pagination(self, method, params, key="photos"):
params["extras"] = ("description,date_upload,tags,views,media," extras = ("description,date_upload,tags,views,media,"
"path_alias,owner_name,") "path_alias,owner_name,")
params["extras"] += ",".join("url_" + fmt[0] for fmt in self.formats) includes = self.extractor.config("metadata")
if includes:
if isinstance(includes, (list, tuple)):
includes = ",".join(includes)
elif not isinstance(includes, str):
includes = ("license,date_taken,original_format,last_update,"
"geo,machine_tags,o_dims")
extras = extras + includes + ","
extras += ",".join("url_" + fmt[0] for fmt in self.formats)
params["extras"] = extras
params["page"] = 1 params["page"] = 1
while True: while True:
@ -478,6 +500,9 @@ class FlickrAPI(oauth.OAuth1API):
photo["views"] = text.parse_int(photo["views"]) photo["views"] = text.parse_int(photo["views"])
photo["date"] = text.parse_timestamp(photo["dateupload"]) photo["date"] = text.parse_timestamp(photo["dateupload"])
photo["tags"] = photo["tags"].split() photo["tags"] = photo["tags"].split()
if self.exif:
photo.update(self.photos_getExif(photo["id"]))
photo["id"] = text.parse_int(photo["id"]) photo["id"] = text.parse_int(photo["id"])
if "owner" in photo: if "owner" in photo:

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2019-2022 Mike Fährmann # Copyright 2019-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -22,10 +22,12 @@ class FoolfuukaExtractor(BaseExtractor):
def __init__(self, match): def __init__(self, match):
BaseExtractor.__init__(self, match) BaseExtractor.__init__(self, match)
self.session.headers["Referer"] = self.root
if self.category == "b4k": if self.category == "b4k":
self.remote = self._remote_direct self.remote = self._remote_direct
def _init(self):
self.session.headers["Referer"] = self.root + "/"
def items(self): def items(self):
yield Message.Directory, self.metadata() yield Message.Directory, self.metadata()
for post in self.posts(): for post in self.posts():
@ -88,13 +90,9 @@ BASE_PATTERN = FoolfuukaExtractor.update({
"root": "https://boards.fireden.net", "root": "https://boards.fireden.net",
"pattern": r"boards\.fireden\.net", "pattern": r"boards\.fireden\.net",
}, },
"rozenarcana": { "palanq": {
"root": "https://archive.alice.al", "root": "https://archive.palanq.win",
"pattern": r"(?:archive\.)?alice\.al", "pattern": r"archive\.palanq\.win",
},
"tokyochronos": {
"root": "https://www.tokyochronos.net",
"pattern": r"(?:www\.)?tokyochronos\.net",
}, },
"rbt": { "rbt": {
"root": "https://rbt.asia", "root": "https://rbt.asia",
@ -137,11 +135,8 @@ class FoolfuukaThreadExtractor(FoolfuukaExtractor):
("https://boards.fireden.net/sci/thread/11264294/", { ("https://boards.fireden.net/sci/thread/11264294/", {
"url": "61cab625c95584a12a30049d054931d64f8d20aa", "url": "61cab625c95584a12a30049d054931d64f8d20aa",
}), }),
("https://archive.alice.al/c/thread/2849220/", { ("https://archive.palanq.win/c/thread/4209598/", {
"url": "632e2c8de05de6b3847685f4bf1b4e5c6c9e0ed5", "url": "1f9b5570d228f1f2991c827a6631030bc0e5933c",
}),
("https://www.tokyochronos.net/a/thread/241664141/", {
"url": "ae03852cf44e3dcfce5be70274cb1828e1dbb7d6",
}), }),
("https://rbt.asia/g/thread/61487650/", { ("https://rbt.asia/g/thread/61487650/", {
"url": "fadd274b25150a1bdf03a40c58db320fa3b617c4", "url": "fadd274b25150a1bdf03a40c58db320fa3b617c4",
@ -187,8 +182,7 @@ class FoolfuukaBoardExtractor(FoolfuukaExtractor):
("https://arch.b4k.co/meta/"), ("https://arch.b4k.co/meta/"),
("https://desuarchive.org/a/"), ("https://desuarchive.org/a/"),
("https://boards.fireden.net/sci/"), ("https://boards.fireden.net/sci/"),
("https://archive.alice.al/c/"), ("https://archive.palanq.win/c/"),
("https://www.tokyochronos.net/a/"),
("https://rbt.asia/g/"), ("https://rbt.asia/g/"),
("https://thebarchive.com/b/"), ("https://thebarchive.com/b/"),
) )
@ -231,8 +225,7 @@ class FoolfuukaSearchExtractor(FoolfuukaExtractor):
("https://archiveofsins.com/_/search/text/test/"), ("https://archiveofsins.com/_/search/text/test/"),
("https://desuarchive.org/_/search/text/test/"), ("https://desuarchive.org/_/search/text/test/"),
("https://boards.fireden.net/_/search/text/test/"), ("https://boards.fireden.net/_/search/text/test/"),
("https://archive.alice.al/_/search/text/test/"), ("https://archive.palanq.win/_/search/text/test/"),
("https://www.tokyochronos.net/_/search/text/test/"),
("https://rbt.asia/_/search/text/test/"), ("https://rbt.asia/_/search/text/test/"),
("https://thebarchive.com/_/search/text/test/"), ("https://thebarchive.com/_/search/text/test/"),
) )
@ -297,8 +290,7 @@ class FoolfuukaGalleryExtractor(FoolfuukaExtractor):
("https://arch.b4k.co/meta/gallery/"), ("https://arch.b4k.co/meta/gallery/"),
("https://desuarchive.org/a/gallery/5"), ("https://desuarchive.org/a/gallery/5"),
("https://boards.fireden.net/sci/gallery/6"), ("https://boards.fireden.net/sci/gallery/6"),
("https://archive.alice.al/c/gallery/7"), ("https://archive.palanq.win/c/gallery"),
("https://www.tokyochronos.net/a/gallery/7"),
("https://rbt.asia/g/gallery/8"), ("https://rbt.asia/g/gallery/8"),
("https://thebarchive.com/b/gallery/9"), ("https://thebarchive.com/b/gallery/9"),
) )

View File

@ -42,11 +42,6 @@ BASE_PATTERN = FoolslideExtractor.update({
"root": "https://read.powermanga.org", "root": "https://read.powermanga.org",
"pattern": r"read(?:er)?\.powermanga\.org", "pattern": r"read(?:er)?\.powermanga\.org",
}, },
"sensescans": {
"root": "https://sensescans.com/reader",
"pattern": r"(?:(?:www\.)?sensescans\.com/reader"
r"|reader\.sensescans\.com)",
},
}) })
@ -64,11 +59,6 @@ class FoolslideChapterExtractor(FoolslideExtractor):
"url": "854c5817f8f767e1bccd05fa9d58ffb5a4b09384", "url": "854c5817f8f767e1bccd05fa9d58ffb5a4b09384",
"keyword": "a60c42f2634b7387899299d411ff494ed0ad6dbe", "keyword": "a60c42f2634b7387899299d411ff494ed0ad6dbe",
}), }),
("https://sensescans.com/reader/read/ao_no_orchestra/en/0/26/", {
"url": "bbd428dc578f5055e9f86ad635b510386cd317cd",
"keyword": "083ef6f8831c84127fe4096fa340a249be9d1424",
}),
("https://reader.sensescans.com/read/ao_no_orchestra/en/0/26/"),
) )
def items(self): def items(self):
@ -129,9 +119,6 @@ class FoolslideMangaExtractor(FoolslideExtractor):
"volume": int, "volume": int,
}, },
}), }),
("https://sensescans.com/reader/series/yotsubato/", {
"count": ">= 3",
}),
) )
def items(self): def items(self):

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2020-2022 Mike Fährmann # Copyright 2020-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -20,13 +20,16 @@ class FuraffinityExtractor(Extractor):
directory_fmt = ("{category}", "{user!l}") directory_fmt = ("{category}", "{user!l}")
filename_fmt = "{id}{title:? //}.{extension}" filename_fmt = "{id}{title:? //}.{extension}"
archive_fmt = "{id}" archive_fmt = "{id}"
cookiedomain = ".furaffinity.net" cookies_domain = ".furaffinity.net"
cookies_names = ("a", "b")
root = "https://www.furaffinity.net" root = "https://www.furaffinity.net"
_warning = True _warning = True
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.user = match.group(1) self.user = match.group(1)
def _init(self):
self.offset = 0 self.offset = 0
if self.config("descriptions") == "html": if self.config("descriptions") == "html":
@ -39,9 +42,8 @@ class FuraffinityExtractor(Extractor):
self._new_layout = None self._new_layout = None
def items(self): def items(self):
if self._warning: if self._warning:
if not self._check_cookies(("a", "b")): if not self.cookies_check(self.cookies_names):
self.log.warning("no 'a' and 'b' session cookies set") self.log.warning("no 'a' and 'b' session cookies set")
FuraffinityExtractor._warning = False FuraffinityExtractor._warning = False
@ -98,7 +100,9 @@ class FuraffinityExtractor(Extractor):
'class="tags-row">', '</section>')) 'class="tags-row">', '</section>'))
data["title"] = text.unescape(extr("<h2><p>", "</p></h2>")) data["title"] = text.unescape(extr("<h2><p>", "</p></h2>"))
data["artist"] = extr("<strong>", "<") data["artist"] = extr("<strong>", "<")
data["_description"] = extr('class="section-body">', '</div>') data["_description"] = extr(
'class="submission-description user-submitted-links">',
' </div>')
data["views"] = pi(rh(extr('class="views">', '</span>'))) data["views"] = pi(rh(extr('class="views">', '</span>')))
data["favorites"] = pi(rh(extr('class="favorites">', '</span>'))) data["favorites"] = pi(rh(extr('class="favorites">', '</span>')))
data["comments"] = pi(rh(extr('class="comments">', '</span>'))) data["comments"] = pi(rh(extr('class="comments">', '</span>')))
@ -125,7 +129,9 @@ class FuraffinityExtractor(Extractor):
data["tags"] = text.split_html(extr( data["tags"] = text.split_html(extr(
'id="keywords">', '</div>'))[::2] 'id="keywords">', '</div>'))[::2]
data["rating"] = extr('<img alt="', ' ') data["rating"] = extr('<img alt="', ' ')
data["_description"] = extr("</table>", "</table>") data["_description"] = extr(
'<td valign="top" align="left" width="70%" class="alt1" '
'style="padding:8px">', ' </td>')
data["artist_url"] = data["artist"].replace("_", "").lower() data["artist_url"] = data["artist"].replace("_", "").lower()
data["user"] = self.user or data["artist_url"] data["user"] = self.user or data["artist_url"]
@ -159,7 +165,13 @@ class FuraffinityExtractor(Extractor):
while path: while path:
page = self.request(self.root + path).text page = self.request(self.root + path).text
yield from text.extract_iter(page, 'id="sid-', '"') extr = text.extract_from(page)
while True:
post_id = extr('id="sid-', '"')
if not post_id:
break
self._favorite_id = text.parse_int(extr('data-fav-id="', '"'))
yield post_id
path = text.extr(page, 'right" href="', '"') path = text.extr(page, 'right" href="', '"')
def _pagination_search(self, query): def _pagination_search(self, query):
@ -241,6 +253,7 @@ class FuraffinityFavoriteExtractor(FuraffinityExtractor):
test = ("https://www.furaffinity.net/favorites/mirlinthloth/", { test = ("https://www.furaffinity.net/favorites/mirlinthloth/", {
"pattern": r"https://d\d?\.f(uraffinity|acdn)\.net" "pattern": r"https://d\d?\.f(uraffinity|acdn)\.net"
r"/art/[^/]+/\d+/\d+.\w+\.\w+", r"/art/[^/]+/\d+/\d+.\w+\.\w+",
"keyword": {"favorite_id": int},
"range": "45-50", "range": "45-50",
"count": 6, "count": 6,
}) })
@ -248,6 +261,12 @@ class FuraffinityFavoriteExtractor(FuraffinityExtractor):
def posts(self): def posts(self):
return self._pagination_favorites() return self._pagination_favorites()
def _parse_post(self, post_id):
post = FuraffinityExtractor._parse_post(self, post_id)
if post:
post["favorite_id"] = self._favorite_id
return post
class FuraffinitySearchExtractor(FuraffinityExtractor): class FuraffinitySearchExtractor(FuraffinityExtractor):
"""Extractor for furaffinity search results""" """Extractor for furaffinity search results"""
@ -354,7 +373,7 @@ class FuraffinityPostExtractor(FuraffinityExtractor):
class FuraffinityUserExtractor(FuraffinityExtractor): class FuraffinityUserExtractor(FuraffinityExtractor):
"""Extractor for furaffinity user profiles""" """Extractor for furaffinity user profiles"""
subcategory = "user" subcategory = "user"
cookiedomain = None cookies_domain = None
pattern = BASE_PATTERN + r"/user/([^/?#]+)" pattern = BASE_PATTERN + r"/user/([^/?#]+)"
test = ( test = (
("https://www.furaffinity.net/user/mirlinthloth/", { ("https://www.furaffinity.net/user/mirlinthloth/", {
@ -367,6 +386,9 @@ class FuraffinityUserExtractor(FuraffinityExtractor):
}), }),
) )
def initialize(self):
pass
def items(self): def items(self):
base = "{}/{{}}/{}/".format(self.root, self.user) base = "{}/{{}}/{}/".format(self.root, self.user)
return self._dispatch_extractors(( return self._dispatch_extractors((

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2021-2022 Mike Fährmann # Copyright 2021-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -19,29 +19,32 @@ class GelbooruV01Extractor(booru.BooruExtractor):
def _parse_post(self, post_id): def _parse_post(self, post_id):
url = "{}/index.php?page=post&s=view&id={}".format( url = "{}/index.php?page=post&s=view&id={}".format(
self.root, post_id) self.root, post_id)
page = self.request(url).text extr = text.extract_from(self.request(url).text)
post = text.extract_all(page, ( post = {
("created_at", 'Posted: ', ' <'), "id" : post_id,
("uploader" , 'By: ', ' <'), "created_at": extr('Posted: ', ' <'),
("width" , 'Size: ', 'x'), "uploader" : extr('By: ', ' <'),
("height" , '', ' <'), "width" : extr('Size: ', 'x'),
("source" , 'Source: <a href="', '"'), "height" : extr('', ' <'),
("rating" , 'Rating: ', '<'), "source" : extr('Source: ', ' <'),
("score" , 'Score: ', ' <'), "rating" : (extr('Rating: ', '<') or "?")[0].lower(),
("file_url" , '<img alt="img" src="', '"'), "score" : extr('Score: ', ' <'),
("tags" , 'id="tags" name="tags" cols="40" rows="5">', '<'), "file_url" : extr('<img alt="img" src="', '"'),
))[0] "tags" : text.unescape(extr(
'id="tags" name="tags" cols="40" rows="5">', '<')),
}
post["id"] = post_id
post["md5"] = post["file_url"].rpartition("/")[2].partition(".")[0] post["md5"] = post["file_url"].rpartition("/")[2].partition(".")[0]
post["rating"] = (post["rating"] or "?")[0].lower()
post["tags"] = text.unescape(post["tags"])
post["date"] = text.parse_datetime( post["date"] = text.parse_datetime(
post["created_at"], "%Y-%m-%d %H:%M:%S") post["created_at"], "%Y-%m-%d %H:%M:%S")
return post return post
def skip(self, num):
self.page_start += num
return num
def _pagination(self, url, begin, end): def _pagination(self, url, begin, end):
pid = self.page_start pid = self.page_start
@ -75,9 +78,9 @@ BASE_PATTERN = GelbooruV01Extractor.update({
"root": "https://drawfriends.booru.org", "root": "https://drawfriends.booru.org",
"pattern": r"drawfriends\.booru\.org", "pattern": r"drawfriends\.booru\.org",
}, },
"vidyart": { "vidyart2": {
"root": "https://vidyart.booru.org", "root": "https://vidyart2.booru.org",
"pattern": r"vidyart\.booru\.org", "pattern": r"vidyart2\.booru\.org",
}, },
}) })
@ -103,7 +106,7 @@ class GelbooruV01TagExtractor(GelbooruV01Extractor):
"count": 25, "count": 25,
}), }),
("https://drawfriends.booru.org/index.php?page=post&s=list&tags=all"), ("https://drawfriends.booru.org/index.php?page=post&s=list&tags=all"),
("https://vidyart.booru.org/index.php?page=post&s=list&tags=all"), ("https://vidyart2.booru.org/index.php?page=post&s=list&tags=all"),
) )
def __init__(self, match): def __init__(self, match):
@ -138,7 +141,7 @@ class GelbooruV01FavoriteExtractor(GelbooruV01Extractor):
"count": 4, "count": 4,
}), }),
("https://drawfriends.booru.org/index.php?page=favorites&s=view&id=1"), ("https://drawfriends.booru.org/index.php?page=favorites&s=view&id=1"),
("https://vidyart.booru.org/index.php?page=favorites&s=view&id=1"), ("https://vidyart2.booru.org/index.php?page=favorites&s=view&id=1"),
) )
def __init__(self, match): def __init__(self, match):
@ -182,7 +185,7 @@ class GelbooruV01PostExtractor(GelbooruV01Extractor):
"md5": "2aaa0438d58fc7baa75a53b4a9621bb89a9d3fdb", "md5": "2aaa0438d58fc7baa75a53b4a9621bb89a9d3fdb",
"rating": "s", "rating": "s",
"score": str, "score": str,
"source": None, "source": "",
"tags": "blush dress green_eyes green_hair hatsune_miku " "tags": "blush dress green_eyes green_hair hatsune_miku "
"long_hair twintails vocaloid", "long_hair twintails vocaloid",
"uploader": "Honochi31", "uploader": "Honochi31",
@ -190,7 +193,7 @@ class GelbooruV01PostExtractor(GelbooruV01Extractor):
}, },
}), }),
("https://drawfriends.booru.org/index.php?page=post&s=view&id=107474"), ("https://drawfriends.booru.org/index.php?page=post&s=view&id=107474"),
("https://vidyart.booru.org/index.php?page=post&s=view&id=383111"), ("https://vidyart2.booru.org/index.php?page=post&s=view&id=39168"),
) )
def __init__(self, match): def __init__(self, match):

View File

@ -19,8 +19,7 @@ import re
class GelbooruV02Extractor(booru.BooruExtractor): class GelbooruV02Extractor(booru.BooruExtractor):
basecategory = "gelbooru_v02" basecategory = "gelbooru_v02"
def __init__(self, match): def _init(self):
booru.BooruExtractor.__init__(self, match)
self.api_key = self.config("api-key") self.api_key = self.config("api-key")
self.user_id = self.config("user-id") self.user_id = self.config("user-id")

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2017-2022 Mike Fährmann # Copyright 2017-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -10,6 +10,7 @@
from .common import Extractor, Message from .common import Extractor, Message
from .. import text, exception from .. import text, exception
from ..cache import cache
class GfycatExtractor(Extractor): class GfycatExtractor(Extractor):
@ -23,6 +24,7 @@ class GfycatExtractor(Extractor):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.key = match.group(1).lower() self.key = match.group(1).lower()
def _init(self):
formats = self.config("format") formats = self.config("format")
if formats is None: if formats is None:
formats = ("mp4", "webm", "mobile", "gif") formats = ("mp4", "webm", "mobile", "gif")
@ -80,6 +82,8 @@ class GfycatUserExtractor(GfycatExtractor):
}) })
def gfycats(self): def gfycats(self):
if self.key == "me":
return GfycatAPI(self).me()
return GfycatAPI(self).user(self.key) return GfycatAPI(self).user(self.key)
@ -219,15 +223,8 @@ class GfycatAPI():
def __init__(self, extractor): def __init__(self, extractor):
self.extractor = extractor self.extractor = extractor
self.headers = {}
def gfycat(self, gfycat_id): self.username, self.password = extractor._get_auth_info()
endpoint = "/v1/gfycats/" + gfycat_id
return self._call(endpoint)["gfyItem"]
def user(self, user):
endpoint = "/v1/users/{}/gfycats".format(user.lower())
params = {"count": 100}
return self._pagination(endpoint, params)
def collection(self, user, collection): def collection(self, user, collection):
endpoint = "/v1/users/{}/collections/{}/gfycats".format( endpoint = "/v1/users/{}/collections/{}/gfycats".format(
@ -240,14 +237,64 @@ class GfycatAPI():
params = {"count": 100} params = {"count": 100}
return self._pagination(endpoint, params, "gfyCollections") return self._pagination(endpoint, params, "gfyCollections")
def gfycat(self, gfycat_id):
endpoint = "/v1/gfycats/" + gfycat_id
return self._call(endpoint)["gfyItem"]
def me(self):
endpoint = "/v1/me/gfycats"
params = {"count": 100}
return self._pagination(endpoint, params)
def search(self, query): def search(self, query):
endpoint = "/v1/gfycats/search" endpoint = "/v1/gfycats/search"
params = {"search_text": query, "count": 150} params = {"search_text": query, "count": 150}
return self._pagination(endpoint, params) return self._pagination(endpoint, params)
def user(self, user):
endpoint = "/v1/users/{}/gfycats".format(user.lower())
params = {"count": 100}
return self._pagination(endpoint, params)
def authenticate(self):
self.headers["Authorization"] = \
self._authenticate_impl(self.username, self.password)
@cache(maxage=3600, keyarg=1)
def _authenticate_impl(self, username, password):
self.extractor.log.info("Logging in as %s", username)
url = "https://weblogin.gfycat.com/oauth/webtoken"
headers = {"Origin": "https://gfycat.com"}
data = {
"access_key": "Anr96uuqt9EdamSCwK4txKPjMsf2"
"M95Rfa5FLLhPFucu8H5HTzeutyAa",
}
response = self.extractor.request(
url, method="POST", headers=headers, json=data).json()
url = "https://weblogin.gfycat.com/oauth/weblogin"
headers["authorization"] = "Bearer " + response["access_token"]
data = {
"grant_type": "password",
"username" : username,
"password" : password,
}
response = self.extractor.request(
url, method="POST", headers=headers, json=data, fatal=None).json()
if "errorMessage" in response:
raise exception.AuthenticationError(
response["errorMessage"]["description"])
return "Bearer " + response["access_token"]
def _call(self, endpoint, params=None): def _call(self, endpoint, params=None):
if self.username:
self.authenticate()
url = self.API_ROOT + endpoint url = self.API_ROOT + endpoint
return self.extractor.request(url, params=params).json() return self.extractor.request(
url, params=params, headers=self.headers).json()
def _pagination(self, endpoint, params, key="gfycats"): def _pagination(self, endpoint, params, key="gfycats"):
while True: while True:

View File

@ -6,7 +6,8 @@
from .common import Extractor, Message from .common import Extractor, Message
from .. import text, exception from .. import text, exception
from ..cache import memcache from ..cache import cache, memcache
import hashlib
class GofileFolderExtractor(Extractor): class GofileFolderExtractor(Extractor):
@ -66,19 +67,18 @@ class GofileFolderExtractor(Extractor):
def items(self): def items(self):
recursive = self.config("recursive") recursive = self.config("recursive")
password = self.config("password")
token = self.config("api-token") token = self.config("api-token")
if not token: if not token:
token = self._create_account() token = self._create_account()
self.session.cookies.set("accountToken", token, domain=".gofile.io") self.cookies.set("accountToken", token, domain=".gofile.io")
self.api_token = token self.api_token = token
token = self.config("website-token", "12345") self.website_token = (self.config("website-token") or
if not token: self._get_website_token())
token = self._get_website_token()
self.website_token = token
folder = self._get_content(self.content_id) folder = self._get_content(self.content_id, password)
yield Message.Directory, folder yield Message.Directory, folder
num = 0 num = 0
@ -109,17 +109,20 @@ class GofileFolderExtractor(Extractor):
self.log.debug("Creating temporary account") self.log.debug("Creating temporary account")
return self._api_request("createAccount")["token"] return self._api_request("createAccount")["token"]
@memcache() @cache(maxage=86400)
def _get_website_token(self): def _get_website_token(self):
self.log.debug("Fetching website token") self.log.debug("Fetching website token")
page = self.request(self.root + "/contents/files.html").text page = self.request(self.root + "/dist/js/alljs.js").text
return text.extract(page, "websiteToken:", ",")[0].strip("\" ") return text.extr(page, 'fetchData.websiteToken = "', '"')
def _get_content(self, content_id): def _get_content(self, content_id, password=None):
if password is not None:
password = hashlib.sha256(password.encode()).hexdigest()
return self._api_request("getContent", { return self._api_request("getContent", {
"contentId" : content_id, "contentId" : content_id,
"token" : self.api_token, "token" : self.api_token,
"websiteToken": self.website_token, "websiteToken": self.website_token,
"password" : password,
}) })
def _api_request(self, endpoint, params=None): def _api_request(self, endpoint, params=None):

View File

@ -57,7 +57,9 @@ class HentaicosplaysGalleryExtractor(GalleryExtractor):
self.root = text.ensure_http_scheme(root) self.root = text.ensure_http_scheme(root)
url = "{}/story/{}/".format(self.root, self.slug) url = "{}/story/{}/".format(self.root, self.slug)
GalleryExtractor.__init__(self, match, url) GalleryExtractor.__init__(self, match, url)
self.session.headers["Referer"] = url
def _init(self):
self.session.headers["Referer"] = self.gallery_url
def metadata(self, page): def metadata(self, page):
title = text.extr(page, "<title>", "</title>") title = text.extr(page, "<title>", "</title>")

View File

@ -20,7 +20,7 @@ class HentaifoundryExtractor(Extractor):
directory_fmt = ("{category}", "{user}") directory_fmt = ("{category}", "{user}")
filename_fmt = "{category}_{index}_{title}.{extension}" filename_fmt = "{category}_{index}_{title}.{extension}"
archive_fmt = "{index}" archive_fmt = "{index}"
cookiedomain = "www.hentai-foundry.com" cookies_domain = "www.hentai-foundry.com"
root = "https://www.hentai-foundry.com" root = "https://www.hentai-foundry.com"
per_page = 25 per_page = 25
@ -123,14 +123,14 @@ class HentaifoundryExtractor(Extractor):
def _init_site_filters(self): def _init_site_filters(self):
"""Set site-internal filters to show all images""" """Set site-internal filters to show all images"""
if self.session.cookies.get("PHPSESSID", domain=self.cookiedomain): if self.cookies.get("PHPSESSID", domain=self.cookies_domain):
return return
url = self.root + "/?enterAgree=1" url = self.root + "/?enterAgree=1"
self.request(url, method="HEAD") self.request(url, method="HEAD")
csrf_token = self.session.cookies.get( csrf_token = self.cookies.get(
"YII_CSRF_TOKEN", domain=self.cookiedomain) "YII_CSRF_TOKEN", domain=self.cookies_domain)
if not csrf_token: if not csrf_token:
self.log.warning("Unable to update site content filters") self.log.warning("Unable to update site content filters")
return return
@ -170,6 +170,9 @@ class HentaifoundryUserExtractor(HentaifoundryExtractor):
pattern = BASE_PATTERN + r"/user/([^/?#]+)/profile" pattern = BASE_PATTERN + r"/user/([^/?#]+)/profile"
test = ("https://www.hentai-foundry.com/user/Tenpura/profile",) test = ("https://www.hentai-foundry.com/user/Tenpura/profile",)
def initialize(self):
pass
def items(self): def items(self):
root = self.root root = self.root
user = "/user/" + self.user user = "/user/" + self.user

View File

@ -45,6 +45,15 @@ class HentaifoxGalleryExtractor(HentaifoxBase, GalleryExtractor):
"type": "doujinshi", "type": "doujinshi",
}, },
}), }),
# email-protected title (#4201)
("https://hentaifox.com/gallery/35261/", {
"keyword": {
"gallery_id": 35261,
"title": "ManageM@ster!",
"artist": ["haritama hiroki"],
"group": ["studio n.ball"],
},
}),
) )
def __init__(self, match): def __init__(self, match):
@ -65,13 +74,14 @@ class HentaifoxGalleryExtractor(HentaifoxBase, GalleryExtractor):
return { return {
"gallery_id": text.parse_int(self.gallery_id), "gallery_id": text.parse_int(self.gallery_id),
"title" : text.unescape(extr("<h1>", "</h1>")),
"parody" : split(extr(">Parodies:" , "</ul>")), "parody" : split(extr(">Parodies:" , "</ul>")),
"characters": split(extr(">Characters:", "</ul>")), "characters": split(extr(">Characters:", "</ul>")),
"tags" : split(extr(">Tags:" , "</ul>")), "tags" : split(extr(">Tags:" , "</ul>")),
"artist" : split(extr(">Artists:" , "</ul>")), "artist" : split(extr(">Artists:" , "</ul>")),
"group" : split(extr(">Groups:" , "</ul>")), "group" : split(extr(">Groups:" , "</ul>")),
"type" : text.remove_html(extr(">Category:", "<span")), "type" : text.remove_html(extr(">Category:", "<span")),
"title" : text.unescape(extr(
'id="gallery_title" value="', '"')),
"language" : "English", "language" : "English",
"lang" : "en", "lang" : "en",
} }

View File

@ -153,7 +153,7 @@ class HiperdexMangaExtractor(HiperdexBase, MangaExtractor):
"Accept": "*/*", "Accept": "*/*",
"X-Requested-With": "XMLHttpRequest", "X-Requested-With": "XMLHttpRequest",
"Origin": self.root, "Origin": self.root,
"Referer": self.manga_url, "Referer": "https://" + text.quote(self.manga_url[8:]),
} }
html = self.request(url, method="POST", headers=headers).text html = self.request(url, method="POST", headers=headers).text

View File

@ -66,12 +66,13 @@ class HitomiGalleryExtractor(GalleryExtractor):
) )
def __init__(self, match): def __init__(self, match):
gid = match.group(1) self.gid = match.group(1)
url = "https://ltn.hitomi.la/galleries/{}.js".format(gid) url = "https://ltn.hitomi.la/galleries/{}.js".format(self.gid)
GalleryExtractor.__init__(self, match, url) GalleryExtractor.__init__(self, match, url)
self.info = None
def _init(self):
self.session.headers["Referer"] = "{}/reader/{}.html".format( self.session.headers["Referer"] = "{}/reader/{}.html".format(
self.root, gid) self.root, self.gid)
def metadata(self, page): def metadata(self, page):
self.info = info = util.json_loads(page.partition("=")[2]) self.info = info = util.json_loads(page.partition("=")[2])

View File

@ -21,9 +21,8 @@ class HotleakExtractor(Extractor):
archive_fmt = "{type}_{creator}_{id}" archive_fmt = "{type}_{creator}_{id}"
root = "https://hotleak.vip" root = "https://hotleak.vip"
def __init__(self, match): def _init(self):
Extractor.__init__(self, match) self.session.headers["Referer"] = self.root + "/"
self.session.headers["Referer"] = self.root
def items(self): def items(self):
for post in self.posts(): for post in self.posts():

View File

@ -19,9 +19,9 @@ import re
class IdolcomplexExtractor(SankakuExtractor): class IdolcomplexExtractor(SankakuExtractor):
"""Base class for idolcomplex extractors""" """Base class for idolcomplex extractors"""
category = "idolcomplex" category = "idolcomplex"
cookienames = ("login", "pass_hash") cookies_domain = "idol.sankakucomplex.com"
cookiedomain = "idol.sankakucomplex.com" cookies_names = ("login", "pass_hash")
root = "https://" + cookiedomain root = "https://" + cookies_domain
request_interval = 5.0 request_interval = 5.0
def __init__(self, match): def __init__(self, match):
@ -29,6 +29,8 @@ class IdolcomplexExtractor(SankakuExtractor):
self.logged_in = True self.logged_in = True
self.start_page = 1 self.start_page = 1
self.start_post = 0 self.start_post = 0
def _init(self):
self.extags = self.config("tags", False) self.extags = self.config("tags", False)
def items(self): def items(self):
@ -51,14 +53,14 @@ class IdolcomplexExtractor(SankakuExtractor):
"""Return an iterable containing all relevant post ids""" """Return an iterable containing all relevant post ids"""
def login(self): def login(self):
if self._check_cookies(self.cookienames): if self.cookies_check(self.cookies_names):
return return
username, password = self._get_auth_info() username, password = self._get_auth_info()
if username: if username:
cookies = self._login_impl(username, password) return self.cookies_update(self._login_impl(username, password))
self._update_cookies(cookies)
else: self.logged_in = False
self.logged_in = False
@cache(maxage=90*24*3600, keyarg=1) @cache(maxage=90*24*3600, keyarg=1)
def _login_impl(self, username, password): def _login_impl(self, username, password):
@ -76,7 +78,7 @@ class IdolcomplexExtractor(SankakuExtractor):
if not response.history or response.url != self.root + "/user/home": if not response.history or response.url != self.root + "/user/home":
raise exception.AuthenticationError() raise exception.AuthenticationError()
cookies = response.history[0].cookies cookies = response.history[0].cookies
return {c: cookies[c] for c in self.cookienames} return {c: cookies[c] for c in self.cookies_names}
def _parse_post(self, post_id): def _parse_post(self, post_id):
"""Extract metadata of a single post""" """Extract metadata of a single post"""

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2014-2022 Mike Fährmann # Copyright 2014-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -21,7 +21,9 @@ class ImagebamExtractor(Extractor):
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.path = match.group(1) self.path = match.group(1)
self.session.cookies.set("nsfw_inter", "1", domain="www.imagebam.com")
def _init(self):
self.cookies.set("nsfw_inter", "1", domain="www.imagebam.com")
def _parse_image_page(self, path): def _parse_image_page(self, path):
page = self.request(self.root + path).text page = self.request(self.root + path).text

View File

@ -31,6 +31,15 @@ class ImagechestGalleryExtractor(GalleryExtractor):
"content": "076959e65be30249a2c651fbe6090dc30ba85193", "content": "076959e65be30249a2c651fbe6090dc30ba85193",
"count": 3 "count": 3
}), }),
# "Load More Files" button (#4028)
("https://imgchest.com/p/9p4n3q2z7nq", {
"pattern": r"https://cdn\.imgchest\.com/files/\w+\.(jpg|png)",
"url": "f5674e8ba79d336193c9f698708d9dcc10e78cc7",
"count": 52,
}),
("https://imgchest.com/p/xxxxxxxxxxx", {
"exception": exception.NotFoundError,
}),
) )
def __init__(self, match): def __init__(self, match):
@ -38,6 +47,14 @@ class ImagechestGalleryExtractor(GalleryExtractor):
url = self.root + "/p/" + self.gallery_id url = self.root + "/p/" + self.gallery_id
GalleryExtractor.__init__(self, match, url) GalleryExtractor.__init__(self, match, url)
def _init(self):
access_token = self.config("access-token")
if access_token:
self.api = ImagechestAPI(self, access_token)
self.gallery_url = None
self.metadata = self._metadata_api
self.images = self._images_api
def metadata(self, page): def metadata(self, page):
if "Sorry, but the page you requested could not be found." in page: if "Sorry, but the page you requested could not be found." in page:
raise exception.NotFoundError("gallery") raise exception.NotFoundError("gallery")
@ -49,7 +66,84 @@ class ImagechestGalleryExtractor(GalleryExtractor):
} }
def images(self, page): def images(self, page):
if " More Files</button>" in page:
url = "{}/p/{}/loadAll".format(self.root, self.gallery_id)
headers = {
"X-Requested-With": "XMLHttpRequest",
"Origin" : self.root,
"Referer" : self.gallery_url,
}
csrf_token = text.extr(page, 'name="csrf-token" content="', '"')
data = {"_token": csrf_token}
page += self.request(
url, method="POST", headers=headers, data=data).text
return [ return [
(url, None) (url, None)
for url in text.extract_iter(page, 'data-url="', '"') for url in text.extract_iter(page, 'data-url="', '"')
] ]
def _metadata_api(self, page):
post = self.api.post(self.gallery_id)
post["date"] = text.parse_datetime(
post["created"], "%Y-%m-%dT%H:%M:%S.%fZ")
for img in post["images"]:
img["date"] = text.parse_datetime(
img["created"], "%Y-%m-%dT%H:%M:%S.%fZ")
post["gallery_id"] = self.gallery_id
post.pop("image_count", None)
self._image_list = post.pop("images")
return post
def _images_api(self, page):
return [
(img["link"], img)
for img in self._image_list
]
class ImagechestAPI():
"""Interface for the Image Chest API
https://imgchest.com/docs/api/1.0/general/overview
"""
root = "https://api.imgchest.com"
def __init__(self, extractor, access_token):
self.extractor = extractor
self.headers = {"Authorization": "Bearer " + access_token}
def file(self, file_id):
endpoint = "/v1/file/" + file_id
return self._call(endpoint)
def post(self, post_id):
endpoint = "/v1/post/" + post_id
return self._call(endpoint)
def user(self, username):
endpoint = "/v1/user/" + username
return self._call(endpoint)
def _call(self, endpoint):
url = self.root + endpoint
while True:
response = self.extractor.request(
url, headers=self.headers, fatal=None, allow_redirects=False)
if response.status_code < 300:
return response.json()["data"]
elif response.status_code < 400:
raise exception.AuthenticationError("Invalid API access token")
elif response.status_code == 429:
self.extractor.wait(seconds=600)
else:
self.extractor.log.debug(response.text)
raise exception.StopExtraction("API request failed")

View File

@ -23,9 +23,8 @@ class ImagefapExtractor(Extractor):
archive_fmt = "{gallery_id}_{image_id}" archive_fmt = "{gallery_id}_{image_id}"
request_interval = (2.0, 4.0) request_interval = (2.0, 4.0)
def __init__(self, match): def _init(self):
Extractor.__init__(self, match) self.session.headers["Referer"] = self.root + "/"
self.session.headers["Referer"] = self.root
def request(self, url, **kwargs): def request(self, url, **kwargs):
response = Extractor.request(self, url, **kwargs) response = Extractor.request(self, url, **kwargs)
@ -283,7 +282,7 @@ class ImagefapFolderExtractor(ImagefapExtractor):
yield gid, extr("<b>", "<") yield gid, extr("<b>", "<")
cnt += 1 cnt += 1
if cnt < 25: if cnt < 20:
break break
params["page"] += 1 params["page"] += 1

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2016-2022 Mike Fährmann # Copyright 2016-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -19,23 +19,23 @@ class ImagehostImageExtractor(Extractor):
basecategory = "imagehost" basecategory = "imagehost"
subcategory = "image" subcategory = "image"
archive_fmt = "{token}" archive_fmt = "{token}"
https = True _https = True
params = None _params = None
cookies = None _cookies = None
encoding = None _encoding = None
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.page_url = "http{}://{}".format( self.page_url = "http{}://{}".format(
"s" if self.https else "", match.group(1)) "s" if self._https else "", match.group(1))
self.token = match.group(2) self.token = match.group(2)
if self.params == "simple": if self._params == "simple":
self.params = { self._params = {
"imgContinue": "Continue+to+image+...+", "imgContinue": "Continue+to+image+...+",
} }
elif self.params == "complex": elif self._params == "complex":
self.params = { self._params = {
"op": "view", "op": "view",
"id": self.token, "id": self.token,
"pre": "1", "pre": "1",
@ -46,16 +46,16 @@ class ImagehostImageExtractor(Extractor):
def items(self): def items(self):
page = self.request( page = self.request(
self.page_url, self.page_url,
method=("POST" if self.params else "GET"), method=("POST" if self._params else "GET"),
data=self.params, data=self._params,
cookies=self.cookies, cookies=self._cookies,
encoding=self.encoding, encoding=self._encoding,
).text ).text
url, filename = self.get_info(page) url, filename = self.get_info(page)
data = text.nameext_from_url(filename, {"token": self.token}) data = text.nameext_from_url(filename, {"token": self.token})
data.update(self.metadata(page)) data.update(self.metadata(page))
if self.https and url.startswith("http:"): if self._https and url.startswith("http:"):
url = "https:" + url[5:] url = "https:" + url[5:]
yield Message.Directory, data yield Message.Directory, data
@ -102,8 +102,8 @@ class ImxtoImageExtractor(ImagehostImageExtractor):
"exception": exception.NotFoundError, "exception": exception.NotFoundError,
}), }),
) )
params = "simple" _params = "simple"
encoding = "utf-8" _encoding = "utf-8"
def __init__(self, match): def __init__(self, match):
ImagehostImageExtractor.__init__(self, match) ImagehostImageExtractor.__init__(self, match)
@ -153,8 +153,9 @@ class ImxtoGalleryExtractor(ImagehostImageExtractor):
"_extractor": ImxtoImageExtractor, "_extractor": ImxtoImageExtractor,
"title": text.unescape(title.partition(">")[2]).strip(), "title": text.unescape(title.partition(">")[2]).strip(),
} }
for url in text.extract_iter(page, '<a href="', '"', pos):
yield Message.Queue, url, data for url in text.extract_iter(page, "<a href=", " ", pos):
yield Message.Queue, url.strip("\"'"), data
class AcidimgImageExtractor(ImagehostImageExtractor): class AcidimgImageExtractor(ImagehostImageExtractor):
@ -163,17 +164,23 @@ class AcidimgImageExtractor(ImagehostImageExtractor):
pattern = r"(?:https?://)?((?:www\.)?acidimg\.cc/img-([a-z0-9]+)\.html)" pattern = r"(?:https?://)?((?:www\.)?acidimg\.cc/img-([a-z0-9]+)\.html)"
test = ("https://acidimg.cc/img-5acb6b9de4640.html", { test = ("https://acidimg.cc/img-5acb6b9de4640.html", {
"url": "f132a630006e8d84f52d59555191ed82b3b64c04", "url": "f132a630006e8d84f52d59555191ed82b3b64c04",
"keyword": "a8bb9ab8b2f6844071945d31f8c6e04724051f37", "keyword": "135347ab4345002fc013863c0d9419ba32d98f78",
"content": "0c8768055e4e20e7c7259608b67799171b691140", "content": "0c8768055e4e20e7c7259608b67799171b691140",
}) })
params = "simple" _params = "simple"
encoding = "utf-8" _encoding = "utf-8"
def get_info(self, page): def get_info(self, page):
url, pos = text.extract(page, "<img class='centred' src='", "'") url, pos = text.extract(page, "<img class='centred' src='", "'")
if not url: if not url:
raise exception.NotFoundError("image") url, pos = text.extract(page, '<img class="centred" src="', '"')
filename, pos = text.extract(page, " alt='", "'", pos) if not url:
raise exception.NotFoundError("image")
filename, pos = text.extract(page, "alt='", "'", pos)
if not filename:
filename, pos = text.extract(page, 'alt="', '"', pos)
return url, (filename + splitext(url)[1]) if filename else url return url, (filename + splitext(url)[1]) if filename else url
@ -225,7 +232,7 @@ class ImagetwistImageExtractor(ImagehostImageExtractor):
@property @property
@memcache(maxage=3*3600) @memcache(maxage=3*3600)
def cookies(self): def _cookies(self):
return self.request(self.page_url).cookies return self.request(self.page_url).cookies
def get_info(self, page): def get_info(self, page):
@ -263,7 +270,7 @@ class PixhostImageExtractor(ImagehostImageExtractor):
"keyword": "3bad6d59db42a5ebbd7842c2307e1c3ebd35e6b0", "keyword": "3bad6d59db42a5ebbd7842c2307e1c3ebd35e6b0",
"content": "0c8768055e4e20e7c7259608b67799171b691140", "content": "0c8768055e4e20e7c7259608b67799171b691140",
}) })
cookies = {"pixhostads": "1", "pixhosttest": "1"} _cookies = {"pixhostads": "1", "pixhosttest": "1"}
def get_info(self, page): def get_info(self, page):
url , pos = text.extract(page, "class=\"image-img\" src=\"", "\"") url , pos = text.extract(page, "class=\"image-img\" src=\"", "\"")
@ -294,19 +301,38 @@ class PostimgImageExtractor(ImagehostImageExtractor):
"""Extractor for single images from postimages.org""" """Extractor for single images from postimages.org"""
category = "postimg" category = "postimg"
pattern = (r"(?:https?://)?((?:www\.)?(?:postimg|pixxxels)\.(?:cc|org)" pattern = (r"(?:https?://)?((?:www\.)?(?:postimg|pixxxels)\.(?:cc|org)"
r"/(?:image/)?([^/?#]+)/?)") r"/(?!gallery/)(?:image/)?([^/?#]+)/?)")
test = ("https://postimg.cc/Wtn2b3hC", { test = ("https://postimg.cc/Wtn2b3hC", {
"url": "0794cfda9b8951a8ac3aa692472484200254ab86", "url": "72f3c8b1d6c6601a20ad58f35635494b4891a99e",
"keyword": "2d05808d04e4e83e33200db83521af06e3147a84", "keyword": "2d05808d04e4e83e33200db83521af06e3147a84",
"content": "cfaa8def53ed1a575e0c665c9d6d8cf2aac7a0ee", "content": "cfaa8def53ed1a575e0c665c9d6d8cf2aac7a0ee",
}) })
def get_info(self, page): def get_info(self, page):
url , pos = text.extract(page, 'id="main-image" src="', '"') pos = page.index(' id="download"')
url , pos = text.rextract(page, ' href="', '"', pos)
filename, pos = text.extract(page, 'class="imagename">', '<', pos) filename, pos = text.extract(page, 'class="imagename">', '<', pos)
return url, text.unescape(filename) return url, text.unescape(filename)
class PostimgGalleryExtractor(ImagehostImageExtractor):
"""Extractor for images galleries from postimages.org"""
category = "postimg"
subcategory = "gallery"
pattern = (r"(?:https?://)?((?:www\.)?(?:postimg|pixxxels)\.(?:cc|org)"
r"/(?:gallery/)([^/?#]+)/?)")
test = ("https://postimg.cc/gallery/wxpDLgX", {
"pattern": PostimgImageExtractor.pattern,
"count": 22,
})
def items(self):
page = self.request(self.page_url).text
data = {"_extractor": PostimgImageExtractor}
for url in text.extract_iter(page, ' class="thumb"><a href="', '"'):
yield Message.Queue, url, data
class TurboimagehostImageExtractor(ImagehostImageExtractor): class TurboimagehostImageExtractor(ImagehostImageExtractor):
"""Extractor for single images from www.turboimagehost.com""" """Extractor for single images from www.turboimagehost.com"""
category = "turboimagehost" category = "turboimagehost"
@ -315,7 +341,7 @@ class TurboimagehostImageExtractor(ImagehostImageExtractor):
test = ("https://www.turboimagehost.com/p/39078423/test--.png.html", { test = ("https://www.turboimagehost.com/p/39078423/test--.png.html", {
"url": "b94de43612318771ced924cb5085976f13b3b90e", "url": "b94de43612318771ced924cb5085976f13b3b90e",
"keyword": "704757ca8825f51cec516ec44c1e627c1f2058ca", "keyword": "704757ca8825f51cec516ec44c1e627c1f2058ca",
"content": "0c8768055e4e20e7c7259608b67799171b691140", "content": "f38b54b17cd7462e687b58d83f00fca88b1b105a",
}) })
def get_info(self, page): def get_info(self, page):
@ -346,8 +372,8 @@ class ImgclickImageExtractor(ImagehostImageExtractor):
"keyword": "6895256143eab955622fc149aa367777a8815ba3", "keyword": "6895256143eab955622fc149aa367777a8815ba3",
"content": "0c8768055e4e20e7c7259608b67799171b691140", "content": "0c8768055e4e20e7c7259608b67799171b691140",
}) })
https = False _https = False
params = "complex" _params = "complex"
def get_info(self, page): def get_info(self, page):
url , pos = text.extract(page, '<br><img src="', '"') url , pos = text.extract(page, '<br><img src="', '"')

View File

@ -62,7 +62,7 @@ class ImgbbExtractor(Extractor):
def login(self): def login(self):
username, password = self._get_auth_info() username, password = self._get_auth_info()
if username: if username:
self._update_cookies(self._login_impl(username, password)) self.cookies_update(self._login_impl(username, password))
@cache(maxage=360*24*3600, keyarg=1) @cache(maxage=360*24*3600, keyarg=1)
def _login_impl(self, username, password): def _login_impl(self, username, password):
@ -82,7 +82,7 @@ class ImgbbExtractor(Extractor):
if not response.history: if not response.history:
raise exception.AuthenticationError() raise exception.AuthenticationError()
return self.session.cookies return self.cookies
def _pagination(self, page, endpoint, params): def _pagination(self, page, endpoint, params):
data = None data = None

View File

@ -22,8 +22,10 @@ class ImgurExtractor(Extractor):
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.api = ImgurAPI(self)
self.key = match.group(1) self.key = match.group(1)
def _init(self):
self.api = ImgurAPI(self)
self.mp4 = self.config("mp4", True) self.mp4 = self.config("mp4", True)
def _prepare(self, image): def _prepare(self, image):
@ -47,8 +49,13 @@ class ImgurExtractor(Extractor):
image_ex = ImgurImageExtractor image_ex = ImgurImageExtractor
for item in items: for item in items:
item["_extractor"] = album_ex if item["is_album"] else image_ex if item["is_album"]:
yield Message.Queue, item["link"], item url = "https://imgur.com/a/" + item["id"]
item["_extractor"] = album_ex
else:
url = "https://imgur.com/" + item["id"]
item["_extractor"] = image_ex
yield Message.Queue, url, item
class ImgurImageExtractor(ImgurExtractor): class ImgurImageExtractor(ImgurExtractor):
@ -272,7 +279,7 @@ class ImgurUserExtractor(ImgurExtractor):
("https://imgur.com/user/Miguenzo", { ("https://imgur.com/user/Miguenzo", {
"range": "1-100", "range": "1-100",
"count": 100, "count": 100,
"pattern": r"https?://(i.imgur.com|imgur.com/a)/[\w.]+", "pattern": r"https://imgur\.com(/a)?/\w+$",
}), }),
("https://imgur.com/user/Miguenzo/posts"), ("https://imgur.com/user/Miguenzo/posts"),
("https://imgur.com/user/Miguenzo/submitted"), ("https://imgur.com/user/Miguenzo/submitted"),
@ -285,17 +292,41 @@ class ImgurUserExtractor(ImgurExtractor):
class ImgurFavoriteExtractor(ImgurExtractor): class ImgurFavoriteExtractor(ImgurExtractor):
"""Extractor for a user's favorites""" """Extractor for a user's favorites"""
subcategory = "favorite" subcategory = "favorite"
pattern = BASE_PATTERN + r"/user/([^/?#]+)/favorites" pattern = BASE_PATTERN + r"/user/([^/?#]+)/favorites/?$"
test = ("https://imgur.com/user/Miguenzo/favorites", { test = ("https://imgur.com/user/Miguenzo/favorites", {
"range": "1-100", "range": "1-100",
"count": 100, "count": 100,
"pattern": r"https?://(i.imgur.com|imgur.com/a)/[\w.]+", "pattern": r"https://imgur\.com(/a)?/\w+$",
}) })
def items(self): def items(self):
return self._items_queue(self.api.account_favorites(self.key)) return self._items_queue(self.api.account_favorites(self.key))
class ImgurFavoriteFolderExtractor(ImgurExtractor):
"""Extractor for a user's favorites folder"""
subcategory = "favorite-folder"
pattern = BASE_PATTERN + r"/user/([^/?#]+)/favorites/folder/(\d+)"
test = (
("https://imgur.com/user/mikf1/favorites/folder/11896757/public", {
"pattern": r"https://imgur\.com(/a)?/\w+$",
"count": 3,
}),
("https://imgur.com/user/mikf1/favorites/folder/11896741/private", {
"pattern": r"https://imgur\.com(/a)?/\w+$",
"count": 5,
}),
)
def __init__(self, match):
ImgurExtractor.__init__(self, match)
self.folder_id = match.group(2)
def items(self):
return self._items_queue(self.api.account_favorites_folder(
self.key, self.folder_id))
class ImgurSubredditExtractor(ImgurExtractor): class ImgurSubredditExtractor(ImgurExtractor):
"""Extractor for a subreddits's imgur links""" """Extractor for a subreddits's imgur links"""
subcategory = "subreddit" subcategory = "subreddit"
@ -303,7 +334,7 @@ class ImgurSubredditExtractor(ImgurExtractor):
test = ("https://imgur.com/r/pics", { test = ("https://imgur.com/r/pics", {
"range": "1-100", "range": "1-100",
"count": 100, "count": 100,
"pattern": r"https?://(i.imgur.com|imgur.com/a)/[\w.]+", "pattern": r"https://imgur\.com(/a)?/\w+$",
}) })
def items(self): def items(self):
@ -317,7 +348,7 @@ class ImgurTagExtractor(ImgurExtractor):
test = ("https://imgur.com/t/animals", { test = ("https://imgur.com/t/animals", {
"range": "1-100", "range": "1-100",
"count": 100, "count": 100,
"pattern": r"https?://(i.imgur.com|imgur.com/a)/[\w.]+", "pattern": r"https://imgur\.com(/a)?/\w+$",
}) })
def items(self): def items(self):
@ -331,7 +362,7 @@ class ImgurSearchExtractor(ImgurExtractor):
test = ("https://imgur.com/search?q=cute+cat", { test = ("https://imgur.com/search?q=cute+cat", {
"range": "1-100", "range": "1-100",
"count": 100, "count": 100,
"pattern": r"https?://(i.imgur.com|imgur.com/a)/[\w.]+", "pattern": r"https://imgur\.com(/a)?/\w+$",
}) })
def items(self): def items(self):
@ -346,15 +377,18 @@ class ImgurAPI():
""" """
def __init__(self, extractor): def __init__(self, extractor):
self.extractor = extractor self.extractor = extractor
self.headers = { self.client_id = extractor.config("client-id") or "546c25a59c58ad7"
"Authorization": "Client-ID " + ( self.headers = {"Authorization": "Client-ID " + self.client_id}
extractor.config("client-id") or "546c25a59c58ad7"),
}
def account_favorites(self, account): def account_favorites(self, account):
endpoint = "/3/account/{}/gallery_favorites".format(account) endpoint = "/3/account/{}/gallery_favorites".format(account)
return self._pagination(endpoint) return self._pagination(endpoint)
def account_favorites_folder(self, account, folder_id):
endpoint = "/3/account/{}/folders/{}/favorites".format(
account, folder_id)
return self._pagination_v2(endpoint)
def gallery_search(self, query): def gallery_search(self, query):
endpoint = "/3/gallery/search" endpoint = "/3/gallery/search"
params = {"q": query} params = {"q": query}
@ -386,12 +420,12 @@ class ImgurAPI():
endpoint = "/post/v1/posts/" + gallery_hash endpoint = "/post/v1/posts/" + gallery_hash
return self._call(endpoint) return self._call(endpoint)
def _call(self, endpoint, params=None): def _call(self, endpoint, params=None, headers=None):
while True: while True:
try: try:
return self.extractor.request( return self.extractor.request(
"https://api.imgur.com" + endpoint, "https://api.imgur.com" + endpoint,
params=params, headers=self.headers, params=params, headers=(headers or self.headers),
).json() ).json()
except exception.HttpError as exc: except exception.HttpError as exc:
if exc.status not in (403, 429) or \ if exc.status not in (403, 429) or \
@ -410,3 +444,23 @@ class ImgurAPI():
return return
yield from data yield from data
num += 1 num += 1
def _pagination_v2(self, endpoint, params=None, key=None):
if params is None:
params = {}
params["client_id"] = self.client_id
params["page"] = 0
params["sort"] = "newest"
headers = {
"Referer": "https://imgur.com/",
"Origin": "https://imgur.com",
}
while True:
data = self._call(endpoint, params, headers)["data"]
if not data:
return
yield from data
params["page"] += 1

View File

@ -24,8 +24,7 @@ class InkbunnyExtractor(Extractor):
archive_fmt = "{file_id}" archive_fmt = "{file_id}"
root = "https://inkbunny.net" root = "https://inkbunny.net"
def __init__(self, match): def _init(self):
Extractor.__init__(self, match)
self.api = InkbunnyAPI(self) self.api = InkbunnyAPI(self)
def items(self): def items(self):

View File

@ -27,34 +27,41 @@ class InstagramExtractor(Extractor):
filename_fmt = "{sidecar_media_id:?/_/}{media_id}.{extension}" filename_fmt = "{sidecar_media_id:?/_/}{media_id}.{extension}"
archive_fmt = "{media_id}" archive_fmt = "{media_id}"
root = "https://www.instagram.com" root = "https://www.instagram.com"
cookiedomain = ".instagram.com" cookies_domain = ".instagram.com"
cookienames = ("sessionid",) cookies_names = ("sessionid",)
request_interval = (6.0, 12.0) request_interval = (6.0, 12.0)
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.item = match.group(1) self.item = match.group(1)
self.api = None
def _init(self):
self.www_claim = "0" self.www_claim = "0"
self.csrf_token = util.generate_token() self.csrf_token = util.generate_token()
self._logged_in = True
self._find_tags = re.compile(r"#\w+").findall self._find_tags = re.compile(r"#\w+").findall
self._logged_in = True
self._cursor = None self._cursor = None
self._user = None self._user = None
def items(self): self.cookies.set(
self.login() "csrftoken", self.csrf_token, domain=self.cookies_domain)
if self.config("api") == "graphql": if self.config("api") == "graphql":
self.api = InstagramGraphqlAPI(self) self.api = InstagramGraphqlAPI(self)
else: else:
self.api = InstagramRestAPI(self) self.api = InstagramRestAPI(self)
def items(self):
self.login()
data = self.metadata() data = self.metadata()
videos = self.config("videos", True) videos = self.config("videos", True)
previews = self.config("previews", False) previews = self.config("previews", False)
video_headers = {"User-Agent": "Mozilla/5.0"} video_headers = {"User-Agent": "Mozilla/5.0"}
order = self.config("order-files")
reverse = order[0] in ("r", "d") if order else False
for post in self.posts(): for post in self.posts():
if "__typename" in post: if "__typename" in post:
@ -71,6 +78,8 @@ class InstagramExtractor(Extractor):
if "date" in post: if "date" in post:
del post["date"] del post["date"]
if reverse:
files.reverse()
for file in files: for file in files:
file.update(post) file.update(post)
@ -126,14 +135,14 @@ class InstagramExtractor(Extractor):
return response return response
def login(self): def login(self):
if not self._check_cookies(self.cookienames): if self.cookies_check(self.cookies_names):
username, password = self._get_auth_info() return
if username:
self._update_cookies(_login_impl(self, username, password)) username, password = self._get_auth_info()
else: if username:
self._logged_in = False return self.cookies_update(_login_impl(self, username, password))
self.session.cookies.set(
"csrftoken", self.csrf_token, domain=self.cookiedomain) self._logged_in = False
def _parse_post_rest(self, post): def _parse_post_rest(self, post):
if "items" in post: # story or highlight if "items" in post: # story or highlight
@ -393,6 +402,12 @@ class InstagramUserExtractor(InstagramExtractor):
("https://www.instagram.com/id:25025320/"), ("https://www.instagram.com/id:25025320/"),
) )
def initialize(self):
pass
def finalize(self):
pass
def items(self): def items(self):
base = "{}/{}/".format(self.root, self.item) base = "{}/{}/".format(self.root, self.item)
stories = "{}/stories/{}/".format(self.root, self.item) stories = "{}/stories/{}/".format(self.root, self.item)
@ -756,10 +771,20 @@ class InstagramRestAPI():
endpoint = "/v1/guides/guide/{}/".format(guide_id) endpoint = "/v1/guides/guide/{}/".format(guide_id)
return self._pagination_guides(endpoint) return self._pagination_guides(endpoint)
def highlights_media(self, user_id): def highlights_media(self, user_id, chunk_size=5):
chunk_size = 5
reel_ids = [hl["id"] for hl in self.highlights_tray(user_id)] reel_ids = [hl["id"] for hl in self.highlights_tray(user_id)]
order = self.extractor.config("order-posts")
if order:
if order in ("desc", "reverse"):
reel_ids.reverse()
elif order in ("id", "id_asc"):
reel_ids.sort(key=lambda r: int(r[10:]))
elif order == "id_desc":
reel_ids.sort(key=lambda r: int(r[10:]), reverse=True)
elif order != "asc":
self.extractor.log.warning("Unknown posts order '%s'", order)
for offset in range(0, len(reel_ids), chunk_size): for offset in range(0, len(reel_ids), chunk_size):
yield from self.reels_media( yield from self.reels_media(
reel_ids[offset : offset+chunk_size]) reel_ids[offset : offset+chunk_size])
@ -799,13 +824,17 @@ class InstagramRestAPI():
params = {"username": screen_name} params = {"username": screen_name}
return self._call(endpoint, params=params)["data"]["user"] return self._call(endpoint, params=params)["data"]["user"]
@memcache(keyarg=1)
def user_by_id(self, user_id): def user_by_id(self, user_id):
endpoint = "/v1/users/{}/info/".format(user_id) endpoint = "/v1/users/{}/info/".format(user_id)
return self._call(endpoint)["user"] return self._call(endpoint)["user"]
def user_id(self, screen_name, check_private=True): def user_id(self, screen_name, check_private=True):
if screen_name.startswith("id:"): if screen_name.startswith("id:"):
if self.extractor.config("metadata"):
self.extractor._user = self.user_by_id(screen_name[3:])
return screen_name[3:] return screen_name[3:]
user = self.user_by_name(screen_name) user = self.user_by_name(screen_name)
if user is None: if user is None:
raise exception.AuthorizationError( raise exception.AuthorizationError(
@ -845,7 +874,7 @@ class InstagramRestAPI():
def user_tagged(self, user_id): def user_tagged(self, user_id):
endpoint = "/v1/usertags/{}/feed/".format(user_id) endpoint = "/v1/usertags/{}/feed/".format(user_id)
params = {"count": 50} params = {"count": 20}
return self._pagination(endpoint, params) return self._pagination(endpoint, params)
def _call(self, endpoint, **kwargs): def _call(self, endpoint, **kwargs):

View File

@ -26,8 +26,10 @@ class ItakuExtractor(Extractor):
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.api = ItakuAPI(self)
self.item = match.group(1) self.item = match.group(1)
def _init(self):
self.api = ItakuAPI(self)
self.videos = self.config("videos", True) self.videos = self.config("videos", True)
def items(self): def items(self):

View File

@ -0,0 +1,82 @@
# -*- coding: utf-8 -*-
# Copyright 2023 Mike Fährmann
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
"""Extractors for https://itch.io/"""
from .common import Extractor, Message
from .. import text
class ItchioGameExtractor(Extractor):
"""Extractor for itch.io games"""
category = "itchio"
subcategory = "game"
root = "https://itch.io"
directory_fmt = ("{category}", "{user[name]}")
filename_fmt = "{game[title]} ({id}).{extension}"
archive_fmt = "{id}"
pattern = r"(?:https?://)?(\w+).itch\.io/([\w-]+)"
test = (
("https://sirtartarus.itch.io/a-craft-of-mine", {
"pattern": r"https://\w+\.ssl\.hwcdn\.net/upload2"
r"/game/1983311/7723751\?",
"count": 1,
"keyword": {
"extension": "",
"filename": "7723751",
"game": {
"id": 1983311,
"noun": "game",
"title": "A Craft Of Mine",
"url": "https://sirtartarus.itch.io/a-craft-of-mine",
},
"user": {
"id": 4060052,
"name": "SirTartarus",
"url": "https://sirtartarus.itch.io",
},
},
}),
)
def __init__(self, match):
self.user, self.slug = match.groups()
Extractor.__init__(self, match)
def items(self):
game_url = "https://{}.itch.io/{}".format(self.user, self.slug)
page = self.request(game_url).text
params = {
"source": "view_game",
"as_props": "1",
"after_download_lightbox": "true",
}
headers = {
"Referer": game_url,
"X-Requested-With": "XMLHttpRequest",
"Origin": "https://{}.itch.io".format(self.user),
}
data = {
"csrf_token": text.unquote(self.cookies["itchio_token"]),
}
for upload_id in text.extract_iter(page, 'data-upload_id="', '"'):
file_url = "{}/file/{}".format(game_url, upload_id)
info = self.request(file_url, method="POST", params=params,
headers=headers, data=data).json()
game = info["lightbox"]["game"]
user = info["lightbox"]["user"]
game["url"] = game_url
user.pop("follow_button", None)
game = {"game": game, "user": user, "id": upload_id}
url = info["url"]
yield Message.Directory, game
yield Message.Url, url, text.nameext_from_url(url, game)

View File

@ -0,0 +1,151 @@
# -*- coding: utf-8 -*-
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
"""Extractors for https://jpeg.pet/"""
from .common import Extractor, Message
from .. import text
BASE_PATTERN = r"(?:https?://)?jpe?g\.(?:pet|fish(?:ing)?|church)"
class JpgfishExtractor(Extractor):
"""Base class for jpgfish extractors"""
category = "jpgfish"
root = "https://jpeg.pet"
directory_fmt = ("{category}", "{user}", "{album}",)
archive_fmt = "{id}"
def _pagination(self, url):
while url:
page = self.request(url).text
for item in text.extract_iter(
page, '<div class="list-item-image ', 'image-container'):
yield text.extract(item, '<a href="', '"')[0]
url = text.extract(
page, '<a data-pagination="next" href="', '" ><')[0]
class JpgfishImageExtractor(JpgfishExtractor):
"""Extractor for jpgfish Images"""
subcategory = "image"
pattern = BASE_PATTERN + r"/img/((?:[^/?#]+\.)?(\w+))"
test = (
("https://jpeg.pet/img/funnymeme.LecXGS", {
"pattern": r"https://simp3\.jpg\.church/images/funnymeme\.jpg",
"content": "098e5e9b17ad634358426e0ffd1c93871474d13c",
"keyword": {
"album": "",
"extension": "jpg",
"filename": "funnymeme",
"id": "LecXGS",
"url": "https://simp3.jpg.church/images/funnymeme.jpg",
"user": "exearco",
},
}),
("https://jpg.church/img/auCruA", {
"pattern": r"https://simp2\.jpg\.church/hannahowo_00457\.jpg",
"keyword": {"album": "401-500"},
}),
("https://jpg.pet/img/funnymeme.LecXGS"),
("https://jpg.fishing/img/funnymeme.LecXGS"),
("https://jpg.fish/img/funnymeme.LecXGS"),
("https://jpg.church/img/funnymeme.LecXGS"),
)
def __init__(self, match):
JpgfishExtractor.__init__(self, match)
self.path, self.image_id = match.groups()
def items(self):
url = "{}/img/{}".format(self.root, self.path)
extr = text.extract_from(self.request(url).text)
image = {
"id" : self.image_id,
"url" : extr('<meta property="og:image" content="', '"'),
"album": text.extract(extr(
"Added to <a", "/a>"), ">", "<")[0] or "",
"user" : extr('username: "', '"'),
}
text.nameext_from_url(image["url"], image)
yield Message.Directory, image
yield Message.Url, image["url"], image
class JpgfishAlbumExtractor(JpgfishExtractor):
"""Extractor for jpgfish Albums"""
subcategory = "album"
pattern = BASE_PATTERN + r"/a(?:lbum)?/([^/?#]+)(/sub)?"
test = (
("https://jpeg.pet/album/CDilP/?sort=date_desc&page=1", {
"count": 2,
}),
("https://jpg.fishing/a/gunggingnsk.N9OOI", {
"count": 114,
}),
("https://jpg.fish/a/101-200.aNJ6A/", {
"count": 100,
}),
("https://jpg.church/a/hannahowo.aNTdH/sub", {
"count": 606,
}),
("https://jpg.pet/album/CDilP/?sort=date_desc&page=1"),
)
def __init__(self, match):
JpgfishExtractor.__init__(self, match)
self.album, self.sub_albums = match.groups()
def items(self):
url = "{}/a/{}".format(self.root, self.album)
data = {"_extractor": JpgfishImageExtractor}
if self.sub_albums:
albums = self._pagination(url + "/sub")
else:
albums = (url,)
for album in albums:
for image in self._pagination(album):
yield Message.Queue, image, data
class JpgfishUserExtractor(JpgfishExtractor):
"""Extractor for jpgfish Users"""
subcategory = "user"
pattern = BASE_PATTERN + r"/(?!img|a(?:lbum)?)([^/?#]+)(/albums)?"
test = (
("https://jpeg.pet/exearco", {
"count": 3,
}),
("https://jpg.church/exearco/albums", {
"count": 1,
}),
("https://jpg.pet/exearco"),
("https://jpg.fishing/exearco"),
("https://jpg.fish/exearco"),
("https://jpg.church/exearco"),
)
def __init__(self, match):
JpgfishExtractor.__init__(self, match)
self.user, self.albums = match.groups()
def items(self):
url = "{}/{}".format(self.root, self.user)
if self.albums:
url += "/albums"
data = {"_extractor": JpgfishAlbumExtractor}
else:
data = {"_extractor": JpgfishImageExtractor}
for url in self._pagination(url):
yield Message.Queue, url, data

View File

@ -0,0 +1,94 @@
# -*- coding: utf-8 -*-
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
"""Extractors for jschan Imageboards"""
from .common import BaseExtractor, Message
from .. import text
import itertools
class JschanExtractor(BaseExtractor):
basecategory = "jschan"
BASE_PATTERN = JschanExtractor.update({
"94chan": {
"root": "https://94chan.org",
"pattern": r"94chan\.org"
}
})
class JschanThreadExtractor(JschanExtractor):
"""Extractor for jschan threads"""
subcategory = "thread"
directory_fmt = ("{category}", "{board}",
"{threadId} {subject|nomarkup[:50]}")
filename_fmt = "{postId}{num:?-//} {filename}.{extension}"
archive_fmt = "{board}_{postId}_{num}"
pattern = BASE_PATTERN + r"/([^/?#]+)/thread/(\d+)\.html"
test = (
("https://94chan.org/art/thread/25.html", {
"pattern": r"https://94chan.org/file/[0-9a-f]{64}(\.\w+)?",
"count": ">= 15"
})
)
def __init__(self, match):
JschanExtractor.__init__(self, match)
index = match.lastindex
self.board = match.group(index-1)
self.thread = match.group(index)
def items(self):
url = "{}/{}/thread/{}.json".format(
self.root, self.board, self.thread)
thread = self.request(url).json()
thread["threadId"] = thread["postId"]
posts = thread.pop("replies", ())
yield Message.Directory, thread
for post in itertools.chain((thread,), posts):
files = post.pop("files", ())
if files:
thread.update(post)
thread["count"] = len(files)
for num, file in enumerate(files):
url = self.root + "/file/" + file["filename"]
file.update(thread)
file["num"] = num
file["siteFilename"] = file["filename"]
text.nameext_from_url(file["originalFilename"], file)
yield Message.Url, url, file
class JschanBoardExtractor(JschanExtractor):
"""Extractor for jschan boards"""
subcategory = "board"
pattern = (BASE_PATTERN + r"/([^/?#]+)"
r"(?:/index\.html|/catalog\.html|/\d+\.html|/?$)")
test = (
("https://94chan.org/art/", {
"pattern": JschanThreadExtractor.pattern,
"count": ">= 30"
}),
("https://94chan.org/art/2.html"),
("https://94chan.org/art/catalog.html"),
("https://94chan.org/art/index.html"),
)
def __init__(self, match):
JschanExtractor.__init__(self, match)
self.board = match.group(match.lastindex)
def items(self):
url = "{}/{}/catalog.json".format(self.root, self.board)
for thread in self.request(url).json():
url = "{}/{}/thread/{}.html".format(
self.root, self.board, thread["postId"])
thread["_extractor"] = JschanThreadExtractor
yield Message.Queue, url, thread

View File

@ -14,7 +14,7 @@ from ..cache import cache
import itertools import itertools
import re import re
BASE_PATTERN = r"(?:https?://)?(?:www\.|beta\.)?(kemono|coomer)\.party" BASE_PATTERN = r"(?:https?://)?(?:www\.|beta\.)?(kemono|coomer)\.(party|su)"
USER_PATTERN = BASE_PATTERN + r"/([^/?#]+)/user/([^/?#]+)" USER_PATTERN = BASE_PATTERN + r"/([^/?#]+)/user/([^/?#]+)"
HASH_PATTERN = r"/[0-9a-f]{2}/[0-9a-f]{2}/([0-9a-f]{64})" HASH_PATTERN = r"/[0-9a-f]{2}/[0-9a-f]{2}/([0-9a-f]{64})"
@ -26,22 +26,24 @@ class KemonopartyExtractor(Extractor):
directory_fmt = ("{category}", "{service}", "{user}") directory_fmt = ("{category}", "{service}", "{user}")
filename_fmt = "{id}_{title}_{num:>02}_{filename[:180]}.{extension}" filename_fmt = "{id}_{title}_{num:>02}_{filename[:180]}.{extension}"
archive_fmt = "{service}_{user}_{id}_{num}" archive_fmt = "{service}_{user}_{id}_{num}"
cookiedomain = ".kemono.party" cookies_domain = ".kemono.party"
def __init__(self, match): def __init__(self, match):
if match.group(1) == "coomer": domain = match.group(1)
self.category = "coomerparty" tld = match.group(2)
self.cookiedomain = ".coomer.party" self.category = domain + "party"
self.root = text.root_from_url(match.group(0)) self.root = text.root_from_url(match.group(0))
self.cookies_domain = ".{}.{}".format(domain, tld)
Extractor.__init__(self, match) Extractor.__init__(self, match)
def _init(self):
self.session.headers["Referer"] = self.root + "/" self.session.headers["Referer"] = self.root + "/"
self._prepare_ddosguard_cookies()
self._find_inline = re.compile(
r'src="(?:https?://(?:kemono|coomer)\.(?:party|su))?(/inline/[^"]+'
r'|/[0-9a-f]{2}/[0-9a-f]{2}/[0-9a-f]{64}\.[^"]+)').findall
def items(self): def items(self):
self._prepare_ddosguard_cookies()
self._find_inline = re.compile(
r'src="(?:https?://(?:kemono|coomer)\.party)?(/inline/[^"]+'
r'|/[0-9a-f]{2}/[0-9a-f]{2}/[0-9a-f]{64}\.[^"]+)').findall
find_hash = re.compile(HASH_PATTERN).match find_hash = re.compile(HASH_PATTERN).match
generators = self._build_file_generators(self.config("files")) generators = self._build_file_generators(self.config("files"))
duplicates = self.config("duplicates") duplicates = self.config("duplicates")
@ -125,10 +127,12 @@ class KemonopartyExtractor(Extractor):
def login(self): def login(self):
username, password = self._get_auth_info() username, password = self._get_auth_info()
if username: if username:
self._update_cookies(self._login_impl(username, password)) self.cookies_update(self._login_impl(
(username, self.cookies_domain), password))
@cache(maxage=28*24*3600, keyarg=1) @cache(maxage=28*24*3600, keyarg=1)
def _login_impl(self, username, password): def _login_impl(self, username, password):
username = username[0]
self.log.info("Logging in as %s", username) self.log.info("Logging in as %s", username)
url = self.root + "/account/login" url = self.root + "/account/login"
@ -222,11 +226,12 @@ class KemonopartyUserExtractor(KemonopartyExtractor):
"options": (("max-posts", 25),), "options": (("max-posts", 25),),
"count": "< 100", "count": "< 100",
}), }),
("https://kemono.su/subscribestar/user/alcorart"),
("https://kemono.party/subscribestar/user/alcorart"), ("https://kemono.party/subscribestar/user/alcorart"),
) )
def __init__(self, match): def __init__(self, match):
_, service, user_id, offset = match.groups() _, _, service, user_id, offset = match.groups()
self.subcategory = service self.subcategory = service
KemonopartyExtractor.__init__(self, match) KemonopartyExtractor.__init__(self, match)
self.api_url = "{}/api/{}/user/{}".format(self.root, service, user_id) self.api_url = "{}/api/{}/user/{}".format(self.root, service, user_id)
@ -327,13 +332,14 @@ class KemonopartyPostExtractor(KemonopartyExtractor):
r"f51c10adc9dabd86e92bd52339f298b9\.txt", r"f51c10adc9dabd86e92bd52339f298b9\.txt",
"content": "da39a3ee5e6b4b0d3255bfef95601890afd80709", # empty "content": "da39a3ee5e6b4b0d3255bfef95601890afd80709", # empty
}), }),
("https://kemono.su/subscribestar/user/alcorart/post/184330"),
("https://kemono.party/subscribestar/user/alcorart/post/184330"), ("https://kemono.party/subscribestar/user/alcorart/post/184330"),
("https://www.kemono.party/subscribestar/user/alcorart/post/184330"), ("https://www.kemono.party/subscribestar/user/alcorart/post/184330"),
("https://beta.kemono.party/subscribestar/user/alcorart/post/184330"), ("https://beta.kemono.party/subscribestar/user/alcorart/post/184330"),
) )
def __init__(self, match): def __init__(self, match):
_, service, user_id, post_id = match.groups() _, _, service, user_id, post_id = match.groups()
self.subcategory = service self.subcategory = service
KemonopartyExtractor.__init__(self, match) KemonopartyExtractor.__init__(self, match)
self.api_url = "{}/api/{}/user/{}/post/{}".format( self.api_url = "{}/api/{}/user/{}/post/{}".format(
@ -359,9 +365,9 @@ class KemonopartyDiscordExtractor(KemonopartyExtractor):
"count": 4, "count": 4,
"keyword": {"channel_name": "finish-work"}, "keyword": {"channel_name": "finish-work"},
}), }),
(("https://kemono.party/discord" (("https://kemono.su/discord"
"/server/256559665620451329/channel/462437519519383555#"), { "/server/256559665620451329/channel/462437519519383555#"), {
"pattern": r"https://kemono\.party/data/(" "pattern": r"https://kemono\.su/data/("
r"e3/77/e377e3525164559484ace2e64425b0cec1db08.*\.png|" r"e3/77/e377e3525164559484ace2e64425b0cec1db08.*\.png|"
r"51/45/51453640a5e0a4d23fbf57fb85390f9c5ec154.*\.gif)", r"51/45/51453640a5e0a4d23fbf57fb85390f9c5ec154.*\.gif)",
"keyword": {"hash": "re:e377e3525164559484ace2e64425b0cec1db08" "keyword": {"hash": "re:e377e3525164559484ace2e64425b0cec1db08"
@ -380,7 +386,7 @@ class KemonopartyDiscordExtractor(KemonopartyExtractor):
def __init__(self, match): def __init__(self, match):
KemonopartyExtractor.__init__(self, match) KemonopartyExtractor.__init__(self, match)
_, self.server, self.channel, self.channel_name = match.groups() _, _, self.server, self.channel, self.channel_name = match.groups()
def items(self): def items(self):
self._prepare_ddosguard_cookies() self._prepare_ddosguard_cookies()
@ -455,14 +461,20 @@ class KemonopartyDiscordExtractor(KemonopartyExtractor):
class KemonopartyDiscordServerExtractor(KemonopartyExtractor): class KemonopartyDiscordServerExtractor(KemonopartyExtractor):
subcategory = "discord-server" subcategory = "discord-server"
pattern = BASE_PATTERN + r"/discord/server/(\d+)$" pattern = BASE_PATTERN + r"/discord/server/(\d+)$"
test = ("https://kemono.party/discord/server/488668827274444803", { test = (
"pattern": KemonopartyDiscordExtractor.pattern, ("https://kemono.party/discord/server/488668827274444803", {
"count": 13, "pattern": KemonopartyDiscordExtractor.pattern,
}) "count": 13,
}),
("https://kemono.su/discord/server/488668827274444803", {
"pattern": KemonopartyDiscordExtractor.pattern,
"count": 13,
}),
)
def __init__(self, match): def __init__(self, match):
KemonopartyExtractor.__init__(self, match) KemonopartyExtractor.__init__(self, match)
self.server = match.group(2) self.server = match.group(3)
def items(self): def items(self):
url = "{}/api/discord/channels/lookup?q={}".format( url = "{}/api/discord/channels/lookup?q={}".format(
@ -491,11 +503,16 @@ class KemonopartyFavoriteExtractor(KemonopartyExtractor):
"url": "ecfccf5f0d50b8d14caa7bbdcf071de5c1e5b90f", "url": "ecfccf5f0d50b8d14caa7bbdcf071de5c1e5b90f",
"count": 3, "count": 3,
}), }),
("https://kemono.su/favorites?type=post", {
"pattern": KemonopartyPostExtractor.pattern,
"url": "4be8e84cb384a907a8e7997baaf6287b451783b5",
"count": 3,
}),
) )
def __init__(self, match): def __init__(self, match):
KemonopartyExtractor.__init__(self, match) KemonopartyExtractor.__init__(self, match)
self.favorites = (text.parse_query(match.group(2)).get("type") or self.favorites = (text.parse_query(match.group(3)).get("type") or
self.config("favorites") or self.config("favorites") or
"artist") "artist")

View File

@ -0,0 +1,161 @@
# -*- coding: utf-8 -*-
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
"""Extractors for https://lensdump.com/"""
from .common import GalleryExtractor, Extractor, Message
from .. import text, util
BASE_PATTERN = r"(?:https?://)?lensdump\.com"
class LensdumpBase():
"""Base class for lensdump extractors"""
category = "lensdump"
root = "https://lensdump.com"
def nodes(self, page=None):
if page is None:
page = self.request(self.url).text
# go through all pages starting from the oldest
page_url = text.urljoin(self.root, text.extr(
text.extr(page, ' id="list-most-oldest-link"', '>'),
'href="', '"'))
while page_url is not None:
if page_url == self.url:
current_page = page
else:
current_page = self.request(page_url).text
for node in text.extract_iter(
current_page, ' class="list-item ', '>'):
yield node
# find url of next page
page_url = text.extr(
text.extr(current_page, ' data-pagination="next"', '>'),
'href="', '"')
if page_url is not None and len(page_url) > 0:
page_url = text.urljoin(self.root, page_url)
else:
page_url = None
class LensdumpAlbumExtractor(LensdumpBase, GalleryExtractor):
subcategory = "album"
pattern = BASE_PATTERN + r"/(?:((?!\w+/albums|a/|i/)\w+)|a/(\w+))"
test = (
("https://lensdump.com/a/1IhJr", {
"pattern": r"https://[abcd]\.l3n\.co/i/tq\w{4}\.png",
"keyword": {
"extension": "png",
"name": str,
"num": int,
"title": str,
"url": str,
"width": int,
},
}),
)
def __init__(self, match):
GalleryExtractor.__init__(self, match, match.string)
self.gallery_id = match.group(1) or match.group(2)
def metadata(self, page):
return {
"gallery_id": self.gallery_id,
"title": text.unescape(text.extr(
page, 'property="og:title" content="', '"').strip())
}
def images(self, page):
for node in self.nodes(page):
# get urls and filenames of images in current page
json_data = util.json_loads(text.unquote(
text.extr(node, "data-object='", "'") or
text.extr(node, 'data-object="', '"')))
image_id = json_data.get('name')
image_url = json_data.get('url')
image_title = json_data.get('title')
if image_title is not None:
image_title = text.unescape(image_title)
yield (image_url, {
'id': image_id,
'url': image_url,
'title': image_title,
'name': json_data.get('filename'),
'filename': image_id,
'extension': json_data.get('extension'),
'height': text.parse_int(json_data.get('height')),
'width': text.parse_int(json_data.get('width')),
})
class LensdumpAlbumsExtractor(LensdumpBase, Extractor):
"""Extractor for album list from lensdump.com"""
subcategory = "albums"
pattern = BASE_PATTERN + r"/\w+/albums"
test = ("https://lensdump.com/vstar925/albums",)
def items(self):
for node in self.nodes():
album_url = text.urljoin(self.root, text.extr(
node, 'data-url-short="', '"'))
yield Message.Queue, album_url, {
"_extractor": LensdumpAlbumExtractor}
class LensdumpImageExtractor(LensdumpBase, Extractor):
"""Extractor for individual images on lensdump.com"""
subcategory = "image"
filename_fmt = "{category}_{id}{title:?_//}.{extension}"
directory_fmt = ("{category}",)
archive_fmt = "{id}"
pattern = BASE_PATTERN + r"/i/(\w+)"
test = (
("https://lensdump.com/i/tyoAyM", {
"pattern": r"https://c\.l3n\.co/i/tyoAyM\.webp",
"content": "1aa749ed2c0cf679ec8e1df60068edaf3875de46",
"keyword": {
"date": "dt:2022-08-01 08:24:28",
"extension": "webp",
"filename": "tyoAyM",
"height": 400,
"id": "tyoAyM",
"title": "MYOBI clovis bookcaseset",
"url": "https://c.l3n.co/i/tyoAyM.webp",
"width": 620,
},
}),
)
def __init__(self, match):
Extractor.__init__(self, match)
self.key = match.group(1)
def items(self):
url = "{}/i/{}".format(self.root, self.key)
extr = text.extract_from(self.request(url).text)
data = {
"id" : self.key,
"title" : text.unescape(extr(
'property="og:title" content="', '"')),
"url" : extr(
'property="og:image" content="', '"'),
"width" : text.parse_int(extr(
'property="image:width" content="', '"')),
"height": text.parse_int(extr(
'property="image:height" content="', '"')),
"date" : text.parse_datetime(extr(
'<span title="', '"'), "%Y-%m-%d %H:%M:%S"),
}
text.nameext_from_url(data["url"], data)
yield Message.Directory, data
yield Message.Url, data["url"], data

View File

@ -1,73 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2019-2020 Mike Fährmann
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
"""Extractors for https://www.lineblog.me/"""
from .livedoor import LivedoorBlogExtractor, LivedoorPostExtractor
from .. import text
class LineblogBase():
"""Base class for lineblog extractors"""
category = "lineblog"
root = "https://lineblog.me"
def _images(self, post):
imgs = []
body = post.pop("body")
for num, img in enumerate(text.extract_iter(body, "<img ", ">"), 1):
src = text.extr(img, 'src="', '"')
alt = text.extr(img, 'alt="', '"')
if not src:
continue
if src.startswith("https://obs.line-scdn.") and src.count("/") > 3:
src = src.rpartition("/")[0]
imgs.append(text.nameext_from_url(alt or src, {
"url" : src,
"num" : num,
"hash": src.rpartition("/")[2],
"post": post,
}))
return imgs
class LineblogBlogExtractor(LineblogBase, LivedoorBlogExtractor):
"""Extractor for a user's blog on lineblog.me"""
pattern = r"(?:https?://)?lineblog\.me/(\w+)/?(?:$|[?#])"
test = ("https://lineblog.me/mamoru_miyano/", {
"range": "1-20",
"count": 20,
"pattern": r"https://obs.line-scdn.net/[\w-]+$",
"keyword": {
"post": {
"categories" : tuple,
"date" : "type:datetime",
"description": str,
"id" : int,
"tags" : list,
"title" : str,
"user" : "mamoru_miyano"
},
"filename": str,
"hash" : r"re:\w{32,}",
"num" : int,
},
})
class LineblogPostExtractor(LineblogBase, LivedoorPostExtractor):
"""Extractor for blog posts on lineblog.me"""
pattern = r"(?:https?://)?lineblog\.me/(\w+)/archives/(\d+)"
test = ("https://lineblog.me/mamoru_miyano/archives/1919150.html", {
"url": "24afeb4044c554f80c374b52bf8109c6f1c0c757",
"keyword": "76a38e2c0074926bd3362f66f9fc0e6c41591dcb",
})

View File

@ -46,9 +46,10 @@ class LolisafeAlbumExtractor(LolisafeExtractor):
LolisafeExtractor.__init__(self, match) LolisafeExtractor.__init__(self, match)
self.album_id = match.group(match.lastindex) self.album_id = match.group(match.lastindex)
def _init(self):
domain = self.config("domain") domain = self.config("domain")
if domain == "auto": if domain == "auto":
self.root = text.root_from_url(match.group(0)) self.root = text.root_from_url(self.url)
elif domain: elif domain:
self.root = text.ensure_http_scheme(domain) self.root = text.ensure_http_scheme(domain)

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2016-2022 Mike Fährmann # Copyright 2016-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -15,7 +15,7 @@ from .. import text, exception
class LusciousExtractor(Extractor): class LusciousExtractor(Extractor):
"""Base class for luscious extractors""" """Base class for luscious extractors"""
category = "luscious" category = "luscious"
cookiedomain = ".luscious.net" cookies_domain = ".luscious.net"
root = "https://members.luscious.net" root = "https://members.luscious.net"
def _graphql(self, op, variables, query): def _graphql(self, op, variables, query):
@ -118,6 +118,8 @@ class LusciousAlbumExtractor(LusciousExtractor):
def __init__(self, match): def __init__(self, match):
LusciousExtractor.__init__(self, match) LusciousExtractor.__init__(self, match)
self.album_id = match.group(1) self.album_id = match.group(1)
def _init(self):
self.gif = self.config("gif", False) self.gif = self.config("gif", False)
def items(self): def items(self):

View File

@ -30,9 +30,11 @@ class MangadexExtractor(Extractor):
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.uuid = match.group(1)
def _init(self):
self.session.headers["User-Agent"] = util.USERAGENT self.session.headers["User-Agent"] = util.USERAGENT
self.api = MangadexAPI(self) self.api = MangadexAPI(self)
self.uuid = match.group(1)
def items(self): def items(self):
for chapter in self.chapters(): for chapter in self.chapters():
@ -85,6 +87,10 @@ class MangadexExtractor(Extractor):
data["group"] = [group["attributes"]["name"] data["group"] = [group["attributes"]["name"]
for group in relationships["scanlation_group"]] for group in relationships["scanlation_group"]]
data["status"] = mattributes["status"]
data["tags"] = [tag["attributes"]["name"]["en"]
for tag in mattributes["tags"]]
return data return data
@ -94,13 +100,13 @@ class MangadexChapterExtractor(MangadexExtractor):
pattern = BASE_PATTERN + r"/chapter/([0-9a-f-]+)" pattern = BASE_PATTERN + r"/chapter/([0-9a-f-]+)"
test = ( test = (
("https://mangadex.org/chapter/f946ac53-0b71-4b5d-aeb2-7931b13c4aaa", { ("https://mangadex.org/chapter/f946ac53-0b71-4b5d-aeb2-7931b13c4aaa", {
"keyword": "86fb262cf767dac6d965cd904ad499adba466404", "keyword": "e86128a79ebe7201b648f1caa828496a2878dc8f",
# "content": "50383a4c15124682057b197d40261641a98db514", # "content": "50383a4c15124682057b197d40261641a98db514",
}), }),
# oneshot # oneshot
("https://mangadex.org/chapter/61a88817-9c29-4281-bdf1-77b3c1be9831", { ("https://mangadex.org/chapter/61a88817-9c29-4281-bdf1-77b3c1be9831", {
"count": 64, "count": 64,
"keyword": "6abcbe1e24eeb1049dc931958853cd767ee483fb", "keyword": "d11ed057a919854696853362be35fc0ba7dded4c",
}), }),
# MANGA Plus (#1154) # MANGA Plus (#1154)
("https://mangadex.org/chapter/74149a55-e7c4-44ea-8a37-98e879c1096f", { ("https://mangadex.org/chapter/74149a55-e7c4-44ea-8a37-98e879c1096f", {
@ -144,6 +150,7 @@ class MangadexMangaExtractor(MangadexExtractor):
pattern = BASE_PATTERN + r"/(?:title|manga)/(?!feed$)([0-9a-f-]+)" pattern = BASE_PATTERN + r"/(?:title|manga)/(?!feed$)([0-9a-f-]+)"
test = ( test = (
("https://mangadex.org/title/f90c4398-8aad-4f51-8a1f-024ca09fdcbc", { ("https://mangadex.org/title/f90c4398-8aad-4f51-8a1f-024ca09fdcbc", {
"count": ">= 5",
"keyword": { "keyword": {
"manga" : "Souten no Koumori", "manga" : "Souten no Koumori",
"manga_id": "f90c4398-8aad-4f51-8a1f-024ca09fdcbc", "manga_id": "f90c4398-8aad-4f51-8a1f-024ca09fdcbc",
@ -157,6 +164,19 @@ class MangadexMangaExtractor(MangadexExtractor):
"language": str, "language": str,
"artist" : ["Arakawa Hiromu"], "artist" : ["Arakawa Hiromu"],
"author" : ["Arakawa Hiromu"], "author" : ["Arakawa Hiromu"],
"status" : "completed",
"tags" : ["Oneshot", "Historical", "Action",
"Martial Arts", "Drama", "Tragedy"],
},
}),
# mutliple values for 'lang' (#4093)
("https://mangadex.org/title/f90c4398-8aad-4f51-8a1f-024ca09fdcbc", {
"options": (("lang", "fr,it"),),
"count": 2,
"keyword": {
"manga" : "Souten no Koumori",
"lang" : "re:fr|it",
"language": "re:French|Italian",
}, },
}), }),
("https://mangadex.cc/manga/d0c88e3b-ea64-4e07-9841-c1d2ac982f4a/", { ("https://mangadex.cc/manga/d0c88e3b-ea64-4e07-9841-c1d2ac982f4a/", {
@ -186,13 +206,16 @@ class MangadexFeedExtractor(MangadexExtractor):
class MangadexAPI(): class MangadexAPI():
"""Interface for the MangaDex API v5""" """Interface for the MangaDex API v5
https://api.mangadex.org/docs/
"""
def __init__(self, extr): def __init__(self, extr):
self.extractor = extr self.extractor = extr
self.headers = {} self.headers = {}
self.username, self.password = self.extractor._get_auth_info() self.username, self.password = extr._get_auth_info()
if not self.username: if not self.username:
self.authenticate = util.noop self.authenticate = util.noop
@ -278,9 +301,13 @@ class MangadexAPI():
if ratings is None: if ratings is None:
ratings = ("safe", "suggestive", "erotica", "pornographic") ratings = ("safe", "suggestive", "erotica", "pornographic")
lang = config("lang")
if isinstance(lang, str) and "," in lang:
lang = lang.split(",")
params["contentRating[]"] = ratings params["contentRating[]"] = ratings
params["translatedLanguage[]"] = lang
params["includes[]"] = ("scanlation_group",) params["includes[]"] = ("scanlation_group",)
params["translatedLanguage[]"] = config("lang")
params["offset"] = 0 params["offset"] = 0
api_params = config("api-parameters") api_params = config("api-parameters")

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2017-2022 Mike Fährmann # Copyright 2017-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -33,6 +33,8 @@ class MangafoxChapterExtractor(ChapterExtractor):
base, self.cstr, self.volume, self.chapter, self.minor = match.groups() base, self.cstr, self.volume, self.chapter, self.minor = match.groups()
self.urlbase = self.root + base self.urlbase = self.root + base
ChapterExtractor.__init__(self, match, self.urlbase + "/1.html") ChapterExtractor.__init__(self, match, self.urlbase + "/1.html")
def _init(self):
self.session.headers["Referer"] = self.root + "/" self.session.headers["Referer"] = self.root + "/"
def metadata(self, page): def metadata(self, page):

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2015-2022 Mike Fährmann # Copyright 2015-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -42,6 +42,8 @@ class MangahereChapterExtractor(MangahereBase, ChapterExtractor):
self.part, self.volume, self.chapter = match.groups() self.part, self.volume, self.chapter = match.groups()
url = self.url_fmt.format(self.part, 1) url = self.url_fmt.format(self.part, 1)
ChapterExtractor.__init__(self, match, url) ChapterExtractor.__init__(self, match, url)
def _init(self):
self.session.headers["Referer"] = self.root_mobile + "/" self.session.headers["Referer"] = self.root_mobile + "/"
def metadata(self, page): def metadata(self, page):
@ -112,9 +114,8 @@ class MangahereMangaExtractor(MangahereBase, MangaExtractor):
("https://m.mangahere.co/manga/aria/"), ("https://m.mangahere.co/manga/aria/"),
) )
def __init__(self, match): def _init(self):
MangaExtractor.__init__(self, match) self.cookies.set("isAdult", "1", domain="www.mangahere.cc")
self.session.cookies.set("isAdult", "1", domain="www.mangahere.cc")
def chapters(self, page): def chapters(self, page):
results = [] results = []

View File

@ -1,7 +1,7 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2020 Jake Mannens # Copyright 2020 Jake Mannens
# Copyright 2021-2022 Mike Fährmann # Copyright 2021-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -39,7 +39,9 @@ class MangakakalotChapterExtractor(MangakakalotBase, ChapterExtractor):
def __init__(self, match): def __init__(self, match):
self.path = match.group(1) self.path = match.group(1)
ChapterExtractor.__init__(self, match, self.root + self.path) ChapterExtractor.__init__(self, match, self.root + self.path)
self.session.headers['Referer'] = self.root
def _init(self):
self.session.headers['Referer'] = self.root + "/"
def metadata(self, page): def metadata(self, page):
_ , pos = text.extract(page, '<span itemprop="title">', '<') _ , pos = text.extract(page, '<span itemprop="title">', '<')

View File

@ -16,21 +16,28 @@ BASE_PATTERN = r"(?:https?://)?((?:chap|read|www\.|m\.)?mangan(?:at|el)o\.com)"
class ManganeloBase(): class ManganeloBase():
category = "manganelo" category = "manganelo"
root = "https://chapmanganato.com" root = "https://chapmanganato.com"
_match_chapter = None
def __init__(self, match): def __init__(self, match):
domain, path = match.groups() domain, path = match.groups()
super().__init__(match, "https://" + domain + path) super().__init__(match, "https://" + domain + path)
self.session.headers['Referer'] = self.root
self._match_chapter = re.compile( def _init(self):
r"(?:[Vv]ol\.?\s*(\d+)\s?)?" self.session.headers['Referer'] = self.root + "/"
r"[Cc]hapter\s*([^:]+)"
r"(?::\s*(.+))?").match if self._match_chapter is None:
ManganeloBase._match_chapter = re.compile(
r"(?:[Vv]ol\.?\s*(\d+)\s?)?"
r"[Cc]hapter\s*(\d+)([^:]*)"
r"(?::\s*(.+))?").match
def _parse_chapter(self, info, manga, author, date=None): def _parse_chapter(self, info, manga, author, date=None):
match = self._match_chapter(info) match = self._match_chapter(info)
volume, chapter, title = match.groups() if match else ("", "", info) if match:
chapter, sep, minor = chapter.partition(".") volume, chapter, minor, title = match.groups()
else:
volume = chapter = minor = ""
title = info
return { return {
"manga" : manga, "manga" : manga,
@ -39,7 +46,7 @@ class ManganeloBase():
"title" : text.unescape(title) if title else "", "title" : text.unescape(title) if title else "",
"volume" : text.parse_int(volume), "volume" : text.parse_int(volume),
"chapter" : text.parse_int(chapter), "chapter" : text.parse_int(chapter),
"chapter_minor": sep + minor, "chapter_minor": minor,
"lang" : "en", "lang" : "en",
"language" : "English", "language" : "English",
} }
@ -61,6 +68,10 @@ class ManganeloChapterExtractor(ManganeloBase, ChapterExtractor):
"keyword": "06e01fa9b3fc9b5b954c0d4a98f0153b40922ded", "keyword": "06e01fa9b3fc9b5b954c0d4a98f0153b40922ded",
"count": 45, "count": 45,
}), }),
("https://chapmanganato.com/manga-no991297/chapter-8", {
"keyword": {"chapter": 8, "chapter_minor": "-1"},
"count": 20,
}),
("https://readmanganato.com/manga-gn983696/chapter-23"), ("https://readmanganato.com/manga-gn983696/chapter-23"),
("https://manganelo.com/chapter/gamers/chapter_15"), ("https://manganelo.com/chapter/gamers/chapter_15"),
("https://manganelo.com/chapter/gq921227/chapter_23"), ("https://manganelo.com/chapter/gq921227/chapter_23"),

View File

@ -8,155 +8,464 @@
"""Extractors for https://mangapark.net/""" """Extractors for https://mangapark.net/"""
from .common import ChapterExtractor, MangaExtractor from .common import ChapterExtractor, Extractor, Message
from .. import text, util, exception from .. import text, util, exception
import re import re
BASE_PATTERN = r"(?:https?://)?(?:www\.)?mangapark\.(?:net|com|org|io|me)"
class MangaparkBase(): class MangaparkBase():
"""Base class for mangapark extractors""" """Base class for mangapark extractors"""
category = "mangapark" category = "mangapark"
root_fmt = "https://v2.mangapark.{}" _match_title = None
browser = "firefox"
@staticmethod def _parse_chapter_title(self, title):
def parse_chapter_path(path, data): if not self._match_title:
"""Get volume/chapter information from url-path of a chapter""" MangaparkBase._match_title = re.compile(
data["volume"], data["chapter_minor"] = 0, "" r"(?i)"
for part in path.split("/")[1:]: r"(?:vol(?:\.|ume)?\s*(\d+)\s*)?"
key, value = part[0], part[1:] r"ch(?:\.|apter)?\s*(\d+)([^\s:]*)"
if key == "c": r"(?:\s*:\s*(.*))?"
chapter, dot, minor = value.partition(".") ).match
data["chapter"] = text.parse_int(chapter) match = self._match_title(title)
data["chapter_minor"] = dot + minor return match.groups() if match else (0, 0, "", "")
elif key == "i":
data["chapter_id"] = text.parse_int(value)
elif key == "v":
data["volume"] = text.parse_int(value)
elif key == "s":
data["stream"] = text.parse_int(value)
elif key == "e":
data["chapter_minor"] = "v" + value
@staticmethod
def parse_chapter_title(title, data):
match = re.search(r"(?i)(?:vol(?:ume)?[ .]*(\d+) )?"
r"ch(?:apter)?[ .]*(\d+)(\.\w+)?", title)
if match:
vol, ch, data["chapter_minor"] = match.groups()
data["volume"] = text.parse_int(vol)
data["chapter"] = text.parse_int(ch)
class MangaparkChapterExtractor(MangaparkBase, ChapterExtractor): class MangaparkChapterExtractor(MangaparkBase, ChapterExtractor):
"""Extractor for manga-chapters from mangapark.net""" """Extractor for manga-chapters from mangapark.net"""
pattern = (r"(?:https?://)?(?:www\.|v2\.)?mangapark\.(me|net|com)" pattern = BASE_PATTERN + r"/title/[^/?#]+/(\d+)"
r"/manga/([^?#]+/i\d+)")
test = ( test = (
("https://mangapark.net/manga/gosu/i811653/c055/1", { ("https://mangapark.net/title/114972-aria/6710214-en-ch.60.2", {
"count": 50, "count": 70,
"keyword": "db1ed9af4f972756a25dbfa5af69a8f155b043ff", "pattern": r"https://[\w-]+\.mpcdn\.org/comic/2002/e67"
r"/61e29278a583b9227964076e/\d+_\d+_\d+_\d+\.jpeg"
r"\?acc=[^&#]+&exp=\d+",
"keyword": {
"artist": [],
"author": ["Amano Kozue"],
"chapter": 60,
"chapter_id": 6710214,
"chapter_minor": ".2",
"count": 70,
"date": "dt:2022-01-15 09:25:03",
"extension": "jpeg",
"filename": str,
"genre": ["adventure", "comedy", "drama", "sci_fi",
"shounen", "slice_of_life"],
"lang": "en",
"language": "English",
"manga": "Aria",
"manga_id": 114972,
"page": int,
"source": "Koala",
"title": "Special Navigation - Aquaria Ii",
"volume": 12,
},
}), }),
(("https://mangapark.net/manga" ("https://mangapark.com/title/114972-aria/6710214-en-ch.60.2"),
"/ad-astra-per-aspera-hata-kenjirou/i662051/c001.2/1"), { ("https://mangapark.org/title/114972-aria/6710214-en-ch.60.2"),
"count": 40, ("https://mangapark.io/title/114972-aria/6710214-en-ch.60.2"),
"keyword": "2bb3a8f426383ea13f17ff5582f3070d096d30ac", ("https://mangapark.me/title/114972-aria/6710214-en-ch.60.2"),
}),
(("https://mangapark.net/manga"
"/gekkan-shoujo-nozaki-kun/i2067426/v7/c70/1"), {
"count": 15,
"keyword": "edc14993c4752cee3a76e09b2f024d40d854bfd1",
}),
("https://mangapark.me/manga/gosu/i811615/c55/1"),
("https://mangapark.com/manga/gosu/i811615/c55/1"),
) )
def __init__(self, match): def __init__(self, match):
tld, self.path = match.groups() self.root = text.root_from_url(match.group(0))
self.root = self.root_fmt.format(tld) url = "{}/title/_/{}".format(self.root, match.group(1))
url = "{}/manga/{}?zoom=2".format(self.root, self.path)
ChapterExtractor.__init__(self, match, url) ChapterExtractor.__init__(self, match, url)
def metadata(self, page): def metadata(self, page):
data = text.extract_all(page, ( data = util.json_loads(text.extr(
("manga_id" , "var _manga_id = '", "'"), page, 'id="__NEXT_DATA__" type="application/json">', '<'))
("chapter_id", "var _book_id = '", "'"), chapter = (data["props"]["pageProps"]["dehydratedState"]
("stream" , "var _stream = '", "'"), ["queries"][0]["state"]["data"]["data"])
("path" , "var _book_link = '", "'"), manga = chapter["comicNode"]["data"]
("manga" , "<h2>", "</h2>"), source = chapter["sourceNode"]["data"]
("title" , "</a>", "<"),
), values={"lang": "en", "language": "English"})[0]
if not data["path"]: self._urls = chapter["imageSet"]["httpLis"]
raise exception.NotFoundError("chapter") self._params = chapter["imageSet"]["wordLis"]
vol, ch, minor, title = self._parse_chapter_title(chapter["dname"])
self.parse_chapter_path(data["path"], data) return {
if "chapter" not in data: "manga" : manga["name"],
self.parse_chapter_title(data["title"], data) "manga_id" : manga["id"],
"artist" : source["artists"],
data["manga"], _, data["type"] = data["manga"].rpartition(" ") "author" : source["authors"],
data["manga"] = text.unescape(data["manga"]) "genre" : source["genres"],
data["title"] = data["title"].partition(": ")[2] "volume" : text.parse_int(vol),
for key in ("manga_id", "chapter_id", "stream"): "chapter" : text.parse_int(ch),
data[key] = text.parse_int(data[key]) "chapter_minor": minor,
"chapter_id": chapter["id"],
return data "title" : chapter["title"] or title or "",
"lang" : chapter["lang"],
"language" : util.code_to_language(chapter["lang"]),
"source" : source["srcTitle"],
"source_id" : source["id"],
"date" : text.parse_timestamp(chapter["dateCreate"] // 1000),
}
def images(self, page): def images(self, page):
data = util.json_loads(text.extr(page, "var _load_pages =", ";"))
return [ return [
(text.urljoin(self.root, item["u"]), { (url + "?" + params, None)
"width": text.parse_int(item["w"]), for url, params in zip(self._urls, self._params)
"height": text.parse_int(item["h"]),
})
for item in data
] ]
class MangaparkMangaExtractor(MangaparkBase, MangaExtractor): class MangaparkMangaExtractor(MangaparkBase, Extractor):
"""Extractor for manga from mangapark.net""" """Extractor for manga from mangapark.net"""
chapterclass = MangaparkChapterExtractor subcategory = "manga"
pattern = (r"(?:https?://)?(?:www\.|v2\.)?mangapark\.(me|net|com)" pattern = BASE_PATTERN + r"/title/(\d+)(?:-[^/?#]*)?/?$"
r"(/manga/[^/?#]+)/?$")
test = ( test = (
("https://mangapark.net/manga/aria", { ("https://mangapark.net/title/114972-aria", {
"url": "51c6d82aed5c3c78e0d3f980b09a998e6a2a83ee", "count": 141,
"keyword": "cabc60cf2efa82749d27ac92c495945961e4b73c", "pattern": MangaparkChapterExtractor.pattern,
"keyword": {
"chapter": int,
"chapter_id": int,
"chapter_minor": str,
"date": "type:datetime",
"lang": "en",
"language": "English",
"manga_id": 114972,
"source": "re:Horse|Koala",
"source_id": int,
"title": str,
"volume": int,
},
}), }),
("https://mangapark.me/manga/aria"), # 'source' option
("https://mangapark.com/manga/aria"), ("https://mangapark.net/title/114972-aria", {
"options": (("source", "koala"),),
"count": 70,
"pattern": MangaparkChapterExtractor.pattern,
"keyword": {
"source": "Koala",
"source_id": 15150116,
},
}),
("https://mangapark.com/title/114972-"),
("https://mangapark.com/title/114972"),
("https://mangapark.com/title/114972-aria"),
("https://mangapark.org/title/114972-aria"),
("https://mangapark.io/title/114972-aria"),
("https://mangapark.me/title/114972-aria"),
) )
def __init__(self, match): def __init__(self, match):
self.root = self.root_fmt.format(match.group(1)) self.root = text.root_from_url(match.group(0))
MangaExtractor.__init__(self, match, self.root + match.group(2)) self.manga_id = int(match.group(1))
Extractor.__init__(self, match)
def chapters(self, page): def items(self):
results = [] for chapter in self.chapters():
data = {"lang": "en", "language": "English"} chapter = chapter["data"]
data["manga"] = text.unescape( url = self.root + chapter["urlPath"]
text.extr(page, '<title>', ' Manga - '))
for stream in page.split('<div id="stream_')[1:]: vol, ch, minor, title = self._parse_chapter_title(chapter["dname"])
data["stream"] = text.parse_int(text.extr(stream, '', '"')) data = {
"manga_id" : self.manga_id,
"volume" : text.parse_int(vol),
"chapter" : text.parse_int(ch),
"chapter_minor": minor,
"chapter_id": chapter["id"],
"title" : chapter["title"] or title or "",
"lang" : chapter["lang"],
"language" : util.code_to_language(chapter["lang"]),
"source" : chapter["srcTitle"],
"source_id" : chapter["sourceId"],
"date" : text.parse_timestamp(
chapter["dateCreate"] // 1000),
"_extractor": MangaparkChapterExtractor,
}
yield Message.Queue, url, data
for chapter in text.extract_iter(stream, '<li ', '</li>'): def chapters(self):
path , pos = text.extract(chapter, 'href="', '"') source = self.config("source")
title1, pos = text.extract(chapter, '>', '<', pos) if not source:
title2, pos = text.extract(chapter, '>: </span>', '<', pos) return self.chapters_all()
count , pos = text.extract(chapter, ' of ', ' ', pos)
self.parse_chapter_path(path[8:], data) source_id = self._select_source(source)
if "chapter" not in data: self.log.debug("Requesting chapters for source_id %s", source_id)
self.parse_chapter_title(title1, data) return self.chapters_source(source_id)
if title2: def chapters_all(self):
data["title"] = title2.strip() pnum = 0
else: variables = {
data["title"] = title1.partition(":")[2].strip() "select": {
"comicId": self.manga_id,
"range" : None,
"isAsc" : not self.config("chapter-reverse"),
}
}
data["count"] = text.parse_int(count) while True:
results.append((self.root + path, data.copy())) data = self._request_graphql(
data.pop("chapter", None) "get_content_comicChapterRangeList", variables)
return results for item in data["items"]:
yield from item["chapterNodes"]
if not pnum:
pager = data["pager"]
pnum += 1
try:
variables["select"]["range"] = pager[pnum]
except IndexError:
return
def chapters_source(self, source_id):
variables = {
"sourceId": source_id,
}
chapters = self._request_graphql(
"get_content_source_chapterList", variables)
if self.config("chapter-reverse"):
chapters.reverse()
return chapters
def _select_source(self, source):
if isinstance(source, int):
return source
group, _, lang = source.partition(":")
group = group.lower()
variables = {
"comicId" : self.manga_id,
"dbStatuss" : ["normal"],
"haveChapter": True,
}
for item in self._request_graphql(
"get_content_comic_sources", variables):
data = item["data"]
if (not group or data["srcTitle"].lower() == group) and (
not lang or data["lang"] == lang):
return data["id"]
raise exception.StopExtraction(
"'%s' does not match any available source", source)
def _request_graphql(self, opname, variables):
url = self.root + "/apo/"
data = {
"query" : QUERIES[opname],
"variables" : util.json_dumps(variables),
"operationName": opname,
}
return self.request(
url, method="POST", json=data).json()["data"][opname]
QUERIES = {
"get_content_comicChapterRangeList": """
query get_content_comicChapterRangeList($select: Content_ComicChapterRangeList_Select) {
get_content_comicChapterRangeList(
select: $select
) {
reqRange{x y}
missing
pager {x y}
items{
serial
chapterNodes {
id
data {
id
sourceId
dbStatus
isNormal
isHidden
isDeleted
isFinal
dateCreate
datePublic
dateModify
lang
volume
serial
dname
title
urlPath
srcTitle srcColor
count_images
stat_count_post_child
stat_count_post_reply
stat_count_views_login
stat_count_views_guest
userId
userNode {
id
data {
id
name
uniq
avatarUrl
urlPath
verified
deleted
banned
dateCreate
dateOnline
stat_count_chapters_normal
stat_count_chapters_others
is_adm is_mod is_vip is_upr
}
}
disqusId
}
sser_read
}
}
}
}
""",
"get_content_source_chapterList": """
query get_content_source_chapterList($sourceId: Int!) {
get_content_source_chapterList(
sourceId: $sourceId
) {
id
data {
id
sourceId
dbStatus
isNormal
isHidden
isDeleted
isFinal
dateCreate
datePublic
dateModify
lang
volume
serial
dname
title
urlPath
srcTitle srcColor
count_images
stat_count_post_child
stat_count_post_reply
stat_count_views_login
stat_count_views_guest
userId
userNode {
id
data {
id
name
uniq
avatarUrl
urlPath
verified
deleted
banned
dateCreate
dateOnline
stat_count_chapters_normal
stat_count_chapters_others
is_adm is_mod is_vip is_upr
}
}
disqusId
}
}
}
""",
"get_content_comic_sources": """
query get_content_comic_sources($comicId: Int!, $dbStatuss: [String] = [], $userId: Int, $haveChapter: Boolean, $sortFor: String) {
get_content_comic_sources(
comicId: $comicId
dbStatuss: $dbStatuss
userId: $userId
haveChapter: $haveChapter
sortFor: $sortFor
) {
id
data{
id
dbStatus
isNormal
isHidden
isDeleted
lang name altNames authors artists
release
genres summary{code} extraInfo{code}
urlCover600
urlCover300
urlCoverOri
srcTitle srcColor
chapterCount
chapterNode_last {
id
data {
dateCreate datePublic dateModify
volume serial
dname title
urlPath
userNode {
id data {uniq name}
}
}
}
}
}
}
""",
}

View File

@ -0,0 +1,192 @@
# -*- coding: utf-8 -*-
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
"""Extractors for https://mangaread.org/"""
from .common import ChapterExtractor, MangaExtractor
from .. import text, exception
import re
class MangareadBase():
"""Base class for Mangaread extractors"""
category = "mangaread"
root = "https://www.mangaread.org"
@staticmethod
def parse_chapter_string(chapter_string, data):
match = re.match(
r"(?:(.+)\s*-\s*)?[Cc]hapter\s*(\d+)(\.\d+)?(?:\s*-\s*(.+))?",
text.unescape(chapter_string).strip())
manga, chapter, minor, title = match.groups()
manga = manga.strip() if manga else ""
data["manga"] = data.pop("manga", manga)
data["chapter"] = text.parse_int(chapter)
data["chapter_minor"] = minor or ""
data["title"] = title or ""
data["lang"] = "en"
data["language"] = "English"
class MangareadChapterExtractor(MangareadBase, ChapterExtractor):
"""Extractor for manga-chapters from mangaread.org"""
pattern = (r"(?:https?://)?(?:www\.)?mangaread\.org"
r"(/manga/[^/?#]+/[^/?#]+)")
test = (
("https://www.mangaread.org/manga/one-piece/chapter-1053-3/", {
"pattern": (r"https://www\.mangaread\.org/wp-content/uploads"
r"/WP-manga/data/manga_[^/]+/[^/]+/[^.]+\.\w+"),
"count": 11,
"keyword": {
"manga" : "One Piece",
"title" : "",
"chapter" : 1053,
"chapter_minor": ".3",
"tags" : ["Oda Eiichiro"],
"lang" : "en",
"language": "English",
}
}),
("https://www.mangaread.org/manga/one-piece/chapter-1000000/", {
"exception": exception.NotFoundError,
}),
(("https://www.mangaread.org"
"/manga/kanan-sama-wa-akumade-choroi/chapter-10/"), {
"pattern": (r"https://www\.mangaread\.org/wp-content/uploads"
r"/WP-manga/data/manga_[^/]+/[^/]+/[^.]+\.\w+"),
"count": 9,
"keyword": {
"manga" : "Kanan-sama wa Akumade Choroi",
"title" : "",
"chapter" : 10,
"chapter_minor": "",
"tags" : list,
"lang" : "en",
"language": "English",
}
}),
# 'Chapter146.5'
# ^^ no whitespace
("https://www.mangaread.org/manga/above-all-gods/chapter146-5/", {
"pattern": (r"https://www\.mangaread\.org/wp-content/uploads"
r"/WP-manga/data/manga_[^/]+/[^/]+/[^.]+\.\w+"),
"count": 6,
"keyword": {
"manga" : "Above All Gods",
"title" : "",
"chapter" : 146,
"chapter_minor": ".5",
"tags" : list,
"lang" : "en",
"language": "English",
}
}),
)
def metadata(self, page):
tags = text.extr(page, 'class="wp-manga-tags-list">', '</div>')
data = {"tags": list(text.split_html(tags)[::2])}
info = text.extr(page, '<h1 id="chapter-heading">', "</h1>")
if not info:
raise exception.NotFoundError("chapter")
self.parse_chapter_string(info, data)
return data
def images(self, page):
page = text.extr(
page, '<div class="reading-content">', '<div class="entry-header')
return [
(url.strip(), None)
for url in text.extract_iter(page, 'data-src="', '"')
]
class MangareadMangaExtractor(MangareadBase, MangaExtractor):
"""Extractor for manga from mangaread.org"""
chapterclass = MangareadChapterExtractor
pattern = r"(?:https?://)?(?:www\.)?mangaread\.org(/manga/[^/?#]+)/?$"
test = (
("https://www.mangaread.org/manga/kanan-sama-wa-akumade-choroi", {
"pattern": (r"https://www\.mangaread\.org/manga"
r"/kanan-sama-wa-akumade-choroi"
r"/chapter-\d+(-.+)?/"),
"count" : ">= 13",
"keyword": {
"manga" : "Kanan-sama wa Akumade Choroi",
"author" : ["nonco"],
"artist" : ["nonco"],
"type" : "Manga",
"genres" : ["Comedy", "Romance", "Shounen", "Supernatural"],
"rating" : float,
"release": 2022,
"status" : "OnGoing",
"lang" : "en",
"language" : "English",
"manga_alt" : list,
"description": str,
}
}),
("https://www.mangaread.org/manga/one-piece", {
"pattern": (r"https://www\.mangaread\.org/manga"
r"/one-piece/chapter-\d+(-.+)?/"),
"count" : ">= 1066",
"keyword": {
"manga" : "One Piece",
"author" : ["Oda Eiichiro"],
"artist" : ["Oda Eiichiro"],
"type" : "Manga",
"genres" : list,
"rating" : float,
"release": 1997,
"status" : "OnGoing",
"lang" : "en",
"language" : "English",
"manga_alt" : ["One Piece"],
"description": str,
}
}),
("https://www.mangaread.org/manga/doesnotexist", {
"exception": exception.HttpError,
}),
)
def chapters(self, page):
if 'class="error404' in page:
raise exception.NotFoundError("manga")
data = self.metadata(page)
result = []
for chapter in text.extract_iter(
page, '<li class="wp-manga-chapter', "</li>"):
url , pos = text.extract(chapter, '<a href="', '"')
info, _ = text.extract(chapter, ">", "</a>", pos)
self.parse_chapter_string(info, data)
result.append((url, data.copy()))
return result
def metadata(self, page):
extr = text.extract_from(text.extr(
page, 'class="summary_content">', 'class="manga-action"'))
return {
"manga" : text.extr(page, "<h1>", "</h1>").strip(),
"description": text.unescape(text.remove_html(text.extract(
page, ">", "</div>", page.index("summary__content"))[0])),
"rating" : text.parse_float(
extr('total_votes">', "</span>").strip()),
"manga_alt" : text.remove_html(
extr("Alternative </h5>\n</div>", "</div>")).split("; "),
"author" : list(text.extract_iter(
extr('class="author-content">', "</div>"), '"tag">', "</a>")),
"artist" : list(text.extract_iter(
extr('class="artist-content">', "</div>"), '"tag">', "</a>")),
"genres" : list(text.extract_iter(
extr('class="genres-content">', "</div>"), '"tag">', "</a>")),
"type" : text.remove_html(
extr("Type </h5>\n</div>", "</div>")),
"release" : text.parse_int(text.remove_html(
extr("Release </h5>\n</div>", "</div>"))),
"status" : text.remove_html(
extr("Status </h5>\n</div>", "</div>")),
}

View File

@ -90,10 +90,12 @@ class MangaseeChapterExtractor(MangaseeBase, ChapterExtractor):
self.category = "mangalife" self.category = "mangalife"
self.root = "https://manga4life.com" self.root = "https://manga4life.com"
ChapterExtractor.__init__(self, match, self.root + match.group(2)) ChapterExtractor.__init__(self, match, self.root + match.group(2))
def _init(self):
self.session.headers["Referer"] = self.gallery_url self.session.headers["Referer"] = self.gallery_url
domain = self.root.rpartition("/")[2] domain = self.root.rpartition("/")[2]
cookies = self.session.cookies cookies = self.cookies
if not cookies.get("PHPSESSID", domain=domain): if not cookies.get("PHPSESSID", domain=domain):
cookies.set("PHPSESSID", util.generate_token(13), domain=domain) cookies.set("PHPSESSID", util.generate_token(13), domain=domain)

View File

@ -19,14 +19,14 @@ class MangoxoExtractor(Extractor):
"""Base class for mangoxo extractors""" """Base class for mangoxo extractors"""
category = "mangoxo" category = "mangoxo"
root = "https://www.mangoxo.com" root = "https://www.mangoxo.com"
cookiedomain = "www.mangoxo.com" cookies_domain = "www.mangoxo.com"
cookienames = ("SESSION",) cookies_names = ("SESSION",)
_warning = True _warning = True
def login(self): def login(self):
username, password = self._get_auth_info() username, password = self._get_auth_info()
if username: if username:
self._update_cookies(self._login_impl(username, password)) self.cookies_update(self._login_impl(username, password))
elif MangoxoExtractor._warning: elif MangoxoExtractor._warning:
MangoxoExtractor._warning = False MangoxoExtractor._warning = False
self.log.warning("Unauthenticated users cannot see " self.log.warning("Unauthenticated users cannot see "
@ -51,7 +51,7 @@ class MangoxoExtractor(Extractor):
data = response.json() data = response.json()
if str(data.get("result")) != "1": if str(data.get("result")) != "1":
raise exception.AuthenticationError(data.get("msg")) raise exception.AuthenticationError(data.get("msg"))
return {"SESSION": self.session.cookies.get("SESSION")} return {"SESSION": self.cookies.get("SESSION")}
@staticmethod @staticmethod
def _sign_by_md5(username, password, token): def _sign_by_md5(username, password, token):

View File

@ -19,12 +19,14 @@ class MastodonExtractor(BaseExtractor):
directory_fmt = ("mastodon", "{instance}", "{account[username]}") directory_fmt = ("mastodon", "{instance}", "{account[username]}")
filename_fmt = "{category}_{id}_{media[id]}.{extension}" filename_fmt = "{category}_{id}_{media[id]}.{extension}"
archive_fmt = "{media[id]}" archive_fmt = "{media[id]}"
cookiedomain = None cookies_domain = None
def __init__(self, match): def __init__(self, match):
BaseExtractor.__init__(self, match) BaseExtractor.__init__(self, match)
self.instance = self.root.partition("://")[2]
self.item = match.group(match.lastindex) self.item = match.group(match.lastindex)
def _init(self):
self.instance = self.root.partition("://")[2]
self.reblogs = self.config("reblogs", False) self.reblogs = self.config("reblogs", False)
self.replies = self.config("replies", True) self.replies = self.config("replies", True)

View File

@ -1,120 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2022 Mike Fährmann
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
"""Extractors for https://meme.museum/"""
from .common import Extractor, Message
from .. import text
class MememuseumExtractor(Extractor):
"""Base class for meme.museum extractors"""
basecategory = "booru"
category = "mememuseum"
filename_fmt = "{category}_{id}_{md5}.{extension}"
archive_fmt = "{id}"
root = "https://meme.museum"
def items(self):
data = self.metadata()
for post in self.posts():
url = post["file_url"]
for key in ("id", "width", "height"):
post[key] = text.parse_int(post[key])
post["tags"] = text.unquote(post["tags"])
post.update(data)
yield Message.Directory, post
yield Message.Url, url, text.nameext_from_url(url, post)
def metadata(self):
"""Return general metadata"""
return ()
def posts(self):
"""Return an iterable containing data of all relevant posts"""
return ()
class MememuseumTagExtractor(MememuseumExtractor):
"""Extractor for images from meme.museum by search-tags"""
subcategory = "tag"
directory_fmt = ("{category}", "{search_tags}")
pattern = r"(?:https?://)?meme\.museum/post/list/([^/?#]+)"
test = ("https://meme.museum/post/list/animated/1", {
"pattern": r"https://meme\.museum/_images/\w+/\d+%20-%20",
"count": ">= 30"
})
per_page = 25
def __init__(self, match):
MememuseumExtractor.__init__(self, match)
self.tags = text.unquote(match.group(1))
def metadata(self):
return {"search_tags": self.tags}
def posts(self):
pnum = 1
while True:
url = "{}/post/list/{}/{}".format(self.root, self.tags, pnum)
extr = text.extract_from(self.request(url).text)
while True:
mime = extr("data-mime='", "'")
if not mime:
break
pid = extr("data-post-id='", "'")
tags, dimensions, size = extr("title='", "'").split(" // ")
md5 = extr("/_thumbs/", "/")
width, _, height = dimensions.partition("x")
yield {
"file_url": "{}/_images/{}/{}%20-%20{}.{}".format(
self.root, md5, pid, text.quote(tags),
mime.rpartition("/")[2]),
"id": pid, "md5": md5, "tags": tags,
"width": width, "height": height,
"size": text.parse_bytes(size[:-1]),
}
if not extr(">Next<", ">"):
return
pnum += 1
class MememuseumPostExtractor(MememuseumExtractor):
"""Extractor for single images from meme.museum"""
subcategory = "post"
pattern = r"(?:https?://)?meme\.museum/post/view/(\d+)"
test = ("https://meme.museum/post/view/10243", {
"pattern": r"https://meme\.museum/_images/105febebcd5ca791ee332adc4997"
r"1f78/10243%20-%20g%20beard%20open_source%20richard_stallm"
r"an%20stallman%20tagme%20text\.jpg",
"keyword": "3c8009251480cf17248c08b2b194dc0c4d59580e",
"content": "45565f3f141fc960a8ae1168b80e718a494c52d2",
})
def __init__(self, match):
MememuseumExtractor.__init__(self, match)
self.post_id = match.group(1)
def posts(self):
url = "{}/post/view/{}".format(self.root, self.post_id)
extr = text.extract_from(self.request(url).text)
return ({
"id" : self.post_id,
"tags" : extr(": ", "<"),
"md5" : extr("/_thumbs/", "/"),
"file_url": self.root + extr("id='main_image' src='", "'"),
"width" : extr("data-width=", " ").strip("'\""),
"height" : extr("data-height=", " ").strip("'\""),
"size" : 0,
},)

View File

@ -7,7 +7,7 @@
"""Extractors for Misskey instances""" """Extractors for Misskey instances"""
from .common import BaseExtractor, Message from .common import BaseExtractor, Message
from .. import text from .. import text, exception
class MisskeyExtractor(BaseExtractor): class MisskeyExtractor(BaseExtractor):
@ -19,14 +19,18 @@ class MisskeyExtractor(BaseExtractor):
def __init__(self, match): def __init__(self, match):
BaseExtractor.__init__(self, match) BaseExtractor.__init__(self, match)
self.item = match.group(match.lastindex)
def _init(self):
self.api = MisskeyAPI(self) self.api = MisskeyAPI(self)
self.instance = self.root.rpartition("://")[2] self.instance = self.root.rpartition("://")[2]
self.item = match.group(match.lastindex)
self.renotes = self.config("renotes", False) self.renotes = self.config("renotes", False)
self.replies = self.config("replies", True) self.replies = self.config("replies", True)
def items(self): def items(self):
for note in self.notes(): for note in self.notes():
if "note" in note:
note = note["note"]
files = note.pop("files") or [] files = note.pop("files") or []
renote = note.get("renote") renote = note.get("renote")
if renote: if renote:
@ -68,7 +72,7 @@ BASE_PATTERN = MisskeyExtractor.update({
}, },
"lesbian.energy": { "lesbian.energy": {
"root": "https://lesbian.energy", "root": "https://lesbian.energy",
"pattern": r"lesbian\.energy" "pattern": r"lesbian\.energy",
}, },
"sushi.ski": { "sushi.ski": {
"root": "https://sushi.ski", "root": "https://sushi.ski",
@ -152,6 +156,21 @@ class MisskeyNoteExtractor(MisskeyExtractor):
return (self.api.notes_show(self.item),) return (self.api.notes_show(self.item),)
class MisskeyFavoriteExtractor(MisskeyExtractor):
"""Extractor for favorited notes"""
subcategory = "favorite"
pattern = BASE_PATTERN + r"/(?:my|api/i)/favorites"
test = (
("https://misskey.io/my/favorites"),
("https://misskey.io/api/i/favorites"),
("https://lesbian.energy/my/favorites"),
("https://sushi.ski/my/favorites"),
)
def notes(self):
return self.api.i_favorites()
class MisskeyAPI(): class MisskeyAPI():
"""Interface for Misskey API """Interface for Misskey API
@ -164,6 +183,7 @@ class MisskeyAPI():
self.root = extractor.root self.root = extractor.root
self.extractor = extractor self.extractor = extractor
self.headers = {"Content-Type": "application/json"} self.headers = {"Content-Type": "application/json"}
self.access_token = extractor.config("access-token")
def user_id_by_username(self, username): def user_id_by_username(self, username):
endpoint = "/users/show" endpoint = "/users/show"
@ -187,6 +207,13 @@ class MisskeyAPI():
data = {"noteId": note_id} data = {"noteId": note_id}
return self._call(endpoint, data) return self._call(endpoint, data)
def i_favorites(self):
endpoint = "/i/favorites"
if not self.access_token:
raise exception.AuthenticationError()
data = {"i": self.access_token}
return self._pagination(endpoint, data)
def _call(self, endpoint, data): def _call(self, endpoint, data):
url = self.root + "/api" + endpoint url = self.root + "/api" + endpoint
return self.extractor.request( return self.extractor.request(

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2020-2022 Mike Fährmann # Copyright 2020-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -166,7 +166,7 @@ class MoebooruTagExtractor(MoebooruExtractor):
subcategory = "tag" subcategory = "tag"
directory_fmt = ("{category}", "{search_tags}") directory_fmt = ("{category}", "{search_tags}")
archive_fmt = "t_{search_tags}_{id}" archive_fmt = "t_{search_tags}_{id}"
pattern = BASE_PATTERN + r"/post\?(?:[^&#]*&)*tags=([^&#]+)" pattern = BASE_PATTERN + r"/post\?(?:[^&#]*&)*tags=([^&#]*)"
test = ( test = (
("https://yande.re/post?tags=ouzoku+armor", { ("https://yande.re/post?tags=ouzoku+armor", {
"content": "59201811c728096b2d95ce6896fd0009235fe683", "content": "59201811c728096b2d95ce6896fd0009235fe683",
@ -174,6 +174,8 @@ class MoebooruTagExtractor(MoebooruExtractor):
("https://konachan.com/post?tags=patata", { ("https://konachan.com/post?tags=patata", {
"content": "838cfb815e31f48160855435655ddf7bfc4ecb8d", "content": "838cfb815e31f48160855435655ddf7bfc4ecb8d",
}), }),
# empty 'tags' (#4354)
("https://konachan.com/post?tags="),
("https://konachan.net/post?tags=patata"), ("https://konachan.net/post?tags=patata"),
("https://www.sakugabooru.com/post?tags=nichijou"), ("https://www.sakugabooru.com/post?tags=nichijou"),
("https://lolibooru.moe/post?tags=ruu_%28tksymkw%29"), ("https://lolibooru.moe/post?tags=ruu_%28tksymkw%29"),

View File

@ -38,7 +38,9 @@ class MyhentaigalleryGalleryExtractor(GalleryExtractor):
self.gallery_id = match.group(1) self.gallery_id = match.group(1)
url = "{}/gallery/thumbnails/{}".format(self.root, self.gallery_id) url = "{}/gallery/thumbnails/{}".format(self.root, self.gallery_id)
GalleryExtractor.__init__(self, match, url) GalleryExtractor.__init__(self, match, url)
self.session.headers["Referer"] = url
def _init(self):
self.session.headers["Referer"] = self.gallery_url
def metadata(self, page): def metadata(self, page):
extr = text.extract_from(page) extr = text.extract_from(page)

View File

@ -1,12 +1,12 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2018-2022 Mike Fährmann # Copyright 2018-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation. # published by the Free Software Foundation.
"""Extract images from https://www.myportfolio.com/""" """Extractors for https://www.myportfolio.com/"""
from .common import Extractor, Message from .common import Extractor, Message
from .. import text, exception from .. import text, exception
@ -21,7 +21,7 @@ class MyportfolioGalleryExtractor(Extractor):
archive_fmt = "{user}_{filename}" archive_fmt = "{user}_{filename}"
pattern = (r"(?:myportfolio:(?:https?://)?([^/]+)|" pattern = (r"(?:myportfolio:(?:https?://)?([^/]+)|"
r"(?:https?://)?([\w-]+\.myportfolio\.com))" r"(?:https?://)?([\w-]+\.myportfolio\.com))"
r"(/[^/?&#]+)?") r"(/[^/?#]+)?")
test = ( test = (
("https://andrewling.myportfolio.com/volvo-xc-90-hybrid", { ("https://andrewling.myportfolio.com/volvo-xc-90-hybrid", {
"url": "acea0690c76db0e5cf267648cefd86e921bc3499", "url": "acea0690c76db0e5cf267648cefd86e921bc3499",

View File

@ -1,118 +0,0 @@
# -*- coding: utf-8 -*-
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
"""Extractors for https://nana.my.id/"""
from .common import GalleryExtractor, Extractor, Message
from .. import text, util, exception
class NanaGalleryExtractor(GalleryExtractor):
"""Extractor for image galleries from nana.my.id"""
category = "nana"
directory_fmt = ("{category}", "{title}")
pattern = r"(?:https?://)?nana\.my\.id/reader/([^/?#]+)"
test = (
(("https://nana.my.id/reader/"
"059f7de55a4297413bfbd432ce7d6e724dd42bae"), {
"pattern": r"https://nana\.my\.id/reader/"
r"\w+/image/page\?path=.*\.\w+",
"keyword": {
"title" : "Everybody Loves Shion",
"artist": "fuzui",
"tags" : list,
"count" : 29,
},
}),
(("https://nana.my.id/reader/"
"77c8712b67013e427923573379f5bafcc0c72e46"), {
"pattern": r"https://nana\.my\.id/reader/"
r"\w+/image/page\?path=.*\.\w+",
"keyword": {
"title" : "Lovey-Dovey With an Otaku-Friendly Gyaru",
"artist": "Sueyuu",
"tags" : ["Sueyuu"],
"count" : 58,
},
}),
)
def __init__(self, match):
self.gallery_id = match.group(1)
url = "https://nana.my.id/reader/" + self.gallery_id
GalleryExtractor.__init__(self, match, url)
def metadata(self, page):
title = text.unescape(
text.extr(page, '</a>&nbsp; ', '</div>'))
artist = text.unescape(text.extr(
page, '<title>', '</title>'))[len(title):-10]
tags = text.extr(page, 'Reader.tags = "', '"')
return {
"gallery_id": self.gallery_id,
"title" : title,
"artist" : artist[4:] if artist.startswith(" by ") else "",
"tags" : tags.split(", ") if tags else (),
"lang" : "en",
"language" : "English",
}
def images(self, page):
data = util.json_loads(text.extr(page, "Reader.pages = ", ".pages"))
return [
("https://nana.my.id" + image, None)
for image in data["pages"]
]
class NanaSearchExtractor(Extractor):
"""Extractor for nana search results"""
category = "nana"
subcategory = "search"
pattern = r"(?:https?://)?nana\.my\.id(?:/?\?([^#]+))"
test = (
('https://nana.my.id/?q=+"elf"&sort=desc', {
"pattern": NanaGalleryExtractor.pattern,
"range": "1-100",
"count": 100,
}),
("https://nana.my.id/?q=favorites%3A", {
"pattern": NanaGalleryExtractor.pattern,
"count": ">= 2",
}),
)
def __init__(self, match):
Extractor.__init__(self, match)
self.params = text.parse_query(match.group(1))
self.params["p"] = text.parse_int(self.params.get("p"), 1)
self.params["q"] = self.params.get("q") or ""
def items(self):
if "favorites:" in self.params["q"]:
favkey = self.config("favkey")
if not favkey:
raise exception.AuthenticationError(
"'Favorite key' not provided. "
"Please see 'https://nana.my.id/tutorial'")
self.session.cookies.set("favkey", favkey, domain="nana.my.id")
data = {"_extractor": NanaGalleryExtractor}
while True:
try:
page = self.request(
"https://nana.my.id", params=self.params).text
except exception.HttpError:
return
for gallery in text.extract_iter(
page, '<div class="id3">', '</div>'):
url = "https://nana.my.id" + text.extr(
gallery, '<a href="', '"')
yield Message.Queue, url, data
self.params["p"] += 1

View File

@ -91,7 +91,7 @@ class NaverwebtoonEpisodeExtractor(NaverwebtoonBase, GalleryExtractor):
return { return {
"title_id": self.title_id, "title_id": self.title_id,
"episode" : self.episode, "episode" : self.episode,
"comic" : extr("titleName: '", "'"), "comic" : extr('titleName: "', '"'),
"tags" : [t.strip() for t in text.extract_iter( "tags" : [t.strip() for t in text.extract_iter(
extr("tagList: [", "}],"), '"tagName":"', '"')], extr("tagList: [", "}],"), '"tagName":"', '"')],
"title" : extr('"subtitle":"', '"'), "title" : extr('"subtitle":"', '"'),

View File

@ -21,13 +21,16 @@ class NewgroundsExtractor(Extractor):
filename_fmt = "{category}_{_index}_{title}.{extension}" filename_fmt = "{category}_{_index}_{title}.{extension}"
archive_fmt = "{_type}{_index}" archive_fmt = "{_type}{_index}"
root = "https://www.newgrounds.com" root = "https://www.newgrounds.com"
cookiedomain = ".newgrounds.com" cookies_domain = ".newgrounds.com"
cookienames = ("NG_GG_username", "vmk1du5I8m") cookies_names = ("NG_GG_username", "vmk1du5I8m")
request_interval = 1.0
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.user = match.group(1) self.user = match.group(1)
self.user_root = "https://{}.newgrounds.com".format(self.user) self.user_root = "https://{}.newgrounds.com".format(self.user)
def _init(self):
self.flash = self.config("flash", True) self.flash = self.config("flash", True)
fmt = self.config("format", "original") fmt = self.config("format", "original")
@ -71,11 +74,12 @@ class NewgroundsExtractor(Extractor):
"""Return general metadata""" """Return general metadata"""
def login(self): def login(self):
if self._check_cookies(self.cookienames): if self.cookies_check(self.cookies_names):
return return
username, password = self._get_auth_info() username, password = self._get_auth_info()
if username: if username:
self._update_cookies(self._login_impl(username, password)) self.cookies_update(self._login_impl(username, password))
@cache(maxage=360*24*3600, keyarg=1) @cache(maxage=360*24*3600, keyarg=1)
def _login_impl(self, username, password): def _login_impl(self, username, password):
@ -84,16 +88,17 @@ class NewgroundsExtractor(Extractor):
url = self.root + "/passport/" url = self.root + "/passport/"
response = self.request(url) response = self.request(url)
if response.history and response.url.endswith("/social"): if response.history and response.url.endswith("/social"):
return self.session.cookies return self.cookies
page = response.text
headers = {"Origin": self.root, "Referer": url} headers = {"Origin": self.root, "Referer": url}
url = text.urljoin(self.root, text.extr( url = text.urljoin(self.root, text.extr(page, 'action="', '"'))
response.text, 'action="', '"'))
data = { data = {
"username": username, "username": username,
"password": password, "password": password,
"remember": "1", "remember": "1",
"login" : "1", "login" : "1",
"auth" : text.extr(page, 'name="auth" value="', '"'),
} }
response = self.request(url, method="POST", headers=headers, data=data) response = self.request(url, method="POST", headers=headers, data=data)
@ -103,7 +108,7 @@ class NewgroundsExtractor(Extractor):
return { return {
cookie.name: cookie.value cookie.name: cookie.value
for cookie in response.history[0].cookies for cookie in response.history[0].cookies
if cookie.expires and cookie.domain == self.cookiedomain if cookie.expires and cookie.domain == self.cookies_domain
} }
def extract_post(self, post_url): def extract_post(self, post_url):
@ -514,6 +519,9 @@ class NewgroundsUserExtractor(NewgroundsExtractor):
}), }),
) )
def initialize(self):
pass
def items(self): def items(self):
base = self.user_root + "/" base = self.user_root + "/"
return self._dispatch_extractors(( return self._dispatch_extractors((

View File

@ -21,19 +21,20 @@ class NijieExtractor(AsynchronousMixin, BaseExtractor):
archive_fmt = "{image_id}_{num}" archive_fmt = "{image_id}_{num}"
def __init__(self, match): def __init__(self, match):
self._init_category(match) BaseExtractor.__init__(self, match)
self.cookiedomain = "." + self.root.rpartition("/")[2] self.user_id = text.parse_int(match.group(match.lastindex))
self.cookienames = (self.category + "_tok",)
def initialize(self):
self.cookies_domain = "." + self.root.rpartition("/")[2]
self.cookies_names = (self.category + "_tok",)
BaseExtractor.initialize(self)
self.session.headers["Referer"] = self.root + "/"
self.user_name = None
if self.category == "horne": if self.category == "horne":
self._extract_data = self._extract_data_horne self._extract_data = self._extract_data_horne
BaseExtractor.__init__(self, match)
self.user_id = text.parse_int(match.group(match.lastindex))
self.user_name = None
self.session.headers["Referer"] = self.root + "/"
def items(self): def items(self):
self.login() self.login()
@ -121,10 +122,11 @@ class NijieExtractor(AsynchronousMixin, BaseExtractor):
return text.unescape(text.extr(page, "<br />", "<")) return text.unescape(text.extr(page, "<br />", "<"))
def login(self): def login(self):
"""Login and obtain session cookies""" if self.cookies_check(self.cookies_names):
if not self._check_cookies(self.cookienames): return
username, password = self._get_auth_info()
self._update_cookies(self._login_impl(username, password)) username, password = self._get_auth_info()
self.cookies_update(self._login_impl(username, password))
@cache(maxage=90*24*3600, keyarg=1) @cache(maxage=90*24*3600, keyarg=1)
def _login_impl(self, username, password): def _login_impl(self, username, password):
@ -139,7 +141,7 @@ class NijieExtractor(AsynchronousMixin, BaseExtractor):
response = self.request(url, method="POST", data=data) response = self.request(url, method="POST", data=data)
if "/login.php" in response.text: if "/login.php" in response.text:
raise exception.AuthenticationError() raise exception.AuthenticationError()
return self.session.cookies return self.cookies
def _pagination(self, path): def _pagination(self, path):
url = "{}/{}.php".format(self.root, path) url = "{}/{}.php".format(self.root, path)
@ -172,13 +174,16 @@ BASE_PATTERN = NijieExtractor.update({
class NijieUserExtractor(NijieExtractor): class NijieUserExtractor(NijieExtractor):
"""Extractor for nijie user profiles""" """Extractor for nijie user profiles"""
subcategory = "user" subcategory = "user"
cookiedomain = None cookies_domain = None
pattern = BASE_PATTERN + r"/members\.php\?id=(\d+)" pattern = BASE_PATTERN + r"/members\.php\?id=(\d+)"
test = ( test = (
("https://nijie.info/members.php?id=44"), ("https://nijie.info/members.php?id=44"),
("https://horne.red/members.php?id=58000"), ("https://horne.red/members.php?id=58000"),
) )
def initialize(self):
pass
def items(self): def items(self):
fmt = "{}/{{}}.php?id={}".format(self.root, self.user_id).format fmt = "{}/{{}}.php?id={}".format(self.root, self.user_id).format
return self._dispatch_extractors(( return self._dispatch_extractors((

View File

@ -21,7 +21,7 @@ class NitterExtractor(BaseExtractor):
archive_fmt = "{tweet_id}_{num}" archive_fmt = "{tweet_id}_{num}"
def __init__(self, match): def __init__(self, match):
self.cookiedomain = self.root.partition("://")[2] self.cookies_domain = self.root.partition("://")[2]
BaseExtractor.__init__(self, match) BaseExtractor.__init__(self, match)
lastindex = match.lastindex lastindex = match.lastindex
@ -35,7 +35,7 @@ class NitterExtractor(BaseExtractor):
if videos: if videos:
ytdl = (videos == "ytdl") ytdl = (videos == "ytdl")
videos = True videos = True
self._cookiejar.set("hlsPlayback", "on", domain=self.cookiedomain) self.cookies.set("hlsPlayback", "on", domain=self.cookies_domain)
for tweet in self.tweets(): for tweet in self.tweets():
@ -162,7 +162,11 @@ class NitterExtractor(BaseExtractor):
banner = extr('class="profile-banner"><a href="', '"') banner = extr('class="profile-banner"><a href="', '"')
try: try:
uid = banner.split("%2F")[4] if "/enc/" in banner:
uid = binascii.a2b_base64(banner.rpartition(
"/")[2]).decode().split("/")[4]
else:
uid = banner.split("%2F")[4]
except Exception: except Exception:
uid = 0 uid = 0
@ -302,7 +306,10 @@ class NitterTweetsExtractor(NitterExtractor):
r"/media%2FCGMNYZvW0AIVoom\.jpg", r"/media%2FCGMNYZvW0AIVoom\.jpg",
"range": "1", "range": "1",
}), }),
("https://nitter.1d4.us/supernaturepics"), ("https://nitter.1d4.us/supernaturepics", {
"range": "1",
"keyword": {"user": {"id": "2976459548"}},
}),
("https://nitter.kavin.rocks/id:2976459548"), ("https://nitter.kavin.rocks/id:2976459548"),
("https://nitter.unixfox.eu/supernaturepics"), ("https://nitter.unixfox.eu/supernaturepics"),
) )

View File

@ -75,7 +75,8 @@ class NsfwalbumAlbumExtractor(GalleryExtractor):
@staticmethod @staticmethod
def _validate_response(response): def _validate_response(response):
return not response.request.url.endswith("/no_image.jpg") return not response.request.url.endswith(
("/no_image.jpg", "/placeholder.png"))
@staticmethod @staticmethod
def _annihilate(value, base=6): def _annihilate(value, base=6):

View File

@ -28,6 +28,8 @@ class OAuthBase(Extractor):
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.client = None self.client = None
def _init(self):
self.cache = config.get(("extractor", self.category), "cache", True) self.cache = config.get(("extractor", self.category), "cache", True)
def oauth_config(self, key, default=None): def oauth_config(self, key, default=None):
@ -71,8 +73,11 @@ class OAuthBase(Extractor):
browser = self.config("browser", True) browser = self.config("browser", True)
if browser: if browser:
import webbrowser try:
browser = webbrowser.get() import webbrowser
browser = webbrowser.get()
except Exception:
browser = None
if browser and browser.open(url): if browser and browser.open(url):
name = getattr(browser, "name", "Browser") name = getattr(browser, "name", "Browser")
@ -131,7 +136,7 @@ class OAuthBase(Extractor):
def _oauth2_authorization_code_grant( def _oauth2_authorization_code_grant(
self, client_id, client_secret, default_id, default_secret, self, client_id, client_secret, default_id, default_secret,
auth_url, token_url, *, scope="read", duration="permanent", auth_url, token_url, scope="read", duration="permanent",
key="refresh_token", auth=True, cache=None, instance=None): key="refresh_token", auth=True, cache=None, instance=None):
"""Perform an OAuth2 authorization code grant""" """Perform an OAuth2 authorization code grant"""

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2018-2022 Mike Fährmann # Copyright 2018-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -14,14 +14,14 @@ from .. import text
class PahealExtractor(Extractor): class PahealExtractor(Extractor):
"""Base class for paheal extractors""" """Base class for paheal extractors"""
basecategory = "booru" basecategory = "shimmie2"
category = "paheal" category = "paheal"
filename_fmt = "{category}_{id}_{md5}.{extension}" filename_fmt = "{category}_{id}_{md5}.{extension}"
archive_fmt = "{id}" archive_fmt = "{id}"
root = "https://rule34.paheal.net" root = "https://rule34.paheal.net"
def items(self): def items(self):
self.session.cookies.set( self.cookies.set(
"ui-tnc-agreed", "true", domain="rule34.paheal.net") "ui-tnc-agreed", "true", domain="rule34.paheal.net")
data = self.get_metadata() data = self.get_metadata()
@ -55,8 +55,8 @@ class PahealExtractor(Extractor):
"class='username' href='/user/", "'")), "class='username' href='/user/", "'")),
"date" : text.parse_datetime( "date" : text.parse_datetime(
extr("datetime='", "'"), "%Y-%m-%dT%H:%M:%S%z"), extr("datetime='", "'"), "%Y-%m-%dT%H:%M:%S%z"),
"source" : text.extract( "source" : text.unescape(text.extr(
extr(">Source&nbsp;Link<", "</td>"), "href='", "'")[0], extr(">Source&nbsp;Link<", "</td>"), "href='", "'")),
} }
dimensions, size, ext = extr("Info</th><td>", ">").split(" // ") dimensions, size, ext = extr("Info</th><td>", ">").split(" // ")
@ -74,16 +74,41 @@ class PahealTagExtractor(PahealExtractor):
directory_fmt = ("{category}", "{search_tags}") directory_fmt = ("{category}", "{search_tags}")
pattern = (r"(?:https?://)?(?:rule34|rule63|cosplay)\.paheal\.net" pattern = (r"(?:https?://)?(?:rule34|rule63|cosplay)\.paheal\.net"
r"/post/list/([^/?#]+)") r"/post/list/([^/?#]+)")
test = ("https://rule34.paheal.net/post/list/Ayane_Suzuki/1", { test = (
"pattern": r"https://[^.]+\.paheal\.net/_images/\w+/\d+%20-%20", ("https://rule34.paheal.net/post/list/Ayane_Suzuki/1", {
"count": ">= 15" "pattern": r"https://[^.]+\.paheal\.net/_images/\w+/\d+%20-%20",
}) "count": ">= 15"
}),
("https://rule34.paheal.net/post/list/Ayane_Suzuki/1", {
"range": "1",
"options": (("metadata", True),),
"keyword": {
"date": "dt:2018-01-07 07:04:05",
"duration": 0.0,
"extension": "jpg",
"filename": "2446128 - Ayane_Suzuki Idolmaster "
"idolmaster_dearly_stars Zanzi",
"height": 768,
"id": 2446128,
"md5": "b0ceda9d860df1d15b60293a7eb465c1",
"search_tags": "Ayane_Suzuki",
"size": 205312,
"source": "https://www.pixiv.net/member_illust.php"
"?mode=medium&illust_id=19957280",
"tags": "Ayane_Suzuki Idolmaster "
"idolmaster_dearly_stars Zanzi",
"uploader": "XXXname",
"width": 1024,
},
}),
)
per_page = 70 per_page = 70
def __init__(self, match): def __init__(self, match):
PahealExtractor.__init__(self, match) PahealExtractor.__init__(self, match)
self.tags = text.unquote(match.group(1)) self.tags = text.unquote(match.group(1))
def _init(self):
if self.config("metadata"): if self.config("metadata"):
self._extract_data = self._extract_data_ex self._extract_data = self._extract_data_ex
@ -96,8 +121,9 @@ class PahealTagExtractor(PahealExtractor):
url = "{}/post/list/{}/{}".format(self.root, self.tags, pnum) url = "{}/post/list/{}/{}".format(self.root, self.tags, pnum)
page = self.request(url).text page = self.request(url).text
pos = page.find("id='image-list'")
for post in text.extract_iter( for post in text.extract_iter(
page, '<img id="thumb_', 'Only</a>'): page, "<img id='thumb_", "Only</a>", pos):
yield self._extract_data(post) yield self._extract_data(post)
if ">Next<" not in page: if ">Next<" not in page:
@ -106,10 +132,10 @@ class PahealTagExtractor(PahealExtractor):
@staticmethod @staticmethod
def _extract_data(post): def _extract_data(post):
pid , pos = text.extract(post, '', '"') pid , pos = text.extract(post, "", "'")
data, pos = text.extract(post, 'title="', '"', pos) data, pos = text.extract(post, "title='", "'", pos)
md5 , pos = text.extract(post, '/_thumbs/', '/', pos) md5 , pos = text.extract(post, "/_thumbs/", "/", pos)
url , pos = text.extract(post, '<a href="', '"', pos) url , pos = text.extract(post, "<a href='", "'", pos)
tags, data, date = data.split("\n") tags, data, date = data.split("\n")
dimensions, size, ext = data.split(" // ") dimensions, size, ext = data.split(" // ")
@ -126,7 +152,7 @@ class PahealTagExtractor(PahealExtractor):
} }
def _extract_data_ex(self, post): def _extract_data_ex(self, post):
pid = post[:post.index('"')] pid = post[:post.index("'")]
return self._extract_post(pid) return self._extract_post(pid)
@ -139,19 +165,19 @@ class PahealPostExtractor(PahealExtractor):
("https://rule34.paheal.net/post/view/481609", { ("https://rule34.paheal.net/post/view/481609", {
"pattern": r"https://tulip\.paheal\.net/_images" "pattern": r"https://tulip\.paheal\.net/_images"
r"/bbdc1c33410c2cdce7556c7990be26b7/481609%20-%20" r"/bbdc1c33410c2cdce7556c7990be26b7/481609%20-%20"
r"Azumanga_Daioh%20Osaka%20Vuvuzela%20inanimate\.jpg", r"Azumanga_Daioh%20inanimate%20Osaka%20Vuvuzela\.jpg",
"content": "7b924bcf150b352ac75c9d281d061e174c851a11", "content": "7b924bcf150b352ac75c9d281d061e174c851a11",
"keyword": { "keyword": {
"date": "dt:2010-06-17 15:40:23", "date": "dt:2010-06-17 15:40:23",
"extension": "jpg", "extension": "jpg",
"file_url": "re:https://tulip.paheal.net/_images/bbdc1c33410c", "file_url": "re:https://tulip.paheal.net/_images/bbdc1c33410c",
"filename": "481609 - Azumanga_Daioh Osaka Vuvuzela inanimate", "filename": "481609 - Azumanga_Daioh inanimate Osaka Vuvuzela",
"height": 660, "height": 660,
"id": 481609, "id": 481609,
"md5": "bbdc1c33410c2cdce7556c7990be26b7", "md5": "bbdc1c33410c2cdce7556c7990be26b7",
"size": 157389, "size": 157389,
"source": None, "source": "",
"tags": "Azumanga_Daioh Osaka Vuvuzela inanimate", "tags": "Azumanga_Daioh inanimate Osaka Vuvuzela",
"uploader": "CaptainButtface", "uploader": "CaptainButtface",
"width": 614, "width": 614,
}, },
@ -163,7 +189,7 @@ class PahealPostExtractor(PahealExtractor):
"md5": "b39edfe455a0381110c710d6ed2ef57d", "md5": "b39edfe455a0381110c710d6ed2ef57d",
"size": 758989, "size": 758989,
"source": "http://www.furaffinity.net/view/4057821/", "source": "http://www.furaffinity.net/view/4057821/",
"tags": "Vuvuzela inanimate thelost-dragon", "tags": "inanimate thelost-dragon Vuvuzela",
"uploader": "leacheate_soup", "uploader": "leacheate_soup",
"width": 1200, "width": 1200,
}, },
@ -171,8 +197,8 @@ class PahealPostExtractor(PahealExtractor):
# video # video
("https://rule34.paheal.net/post/view/3864982", { ("https://rule34.paheal.net/post/view/3864982", {
"pattern": r"https://[\w]+\.paheal\.net/_images/7629fc0ff77e32637d" "pattern": r"https://[\w]+\.paheal\.net/_images/7629fc0ff77e32637d"
r"de5bf4f992b2cb/3864982%20-%20Metal_Gear%20Metal_Gear_" r"de5bf4f992b2cb/3864982%20-%20animated%20Metal_Gear%20"
r"Solid_V%20Quiet%20Vg_erotica%20animated%20webm\.webm", r"Metal_Gear_Solid_V%20Quiet%20Vg_erotica%20webm\.webm",
"keyword": { "keyword": {
"date": "dt:2020-09-06 01:59:03", "date": "dt:2020-09-06 01:59:03",
"duration": 30.0, "duration": 30.0,
@ -183,8 +209,8 @@ class PahealPostExtractor(PahealExtractor):
"size": 18454938, "size": 18454938,
"source": "https://twitter.com/VG_Worklog" "source": "https://twitter.com/VG_Worklog"
"/status/1302407696294055936", "/status/1302407696294055936",
"tags": "Metal_Gear Metal_Gear_Solid_V Quiet " "tags": "animated Metal_Gear Metal_Gear_Solid_V "
"Vg_erotica animated webm", "Quiet Vg_erotica webm",
"uploader": "justausername", "uploader": "justausername",
"width": 1768, "width": 1768,
}, },

View File

@ -19,7 +19,7 @@ class PatreonExtractor(Extractor):
"""Base class for patreon extractors""" """Base class for patreon extractors"""
category = "patreon" category = "patreon"
root = "https://www.patreon.com" root = "https://www.patreon.com"
cookiedomain = ".patreon.com" cookies_domain = ".patreon.com"
directory_fmt = ("{category}", "{creator[full_name]}") directory_fmt = ("{category}", "{creator[full_name]}")
filename_fmt = "{id}_{title}_{num:>02}.{extension}" filename_fmt = "{id}_{title}_{num:>02}.{extension}"
archive_fmt = "{id}_{num}" archive_fmt = "{id}_{num}"
@ -28,11 +28,11 @@ class PatreonExtractor(Extractor):
_warning = True _warning = True
def items(self): def items(self):
if self._warning: if self._warning:
if not self._check_cookies(("session_id",)): if not self.cookies_check(("session_id",)):
self.log.warning("no 'session_id' cookie set") self.log.warning("no 'session_id' cookie set")
PatreonExtractor._warning = False PatreonExtractor._warning = False
generators = self._build_file_generators(self.config("files")) generators = self._build_file_generators(self.config("files"))
for post in self.posts(): for post in self.posts():

View File

@ -19,39 +19,18 @@ class PhilomenaExtractor(BooruExtractor):
filename_fmt = "{filename}.{extension}" filename_fmt = "{filename}.{extension}"
archive_fmt = "{id}" archive_fmt = "{id}"
request_interval = 1.0 request_interval = 1.0
page_start = 1
per_page = 50 per_page = 50
def _init(self):
self.api = PhilomenaAPI(self)
_file_url = operator.itemgetter("view_url") _file_url = operator.itemgetter("view_url")
@staticmethod @staticmethod
def _prepare(post): def _prepare(post):
post["date"] = text.parse_datetime(post["created_at"]) post["date"] = text.parse_datetime(post["created_at"])
def _pagination(self, url, params):
params["page"] = 1
params["per_page"] = self.per_page
api_key = self.config("api-key")
if api_key:
params["key"] = api_key
filter_id = self.config("filter")
if filter_id:
params["filter_id"] = filter_id
elif not api_key:
try:
params["filter_id"] = INSTANCES[self.category]["filter_id"]
except (KeyError, TypeError):
params["filter_id"] = "2"
while True:
data = self.request(url, params=params).json()
yield from data["images"]
if len(data["images"]) < self.per_page:
return
params["page"] += 1
INSTANCES = { INSTANCES = {
"derpibooru": { "derpibooru": {
@ -146,8 +125,7 @@ class PhilomenaPostExtractor(PhilomenaExtractor):
self.image_id = match.group(match.lastindex) self.image_id = match.group(match.lastindex)
def posts(self): def posts(self):
url = self.root + "/api/v1/json/images/" + self.image_id return (self.api.image(self.image_id),)
return (self.request(url).json()["image"],)
class PhilomenaSearchExtractor(PhilomenaExtractor): class PhilomenaSearchExtractor(PhilomenaExtractor):
@ -201,8 +179,7 @@ class PhilomenaSearchExtractor(PhilomenaExtractor):
return {"search_tags": self.params.get("q", "")} return {"search_tags": self.params.get("q", "")}
def posts(self): def posts(self):
url = self.root + "/api/v1/json/search/images" return self.api.search(self.params)
return self._pagination(url, self.params)
class PhilomenaGalleryExtractor(PhilomenaExtractor): class PhilomenaGalleryExtractor(PhilomenaExtractor):
@ -239,15 +216,81 @@ class PhilomenaGalleryExtractor(PhilomenaExtractor):
self.gallery_id = match.group(match.lastindex) self.gallery_id = match.group(match.lastindex)
def metadata(self): def metadata(self):
url = self.root + "/api/v1/json/search/galleries" try:
params = {"q": "id:" + self.gallery_id} return {"gallery": self.api.gallery(self.gallery_id)}
galleries = self.request(url, params=params).json()["galleries"] except IndexError:
if not galleries:
raise exception.NotFoundError("gallery") raise exception.NotFoundError("gallery")
return {"gallery": galleries[0]}
def posts(self): def posts(self):
gallery_id = "gallery_id:" + self.gallery_id gallery_id = "gallery_id:" + self.gallery_id
url = self.root + "/api/v1/json/search/images"
params = {"sd": "desc", "sf": gallery_id, "q": gallery_id} params = {"sd": "desc", "sf": gallery_id, "q": gallery_id}
return self._pagination(url, params) return self.api.search(params)
class PhilomenaAPI():
"""Interface for the Philomena API
https://www.derpibooru.org/pages/api
"""
def __init__(self, extractor):
self.extractor = extractor
self.root = extractor.root + "/api"
def gallery(self, gallery_id):
endpoint = "/v1/json/search/galleries"
params = {"q": "id:" + gallery_id}
return self._call(endpoint, params)["galleries"][0]
def image(self, image_id):
endpoint = "/v1/json/images/" + image_id
return self._call(endpoint)["image"]
def search(self, params):
endpoint = "/v1/json/search/images"
return self._pagination(endpoint, params)
def _call(self, endpoint, params=None):
url = self.root + endpoint
while True:
response = self.extractor.request(url, params=params, fatal=None)
if response.status_code < 400:
return response.json()
if response.status_code == 429:
self.extractor.wait(seconds=600)
continue
# error
self.extractor.log.debug(response.content)
raise exception.StopExtraction(
"%s %s", response.status_code, response.reason)
def _pagination(self, endpoint, params):
extr = self.extractor
api_key = extr.config("api-key")
if api_key:
params["key"] = api_key
filter_id = extr.config("filter")
if filter_id:
params["filter_id"] = filter_id
elif not api_key:
try:
params["filter_id"] = INSTANCES[extr.category]["filter_id"]
except (KeyError, TypeError):
params["filter_id"] = "2"
params["page"] = extr.page_start
params["per_page"] = extr.per_page
while True:
data = self._call(endpoint, params)
yield from data["images"]
if len(data["images"]) < extr.per_page:
return
params["page"] += 1

View File

@ -48,9 +48,10 @@ class PhotobucketAlbumExtractor(Extractor):
) )
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match)
self.album_path = ""
self.root = "https://" + match.group(1) self.root = "https://" + match.group(1)
Extractor.__init__(self, match)
def _init(self):
self.session.headers["Referer"] = self.url self.session.headers["Referer"] = self.url
def items(self): def items(self):
@ -129,6 +130,8 @@ class PhotobucketImageExtractor(Extractor):
Extractor.__init__(self, match) Extractor.__init__(self, match)
self.user = match.group(1) or match.group(3) self.user = match.group(1) or match.group(3)
self.media_id = match.group(2) self.media_id = match.group(2)
def _init(self):
self.session.headers["Referer"] = self.url self.session.headers["Referer"] = self.url
def items(self): def items(self):

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2018-2022 Mike Fährmann # Copyright 2018-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -19,7 +19,7 @@ class PiczelExtractor(Extractor):
filename_fmt = "{category}_{id}_{title}_{num:>02}.{extension}" filename_fmt = "{category}_{id}_{title}_{num:>02}.{extension}"
archive_fmt = "{id}_{num}" archive_fmt = "{id}_{num}"
root = "https://piczel.tv" root = "https://piczel.tv"
api_root = "https://tombstone.piczel.tv" api_root = root
def items(self): def items(self):
for post in self.posts(): for post in self.posts():

View File

@ -24,7 +24,7 @@ class PillowfortExtractor(Extractor):
filename_fmt = ("{post_id} {title|original_post[title]:?/ /}" filename_fmt = ("{post_id} {title|original_post[title]:?/ /}"
"{num:>02}.{extension}") "{num:>02}.{extension}")
archive_fmt = "{id}" archive_fmt = "{id}"
cookiedomain = "www.pillowfort.social" cookies_domain = "www.pillowfort.social"
def __init__(self, match): def __init__(self, match):
Extractor.__init__(self, match) Extractor.__init__(self, match)
@ -82,15 +82,14 @@ class PillowfortExtractor(Extractor):
yield msgtype, url, post yield msgtype, url, post
def login(self): def login(self):
cget = self.session.cookies.get if self.cookies.get("_Pf_new_session", domain=self.cookies_domain):
if cget("_Pf_new_session", domain=self.cookiedomain) \ return
or cget("remember_user_token", domain=self.cookiedomain): if self.cookies.get("remember_user_token", domain=self.cookies_domain):
return return
username, password = self._get_auth_info() username, password = self._get_auth_info()
if username: if username:
cookies = self._login_impl(username, password) self.cookies_update(self._login_impl(username, password))
self._update_cookies(cookies)
@cache(maxage=14*24*3600, keyarg=1) @cache(maxage=14*24*3600, keyarg=1)
def _login_impl(self, username, password): def _login_impl(self, username, password):

View File

@ -23,12 +23,10 @@ class PinterestExtractor(Extractor):
archive_fmt = "{id}{media_id}" archive_fmt = "{id}{media_id}"
root = "https://www.pinterest.com" root = "https://www.pinterest.com"
def __init__(self, match): def _init(self):
Extractor.__init__(self, match)
domain = self.config("domain") domain = self.config("domain")
if not domain or domain == "auto" : if not domain or domain == "auto" :
self.root = text.root_from_url(match.group(0)) self.root = text.root_from_url(self.url)
else: else:
self.root = text.ensure_http_scheme(domain) self.root = text.ensure_http_scheme(domain)
@ -112,7 +110,7 @@ class PinterestExtractor(Extractor):
class PinterestPinExtractor(PinterestExtractor): class PinterestPinExtractor(PinterestExtractor):
"""Extractor for images from a single pin from pinterest.com""" """Extractor for images from a single pin from pinterest.com"""
subcategory = "pin" subcategory = "pin"
pattern = BASE_PATTERN + r"/pin/([^/?#&]+)(?!.*#related$)" pattern = BASE_PATTERN + r"/pin/([^/?#]+)(?!.*#related$)"
test = ( test = (
("https://www.pinterest.com/pin/858146903966145189/", { ("https://www.pinterest.com/pin/858146903966145189/", {
"url": "afb3c26719e3a530bb0e871c480882a801a4e8a5", "url": "afb3c26719e3a530bb0e871c480882a801a4e8a5",
@ -121,7 +119,7 @@ class PinterestPinExtractor(PinterestExtractor):
}), }),
# video pin (#1189) # video pin (#1189)
("https://www.pinterest.com/pin/422564377542934214/", { ("https://www.pinterest.com/pin/422564377542934214/", {
"pattern": r"https://v\.pinimg\.com/videos/mc/hls/d7/22/ff" "pattern": r"https://v\d*\.pinimg\.com/videos/mc/hls/d7/22/ff"
r"/d722ff00ab2352981b89974b37909de8.m3u8", r"/d722ff00ab2352981b89974b37909de8.m3u8",
}), }),
("https://www.pinterest.com/pin/858146903966145188/", { ("https://www.pinterest.com/pin/858146903966145188/", {
@ -147,8 +145,8 @@ class PinterestBoardExtractor(PinterestExtractor):
subcategory = "board" subcategory = "board"
directory_fmt = ("{category}", "{board[owner][username]}", "{board[name]}") directory_fmt = ("{category}", "{board[owner][username]}", "{board[name]}")
archive_fmt = "{board[id]}_{id}" archive_fmt = "{board[id]}_{id}"
pattern = (BASE_PATTERN + r"/(?!pin/)([^/?#&]+)" pattern = (BASE_PATTERN + r"/(?!pin/)([^/?#]+)"
"/(?!_saved|_created|pins/)([^/?#&]+)/?$") "/(?!_saved|_created|pins/)([^/?#]+)/?$")
test = ( test = (
("https://www.pinterest.com/g1952849/test-/", { ("https://www.pinterest.com/g1952849/test-/", {
"pattern": r"https://i\.pinimg\.com/originals/", "pattern": r"https://i\.pinimg\.com/originals/",
@ -198,7 +196,7 @@ class PinterestBoardExtractor(PinterestExtractor):
class PinterestUserExtractor(PinterestExtractor): class PinterestUserExtractor(PinterestExtractor):
"""Extractor for a user's boards""" """Extractor for a user's boards"""
subcategory = "user" subcategory = "user"
pattern = BASE_PATTERN + r"/(?!pin/)([^/?#&]+)(?:/_saved)?/?$" pattern = BASE_PATTERN + r"/(?!pin/)([^/?#]+)(?:/_saved)?/?$"
test = ( test = (
("https://www.pinterest.com/g1952849/", { ("https://www.pinterest.com/g1952849/", {
"pattern": PinterestBoardExtractor.pattern, "pattern": PinterestBoardExtractor.pattern,
@ -223,7 +221,7 @@ class PinterestAllpinsExtractor(PinterestExtractor):
"""Extractor for a user's 'All Pins' feed""" """Extractor for a user's 'All Pins' feed"""
subcategory = "allpins" subcategory = "allpins"
directory_fmt = ("{category}", "{user}") directory_fmt = ("{category}", "{user}")
pattern = BASE_PATTERN + r"/(?!pin/)([^/?#&]+)/pins/?$" pattern = BASE_PATTERN + r"/(?!pin/)([^/?#]+)/pins/?$"
test = ("https://www.pinterest.com/g1952849/pins/", { test = ("https://www.pinterest.com/g1952849/pins/", {
"pattern": r"https://i\.pinimg\.com/originals/[0-9a-f]{2}" "pattern": r"https://i\.pinimg\.com/originals/[0-9a-f]{2}"
r"/[0-9a-f]{2}/[0-9a-f]{2}/[0-9a-f]{32}\.\w{3}", r"/[0-9a-f]{2}/[0-9a-f]{2}/[0-9a-f]{32}\.\w{3}",
@ -245,10 +243,10 @@ class PinterestCreatedExtractor(PinterestExtractor):
"""Extractor for a user's created pins""" """Extractor for a user's created pins"""
subcategory = "created" subcategory = "created"
directory_fmt = ("{category}", "{user}") directory_fmt = ("{category}", "{user}")
pattern = BASE_PATTERN + r"/(?!pin/)([^/?#&]+)/_created/?$" pattern = BASE_PATTERN + r"/(?!pin/)([^/?#]+)/_created/?$"
test = ("https://www.pinterest.de/digitalmomblog/_created/", { test = ("https://www.pinterest.de/digitalmomblog/_created/", {
"pattern": r"https://i\.pinimg\.com/originals/[0-9a-f]{2}" "pattern": r"https://i\.pinimg\.com/originals/[0-9a-f]{2}"
r"/[0-9a-f]{2}/[0-9a-f]{2}/[0-9a-f]{32}\.jpg", r"/[0-9a-f]{2}/[0-9a-f]{2}/[0-9a-f]{32}\.(jpg|png)",
"count": 10, "count": 10,
"range": "1-10", "range": "1-10",
}) })
@ -270,7 +268,7 @@ class PinterestSectionExtractor(PinterestExtractor):
directory_fmt = ("{category}", "{board[owner][username]}", directory_fmt = ("{category}", "{board[owner][username]}",
"{board[name]}", "{section[title]}") "{board[name]}", "{section[title]}")
archive_fmt = "{board[id]}_{id}" archive_fmt = "{board[id]}_{id}"
pattern = BASE_PATTERN + r"/(?!pin/)([^/?#&]+)/([^/?#&]+)/([^/?#&]+)" pattern = BASE_PATTERN + r"/(?!pin/)([^/?#]+)/([^/?#]+)/([^/?#]+)"
test = ("https://www.pinterest.com/g1952849/stuff/section", { test = ("https://www.pinterest.com/g1952849/stuff/section", {
"count": 2, "count": 2,
}) })
@ -321,7 +319,7 @@ class PinterestRelatedPinExtractor(PinterestPinExtractor):
"""Extractor for related pins of another pin from pinterest.com""" """Extractor for related pins of another pin from pinterest.com"""
subcategory = "related-pin" subcategory = "related-pin"
directory_fmt = ("{category}", "related {original_pin[id]}") directory_fmt = ("{category}", "related {original_pin[id]}")
pattern = BASE_PATTERN + r"/pin/([^/?#&]+).*#related$" pattern = BASE_PATTERN + r"/pin/([^/?#]+).*#related$"
test = ("https://www.pinterest.com/pin/858146903966145189/#related", { test = ("https://www.pinterest.com/pin/858146903966145189/#related", {
"range": "31-70", "range": "31-70",
"count": 40, "count": 40,
@ -340,7 +338,7 @@ class PinterestRelatedBoardExtractor(PinterestBoardExtractor):
subcategory = "related-board" subcategory = "related-board"
directory_fmt = ("{category}", "{board[owner][username]}", directory_fmt = ("{category}", "{board[owner][username]}",
"{board[name]}", "related") "{board[name]}", "related")
pattern = BASE_PATTERN + r"/(?!pin/)([^/?#&]+)/([^/?#&]+)/?#related$" pattern = BASE_PATTERN + r"/(?!pin/)([^/?#]+)/([^/?#]+)/?#related$"
test = ("https://www.pinterest.com/g1952849/test-/#related", { test = ("https://www.pinterest.com/g1952849/test-/#related", {
"range": "31-70", "range": "31-70",
"count": 40, "count": 40,
@ -348,13 +346,13 @@ class PinterestRelatedBoardExtractor(PinterestBoardExtractor):
}) })
def pins(self): def pins(self):
return self.api.board_related(self.board["id"]) return self.api.board_content_recommendation(self.board["id"])
class PinterestPinitExtractor(PinterestExtractor): class PinterestPinitExtractor(PinterestExtractor):
"""Extractor for images from a pin.it URL""" """Extractor for images from a pin.it URL"""
subcategory = "pinit" subcategory = "pinit"
pattern = r"(?:https?://)?pin\.it/([^/?#&]+)" pattern = r"(?:https?://)?pin\.it/([^/?#]+)"
test = ( test = (
("https://pin.it/Hvt8hgT", { ("https://pin.it/Hvt8hgT", {
@ -370,7 +368,7 @@ class PinterestPinitExtractor(PinterestExtractor):
self.shortened_id = match.group(1) self.shortened_id = match.group(1)
def items(self): def items(self):
url = "https://api.pinterest.com/url_shortener/{}/redirect".format( url = "https://api.pinterest.com/url_shortener/{}/redirect/".format(
self.shortened_id) self.shortened_id)
response = self.request(url, method="HEAD", allow_redirects=False) response = self.request(url, method="HEAD", allow_redirects=False)
location = response.headers.get("Location") location = response.headers.get("Location")
@ -458,10 +456,10 @@ class PinterestAPI():
options = {"section_id": section_id} options = {"section_id": section_id}
return self._pagination("BoardSectionPins", options) return self._pagination("BoardSectionPins", options)
def board_related(self, board_id): def board_content_recommendation(self, board_id):
"""Yield related pins of a specific board""" """Yield related pins of a specific board"""
options = {"board_id": board_id, "add_vase": True} options = {"id": board_id, "type": "board", "add_vase": True}
return self._pagination("BoardRelatedPixieFeed", options) return self._pagination("BoardContentRecommendation", options)
def user_pins(self, user): def user_pins(self, user):
"""Yield all pins from 'user'""" """Yield all pins from 'user'"""

View File

@ -15,6 +15,9 @@ from datetime import datetime, timedelta
import itertools import itertools
import hashlib import hashlib
BASE_PATTERN = r"(?:https?://)?(?:www\.|touch\.)?pixiv\.net"
USER_PATTERN = BASE_PATTERN + r"/(?:en/)?users/(\d+)"
class PixivExtractor(Extractor): class PixivExtractor(Extractor):
"""Base class for pixiv extractors""" """Base class for pixiv extractors"""
@ -23,10 +26,9 @@ class PixivExtractor(Extractor):
directory_fmt = ("{category}", "{user[id]} {user[account]}") directory_fmt = ("{category}", "{user[id]} {user[account]}")
filename_fmt = "{id}_p{num}.{extension}" filename_fmt = "{id}_p{num}.{extension}"
archive_fmt = "{id}{suffix}.{extension}" archive_fmt = "{id}{suffix}.{extension}"
cookiedomain = None cookies_domain = None
def __init__(self, match): def _init(self):
Extractor.__init__(self, match)
self.api = PixivAppAPI(self) self.api = PixivAppAPI(self)
self.load_ugoira = self.config("ugoira", True) self.load_ugoira = self.config("ugoira", True)
self.max_posts = self.config("max-posts", 0) self.max_posts = self.config("max-posts", 0)
@ -44,6 +46,8 @@ class PixivExtractor(Extractor):
def transform_tags(work): def transform_tags(work):
work["tags"] = [tag["name"] for tag in work["tags"]] work["tags"] = [tag["name"] for tag in work["tags"]]
url_sanity = ("https://s.pximg.net/common/images"
"/limit_sanity_level_360.png")
ratings = {0: "General", 1: "R-18", 2: "R-18G"} ratings = {0: "General", 1: "R-18", 2: "R-18G"}
meta_user = self.config("metadata") meta_user = self.config("metadata")
meta_bookmark = self.config("metadata-bookmark") meta_bookmark = self.config("metadata-bookmark")
@ -99,6 +103,10 @@ class PixivExtractor(Extractor):
elif work["page_count"] == 1: elif work["page_count"] == 1:
url = meta_single_page["original_image_url"] url = meta_single_page["original_image_url"]
if url == url_sanity:
self.log.debug("Skipping 'sanity_level' warning (%s)",
work["id"])
continue
work["date_url"] = self._date_from_url(url) work["date_url"] = self._date_from_url(url)
yield Message.Url, url, text.nameext_from_url(url, work) yield Message.Url, url, text.nameext_from_url(url, work)
@ -150,7 +158,7 @@ class PixivExtractor(Extractor):
class PixivUserExtractor(PixivExtractor): class PixivUserExtractor(PixivExtractor):
"""Extractor for a pixiv user profile""" """Extractor for a pixiv user profile"""
subcategory = "user" subcategory = "user"
pattern = (r"(?:https?://)?(?:www\.|touch\.)?pixiv\.net/(?:" pattern = (BASE_PATTERN + r"/(?:"
r"(?:en/)?u(?:sers)?/|member\.php\?id=|(?:mypage\.php)?#id=" r"(?:en/)?u(?:sers)?/|member\.php\?id=|(?:mypage\.php)?#id="
r")(\d+)(?:$|[?#])") r")(\d+)(?:$|[?#])")
test = ( test = (
@ -165,20 +173,25 @@ class PixivUserExtractor(PixivExtractor):
PixivExtractor.__init__(self, match) PixivExtractor.__init__(self, match)
self.user_id = match.group(1) self.user_id = match.group(1)
def initialize(self):
pass
def items(self): def items(self):
base = "{}/users/{}/".format(self.root, self.user_id) base = "{}/users/{}/".format(self.root, self.user_id)
return self._dispatch_extractors(( return self._dispatch_extractors((
(PixivAvatarExtractor , base + "avatar"), (PixivAvatarExtractor , base + "avatar"),
(PixivBackgroundExtractor, base + "background"), (PixivBackgroundExtractor , base + "background"),
(PixivArtworksExtractor , base + "artworks"), (PixivArtworksExtractor , base + "artworks"),
(PixivFavoriteExtractor , base + "bookmarks/artworks"), (PixivFavoriteExtractor , base + "bookmarks/artworks"),
(PixivNovelBookmarkExtractor, base + "bookmarks/novels"),
(PixivNovelUserExtractor , base + "novels"),
), ("artworks",)) ), ("artworks",))
class PixivArtworksExtractor(PixivExtractor): class PixivArtworksExtractor(PixivExtractor):
"""Extractor for artworks of a pixiv user""" """Extractor for artworks of a pixiv user"""
subcategory = "artworks" subcategory = "artworks"
pattern = (r"(?:https?://)?(?:www\.|touch\.)?pixiv\.net/(?:" pattern = (BASE_PATTERN + r"/(?:"
r"(?:en/)?users/(\d+)/(?:artworks|illustrations|manga)" r"(?:en/)?users/(\d+)/(?:artworks|illustrations|manga)"
r"(?:/([^/?#]+))?/?(?:$|[?#])" r"(?:/([^/?#]+))?/?(?:$|[?#])"
r"|member_illust\.php\?id=(\d+)(?:&([^#]+))?)") r"|member_illust\.php\?id=(\d+)(?:&([^#]+))?)")
@ -239,8 +252,7 @@ class PixivAvatarExtractor(PixivExtractor):
subcategory = "avatar" subcategory = "avatar"
filename_fmt = "avatar{date:?_//%Y-%m-%d}.{extension}" filename_fmt = "avatar{date:?_//%Y-%m-%d}.{extension}"
archive_fmt = "avatar_{user[id]}_{date}" archive_fmt = "avatar_{user[id]}_{date}"
pattern = (r"(?:https?://)?(?:www\.)?pixiv\.net" pattern = USER_PATTERN + r"/avatar"
r"/(?:en/)?users/(\d+)/avatar")
test = ("https://www.pixiv.net/en/users/173530/avatar", { test = ("https://www.pixiv.net/en/users/173530/avatar", {
"content": "4e57544480cc2036ea9608103e8f024fa737fe66", "content": "4e57544480cc2036ea9608103e8f024fa737fe66",
}) })
@ -260,8 +272,7 @@ class PixivBackgroundExtractor(PixivExtractor):
subcategory = "background" subcategory = "background"
filename_fmt = "background{date:?_//%Y-%m-%d}.{extension}" filename_fmt = "background{date:?_//%Y-%m-%d}.{extension}"
archive_fmt = "background_{user[id]}_{date}" archive_fmt = "background_{user[id]}_{date}"
pattern = (r"(?:https?://)?(?:www\.)?pixiv\.net" pattern = USER_PATTERN + "/background"
r"/(?:en/)?users/(\d+)/background")
test = ("https://www.pixiv.net/en/users/194921/background", { test = ("https://www.pixiv.net/en/users/194921/background", {
"pattern": r"https://i\.pximg\.net/background/img/2021/01/30/16/12/02" "pattern": r"https://i\.pximg\.net/background/img/2021/01/30/16/12/02"
r"/194921_af1f71e557a42f499213d4b9eaccc0f8\.jpg", r"/194921_af1f71e557a42f499213d4b9eaccc0f8\.jpg",
@ -375,12 +386,12 @@ class PixivWorkExtractor(PixivExtractor):
class PixivFavoriteExtractor(PixivExtractor): class PixivFavoriteExtractor(PixivExtractor):
"""Extractor for all favorites/bookmarks of a pixiv-user""" """Extractor for all favorites/bookmarks of a pixiv user"""
subcategory = "favorite" subcategory = "favorite"
directory_fmt = ("{category}", "bookmarks", directory_fmt = ("{category}", "bookmarks",
"{user_bookmark[id]} {user_bookmark[account]}") "{user_bookmark[id]} {user_bookmark[account]}")
archive_fmt = "f_{user_bookmark[id]}_{id}{num}.{extension}" archive_fmt = "f_{user_bookmark[id]}_{id}{num}.{extension}"
pattern = (r"(?:https?://)?(?:www\.|touch\.)?pixiv\.net/(?:(?:en/)?" pattern = (BASE_PATTERN + r"/(?:(?:en/)?"
r"users/(\d+)/(bookmarks/artworks|following)(?:/([^/?#]+))?" r"users/(\d+)/(bookmarks/artworks|following)(?:/([^/?#]+))?"
r"|bookmark\.php)(?:\?([^#]*))?") r"|bookmark\.php)(?:\?([^#]*))?")
test = ( test = (
@ -483,8 +494,7 @@ class PixivRankingExtractor(PixivExtractor):
archive_fmt = "r_{ranking[mode]}_{ranking[date]}_{id}{num}.{extension}" archive_fmt = "r_{ranking[mode]}_{ranking[date]}_{id}{num}.{extension}"
directory_fmt = ("{category}", "rankings", directory_fmt = ("{category}", "rankings",
"{ranking[mode]}", "{ranking[date]}") "{ranking[mode]}", "{ranking[date]}")
pattern = (r"(?:https?://)?(?:www\.|touch\.)?pixiv\.net" pattern = BASE_PATTERN + r"/ranking\.php(?:\?([^#]*))?"
r"/ranking\.php(?:\?([^#]*))?")
test = ( test = (
("https://www.pixiv.net/ranking.php?mode=daily&date=20170818"), ("https://www.pixiv.net/ranking.php?mode=daily&date=20170818"),
("https://www.pixiv.net/ranking.php"), ("https://www.pixiv.net/ranking.php"),
@ -549,8 +559,7 @@ class PixivSearchExtractor(PixivExtractor):
subcategory = "search" subcategory = "search"
archive_fmt = "s_{search[word]}_{id}{num}.{extension}" archive_fmt = "s_{search[word]}_{id}{num}.{extension}"
directory_fmt = ("{category}", "search", "{search[word]}") directory_fmt = ("{category}", "search", "{search[word]}")
pattern = (r"(?:https?://)?(?:www\.|touch\.)?pixiv\.net" pattern = (BASE_PATTERN + r"/(?:(?:en/)?tags/([^/?#]+)(?:/[^/?#]+)?/?"
r"/(?:(?:en/)?tags/([^/?#]+)(?:/[^/?#]+)?/?"
r"|search\.php)(?:\?([^#]+))?") r"|search\.php)(?:\?([^#]+))?")
test = ( test = (
("https://www.pixiv.net/en/tags/Original", { ("https://www.pixiv.net/en/tags/Original", {
@ -596,6 +605,9 @@ class PixivSearchExtractor(PixivExtractor):
sort_map = { sort_map = {
"date": "date_asc", "date": "date_asc",
"date_d": "date_desc", "date_d": "date_desc",
"popular_d": "popular_desc",
"popular_male_d": "popular_male_desc",
"popular_female_d": "popular_female_desc",
} }
try: try:
self.sort = sort = sort_map[sort] self.sort = sort = sort_map[sort]
@ -630,8 +642,7 @@ class PixivFollowExtractor(PixivExtractor):
subcategory = "follow" subcategory = "follow"
archive_fmt = "F_{user_follow[id]}_{id}{num}.{extension}" archive_fmt = "F_{user_follow[id]}_{id}{num}.{extension}"
directory_fmt = ("{category}", "following") directory_fmt = ("{category}", "following")
pattern = (r"(?:https?://)?(?:www\.|touch\.)?pixiv\.net" pattern = BASE_PATTERN + r"/bookmark_new_illust\.php"
r"/bookmark_new_illust\.php")
test = ( test = (
("https://www.pixiv.net/bookmark_new_illust.php"), ("https://www.pixiv.net/bookmark_new_illust.php"),
("https://touch.pixiv.net/bookmark_new_illust.php"), ("https://touch.pixiv.net/bookmark_new_illust.php"),
@ -670,7 +681,7 @@ class PixivPixivisionExtractor(PixivExtractor):
def works(self): def works(self):
return ( return (
self.api.illust_detail(illust_id) self.api.illust_detail(illust_id.partition("?")[0])
for illust_id in util.unique_sequence(text.extract_iter( for illust_id in util.unique_sequence(text.extract_iter(
self.page, '<a href="https://www.pixiv.net/en/artworks/', '"')) self.page, '<a href="https://www.pixiv.net/en/artworks/', '"'))
) )
@ -693,8 +704,7 @@ class PixivSeriesExtractor(PixivExtractor):
directory_fmt = ("{category}", "{user[id]} {user[account]}", directory_fmt = ("{category}", "{user[id]} {user[account]}",
"{series[id]} {series[title]}") "{series[id]} {series[title]}")
filename_fmt = "{num_series:>03}_{id}_p{num}.{extension}" filename_fmt = "{num_series:>03}_{id}_p{num}.{extension}"
pattern = (r"(?:https?://)?(?:www\.)?pixiv\.net" pattern = BASE_PATTERN + r"/user/(\d+)/series/(\d+)"
r"/user/(\d+)/series/(\d+)")
test = ("https://www.pixiv.net/user/10509347/series/21859", { test = ("https://www.pixiv.net/user/10509347/series/21859", {
"range": "1-10", "range": "1-10",
"count": 10, "count": 10,
@ -747,6 +757,220 @@ class PixivSeriesExtractor(PixivExtractor):
params["p"] += 1 params["p"] += 1
class PixivNovelExtractor(PixivExtractor):
"""Extractor for pixiv novels"""
subcategory = "novel"
request_interval = 1.0
pattern = BASE_PATTERN + r"/n(?:ovel/show\.php\?id=|/)(\d+)"
test = (
("https://www.pixiv.net/novel/show.php?id=19612040", {
"count": 1,
"content": "8c818474153cbd2f221ee08766e1d634c821d8b4",
"keyword": {
"caption": r"re:「無能な名無し」と呼ばれ虐げられて育った鈴\(すず\)は、",
"comment_access_control": 0,
"create_date": "2023-04-02T15:18:58+09:00",
"date": "dt:2023-04-02 06:18:58",
"id": 19612040,
"is_bookmarked": False,
"is_muted": False,
"is_mypixiv_only": False,
"is_original": True,
"is_x_restricted": False,
"novel_ai_type": 1,
"page_count": 1,
"rating": "General",
"restrict": 0,
"series": {
"id": 10278364,
"title": "龍の贄嫁〜無能な名無しと虐げられていましたが、"
"どうやら異母妹に霊力を搾取されていたようです〜",
},
"tags": ["和風ファンタジー", "溺愛", "神様", "ヤンデレ", "執着",
"異能", "ざまぁ", "学園", "神嫁"],
"text_length": 5974,
"title": "異母妹から「無能な名無し」と虐げられていた私、"
"どうやら異母妹に霊力を搾取されていたようです(1)",
"user": {
"account": "yukinaga_chifuyu",
"id": 77055466,
},
"visible": True,
"x_restrict": 0,
},
}),
# embeds
("https://www.pixiv.net/novel/show.php?id=16422450", {
"options": (("embeds", True),),
"count": 3,
}),
# full series
("https://www.pixiv.net/novel/show.php?id=19612040", {
"options": (("full-series", True),),
"count": 4,
}),
# short URL
("https://www.pixiv.net/n/19612040"),
)
def __init__(self, match):
PixivExtractor.__init__(self, match)
self.novel_id = match.group(1)
def items(self):
tags = self.config("tags", "japanese")
if tags == "original":
transform_tags = None
elif tags == "translated":
def transform_tags(work):
work["tags"] = list(dict.fromkeys(
tag["translated_name"] or tag["name"]
for tag in work["tags"]))
else:
def transform_tags(work):
work["tags"] = [tag["name"] for tag in work["tags"]]
ratings = {0: "General", 1: "R-18", 2: "R-18G"}
meta_user = self.config("metadata")
meta_bookmark = self.config("metadata-bookmark")
embeds = self.config("embeds")
if embeds:
headers = {
"User-Agent" : "Mozilla/5.0",
"App-OS" : None,
"App-OS-Version": None,
"App-Version" : None,
"Referer" : self.root + "/",
"Authorization" : None,
}
novels = self.novels()
if self.max_posts:
novels = itertools.islice(novels, self.max_posts)
for novel in novels:
if meta_user:
novel.update(self.api.user_detail(novel["user"]["id"]))
if meta_bookmark and novel["is_bookmarked"]:
detail = self.api.novel_bookmark_detail(novel["id"])
novel["tags_bookmark"] = [tag["name"] for tag in detail["tags"]
if tag["is_registered"]]
if transform_tags:
transform_tags(novel)
novel["num"] = 0
novel["date"] = text.parse_datetime(novel["create_date"])
novel["rating"] = ratings.get(novel["x_restrict"])
novel["suffix"] = ""
yield Message.Directory, novel
novel["extension"] = "txt"
content = self.api.novel_text(novel["id"])["novel_text"]
yield Message.Url, "text:" + content, novel
if embeds:
desktop = False
illusts = {}
for marker in text.extract_iter(content, "[", "]"):
if marker.startswith("[jumpuri:If you would like to "):
desktop = True
elif marker.startswith("pixivimage:"):
illusts[marker[11:].partition("-")[0]] = None
if desktop:
novel_id = str(novel["id"])
url = "{}/novel/show.php?id={}".format(
self.root, novel_id)
data = util.json_loads(text.extr(
self.request(url, headers=headers).text,
"id=\"meta-preload-data\" content='", "'"))
for image in (data["novel"][novel_id]
["textEmbeddedImages"]).values():
url = image.pop("urls")["original"]
novel.update(image)
novel["date_url"] = self._date_from_url(url)
novel["num"] += 1
novel["suffix"] = "_p{:02}".format(novel["num"])
text.nameext_from_url(url, novel)
yield Message.Url, url, novel
if illusts:
novel["_extractor"] = PixivWorkExtractor
novel["date_url"] = None
for illust_id in illusts:
novel["num"] += 1
novel["suffix"] = "_p{:02}".format(novel["num"])
url = "{}/artworks/{}".format(self.root, illust_id)
yield Message.Queue, url, novel
def novels(self):
novel = self.api.novel_detail(self.novel_id)
if self.config("full-series") and novel["series"]:
self.subcategory = PixivNovelSeriesExtractor.subcategory
return self.api.novel_series(novel["series"]["id"])
return (novel,)
class PixivNovelUserExtractor(PixivNovelExtractor):
"""Extractor for pixiv users' novels"""
subcategory = "novel-user"
pattern = USER_PATTERN + r"/novels"
test = ("https://www.pixiv.net/en/users/77055466/novels", {
"pattern": "^text:",
"range": "1-5",
"count": 5,
})
def novels(self):
return self.api.user_novels(self.novel_id)
class PixivNovelSeriesExtractor(PixivNovelExtractor):
"""Extractor for pixiv novel series"""
subcategory = "novel-series"
pattern = BASE_PATTERN + r"/novel/series/(\d+)"
test = ("https://www.pixiv.net/novel/series/10278364", {
"count": 4,
"content": "b06abed001b3f6ccfb1579699e9a238b46d38ea2",
})
def novels(self):
return self.api.novel_series(self.novel_id)
class PixivNovelBookmarkExtractor(PixivNovelExtractor):
"""Extractor for bookmarked pixiv novels"""
subcategory = "novel-bookmark"
pattern = (USER_PATTERN + r"/bookmarks/novels"
r"(?:/([^/?#]+))?(?:/?\?([^#]+))?")
test = (
("https://www.pixiv.net/en/users/77055466/bookmarks/novels", {
"count": 1,
"content": "7194e8faa876b2b536f185ee271a2b6e46c69089",
}),
("https://www.pixiv.net/en/users/11/bookmarks/novels/TAG?rest=hide"),
)
def __init__(self, match):
PixivNovelExtractor.__init__(self, match)
self.user_id, self.tag, self.query = match.groups()
def novels(self):
if self.tag:
tag = text.unquote(self.tag)
else:
tag = None
if text.parse_query(self.query).get("rest") == "hide":
restrict = "private"
else:
restrict = "public"
return self.api.user_bookmarks_novel(self.user_id, tag, restrict)
class PixivSketchExtractor(Extractor): class PixivSketchExtractor(Extractor):
"""Extractor for user pages on sketch.pixiv.net""" """Extractor for user pages on sketch.pixiv.net"""
category = "pixiv" category = "pixiv"
@ -755,7 +979,7 @@ class PixivSketchExtractor(Extractor):
filename_fmt = "{post_id} {id}.{extension}" filename_fmt = "{post_id} {id}.{extension}"
archive_fmt = "S{user[id]}_{id}" archive_fmt = "S{user[id]}_{id}"
root = "https://sketch.pixiv.net" root = "https://sketch.pixiv.net"
cookiedomain = ".pixiv.net" cookies_domain = ".pixiv.net"
pattern = r"(?:https?://)?sketch\.pixiv\.net/@([^/?#]+)" pattern = r"(?:https?://)?sketch\.pixiv\.net/@([^/?#]+)"
test = ("https://sketch.pixiv.net/@nicoby", { test = ("https://sketch.pixiv.net/@nicoby", {
"pattern": r"https://img\-sketch\.pixiv\.net/uploads/medium" "pattern": r"https://img\-sketch\.pixiv\.net/uploads/medium"
@ -904,6 +1128,23 @@ class PixivAppAPI():
params = {"illust_id": illust_id} params = {"illust_id": illust_id}
return self._pagination("/v2/illust/related", params) return self._pagination("/v2/illust/related", params)
def novel_bookmark_detail(self, novel_id):
params = {"novel_id": novel_id}
return self._call(
"/v2/novel/bookmark/detail", params)["bookmark_detail"]
def novel_detail(self, novel_id):
params = {"novel_id": novel_id}
return self._call("/v2/novel/detail", params)["novel"]
def novel_series(self, series_id):
params = {"series_id": series_id}
return self._pagination("/v1/novel/series", params, "novels")
def novel_text(self, novel_id):
params = {"novel_id": novel_id}
return self._call("/v1/novel/text", params)
def search_illust(self, word, sort=None, target=None, duration=None, def search_illust(self, word, sort=None, target=None, duration=None,
date_start=None, date_end=None): date_start=None, date_end=None):
params = {"word": word, "search_target": target, params = {"word": word, "search_target": target,
@ -916,6 +1157,11 @@ class PixivAppAPI():
params = {"user_id": user_id, "tag": tag, "restrict": restrict} params = {"user_id": user_id, "tag": tag, "restrict": restrict}
return self._pagination("/v1/user/bookmarks/illust", params) return self._pagination("/v1/user/bookmarks/illust", params)
def user_bookmarks_novel(self, user_id, tag=None, restrict="public"):
"""Return novels bookmarked by a user"""
params = {"user_id": user_id, "tag": tag, "restrict": restrict}
return self._pagination("/v1/user/bookmarks/novel", params, "novels")
def user_bookmark_tags_illust(self, user_id, restrict="public"): def user_bookmark_tags_illust(self, user_id, restrict="public"):
"""Return bookmark tags defined by a user""" """Return bookmark tags defined by a user"""
params = {"user_id": user_id, "restrict": restrict} params = {"user_id": user_id, "restrict": restrict}
@ -935,6 +1181,10 @@ class PixivAppAPI():
params = {"user_id": user_id} params = {"user_id": user_id}
return self._pagination("/v1/user/illusts", params) return self._pagination("/v1/user/illusts", params)
def user_novels(self, user_id):
params = {"user_id": user_id}
return self._pagination("/v1/user/novels", params, "novels")
def ugoira_metadata(self, illust_id): def ugoira_metadata(self, illust_id):
params = {"illust_id": illust_id} params = {"illust_id": illust_id}
return self._call("/v1/ugoira/metadata", params)["ugoira_metadata"] return self._call("/v1/ugoira/metadata", params)["ugoira_metadata"]

View File

@ -41,7 +41,7 @@ class PoipikuExtractor(Extractor):
"user_name" : text.unescape(extr( "user_name" : text.unescape(extr(
'<h2 class="UserInfoUserName">', '</').rpartition(">")[2]), '<h2 class="UserInfoUserName">', '</').rpartition(">")[2]),
"description": text.unescape(extr( "description": text.unescape(extr(
'class="IllustItemDesc" >', '<')), 'class="IllustItemDesc" >', '</h1>')),
"_http_headers": {"Referer": post_url}, "_http_headers": {"Referer": post_url},
} }
@ -76,11 +76,12 @@ class PoipikuExtractor(Extractor):
"MD" : "0", "MD" : "0",
"TWF": "-1", "TWF": "-1",
} }
page = self.request( resp = self.request(
url, method="POST", headers=headers, data=data).json()["html"] url, method="POST", headers=headers, data=data).json()
if page.startswith(("You need to", "Password is incorrect")): page = resp["html"]
self.log.warning("'%s'", page) if (resp.get("result_num") or 0) < 0:
self.log.warning("'%s'", page.replace("<br/>", " "))
for thumb in text.extract_iter( for thumb in text.extract_iter(
page, 'class="IllustItemThumbImg" src="', '"'): page, 'class="IllustItemThumbImg" src="', '"'):
@ -172,7 +173,9 @@ class PoipikuPostExtractor(PoipikuExtractor):
"count": 3, "count": 3,
"keyword": { "keyword": {
"count": "3", "count": "3",
"description": "ORANGE OASISボスネタバレ", "description": "ORANGE OASISボスネタバレ<br />曲も大好き<br />"
"2枚目以降はほとんど見えなかった1枚目背景"
"のヒエログリフ小ネタです𓀀",
"num": int, "num": int,
"post_category": "SPOILER", "post_category": "SPOILER",
"post_id": "5776587", "post_id": "5776587",

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright 2019-2021 Mike Fährmann # Copyright 2019-2023 Mike Fährmann
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as # it under the terms of the GNU General Public License version 2 as
@ -11,7 +11,6 @@
from .common import Extractor, Message from .common import Extractor, Message
from .. import text, exception from .. import text, exception
BASE_PATTERN = r"(?:https?://)?(?:[\w-]+\.)?pornhub\.com" BASE_PATTERN = r"(?:https?://)?(?:[\w-]+\.)?pornhub\.com"
@ -59,6 +58,9 @@ class PornhubGalleryExtractor(PornhubExtractor):
self._first = None self._first = None
def items(self): def items(self):
self.cookies.set(
"accessAgeDisclaimerPH", "1", domain=".pornhub.com")
data = self.metadata() data = self.metadata()
yield Message.Directory, data yield Message.Directory, data
for num, image in enumerate(self.images(), 1): for num, image in enumerate(self.images(), 1):
@ -109,7 +111,7 @@ class PornhubGalleryExtractor(PornhubExtractor):
"views" : text.parse_int(img["times_viewed"]), "views" : text.parse_int(img["times_viewed"]),
"score" : text.parse_int(img["vote_percent"]), "score" : text.parse_int(img["vote_percent"]),
} }
key = img["next"] key = str(img["next"])
if key == end: if key == end:
return return
@ -146,10 +148,20 @@ class PornhubUserExtractor(PornhubExtractor):
data = {"_extractor": PornhubGalleryExtractor} data = {"_extractor": PornhubGalleryExtractor}
while True: while True:
page = self.request( response = self.request(
url, method="POST", headers=headers, params=params).text url, method="POST", headers=headers, params=params,
if not page: allow_redirects=False)
return
for gid in text.extract_iter(page, 'id="albumphoto', '"'): if 300 <= response.status_code < 400:
url = "{}{}/photos/{}/ajax".format(
self.root, response.headers["location"],
self.cat or "public")
continue
gid = None
for gid in text.extract_iter(response.text, 'id="albumphoto', '"'):
yield Message.Queue, self.root + "/album/" + gid, data yield Message.Queue, self.root + "/album/" + gid, data
if gid is None:
return
params["page"] += 1 params["page"] += 1

View File

@ -23,7 +23,9 @@ class PornpicsExtractor(Extractor):
def __init__(self, match): def __init__(self, match):
super().__init__(match) super().__init__(match)
self.item = match.group(1) self.item = match.group(1)
self.session.headers["Referer"] = self.root
def _init(self):
self.session.headers["Referer"] = self.root + "/"
def items(self): def items(self):
for gallery in self.galleries(): for gallery in self.galleries():

View File

@ -22,18 +22,21 @@ class ReactorExtractor(BaseExtractor):
def __init__(self, match): def __init__(self, match):
BaseExtractor.__init__(self, match) BaseExtractor.__init__(self, match)
url = text.ensure_http_scheme(match.group(0), "http://") url = text.ensure_http_scheme(match.group(0), "http://")
pos = url.index("/", 10) pos = url.index("/", 10)
self.root = url[:pos]
self.root, self.path = url[:pos], url[pos:] self.path = url[pos:]
self.session.headers["Referer"] = self.root
self.gif = self.config("gif", False)
if self.category == "reactor": if self.category == "reactor":
# set category based on domain name # set category based on domain name
netloc = urllib.parse.urlsplit(self.root).netloc netloc = urllib.parse.urlsplit(self.root).netloc
self.category = netloc.rpartition(".")[0] self.category = netloc.rpartition(".")[0]
def _init(self):
self.session.headers["Referer"] = self.root
self.gif = self.config("gif", False)
def items(self): def items(self):
data = self.metadata() data = self.metadata()
yield Message.Directory, data yield Message.Directory, data

View File

@ -57,8 +57,10 @@ class ReadcomiconlineIssueExtractor(ReadcomiconlineBase, ChapterExtractor):
def __init__(self, match): def __init__(self, match):
ChapterExtractor.__init__(self, match) ChapterExtractor.__init__(self, match)
self.params = match.group(2)
params = text.parse_query(match.group(2)) def _init(self):
params = text.parse_query(self.params)
quality = self.config("quality") quality = self.config("quality")
if quality is None or quality == "auto": if quality is None or quality == "auto":

Some files were not shown because too many files have changed in this diff Show More