* [instagram] Add extractor for instagram.com user profiles and pages
The extractor scrapes `instagram.com/<user>' timelines and
`instagram.com/p/<shortcode>' by mimicking the behaviour of a web
browser and extracting the sharedData JSON of the single pages.
Please note that this mean that for user timelines we also do an
extra request to the `instagram.com/p/<shortcode>' page but this
permit to have consistent (and all) information about the media
fetched.
The MD5 logic used for X-Instagram-GIS was documented in
<https://stackoverflow.com/questions/49786980/>
* [instagram] Test for keywords, not url for GraphImage and GraphSidecar
URLs returned by instagram seems not stable so avoid testing for
them and instead test for keyword returned.
* [instagram] Improve test of InstagramProfilepageExtractor
Also check the count of media returned.
* [instagram] Several cleanup and improvements
- Change description, subcategories to generate a better description in
docs/supportedsite.rst
- Remove not needed InstagramExtractor.__init__()
- Use text.parse_int() instead of directly using int() (the former is more
robust)
- Use self.request().json() instead of using json.loads() the
self.request().text()
- Add `pattern:' to check the URLs where we do not have a stable URLs.
It seems that only the subdomain is not stable.
Thanks to @mikf!
While a filename might not be a real 'hash', or comparable to what
tumbler usually provides, it is still better than an empty string.
At least as long as "alternatives" in format strings aren't implemented.
The "default" downloader options (rate, retries, timeout, verify) are
mapped to corresponding youtube-dl options.
downloader.ytdl.logging tells the downloader to pass youtube-dl's output
to a Logger object.
downloader.ytdl.raw-options allows to pass arbitrary options to the
YoutubeDL constructor.
from: 5xx HTTP Error: Reason
to : 5xx: Reason
The "HTTP Error" part was in there to emulate Request's error messages
from response.raise_for_status(), but it reads a lot better without.
In addition to 'abort' and 'exit', it is now possible to specify
'abort:N' and 'exit:N' (where N is any integer) as value for 'skip'
to abort/exit after consecutively skipping N downloads.
The first login will still use username and password, but everything
afterwards will use the refresh_token obtained from that.
This will prevent pixiv from sending a "New login to pixiv" email every
time a new access_token is requested.
The functionality of --(chapter-)filter and --(chapter-)range are now
also exposed as the following config-file options:
- extractor.*.image-filter
- extractor.*.image-range
- extractor.*.chapter-filter
- extractor.*.chapter-range
TODO: update configuration.rst
This change introduces 'extractor.*.retries/timeout/verify' options
as a general way to set these values for all HTTP requests.
'downloader.http.retries/timeout/verify' is a way to override these
options for file downloads only and will fall back to 'extractor.*.…*
values if they haven't been explicitly set.
Also: downloader classes now take an extractor object as first argument
instead of a requests.session.
URLs starting with 'ytdl:' will now be handled by youtube-dl.
There is probably a lot to fix and improve, but the basic use case
works.
TODO:
- format selection and ytdl options in general
- better filename/path handling
- ytdl support for "unsupported URLs"
- ...
Enabling this option will detect videos in tweets and output them as
"unsupported" URLs, so that these can then be downloaded with youtube-dl
There are a lot of improvements to be made to the current
implementation, but it works and does what it is supposed to, even if
inefficient as can be ...
and some smaller changes ...
'user' is the name of the account an image is listed at and
'artist' is now the name of the account who created the image.
For example "https://www.hentai-foundry.com/user/Tenpura/faves/pictures"
- 'user': Tenpura
- 'artist' of the only image: LewdBrush
- rename "deleted" to "same-blog"
- change test for deleted original post to test if
original post owner has the same UUID (full blog name) as the one
being downloaded from
- add 'blog[uuid]' metadata to allow comparison with
'reblogged_from_uuid'
Setting 'reblogs' to "deleted" will check if the parent post of a
reblog has been deleted and download its media content if that is the
case, otherwise it will be skipped.
This is a rather costly operation (1 API request per reblogged post)
and should therefore be used with care.
Each post-processor config dict now supports a list of extractor
categories for which it should/shouldn't be active for.
For example:
"postprocessors": [
{"name": "classify",
"whitelist": ["tumblr", "deviantart"],
...
}
]
A format string now gets parsed only once instead of re-parsing it each
time it is applied to a set of data.
The initial parsing causes directory path creation to be at about 2x
slower than before, since each format string there is used only once,
but building a filename, the more common operation, is at least 2x
faster. The "directory slowness" cancels at about 5 filenames and
everything above that is significantly faster.
http://subapics.com/ got discontinued and replaced by http://ngomik.in/.
ngomik.in is still displaying a link to the "old site" showing a big
"Account Suspended" sign.
For example "https://twitter.com/PicturesEarth/media".
They are different from normal timelines in that they do not contain
any (re)tweets from other users and feature all media the user ever
posted, including responses to other tweets.
- rename User- to TimelineExtractor
- rename 'userid' to 'user_id' to conform to the other ..._id values
- adjust archive_fmt to deal with retweets
- emulate browser behavior for API calls
- cache manga API results
- add artist, author and date fields to chapter metadata
- remove Manga-/ChapterExtractor inheritance
- minor code simplifications and improvements