1
0
mirror of https://github.com/mikf/gallery-dl.git synced 2024-11-23 03:02:50 +01:00
Commit Graph

27 Commits

Author SHA1 Message Date
Mike Fährmann
3194bcbccc
[blogger] remove 'micmicidol.club' 2024-10-10 14:23:58 +02:00
Wiiplay123
6eb62f2140
Combine lh*(-**).googleusercontent.com URL regex into one line.
Co-authored-by: Mike Fährmann <mike_faehrmann@web.de>
2024-01-20 15:53:11 -06:00
Wiiplay123
a6fed628dd
[blogger] Fix lh*.googleusercontent.com forward slash bug, add support for lh*-**.googleusercontent.com
Some URLs use "lh(number)-(locale).googleusercontent.com" format, so I added support for those.

Also, "lh(number).googleusercontent.com" formats were broken because the regex was looking for a second forward slash.

Examples:
lh7.googleusercontent.com
lh7-us.googleusercontent.com
2024-01-20 15:07:52 -06:00
Mike Fährmann
e17a48fe56
[blogger] inherit from BaseExtractor
- support www.micmicidol.club (#4759)
2023-11-21 16:52:25 +01:00
Mike Fährmann
27ec653991
fix bug in test_init and update example URLs 2023-09-14 13:27:03 +02:00
Mike Fährmann
a453335a9f
remove test results in extractor modules
and add generic example URLs
2023-09-11 16:30:55 +02:00
Mike Fährmann
a383eca7f6
decouple extractor initialization
Introduce an 'initialize()' function that does the actual init
(session, cookies, config options) and can called separately from
the constructor __init__().

This allows, for example, to adjust config access inside a Job
before most of it already happened when calling 'extractor.find()'.
2023-07-25 22:16:16 +02:00
Mike Fährmann
0ad59c92b1
[blogger] download files from 'lh*.googleusercontent.com' (4070) 2023-05-28 19:58:20 +02:00
enduser420
bbb1e34c34 [blogger] update sub regex 2023-04-03 12:43:58 +05:30
Mike Fährmann
dd884b02ee
replace json.loads with direct calls to JSONDecoder.decode 2023-02-09 15:22:00 +01:00
Mike Fährmann
b0cb4a1b9c
replace 'text.extract()' with 'text.extr()' where possible 2022-11-05 01:14:09 +01:00
Mike Fährmann
d699310fdf
[blogger] add 'label' or 'query' metadata fields (#2930)
for '/search/label/…' or '/search?q=…' URLs
2022-09-20 11:37:39 +02:00
Mike Fährmann
eef50c1f28
[blogger] split 'search' extractor (#2930) 2022-09-19 21:01:21 +02:00
Mike Fährmann
5038893cdd
[blogger] emit metadata for posts without files (#2789) 2022-07-29 13:38:39 +02:00
Mike Fährmann
c6a9bab019
update extractor test results 2022-07-12 15:49:22 +02:00
Mike Fährmann
698f35215e
[blogger] support new image domain (fixes #2204) 2022-01-20 23:13:07 +01:00
Vrihub
96fcff182c
generic extractor (#735)
* Generic extractor, see issue #683

* Fix failed test_names test, no subcategory needed

* Prefix directory_fmt with "generic"

* Relax regex (would break some urls)

* Flake8 compliance

* pattern: don't require a scheme

This fixes a bug when we force the generic extractor on urls without a
scheme (that are allowed by all other extractors).

* Fix using g: and r: on urls without http(s) scheme

Almost all extractors accept urls without an initial http(s) scheme.

Many extractors also allow for generic subdomains in their "pattern"
variable; some of them implement this with the regex character class
"[^.]+" (everything but a dot).

This leads to a problem when the extractor is given a url starting
with g: or r: (to force using the generic or recursive extractor)
and without the http(s) scheme: e.g. with "r:foobar.tumblr.com"
the "r:" is wrongly considered part of the subdomain.

This commit fixes the bug, replacing the too generic "[^.]+" with the
more specific "[\w-]+" (letters, digits and "-", the only characters
allowed in domain names), which is already used by some extractors.

* Relax imageurl_pattern_ext: allow relative urls

* First round of small suggested changes

* Support image urls starting with "//"

* self.baseurl: remove trailing slash

* Relax regexp (didn't catch some image urls)

* Some fixes and cleanup

* Fix domain pattern; option to enable extractor

Fixed the domain section for "pattern", to pass "test_add" and
"test_add_module" tests.
Added the "enabled" configuration option (default False) to enable the
generic extractor. Using "g(eneric):URL" forces using the extractor.
2021-12-29 22:39:29 +01:00
Mike Fährmann
bd08ee2859
remove most 'yield Message.Version' statements
only leave them in oauth.py as noop results
2021-08-16 03:10:48 +02:00
Mike Fährmann
968d3e8465
remove '&' from URL patterns
'/?&#' -> '/?#' and '?&#' -> '?#'

According to https://www.ietf.org/rfc/rfc3986.txt, URLs are
"organized hierarchically" by using "the slash ("/"), question
mark ("?"), and number sign ("#") characters to delimit components"
2020-10-22 23:31:25 +02:00
Mike Fährmann
6491db3eaf
[blogger] handle URLs with specified width/height (closes #1061)
get highest quality for images with
/wXXX-hXXX/ instead of the usual /sXXX/
2020-10-15 15:14:18 +02:00
Mike Fährmann
2b88c90f6f
[blogger] add search extractor (#925) 2020-08-06 19:43:39 +02:00
Mike Fährmann
aa64149583
[blogger] support searching posts by labels (closes #925) 2020-08-04 22:49:37 +02:00
Mike Fährmann
453f3bc519
[blogger] improve error messages for missing posts/blogs (#903) 2020-07-22 23:51:48 +02:00
Mike Fährmann
d6a480682f
update test results 2020-05-02 21:13:00 +02:00
Mike Fährmann
4e361b3008
add tests for specific datetime values 2020-02-23 16:48:30 +01:00
Mike Fährmann
6703b8a86b
[blogger] implement video extraction (closes #587) 2020-01-24 23:37:23 +01:00
Mike Fährmann
109718a5e3
[blogger] add blog and post extractors (closes #364) 2019-10-26 14:15:55 +02:00