Mike Fährmann
05255f5be0
add 'default' argument to 'text.extr()'
2022-11-09 11:00:32 +01:00
Mike Fährmann
eb33e6cf2d
add 'text.extr()'
...
a stripped-down version of text.extract() that
- always returns a string (like 'extract_from')
- only returns a string
- does not deal with 'pos' arguments
- is ~20% faster
2022-11-04 21:37:36 +01:00
Mike Fährmann
67bad04dda
[formatter] add 'g' conversion to sluGify a string ( #2410 )
2022-08-26 17:57:17 +02:00
Mike Fährmann
bddcec49f1
implement 'text.root_from_url()'
...
use domain from input URL for kemono
2022-03-01 03:09:57 +01:00
Mike Fährmann
bc0e853d30
combine KeyError & IndexError to common base class LookupError
2022-02-11 00:42:49 +01:00
Mike Fährmann
bc868e7bb8
consider apparently long extensions as part of the filename
...
(#1516 )
2021-05-02 21:15:50 +02:00
Mike Fährmann
387fe415d5
unescape items in text.split_html()
2021-03-29 02:12:29 +02:00
Mike Fährmann
78fd63b8f0
remove 'text.clean_xml()'
...
was not used anywhere
2021-03-28 04:05:16 +02:00
Mike Fährmann
8553b218d9
replace calls to 'os.path.splitext()' with 'str.rpartition()'
...
Makes functions who used it more than twice as fast
and we can get rid of an import as well.
2021-03-28 04:01:27 +02:00
Mike Fährmann
a09f42f6b3
improve filename_from_url() performance
...
Manually extracting the part between the last '/' and '?' instead of
relying on the standard libraries' 'urllib.parse.urlsplit()' increases
performance by ~400%.
urlsplit() : 3.64 secs per 1.000.000 iterations
partition(): 0.87 secs per 1.000.000 iterations
2020-10-23 00:14:06 +02:00
Mike Fährmann
37d71f6e09
strip microseconds in text.parse_datetime()
2020-06-17 21:40:16 +02:00
Mike Fährmann
6294e2c540
add 'text.ensure_http_scheme()'
2020-05-19 22:32:53 +02:00
Mike Fährmann
a0f4c295c0
add optional 'utcoffset' argument to 'parse_datetime()'
2020-04-11 02:05:00 +02:00
Mike Fährmann
f6c5edb76b
pre-compile regex pattern for remove_html() and split_html()
2020-03-13 23:31:54 +01:00
Mike Fährmann
b1bea8aaeb
add 'restrict-filenames' option ( #348 )
2019-07-23 17:41:24 +02:00
Mike Fährmann
1740086d8a
add 'repl' and 'sep' arguments to text.replace_html()
2019-07-17 14:48:24 +02:00
Mike Fährmann
b171befa87
implement 'parse_unicode_escapes()'
2019-06-16 21:47:24 +02:00
Mike Fährmann
2b1999476e
implement 'text.rextract()'
2019-05-28 21:03:41 +02:00
Mike Fährmann
2316e0ed3d
fix strptime workaround from b0e85a4
...
Don't return a modified version of 'date_time' if strptime fails.
2019-05-25 23:22:26 +02:00
Mike Fährmann
b0e85a42e3
apply workaround from 4736912
in parse_datetime() itself
2019-05-09 21:53:17 +02:00
Mike Fährmann
d09864b581
implement text.parse_datetime()
2019-05-08 15:43:59 +02:00
Mike Fährmann
6264a46212
use 'utcfromtimestamp()'
...
'fromtimestamp()' converts its results to the local timezone and causes
problems when running tests on a different machine.
2019-04-21 16:22:53 +02:00
Mike Fährmann
d670de0344
implement 'text.parse_timestamp()'
2019-04-21 15:28:27 +02:00
Mike Fährmann
21a7e395a7
implement convenience wrapper for text.extract functionality
2019-04-19 22:30:11 +02:00
Mike Fährmann
8f249f1d54
improve text.extract_iter() performance
...
by roughly 40% through
- inlining code
- pre-calculating reused values
- entering a try-except block only once
2019-04-18 23:37:17 +02:00
Mike Fährmann
5530871b5a
change results of text.nameext_from_url()
...
Instead of getting a complete 'filename' from an URL and splitting that
into 'name' and 'extension', the new approach gets rid of the complete
version and renames 'name' to 'filename'. (Using anything other than
{extension} for a filename extension doesn't really work anyway)
Example: "https://example.org/path/filename.ext "
before:
- filename : filename.ext
- name : filename
- extension: ext
now:
- filename : filename
- extension: ext
2019-02-14 16:07:17 +01:00
Mike Fährmann
e1d3e9a926
add 'ext_from_url' to text.py
2019-01-31 12:23:25 +01:00
Mike Fährmann
2d2953a5bf
add 'text.parse_float()' + cleanup in text.py
2019-01-29 16:46:21 +01:00
Mike Fährmann
ae9a37a528
implement text.split_html()
2018-05-27 15:00:41 +02:00
Mike Fährmann
cc36f88586
rename safe_int to parse_int; move parse_* to text module
2018-04-20 14:53:21 +02:00
Mike Fährmann
4ffa94f634
remove 'shorten_path()' and 'shorten_filename()'
2018-04-15 18:44:13 +02:00
Mike Fährmann
27eab4e467
rewrite text tests and improve functions
...
- test more edge cases
- consistently return an empty string for invalid arguments
- remove the ungreedy-flag in 'remove_html()'
2018-04-15 18:13:46 +02:00
Mike Fährmann
e3f2bd4087
add tests for 'text.clean_xml()' and improve it
2018-04-14 22:07:01 +02:00
Mike Fährmann
6d8b191ea7
improve 'parse_query()' and add tests
...
- another irrelevant micro-optimization !
- use urllib.parse.parse_qsl directly instead of parse_qs, which
just packs the results of parse_qsl in a different data structure
- reduced memory requirements since no additional dict and lists are
created
2018-04-13 19:21:32 +02:00
Mike Fährmann
731ffd4986
improve text.filename_from_url() performance
...
- urlsplit() is faster than urlparse()
- rpartition() is faster than rindex() + slicing
- new version is 2.3 times as fast
2018-02-18 16:50:07 +01:00
Mike Fährmann
f7cdfd4c25
add a simplified version of 'parse_qs'
...
This version only returns a dict of plain string to string key-value
pairs and ignores multiple values for the same query variable.
2017-08-24 20:55:58 +02:00
Mike Fährmann
e5f79ae839
[deviantart] add support for all media types
...
- this includes
- images
- videos
- flash-animations
- journals
- also renamed some of the extractors
- User -> Gallery
- Image -> Deviation
2017-05-10 16:45:45 +02:00
Mike Fährmann
ed94d9b92d
fix/improve various things
2017-03-17 09:39:46 +01:00
Mike Fährmann
619c74159a
[seiga] fix file extension and xml parsing
...
- The file extension of the first image had been used for all further
images
- API responses can contain invalid characters, which cause the XML
parser to fail (http://seiga.nicovideo.jp/user/illust/26377934
contains several \x08 characters)
2017-03-14 09:09:04 +01:00
Mike Fährmann
4f123b8513
code adjustments according to pep8
2017-01-30 19:40:15 +01:00
Mike Fährmann
8780abcc77
fix a small spelling error
2017-01-10 14:24:58 +01:00
Mike Fährmann
00074a71d7
several changes to make travis build work
...
- fixed html.unescape not being available on Python3.3
- removed inconsistent test result
- added username/password pairs for authenticating extractors
2017-01-10 13:41:00 +01:00
Mike Fährmann
91c446805b
replace platform.system() with os.name
2016-10-25 15:44:36 +02:00
Mike Fährmann
8a49a28d13
replace deprecated 'unescape' method
2016-02-18 15:54:58 +01:00
Mike Fährmann
99b4fbb081
implement text.extract_iter
2015-11-28 01:46:34 +01:00
Mike Fährmann
7fd284a705
always provide lowercase fileextensions
2015-11-16 17:40:05 +01:00
Mike Fährmann
ca523b9f64
add helper method to text module
2015-11-16 03:46:43 +01:00
Mike Fährmann
d0bebd9ce3
allow adding values to existing dict
2015-11-03 00:05:18 +01:00
Mike Fährmann
629133a27a
document text.extract
2015-11-02 15:52:26 +01:00
Mike Fährmann
692d0c95cc
reimplement text.extract_all
2015-11-02 15:51:32 +01:00