1
0
mirror of https://github.com/gorhill/uBlock.git synced 2024-11-17 16:02:33 +01:00
Commit Graph

60 Commits

Author SHA1 Message Date
Raymond Hill
11c56ab540
Minor fine-tuning of URL tokenizer 2019-10-31 11:15:00 -04:00
Raymond Hill
a69b301d81
Fine-tune new bidi-trie code
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/761
2019-10-29 10:26:34 -04:00
Raymond Hill
b0cbc47d9a
Add WASM versions for some bidi-trie methods
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/761

Changes related to above issue made it possible to
create WASM versions of methods used in the bidi-trie.
In this commit, WASM versions for startsWith(), indexOf()
and lastIndexOf() have been implemented.
2019-10-26 13:13:53 -04:00
Raymond Hill
7971b22385
Expand bidi-trie usage in static network filtering engine
Related issues:
- https://github.com/uBlockOrigin/uBlock-issues/issues/761
- https://github.com/uBlockOrigin/uBlock-issues/issues/528

The previous bidi-trie code could only hold filters which
are plain pattern, i.e. no wildcard characters, and which
had no origin option (`domain=`), right and/or left anchor,
and no `csp=` option.

Example of filters that could be moved into a bidi-trie
data structure:

    &ad_box_
    /w/d/capu.php?z=$script,third-party
    ||liveonlinetv247.com/images/muvixx-150x50-watch-now-in-hd-play-btn.gif

Examples of filters that could NOT be moved to a bidi-trie:

    -adap.$domain=~l-adap.org
    /tsc.php?*&ses=
    ||ibsrv.net/*forumsponsor$domain=[...]
    @@||imgspice.com/jquery.cookie.js|$script
    ||view.atdmt.com^*/iview/$third-party
    ||postimg.cc/image/$csp=[...]

Ideally the filters above should be able to be moved to a
bidi-trie since they are basically plain patterns, or at
least partially moved to a bidi-trie when there is only a
single wildcard (i.e. made of two plain patterns).

Also, there were two distinct bidi-tries in which
plain-pattern filters can be moved to: one for patterns
without hostname anchoring and another one for patterns
with hostname-anchoring. This was required because the
hostname-anchored patterns have an extra condition which
is outside the bidi-trie knowledge.

This commit expands the number of filters which can be
stored in the bidi-trie, and also remove the need to
use two distinct bidi-tries.

- Added ability to associate a pattern with an integer
  in the bidi-trie [1].
    - The bidi-trie match code passes this externally
      provided integer when calling an externally
      provided method used for testing extra conditions
      that may be present for a plain pattern found to
      be matching in the bidi-trie.

- Decomposed existing filters into smaller logical units:
    - FilterPlainLeftAnchored =>
        FilterPatternPlain +
        FilterAnchorLeft
    - FilterPlainRightAnchored =>
        FilterPatternPlain +
        FilterAnchorRight
    - FilterExactMatch =>
        FilterPatternPlain +
        FilterAnchorLeft +
        FilterAnchorRight
    - FilterPlainHnAnchored =>
        FilterPatternPlain +
        FilterAnchorHn
    - FilterWildcard1 =>
        FilterPatternPlain + [
          FilterPatternLeft or
          FilterPatternRight
        ]
    - FilterWildcard1HnAnchored =>
        FilterPatternPlain + [
          FilterPatternLeft or
          FilterPatternRight
        ] +
        FilterAnchorHn
    - FilterGenericHnAnchored =>
        FilterPatternGeneric +
        FilterAnchorHn
    - FilterGenericHnAndRightAnchored =>
        FilterPatternGeneric +
        FilterAnchorRight +
        FilterAnchorHn
    - FilterOriginMixedSet =>
        FilterOriginMissSet +
        FilterOriginHitSet
    - Instances of FilterOrigin[...], FilterDataHolder
      can also be added to a composite filter to
      represent `domain=` and `csp=` options.

- Added a new filter class, FilterComposite, for
  filters which are a combination of two or more
  logical units. A FilterComposite instance is a
  match when *all* filters composing it are a
  match.

Since filters are now encoded into combination of
smaller units, it becomes possible to extract the
FilterPatternPlain component and store it in the
bidi-trie, and use the integer as a handle for the
remaining extra conditions, if any.

Since a single pattern in the bidi-trie may be a
component for different filters, the associated
integer points to a sequence of extra conditions,
and a match occurs as soon as one of the extra
conditions (which may itself be a sequence of
conditions) is fulfilled.

Decomposing filters which are currently single
instance into sequences of smaller logical filters
means increasing the storage and CPU overhead when
evaluating such filters. The CPU overhead is
compensated by the fact that more filters can now
moved into the bidi-trie, where the first match is
efficiently evaluated. The extra conditions have to
be evaluated if and only if there is a match in the
bidi-trie.

The storage overhead is compensated by the
bidi-trie's intrinsic nature of merging similar
patterns.

Furthermore, the storage overhead is reduced by no
longer using JavaScript array to store collection
of filters (which is what FilterComposite is):
the same technique used in [2] is imported to store
sequences of filters.

A sequence of filters is a sequence of integer pairs
where the first integer is an index to an actual
filter instance stored in a global array of filters
(`filterUnits`), while the second integer is an index
to the next pair in the sequence -- which means all
sequences of filters are encoded in one single array
of integers (`filterSequences` => Uint32Array). As
a result, a sequence of filters can be represented by
one single integer -- an index to the first pair --
regardless of the number of filters in the sequence.

This representation is further leveraged to replace
the use of JavaScript array in FilterBucket [3],
which used a JavaScript array to store collection
of filters. Doing so means there is no more need for
FilterPair [4], which purpose was to be a lightweight
representation when there was only two filters in a
collection.

As a result of the above changes, the map of `token`
(integer)  => filter instance (object) used to
associate tokens to filters or collections of filters
is replaced with a more efficient map of `token`
(integer) to filter unit index (integer) to lookup a
filter object from the global `filterUnits` array.

Another consequence of using one single global
array to store all filter instances means we can reuse
existing instances when a logical filter instance is
parameter-less, which is the case for FilterAnchorLeft,
FilterAnchorRight, FilterAnchorHn, the index to these
single instances is reused where needed.

`urlTokenizer` now stores the character codes of the
scanned URL into a bidi-trie buffer, for reuse when
string matching methods are called.

New method: `tokenHistogram()`, used to generate
histograms of occurrences of token extracted from URLs
in built-in benchmark. The top results of the "miss"
histogram are used as "bad tokens", i.e. tokens to
avoid if possible when compiling filter lists.

All plain pattern strings are now stored in the
bidi-trie memory buffer, regardless of whether they
will be used in the trie proper or not.

Three methods have been added to the bidi-trie to test
stored string against the URL which is also stored in
then bidi-trie.

FilterParser is now instanciated on demand and
released when no longer used.

***

[1] 135a45a878/src/js/strie.js (L120)
[2] e94024d350
[3] 135a45a878/src/js/static-net-filtering.js (L1630)
[4] 135a45a878/src/js/static-net-filtering.js (L1566)
2019-10-21 08:15:58 -04:00
Raymond Hill
f2340bef3c
Fix bad returned value in case of empty URL
Though I do no expect the empty URL case
to ever occur, having the tokenizer return
the wrong value if it ever occur could cause
uBO to malfunction.
2019-10-17 17:23:05 -04:00
Raymond Hill
f117c280d0
Fix minor bugs spotted during code review 2019-10-14 09:03:51 -04:00
Raymond Hill
23c4c80136
Add support for elemhide (through specifichide)
Related documentation:
- https://help.eyeo.com/en/adblockplus/how-to-write-filters#element-hiding

Related feedback/discussion:
- https://www.reddit.com/r/uBlockOrigin/comments/d6vxzj/

The `elemhide` filter option as per ABP semantic is
now supported. Previously uBO would consider `elemhide`
to be an alias of `generichide`.

The support of `elemhide` is through the convenient
conversion of `elemhide` option into existing
`generichide` option and new `specifichide` option.

The purpose of the new `specifichide` filter option
is to disable all specific cosmetic filters, i.e.
those who target a specific site.

Additionally, for convenience purpose, the filter
options `generichide`, `specifichide` and `elemhide`
can be aliased using the shorter forms `ghide`,
`shide` and `ehide` respectively.
2019-09-21 11:30:38 -04:00
Raymond Hill
b5c1efc7f5
Informal code review toward ES6 2019-09-11 08:08:30 -04:00
Raymond Hill
bcf5ac1fee
Add advanced setting to control logger popup type
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/663

The advanced setting `loggerPopupType` has been added, to
control the type of window to be used when the logger is
launched as a separate window.

The default value is `popup`, it can be changed to any of
the values documented at:

https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/windows/CreateType
2019-09-06 11:41:07 -04:00
Raymond Hill
708e5004e8
Fix badly computed output size in µBlock.base64.encode()
This bug could cause losing 1 to 3 bytes of information
dropped from various internal buffers at encoding time.

Possibly related feedback:
- https://www.reddit.com/r/uBlockOrigin/comments/cs1y26/
2019-08-22 09:17:19 -04:00
Raymond Hill
7ff750eaf6
Reflect blocking mode in badge color of toolbar icon
Related feedback:
- https://www.reddit.com/r/uBlockOrigin/comments/cmh910/

Additionally, the `3p` rule has been made distinct from
`3p-script`/`3p-frame` for the purpose of
"Relax blocking mode" command.

The badge color will hint at the current blocking mode.
There are four colors for the four following blocking
modes:
- JavaScript wholly disabled
- All 3rd parties blocked
- 3rd-party scripts and frames blocked
- None of the above

The default badge color will be used when JavaScript is not
wholly disabled and when there are no rules for `3p`,
`3p-script` or `3p-frame`.

A new advanced setting has been added to let the user choose
the badge colors for the various blocking modes,
`blockingProfileColors`. The value *must* be a sequence of
4 valid CSS color values that match 6 hexadecimal digits
prefixed with`#` -- anything else will be ignored.
2019-08-10 10:57:24 -04:00
Raymond Hill
0ca44b847c
Avoid duplicated strings in filterOrigin w/ new approach
The new approach is simpler and should benefit selfie
serialization/unserialization.

This renders stringDeduplicater obsolete -- it has been
removed.
2019-05-17 10:13:58 -04:00
Raymond Hill
c4f9ae706a
Fix alternate code path introduced in 295f08da97 (oops) 2019-04-28 14:18:09 -04:00
Raymond Hill
295f08da97
Implement code path for when TextDecoder() is not available
The primary purpose is to unbreak
https://github.com/cliqz-oss/adblocker/tree/master/bench/comparison
2019-04-28 14:07:21 -04:00
Raymond Hill
ac58b8e688
Make token hashes fit within a 32-bit integer
The staticNetFilteringEngine uses token hashes to store/lookup
filters into Map objects.

Before this commit, the tokens were encoded into token hashes
as JS numbers (not exceeding MAX_SAFE_INTEGER) using at most
the 8 first characters of the token.

With this commit, token hashes are now restricted to fit
into 32-bit integers, and are derived from at most the 7 first
characters. This improves filter look-up performance as per
built-in benchmark().
2019-04-28 10:15:15 -04:00
Raymond Hill
96dce22218
Increase resolution of known-token lookup table
Related commit:
- 69a43e07c4

Using 32 bits of token hash rather than just the 16 lower
bits does help discard more unknown tokens.

Using the default filter lists, the known-token lookup
table is populated by 12,276 entries, out of 65,536, thus
making the case that theoretically there is a lot of
possible tokens which can be discarded.

In practice, running the built-in
staticNetFilteringEngine.benchmark() with default filter
lists, I find that 1,518,929 tokens were skipped out of
4,441,891 extracted tokens, or 34%.
2019-04-27 08:18:01 -04:00
Raymond Hill
69a43e07c4
Ignore unknown tokens in urlTokenizer.getTokens()
Given that all tokens extracted from one single URL are potentially
iterated multiple times in a single URL-matching cycle, it pays to
ignore extracted tokens which are known to not be used anywhere in
the static filtering engine.

The gain in processing a single network request in the static
filtering engine can become especially high when dealing with
long and random-looking URLs, which URLs have a high likelihood
of containing a majority of tokens which are known to not be in
use.
2019-04-26 17:14:00 -04:00
Raymond Hill
a52b07ff6e
Make userResourcesLocation able to support multiple URLs
The URLs must be space-separated.

Reminders:
- The additional resources will be updated at the same time
  the built-in resource file is updated
- Purging the cache of 'uBlock filters' will also purge the
  cache of the built-in resource file -- and hence force a
  reload of the user's custom resources if any

Related issues:
- https://github.com/gorhill/uBlock/issues/3307
- https://github.com/uBlockOrigin/uAssets/issues/5184#issuecomment-475875189

Addtionally:
- Opportunitically promisified assets.fetchText()
- Fixed https://github.com/gorhill/uBlock/issues/3586
2019-04-20 17:16:49 -04:00
Raymond Hill
fa83744b58
Use a sequence of base 64 numbers to encode array buffers
The purpose of using a custom base128 encoder is to
convert array buffers into strings, to allow a direct
string-to-array buffer conversion at load time:

  string => array buffer

Whereas a JSON array would require an extra step:

  JSON array as string => JS array => array buffer

Turns out that the current use of a custom base128 encoding
results in a significantly larger selfie storage usage when
converting array buffers into strings.

Speculation: possibly the browser convert the strings to
save into JSON strings internally. Since the custom base128
encoder is likely to cause the resulting string to contain
a lot of unprintable ASCII characters, these will need to
be escaped when converted to JSON -- escaped characters
occupy more space than non-escaped ones.

Using a sequence of base 64 numbers means only printable
will be present in the output string, hence no escaping
necessary. I have observed significant reduction in
storage usage for selfie purpose.
2019-04-20 09:06:54 -04:00
Raymond Hill
3f3a1543ea
Add HNTrie-based filter classes to store origin-only filters
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/528#issuecomment-484408622

Following STrie-related work in above issue, I noticed that a large
number of filters in EasyList were filters which only had to match
against the document origin. For instance, among just the top 10
most populous buckets, there were four such buckets with over
hundreds of entries each:

- bits: 72, token: "http", 146 entries
- bits: 72, token: "https", 139 entries
- bits: 88, token: "http", 122 entries
- bits: 88, token: "https", 118 entries

These filters in these buckets have to be matched against all
the network requests.

In order to leverage HNTrie for these filters[1], they are now handled
in a special way so as to ensure they all end up in a single HNTrie
(per bucket), which means that instead of scanning hundreds of entries
per URL, there is now a single scan per bucket per URL for these
apply-everywhere filters.

Now, any filter which fulfill ALL the following condition will be
processed in a special manner internally:

- Is of the form `|https://` or `|http://` or `*`; and
- Does have a `domain=` option; and
- Does not have a negated domain in its `domain=` option; and
- Does not have `csp=` option; and
- Does not have a `redirect=` option

If a filter does not fulfill ALL the conditions above, no change
in behavior.

A filter which matches ALL of the above will be processed in a special
manner:

- The `domain=` option will be decomposed so as to create as many
  distinct filter as there is distinct value in the `domain=` option
- This also apply to the `badfilter` version of the filter, which
  means it now become possible to `badfilter` only one of the
  distinct filter without having to `badfilter` all of them.
- The logger will always report these special filters with only a
  single hostname in the `domain=` option.

***

[1] HNTrie is currently WASM-ed on Firefox.
2019-04-19 16:33:46 -04:00
Raymond Hill
a594b3f3d1
Add µBlock.staticNetFilteringEngine.bucketHistogram() as investigative dev tool
Additionally, lower the treshold of trieability to 4 for FilterPlainPrefix1.
2019-04-15 11:45:33 -04:00
Raymond Hill
008370e4b9
Fix https://github.com/uBlockOrigin/uBlock-issues/issues/461
uBO will fallback using a JSON string when trying to encode an array
buffer in Chromium version 59 and earlier.
2019-03-16 09:00:31 -04:00
Raymond Hill
928ab91ab8
Add support to benchmark the dynamic filtering pane
From uBO's dev console, type:
- `µBlock.sessionFirewall.benchmark();`

Keep in mind that it's the temporary ruleset being benchmarked.
2019-02-19 10:46:33 -05:00
Raymond Hill
ed7e34fb07
Refactor selfie generation into a more flexible persistence mechanism
The motivation is to address the higher peak memory usage at launch
time with 3rd-gen HNTrie when a selfie was present.

The selfie generation prior to this change was to collect all
filtering data into a single data structure, and then to serialize
that whole structure at once into storage (using JSON.stringify).

However, HNTrie serialization requires that a large UintArray32 be
converted into a plain JS array, which itslef would be indirectly
converted into a JSON string. This was the main reason why peak
memory usage would be higher at launch from selfie, since the JSON
string would need to be wholly unserialized into JS objects, which
themselves would need to be converted into more specialized data
structures (like that Uint32Array one).

The solution to lower peak memory usage at launch is to refactor
selfie generation to allow a more piecemeal approach: each filtering
component is given the ability to serialize itself rather than to be
forced to be embedded in the master selfie. With this approach, the
HNTrie buffer can now serialize to its own storage by converting the
buffer data directly into a string which can be directly sent to
storage. This avoiding expensive intermediate steps such as
converting into a JS array and then to a JSON string.

As part of the refactoring, there was also opportunistic code
upgrade to ES6 and Promise (eventually all of uBO's code will be
proper ES6).

Additionally, the polyfill to bring getBytesInUse() to Firefox has
been revisited to replace the rather expensive previous
implementation with an implementation with virtually no overhead.
2019-02-14 13:33:55 -05:00
Raymond Hill
261ef8c510
Add support for procedural :not to HTML filtering
Related issue: <https://github.com/gorhill/uBlock/issues/3683>

Additionally, improve compile-time error reporting in the logger
2018-12-15 10:46:17 -05:00
Raymond Hill
5b7a3c9983
fix https://github.com/uBlockOrigin/uBlock-issues/issues/256; add regex support in logger filter field 2018-12-14 11:01:21 -05:00
Raymond Hill
cabb0d36b6
fix https://github.com/gorhill/uBlock/issues/3371 2018-10-23 14:01:08 -03:00
Raymond Hill
777144b036
fix https://github.com/uBlockOrigin/uBlock-issues/issues/200 2018-09-03 16:15:51 -04:00
Raymond Hill
8f1b4b52fd
fix #3606 2018-08-09 11:31:25 -04:00
Raymond Hill
7766786b2c
code review: reuse last decomposed hostname (hit rate = 75%) 2018-06-03 13:27:42 -04:00
Raymond Hill
2c843f6e69
code review: chromium 45 supports arrow functions = start using them 2018-06-01 11:49:48 -04:00
Raymond Hill
798f8dab9d
reduce baseline memory at selfie-load time 2018-06-01 07:54:31 -04:00
Raymond Hill
a9f68fe02f
Fix #3069, and consequently #3374, #3378.
A new filtering class has been created: "static extended filtering".
This new class is an umbrella class for more specialized filtering
engines:
- Cosmetic filtering
- Scriptlet filtering
- HTML filtering

HTML filtering is available only on platforms which support modifying
the response body on the fly, so only Firefox 57+ at the moment.

With the ability to modify the response body, HTML filtering has
been introduced: removing elements from the DOM before the source
data has been parsed by the browser.

A consequence of HTML filtering ability is to bring back script tag
filtering feature.
2017-12-28 13:49:02 -05:00
Raymond Hill
4ab63e70fe
code review: avoid Array.splice/unshift
The array size stays the same, items are just moved around.
2017-12-22 09:37:26 -05:00
Raymond Hill
607968de7f
code review: cache most-recently-used pre-filled scriptlets 2017-12-21 17:05:25 -05:00
gorhill
386e8bee9c
fix #3210 2017-11-09 12:53:05 -05:00
gorhill
6112a68faf
fix #2984 2017-10-21 13:43:46 -04:00
gorhill
9a4681d4e1
fix #2656 2017-05-27 14:31:46 -04:00
gorhill
aae97b8535
fix badfilter option; performance work
- badfilter option was no longer working following last refactoring
  changes.
- performance work:
    - reduce duplication of large strings.
    - new lighter FilterBucket to use when only 2 filters: FilterPair.
2017-05-26 20:00:21 -04:00
gorhill
8d2319e011
fix "purge all" button not disabled when there is nothing left to purge 2017-05-26 08:31:19 -04:00
gorhill
f3e6057e07
fix #2598: refactor to address the cause rather than the symptoms 2017-05-25 17:46:59 -04:00
gorhill
fd03683045
minor code review: it makes no difference, I just prefer no indent there 2017-05-20 16:32:42 -04:00
gorhill
acf7562b0f
minor code review 2017-05-19 20:22:26 -04:00
gorhill
fcf43d972e
tentatively fix issue reported in #2612 re. FFox 24.8.1 2017-05-19 10:12:55 -04:00
gorhill
a222e23e49
fix #2630 2017-05-19 08:45:19 -04:00
gorhill
0232382695
refactor static network filtering, add support for csp injection 2017-05-12 10:35:11 -04:00
gorhill
a4e20ae3ad
new filter option: "badfilter" (see https://github.com/uBlockOrigin/uAssets/issues/192) 2017-03-11 13:55:47 -05:00
gorhill
0b4f31bd8a fix #2344 2017-01-27 13:44:52 -05:00
gorhill
da163bbe4b fix #1641 2016-10-13 13:25:57 -04:00
gorhill
b105010f34 minor code review 2016-10-11 11:53:28 -04:00