1
0
mirror of https://github.com/gorhill/uBlock.git synced 2024-09-29 22:27:12 +02:00
Commit Graph

32 Commits

Author SHA1 Message Date
Raymond Hill
3c67d2b89f
Add support for entity-matching in domain= filter option
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/1008

This commit adds support entity-matching in the filter
option `domain=`. Example:

    pattern$domain=google.*

The `*` above is meant to match any suffix from the Public
Suffix List. The semantic is exactly the same as the
already existing entity-matching support in static
extended filtering:

- https://github.com/gorhill/uBlock/wiki/Static-filter-syntax#entity

Additionally, in this commit:

Fix cases where "just-origin" filters of the form `|http*://`
were erroneously normalized to `|http://`. The proper
normalization of `|http*://` is `*`.

Add support to store hostname strings into the character
buffer of a hntrie container. As of commit time, there are
5,544 instances of FilterOriginHit, and 732 instances of
FilterOriginMiss, which filters require storing/matching a
single hostname string. Those strings are now stored in the
character buffer of the already existing origin-related
 hntrie container. (The same approach is used for plain
patterns which are not part of a bidi-trie.)
2020-05-24 10:46:16 -04:00
Raymond Hill
4fa5c6b88e
Fix uselessly allocating one extra WASM page
spotted as a result of stepping in the code. The issue
is that a uBP "page size" might differ from a WASM
page size, which is always 65536 bytes.
2020-05-15 12:03:05 -04:00
Raymond Hill
3621792f16
Rework/remove remnant of code dependent on localStorage
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/899
2020-02-23 12:18:45 -05:00
Raymond Hill
15470bcbdc
Ensure disableWebAssembly setting is loaded before use
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/899

WASM modules are now loaded on demand rather than at
script evaluation time.
2020-02-22 13:36:22 -05:00
Raymond Hill
a69b301d81
Fine-tune new bidi-trie code
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/761
2019-10-29 10:26:34 -04:00
Raymond Hill
7971b22385
Expand bidi-trie usage in static network filtering engine
Related issues:
- https://github.com/uBlockOrigin/uBlock-issues/issues/761
- https://github.com/uBlockOrigin/uBlock-issues/issues/528

The previous bidi-trie code could only hold filters which
are plain pattern, i.e. no wildcard characters, and which
had no origin option (`domain=`), right and/or left anchor,
and no `csp=` option.

Example of filters that could be moved into a bidi-trie
data structure:

    &ad_box_
    /w/d/capu.php?z=$script,third-party
    ||liveonlinetv247.com/images/muvixx-150x50-watch-now-in-hd-play-btn.gif

Examples of filters that could NOT be moved to a bidi-trie:

    -adap.$domain=~l-adap.org
    /tsc.php?*&ses=
    ||ibsrv.net/*forumsponsor$domain=[...]
    @@||imgspice.com/jquery.cookie.js|$script
    ||view.atdmt.com^*/iview/$third-party
    ||postimg.cc/image/$csp=[...]

Ideally the filters above should be able to be moved to a
bidi-trie since they are basically plain patterns, or at
least partially moved to a bidi-trie when there is only a
single wildcard (i.e. made of two plain patterns).

Also, there were two distinct bidi-tries in which
plain-pattern filters can be moved to: one for patterns
without hostname anchoring and another one for patterns
with hostname-anchoring. This was required because the
hostname-anchored patterns have an extra condition which
is outside the bidi-trie knowledge.

This commit expands the number of filters which can be
stored in the bidi-trie, and also remove the need to
use two distinct bidi-tries.

- Added ability to associate a pattern with an integer
  in the bidi-trie [1].
    - The bidi-trie match code passes this externally
      provided integer when calling an externally
      provided method used for testing extra conditions
      that may be present for a plain pattern found to
      be matching in the bidi-trie.

- Decomposed existing filters into smaller logical units:
    - FilterPlainLeftAnchored =>
        FilterPatternPlain +
        FilterAnchorLeft
    - FilterPlainRightAnchored =>
        FilterPatternPlain +
        FilterAnchorRight
    - FilterExactMatch =>
        FilterPatternPlain +
        FilterAnchorLeft +
        FilterAnchorRight
    - FilterPlainHnAnchored =>
        FilterPatternPlain +
        FilterAnchorHn
    - FilterWildcard1 =>
        FilterPatternPlain + [
          FilterPatternLeft or
          FilterPatternRight
        ]
    - FilterWildcard1HnAnchored =>
        FilterPatternPlain + [
          FilterPatternLeft or
          FilterPatternRight
        ] +
        FilterAnchorHn
    - FilterGenericHnAnchored =>
        FilterPatternGeneric +
        FilterAnchorHn
    - FilterGenericHnAndRightAnchored =>
        FilterPatternGeneric +
        FilterAnchorRight +
        FilterAnchorHn
    - FilterOriginMixedSet =>
        FilterOriginMissSet +
        FilterOriginHitSet
    - Instances of FilterOrigin[...], FilterDataHolder
      can also be added to a composite filter to
      represent `domain=` and `csp=` options.

- Added a new filter class, FilterComposite, for
  filters which are a combination of two or more
  logical units. A FilterComposite instance is a
  match when *all* filters composing it are a
  match.

Since filters are now encoded into combination of
smaller units, it becomes possible to extract the
FilterPatternPlain component and store it in the
bidi-trie, and use the integer as a handle for the
remaining extra conditions, if any.

Since a single pattern in the bidi-trie may be a
component for different filters, the associated
integer points to a sequence of extra conditions,
and a match occurs as soon as one of the extra
conditions (which may itself be a sequence of
conditions) is fulfilled.

Decomposing filters which are currently single
instance into sequences of smaller logical filters
means increasing the storage and CPU overhead when
evaluating such filters. The CPU overhead is
compensated by the fact that more filters can now
moved into the bidi-trie, where the first match is
efficiently evaluated. The extra conditions have to
be evaluated if and only if there is a match in the
bidi-trie.

The storage overhead is compensated by the
bidi-trie's intrinsic nature of merging similar
patterns.

Furthermore, the storage overhead is reduced by no
longer using JavaScript array to store collection
of filters (which is what FilterComposite is):
the same technique used in [2] is imported to store
sequences of filters.

A sequence of filters is a sequence of integer pairs
where the first integer is an index to an actual
filter instance stored in a global array of filters
(`filterUnits`), while the second integer is an index
to the next pair in the sequence -- which means all
sequences of filters are encoded in one single array
of integers (`filterSequences` => Uint32Array). As
a result, a sequence of filters can be represented by
one single integer -- an index to the first pair --
regardless of the number of filters in the sequence.

This representation is further leveraged to replace
the use of JavaScript array in FilterBucket [3],
which used a JavaScript array to store collection
of filters. Doing so means there is no more need for
FilterPair [4], which purpose was to be a lightweight
representation when there was only two filters in a
collection.

As a result of the above changes, the map of `token`
(integer)  => filter instance (object) used to
associate tokens to filters or collections of filters
is replaced with a more efficient map of `token`
(integer) to filter unit index (integer) to lookup a
filter object from the global `filterUnits` array.

Another consequence of using one single global
array to store all filter instances means we can reuse
existing instances when a logical filter instance is
parameter-less, which is the case for FilterAnchorLeft,
FilterAnchorRight, FilterAnchorHn, the index to these
single instances is reused where needed.

`urlTokenizer` now stores the character codes of the
scanned URL into a bidi-trie buffer, for reuse when
string matching methods are called.

New method: `tokenHistogram()`, used to generate
histograms of occurrences of token extracted from URLs
in built-in benchmark. The top results of the "miss"
histogram are used as "bad tokens", i.e. tokens to
avoid if possible when compiling filter lists.

All plain pattern strings are now stored in the
bidi-trie memory buffer, regardless of whether they
will be used in the trie proper or not.

Three methods have been added to the bidi-trie to test
stored string against the URL which is also stored in
then bidi-trie.

FilterParser is now instanciated on demand and
released when no longer used.

***

[1] 135a45a878/src/js/strie.js (L120)
[2] e94024d350
[3] 135a45a878/src/js/static-net-filtering.js (L1630)
[4] 135a45a878/src/js/static-net-filtering.js (L1566)
2019-10-21 08:15:58 -04:00
Raymond Hill
be2a950541
Code review of HNTrie/staticNetFilteringEngine
- Remove HNTrieContainer class from global context by
  storing it as a property of µBlock.

- Use block scope to isolate HNTrie-related constants
  from global context.

- Prevent filters which are pure IP address from
  being stored in an HNTrie instance -- as this
  could cause false positives.
2019-06-19 10:00:19 -04:00
Raymond Hill
39e2a03edb
Fix comment 2019-05-14 09:31:51 -04:00
Raymond Hill
3692bb4ada
Add HNTrieRef.dump() and STrieRef.dump() as dev tool
To be used at the console, as an investigation tool for
development purpose.

Using it to verify the content of the largest
FilterHostnameDict instance, I spotted an all-uppercase
hostname in the HNTrieRef instance:

µBlock.staticNetFilteringEngine.categories.get(0).get(0x10000000).dict.dump();

Thus the changes to static-net-filtering.js are to fix
the erroneous insertion of filters with uppercase
characters. The single instance found was a hostname entry
in Malware Domain List (TRIANGLESERVICESLTD dot COM).
2019-05-06 11:12:39 -04:00
Raymond Hill
42bf659695
Revert "Order HNTrie nodes alphabetically to allow for early bailout"
This reverts commit f5f9e05071.
2019-04-30 07:00:52 -04:00
Raymond Hill
f5f9e05071
Order HNTrie nodes alphabetically to allow for early bailout
This commit implements the alphabetical ordering of HNTrie
nodes, so as to make it possible to bail out early at
HNTrie.matches() time.

Contrary to what I expected, there is no performance gain
observed to HNTrie.matches() as per benchmarks -- I find
the results perplexing.

Because of this I will revert this commit immediately.
The purpose of this commit is to record the changes so
that I can bring them back to life in the future whenever
I want to investigate further.
2019-04-30 06:47:54 -04:00
Raymond Hill
adabb56dc9
Do not store impossible to match filters in HNTrie
Consider the two following filters:

    example.com
    www.example.com

This commit make it so that if the first filter is
already present in a given HNTrie, the second filter
will not be stored, since HNTrie will _always_
return the first filter as a match whenever the
hostname to match is example.com or any subdomain
of example.com.

The detection of such pointless filters is
virtually free when adding a hostname to an HNTrie
instance (given how data is stored in the trie), so
in practice no overhead is incurred to detect such
pointless filters.

The ability to ignore impossible to match filters
in HNTrie instances will _especially_ benefit those
using large hosts files.

Examples of how this helps using real configurations:

- Default lists:
  444 filters out of 100,382 were ignored as a result
  of this commit.

- Default lists + "Energized Ultimate Protection":
  283,669 filters out of 903,235 were ignored as a
  result of this commit.

Side note: There was no measurable difference between
the two configurations above in the performance of
the matching algorithm as reported by the built-in
benchmark tool.
2019-04-29 13:15:16 -04:00
Raymond Hill
e0d2285da0
Convert HNTrie code to ES6 class 2019-04-25 19:38:07 -04:00
Raymond Hill
155abfba18
Cache and reuse result of HNTrieRef.matches() when possible
Due to how web pages typically load secondary resources and due
to how HNTrieContainer instances are used in uBO, there is a
great likelihood that the result of a previous call to
HNTrieRef.matches() can be reused in a subsequent call.
This has been confirmed by instrumenting HNTrieRef.matches().

Since uBO uses distinct HNTrieContainer instances to either
match against the request or the origin hostnames, this
means a high likelihood of repeated calls to HNTrieRef.matches()
with the same hostname as argument, hence a performance gain
when caching the argument+result -- as despite that
HNTrie.matches() is fast, comparing two short strings is even
faster if this allows to skip HNTrie.matches() altogether.
2019-04-25 18:36:03 -04:00
Raymond Hill
fa83744b58
Use a sequence of base 64 numbers to encode array buffers
The purpose of using a custom base128 encoder is to
convert array buffers into strings, to allow a direct
string-to-array buffer conversion at load time:

  string => array buffer

Whereas a JSON array would require an extra step:

  JSON array as string => JS array => array buffer

Turns out that the current use of a custom base128 encoding
results in a significantly larger selfie storage usage when
converting array buffers into strings.

Speculation: possibly the browser convert the strings to
save into JSON strings internally. Since the custom base128
encoder is likely to cause the resulting string to contain
a lot of unprintable ASCII characters, these will need to
be escaped when converted to JSON -- escaped characters
occupy more space than non-escaped ones.

Using a sequence of base 64 numbers means only printable
will be present in the output string, hence no escaping
necessary. I have observed significant reduction in
storage usage for selfie purpose.
2019-04-20 09:06:54 -04:00
Raymond Hill
008370e4b9
Fix https://github.com/uBlockOrigin/uBlock-issues/issues/461
uBO will fallback using a JSON string when trying to encode an array
buffer in Chromium version 59 and earlier.
2019-03-16 09:00:31 -04:00
Raymond Hill
ed7e34fb07
Refactor selfie generation into a more flexible persistence mechanism
The motivation is to address the higher peak memory usage at launch
time with 3rd-gen HNTrie when a selfie was present.

The selfie generation prior to this change was to collect all
filtering data into a single data structure, and then to serialize
that whole structure at once into storage (using JSON.stringify).

However, HNTrie serialization requires that a large UintArray32 be
converted into a plain JS array, which itslef would be indirectly
converted into a JSON string. This was the main reason why peak
memory usage would be higher at launch from selfie, since the JSON
string would need to be wholly unserialized into JS objects, which
themselves would need to be converted into more specialized data
structures (like that Uint32Array one).

The solution to lower peak memory usage at launch is to refactor
selfie generation to allow a more piecemeal approach: each filtering
component is given the ability to serialize itself rather than to be
forced to be embedded in the master selfie. With this approach, the
HNTrie buffer can now serialize to its own storage by converting the
buffer data directly into a string which can be directly sent to
storage. This avoiding expensive intermediate steps such as
converting into a JS array and then to a JSON string.

As part of the refactoring, there was also opportunistic code
upgrade to ES6 and Promise (eventually all of uBO's code will be
proper ES6).

Additionally, the polyfill to bring getBytesInUse() to Firefox has
been revisited to replace the rather expensive previous
implementation with an implementation with virtually no overhead.
2019-02-14 13:33:55 -05:00
Raymond Hill
fc03782985
Ensure that WASM module was actually loaded 2019-02-01 09:09:51 -05:00
Raymond Hill
69c87c5117
Fix Promise chain of WASM module load operations
The Promise chain was not properly designed for WASM module
loading. This became apparent when removing WASM modules
from Opera build[1].

The problem was that errors thrown by fetch() -- used to
load WASM modules -- were not properly handled.

[1] Opera refuses updating uBO if there are unrecognized file
types in the package, and `.wasm`/`.wat` files are not
recognized by Opera uploader.
2019-02-01 08:20:43 -05:00
Raymond Hill
1b6fea16da
3rd-gen hntrie, suitable for large set of hostnames 2018-12-04 13:02:09 -05:00
Raymond Hill
2a91a685ce
code review: fix handling of too long needles 2018-11-19 14:04:26 -05:00
Raymond Hill
2189f020df
add new advanced setting to disable use of WASM for dev purpose 2018-11-16 10:19:06 -05:00
Raymond Hill
19b7cbca55
minor review of hntrie code 2018-11-06 13:38:37 -02:00
Raymond Hill
a42513aa2f
minor code review 2018-11-04 19:26:02 -02:00
Raymond Hill
95899a0d1d
be explicit about where the related wasm file is fetched 2018-11-04 18:52:25 -02:00
Raymond Hill
d7d544cda0
Squashed commit of the following:
commit 7c6cacc59b27660fabacb55d668ef099b222a9e6
Author: Raymond Hill <rhill@raymondhill.net>
Date:   Sat Nov 3 08:52:51 2018 -0300

    code review: finalize support for wasm-based hntrie

commit 8596ed80e3bdac2c36e3c860b51e7189f6bc8487
Merge: cbe1f2e 000eb82
Author: Raymond Hill <rhill@raymondhill.net>
Date:   Sat Nov 3 08:41:40 2018 -0300

    Merge branch 'master' of github.com:gorhill/uBlock into trie-wasm

commit cbe1f2e2f38484d42af3204ec7f1b5decd30f99e
Merge: 270fc7f dbb7e80
Author: Raymond Hill <rhill@raymondhill.net>
Date:   Fri Nov 2 17:43:20 2018 -0300

    Merge branch 'master' of github.com:gorhill/uBlock into trie-wasm

commit 270fc7f9b3b73d79e6355522c1a42ce782fe7e5c
Merge: d2a89cf d693d4f
Author: Raymond Hill <rhill@raymondhill.net>
Date:   Fri Nov 2 16:21:08 2018 -0300

    Merge branch 'master' of github.com:gorhill/uBlock into trie-wasm

commit d2a89cf28f0816ffd4617c2c7b4ccfcdcc30e1b4
Merge: d7afc78 649f82f
Author: Raymond Hill <rhill@raymondhill.net>
Date:   Fri Nov 2 14:54:58 2018 -0300

    Merge branch 'master' of github.com:gorhill/uBlock into trie-wasm

commit d7afc78b5f5675d7d34c5a1d0ec3099a77caef49
Author: Raymond Hill <rhill@raymondhill.net>
Date:   Fri Nov 2 13:56:11 2018 -0300

    finalize wasm-based hntrie implementation

commit e7b9e043cf36ad055791713e34eb0322dec84627
Author: Raymond Hill <rhill@raymondhill.net>
Date:   Fri Nov 2 08:14:02 2018 -0300

    add first-pass implementation of wasm version of hntrie

commit 1015cb34624f3ef73ace58b58fe4e03dfc59897f
Author: Raymond Hill <rhill@raymondhill.net>
Date:   Wed Oct 31 17:16:47 2018 -0300

    back up draft work toward experimenting with wasm hntries
2018-11-03 08:58:46 -03:00
gorhill
e83ffde5af
code review for #3328 2017-12-08 07:07:05 -05:00
gorhill
c7e8b65b6c
fix #3328 2017-12-08 00:33:02 -05:00
gorhill
4d20950dfa
save investigative work for the future re. wasm 2017-11-05 12:33:46 -05:00
gorhill
da605f53a6
code review: avoid pointless test for single-char cells 2017-11-05 06:45:43 -05:00
gorhill
22c460d52f
just edit comments 2017-11-03 08:36:16 -04:00
gorhill
5928996f2a
address #3193 2017-11-02 15:49:11 -04:00