Whereas before the string segment was encoded as:
LL OOOOOOOOOOOO
where L are the upper 8 bits and used to encode the length
of the segment, and O are the lower 24 bits and used to
encode the offset of the string data in the character
buffer, the new code encode as follow:
OOOOOOOOOOOO LL
And furthermore the most significant bit of the length
LL is now used to mark whether the current string segment
is a label boundary.
This means a cell can't reference a segment longer then
127 characters. To work around this limitation for when a
segment is longer than 127 characters (a rare occurrence),
the algorithm will simply split the segment into multiple
adjacent cells.
As a result, there is no longer a need to encode
"boundariness" into special cells, which simplifies
both the storing and matching algorithms.
Additionally, added minimal documentation for the NPM
package on how to import and use HNTrieContainer as a
standalone API.
The erroneous test does not seem to interfere
with the proper functioning of the trie, due
to the fact that nodes are never split without
a OR node or boundary node being present.
The issue was found when undertaking a rewrite
of the algorithm to avoid having to create
boundary nodes.
This fix the case of the following filter:
trk*.vidible.tv
Not matching:
https://trk.vidible.tv/trk/.vidible.tv
The wildcard is supposed to match any number of
characters, including zero characters. The issue
is that the code was not matching zero characters.
This is due to an incorrect comparison in
BidiTrieContainer.indexOf(), causing the code to
bail out before testing for the zero character
condition.
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/882
Related commits:
- https://github.com/gorhill/uBlock/commit/a95ef16e064a
- https://github.com/gorhill/uBlock/commit/7971b223855d
Leading wildcards before valid token characters need to
be kept in order to respect the semantic of the filter.
A leading wildcard in such case changes the semantic of
a filter, i.e. two following filters are semantically
different:
example/abc
*example/abc
As a result, µBlock.BidiTrieContainer.indexOf() is now
able to deal with a needle of length zero -- which is
what happens in FilterPatternLeft(Ex) with filter
patterns starting with `*` (or `^*`) and followed by
valid token characters (0-9, a-z and %).
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/761
Changes related to above issue made it possible to
create WASM versions of methods used in the bidi-trie.
In this commit, WASM versions for startsWith(), indexOf()
and lastIndexOf() have been implemented.
Consider the two following filters:
example.com
www.example.com
This commit make it so that if the first filter is
already present in a given HNTrie, the second filter
will not be stored, since HNTrie will _always_
return the first filter as a match whenever the
hostname to match is example.com or any subdomain
of example.com.
The detection of such pointless filters is
virtually free when adding a hostname to an HNTrie
instance (given how data is stored in the trie), so
in practice no overhead is incurred to detect such
pointless filters.
The ability to ignore impossible to match filters
in HNTrie instances will _especially_ benefit those
using large hosts files.
Examples of how this helps using real configurations:
- Default lists:
444 filters out of 100,382 were ignored as a result
of this commit.
- Default lists + "Energized Ultimate Protection":
283,669 filters out of 903,235 were ignored as a
result of this commit.
Side note: There was no measurable difference between
the two configurations above in the performance of
the matching algorithm as reported by the built-in
benchmark tool.
commit 7c6cacc59b27660fabacb55d668ef099b222a9e6
Author: Raymond Hill <rhill@raymondhill.net>
Date: Sat Nov 3 08:52:51 2018 -0300
code review: finalize support for wasm-based hntrie
commit 8596ed80e3bdac2c36e3c860b51e7189f6bc8487
Merge: cbe1f2e 000eb82
Author: Raymond Hill <rhill@raymondhill.net>
Date: Sat Nov 3 08:41:40 2018 -0300
Merge branch 'master' of github.com:gorhill/uBlock into trie-wasm
commit cbe1f2e2f38484d42af3204ec7f1b5decd30f99e
Merge: 270fc7f dbb7e80
Author: Raymond Hill <rhill@raymondhill.net>
Date: Fri Nov 2 17:43:20 2018 -0300
Merge branch 'master' of github.com:gorhill/uBlock into trie-wasm
commit 270fc7f9b3b73d79e6355522c1a42ce782fe7e5c
Merge: d2a89cf d693d4f
Author: Raymond Hill <rhill@raymondhill.net>
Date: Fri Nov 2 16:21:08 2018 -0300
Merge branch 'master' of github.com:gorhill/uBlock into trie-wasm
commit d2a89cf28f0816ffd4617c2c7b4ccfcdcc30e1b4
Merge: d7afc78 649f82f
Author: Raymond Hill <rhill@raymondhill.net>
Date: Fri Nov 2 14:54:58 2018 -0300
Merge branch 'master' of github.com:gorhill/uBlock into trie-wasm
commit d7afc78b5f5675d7d34c5a1d0ec3099a77caef49
Author: Raymond Hill <rhill@raymondhill.net>
Date: Fri Nov 2 13:56:11 2018 -0300
finalize wasm-based hntrie implementation
commit e7b9e043cf36ad055791713e34eb0322dec84627
Author: Raymond Hill <rhill@raymondhill.net>
Date: Fri Nov 2 08:14:02 2018 -0300
add first-pass implementation of wasm version of hntrie
commit 1015cb34624f3ef73ace58b58fe4e03dfc59897f
Author: Raymond Hill <rhill@raymondhill.net>
Date: Wed Oct 31 17:16:47 2018 -0300
back up draft work toward experimenting with wasm hntries