Turns out the various benchmarks show no benefits when compiling
filters whose pattern contains a single wildcard character into
specialized classes which threat the pattern as two sub-patterns,
and actually there is a slight improvement in performance as per
benchamrks when treating these patterns as generic ones.
This also fixes the following related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/1207
The original motivation is to further speed up launch time
for either non-selfie-based and selfie-based initialization
of the static network filtering engine (SNFE).
As a result of the refactoring:
Filters are no longer instance-based, they are sequence-of-
integer-based. This eliminates the need to create instances
of filters at launch, and consequently eliminates all the
calls to class constructors, the resulting churning of memory,
and so forth.
All the properties defining filter instances are now as much
as possible 32-bit integer-based, and these are allocated in a
single module-scoped typed array -- this eliminates the need
to allocate memory for every filter being instantiated.
Not all filter properties can be represented as a 32-bit
integer, and in this case a filter class can allocate slots
into another module-scoped array of references.
As a result, this eliminates a lot of memory allocations when
the SNFE is populated with filters, and this makes the saving
and loading of selfie more straightforward, as the operation
is reduced to saving/loading two arrays, one of 32-bit
integers, and the other, much smaller, an array JSON-able
values.
All filter classes now only contain static methods, and all
of these methods are called with an index to the specific
filter data in the module-scoped array of 32-bit integers.
The filter sequences (used to avoid the use of JS arrays) are
also allocated in the single module-scoped array of 32-bit
integers -- they used to be stored in their own dedicated
array.
Additionally, some filters are now loaded more in a deferred
way, so as reduce uBO's time-to-readiness -- the outcome of
this still needs to be evaluated, time-to-readiness is
especially a concern in Firefox for Android or less powerful
computers.
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/1730
A new filter unit, FilterNotType, is introduced to enforce
negated filter type options.
Before this commit, there was no actual negated types in the
static network filtering engine, as a negated type was internally
converted to non-negated types at compile time. As a result,
the logger would never output a matching filter with its original
negated type options.
This commit no longer causes an internal conversion to take place
at compile time, but explicitly enforce negated types at match time,
and as a result the logger will from now on output matching filter
with their original negated type options.
Name: modifyWebextFlavor
Value: A list of space-separated tokens to be added/removed from the
computed default webext flavor.
The primary purpose is to give filter list authors the ability to
test mobile flavor on desktop computers. Though mobile versions of
web pages can be emulated using browser dev tools, it's not
possible to do so for uBO itself.
By using `+mobile` as a value for this setting will force uBO
to act as if it's being executed on a mobile device.
Important: this setting is best used in a dedicated browser
profile, as this affects how filter lists are compiled. So best
to set it in a new browser profile, then force all filter lists
to be recompiled, and use the profile in the future when there
is a need to test the specific webext flavor.
When matching a network request in the static network filtering
engine ("snfe"), these are the possible outcomes, from most
to least likely:
- No block
- Block
- Unblock ("exception" filter overriding the block)
- Block-important ("important" filter override the unblock)
Hence why the matching in the snfe always check for a match in
the "block" realm, and the "unblock" realm would be checked
if and only if there was a match in the "block" realm.
However the "block-important" realm was always matched against
first, and when a match in that realm was found, there would
be no need to check in other realms since nothing can override
the "important" option. The problem with this approach though
is that matches in the "block-important" realm are most
unlikely, which means pointless work being done for vast
majority of network requests.
This commit makes it so that the "block-important" realm is
matched against ONLY when there is a matched "unblock" filter.
The result is a measurable improvement in the snfe-related
benchmarks (though given the numbers involved, end users won't
perceive a difference).
Somewhat related discussion which was the motivation to look
more into this:
https://github.com/cliqz-oss/adblocker/discussions/2170#discussioncomment-1168125
Whereas before the string segment was encoded as:
LL OOOOOOOOOOOO
where L are the upper 8 bits and used to encode the length
of the segment, and O are the lower 24 bits and used to
encode the offset of the string data in the character
buffer, the new code encode as follow:
OOOOOOOOOOOO LL
And furthermore the most significant bit of the length
LL is now used to mark whether the current string segment
is a label boundary.
This means a cell can't reference a segment longer then
127 characters. To work around this limitation for when a
segment is longer than 127 characters (a rare occurrence),
the algorithm will simply split the segment into multiple
adjacent cells.
As a result, there is no longer a need to encode
"boundariness" into special cells, which simplifies
both the storing and matching algorithms.
Additionally, added minimal documentation for the NPM
package on how to import and use HNTrieContainer as a
standalone API.
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/1664
The various filtering engine benchmarking functions are best
isolated in their own file since they have specific
dependencies that should not be suffered by the filtering
engines.
Additionally, moved decomposeHostname() into uri-utils.js
as it's a hostname-related function required by many
filtering engine cores -- this allows to further reduce
or outright remove dependency on `µb`.
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/1664
The changes are enough to fulfill the related issue.
A new platform has been added in order to allow for building
a NodeJS package. From the root of the project:
./tools/make-nodejs
This will create new uBlock0.nodejs directory in the
./dist/build directory, which is a valid NodeJS package.
From the root of the package, you can try:
node test
This will instantiate a static network filtering engine,
populated by easylist and easyprivacy, which can be used
to match network requests by filling the appropriate
filtering context object.
The test.js file contains code which is typical example
of usage of the package.
Limitations: the NodeJS package can't execute the WASM
versions of the code since the WASM module requires the
use of fetch(), which is not available in NodeJS.
This is a first pass at modularizing the codebase, and
while at it a number of opportunistic small rewrites
have also been made.
This commit requires the minimum supported version for
Chromium and Firefox be raised to 61 and 60 respectively.
This advanced setting is not really needed, as the
same can be accomplished with a broad exception
filter such as `#@#+js()`.
Related feedback:
- f5b453fae3 (commitcomment-49499082)
Ever since the `redirect` code was refactored:
157cef6034
This advanced setting is no longer needed, as the same
can be accomplished with a plain network filter:
@@*$redirect-rule
The syntax to remove response header is a special case
of HTML filtering, whereas the response headers are
targeted, rather than the response body:
example.com##^responseheader(header-name)
Where `header-name` is the name of the header to
remove, and must always be lowercase.
The removal of response headers can only be applied to
document resources, i.e. main- or sub-frames.
Only a limited set of headers can be targeted for
removal:
location
refresh
report-to
set-cookie
This limitation is to ensure that uBO never lowers the
security profile of web pages, i.e. we wouldn't want to
remove `content-security-policy`.
Given that the header removal occurs at onHeaderReceived
time, this new ability works for all browsers.
The motivation for this new filtering ability is instance
of website using a `refresh` header to redirect a visitor
to an undesirable destination after a few seconds.
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/1513
Prior to this commit, the ability to enable/disable the
uncloaking of canonical names was only available to advanced
users. This commit make it so that the setting can be
toggled from the _Settings_ pane.
The setting is enabled by default. The documentation should
be clear that the setting should not be disabled unless it
actually solves serious network issues, for example:
https://bugzilla.mozilla.org/show_bug.cgi?id=1694404
Also, as a result, the advanced setting `cnameUncloak` is no
longer available from within the advanced settings editor.
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/1480
Forward compatiblity was broken due to `externalLists`
being converted into an Array from a string, i.e.
downgrading to uBO 1.32.4 was completely breaking uBO.
This commit restores `externalLists` as a string which
is what older versions of uBO expect.
A new property `importedLists` has been created to
hold the imported lists as an array, while
`externalLists` will be kept around for a while until
it is completely removed in some future.
The managed `userSettings` entry is an array of entries,
where each entry is a name/value pair encoded into an array
of strings.
The first item in the entry array is the name of a setting,
and the second item is the stringified value for the
setting.
This is a more convenient way for administrators to set
specific user settings. The settings set through
`userSettings` policy will always be set at uBO launch
time.
This should improve usability of uBO's hard-mode
and "relax blocking mode" operations. This is the
new default behavior.
The previous behavior of forcing a reload of the
page can be re-enabled by simply setting the `3p`
bit of the advanced setting `blockingProfiles`
to 1.
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/1365
This commit adds the compiled magic version number to the
compiled data itself, and consequently this allows uBO
to no longer require that any given compiled list with a
mismatched format to be detected and discarded at launch
time.
Given this change, uBO no longer needs to rely on the
deletion of cached data at launch time to ensure it
won't use no longer valid compiled lists.
Related discussions:
- https://github.com/uBlockOrigin/uBlock-issues/issues/1356#issuecomment-732411286
- https://github.com/AdguardTeam/CoreLibs/issues/1384
Changes:
Negation character is `~` (instead of `!`).
Drop special anchor character `|` -- leading `|`
will be supported until no such filter is present
in uBO's own filter lists. For example, instance
of `queryprune=|ad` will have to be replaced with
`queryprune=/^ad/` (or `queryprune=ad` if the name
of the parameter to remove is exactly `ad`).
Align semantic with that of AdGuard's `removeparam=`,
except that specifying multiple `|`-separated names
is not supported.
`match-case`
------------
Related issue:
- https://github.com/uBlockOrigin/uAssets/issues/8280#issuecomment-735245452
The new filter option `match-case` can be used only for
regex-based filters. Using `match-case` with any other
sort of filters will cause uBO to discard the filter.
`redirect=`
-----------
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/1366
`redirect=` filters with unresolvable resource token at
runtime will be discarded.
Additionally, the implicit priority is now set to 1
(was 0). The idea is to allow custom `redirect=` filters
to be used strictly as fallback `redirect=` filters in case
another `redirect=` filter is not picked up.
For example, one might create a `redirect=click2load.html:0`
filter, to be taken if and only if the blocked resource is
not already being redirected by another "official" filter
in one of the enabled filter lists.
New filter options
==================
Strict partyness: `1P`, `3P`
----------------------------
The current options 1p/3p are meant to "weakly" match partyness, i.e. a
network request is considered 1st-party to its context as long as both the
context and the request share the same base domain.
The new partyness options are meant to check for strict partyness, i.e. a
network request will be considered 1st-party if and only if both the context
and the request share the same hostname.
For examples:
- context: `www.example.org`
- request: `www.example.org`
- `1p`: yes, `1P`: yes
- `3p`: no, `3P`: no
- context: `www.example.org`
- request: `subdomain.example.org`
- `1p`: yes, `1P`: no
- `3p`: no, `3P`: yes
- context: `www.example.org`
- request: `www.example.com`
- `1p`: no, `1P`: no
- `3p`: yes, `3P`: yes
The strict partyness options will be visually emphasized in the editor so as
to prevent mistakenly using `1P` or `3P` where weak partyness is meant to be
used.
Filter on response headers: `header=`
-------------------------------------
Currently experimental and under evaluation. Disabled by default, enable by
toggling `filterOnHeaders` to `true` in advanced settings.
Ability to filter network requests according to whether a specific response
header is present and whether it matches or does not match a specific value.
For example:
*$1p,3P,script,header=via:1\.1\s+google
The above filter is meant to block network requests which fullfill all the
following conditions:
- is weakly 1st-party to the context
- is not strictly 1st-party to the context
- is of type `script`
- has a response HTTP header named `via`, which value matches the regular
expression `1\.1\s+google`.
The matches are always performed in a case-insensitive manner.
The header value is assumed to be a literal regular expression, except for
the following special characters:
- to anchor to start of string, use leading `|`, not `^`
- to anchor to end of string, use trailing `|`, not `$`
- to invert the test, use a leading `!`
To block a network request if it merely contains a specific HTTP header is
just a matter of specifying the header name without a header value:
*$1p,3P,script,header=via
Generic exception filters can be used to disable specific block `header=`
filters, i.e. `@@*$1p,3P,script,header` will override the block `header=`
filters given as example above.
Dynamic filtering's `allow` rules override block `headers=` filters.
Important: It is key that filter authors use as many narrowing filter options
as possible when using the `header=` option, and the `header=` option should
be used ONLY when other filter options are not sufficient.
More documentation justifying the purpose of `header=` option will be
provided eventually if ever it is decided to move it from experimental to
stable status.
To be decided: to restrict usage of this filter option to only uBO's own
filter lists or "My filters".
Changes
=======
Fine tuning `queryprune=`
-------------------------
The following changes have been implemented:
The special value `*` (i.e. `queryprune=*`) means "remove all query
parameters".
If the `queryprune=` value is made only of alphanumeric characters
(including `_`), the value will be internally converted to regex equivalent
`^value=`. This ensures a better future compatibility with AdGuard's
`removeparam=`.
If the `queryprune=` value starts with `!`, the test will be inverted. This
can be used to remove all query parameters EXCEPT those who match the
specified value.
Other
-----
The legacy code to test for spurious CSP reports has been removed. This
is no longer an issue ever since uBO redirects to local resources through
web accessible resources.
Notes
=====
The following new and recently added filter options are not compatible with
Chromium's manifest v3 changes:
- `queryprune=`
- `1P`
- `3P`
- `header=`
Performance-related work.
There is a fair number of filters which can't be tokenized
in uBO's own filter lists. Majority of those filters also
declare a `domain=` option, examples:
*$script,redirect-rule=noopjs,domain=...
*$script,3p,domain=...,denyallow=...
*$frame,3p,domain=...
Such filters can be found in uBO's asset viewer using the
following search expression:
/^\*?\$[^\n]*?domain=/
Some filter buckets will contain many of those filters, for
instance one of the bucket holding untokenizable `redirect=`
filters has over 170 entries, which must be all visited when
collating all matching `redirect=` filters.
When a bucket contains many such filters, I found that it's
worth to extract all the non-negated hostname values from
`domain=` options into a single hntrie and perform a pre-test
at match() time to find out whether the current origin of a
network request matches any one of the collected hostnames,
so as to avoid iterating through all the filters.
Since there is rarely a match() for vast majority of network
requests with `domain=` option, this pre-test saves a good
amount of work, and this is measurable with the built-in
benchmark.
This commit moves the parsing, compiling and enforcement
of the `redirect=` and `redirect-rule=` network filter
options into the static network filtering engine as
modifier options -- just like `csp=` and `queryprune=`.
This solves the two following issues:
- https://github.com/gorhill/uBlock/issues/3590
- https://github.com/uBlockOrigin/uBlock-issues/issues/1008#issuecomment-716164214
Additionally, `redirect=` option is not longer afflicted
by static network filtering syntax quirks, `redirect=`
filters can be used with any other static filtering
modifier options, can be excepted using `@@` and can be
badfilter-ed.
Since more than one `redirect=` directives could be found
to apply to a single network request, the concept of
redirect priority is introduced.
By default, `redirect=` directives have an implicit
priority of 0. Filter authors can declare an explicit
priority by appending `:[integer]` to the token of the
`redirect=` option, for example:
||example.com/*.js$1p,script,redirect=noopjs:100
The priority dictates which redirect token out of many
will be ultimately used. Cases of multiple `redirect=`
directives applying to a single blocked network request
are expected to be rather unlikely.
Explicit redirect priority should be used if and only if
there is a case of redirect ambiguity to solve.
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/760
The purpose of this new network filter option is to remove
query parameters form the URL of network requests.
The name `queryprune` has been picked over `querystrip`
since the purpose of the option is to remove some
parameters from the URL rather than all parameters.
`queryprune` is a modifier option (like `csp`) in that it
does not cause a network request to be blocked but rather
modified before being emitted.
`queryprune` must be assigned a value, which value will
determine which parameters from a query string will be
removed. The syntax for the value is that of regular
expression *except* for the following rules:
- do not wrap the regex directive between `/`
- do not use regex special values `^` and `$`
- do not use literal comma character in the value,
though you can use hex-encoded version, `\x2c`
- to match the start of a query parameter, prepend `|`
- to match the end of a query parameter, append `|`
`queryprune` regex-like values will be tested against each
key-value parameter pair as `[key]=[value]` string. This
way you can prune according to either the key, the value,
or both.
This commit introduces the concept of modifier filter
options, which as of now are:
- `csp=`
- `queryprune=`
They both work in similar way when used with `important`
option or when used in exception filters. Modifier
options can apply to any network requests, hence the
logger reports the type of the network requests, and no
longer use the modifier as the type, i.e. `csp` filters
are no longer reported as requests of type `csp`.
Though modifier options can apply to any network requests,
for the time being the `csp=` modifier option still apply
only to top or embedded (frame) documents, just as before.
In some future we may want to apply `csp=` directives to
network requests of type script, to control the behavior
of service workers for example.
A new built-in filter expression has been added to the
logger: "modified", which allow to see all the network
requests which were modified before being emitted. The
translation work for this new option will be available
in a future commit.
Related feedback:
- https://github.com/uBlockOrigin/uBlock-issues/issues/401#issuecomment-703075797
Name: `uiTheme`
Default: `unset`
Values:
- `unset`: uBO will pick the theme according to
browser's `prefers-color-scheme`
- `light`: force light scheme
- `dark`: force dark theme
This advanced setting is not to be documented yet as
it has not been decided this is a long term solution.
This commit adds concept of "really bad list" to the
badlists infrastructure. Really bad lists won't be
fetched from a remote server, while plain bad list
will be fetched but won't be compiled.
A really bad list is denoted by the `nofetch` token
following the URL.
Really bad lists can cause more serious issues such
as causing undue launch delays because the remote
server where a really bad list is hosted fails to
respond properly and times out.
Such an example of really bad list is hpHosts which
original server no longer exist.
Many filter lists are known to cause serious filtering
issues in uBO and are not meant to be used in uBO.
Unfortunately, unwitting users keep importing these
filter lists and as a result this ends up causing
filtering issues for which the resolution is always
to remove the incompatible filter list.
Example of inconpatible filter lists:
- Reek's Anti-Adblock Killer
- AdBlock Warning Removal List
- ABP anti-circumvention filter list
uBO will use the following resource to know
which filter lists are incompatible:
- https://github.com/uBlockOrigin/uAssets/blob/master/filters/badlists.txt
Incompatible filter lists can still be imported into
uBO, useful for asset-viewing purpose, but their content
will be discarded at compile time.
Cloud storage is a limited resource, and thus it
makes sense to support data compression before
sending the data to cloud storage.
A new hidden setting allows to toggle on
cloud storage compression:
name: cloudStorageCompression
default: false
By default, this hidden setting is `false`, and a
user must set it to `true` to enable compression
of cloud storage items.
This hidden setting will eventually be toggled
to `true` by default, when there is good confidence
a majority of users are using a version of uBO
which can properly handle compressed cloud storage
items.
A cursory assessment shows that compressed items
are roughly 40-50% smaller in size.
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/1008
This commit adds support entity-matching in the filter
option `domain=`. Example:
pattern$domain=google.*
The `*` above is meant to match any suffix from the Public
Suffix List. The semantic is exactly the same as the
already existing entity-matching support in static
extended filtering:
- https://github.com/gorhill/uBlock/wiki/Static-filter-syntax#entity
Additionally, in this commit:
Fix cases where "just-origin" filters of the form `|http*://`
were erroneously normalized to `|http://`. The proper
normalization of `|http*://` is `*`.
Add support to store hostname strings into the character
buffer of a hntrie container. As of commit time, there are
5,544 instances of FilterOriginHit, and 732 instances of
FilterOriginMiss, which filters require storing/matching a
single hostname string. Those strings are now stored in the
character buffer of the already existing origin-related
hntrie container. (The same approach is used for plain
patterns which are not part of a bidi-trie.)
There have been too many examples out there of users
opting-in to "I am an advanced user" and yet still misusing
dynamic filtering by creating _allow_ rules where _noop_
rules should be used.
Creating _allow_ rules has serious consequences as these
override blocking static filters and can potentially
disable other advanced filtering ability such as
HTML filtering and scriptlet injection -- often used
to deal with anti-blocker mechanisms.
The ability to point-and-click to create _allow_ rules
from the popup panel is no longer allowed by default.
An new advanced setting has been added to enable
the ability to create _allow_ rules from the popup
panel, `popupPanelGodMode`, which default to `false`.
Set to `true` to restore ability to set _allow_ rules
from popup panel.
Since the creation of _allow_ rules is especially useful
to filter list authors, to diagnose and narrow down site
breakage as a result of problematic blocking filter,
the creation of _allow_ rules will still be available
when the advanced setting `filterAuthorMode` is `true`.
This change is probably going to be problematic to all
those users who were misusing dynamic filtering by
creating _allow_ rules instead of _noop_ rules -- but
the breakage is going to bring their misusing to their
attention, a positive outcome.
Default to `unset`.
To allow users to bypass uBO's default CSS styles in
case they are causing issues to specific users. It is
the responsibility of the user to ensure the value of
`uiStyles` contains valid CSS property declarations.
uBO will assign the value to `document.body.style.cssText`.
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/1044
For example, in the case of the issue above, one could
set `uiStyles` to `font-family: sans-serif` to force uBO
to the system font for its user interface.
FilterPlainHostname, an atomic filter unit, has been
removed and is being replaced with a composite filter
made of a pattern filter and a filter which test
hostname boundaries.
Doing so enables filters formerly being represented
by FilterPlainHostname to be now represented as a
plain pattern, and thus to be potentially stored in
a bidi-trie.
Comparing the new filter histogram with the previous
one:
FilterPatternPlain 24612 26432 1820
FilterComposite 17656 17125 -531
FilterPlainTrie Content 12977 13519 542
FilterPlainHostname 2904 0 -2904
FilterBucket 2121 1961 -160
FilterPlainTrie 1418 1578 160
Which means:
- An extra 542 patterns could be stored in bidi-tries
- There are 531 less composite filters needed
- An extra 160 buckets could be aggregated into 160
bidi-trie
Memory-wise, it's a marginal gain (as per Chromium's
Javascript VM instance figure) -- i.e. not worth
talking about). CPU-wise, no measurable difference.
The benefit is that I consider this conceptually
simplifies slightly the static network filtering
code base.