[prev] [next] [up]

Content Categorizer

Foxy categorizes websites by their contents. Content category labels (like Porn, Hate, Gambling, etc.) are attached to HTTP transactions and may be used by content filters to block access to objectionable content.

Categorization is very straightforward and fully controlled by the user. The user can define a number of content categories. A content category may be defined by either or both of:

Domain list based categorization is simple and predictable. When a page's domain is listed under one of the defined categories, the website is considered to belong to this category, no further processing is required.

If category cannot be deduced from the domain, the web page's vocabulary is analyzed. If the page scores enough points under one of the categories, it's considered to belong to this category. This is a more flexible method since it can handle yet unknown web sites.

Configuration File

The content category configuration file defines a set of content categories. It's a plain text file. The name of the file is specified by the category_file parameter of the main configuration file. Each category is represented by a block of lines in a name = value format. Categories are separated by one or more empty lines.

Parameters

category = category_name
min_score = number
domain = domain_name
word = word_template

category

A unique content category name, e.g. Porn. This parameter is required; a category = name line starts a new category.

Format:

category = category name

min_score

Specifies vocabulary analysis sensitivity when identifying this category. The greater the number, the less sensitive the detection is (that is more points have to be scored before the category is positively identified). The default value is 10 (at least 10 points must be scored for a positive category recognition). The scoring rules are outlined below.

Format:

min_score = number

domain

Format:

domain = domain_name

One or more domain parameters define a set of domains belonging to the category defined by the last category parameter. All resources from those domains automatically fall under this category without the need for vocabulary analysis.

A domain entry will match this domain and all its subdomains, e.g. google.com will match google.com, www.google.com, images.google.com, etc. Exactly the same domain must not be listed under more than one category, but multiple subdomains are allowed. In such cases Foxy will select the longest match.

Domains in the content configuration file are just domain names, no paths are allowed. But they MAY contain wildcards, for example:

category = Adult
domain = playboy.com
domain = *sex*
domain = *porn*
domain = *adult*
...

word

Format:

word = word_template

Each word parameter defines a word to match. A word may be followed by special characters (flags) that define the word's weight and other properties. Without any flags a word scores one point. Only unique matches are counted (but there is a very small award for the total number of matches).

Word tepmplates:

word Matches word and its main forms (English only).
word* Matches words starting with word. For example, porn* will match all words that start with porn, e.g. porn, porno, pornography, etc. “*” may be followed by “~” or “!” to specify weak or strong match respectively.
Please note that “*” is allowed only at the end of the word (e.g. templates like *abc or ab*c are illegal).
word. Exact match. Match word only (no word forms).

A word template may be followed by “!” or “~” to specify a strong or weak match respectively:

word! Strong match. Scores 2 points.
word~ Weak match. Scores only half of a point. Only first N weak words score anything at all, where N is the number of unique normal and strong matches. The idea here is that some words are pretty neutral by themselves, but in a context with stronger words they must increase the score (e.g. words like babe, sex, nude, etc. when filtering adult content.)

Notes:

How it works

  1. When Foxy sees an HTTP request, it checks if the domain part of the requested URL appears under one of the content categories. If it does, a content category label is attached to this transaction. At this point some content filters may block the page.
  2. If the domain does not match any of the defined content categories, the server half of page download starts (server to proxy). The first 20K bytes or so of the page are analyzed using categorized word lists. If at least one of the categories scores at least min_score points for that category, the page is considered to fall under that category. If more than one category matches, a category with the highest score is selected. At this point some content filters may block the page (the client part of the transfer has not yet started).
  3. In case of a category match (either domain or word list based), the match is stored in category cache. Upon a request for one of the recently matched domains, no analysis is required (and content filters may immediately block the page).

Notes:

Hints

General approach

...

Counter-categories

Some categories may have intersecting vocabularies. E.g. both Gambling and Sports may use words like championship, world, series, game, etc. Say, you wish to block access to gambling websites, but don't really care about sports. I would still recommend to carefully define the Sports category with the sole purpose of outweighing the Gambling category for pages dedicated to sports. For Sports, use both words common with gambling and words specific to sports (like NHL, athlete, coach). When visiting pages dedicated to sports, the Gambling category may score enough points to be positively identified. But the Sports category will typically score more vocabulary points (or match by domain), the page will be categorized as Sports, and the blocking filter will not misfire. Another example of counter-categories may be something like Porn and Sex Education, the sole purpose of the latter being to introduce (domain based) exceptions for the former. See the default categories.cfg for an example.

Limitations