Foxy categorizes websites by their contents. Content category labels (like Porn, Hate, Gambling, etc.) are attached to HTTP transactions and may be used by content filters to block access to objectionable content.
Categorization is very straightforward and fully controlled by the user. The user can define a number of content categories. A content category may be defined by either or both of:
Domain list based categorization is simple and predictable. When a page's domain is listed under one of the defined categories, the website is considered to belong to this category, no further processing is required.
If category cannot be deduced from the domain, the web page's vocabulary is analyzed. If the page scores enough points under one of the categories, it's considered to belong to this category. This is a more flexible method since it can handle yet unknown web sites.
The content category configuration file defines a set of content categories. It's a plain text file. The name of the file is specified by the category_file parameter of the main configuration file. Each category is represented by a block of lines in a name = value format. Categories are separated by one or more empty lines.
A unique content category name, e.g. Porn. This parameter is required; a category = name line starts a new category.
Format:
category = category name
Specifies vocabulary analysis sensitivity when identifying this category. The greater the number, the less sensitive the detection is (that is more points have to be scored before the category is positively identified). The default value is 10 (at least 10 points must be scored for a positive category recognition). The scoring rules are outlined below.
Format:
min_score = number
Format:
domain = domain_name
One or more domain parameters define a set of domains belonging to the category defined by the last category parameter. All resources from those domains automatically fall under this category without the need for vocabulary analysis.
A domain entry will match this domain and all its subdomains, e.g. google.com will match google.com, www.google.com, images.google.com, etc. Exactly the same domain must not be listed under more than one category, but multiple subdomains are allowed. In such cases Foxy will select the longest match.
Domains in the content configuration file are just domain names, no paths are allowed. But they MAY contain wildcards, for example:
category = Adult domain = playboy.com domain = *sex* domain = *porn* domain = *adult* ...
Format:
word = word_template
Each word parameter defines a word to match. A word may be followed by special characters (flags) that define the word's weight and other properties. Without any flags a word scores one point. Only unique matches are counted (but there is a very small award for the total number of matches).
Word tepmplates:
| word | Matches word and its main forms (English only). | |
| word* |
Matches words starting with word.
For example, porn* will match all words that start with
porn, e.g. porn, porno, pornography, etc.
“*” may be followed by “~”
or “!” to specify weak or strong match respectively.
Please note that “*” is allowed only at the end of the word (e.g. templates like *abc or ab*c are illegal). | |
| word. | Exact match. Match word only (no word forms). |
A word template may be followed by “!” or “~” to specify a strong or weak match respectively:
| word! | Strong match. Scores 2 points. |
| word~ | Weak match. Scores only half of a point. Only first N weak words score anything at all, where N is the number of unique normal and strong matches. The idea here is that some words are pretty neutral by themselves, but in a context with stronger words they must increase the score (e.g. words like babe, sex, nude, etc. when filtering adult content.) |
Notes:
Notes:
...
Some categories may have intersecting vocabularies. E.g. both Gambling and Sports may use words like championship, world, series, game, etc. Say, you wish to block access to gambling websites, but don't really care about sports. I would still recommend to carefully define the Sports category with the sole purpose of outweighing the Gambling category for pages dedicated to sports. For Sports, use both words common with gambling and words specific to sports (like NHL, athlete, coach). When visiting pages dedicated to sports, the Gambling category may score enough points to be positively identified. But the Sports category will typically score more vocabulary points (or match by domain), the page will be categorized as Sports, and the blocking filter will not misfire. Another example of counter-categories may be something like Porn and Sex Education, the sole purpose of the latter being to introduce (domain based) exceptions for the former. See the default categories.cfg for an example.