Guidance for Blacklisting and Watching

Blacklisting or watching a keyword or a web site address cause the spam detection bot SmokeDetector to trigger an alert whenever that keyword or web site address appears in a post. In other words, it basically says that any post containing this expression is spam, or at least suspicious.

The Website Blacklist

The website blacklist consists of a list of websites associated with known spam that automatically raise suspicion when posted anywhere on Stack Exchange.

Blacklisting a website makes SmokeDetector report every post that is posted or modified with a link to the website (formatted as a link or otherwise) in its text.

The website blacklist is maintained in the SmokeDetector GitHub repository, specifically in the file blacklisted_websites.txt.

The Keyword Blacklist

The keyword blacklist consists of a list of phrases which are frequently seen in spam, and rarely outside of spam posts.

Blacklisting a “keyword” (which can actually be a regular expression matching a phrase or a more-complex expression with alternatives, like find (true )?love which matches either of “find love” or “find true love”) causes any post which matches it to be reported as probable spam by SmokeDetector. Matches are not reported in the middle of a word; the keyword expression “dog” does not match “doggone” or “endogenous”.

The keyword blacklist is maintained in the SmokeDetector GitHub repository, specifically in the file bad_keywords.txt.

Watch Expressions

“Watching” an expression causes SmokeDetector to report it just like a blacklist expression, but the rule weight is kept low, so as to prevent matches from triggering autoflagging. Posts which only match watched expressions and no other rules are not reported in other chat rooms, just in Charcoal. That means you can use !!/watch to try out different patterns experimentally, just to get an idea of what sorts of posts match a particular expression.

The list of watched expressions is maintained in the SmokeDetector GitHub repository, specifically in the file watched_keywords.txt. The format is slightly different from the other similar files; each entry is a tab-delimited record which includes a date stamp (expressed as Unix epoch, i.e. seconds since midnight Jan 1 1970 UTC) and the user name of the person who added the expression.

Rules for Blacklisting and Watching

We have established the following rules for watching and blacklisting.

Ongoing Campaign

The blacklisting guidance relaxes the criteria for blacklisting a web site when it is promoted in a spam post which we identify as being part of an “ongoing campaign”. This basically means that the spam incident is substantially similar to a number of other recent spam posts which already fulfill the stricter blacklisting criteria. In practice, this helps us trigger blacklisting early for sites which are clearly part of an ongoing promotion, where we can be reasonably sure that the only purpose of the site is to have a different URL than the other sites used in the campaign. (This is called “snowshoe spamming” – the tactic is to spread your footprint across many sites so as to evade trivial duplicate detection.)

How to Blacklist or Watch Something

You will want to test that the expression you want to blacklist or watch isn’t already covered by one of the existing patterns.

You can test this by using the !!/test <string to test> command (or !!/test-a <string to test> to test as an answer).

Everyone with SmokeDetector privileges (if you don’t have those and would like them, read up on how to get them) can blacklist a website, though this will need to be approved by someone with code privileges if you don’t already have them. Additions to the blacklist must be valid regular expressions (regex). In reality that means for largely exact matches (like the website blacklist) that you ensure that special characters (like .) are escaped. (Example: thisisspam\.com)

There are two methods to add a website to a watch list or blacklist:

If you’re blacklisting or watching a complex regex to match a whole bunch of different stuff, it’s probably better off in the pattern-matching section of findspam.py. You’ll need to propose a change to the file on GitHub for this; ask for help if you’re unsure what to do.