Google Unveils Major Spam Filter Overhaul

John Lister's picture

Google says it has significantly upgraded Gmail's spam filter to overcome a common scam tactic. It's now using AI to detect images that aren't technically text characters but are still readable by humans.

It also says the new system will reduce the number of false positives: legitimate emails mistakenly flagged as spam. That's certainly felt like an increasing problem over the past year or so.

The scammer tactic tackled with the update is called adversarial text manipulation. That takes account of the fact that a key part of spam filtering involves analyzing the text in an email and looking for patterns and signals associated with unsolicited messages.

Fake Characters Deceive

These can include text that appears to be written by a machine (for example, generating thousands of different emails to see which wording is most persuasive). It can also include badly translated or poorly written text. One theory is that some spam senders deliberately make their scam "obvious" to most people to filter out the skeptical and leave only the people most likely to fall for a scam.

The problem is that adversarial text manipulation uses code and images of special characters that resemble letters closely enough that people can read them, but don't have any "meaning" to the spam filter, thereby getting past the spam filter and straight into your inbox.

For example, the word "Microsoft" might be written as "Microsof+", but in a much more convincing way using graphics.

The new system is called RETVec, standing for Resilient & Efficient Text Vectorizer. Previously spam filters have tried to combat this using optical character recognition, the same system used when scanning printed documents and converting them to text. (Source: techspot.com)

Context Is Key

Instead, RETVec uses image similarity as a starting point, then uses context to try and figure out the most likely meaning of each character. For example, it might individually rank consecutive characters as possibly being a "t", and "h" and an "e". Putting this information together means it can be more confident that the word is indeed "the".

Perhaps the biggest difference is that the new technology significantly reduces the number of steps needed to identify a letter. Previous spam filters used millions of parameters, whereas RETVec uses around 200,000. That uses less computing resources, making it more practical to use. (Source: arstechnica.com)

What's Your Opinion?

Have you spotted this type of spam? Have you noticed any changes in spam filtering recently? Are false positives a bigger problem than missed spam these days?

Rate this article: 
Average: 5 (6 votes)