Exploiting Redundancy in Natural Language Christoph Karlberger, G¨unther Bayler, Christopher Kruegel, and Engin Kirda {christoph,gmb,chris,ek} Probabilistic systems: Systems such as Bayesian filters are used to learn word frequencies that are associ- Today’s attacks against Bayesian spam filters attempt to ated with both spam and non-spam messages [11].
keep the content of spam mails visible to humans, butobscured to filters. A common technique is to fool filters Since Bayesian filters do not have a fixed set of rules to by appending additional words to a spam mail. Because classify incoming messages, they have to be trained with these words appear very rarely in spam mails, filters are known spam and ham messages before they are able to inclined to classify the mail as legitimate.
classify messages. The training of a Bayesian spam fil- The idea we present in this paper leverages the fact ter occurs in three steps: first, each message is stripped that natural language typically contains synonyms. Syn- of any transfer encodings. The decoded message is then onyms are different words that describe similar terms and split into single tokens, which are the words that make concepts. Such words often have significantly different up the message. Last, for each token, a record in the to- spam probabilities. Thus, an attacker might be able to ken database is updated that maintains two counts: the penetrate Bayesian filters by replacing suspicious words number of spam messages and the number of ham mes- by innocuous terms with the same meaning. A precon- sages in which that token has been observed so far. Be- dition for the success of such an attack is that Bayesian sides that, the token database also keeps track of the total spam filters of different users assign similar spam prob- number of spam and ham messages that have been used abilities to similar tokens. We first examine whether this precondition is met; afterwards, we measure the effectiv- Once a Bayesian spam filter has created a token ity of an automated substitution attack by creating a test database, messages can be analyzed. Analogous to the set of spam messages that are tested against SpamAssas- training phase, the message is first decoded and split into single tokens. For each token, a spam probability is cal-culated based on the number of spam and ham messagesthat have contained this token as well as the total num- ber of spam and ham messages that have been used totrain the Bayesian spam filter. The following formula is The purpose of a spam filter is to decide whether an in- coming message is legitimate (i.e., ham) or unsolicited(i.e., spam). There are many different types of filter sys-tems, including: Word lists: Simple and complex lists of words that are In this formula, nspam and nham are the total num- Black lists and white lists: These lists contain known bers of spam and ham tokens, whereas nspam(token) IP addresses of spam and non-spam senders.
and nham(token) denote how many times a token ap-peared in a spam or ham mail, respectively. Note that Message digests: These systems summarize mails into there are alternative ways to calculate this probability; pseudo-unique values. Repeated sightings of the an overview can be found in [22]. Next, Bayes theorem same digest is symptomatic of a spam mail.
is used to calculate the spam probability of the whole message by combining the spam probabilities of the sin- messages is that blocks of additional words are indica- gle tokens. Finally, the message is classified as ham or tors of spam, and algorithms that are able to detect these spam, typically by comparing its combined spam proba- additional words, such as Zdziarski’s Bayesian Noise Re- duction algorithm [21], foil attacks.
In this paper, we explore an alternative approach: in- stead of adding known good words to compensate for the bad words in the spam mail, one could exploit redun-dancies in the language and substitute words with a high The goal of attacks against Bayesian spam filters is to let spam probability by synonyms with a lower spam prob- spam mails be identified as ham mails. Currently exist- ability. The idea of a computer-aided substitution attack ing attacks aim to achieve this by adding words to the was first hinted at by Bowers [2]. Bowers showed that by spam mails. The objective is that these additional words manually replacing suspicious words, the spam probabil- are used in the classification of the mail in the same way ity of a message can be lowered. However, a completely as the original words, thereby tampering with the classifi- manual substitution process is clearly impractical for an cation process and reducing the overall spam probability.
attacker. In this work, we investigate the feasibility of When the additional words are randomly chosen from an automated substitution attack and evaluate its success a larger set of words, for example, a dictionary, this is called random word attack (“word salad”). The objec-tive is that the spam probabilities of the words added tothe spam message should compensate for the original to- kens’ high spam probabilities in the calculation of thewhole message’s combined spam probability. There is For a successful substitution attack, it is necessary that some controversy about the effectiveness of such an at- Bayesian spam filters at different sites (and for differ- tack: Several authors have found random word attacks ent users) judge words sufficiently similarly. Otherwise, ineffective against Bayesian spam filters [9, 13, 22], be- the attacker would not know which words are consid- cause many random dictionary words are infrequently ered suspicious by the victims’ spam filters and, there- used in legitimate messages and, therefore, tend to have fore, should be substituted. In addition, it would be un- either neutral or high spam probabilities for Bayesian known which synonyms could be used for the substitu- spam filters. An improvement of the random word at- tion, since it would be equally unknown which words tack is to add words to the spam mail that are often used receive a low or neutral spam probability by the victims’ in legitimate messages. This is called common word attack. The idea is that often-used words should have The spam probability of a word is determined (a) by lower spam probabilities than randomly-chosen words the number of appearances of this word in spam mails, for Bayesian filters, thus being better suited for an attack.
(b) the number of appearances of this word in ham mails, The number of a additional words that are needed for this (c) the total number of spam mails, and (d) the total num- attack to work varies between 50 [20] and 1,000 [13].
ber of ham mails the Bayesian filter has classified. If Finally, Lowd and Meek improved the common word at- the spam mails and the ham mails of users of Bayesian tack by adding words that are common in the language spam filters are sufficiently similar, then it is reasonable but uncommon in spam [13]. They called their attack to assume that words are classified similarly enough for frequency ratio attack. Lowd and Meek calculated that about 150 frequency-ratio words suffice to make a spam Are the mails used for training Bayesian spam filters of different users the same? This is clearly not the case.
Another common approach to circumvent Bayesian But many spam filters are set up for more than one user; spam filters is to overlay the text of a message on images that means, they use broader samples of spam and ham that are embedded in HTML. The content of the mail is mail for training. Even more important, however, is that visible to the user, but it is unrecognizable to text-based we can assume that many users receive very similar spam filters, which usually ignore images in their analysis of mails. After all, the idea of spam is the wide dissemi- nation of particular messages. Thus, it is reasonable toassume that many users receive similar spam messages,and as a result, their filters assign high spam probabilities to the same (or very similar) sets of words.
Another aspect to consider in this regard is how fast Most attacks described previously have one thing in com- messages, in particular spam messages, mutate. When mon: they add words to spam mails. From the spammer’s message content changes too quickly, the classification point of view, the disadvantage of adding words to spam of the words in the messages would change too. As a re- sult, the effectiveness of the attack decreases, because matically replacing words with high spam probability by the adversary does not know which words to replace, words with low spam probability. This is done in several and which synonyms to choose. In order to determine whether spam messages change slowly enough to allowa substitution attack, we examined three different spam 1. All words with a very high spam probability are archives. We extracted messages received in the year 2006, divided them by the month they were received, and 2. For every such word, a thesaurus is queried to find a created lists of the most frequently used tokens for each set of words with similar meaning, but with a lower month. We then measured the overlap of these lists by comparing them. The goal was to determine how manyof the 100 most frequently used tokens of one month ap- 3. If a set of suitable synonyms is found, the spam pear among the 100 most frequently used tokens of an- word is replaced with one of the possible candi- other month. The results in Table 1 show that the major- ity of the 100 most frequently used tokens in one month’sspam messages appear among the top 100 tokens of an- Identify words with high spam probability.
that raise the spam probability of a message need to be Manual inspection of the most frequently used tokens automatically replaced by words that have a lower spam showed that the lower the rank of a token in this list is, probability. To this end, the spam probability of each the less is the difference of that token’s and the next fre- word in a message has to be determined. For this, we quently used token’s number of appearances. Some to- query the Bayes token database of a spam filter. More kens that are at the end of the list of the 100 most fre- precisely, we trained SpamAssassin with about 18,000 quently used tokens of one month do not appear on an- spam mails from Bruce Guenter’s Spam Archive [10] other month’s top 100 list and, therefore, lower the over- and about 12,000 ham mails from the SpamAssassin lap. However, many of these tokens are not completely and Enron ham archives [6] to prepare SpamAssassin’s missing in the spam corpus of that other month, but are Bayes filter with a large and comprehensive training cor- only a little bit too infrequent to appear in the list of the pus. Then, for each word of a message, SpamAssas- 100 most frequently used tokens. If that border case to- sin was consulted to derive the spam probability for this kens would count too, the overlap would be higher; to word. We chose the SpamAssassin mail filter [17] for be able to estimate the overlap including the border case this task because it is widely used and achieves good re- tokens, we also measured how many of the 100 most fre- quently used tokens of a month appear among the 200 Based on the Bayesian spam probabilities for each most frequently used tokens of another month. The re- word, the decision is made whether this word needs to sults for this type of comparison is shown in the right half be replaced. To this end, a substitution threshold is de- fined. If a word with a spam probability higher than that Our results demonstrate that many of the terms used threshold is found, it is replaced with a synonym. In the in spam messages do not change over the course of a following Section 6, we show results of experiments us- year, which is a certain indication that the spam prob- ing different values for the substitution threshold.
abilities Bayesian filters assign to these terms do notchange too much either. These findings are confirmed by related studies: Sullivan [18] examined “almost 2,500 a spam probability above the threshold is found, Word- spam messages sampled from 8 different domains over Net [15] is queried for alternatives. WordNet is a lex- a period of 2.5 years” and found that spam is relatively ical database that groups verbs, nouns, adjectives, and time-stable. Pu and Webb [16] studied the evolution of adverbs into sets of cognitive synonyms (called synsets).
spam by examining “over 1.4 million spam messages that That is, each synset represents a concept, and it con- were collected from SpamArchive between January 2003 tains a set of words with a sense that names this con- and January 2006.” They focused their study on a trend cept. For example, the word “car” is classified as noun analysis of spam construction techniques, and found that and is contained in five synsets, where each of these sets these changes occur slowly over the course of several represents a different meaning of the word “car.” One months, confirming Sullivan’s claim.
of these sets is described as “a motor vehicle with fourwheels” and contains other words such as “automobile” or “motorcar.” Another synset is described as a “cabinfor transporting people”, containing the word “elevator As mentioned previously, the goal of the substitution at- car.” For every synset, WordNet provides links to hy- tack is to reduce the overall spam score of a mail by auto- pernym synsets, which are sets of words whose meaning Table 1: Overlap of most frequently used tokens in three different spam archives for 2006 (see Section 4).
encompasses that of other words. That is, a hypernym all synsets returned by WordNet are considered (although is more generic than a given word. For example, “motor the substitution is less likely to be accurate).
vehicle” is a hypernym of “car” [15].
The easiest strategy is to select that word among the Whenever WordNet is queried for alternatives for a candidates whose spam probability is the lowest. An- particular word, the tool not only requires the search other strategy is to randomly choose a word from the re- word itself, but also additional information that describes sulting synset(s). The latter approach aims to create di- the role of this word in the sentence (such as whether versity in the substitution process for a large set of mails.
this word is a noun, a verb, or an adjective). The rea- If a word is always replaced by the same word, the spam son is that WordNet distinguishes between synsets for probability of this word would rise every time the mail is verbs, nouns, adjectives and adverbs, and it is necessary classified as spam. Variability in the substitution could to specify what kind of synsets the result should con- slow down this process. Thus, we can select between tain. For example, if a word that is used in the query minimum or random as replacement strategies.
is a noun, it would not make sense to look up synsetsof verbs. Obviously, many words possess more than a single role. For example, the word “jump” can either be a spam probability lower than that of the original word is used as verb or as noun. The natural language processing found and the spam probability is very high, it is possi- (NLP) tool that we use to perform the recognition of the ble to exchange a single letter of the word with another roles of words in a sentence is contained in the LingPipe character that resembles this letter (e.g. “i” with “1”, “a” NLP package [12]. This tool relies on the context of a with “@”). This is an implementation of a trick from word to discover its role in a sentence and assigns a tag John Graham-Cumming’s site “The Spammer’s Com- to each word that describes its part-of-speech role [4].
pendium” [8] to conceal words from spam filters. An- Once the role of a word is discovered, WordNet can other threshold, called the exchange threshold, has to provide all synsets that contain the search word in its be defined that specifies for which words (and their cor- proper role. In addition, WordNet is also queried for responding spam probabilities) this obfuscation process all direct hypernym synsets of these sets, because hy- pernyms can also act as synonyms and, therefore, ex-pand the search space for suitable replacement words.
Unfortunately, the role of a word is not sufficient to se-lect the proper synset. The reason is that one must se- We evaluated our substitution attack against three pop- lect the synset that contains those words that are seman- ular mail filters: SpamAssassin 3.1.4 [17], DSPAM tically closest to the original term. As mentioned above, 3.8.0 [5], and Gmail [7]. For our experiments, we ran- the noun “car” could be replace by the term “automo- domly chose 100 spam messages from Bruce Guenter’s bile”, but also by the word “cabin.” To choose the synset SPAM archive [10] for the month of May 2007.
that contains words with a semantics that is closest to In a first step, the header lines of each mail were re- the original term, SenseLearner [14], a “word sense dis- moved, except for the subject line. This was done for two ambiguation” tool, is employed. This tool analyzes the reasons. First, the header (except for the subject line) is mail text together with the previously calculated part-of- not altered in a substitution attack, because it contains no speech tags to determine the synset that is semantically words that can be replaced by synonyms. Second, retain- closest to the original search word.
ing the header of the original spam mail would influencethe result, because certain lines of the header could causea spam filter to classify a message differently. In the next step, each HTML mail was stripped of its markup tags determining a single synset, only words from this synset and converted into plain text. Then, we corrected man- are considered as candidates for substitution. Otherwise, ually words that were extended with additional charac- ters or were altered in similar ways to escape spam filter the overall effectiveness of the attack is limited because detection (e.g., Hou=se – House). Finally, the resulting SpamAssassin use Bayesian analysis only as one compo- messages were processed by our prototype that imple- nent in their classification process. For example, Spam- ments the proposed substitution attack. In total, five dif- Assassin uses “block lists” that contain URLs that are ferent test sets were created, using different settings for associated with spam mails. In our test set, many mails the substitution and exchange thresholds as well as dif- do contain such links, and in some cases, a mail received ferent substitution policies (as described in section 5): more than 10 points for a single URL. In this case, thespam threshold of 5 was immediately exceeded, and the Test Set A: no substitution, original header removed.
mail is tagged as spam regardless of the result that theBayesian classifier delivers.
threshold: 95%; minimum replacement strategy.
threshold: 100%; minimum replacement strategy.
threshold: 100%; random replacement strategy.
threshold: 100%; minimum replacement strategy.
A threshold of 100% means that no character ex- change or word substitution is performed. Test set Aconsists of the original messages for which no substi- tution was performed. Test sets B, C and D use an ag-gressive substitution policy, whereas test set E aims to Figure 1: SpamAssassin: Bayesian spam scores.
preserve more of the original text. Test set C and D usethe same threshold settings, but apply a different replace- To gain a deeper understanding of the effect of our ment strategy. The difference between test set B and C is attack on the Bayesian classifier of SpamAssassin, we that certain words in test set B are obfuscated.
examined the Bayesian spam score that is computed bySpamAssassin for the mails before (test set A) and after Using our five test sets, SpamAssassin and the most effective substitution attack (test set B). The re- DSPAM were locally run to classify all mails in each sults are shown in Figure 1. Note that the spam scores set. In addition, all mails were sent to a newly created that are assigned to a mail by SpamAssassin are fixed Gmail account to determine which of them Gmail would values that range from -2.599 to 3.5. A negative score recognize as spam. SpamAssassin was used with its de- means that the content of the mail is regarded as ham, fault configuration (where the threshold for classifying whereas a positive score implies that the mail is spam.
a spam is 5). However, note that we disabled SpamAs- Values around 0 are neutral that leave the classification sassin’s ability to learn new spam tokens from analyzed of the mail to other mechanisms. In the figure, it can mails. This was done to prevent changes in the results be seen that for the original test set A, only 10% of all that depend on the order in which the tests were exe- mails had the lowest score of -2.599, while 30% received cuted. Furthermore, SpamAssassin was not allowed to the highest spam score of 3.5. After the substitution at- add network addresses to its whitelist. DSPAM was used tack (with test set B), 25% of all mails achieved a score in its standard configuration, with the exception that it of -2.599, while only 2% received 3.5 points. Also, the was not allowed to use whitelists as well. Whitelisting is number of mails that were assigned a neutral spam score disabled to ensure that filters would never incorrectly let increased. This clearly shows the significant effect of the a mail pass as ham without first invoking the Bayesian substitution attack on the Bayesian classification.
This claim is further confirmed when analyzing the re- The results of the experiments are listed in Table 2.
sults for DSPAM shown in Table 2. DSPAM is much For each tested spam filter, the numbers show the mails more dependent on the results derived by the Bayesian that are incorrectly classified as ham (i.e., the mails that filter when detecting spam, and thus, the number of spam successfully penetrated the filter). At a first glance, the mails that passed the filter could be more than doubled effectiveness of the substitution attack does not seem to after the substitution process. To pass filters such as be significant, especially for SpamAssassin and Gmail.
SpamAssassin (and probably also Gmail), the attacker Closer examination of the results, however, revealed that also has to take into account other factors besides the Table 2: Number of test spam messages not recognized by filters.
content (i.e., text) of his mail. For example, by frequently ficult task, and our tools are not always able to identify changing the URLs that point to the spammer’s sites (or the correct role or semantics of a word. For example, by hosting these sites on compromised machines), one WordNet yields “nexus” as replacement for “link.” Other could evade SpamAssassin’s block list. In this case, the examples are “locomote” for “go” or “stymie” for “em- substitution attack is only one building block of a suc- barrass.” We have invested significant effort to select pre- cise replacements, but, unsurprisingly, the system failssometimes. Moreover, the bad grammar used in manyspam mails makes correct semantic analysis even more challenging. To mitigate this limitation, one could con- uating the effectiveness of a substitution attack, we also sider a setup in which the substitution system produces assessed the number of different versions that can be cre- different versions of a particular spam mail that all have ated from a single spam mail. For this, we analyzed the low spam probabilities. Then, a human can pick those number of words for which substitution was attempted, alternatives that sound reasonable, and use only those for as well as the number of possible synonyms for each spamming. An example for a mail before and after word word. When a substitution threshold of 60% was used, substitution is shown in Appendix A.
the system attempted to replace on average 36 words permail. For these, an average of 1.92 synonyms were avail-able, and in 23% of the cases, not a single synonym could be found. For a substitution threshold of 80%, 19 substi-tution attempts were made on average, with 1.65 avail- Spam mails are a serious concern to and a major annoy- able synonyms (and no synonym in 29% of the cases).
ance for many Internet users. Bayesian spam filters are Using a random replacement strategy, we also found that an important element in the fight against spam mail, and there are on average 992 variations of one mail.
such filters are now integrated into popular mail clientssuch as Mozilla Thunderbird or Microsoft Outlook. Ob- The substitution attack is effective in re- viously, spammers have been working on adapting their ducing the spam score calculated by Bayesian filters.
techniques to bypass Bayesian filters. For example, a However, the attack also has occasional problems.
common technique for disguising spam is appending ad- One issue is that it is not always possible to find suit- ditional words to mails, with the hope of reducing the able synonyms for particular words. This is especially calculated spam probability. The effectiveness of such relevant for brand names and proper names such as “Vi- evasion efforts, however, varies, and Bayesian filters are agra.” In this case, one has to resort to obfuscation by replacing certain characters. Unfortunately for the at- In this paper, we present a novel, automated technique tacker, spam filters are quite robust to simple character to penetrate Bayesian spam filters by replacing words substitution. This can be observed when one compares with high spam probability with synonyms that have a the results for test set B (with obfuscation) with test set C lower spam probability. Our technique attacks the core (without obfuscation) in Table 2. Also, newly created idea behind Bayesian filters, which identify spam by as- words can be learned by spam filters, which counters the signing spam probability values to individual words. Our obfuscation or even raises the spam score of a mail [22].
experiments demonstrate that automated substitution at- Another problem for automated substitution are spelling tacks are feasible in practice, and that Bayesian filters are errors in spam mails, which make it impossible to find vulnerable. Hence, it is important for service providers the misspelled words in the thesaurus.
and mail clients to make use of a combination of tech- Another issue is that automated word substitutions are niques to fight spam such as URL-blocking, blacklisting, not always perfect. Natural language processing is a dif- This work was supported by the Austrian Science Foun- This example shows the content of a mail before and after dation (FWF) under grant P18157, the FIT-IT project the substitution process. It can be seen that most words Pathfinder, and the Secure Business Austria competence are substituted by a reasonable replacement, although us- [1] ARADHYE, H. B., MYERS, G. K., AND HERSON, J. A. Image Subject: Take twice as long to eat half as much analysis for efficient categorization of image-based spam e-mail.
In Eighth International Conference on Document Analysis andRecognition (2005).
I know it is the HOODIA that has made me lose
weight. Now I am so confident I think I will try to do it a
few more times and see where it gets me. I love the fact org/writings/bayesReport.html,, February 2003.
that I am getting weight loss results without any bad side
[3] CipherTrust SpamArchive. ftp://mirrors.blueyonder.
effects like the other products that have stimulants in
them. So I just had to write and give you my testimonial
to say I am happy I gained my body back and since
[4] CUTTING, D., KUPIEC, J., AND PEDERSEN, J. A practical part- of-speech tagger. In Third Conference on Applied Natural Lan- losing weight, I am ready to become more active and
guage Processing (1992), Xerox Palo Alto Research Center.
attractive than I have ever been. Thanks So Much, Patricia Strate - Currently 137 lbs
Order online securely from our website
[7] GOOGLE. Gmail.
(A sample is available at no cost to you)
[8] GRAHAM-CUMMING, J. The spammers’ compendium. http: pls click the remove link at our website, and enter your
[9] GRAHAM-CUMMING, J. How to beat an adaptive spam filter. In UENTER, B. Bruce Guenter’s SPAM Archive. http://www.
Subject: Take twice as long to eat half as much [11] KRAWETZ, N. Anti-Spam Solutions and Security. http://, 2004.
I know it is the HOODIA that has made me drop
[12] LingPipe 2.4.0.
off weight. Instantly I am so confident I think I will try to
[13] LOWD, D., AND MEEK, C. Good word attacks on statistical spam filters. In Conference on Email and Anti-Spam (2005).
do it a few more times and see where it gets me. I love the fact that I am getting weight passing results without
any bad side effects like the other merchandises that
[15] PRINCTON. Wordnet 2.1. http://wordnet.princeton.
have stimulants in them. So I just had to write and give you my testimony to say I am happy I derived my body
[16] PU, C., AND WEBB, S. Observed trends in spam construction back and since losing weight, I am quick to become
techniques: A case study of spam evolution. In Third Conference more active and attractive than I have ever been. Thanks on Email and Anti-Spam (CEAS) (2006), p. 104.
[17] SpamAssassin.
Patricia Strate - Currently 137 pounds
[18] SULLIVAN, T. The more things change: Volatility and stability in spam features. In MIT Spam Conference (2004).
Order online securely from our internet site
// (A sample is usable at no cost to you)
[20] WITTEL, G., AND WU, F. Attacking statistical spam filters. In pls click the remove link at our internet site, and enter
First Conference on Email and Anti-Spam (CEAS) (July 2004).
[21] ZDZIARSKI, J. Bayesian noise reduction: Contextual symmetry logic utilizing pattern consistency analysis. In MIT Spam Confer-ence (2005).
[22] ZDZIARSKI, J. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification. No Starch Press,2005.



The information in this leafl et will help you fi nd the right expert care to meet your needs. One out of every four people who go to A&E could have been treated elsewhere in the community, or could have self-treated. Your local pharmacy can help treat common illnesses. A&E and 999 services are for life-threatening and emergency conditions only. By following a few tips, we can all give



Copyright © 2010-2018 Pharmacy Drugs Pdf