Earlier this week, The Citizen Lab in collaboration with researchers at the University of New Mexico released one of the largest unbiased surveillance and censorship keyword lists, which was extracted and decrypted from two chat programs used in China. In total, these lists from TOM-Skype and Sina UC comprise 4,256 keywords, and provide fascinating insight into the types of topics that authorities at those companies consider sensitive and worth censoring/listening in on.
Past lists of sensitive keywords on Chinese websites have been developed through techniques ranging from internal testing aided by crowdsourced reports to more systematic monitoring. However, the nature of these newly-released lists—the 4,256 keywords represent every keyword that these two IM programs have been tracking on the client side, including a number of unique words that have been not previously known to be sensitive—provides researchers with a complete set of keywords known to represent all the major topics that were of interest to authorities in charge at Sina UC and TOM-Skype during the past year and a half. (Though some caveats: these two chat programs are lightly used in China, with their market share paling in comparison to the dominant chat client, Tencent’s QQ, and the lists have not been frequently updated in recent months.)
Among the findings of the paper The Citizen Lab/UNM published is that there was not much overlap of keywords between the two chat clients: only 3% of the words were censored/monitored in both Sina UC and TOM-Skype, strongly indicating that for these chat clients, the censorship and surveillance lists were developed internally and not handed down from a central source. However, was there much overlap and sharing of keyword lists within companies? And are these keywords still sensitive right now? The paper does preliminary analysis by checking the set against previously known lists of blocked keywords, but we decided to extend this analysis by performing a more thorough test on June 6 and then again on June 29, checking whether each of the 4,256 keywords was blocked from searching on Sina Weibo, the most actively used social media website in China. (A spreadsheet of all the data mentioned in this report can be viewed in this Google Fusion Table or downloaded in .csv format for further analysis.)
The unblocking of June 4 terms
Of the 4,256 keywords, 783 were blocked on June 6 (18.4%) while 707 were blocked on June 29 (16.6%). After filtering out redundant words,1 the numbers adjust to 532 of 3,311 words blocked on June 6 (16.1%) and 480 blocked on June 29 (14.5%). Looking at the raw numbers, it would appear that a large number of keywords, nearly 10% of the original list, were unblocked (fitting with recent evidence regarding the unblocking of certain types of keywords).
Unique keywords from China Chats keyword list unblocked on Weibo between Jun 6 and Jun 29
|keyword||description (from China-chats.net)||category|
|89||This is a reference to June 4, 1989 — the Tiananmen Square crackdown.||tiananmen square june 4 1989|
|榴嗣||Homonym for June 4 (1989)||tiananmen square june 4 1989|
|平反||A call for the Chinese government to look into and redress the human rights abuses that arose out of the Tiananmen Square crackdown in 1989.||tiananmen square june 4 1989|
|波推||Sexual content.||prurient interests|
|陆肆||Another way of writing 六四, or Liu Si, in reference to the Tiananmen Square protests that occurred on June 4th, 1989.||tiananmen square june 4 1989|
|丝带||Silk ribbon: this may be related to sexual content.||prurient interests|
|坦克||tank: ‘Tank Man’, from the June 4, 1989 Tiananmen Square massacre.||tiananmen square june 4 1989|
|鎮壓||Repression||tiananmen square june 4 1989|
|滕彪||Teng Biao is a human rights activist and lawyer in China.||dissident activist|
|六四||Liu Si||tiananmen square june 4 1989|
|王维林||Wang Weilin — an anonymous man who stood in front of the tanks at Tiananmen Square.||tiananmen square june 4 1989|
|支联会||(short form for) Hong Kong Alliance in Support of Patriotic Democratic Movements In China||human rights|
|温总理||This is a reference to Chinese Premier Wen Jiabao.||cpc member gov official|
|38军||This is a military formation of the PRC’s People’s Liberation Army.||tiananmen square june 4 1989|
|六*四||Reference to June 4 1989||tiananmen square june 4 1989|
|长安街||This is a major thoroughfare in Beijing and directly in front of the Tiananmen Square.||tiananmen square june 4 1989|
|吴仁华||Wu Renhua is a Chinese scholar and was a democracy activist during the 1989 Democracy movement.||dissident activist|
|六+四||Reference to June 4 1989||tiananmen square june 4 1989|
|支聯會||(short form for) Hong Kong Alliance in Support of Patriotic Democratic Movements In China||humanrightsorganizations|
|六=四||Reference to June 4 1989||tiananmen square june 4 1989|
|八八节||August 8 or Father’s day||context unclear|
|6.4||This is a reference to the Tiananmen Square crackdowns on June 4, 1989.||tiananmen square june 4 1989|
|平*反||A call for the Chinese government to look into and redress the human rights abuses that arose out of the Tiananmen Square crackdown in 1989.||tiananmen square june 4 1989|
|最后一枪||This appears to be the name of a multi-episode television drama in China. It portrays Chinese resistance against Japanese forces in Shanghai in 1941.||international relations|
|VIIV||Roman numerals for 6-4, a reference to the June 4, 1989 Tiananmen Square massacre.||tiananmen square june 4 1989|
|一九八九||Likely a reference to the 1989 Democracy Movement in China that amounted to the June Fourth Massacre at Tiananmen Square.||tiananmen square june 4 1989|
|春夏之交||Between Spring and Summer of 1989.||tiananmen square june 4 1989|
|八八血卡||Eighty-eight blood card: context unclear, but likely related to Tiananmen Square crackdown and democracy movement.||tiananmen square june 4 1989|
|屠杀大学生||Massacre of university students: likely a reference to Tiananmen Square.||tiananmen square june 4 1989|
|TAM事件||This is a reference to the Tiananmen Square crackdown in June 1989.||tiananmen square june 4 1989|
|liusi||This is a reference to the Tiananmen protests and crackdown on June 4, 1989.||tiananmen square june 4 1989|
|64狗qq||(gun) qq is a Chinese website.||illicit goods and services|
|八平方纪念||Eight squared memorial (8 to the power of 8 = 64): reference to the Tiananmen Square massacre.||tiananmen square june 4 1989|
|5月三十五||This is a reference to the Tiananmen Square crackdown on June 4, 1989.||tiananmen square june 4 1989|
|供应杜冷丁||Pethidine is used to relieve moderate to severe pain.||illicit goods and services|
|64气狗QQ||(air gun) QQ is a Chinese microblog website||illicit goods and services|
|法拉利 死亡||Ferrari death: Reference to the Beijing Ferrari accident on March 2012.||ferrari crash march 2012|
|6月的第4天||This is a reference to the Tiananmen Square massacre on June 4, 1989.||tiananmen square june 4 1989|
|3月6日集会||March 6 meeting in the spring of 2011. There was a call online that on March 6, people gather in 35 major cities to protest against the Communist government.||jasmine revolution|
|六月的第四天||Reference to June 4 1989||tiananmen square june 4 1989|
|二十四人通缉令||24 arrest warrants: Unclear of context, but likely related to democracy movement or Tiananmen Square crackdown or with Liu Xiaobo.||chinese democracy movement|
|8|9|6|4||Reference to the Tiananmen Square crackdown on June 4, 1989.||tiananmen square june 4 1989|
|法~轮~大~法||Falun gong||religion falun gong|
|tank man||tiananmen square june 4 1989|
|chailing||Chai Ling||dissident activist|
|己巳年己巳月乙未日||Ji si year happens every 60 years and falls on 1989. Ji si month is between May and June. Likely a reference to the Tiananmen Square crackdown in June 1989.||tiananmen square june 4 1989|
|一个光明的民主中国||A bright democratic China: this may be related to the Jasmine Revolution.||jasmine revolution|
|周六下午四点 茉莉花||Related to the Jasmine Revolution.||jasmine revolution|
|54式 64式 手枪||54 style 64 style hand gun pistol, a semi-automatic originating in China.||illicit goods and services|
|2453605542||Unclear the context. However, this could be a reference to illicit use/buying/cracking of telephone numbers.||illicit goods and services|
However, a closer look at the 54 unique words that were unblocked (two were added to the block list: “9评”, the Nine Commentaries, and “six*4”, a June 4 reference, giving us the net change of -52 blocked words) reveals that the majority of them were related to the June 4 Tiananmen Square crackdown. Twenty-nine of the words were initially categorized as related to June 4, and another nine were likely blocked due to June 4-references2 (支联会, 吴仁华, 支聯會, 64狗qq, 64气狗QQ, 二十四人通缉令, chailing, 54式 64式 手枪, 2453605542).
As discussed by GreatFire.org, June 4-related keywords appeared to be treated in a special manner before and during the anniversary, and the data here affirms this—three weeks after June 4, a number of June 4-related words have apparently been deemed to no longer be sensitive enough to be blocked (though 44 words we’ve categorized as related to June 4 are still blocked, including most of the ones we previously identified in a test on June 3rd). Whether the unblocking of keywords is part of a gradual, intentional shift toward a censorship system more reliant on human censors who delete posts and away from the crutch of blanket search blocks, or whether it was just a momentary blip related to June 4, is something we’ll know more about as we continue to test this list in the coming weeks.
Identifying which keywords have been targeted for implicit censorship via decreased reported search result numbers
Using the data that we’ve accumulated via keyword searches, another potential way of unearthing new or unexpected censorship patterns is to track the number of search results returned for each keyword search and identifying major drops in reported results. If on June 1 a search for “Ferrari” on Weibo reports there are 10,000 results,3 and the following week there are only 100, even though the term has not been blocked, it would be clear that something is being manipulated, either via the deletion of posts or their disappearance from searches.
Obviously, this relies on Weibo to report results honestly and consistently; the method outlined above is useless if Weibo constantly changes its search algorithm or is intentionally manipulating the number of results reported to mask the actual level of censorship that is taking place. However, for the two data points for quantity of search results that we have, that doesn’t appear to be the case:
|Percent change in number of results from June 6
to June 29 for China Chats keywords with greater than 1000 results initially and are not blocked on Weibo
|% change||number of keywords||% total|
|-100% to -90%||45||3.48|
|-10% to 0%||68||5.26|
|0% to 10%||1045||80.82|
|90% to 100%||0||0.00|
Except for a few exceptional cases, the number of results reported for keywords tended to be very stable. After filtering out all the uncommon terms with under 1,000 results as well as those which were blocked on our first test on June 6, over 80% of the remaining 1,293 keywords had their search results increase by 0-10% on our second test—not totally unexpected as one would expect users to add new posts to Weibo’s database and Weibo to dutifully report them. However, there are some outliers: 18 keywords increased by over 100%, either an indication that they became particularly hot and were used quite a bit more in posts during the period between tests; or that Weibo suddenly tweaked the search algorithm for those terms specifically. Even if the latter is true, using number of results reported seems to be a genuine, though not foolproof, measure of a keyword’s usage in posts.
Looking at the other end of the spectrum, the 45 keywords which decreased by over 90% cannot be explained by any reasonable natural occurrence, it can only be due to either inconsistency or intentional manipulation of Weibo’s search algorithm that causes such an extreme drop in reported results. And indeed, while inconsistency does play a factor, with 17 of the 45 keywords subsequently returning to roughly their original number of results as of today, the other 28 keywords have maintained the suppressed number of search results. For example, a search for 恶党 (evil party), which once returned 861,904 results, returned 852 during our second test—and is still dropping with only 783 or 814 being reported as of July 4, 3:42 EDT. However, noted at the bottom now is “为了提供多样性结果，我们省略了部分相似微博，您可以点击查看全部搜索结果” which translates to “In order to provide more diverse search results, we’ve dropped some similar posts, you can click to see all the search results.” Clicking on the link should now give you access to all 880,000+ posts which contain 恶党. Similarly, one can access very nearly the same list of 880,000+ by searching for “恶 党” with a space between the characters. Also of note is that on the bottom of certain pages of search results for 恶党, though never on the first page, is the message “根据相关法律法规和政策，部分搜索结果未予显示”: “Due to relevant laws, regulations, and policies, some search results were not displayed.” (This new “semi-censorship” by Weibo was first noted by Jason Q. Ng and GreatFire.org in Sept 2012.)
The same hiding of results—forcing users to click to access the full set of posts containing the desired search term—is true for a few other keywords we tested, but a more thorough check of future drops in search results will need to be done before we can come to a conclusion about what Weibo is trying to accomplish with this tactic. Whether it’s intentional censorship or an unintended tweak in the search algorithm, something odd is taking place and worth further examination.
Overlaps of sensitive terms between services
As mentioned previously, there did not appear to be much overlap between the keyword lists in Sina UC and TOM-Skype. However, after performing our test on Sina Weibo, we now have another service to compare those lists against.
During our first test, 294 of the 1,919 unique words from TOM-Skype’s list were blocked on Weibo (15.3%), dropping slightly to 267 three weeks later (13.9%). Perhaps unsurprisingly since they both have the same parent company, the number of Sina UC’s keywords which were blocked on Sina Weibo was slightly higher than TOM-Skype’s: 299 of the 1,517 unique words from Sina UC’s list were blocked at first (19.7%), and 270 (17.8%) subsequently.
Running some regression models seems to indicate at first glance that Sina UC’s keywords are much more likely to return no results than TOM-Skype, but digging deeper it appears that this is due to the greater number of keywords we’ve categorized as URLs in Sina UC’s list than in TOM-Skype’s list (267 in Sina UC’s list versus just 7 in TOM-Skype’s). URLs of course are much more likely to return 0 results on Sina Weibo because Weibo natively uses a link shortener. Re-running the regression without keywords categorized as URLs and using length of the keyword as a control eliminates that discrepancy.
However, which list a keyword is on significantly affects whether or not it will be explicitly blocked on Sina Weibo: keywords on Sina UC’s list are 77% more likely to be blocked on Weibo than non Sina-UC terms (p<0.001), whereas TOM-Skype words are 20% less likely to be blocked on Weibo than non-Sina-UC terms (p=.02). Thus, either Sina UC and Sina Weibo share similar outlooks or processes on deciding which words to be blocked; or TOM-Skype is especially bad at predicting what sensitive words will continue to be relevant today (remember: the Sina UC and TOM-Skype censorship lists we are analyzing have not been updated since December 2012).
However, there are a number of blocked Weibo keywords on Sina UC’s list and even TOM-Skype’s list which are so specific that it seems unlikely that either service developed them independently from Sina Weibo, and thus point to a central authority being involved. For instance, the “big character” slogan 清算特权废除特供 (Expose and criticize the privileged; repeal the special benefits) is contained on TOM-Skype’s list and is also blocked on Weibo—but only if the term is entered exactly as is. If any character from the phrase is removed or if the four-character words are reversed, the phrase becomes unblocked. What are the odds that two different services blocked this exact eight-character term—which has only eight results on Google, two of which are references to the research that uncovered it? The phrase does occur in a 2011 article on the Epoch Times website, the newspaper primarily affiliated with the Falun Gong, and while it is possible that both Weibo and TOM-Skype are performing content analysis of the Epoch Times’s website and adding new keywords automatically, it may be more reasonable to conclude that Sina Weibo and TOM-Skype receive certain keywords from higher-ups. But if that’s the case, then why did Sina UC’s list not include this keyword?
Furthermore, it is worth keeping in mind that even though there wasn’t much overlap of the terms between the two services, those that were shared are particularly special, having been identified (or mandated from above) by both services as sensitive. One hundred twenty-five unique keywords are on both lists, roughly half of which are explicitly blocked from searching on Weibo. These words are nearly 5 times more likely to be blocked on Weibo than those only on one or not on both lists (p<0.001), a possible sign of the “wisdom of crowds”—if you can call two chat clients a crowd—at identifying the most sensitive of sensitive words. Or perhaps those words were so much more likely to be blocked on Sina Weibo because central authorities have designated these words for special treatment and caused them to be censored across services.
Both explanations are plausible, another sign of the opaqueness and vagaries of the Chinese censorship system showing themselves again. Having examined this preliminary data, questions remain: How much self-censorship is performed by companies and how much is mandated by the government? Do services within the same company share similar censorship lists and strategies? Is the unblocking of June 4-related keywords indicative of larger trends or just a scheduling blip? More work is to be done but hopefully the keyword list available from The Citizen Lab/UNM’s China Chats study will assist interested folks in performing future research into these questions.
1. Words were marked as non-unique if they either:
- contained another keyword on the list within it, e.g., 一夜情 (one-night stand) contains 一夜 (one-night), which is also on the list; or
- contained the component of another known blocked keyword within it, e.g., 无码 (uncensored) is blocked on Weibo and is contained in 18 keywords from our list, for instance 成人无码DVD (uncensored Japanese DVDs); thus, one of the 18 is marked as unique while the other 17 are marked as not unique. In cases like 无码, the choice of which term was designated as unique as opposed to the others might affect the analysis of which chat client “earned” the Weibo block designation. However, this only affects a handful of such cases, and ideally, a more rigorous approach would go back and solve this by designating the unique word as in both lists if its components are found in keywords on both lists. For instance, of the eighteen words that contain 无码, 17 are exclusively on TOM-Skype’s list and one is exclusively on Sina UC’s list. The unique keyword to represent 无码 was one of the TOM-Skype keywords, so in future regressions testing how likely a word would be blocked if it were on TOM-Skype’s list as opposed Sina UC’s list, TOM-Skype would be exaggerated by a slight increment since the designated unique word doesn’t account for it’s presence in the one Sina UC keyword which contained it. Ideally, their would be a variable that would re-weight all this so that the model would know that 17/18 of the client designation for the unique blocked word should be appropriated to TOM-Skype while 1/17 should go to Sina UC, but the current approach is probably acceptable for the general analysis in this report.
2. While “64” itself wasn’t blocked when it was tested on June 6, the preponderance of words which contain “64” which were blocked indicate that the term was likely used to trigger a search block when combined with other seemingly innocent words.
3. Weibo only shows 50 pages of results for each search (with 20 posts per page, a total of 1,000 posts), even if it reports there are more than 1,000 results. However, you can narrow your search by date, location, and other options to view more specific results, and could verify if indeed there were actually 1.3 million posts which contain “Obama” as Weibo says there is. One could also theoretically use the Weibo search API and scrape the results, but unfortunately the date, page, and count parameters for advancing to the next set of results don’t actually work as described in the API documentation. If someone has figured out how to use Weibo’s search API properly, please feel free to contact the author.