Visualizing Changes in Censorship: Summarizing two months of Sina Weibo keyword monitoring with two interactive charts

Thanks to the excellent work being done by researchers and journalists at China Digital Times, GreatFire.org, and many others, there has never been more information about what is being censored online in China. However, what is less discussed and written about are instances when the censors withdraw keywords or topics from their censorship watchlists. And though the relaxing of certain restrictions may not actually represent actual greater freedom of expression online if they are followed up with even more sophisticated forms of censorship, it is still worth monitoring what does get changed–even if only for hints at the political motives behind such changes (for instance the unblocking of Bo Xilais’ name in the run-up to his corruption trial).

The Citizen Lab’s China Chats project was one model attempt at identifying changes over time in censorship and surveillance in two Chinese chat clients, and using the set of 4,256 sensitive keywords extracted and decrypted from those chat clients in that project, we have been testing changes in search keyword censorship on Sina Weibo these past two months. We performed four such tests each roughly 3 weeks apart, recording what Weibo’s response was when we tried to search for a particular keyword. The below chart (click through to open a a link to an interactive version of it) breaks down the 3,311 unique words* into three categories:

not blocked, meaning they returned search results;
explicitly blocked, meaning they returned a clear message stating that censorship was taking place (“According to relevant laws, regulations and policies, search results for [the blocked keyword] can not be displayed.”);
no results, meaning that there could genuinely be no results or that the term is being implicitly censored

click through to open a a link to an interactive version of the chart

As noted in a previous post, a number of words were unblocked between our first test on June 6 and second test on June 29–most having to do with the June 4/Tiananmen Square anniversary. However, the trend continued in our third test on Jul 23, with an additional 46 keywords being essentially unblocked (going from explicitly censored to having search results). You can click through to the interactive chart to show the words that were unblocked in each test. Our fourth test showed a stabilizing of this unblocking of explicitly censored keywords trend, but surprisingly showed a marked increase in words being implicitly censored (returning no results).

However, a preliminary re-test of those words shows that a large number of them have since returned to having results, so it may merely have been a temporary change in the search algorithm for handing certain words or perhaps there was a shake-up in the list of words that were earmarked for “no results” before Sina decided to revert most of them back–though not all. For instance, 批发K粉–“wholesale ketamine”–was explicitly censored on the first two tests before returning 159,194 results on our third test. On our fourth test, the term was one of the many that switched from having results to having none–which is still the case as of today, and a likely receiver of implicit censorship status. Thus, no conclusions about the fourth test’s data on potentially implicitly censored keywords should be made until a follow-up test is completed.

click through to open a a link to an interactive version of the chart

The above chart shows a more granular view of the changes between each test: keywords that went from one of the three categories in the first chart above to a different category were identified as having their block status changed. This second chart breaks down the five** ways that a keyword could change block statuses from one test to another. Again, clicking through to the interactive version allows you to see which keywords are in each category. Now we can see which keywords were unblocked from one test to another as well as the few that were added to the block list. For instance, try clicking the red partition on the middle bar. You’ll see three terms, including not surprisingly, the legal scholar and rights activist 许志永 (Xu Zhiyong), who went from having over 16,000 results on the first to test to being explicitly blocked on the third after he was arrested on July 16.

As noted above, the third bar, which compares the third and fourth tests appears anomalous due to the much higher proportion of words that went from having results to no results which may or may not be purely temporary. The having results to no results categories and vice versa are further elevated because most of the keywords in these two categories went from only a handful of results to none or from none to a few–both cases which could be plausibly explained for organic reasons rather than outright censorship. If we filter out keywords that returned more than 100 results before or after experiencing the change in block status, we are left with three, four, and fourteen keywords which went from more than 100 results to zero in the three respective comparisons of tests; and three, five, and six keywords which went from zero results to over a hundred in the following test in the three respective comparisons. Thus, making definitive statements about changes in block status involving keywords which go from or to zero results are a bit murkier because in theory they could happen that way outside of censorship (people could delete or make posts with that keyword).

However, we can conclusively say that the list of keywords that are being explicitly blocked appears to have stabilized in our most recent tests.*** Further testing will reveal whether or not there are any other shakes to the Sina Weibo system equivalent to that which took place after June 4.

* You can read more about the rough methodology for classifying unique and non-unique keywords in footnotes 1 and 2 to this previous Citizen Lab blog post: Using the China Chats surveillance/censorship keyword list. However, as the data from these tests show, filtering out all redundant keywords to uncover the unique keywords which actually trigger the censorship is rather difficult without knowing the list of censored keywords in advance (which would make this whole exercise pointless). Thus, the lists of keywords in these charts contain at least a few words which share the same words which trigger censorship. For the most part though, the filtering performed seems to have done an acceptable job at weeding out the major trigger words shared by many words.

**There are technically six possible pathways, but in our tests no keywords went from implicitly blocked to explicitly blocked.

***A fifth test run through a Chinese VPN on Aug 20 showed no difference in the list of words explicitly blocked compared to the fourth test run on Aug 17 from here in Canada.