The Citizen Lab and Prof. Jedidiah Crandall and Jeffrey Knockel of the Department of Computer Science at the University of New Mexico are proud to announce the release of our paper “Chat Program Censorship and Surveillance in China: Tracking TOM-Skype and Sina UC” in the July 2013 edition of First Monday.
In this study we examine the implementation of censorship and surveillance in two IM clients maintained by two different Chinese companies. For a period of more than a year and a half, we downloaded and decrypted the censorship and surveillance keyword lists used by the client software of two IM programs used in China: TOM-Skype and Sina UC.
We obtained the keyword list URLs and encryption keys by reverse engineering the software binaries of the clients. In the TOM-Skype client, keyword lists are used to trigger censorship and/or surveillance of user chats, while in Sina UC the keyword lists trigger only censorship. This data affords a rare opportunity to analyze the contents of, and updates to, complete and unbiased keyword lists used for both censorship and surveillance.
This dataset represents the largest unbiased surveillance and censorship keyword list ever available to the research community. The data offers insights into China’s information control regime but also raises further questions on how industry enforcement works in practice in China, the social and political implications of these IM clients’ censorship and surveillance operations, and how surveillance and censorship features of these IM clients compare to those of other Internet services in China.
All of the data and visualizations of this study are available on a dedicated website https://china-chats.net that is designed to aid the analysis of this unique dataset and encourage researchers to further explore the data and open questions emerging from it.
Key Findings
Variance in Keyword Lists Between Clients
Comparing the keyword lists of the two clients reveals very little overlap. The full dataset of 88 combined lists contains 4,256 unique keywords, of which only 138 terms (3.2%) were shared in common between TOM-Skype and Sina UC. This lack of overlap suggests that no common keyword list was provided to these companies by government authorities. These inconsistencies suggest that companies may be given general guidelines from authorities on what types of content to target, but have some degree of flexibility on how to implement these directives.
Explore the keywords here.
Highly targeted and overly broad keyword content
Targeted keywords include highly specific information such as instructions and locations related to Jasmine rallies, names of dissidents, and neologisms used by Chinese users to discuss sensitive issues. The targeted nature of these keywords raises concerns regarding the ultimate impact of censorship and surveillance on users discussing such sensitive issues and social mobilizations. Other keywords were very generic (e.g. Chinese people” “华人”, and “Internet” “互联 网”), which raises implications of overly broad surveillance of users.
Explore the keywords here.
Keyword list changes affecting censorship and surveillance functions
Significant changes to keyword lists in both clients affected the implementation of censorship and surveillance functions. The most recent update to the censorship lists for the latest versions of TOM-Skype versions reduced these lists to a single keyword, effectively eliminating censorship on these versions of the client. However, these versions still maintain active surveillance-only lists and earlier versions of the client (3.6–4.2) retain active censorship lists, which means that the latest versions of TOM-Skype analyzed in our study focus on keyword surveillance.
Similarly, on September 17, 2012, four of the five Sina UC lists were reduced to a single keyword. The remaining keyword list is used to censor the username a user may select, meaning that censorship of incoming and outgoing messages appears to be effectively eliminated in Sina UC. It is possible, however, that Sina UC has implemented surveillance features on the server side that we are unable to detect with our reverse engineering methods.
Explore the list changes here.
Surveillance and censorship in reaction to sensitive events
We identified current events referenced in the dataset that occurred within our data collection time frame and correlated keyword list updates with the timeline of the events. Across the selected cases we observed inconsistent patterns. In some cases keyword updates were implemented within a single day of a sensitive event. In other cases, updates were applied weeks or months after the event took place, potentially indicating the censors only responded after an issue developed sufficient political salience.
Explore correlations to current events here.
 
							