During The Citizen Lab’s 2014 Summer Institute on Monitoring Internet Openness and Rights, a researcher expressed desire to test for keyword censorship on a Chinese online service. With limited resources and time, however, the researcher faced difficulty in determining what keywords should be used for these tests.
As lists of censored and sensitive Chinese keywords exist on the web, including sets produced by The Citizen Lab itself, this is a problem that has long been “solved.” However, keyword lists can be overly inclusive and have too many terms for time-strapped researchers to efficiently utilize. Furthermore, pulling out only the most sensitive keywords to test—e.g., Falun Gong, June 4, and Xi Jinping—while also ensuring that the keywords cover a broad range of topics, can be quite difficult, even for researchers familiar with the Chinese online sphere. For those not experienced in Chinese or Internet censorship, it is certainly a daunting task. Researchers must first locate these scattered lists, and then winnow them down to something more usable.
To assist researchers faced with these problems, we have collected 13 lists of sensitive Chinese keywords and aggregated them into a single, sortable, and share-able CSV file (see a Google docs sample, sorted by the number of lists each keyword appears on). This file, along with a description of the 13 lists and their sources/origins, are located in a GitHub repository that will be updated as new Chinese keyword lists are identified.
The 13 lists contain 9,054 unique keywords, including those in Chinese, English, pinyin, or a combination of the three. The lists go back as early as 2004 (the leaked Tencent QQ blacklist) and were produced as recently as November 2014 by Citizen Lab collaborator Jeffrey Knockel (University of New Mexico), who extracted 910 keywords from Sina Show. The keyword 魏京生, the name of human rights activists Wei Jingsheng, was found on every list, and four keywords–柴玲 (the June 4 student leader Chai Ling), 六四 (64, referring to June 4), 美国之音 (Voice of America), and 太子党 (princelings, referring to the children of government officials)–were on twelve of the thirteen lists.
The CSV file contains machine translations from Google and human translations/notes for most of the keywords. Many also have theme and category variables included, due to various sources that have previously tagged their keyword lists.
Currently, there are three different versions:
- all.csv: All the keywords, all available data/variables, plus 3,987 popular (3,803 non-sensitive) keywords which can be used as possible controls for searching. These popular/non-sensitive keywords were taken from article titles of the top 1000 most viewed articles on Wikipedia China in April 2013 (995 after a few Wikipedia meta-pages were removed) and titles of articles that generated more than a total of 10 combined views on August 1, 0:00-1:00 and 12:00-13:00.
- no-dummy-vars-for-categories-and-themes.csv: All the keywords without dummy variables for each of the themes and categories that were tagged by The Citizen Lab. Category/theme info is instead stored in catch-all “category” and “theme” variable (column).
- no-dummy-vars-for-categories-and-themes_only-sensitive-words.csv: Same as above, except also with the non-sensitive words removed. Once downloaded, you can also sort by keyword length as well as how many of the lists each keyword appears on.
The thirteen lists this collection combines are:
|Creator/source||Tested on/found from||Number of keywords||Year||Method + source|
|University of New Mexico / The Citizen Lab||Sina UC||1,818||2013||reverse engineered from the client; analysis here; download link|
|University of New Mexico / The Citizen Lab||Tom-Skype||2,574||2013||reverse engineered from the client; analysis here; download link|
|The Citizen Lab||LINE||673||2014||reverse engineered from the client; analysis here; download link|
|Jason Q. Ng (Blocked on Weibo)||Sina Weibo||839||2013||running Wikipedia China article titles through Sina Weibo search; more analysis and book|
|Xia Chu||Great Firewall||669||2014||HTTP request scans of Wikipedia China articles to see if they’d trigger a GFW block; more analysis here; download link (removed duplicates and keywords related to meta and user pages)|
|China Digital Times||Sina Weibo||2,448||2014||crowdsourced testing of suspected sensitive keywords on Sina Weibo; more analysis on CDT and in CDT’s Grass Mud Horse Lexicon e-book; download link|
|GreatFire.org||Wikipedia||488||2013||testing to see if Wikipedia pages are available in China; more info; download link|
|Google/ATGFW.org||Google/Great Firewall||456||2012||ATGFW.org and GreatFire.org reverse engineered the keywords Google was using to warn users of censorship while using their service in China; download link|
|Jeffrey Knockel (University of New Mexico)||Sina Show||910||2014||extracted list from Sina Show app; download link|
|Unknown||163.com||376||2008||archived by Nart Villeneuve; circulated on 163.com, a Chinese web portal download link|
|Omnitalk BBS users?||Tencent QQ||863||2004||archived by Nart Villeneuve; extracted from Tencent QQ app, more info and analysis from CDT; download link|
|Jed Crandall et al / “ConceptDoppler”||Great Firewall||669||2008||archived by Nart Villeneuve; “HTTP keyword filtering by Internet routers”; website; paper; download link|
|Unknown||a “blog provider”||844||2005||archived by Nart Villeneuve; according to Villeneuve: “This is a keyword list from a blog provider in China.” download link|
Please follow the GitHub repository for future updates.
We encourage others to incorporate their own lists into the project. If you know of a list that we have missed or if you have produced one of your own, contact us in the comments, or through @jasonqng‘s Twitter, or via GitHub.