During The Citizen Lab’s 2014 Summer Institute on Monitoring Internet Openness and Rights, a researcher expressed desire to test for keyword censorship on a Chinese online service. With limited resources and time, however, the researcher faced difficulty in determining what keywords should be used for these tests.

As lists of censored and sensitive Chinese keywords exist on the web, including sets produced by The Citizen Lab itself, this is a problem that has long been “solved.” However, keyword lists can be overly inclusive and have too many terms for time-strapped researchers to efficiently utilize. Furthermore, pulling out only the most sensitive keywords to test—e.g., Falun Gong, June 4, and Xi Jinping—while also ensuring that the keywords cover a broad range of topics, can be quite difficult, even for researchers familiar with the Chinese online sphere. For those not experienced in Chinese or Internet censorship, it is certainly a daunting task. Researchers must first locate these scattered lists, and then winnow them down to something more usable.

To assist researchers faced with these problems, we have collected 13 lists of sensitive Chinese keywords and aggregated them into a single, sortable, and share-able CSV file (see a Google docs sample, sorted by the number of lists each keyword appears on). This file, along with a description of the 13 lists and their sources/origins, are located in a GitHub repository that will be updated as new Chinese keyword lists are identified.

The 13 lists contain 9,054 unique keywords, including those in Chinese, English, pinyin, or a combination of the three. The lists go back as early as 2004 (the leaked Tencent QQ blacklist) and were produced as recently as November 2014 by Citizen Lab collaborator Jeffrey Knockel (University of New Mexico), who extracted 910 keywords from Sina Show. The keyword 魏京生, the name of human rights activists Wei Jingsheng, was found on every list, and four keywords–柴玲 (the June 4 student leader Chai Ling), 六四 (64, referring to June 4), 美国之音 (Voice of America), and 太子党 (princelings, referring to the children of government officials)–were on twelve of the thirteen lists.

The CSV file contains machine translations from Google and human translations/notes for most of the keywords. Many also have theme and category variables included, due to various sources that have previously tagged their keyword lists.

Currently, there are three different versions:

The thirteen lists this collection combines are:

Creator/source Tested on/found from Number of keywords Year Method + source
University of New Mexico / The Citizen Lab Sina UC 1,818 2013 reverse engineered from the client; analysis here; download link
University of New Mexico / The Citizen Lab Tom-Skype 2,574 2013 reverse engineered from the client; analysis here; download link
The Citizen Lab LINE 673 2014 reverse engineered from the client; analysis here; download link
Jason Q. Ng (Blocked on Weibo) Sina Weibo 839 2013 running Wikipedia China article titles through Sina Weibo search; more analysis and book
Xia Chu Great Firewall 669 2014 HTTP request scans of Wikipedia China articles to see if they’d trigger a GFW block; more analysis heredownload link (removed duplicates and keywords related to meta and user pages)
China Digital Times Sina Weibo 2,448 2014 crowdsourced testing of suspected sensitive keywords on Sina Weibo; more analysis on CDT and in CDT’s Grass Mud Horse Lexicon e-bookdownload link
GreatFire.org Wikipedia 488 2013 testing to see if Wikipedia pages are available in China; more info; download link
Google/ATGFW.org Google/Great Firewall 456 2012 ATGFW.org and GreatFire.org reverse engineered the keywords Google was using to warn users of censorship while using their service in China; download link
Jeffrey Knockel (University of New Mexico) Sina Show 910 2014 extracted list from Sina Show app; download link
Unknown 163.com 376 2008 archived by Nart Villeneuve; circulated on 163.com, a Chinese web portal download link
Omnitalk BBS users? Tencent QQ 863 2004 archived by Nart Villeneuve; extracted from Tencent QQ app, more info and analysis from CDT; download link
Jed Crandall et al / “ConceptDoppler” Great Firewall 669 2008 archived by Nart Villeneuve; “HTTP keyword filtering by Internet routers”; website; paper; download link
Unknown a “blog provider” 844 2005 archived by Nart Villeneuve; according to Villeneuve: “This is a keyword list from a blog provider in China.” download link

Please follow the GitHub repository for future updates.

We encourage others to incorporate their own lists into the project. If you know of a list that we have missed or if you have produced one of your own, contact us in the comments, or through @jasonqng‘s Twitter, or via GitHub.