Online censorship in China has been well documented in popular applications like WeChat and Weibo. But how does censorship affect individual programmers? A new paper by the Citizen Lab investigates how Chinese censorship reaches independent developers and reveals that, while developers include censorship lists in open source projects, there is little apparent similarity in these blacklists, raising several questions about their origins.

Founded in 2008, GitHub is a popular open source software development and sharing platform used by programmers all over the world. By scraping GitHub code repositories, Citizen Lab researchers found over 1,000 Chinese blacklists comprising over 200,000 unique keywords, representing the largest dataset of Chinese blacklisted keywords to date.

“GitHub has become the dominant social platform for sharing and developing open source code,” says author Jeffrey Knockel. “As software becomes more complex, having access to open source code becomes increasingly important.” He suggests this is why China has not blocked access to GitHub, despite the often adversarial relationship between GitHub and the Chinese government. It has, however, been the subject of targeted attacks in the past.

“Previous research has focused on how censorship is implemented in company products,” says Knockel. “This research shows us that individual developers are feeling the same pressures or obligations to censor as those working for companies.”

The blacklisted keywords reflected a variety of taboo topics, including those related to prurient interests, Falun Gong references, political movements, government criticism, and political leaders. Many lists contained over 1,000 words, making it unlikely that individual developers compiled these lists on their own. However, given the dissimilarity in lists, it doesn’t appear that they come from a common source.

“We found little overlap between the lists, despite most being very long, raising questions as to where all of these large but disparate lists originate.”

Additionally, it remains an open question why developers include these lists in their projects. The researchers suggest several possibilities to explain their presence:

“The developers may be concerned that they themselves or others using or deploying their projects may be held liable for content shared on their projects in the same manner that commercial companies are known to be controlled. Developers may be accustomed to this requirement and see it as necessary for their project to gain users and traction in the Chinese market. It may also be that they believe their application should be censored and share the political concerns that motivate the Chinese government.”

In order to test these theories, the authors suggest conducting a series of interviews with the developers to gain insights into their motivations.

The findings of this new paper are being presented at the First Workshop on NLP for Internet Freedom in Santa Fe, New Mexico.

Read the full paper here