Uncovering “Private” Misinformation

by Ashkan Kazemi (PhD student, Michigan AI)

While fake news detection is an already difficult task for machines (and humans!), what happens when large unmoderated private groups become filled with misinformation? 

Even if there was a perfect algorithm that could automatically detect fake news (which doesn’t exist), it still would require access to the user content to determine its veracity. Although social media platforms usually have access to scan user content to detect malicious purposes, this is simply not the case for end-to-end encrypted social media. 

As a PhD candidate in the AI Lab interested in conducting research at the intersection of NLP and misinformation, I (Ashkan Kazemi) wanted to understand more about the dynamics of fake news on closed messaging platforms.

There is currently no easy way to discover misleading and malicious content on WhatsApp and other end-to-end encrypted platforms at scale, given that all conversations are accessible only to users partaking in the conversations. In a recent paper published in Harvard Kennedy School Journal of Misinformation Review with collaborators at Meedan, MIT and University of Duisburg-Essen, we conducted a case-study of the 2019 Indian General Elections to analyze the usefulness of a crowd-sourced “tipline” through which users can submit content (“tips”) that they want fact-checked. A tipline is an account on social media operated by a fact-checking organization. Social media users are invited to send potential misinformation to the tipline to (a) see if there is a fact-check or (b) share it as a ‘tip’ that the fact-checking organization might investigate.

Using state-of-the-art AI technology for matching similar texts and images, we compared content sent to the election tipline to the content collected from a large-scale crawl of public WhatsApp groups (these are WhatsApp groups where the link to join is shared openly), ShareChat (a popular image sharing platform in India similar to Instagram), and fact-checks published during the same time in order to understand the overlap between these sources.

Our results show the effectiveness of tiplines in content discovery for fact-checking on encrypted platforms. We show that:

  • A majority of the viral content spreading on WhatsApp public groups and on ShareChat was shared on the WhatsApp tipline first, which is important as early identification of misinformation is an essential element of an effective fact-checking pipeline given how quickly rumors can spread (Vosoughi et al., 2018).
  • The tipline covers a significant portion of popular content: 67% of images and 23% of text messages shared more than 100 times in public WhatsApp groups appeared on the tipline.
  • Compared to content from popular fact-checking organizations, the messages sent to tiplines cover a much higher proportion of WhatsApp public group messages. While misinformation often flows between platforms (Resende et al., 2019), this suggests that tiplines can capture unique content within WhatsApp that is not surfaced by fact-checking efforts relying on platforms without end-to-end encryption.

The authors use text and image “embedding models” which are used to map texts and images to mathematical vectors that facilitate processing such complex data at scale. For instance, the authors used PDQ hashing to construct a summary mosaic of about 35,000 images submitted to the tipline. As we move from top left to the bottom right, we see representative images of text turn into image+text combinations across the diagonal, and images in the bottom right are mostly of people. As the visual summary reflects, pictures of newspapers or images with text on them are the most dominant image type submitted to the tiplines under study, constituting over 40% of the content, followed by memes which make up roughly 35% of the content. 

An exciting aspect of this project for me was to develop and use cutting edge technology for non-English language data, as the dataset under study contained text in Hindi, Tamil, Bengali and a number of other Indic languages. Such lower resource languages are often overlooked in the AI and NLP technology development cycles, and working with such data often brings nuanced challenges that make for intriguing research questions. Text and image embedding models were a crucial part of our study, enabling us to take a closer look at the content shared on the Indian election tipline to better understand the gravity of misinformation on private WhatsApp groups. 

Overall, our findings demonstrate tiplines can be an effective privacy-preserving, opt-in solution to identify potentially misleading information for fact-checking on WhatsApp and other end-to-end encrypted platforms.

More about the author: https://ashkankazemi.ir/

(Editor: Jung Min Lee, PhD student, Michigan AI)