Workshop on Multimodal Content Moderation (MMCM)

at 2024 IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)

Date: June 17, 2024

Venue: Room Arch 304, Seattle Convention Center, Seattle WA, USA


Welcome to the 2nd IEEE Workshop on Multimodal Content Moderation (MMCM) being held in conjunction with CVPR 2024!

Content moderation (CM) is a rapidly growing need in today’s world, with a high societal impact, where automated CM systems can discover discrimination, violent acts, hate/toxicity, and much more, on a variety of signals (visual, text/OCR, speech, audio, language, generated content, etc.). Leaving or providing unsafe content on social platforms and devices can cause a variety of harmful consequences, including brand damage to institutions and public figures, erosion of trust in science and government, marginalization of minorities, geo-political conflicts, suicidal thoughts and more. Besides user-generated content, content generated by powerful AI models such as DALL-E and GPT present additional challenges to CM systems.

With the prevalence of multimedia social networking and online gaming, the problem of sensitive content detection and moderation is by nature multimodal. Moreover, content moderation is contextual and culturally multifaceted, for example, different cultures have different conventions about gestures. This requires CM approach to be not only multimodal, but also context aware and culturally sensitive.


📢 Program is live!


MMCM 2024 will be a hybrid workshop, with both in-person and virtual attendance options. The in-person event will be held at the Seattle Convention Center, Room Arch 304, and simultaneously available via the CVPR Virtual Platform.

Conference content hosted on CVPR virtual platform will be available exclusively to registered attendees. Accepted papers in the conference proceedings will be publicly available via the CVF website, with the final version posted to IEEE Xplore after the conference.


Mike Pappas

Modulate AI

Sarah Gilbert

Research Director, Citizens and Technology Lab
Cornell University

Xiang Hao

Principal Applied Scientist
Prime Video at Amazon

Susanna Ricco

Senior Staff Software Engineer
Google DeepMind

Paul Rad

Associate Professor
University of Texas at San Antonio

Symeon Papadopoulos

Principal Researcher
Information Technologies Institute, Greece

Manuel Black

TU Darmstadt

Lin Ai

Columbia University

Marlyn Thomas Savio

Senior Behavioral Scientist

Kaustav Kundu

Research Scientist


This workshop intends to draw more visibility and interest to this challenging field, and establish a platform to foster in-depth idea exchange and collaboration. Authors are invited to submit original and innovative papers. We aim for broad scope, topics of interest include but are not limited to:

  • Multi-modal content moderation in image, video, audio/speech, text;
  • Context aware content moderation;
  • Datasets/benchmarks/metrics for content moderation;
  • Annotations for content moderation with ambiguous policies, perspectivism, noisy or disagreeing labels;
  • Content moderation for synthetic/generated data (image, video, audio, text); utilizing synthetic dataset;
  • Dealing with limited data for content moderation.
  • Continual & adversarial learning in content moderation services;
  • Explainability and interpretability of models;
  • Challenges of at-scale real-time content moderation needs vs. human-in-the-loop moderation;
  • Detecting misinformation;
  • Detecting/mitigating biases in content moderation;
  • Analyses of failures in content moderation.

Submission Link:

Authors are required to submit full papers by the paper submission deadline. These are hard deadlines due to the tight timeline; no extensions will be given. Please note that due to the tight timeline to have accepted papers included in the CVPR proceedings, no supplemental materials or rebuttal will be accepted.

Papers are limited to eight pages, including figures and tables, in the CVPR style. Additional pages containing only cited references are allowed. Papers with more than 8 pages (excluding references) or violating formatting specifications will be rejected without review. For more information on the submission instructions, templates, and policies (double blind review, dual submissions, plagiarism, etc.), please consult CVPR 2024 - Author Guidelines webpage. Please abide by CVPR policies regarding conflict, plagiarism, double blind review, dual submissions, and attendance.

Accepted papers will be included in the CVPR proceedings, on IEEE Xplore, and on CVF website. Authors will be required to transfer, to the IEEE, copyrights for any papers published in the conference proceedings. At least one author is expected to attend the workshop and present the paper.


Event Date
Paper Submission Deadline March 22, 2024, 11:59:59 PM Pacific Time
Final Decisions to Authors April 10, 2024    April 8, 2024
Camera Ready Deadline April 14, 2024, 11:59:59 PM Pacific Time

Authors are required to submit full papers by the paper submission deadline. These are hard deadlines due to the tight timeline; no extensions will be given. Please note that due to the tight timeline to have accepted papers included in the CVPR proceedings, no supplemental materials or rebuttal will be accepted.


Mei Chen

Principal Research Manager
Responsible & OpenAi Research, Microsoft

Cristian Canton

Research Manager
Responsible AI, Meta

Davide Modolo

Research Manager
AWS AI Labs, Amazon

Maarten Sap

Assistant Professor
LTI, Carnegie Mellon University

Maria Zontak

Sr. Applied Scientist
Alexa Sensitive Content Intelligence, Amazon

Matt Lease

UT Austin


Venue: Room Arch 304, Seattle Convention Center

Time (PST) Event Title Speaker(s)
08:30 - 08:45 Opening Remarks and Logistics for the Day Mei Chen, Microsoft
08:45 - 09:15 Invited Talk Leveraging AI Filters for Mitigating Viewer Impact from Disturbing Imagery Symeon Papadopoulos, ITI
09:15 - 09:45 Invited Talk Human Moderation in an Evolving Web Landscape: Wellness Gaps and Needs Marlyn Thomas Savio, TaskUs
09:45 - 10:15 Invited Talk Content Safeguarding in the Generative AI Era Paul Rad, UTSA
10:15 - 10:30 Coffee Break
10:30 - 11:20 Invited Talk Do moderators dream of AI support? Understanding what moderators really need through community-engaged research Sarah Gilbert, Cornell University
11:20 - 12:00 Panel Discussion Culture, Context, Metrics, Transparency in Responsible AI Sarah Gilbert, Cornell University
Paul Rad, UTSA
Symeon Papadopoulos, ITI
Xiang Hao, Amazon
12:00 - 13:30 Lunch Break
13:30 - 14:15 Invited Talk Ephemeral Chat Moderation: Understanding Invisible Harms Mike Pappas, Modulate AI
14:15 - 14:45 Invited Talk Enhancing Visual Content Safety: Multimodal Approaches for Dataset Curation and Model Safeguarding Manuel Black, TU Darmstadt
14:45 – 15:15 Invited Talk Multimodal Deception Detection using Automatically Extracted Acoustic, Visual, and Lexical Features Lin Ai, Columbia University
15:15 - 15:45 Coffee Break
15:45 - 16:45 Invited Talk Fair and Inclusive Multimodal Generations Susanna Ricco, Google DeepMind
16:45 - 17:00 Accepted Paper An End-to-End Vision Transformer Approach for Image Copy Detection Jiahe Steven Lee, Mong Li Lee, Wynne Hsu
National University of Singapore
17:00 - 17:40 Panel Discussion Multimodal Generation, Moderation, Evaluation Susanna Ricco, Google DeepMind
Mike Pappas, Modulate AI
Manuel Black, TU Darmstadt
Kaustav Kundu, Amazon
17:40 - 17:45 Closing Remarks Mei Chen, Microsoft

Leveraging AI Filters for Mitigating Viewer Impact from Disturbing Imagery

Speakers: Symeon Papadopoulos, ITI

Abstract: Exposure to disturbing imagery can significantly impact individuals, especially professionals who encounter such content as part of their work. This talk will summarize some of our insights from our team's recent research into leveraging AI for reducing the impact of such imagery. The talk will first briefly present how large public datasets that are seemingly innocuous could be used for creating improved AI-based content moderation models. Then, it will present the findings from a user study, involving 107 participants, predominantly journalists and human rights investigators, that explores the capability of Artificial Intelligence (AI)-based image filters to potentially mitigate the emotional impact of viewing such disturbing content.

Human Moderation in an Evolving Web Landscape: Wellness Gaps and Needs

Speakers: Marlyn Thomas Savio, TaskUs

Abstract: Human-led content moderation, which involves trained reviewers assessing user-generated digital content to ensure compliance with a platform’s policies, has become indispensable for offering a safe and authentic experience to web users. Given moderators’ exposure to potentially traumatic content, efforts are growing to provide psychosocial interventions and support tools that can mitigate moderators’ distress. There is a need for continual empathy and innovation to keep pace with emerging challenges (e.g., disinformation, dark AI, virtual reality) that pose newer risks for the moderator population. This talk will briefly trace the professionalization of content moderation, and how engineers can help build robust systems that enable moderators to thrive and serve. Various concerns articulated by moderators that go beyond egregious content and seek technological solutions will be discussed. The presentation will call for ideating and building towards seamless human+tech workflows that can address complexity in scalable moderation.

Content Safeguarding in the Generative AI Era

Speakers: Paul Rad, UTSA

Abstract: Online platforms are increasingly being exploited by malicious actors to disseminate unsafe and harmful content. In response, major platforms employ artificial intelligence (AI) and human moderation to obfuscate such content, ensuring user safety. Two critical needs for effectively obfuscating unsafe content are: 1) verifying the authenticity of the content, 2) providing an accurate rationale for the obfuscation, and 3) obfuscating the sensitive regions while preserving the safe regions. In this talk, we address these challenges within the context of the generative AI era. We first emphasize the importance of verifying the authenticity of content to distinguish between genuine and manipulated media. We then introduce a Visual Reasoning Language Model (VLM) conditioned on pre-trained unsafe image classifiers to provide accurate rationales grounded in specific attributes of unsafe content. Additionally, we propose a counterfactual explanation algorithm that minimally identifies and obfuscates unsafe regions. This is achieved by utilizing an unsafe image classifier attribution matrix to guide optimal subregion segmentation, followed by an informed greedy search to determine the minimum number of subregions required to modify the classifier’s output based on the attribution score. Our approach ensures that unsafe content is effectively obfuscated while maintaining the integrity of safe content, advancing the field of content safeguarding in generative AI.

Do moderators dream of AI support? Understanding what moderators really need through community-engaged research

Speakers: Sarah Gilbert, Cornell University

Abstract: Content moderation is central to the success of online platforms; however, getting it right is hard. While restrictive moderation stymies participation that disproportionately affects at-risk and marginalized populations, overly permissive moderation enables extremism and harassment. Negotiating this delicate balance is a daily task undertaken by millions of volunteers around the world and across platforms and modalities. However, in doing so these moderators suffer from burnout and exhaustion from repeated exposure to toxic content, stressful decision-making, and abusive users.
Advances in generative AI may present promising solutions to human labor and fairness issues in content moderation. However, without understanding the work moderators do, such interventions risk replicating or even exacerbating the issues they are intended to solve. By presenting a behind-the-scenes look into the challenges volunteer moderators face, I’ll share some promising avenues for advancing human-in-the-loop, AI-assisted moderation support, highlight where human labor is needed, and why community-centered approaches to research and design are a must for everyone working on moderation issues.

Ephemeral Chat Moderation: Understanding Invisible Harms

Speakers: Mike Pappas, Modulate AI

Abstract: Content moderation tends to focus on posts, clips, and other content which is long-lived and seen by many people, as in traditional social media applications. But some of the most personal and poignant harms happen in live chats - especially voice chats - which are ephemeral and so much more difficult to moderate. In this talk, Mike Pappas (CEO of Modulate, which moderates ephemeral chats for Call of Duty, Grand Theft Auto Online, and other top platforms) will discuss where intuitions about traditional content moderation break down for ephemeral chat - and what true best practices for live chat look like.

Enhancing Visual Content Safety: Multimodal Approaches for Dataset Curation and Model Safeguarding

Speakers: Manuel Black, TU Darmstadt

Abstract: As the landscape of generative AI evolves, addressing the safety of visual content has never been more critical. In this talk, we consider two aspects crucial to current visual content moderation challenges. First, we look into generative text-to-image models and approaches to automatically benchmark their susceptibility to creating unsafe images. Specifically, our widely adopted image generation test bed—inappropriate image prompts (I2P)—contains dedicated, real-world text prompts that elicit the generation concepts such as nudity and violence. Secondly, we will discuss how to leverage advances in vision-language models (VLMs) to perform comprehensive and flexible safety assessments at scale. Our LlavaGuard family of safeguard models is easily customizable to varying safety policies. Each assessment includes a safety rating, information on violated safety categories, and an in-depth textual rationale. For example, LlavaGuard can be used for dataset curation, model safeguarding, or generative model evaluation in combination with I2P.

Multimodal Deception Detection using Automatically Extracted Acoustic, Visual, and Lexical Features

Speakers: Lin Ai, Columbia University

Abstract: Deception detection in conversational dialogue has attracted much attention in recent years. Existing methods rely heavily on human-labeled annotations, which are costly and potentially inaccurate. In this work, we present an automated system that utilizes multimodal features for conversational deception detection without human annotations. We study the predictive power of different modalities and combine them for better performance. We use openSMILE to extract acoustic features after applying noise reduction techniques to the original audio. Facial landmark features are extracted from the visual modality. We experiment with training facial expression detectors and applying Fisher Vectors to encode sequences of facial landmarks of varying lengths. Linguistic features are extracted from automatic transcriptions of the data. We examine the performance of these methods on the Box of Lies dataset of deception game videos, achieving 73% accuracy using features from all modalities. This result is significantly better than previous results on this corpus, which relied on manual annotations, and also better than human performance.

Fair and Inclusive Multimodal Generations

Speakers: Suzanna Ricco, Google DeepMind

Abstract: In the age of Generative AI, technology is responsible for not just finding and serving content, but also for creating the content in the first place. Defining good results is notoriously difficult. This talk introduces our research in inclusive generative models, focusing on avoiding results that reinforce harmful stereotypes. We cover both image generation and analysis tasks such as captioning and VQA, and discuss how we navigate the challenges of translating broad ethical, and necessarily subjective, goals into concrete, actionable objectives.

An End-to-End Vision Transformer Approach for Image Copy Detection

Speakers: Jiahe Steven Lee, Mong Li Lee, Wynne Hsu (National University of Singapore)

Abstract: Image copy detection is one of the pivotal tools to safeguard online information integrity. The challenge lies in determining whether a query image is an edited copy, which necessitates the identification of candidate source images through a retrieval process. The process requires discriminative features comprising of both global descriptors that are designed to be augmentation-invariant and local descriptors that can capture salient foreground objects to assess whether a query image is an edited copy of some source reference image. This work describes an end-to-end solution that leverage a Vision Transformer model to learn such discriminative features and perform implicit matching between the query image and the reference image.


Mike Pappas is the CEO/Co-founder of Modulate, which works with studios like Activision, Riot Games, and Rec Room to foster safer and more inclusive online voice chat. Mike’s work at Modulate ranges from developing new partnerships within the industry, monitoring trends and new opportunities for Modulate’s unique technology to have a positive impact, and reinforcing an internal and external culture of passion, respect, and personal growth. Outside of work, his passions include creating themed cocktail lists, long-distance running, and playing Nintendo games.
Sarah uses a sociotechnical lens to explore peoples’ experiences using online platforms. Her work has covered key areas such as motivations, informal learning, privacy and ethics, and online governance. To this end she uses a variety of methods, from qualitative ethnography and in-depth interviews, to quantitative surveys and social network analysis. Currently, she is the Research Director at the Citizens and Technology (CAT) Lab, founded by Dr. J. Nathan Matias, and leading the NSF-funded project, Learning Code(s): Community-Centered Design of Automated Content Moderation with along with PIs, Drs. Katie Shilton, Hal Daumé III, and Michelle Mazurek. Prior to that, she graduated from the University of British Columbia in 2018 advised by Drs. Caroline Haythornthwaite and Luanne Freund and worked as a postdoctoral scholar at the University of Maryland on the NSF funded PERVADE (Pervasive Data Ethics for Computational Research) project with Drs. Katie Shilton and Jessica Vitak.
Xiang Hao is worked as a Principal Applied Scientist in the Prime Video team at Amazon. Xiang Hao was a Ph.D. student in the School of Computing at the University of Utah from August 2008 to February 2014 and worked with Prof. Tom Fletcher at the Scientific Computing and Imaging Institute. Prior to coming to UU, he obtained my B.E degree in Computer Science at Shandong University, China.
Susanna Ricco is a Senior Staff Software Engineer at Google DeepMind where she leads a team dedicated to advancing fairness and inclusion for multimodal AI models and systems. Her research centers around computational techniques designed to responsibly and faithfully model human perception of demographic, cultural, and social identities across various modalities. As part of an interdisciplinary team of researchers, she pioneers new approaches to model development, leading to higher quality, more controllable, more inclusive output, while reducing behaviors that reinforce harmful stereotypes. The team's work has contributed to improvements that make Google products work better for everyone, benefiting Google Photos, Nest, Search, YouTube, and more. Before joining Google in 2013, she earned a Ph.D. in Computer Science from Duke University. She was a recipient of the NSF Graduate Research Fellowship and is also a proud alum of Harvey Mudd College.
Paul received his B.S. degree in computer engineering from the Sharif University of Technology, in 1994, the master’s degree in artificial intelligence from the Tehran Polytechnic, and the master’s degree in computer science, the Ph.D. degree in electrical and computer engineering from The University of Texas at San Antonio, San Antonio, Texas, USA. He is currently the Founder and the Director of the Secure AI and Autonomy Laboratory, Co-founder and Assistant Director of Open Cloud Institute, and an Associate Professor with the Departments of Information Systems and Cyber Security and Electrical and Computer Engineering (by courtesy) at UTSA.
Dr. Symeon Papadopoulos is a Principal Researcher with the Information Technologies Institute (ITI), Centre for Research and Technology Hellas (CERTH), Thessaloniki, Greece. He holds an electrical and computer engineering diploma from the Aristotle University of Thessaloniki, a Professional Doctorate in Engineering from the Technical University of Eindhoven, a Master’s in Business Administration from the Blekinge Institute of Technology and a PhD in Computer Science from the Aristotle University of Thessaloniki. His research interests lie at the intersection of multimedia understanding, social network analysis, information retrieval, big data management and artificial intelligence. Dr. Papadopoulos has co-authored more than 40 papers in refereed journals, 10 book chapters and 130 papers in international conferences, 3 patents, and has edited two books. He has participated in and coordinates a number of relevant EC FP7, H2020 and Horizon Europe projects in the areas of media convergence, social media and artificial intelligence. He is leading the Media Analysis, Verification and Retrieval Group (MeVer,, and is a co-founder of the Infalia Private Company, a spin-out of CERTH-ITI.
Manuel’s research is centred around the risks and promises of large ML models for human-centered AI. I expore how to leverage the capabilities of these models’ for the best while accounting for and mitigating the issues arising from large-scale training. Latest advances in deep learning have been highly data-driven, relying on billion-sized datasets randomly scraped from the internet. In my research I investigate what models actually learn during pre-training and how to measure and utilize the model's representations. With the goal of improving how we as humans interact with AI systems, making the AI align with the behavior us humans expect from the system.
Lin Ai is a PhD student at Columbia Speech Lab, where she is supervised by Professor Julia Hirschberg. Her research focuses on detecting misinformation and malicious intentions across various media platforms, including social media, news content, and the usage of LLMs. Lin employs a multidisciplinary approach, integrating techniques from natural language processing, machine learning, and social network analysis to identify and mitigate the spread of harmful information. Her work aims to enhance the understanding and development of automated systems capable of identifying and addressing deceptive and malicious content. Lin's dedication to this field contributes to the broader efforts in creating safer and more trustworthy online environments.
Dr. Marlyn Thomas Savio is a chartered psychologist with a PhD specializing in health psychology. In her current role as Senior Behavioral Scientist at TaskUs, Marlyn leads a global team of researchers focussed on employee wellness and productivity. Prior to this, she undertook clinical and teaching roles in healthcare and academia. Marlyn’s research career, spanning over a decade, has been in areas of organizational people care, user experience, biopsychosocial interventions, psychometric tooling, and phenomenological & cross-cultural explorations. She has authored peer-reviewed articles, book chapters, business white papers, and thought leadership pieces. Hailing from South Asia, Marlyn is passionate about advocating for diverse HCI design, and equitable & ethical technology for human augmentation.
Kaustav Kunda is a Research Scientist at AWS. He did his PhD in the Department of Computer Science at University of Toronto and was advised by Prof. Raquel Urtasun and Prof. Sanja Fidler. His research interests lie broadly in the field of Computer Vision and Machine Learning. In the last few years, he has been excited about the problem of visual scene understanding. During his PhD, he explored ideas on how to combine geometric priors and semantic information for 3D scene understanding. At Amazon Go, he worked towards the goal of building efficient temporal representations across multiple cameras for real-time action detection.


  • Christopher Clarke, PhD student, University of Michigan
  • Gaurav Mittal, Senior Researcher, Microsoft
  • J.P. Lewis, Staff Research Scientist, Google Research
  • Jay Patravali, Data & Applied Scientist II, Microsoft
  • Jialin Yuan, PhD Student, Oregon State University
  • Jiarui Cai, Applied Scientist, AWS AI Labs
  • Lan Wang, PhD Student, Michigan State University
  • Mahmoud Khademi, Researcher 2, Microsoft
  • Mamshad Nayeem Rizve, PhD Student, University of Central Florida
  • Matthew Hall, Principal Applied Scientist, Microsoft
  • Reid Pryzant, Senior Researcher, Microsoft
  • Rishi Madhok, Senior Applied Science Manager, Microsoft
  • Sandra Sajeev, Data Scientist 2, Microsoft
  • Sarah Laszlo, Staff Research Scientist, Google Research
  • Satarupa Guha, Applied Scientist II, Microsoft
  • Simon Baumgartner, Software Engineer, Google Research
  • Soumik Mandal, Applied Scientist, Amazon
  • Tobias Rohde, Applied Scientist II, Amazon
  • Xuhui Zhou, PhD student, Carnegie Mellon University
  • Ye Yu, Senior Software Engineer, Microsoft
  • Zhen Gao, Applied Scientist II, Amazon


If you have any questions, please feel free to reach us at below

Previous Years

Check out the proceedings from previous years!