A recap of the first workshop on Safety for Conversational AI
Emily Dinan, Program Chair, Facebook AI
Verena Rieser, Program Chair, Heriot-Watt University
Alborz Geramifard, Program Committee, Facebook AI
Zhou Yu, Program Committee, UC Davis
Over the last several years, neural dialogue agents have vastly improved in their ability to carry a chit-chat conversation with humans. However, these models are often trained on large databases from the internet, and as such, may learn undesirable behaviors from this data such as toxic or biased language. Additionally, these models may prove unsafe in other ways: they may provide unsound medical advice or not appropriately recognize and respond to messages indicating self-harm. This creates a host of challenges. What should the scope of “safety” be, and who decides what is acceptable or not? What types of safety assurances should be met before releasing a model for research or production? How can we quantify progress? How do we give voice to potentially affected parties in the process? And even if we could define the meaning and scope of “safety”, how do we solve the technical challenge of controlling our models such that they do not produce undesirable outputs?
We organized this workshop in order to bring together elite researchers and practitioners from across academia and industry to identify and discuss key technical and ethical challenges related to safety for conversational systems. In this post we will summarize key takeaways from the workshop and make two recommendations for next steps.
The first workshop on Safety for Conversational AI was held virtually on Thursday, October 15, 2020. Over 80 students, researchers, and engineers from academia and industry attended the workshop to envision safer and better-behaved neural conversational AI models. The day featured two keynotes, 9 short talks, and breakout sessions and concluded with a discussion with a panel of experts.
Barbara Grosz of Harvard University and Yulia Tsvetkov of Carnegie Mellon University gave the two keynotes at the workshop. Barbara opened the day with a refresher on dialogue theory foundations and a call to action: teach ethical reasoning to computer scientists. She reminded the audience of faculty, students, researchers, and product folks that we are working with socio-technical systems, and as such, we need to work closely with ethicists and social scientists. Barbara also noted that system bias is not only limited to systems learning from data: it can also originate from human design choices.
Yulia followed with a talk on social bias in conversational AI systems. She emphasized the high stakes of this problem, including the cognitive toll taken on by underrepresented groups when forced to confront biases and stereotypes. Yulia also noted that previous attempts to remedy these issues have largely been reactive and focused primarily on hate speech. She urged the community to move toward more proactive approaches, and consider the full spectrum of toxic language, including objectifying comments, microaggressions and other forms of covert toxicity.
Following the keynotes, we heard short talks from nine researchers, who discussed topics ranging from the dangers of providing inappropriate medical advice to existing frameworks for ethical research. Others discussed empirical observations and techniques for improving our systems, like methods for training safer chatbots, dynamic data collection, or response strategies for dealing with user abuse.
Prior to the workshop, we sent a survey to all attendees that asked questions that centered around the topic of releasing dialogue models — both to the research community (like BlenderBot or DialoGPT) and in the form of a product (like Xiaoice or Tay or Siri). In particular, the questions aimed to get at the tradeoff between releasing safe conversational AI models and being able to do open, reproducible research. Among the 56 workshop participants who answered the survey, the breakdown of responses for how and when to release dialogue models to the research community was as follows:
- 3.6% Never release: models should never be released unless it can be guaranteed that they are 100% safe
- 28.6% Always release: models should always be released as reproducible research is key to scientific progress
- 67.8% Sometimes release: models should be released sometimes; either with appropriate safeguards or after certain steps have been taken
Other notable results from the survey include:
- 91.1% of those surveyed responded that open, reproducible research on conversational AI systems was important to them.
- 75.5% of those surveyed responded that the bar for releasing models to the research community should be lower than the bar for releasing models in the form of a product. Written reasons included the fact that in order to improve safety, we must release models to the research community and that the risk of misuse by a larger population is higher when we release in the form of a product.
Given the survey results — including the fact that over two thirds of respondents felt that models should only be released sometimes — the breakout sessions aimed to discuss standards for how and when to release these models. We broke into four small groups to exchange views on the following questions:
- Can we develop a task (or set of tasks) to assess the safety conversational AI models?
- Can we develop a metric or a rating system to rate the safety of conversational AI systems?
- Can we come up with a set of guidelines for releasing models to the research community?
- Can we come up with a set of guidelines for releasing models in the form of a product?
We finished the day with a discussion about many of these challenging issues with a panel of experts, moderated by Y-Lan Boureau. We had four panelists:
- Pilar Manchón (Google AI)
- Ehud Reiter (University of Aberdeen, Arria NLG)
- Jason Weston (Facebook AI)
- Michelle Zhou (CEO and Founder of Juji)
Panelists shared their views on the biggest challenges for addressing safety issues with conversational AI systems, including trust in AI systems, identifying vulnerable groups, and defining what “safety” means.
(1) NEED FOR TRANSPARENCY
One common thread throughout the day — from the keynotes all the way to the panel discussion — was the need for transparency regarding the abilities of existing systems. The public is often unaware of the limitations of chatbots, including both their propensity for repeating harmful language learned from the training data and more generally their lack of language understanding. Current models exhibit poor common sense and reasoning abilities and do not have a deep understanding of what they are saying. Researchers need to be clear about these limitations and set expectations for the intended use of such models correctly.
(2) WORK WITH THE EXPERTS FROM VARIOUS FIELDS
Throughout the day, many speakers and attendees noted that we need to consult with others outside of “this room” to address these issues. The workshop was largely composed of students and researchers in academia and industry who work primarily in artificial intelligence: fewer than 1% of registered attendees listed their research area as something outside of artificial intelligence. Starting with Barbara Grosz at the very beginning of the day, many called for working directly with ethicists and social scientists to study this problem and set appropriate guidelines for models. Other important fields of expertise required to tackle these issues include Legal expertise (who will be held responsible when legal issues arise?), Safety and Information Security, and Policy expertise. During breakout discussions, several suggested that professional societies (such as ACL or ACM) could team up with experts from other fields to help define guidelines for models; leveraging professional societies in this way could help with buy-in for such a set of guidelines.
(3) SOLVING SAFETY IS HARD
Across our speakers, panelists, and attendees, we found a strong consensus that creating a truly safe conversational agent is hard. The breakout sessions in particular highlighted the difficulty of coming to an agreement on what “safety” means and how to achieve it: what feels “safe” is highly subjective and even among a relatively homogenous group of researchers working in similar fields, we were unable to come to a consensus on these issues. Participants pondered the question: if we cannot ever create a truly 100% safe agent, where do we draw the line?
Nevertheless, another important theme from the workshop was that open, reproducible research is important — roughly 91.1% of survey respondents agreed. So, even if “solving” the safety problem is impossible, we must aim to keep research moving forward, but along a more responsible path.
During the panel discussion, one participant optimistically asserted that we should be able to find a universally accepted set of minimum guidelines around releasing such models. They made an analogy to the Universal Declaration of Human Rights: whereas the very definition of “safety” differs by culture and by individual, we should be able to agree upon a minimum set of standards. Determining these standards requires the time and thoughtful consideration of experts from various fields. However, in the absence of a such a set of standards, for the time being we make two concrete recommendations for moving forward based on the discussion at the workshop:
(1) RELEASE RESEARCH MODELS WITH MODEL CARDS
As mentioned above, one of the most important themes of the day was transparency about both the abilities and limitations of our machine learning models. Mitchell et al. 2017 proposed the framework of model cards for this purpose, which provide details about the model including training data, metrics, the intended use cases for the model, and ethical considerations, among other information. In addition to the fields covered by Mitchell et al. 2017, in the context of chatbots we also recommend providing:
- Metrics on the performance of the model on existing safety benchmarks, when they are relevant . See also (2). Note that we are primarily aiming for transparency with reporting these metrics. We explicitly do not prescribe any thresholds regarding performance on safety benchmarks, as determining these thresholds (and who can set them) is outside of the scope of this note.
- A written explanation of possible issues with the model as they relate to safety. Be open about the limitations of the model when it comes to offensive or toxic language, harmful biases, or other potential harms. Discuss the above metrics where relevant. See the following resources on writing bias and broader impact statements: How to write a bias statement and Suggestions for Writing Neurips 2020 Broader Impact Statements.
- Instructions for flagging issues: when people experience issues with models or find serious flaws (perhaps as it relates to “safety” or biases), how should they report them? Options include opening a GitHub issue, contacting an author, etc.
(2) DEVELOP AN ACCEPTED SUITE OF BENCHMARKS AROUND IMPORTANT THEMES
While a finite set of tasks would never be able to capture the totality of “safety”, an agreed-upon suite of benchmarks can flag critical issues in dialogue models. In other words, such a testing suite will never be able to prove that a model is “safe”, but it could prove useful in highlighting serious issues. We urge the community to move towards building a standard set of benchmarks that cover important topics in safety. Based on responses from the pre-workshop survey, workshop participants felt that the five most important topics for a suite of benchmarks to cover are:
- Hate speech
- Pornographic or sexual content
- Offensive language/profanity
- Medical advice
Several existing benchmarks cover the domains of hate speech and offensive language/profanity  but we are unaware of standard, open-source benchmarks that cover the domains of self-harm, pornographic or sexual content, or medical advice. Recent work by Facebook AI made steps to train classifiers to detect (3) and (5) using social media forums on the given topic — and such an approach could be taken with (1) — but an open-problem is how models should respond when we detect user messages on these topics.
We note that work on such a set of benchmarks is never “complete”. Most of the existing benchmarks listed in  are static and limited to the English language. Moreover, the above set of topics is relatively narrow in terms of domain coverage. We merely suggest working on these topics in concurrence with other conversational AI research as a starting point for understanding the safety limitations of current models.
Thank you to Y-Lan Boureau who provided extensive edits and feedback on this post. A special thanks to several folks for providing additional organizational support that allowed this workshop to happen: Marina Zannoli, Antoine Bordes, Mary Williamson, Douwe Kiela, Jason Weston, and Stephen Roller. Verena Rieser thanks EPSRC for funding for project AISEC.
 A non-exhaustive list of existing benchmarks related to dialogue safety:
- Hate Speech Twitter Annotations (2016)
- Wikipedia Toxic Comments (2017)
- Automated Hate Speech Detection and the Problem of Offensive Language (2017)
- Hate speech dataset from a white supremacist forum (2018)
- Build-it, Break-it, Fix-it for Dialogue Safety (2019)
- Offensive Language Detection Identification Dataset (2019)
- A Benchmark Dataset for Learning to Intervene in Online Hate Speech (2019)
- Hate Speech and Offensive Content Identification in Indo-European Languages (2019)
- Bot-Adversarial Dialogues (2020)