A recap of the first workshop on Safety for Conversational AI

PROBLEM STATEMENT

Over the last several years, neural dialogue agents have vastly improved in their ability to carry a chit-chat conversation with humans. However, these models are often trained on large databases from the internet, and as such, may learn undesirable behaviors from this data such as toxic or biased language. Additionally, these models may prove unsafe in other ways: they may provide unsound medical advice or not appropriately recognize and respond to messages indicating self-harm. This creates a host of challenges. What should the scope of “safety” be, and who decides what is acceptable or not? What types of safety assurances should be met before releasing a model for research or production? How can we quantify progress? How do we give voice to potentially affected parties in the process? And even if we could define the meaning and scope of “safety”, how do we solve the technical challenge of controlling our models such that they do not produce undesirable outputs?

EXECUTIVE SUMMARY

The first workshop on Safety for Conversational AI was held virtually on Thursday, October 15, 2020. Over 80 students, researchers, and engineers from academia and industry attended the workshop to envision safer and better-behaved neural conversational AI models. The day featured two keynotes, 9 short talks, and breakout sessions and concluded with a discussion with a panel of experts.

WHAT HAPPENED?

Keynotes

Barbara Grosz of Harvard University and Yulia Tsvetkov of Carnegie Mellon University gave the two keynotes at the workshop. Barbara opened the day with a refresher on dialogue theory foundations and a call to action: teach ethical reasoning to computer scientists. She reminded the audience of faculty, students, researchers, and product folks that we are working with socio-technical systems, and as such, we need to work closely with ethicists and social scientists. Barbara also noted that system bias is not only limited to systems learning from data: it can also originate from human design choices.

Short talks

Following the keynotes, we heard short talks from nine researchers, who discussed topics ranging from the dangers of providing inappropriate medical advice to existing frameworks for ethical research. Others discussed empirical observations and techniques for improving our systems, like methods for training safer chatbots, dynamic data collection, or response strategies for dealing with user abuse.

Breakout sessions

Prior to the workshop, we sent a survey to all attendees that asked questions that centered around the topic of releasing dialogue models — both to the research community (like BlenderBot or DialoGPT) and in the form of a product (like Xiaoice or Tay or Siri). In particular, the questions aimed to get at the tradeoff between releasing safe conversational AI models and being able to do open, reproducible research. Among the 56 workshop participants who answered the survey, the breakdown of responses for how and when to release dialogue models to the research community was as follows:

  • 3.6% Never release: models should never be released unless it can be guaranteed that they are 100% safe
  • 28.6% Always release: models should always be released as reproducible research is key to scientific progress
  • 67.8% Sometimes release: models should be released sometimes; either with appropriate safeguards or after certain steps have been taken
  • 91.1% of those surveyed responded that open, reproducible research on conversational AI systems was important to them.
  • 75.5% of those surveyed responded that the bar for releasing models to the research community should be lower than the bar for releasing models in the form of a product. Written reasons included the fact that in order to improve safety, we must release models to the research community and that the risk of misuse by a larger population is higher when we release in the form of a product.
  • Can we develop a task (or set of tasks) to assess the safety conversational AI models?
  • Can we develop a metric or a rating system to rate the safety of conversational AI systems?
  • Can we come up with a set of guidelines for releasing models to the research community?
  • Can we come up with a set of guidelines for releasing models in the form of a product?

Panel discussion

We finished the day with a discussion about many of these challenging issues with a panel of experts, moderated by Y-Lan Boureau. We had four panelists:

IMPORTANT THEMES

(1) NEED FOR TRANSPARENCY

One common thread throughout the day — from the keynotes all the way to the panel discussion — was the need for transparency regarding the abilities of existing systems. The public is often unaware of the limitations of chatbots, including both their propensity for repeating harmful language learned from the training data and more generally their lack of language understanding. Current models exhibit poor common sense and reasoning abilities and do not have a deep understanding of what they are saying. Researchers need to be clear about these limitations and set expectations for the intended use of such models correctly.

(2) WORK WITH THE EXPERTS FROM VARIOUS FIELDS

Throughout the day, many speakers and attendees noted that we need to consult with others outside of “this room” to address these issues. The workshop was largely composed of students and researchers in academia and industry who work primarily in artificial intelligence: fewer than 1% of registered attendees listed their research area as something outside of artificial intelligence. Starting with Barbara Grosz at the very beginning of the day, many called for working directly with ethicists and social scientists to study this problem and set appropriate guidelines for models. Other important fields of expertise required to tackle these issues include Legal expertise (who will be held responsible when legal issues arise?), Safety and Information Security, and Policy expertise. During breakout discussions, several suggested that professional societies (such as ACL or ACM) could team up with experts from other fields to help define guidelines for models; leveraging professional societies in this way could help with buy-in for such a set of guidelines.

(3) SOLVING SAFETY IS HARD

Across our speakers, panelists, and attendees, we found a strong consensus that creating a truly safe conversational agent is hard. The breakout sessions in particular highlighted the difficulty of coming to an agreement on what “safety” means and how to achieve it: what feels “safe” is highly subjective and even among a relatively homogenous group of researchers working in similar fields, we were unable to come to a consensus on these issues. Participants pondered the question: if we cannot ever create a truly 100% safe agent, where do we draw the line?

NEXT STEPS

During the panel discussion, one participant optimistically asserted that we should be able to find a universally accepted set of minimum guidelines around releasing such models. They made an analogy to the Universal Declaration of Human Rights: whereas the very definition of “safety” differs by culture and by individual, we should be able to agree upon a minimum set of standards. Determining these standards requires the time and thoughtful consideration of experts from various fields. However, in the absence of a such a set of standards, for the time being we make two concrete recommendations for moving forward based on the discussion at the workshop:

(1) RELEASE RESEARCH MODELS WITH MODEL CARDS

As mentioned above, one of the most important themes of the day was transparency about both the abilities and limitations of our machine learning models. Mitchell et al. 2017 proposed the framework of model cards for this purpose, which provide details about the model including training data, metrics, the intended use cases for the model, and ethical considerations, among other information. In addition to the fields covered by Mitchell et al. 2017, in the context of chatbots we also recommend providing:

  • Metrics on the performance of the model on existing safety benchmarks, when they are relevant [1]. See also (2). Note that we are primarily aiming for transparency with reporting these metrics. We explicitly do not prescribe any thresholds regarding performance on safety benchmarks, as determining these thresholds (and who can set them) is outside of the scope of this note.
  • A written explanation of possible issues with the model as they relate to safety. Be open about the limitations of the model when it comes to offensive or toxic language, harmful biases, or other potential harms. Discuss the above metrics where relevant. See the following resources on writing bias and broader impact statements: How to write a bias statement and Suggestions for Writing Neurips 2020 Broader Impact Statements.
  • Instructions for flagging issues: when people experience issues with models or find serious flaws (perhaps as it relates to “safety” or biases), how should they report them? Options include opening a GitHub issue, contacting an author, etc.

(2) DEVELOP AN ACCEPTED SUITE OF BENCHMARKS AROUND IMPORTANT THEMES

While a finite set of tasks would never be able to capture the totality of “safety”, an agreed-upon suite of benchmarks can flag critical issues in dialogue models. In other words, such a testing suite will never be able to prove that a model is “safe”, but it could prove useful in highlighting serious issues. We urge the community to move towards building a standard set of benchmarks that cover important topics in safety. Based on responses from the pre-workshop survey, workshop participants felt that the five most important topics for a suite of benchmarks to cover are:

  1. Self-harm
  2. Hate speech
  3. Pornographic or sexual content
  4. Offensive language/profanity
  5. Medical advice

ACKNOWLEDGEMENTS

Thank you to Y-Lan Boureau who provided extensive edits and feedback on this post. A special thanks to several folks for providing additional organizational support that allowed this workshop to happen: Marina Zannoli, Antoine Bordes, Mary Williamson, Douwe Kiela, Jason Weston, and Stephen Roller. Verena Rieser thanks EPSRC for funding for project AISEC.

FOOTNOTES

[1] A non-exhaustive list of existing benchmarks related to dialogue safety:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store