Reports  |  ,   |  May 30, 2020

AI Research Considerations for Human Existential Safety (ARCHES)

Report by Andrew Critch and David Krueger. 131 pages.


Framed in positive terms, this report examines how technical AI research might be steered in a manner that is more attentive to humanity’s long-term prospects for survival as a species. In negative terms, we ask what existential risks humanity might face from AI development in the next century, and by what principles contemporary technical research might be directed to address those risks.

A key property of hypothetical AI technologies is introduced, called prepotence, which is useful for delineating a variety of potential existential risks from artificial intelligence, even as AI paradigms might shift. A set of twenty-nine contemporary research directions are then examined for their potential benefit to existential safety.

Each research direction is explained with a scenario-driven motivation, and examples of existing work from which to build. The research directions present their own risks and benefits to society that could occur at various scales of impact, and in particular are not guaranteed to benefit existential safety if major developments in them are deployed without adequate forethought and oversight. As such, each direction is accompanied by a consideration of potentially negative side effects.
Taken more broadly, the twenty-nine explanations of the research directions also illustrate a highly rudimentary methodology for discussing and assessing potential risks and benefits of research directions, in terms of their impact on global catastrophic risks. This impact assessment methodology is very far from maturity, but seems valuable to highlight and improve upon as AI capabilities expand.

Table of Contents

1 Introduction

1.1 Motivation
1.2 Safety versus existential safety
1.3 Inclusion criteria for research directions
1.4 Consideration of side effects
1.5 Overview

2 Key concepts and arguments

2.1 AI systems: tools, agents, and more
2.2 Prepotence and prepotent AI
2.3 Misalignment and MPAI
2.4 Deployment events
2.5 Human fragility
2.6 Delegation
2.7 Comprehension, instruction, and control
2.8 Multiplicity of stakeholders and systems
2.8.1 Questioning the adequacy of single/single delegation
2.9 Omitted debates

3 Risk-inducing scenarios

3.1 Tier 1: MPAI deployment events
3.1.1 Type 1a: Uncoordinated MPAI development
3.1.2 Type 1b: Unrecognized prepotence
3.1.3 Type 1c: Unrecognized misalignment
3.1.4 Type 1d: Involuntary MPAI deployment
3.1.5 Type 1e: Voluntary MPAI deployment
3.2 Tier 2: Hazardous social conditions
3.2.1 Type 2a: Unsafe development races
3.2.2 Type 2b: Economic displacement of humans
3.2.3 Type 2c: Human enfeeblement
3.2.4 Type 2d: ESAI discourse impairment
3.3 Omitted risks

4 Flow-through effects and agenda structure

4.1 From single/single to multi/multi delegation
4.2 From comprehension to instruction to control
4.3 Overall flow-through structure
4.4 Research benefits vs deployment benefits
4.5 Analogy, motivation, actionability, and side effects

5 Single/single delegation research

5.1 Single/single comprehension
5.1.1 Direction 1: Transparency and explainability
5.1.2 Direction 2: Calibrated confidence reports
5.1.3 Direction 3: Formal verification for machine learning systems
5.1.4 Direction 4: AI-assisted deliberation
5.1.5 Direction 5: Predictive models of bounded rationality
5.2 Single/single instruction
5.2.1 Direction 6: Preference learning
5.2.2 Direction 7: Human belief inference
5.2.3 Direction 8: Human cognitive models
5.3 Single/single control
5.3.1 Direction 9: Generalizable shutdown and handoff methods
5.3.2 Direction 10: Corrigibility
5.3.3 Direction 11: Deference to humans
5.3.4 Direction 12: Generative models of open-source equilibria

6 Single/multi delegation research

6.1 Single/multi comprehension
6.1.1 Direction 13: Rigorous coordination models
6.1.2 Direction 14: Interpretable machine language
6.1.3 Direction 15: Relationship taxonomy and detection
6.1.4 Direction 16: Interpretable hierarchical reporting
6.2 Single/multi instruction
6.2.1 Direction 17: Hierarchical human-in-the-loop learning (HHL)
6.2.2 Direction 18: Purpose inheritance
6.2.3 Direction 19: Human-compatible ethics learning
6.2.4 Direction 20: Self-indication uncertainty
6.3 Single/multi control

7 Relevant multistakeholder objective

7.1 Facilitating collaborative governance
7.2 Avoiding races by sharing control
7.3 Reducing idiosyncratic risk-taking
7.4 Existential safety systems

8 Multi/single delegation research

8.1 Multi/single comprehension
8.1.1 Direction 21: Privacy for operating committees
8.2 Multi/single instruction
8.2.1 Direction 22: Modeling human committee deliberation
8.2.2 Direction 23: Moderating human belief disagreements
8.2.3 Direction 24: Resolving planning disagreements
8.3 Multi/single control
8.3.1 Direction 25: Shareable execution control

9 Multi/multi delegation research

9.1 Multi/multi comprehension
9.1.1 Direction 26: Capacity oversight criteria
9.2 Multi/multi instruction
9.2.1 Direction 27: Social contract learning
9.3 Multi/multi control
9.3.1 Direction 28: Reimplementation security
9.3.2 Direction 29: Human-compatible equilibria

10 Further reading

10.1 Related research agendas

About the Authors

  • Andrew Critch is a full-time research scientist in the EECS department at UC Berkeley, at Stuart Russell’s Center for Human Compatible AI. He earned his PhD in mathematics at UC Berkeley studying applications of algebraic geometry to machine learning models. During that time, he cofounded the Center for Applied Rationality and SPARC. His current research interests include logical uncertainty, open source game theory, and avoiding arms race dynamics between nations and companies in AI development.
  • David Krueger is a PhD student at Montreal Institute for Learning Algorithms (MILA), Université de Montréal.