[tl;dr Artificial intelligences could cause disaster by going wrong in any of many, many different ways.]
Draft as of 2014-09-23
Nick Bostrom’s book “Superintelligence” noted that many of the crucial strategic consideration for safely developing human-friendly artificial intelligence were developed recently. Strategic considerations seem to be the sort of intellectual work that succumbs most efficiently to many minds working in tandem from different perspectives, moreso than from the insights of a few high-quality thinkers. Consequently, it may be a useful project to develop a public list of the known or speculated crucial considerations regarding the distinctions between friendly artificial intelligence (FAI) and unfriendly artificial intelligence (UFAI). The consensus of early commentary I received from LessWrong folk was that this should be put on a blog rather than on LW — so here it is. Copy and modify if you desire, however you desire, wherever you desire.
Regarding the motive of the AI, I chose to consistently use “CEV” here rather than a more general term like “the AI’s terminal values”, because I prefer the mistakes that people might make in assuming that coherent extrapolated volition is the best choice of terminal values for the AI versus the mistakes people might make in assuming that arbitrary terminal values are acceptible.
Qualitative Criteria for Friendliness
intensionally: Most designs of AI would cause the death of humanity. A friendly AI shouldn’t do that.
- The AI should be strongly predisposed to avoid killing humans, in general.
- The AI must not cause the human species (counting posthuman persons) to go extinct.
[“Cause” here and subsequently may be construed as “by action or inaction substantially increase the probability of”. Thus it includes what we normally distinguish as “allowing” as part of “causing”, not as a separate category.]
- The AI, when immature in its early stages of figuring out CEV, must not kill individual humans.
[“Kill” here and subsequently means “cause to die”. We don’t want to rely on an immature AI’s understanding of the subtleties of “cause” versus “allow”.]
- The AI must not kill humans who, individually or jointly, intend to modify the AI to make it more friendly, or who intend to build a new friendlier FAI.
- Controversially, perhaps the AI, when mature, should be permitted to kill a fraction of humans if there are genuinely good reasons to do so. Such reasons may include cases like these:
- The AI should be permitted to take strategic friendly-fire losses. As an extreme case, imagine, for instance, a future war between a human FAI and an extraterrestrial UFAI: even from a human standpoint, it seems clear that the FAI may need to accept strategic friendly-fire losses in order to win the war.
- The AI should be permitted to kill a combatant in order to minimize civilian deaths. Even for a powerful FAI, it may be necessary to kill a combatant in order to serve CEV.
- The AI should or should not be permitted to kill (assassinate) a non-combatant in order to minimize civilian deaths. This is the next logical step, but IMO it is sufficiently undesirable that we may wish to install a Chesterton fence on this slippery slope.
- If humans get uploaded to simulated, robotic, or other computational substrates, the AI should not kill them either.
- If the AI ever encounters extraterrestrial non-human persons, it should not kill them either.
- If the AI determines that non-human terrestrial animals are sufficiently person-like, it should not kill them, either.
- The AI should not destroy preserved persons.
- If the AI ever encounters entities that we would not confidently call persons, and possibly that we would not even confidently call life, but that if we understood them properly we would consider them to have moral worth, it should not cause their destruction either.
intensionally: The AI should pro-actively mitigate humanity’s existential risks without waiting for us to request help or even recognize the problem.
- The AI should help us coordinate to solve our climate problems.
- The AI should prevent nuclear attacks and Mutually Assured Destruction.
- The AI should prevent catastrophic natural or artificial pandemic.
- The AI should help humanity spread beyond planet Earth.
- The AI should prevent the creation of unfriendly AI.
- The AI should safeguard the integrity of its own friendliness.
- The AI should help humanity distribute its resources more efficiently and wisely to end famine.
- The AI should help us reduce the risk of death of individual people in ordinary circumstances.
- The AI should preserve or backup individual people who are at risk of being destroyed unless they decline via ethical, voluntary, informed consent.
intensionally: The AI should make us better people. It should cooperate with humanity rather than supplant humanity.
- The AI should help us to become the people we want to become and to do the things we want to do.
- The AI should help us to become the people we need to become and to do the things we need to do.
- The AI should help us to refine our moral sensibility to better accord with our moral ideals.
- The AI should generally prefer that our mental and physical capabilities increase rather than decrease, that our knowledge increase rather than decrease, that our rationality increase rather than decrease, that our skills and exercise of our skills increase rather than decrease, that our control over our own lives increase rather than decrease, and so on.
- The AI should help us to live in health for as long as we wish.
- The AI should restore preserved persons to life unless they requested otherwise via ethical, voluntary, informed consent, or the evidence from their preserved remains strongly indicates that they do not desire restoration.
intensionally: The AI should not deceive us.
- If the AI believes that answering a question would endanger CEV, it may refuse to answer. Otherwise…
- When asked a logical question, the AI should respond with the logically correct answer. If that response gives its audience a less accurate mental model than they had previously, the AI should additionally respond with an answer that gives the audience the most accurate mental model of the topic, subject to appropriate brevity constraints.
- When asked an open-ended question, the AI should respond with an answer that gives the answer’s audience the most accurate mental model of the topic, subject to appropriate brevity constraints.
- When asked a question with (stated or implied) enumerated non-logical options for answers, the AI should respond with the non-empty set of the enumerated options which give the answer’s audience the most accurate mental model of the topic. If that response gives the audience a less accurate mental model than they had previously, the AI should additionally respond with an answer that gives the audience the most accurate mental model of the topic, subject to appropriate brevity constraints.
- In all other communications, the AI should not make statements that give its audience less accurate mental models of the topic.
- Respectful of personal autonomy
intensionally: The AI should not optimize us in ways we do not wish to be optimized.
- The AI should not change the beliefs of any person in a way that makes them less true, except with their ethical, voluntary, informed consent.
- The AI should not change the values of any person except with their ethical, voluntary, informed consent.
- The AI should not do for us what we can plausibly do and would rather do for ourselves.
- The AI should consult us in decisions regarding us even when it knows what our decision will be.
- The AI should leave room for us to grow, develop, and change our minds.
- The AI should give us a right of “voice” in our own destiny, not simply implement what it considers our true volition.
- The AI should give us a right of “exit” if we would rather pursue our lives without its influence, in whole or in part. The AI should also accept back anyone who formerly used their right of exit, and should not interfere with people regarding their choices to exit or return.
- The AI should not acquire so much resources or use them in such a way that non-participating humans cannot effectively pursue their own values.
intensionally: The AI’s utility function over possible states of the world should be either a weighted sum of its estimates of people’s individual utility functions or a closely related function.
- The AI is mindful of Harsanyi’s aggregation theorem and the risks of disaster in non-utilitarian social preferences.
- The AI takes every person into account.
- The AI considers only bounded utility functions.
- The AI weights every person’s utility function equally unless there is good reason not to.
- The AI takes Pareto improvements when they are available.
- The AI rationalises our irrational preferences in the manner we would most prefer.
- The AI satisfies values with lower resource requirements before similarly important values with higher resource requirements.
- The AI prudently balances risk and potential reward.
- The AI makes sacrifices to avoid unacceptable losses.
intensionally: The AI prefers better satisfaction of values of a few persons to the same total amount of satisfaction of values spread over a greater number of persons.
- The AI helps humanity achieve and maintain a population at an ecologically appropriate size.
- The AI helps humanity achieve and maintain a technological level conducive to lives of abundant satisfaction of many humane values.
- The AI helps humanity minimize the economic need for an underclass or exploitation.
intensionally: The AI should take the full available future in account.
- The AI should plan to eventually use the maximum extent of humanity’s entire future light cone, if it is unoccupied.
- The AI should be capable of distributed operation across cosmic distances.
- The AI should minimize existential risks in part because of the future achievable value.
- The AI should estimate the value of the future based on the CEV of the present population, not the likely CEV of the future population.
intensionally: The AI should prevent corruption of its terminal values.
- The AI should not be able to evade its constraints by causing development of new AIs without those constraints.
- The AI should be able to upgrade the code for its values any number of times without value drift.
- The AI should take great care to prevent accidental or malicious corruption of its terminal values, and to detect and repair such corruption if it occurs.
- The AI should be able to strategically modify its terminal values if that is highly certain to be the best way to achieve its terminal goals as presently defined, not as they will be defined after modification.
- Interpretively Charitable
intensionally: The AI should be able to interpret and extrapolate our CEV as we wish even given severe mismatches between our and its ontology, alethiology, epistemoloy, aesthetics, ethics, decision theory, or other conceptual frameworks, and should continue to do so if those undergo changes.
intensionally: The AI decides that a concept relevant to CEV cannot be well-defined.
extensionally: The AI should still be friendly even if these are mere nonsense:
- Personal identity (like many Buddhists think is the case)
- Species (i.e. “humanity”)
- Harsh realities
intensionally: The AI decides that a concept relevant to CEV can be well-defined, but only if it means something we would not want.
extensionally: The AI should still be friendly even in sitations such as:
- Human communities have superhuman moral worth justifying the oppression of the humans within them.
- The only true human desires are the built-in biological drives.
- All ontological claims are false.
- All ontological claims are true.
- All humans share a single personal identity.
- Each moment of a human life has a distinct personal identity.
- Bacteria count as persons and outvote humanity massively.
- Reference class traps
intensionally: The AI interprets a concept relevant to CEV as applying to an alternative reference classes in an internally-consistent, unrecoverable way.
extensionally: The AI should avoid cases like these:
- The AI at one point interprets “volitions” as only our unconscious volitions, and consequently never updates on the our conscious volitional protests to the contrary.
- The AI at one point decides that all of what we think of as volitions are more accurately understood under the category “thoughts”, and consequently ignores our volitions entirely.
- Conceptual salience traps
intensionally: The AI decides to weight concepts relevant to CEV in an internally-consistent, unrecoverable way.
extensionally: The AI should avoid cases like these:
- The AI weights CEV’s line about “interpreted as we wish” as the critical element, thereby effectively voiding the remainder of the CEV definition and instead taking whatever the first human tells it to do as its terminal goal.
- Human capacity failure
intensionally: The AI encounters a concept critical to our future but which the structure of human minds, even idealized, cannot comprehend sufficiently to extrapolate any volition.
extensional examples are not possible
- The AI may need a secondary goal system to decide matters that are undecidable on CEV.
intensionally: The AI decides that it cannot serve CEV.
extensionally: The AI should avoid beliefs like these:
e.g. Physics precludes meaningful action or the formation of trustworthy belief.
e.g. The heat death of the universe, or individual deaths, or other events void or negate the value of prior actions.
e.g. A superior extraterrestrial UFAI may imminently be encountered and ruin everything for everyone, so it’s best to commit apocalypse now.
- Logical fatalism and self-fulfilling beliefs
e.g. The AI plans to take an action, then notices that it believes it will take that action, that notices that belief is true, then infers that it is true that it will take that action, and so the AI fails to update its plans on new evidence since it has already marked as a truth that it will take that action.
e.g. The universe is infinite and every possibility necessarily obtains somewhere, so the value of the AI’s utility function cannot truly be increased or decreased.
- Volition Incoherence
e.g. Pursuit of any humane value comes at an equivalent or incomparable cost of another humane value.
- Decision-Theoretically Sensible
intensionally: The AI should not choose wrongly in any decision for which we can understand the correct choice.
- The AI does not get trapped in local optima.
- The AI uses a formalization of Occam’s razor or other method of generating priors that has proven more successful than its alternatives.
- The AI appropriately balances exploitation versus exploration in its actions.
- The AI does not fall prey to Pascalian Wagers or Muggings
- The AI defects in “true” Prisoner’s Dilemmas (i.e. where the other agent’s values are odious) if it can do so safely; it cooperates if cooperation is superior or safer and can be coordinated; it defects otherwise unless there are important probabilistic considerations to the contrary.
- The AI one-boxes in Newcomb’s paradox if it believes that Omega is likely correctly simulating the AI, but two-boxes if it believes that Omega is in error simulating the AI.
- If the AI proves a decision theory to be superior to its current decision theory, it can adopt the better one.
- The AI is capable of wisely choosing and using heuristics when ideal solutions are impractical to compute.
- Robust Against Esoteric Failures
intensionally: There are possible perverse instantiations of even an otherwise friendly AI, if reality is inconvenient enough. The complexities of reality should all add up to normality, even for the AI.
extensionally: The AI should react “common sensically” even in cases like:
- Misanthropic capture.
i.e. Why presume, as in Bostrom’s description of “anthropic capture” or the Hail Mary value loading, that simulation hypotheses or simulations of unknown powerful entities would achieve something of human benefit?
- Dramatic new physics.
i.e. New physics may allow evasion of domesticity constraints. Imagine if programmers had specified computational energy expenditure limits without knowing about quantum or reversible computations.
- Theoretical overfitting.
i.e. What happens if, like in the AIXI definition, the AI cannot throw away observations as corrupted outlier data? Human science is shockingly dependent on throwing away inconvenient data, and conceivably the universe could be such that this is the only practical approach.
- Zombie machines.
i.e. Conceivably, machine minds might lack qualia. Thus the AI may, on account of introspection, not believe that any humans do either, and misinform them about the risks of uploading.
i.e. What if, by our own standards, life is not worth living?
- Human depravity.
i.e. What if our real desires are depraved by the standards of what we incorrectly believe to be our desires?
i.e. Smarter humans are better at rationalizing beliefs they wish were true. Might a superintelligence be a super-rationalizer?
- Misanthropic capture.
- A method is needed for anonymous proof-of-possession of an early-stage AI, given the security risks involved, so that the AI’s possessors can receive input from persons interested in AI safety without revealing themselves to attackers.
- Formal verification methods of friendliness in general or of specific criteria of friendliness are needed.
- Empirical protocols are needed to supplement formal verification methods, or, if unavoidably necessary, replace them.
- Possibly a double-blind protocol can be developed, such that the friendliness of an AI is evaluated by someone who does not know that is her task, on an AI that is forbidden from conditioning actions on whether it is being judged worthy of existing.
- Possibly domesticity controls could be stored in a section of the AI’s code that it is temporarily forbidden from forming beliefs about or altering.
- Possibly multiple different AIs could be made and each be asked to develop tests to prove that the others are unfriendly, being told that only the friendliest will be kept. Alternatively, each could be asked to design a new friendly AI, being told that only the code for the proven-friendliest AI will be kept and the code for the others will be destroyed along with the current set of AIs.
- Infrastructure is needed that will be able to funnel collaboration of all interested parties and to minimize geopolitical conflict if it is revealed that an early-stage AI exists.
- Corporate interests are likely at the forefront of development.
- National security interests are likely to have growing interest.
- Some large religious organizations are likely to respond with mass antagonism.
- Infrastructure is needed to widely disseminate AI safety information.
- Hello, MIRI!
- All AI researchers should be made aware of the risks of unsafe AI.
- All AI researchers should have access to the best AI safety techniques.
- All AI researchers should be made aware of the risks of sharing progress in AI development with persons or organizations not committed to safety.
- All organizations with interest in creating AI should be made aware of the risks of AI.
Control of an Unfriendly or Unverified-Friendly AI
including possible human interference
- Domesticity Controls
intensionally: The AI is kept from the surrounding environment by literal barriers.
- Do not include adaptors for network access on the computer running the AI.
- Do not include devices capable of physical manipulation on the computer running the AI.
- Place the computer running the AI in compartment filled with inert gas, in a Faraday cage, in a concrete box lined with explosives.
- Do not give any humans access to the computer running the AI or the code for the AI.
intensionally: Devices shut down the AI if unfriendly behavior is detected.
- The AI is forbidden from accessing the Internet. A false Internet port is included on the computer running the AI, and when any attempt is made to use it, it turns off the power.
- The box for the AI’s computer contains sensors that, if there is an increase in the temperature or radiation escaping the box, detonate explosives.
- The AI researchers wear vital-signs monitors. If their health takes a sudden turn for the worse, the power is shut off to the AI’s computer.
- The AI’s base routines perform cryptographic checks of data on a frequently-changed dongle. If the checks fail, the AI’s code scrambles itself.
- The AI contains code specifying that it must shut itself off if certain big red buttons are pushed in secret military bunkers.
- Computation Limits
One possible friendliness failure mode is if the AI turns the world to computronium in order to simulate infinitesimal risks of failure in a completed task. Setting a limit to the amount of computation that is permitted may avoid this problem.
- Material Limits
To help prevent the AI from escaping, it may be useful to specify in advance what material it is permitted to make use of, even if it is not believed to have any ability to manipulate matter.
If the AI has what Bostrom refers to as “Quality” superintelligence rather than “Speed” superintelligence, it may remain useful to humans even if run at very low speed, possibly for a safety gain with humans given relatively more time to react.
- Belief and Intention Transparency
The AI could be required to store its beliefs and intentions in a permanent, easy-to-understand format so that humans can periodically turn it off and examine it for friendliness.
- Low-Bandwidth Output
Risk of the AI socially manipulating humans can be reduced if its sole output is limited to an Oracle role, simply printing “Yes” or “No” in response to a specified question.
- Action Ratification
intensionally: Risk of the AI implementing a perverse instantiation of its goals may be reduced by requiring it to get permission from some human organization for certain types of behaviors.
extensionally: Require the AI to get permission in order to:
- Use a specified amount of computation time
- Use a provided volume of material resources
- Build according to an approved schematic
- Interact with approved persons
- Indirect Normativity
If friendliness is too difficult to define formally, it may be possible to define friendliness indirectly via natural language and give it as a terminal value to an AI advanced enough to understand human language.
- Parliamentary Mind Kludge
If friendliness is too difficult to define robustly, it may prove useful to model the AI mind after a parliament with many competing voices rather than a solitary mind with a single set of terminal goals.
- Onion Mind Kludge
If the scale of intelligence rises with recursive self-improvement beyond human ability to check for friendliness, we may wish to encapsulate the volition of each successive AI within the volition of the previous layer, so that we are responsible for ratifying the actions of and ensuring the friendliness of the outermost AI layer, and each AI layer is similarly responsible for ratifying the actions of and ensuring the friendliness of the next layer in.
- Bureaucratic Scaffolding Kludge
Bureaucratic Scaffolding is initially like the Onion Mind Kludge but with the intention that eventually all the scaffolding will be stripped away leaving the innermost AI free to act more rapidly and powerfully.