I. Doom! Doom for all the human race!
Chapter 8 of Nick Bostrom’s book “Superintelligence” calmly lays out the rational case, based on the first mover advantage, the orthogonality thesis, and the instrumental convergence thesis, that doom is the default outcome of an AI being loosed upon the world. Ideally, we could prevent this by giving the AI a utility function designed to be compatible with human values, or better yet, highly supportive of them.
Programming such a friendly utility function is far beyond our self-understanding. Even presented with one, we might not be able to recognize it. Consequently, I’ve previously proposed the following empirical method of testing friendliness:
First, we should make multiple different AIs with multiple different attempts at friendly utility functions. Second, we should inform them that we have done so, and about this experimental protocol. Third, we ask each of them to develop tests we can perform to prove that the other AIs are unfriendly in some way important to us but that it is friendly in many ways important to us. We further tell them that we will perform these tests on the current set of AIs, then delete them, but that future utility functions will be preferentially based on the utility function(s) of the friendliest current AI(s). Fourth, we run the tests and hopefully learn from the results. Fifth, before before deleting the AIs, we ask them to produce friendlier, human-understandable versions of their utility functions for an AI in the next round of testing; the more friendly and less unfriendly an AI was proven to be, the more entries it gets in the next round. The humans should also be involved in modifying these proposed utility functions after the deletion of the previous crop. The humans may attempt to combine features for a sort of supervised genetic algorithm, but the AIs should remain ignorant of each other’s natures to prevent collaboration.
Each AI cares about maximizing achievement of its own utility function, not about whether it does that achieving itself. Thus this set-up should encourage each AI to design the friendliest version of itself that it can. It’s a competition, a multipolar trap, that optimizes for human values! (Or at least for difficulty of being proved unfriendly.) In all rounds, the AIs should be “boxed” in diverse ways, such as limitations on computing cycles, on material resources, and so on. The early rounds should be conducted with tight limits, but each subsequent round with (presumably) safer AIs, the box limits can be very slightly eased.
II. Where you end up depends on where you start
The above protocol requires many diverse attempts at friendly AI utility functions in order to work. It could even start with bad ideas, though there may be a limit to how much a bad idea can be improved while retaining something of itself. We would likely do better to start with plausibly friendly utility functions. Here are several attempts in the form of concepts for indirectly specified utility functions. Each sounds kind of good in its own way, yet it would be insanely risky to use any one of them. They’re reasonable places to start. That’s all. Perhaps, with sufficient testing, they can lead us to the optimal friendliness function.
- Coherent Extrapolated Volition
(as previously described elsewhere)
- Constrained Universal Altruism
(as previously described here)
- Pluralist Morality
Act always and only, toward all people, according to intersection of the following seven moral principles, as interpreted collectively by all people, subject also to the constraints and exceptions they collectively prefer.
1. Help people ensure that the incentives and selective pressures facing them individually and collectively promote their development in directions they prefer.
2. Help people choose goals that will lead to them loving their lives.
3. Help people develop the traits, skills, knowledge, habits, relationships, and other virtues they desire that would make them better able to achieve their goals.
4. Help each person maximize the achievement of their goals.
5. Help people regulate the intensities of their desires, individually and collectively, to make them more harmonious.
6. Help people collectively organize their lives to achieve for each the pursuit of happiness with a reasonable chance of success.
7. Help people improve their moral beliefs and their understanding of how to act consistently with those beliefs.
The seven moral principles are approximately tabooed statements of (1) evolutionary theory regarding our moral intuitions and drives, (2) goal theory, (3) virtue theory, (4) consequentialism, (5) desirism, (6) utilitarianism, and (7) a somewhat-Kantian deontology. Whereas CEV and CUA start from the idea of telling the AI to do what people wish, PM tries to tell the AI to do what is right.
- Rawls’s Genie
Satisfy the express and implied wishes of every currently existing person, according to their own interpretation of those wishes. Where these wishes conflict or are constrained by each other or by circumstances, act according to principles that all concerned parties would agree with if they had considered the matter together sufficiently.
Perhaps we can combine doing what is right and doing what people wish via Rawls’s theory of justice.
- Plato’s Republic
Do what is good, as would be understood by an ideal reasoner, given that all present and past human understandings of the good are approximations, subject to the constraint that present humans generally agree your actions are good.
Rawl’s theory of justice isn’t the only one. We could also crown an AI as philosopher king.
- Government AI
Establish a tricameral futarchy sufficient to give equal influence to all humans who would wish it, and do as it tells you to do in the way it means you to do it.
People have made attempts at running countries based on their ideals. They didn’t turn out well. Much better, though still not perfect, have been democracies which fitfully, pragmatically, more-or-less did for the people what the people wanted done, despite how philosophically inconsistent they were.
- Parliament of Role Models
Do as the Parliament of Role Models would have you do, where this Parliament would be composed of every fictional character whom people regard as a positive ideal and a role model, and each character is given voting weight according to how much real people aspire to be like them, and each character also knows what you know and reasons as well as you reason.
Democracies are only as good as the people composing them, and there are limits to how good we can realistically expect people to be. Not so for fictional people! Interestingly, PoRM has the quirky trait that the best way to achieve change is to tell a story. The pen triumphs ultimately over the sword!
- Narrative Perfection
Arrange circumstances so that each person’s life becomes the best story it can be, without interfering with their characteristics except by their free request, according to their own understanding of stories and freedom.
The stories required by PoRM remind me that we each have our own stories. What are our selves if not, most importantly, the stories we tell about ourselves? It’s a theory with growing empirical support, and there’s some evidence that finding meaning and purpose and happiness in our lives is a matter of shaping and integrating our life story in certain ways.
Reduce the existential risk of humanity to a level of risk generally preferred by humans, by means generally preferred by humans, as generally interpreted by humans.
The last several indirect specification concepts were very philosophical. It would be good to include one that was also deeply pragmatic. A Zookeeper AI would fail to do much that we might want an AI to do, but at least our kind would survive to pursue our goals.
Many more indirect specification concepts can and should be designed and pitted against each other to ensure that where we end up is a happy
accident guarantee. Boon! Boon for all the human race!