Friday, July 11, 2014

Reward Prediction Error Signals are Meta-Representational

“Research on humans and other animals has produced an impressive body of converging evidence that midbrain dopamine neurons produce a reward prediction error signal (RPE) that is causally involved in choice behaviour […]. RPEs are found in humans, primates, rodents and perhaps even insects […]. This paper argues that RPEs carry metarepresentational contents.” (Shea 2014.)

Introduction


Scientifically interesting questions about common currencies fall largely into three groups:
  1. About whether the behaviour of some type of agent (including people) is consistent with a single preference or value ordering? (Related questions here concern what kinds and degrees of order there are, and what theories - especially biological and economic ones - make sense of the order.
  2. About whether the processes that produce the behaviour of some type of agent (including people) consult or ‘consume’ a representation of values or preferences?
  3. Finally, about the relationships between answers to the questions in groups (1) and (2).

Questions in the second group are about the existence and nature of internal, cognitive, representations of value. If there are comprehensive internal value representations, then they are, in terms discussed elsewhere on this blog, instances of proximal common currencies. The point of a proximal common currency is that one way, at least, to produce behaviour that is sensitive to the values (whether understood as reward, utility, pleasure,…) of the results of actions, is to represent those values internally. Then the behaviour selection process can consult the value representation, and the sensitivity of behaviour to value is both made possible and explained.

I recently ran into an elegant and interesting paper by Nicholas Shea, at the time at Oxford but now back at King’s College, London. The paper, like it says on the box, argues that reward prediction error signals are meta-representational. On this blog so far, I’ve mostly left discussion of representation as a topic in a black box, and focused on the arguments about whether or not there is, or must be, a represented common currency in order to explain order in behaviour. But that’s not because of my thinking that representation - especially of values - wasn’t an interesting topic.

Representation has, of course, an enduring fascination for the philosophy of cognitive science. Back in the olden days this fascination was partly expressed in big picture yelling about whether cognition in general was essentially representational, or was possible without representations (e.g. Rodney Brooks’ famous 1991 paper Intelligence without Representation [PDF]).

There’s also a tradition of more interesting - to my mind, anyway, - work focused on developing detailed theories of various kinds of representation, and working out how they might apply to far more specific cognitive phenomena and cognitive mechanisms. Shea’s paper is a fine example of that latter approach. It reminds me a bit of Dan Lloyd’s wonderful book Simple Minds (1989), in focusing on simple systems that, because they can be exhaustively described, can explicitly be shown to instantiate representations by the lights of a specific theory.

The topic of value, or reward, representation cries out for serious philosophical treatment. Neuroscientists explicitly and regularly make claims about what this or that brain network, or brain activity, represents, but rarely develop or invoke specific theories of representation while they do so. Most of the work goes on experimental design and execution aimed at showing suitably determinate relationships between the brain network or activity and something else that can be independently measured. A theory of representation can help work out how to relate the discoveries from neuroscience - in this case including neuroeconomics - to theories from other relevant areas, and to theories at other scales of granularity including, perhaps and eventually, whole agent theories about the rational explanation of actions in terms of desires or preferences (and beliefs).

So here’s a sketch of the highlights - given my rather specific interests - of Shea’s argument for the thesis that reward prediction error signals (RPEs) are meta-representational. To turn this into something a bit more specific and comprehensible, we need to get clear on what some of those notions mean. (All quotations are, unless otherwise noted, from Shea’s paper.)

Metarepresentation


A representation says something about something. For example, a price tag on some object says of it that it can be bought for some sum of money.

Sometimes what is represented is itself a representation. Only some of these representations are meta-representations, though. For example, I might point to a price tag and say of it that the tag is printed with a dot-matrix printer. Then I’m saying something about the form of a representation, but not the content. If, instead, I said of the very same price tag that “that’s more than I was planning on spending”, or “that’s less than it was last week”, I’d be saying saying something about the content of the representation. In this sort of case I’m metarepresenting.

The kinds of representation, and metarepresentation, we’re concerned with here are non-conceptual. This means that they don’t require concept possession. If that sounds perplexing, cling to this: What’s important for present purposes about a non-conceptual representation is that it has correctness, or satisfaction, conditions. So we establish that something is a representation by specifying these conditions. In the case of metarepresentations, Shea adopts the following criterion. Consider a putative metarepresentation (‘M’):

“ […] M’s having a correctness condition or satisfaction condition that concerns the content of another representation is taken to be a sufficient condition for M to be a metarepresentation. That is a reasonably stringent test. It is not enough that M concerns another representation. A representational property must figure in M’s correctness condition or satisfaction condition.”

Reward Prediction Errors (RPEs)


Very generally and informally, a prediction error occurs in a class of ‘temporal difference’ learning algorithms. It is the difference between an expected or predicted and an actual value, and is used to modify the expectations in future cases. The expectations are averages, and the details of the modification vary from case to case, but the general idea is intuitively clear: If the expected value was too low, raise it a little; if too high, lower it a little. In the case of reward prediction error, or RPE, the specific thing predicted is reward. But the fact that it’s reward that is predicted isn’t an intrinsic property of the algorithm, it depends on how an implementation of the algorithm is hooked up (to the world, and other bits of an information processing system).

As Shea notes, temporal difference learning and prediction errors came out of computer science, and to begin with weren’t intended to say much about brains or animals:

“But then Wolfram Schultz and colleagues discovered that midbrain dopamine neurons broadcast an RPE signal […]. Relative to a background tonic level of firing, there is a transitory phasic increase when a unpredicted reward is delivered, and a phasic decrease in firing when a predicted reward is not delivered. This finding brought the computational modelling rapidly back into contact with real psychology, galvanised the cognitive neuroscience of decision-making and launched the science of neuroeconomics.”

To get an idea of what Schultz and colleagues discovered, consider the figure below,  adapted from Schultz, Dayan & Montague (1997). I think it’s fair now to call it famous - most people with anything more than passing knowledge of neuroeconomics will recognise it and be able to explain it. Apologies if you’re already familiar with it:

Figure from Schultz, Dayan & Montague (1997)

In each of the three panels, time is represented horizontally, with earlier times to the left of later ones. The rest of the each panel represents the activity of some dopamine neurons, with dots representing spikes, aggregated into a bar graph along the top. ‘CS’ is a point in time at which a conditioned stimulus is presented to an awake monkey (bottom two plots only).  R’ is a point at which a reward (such as squirt of fruit juice into the mouth) is delivered to the monkey (top two plots only). The CS isn’t rewarding in itself (it’s typically a sound, or a flash of light). But R is rewarding.

The data in all three plots is from a subject familiar with CS preceding R by around one second.

In the top plot, there is no CS, but R is still delivered. In that case there is a flare up in neural activity shortly after R. In the middle plot the presentation of CS is followed by an increase in dopamine neuron activity, but those neurons show no response to the subsequent delivery of R. In the bottom image, CS is the same and receives the same neural response, but R is absent, and in that case there’s a drop in activity when R was expected.

These experiments were decisive against the notion that dopamine is a ‘pleasure’ molecule, or otherwise primarily linked to experienced or occurrent reward. (That was a popular view in science for a while, and is still sometimes asserted in night clubs.) If dopamine was simply linked to reward, then it wouldn’t be associated with unexpected unrewarding CS events, and it would be associated with rewards, even if they were expected.

The point is not that dopamine is unrelated to reward, it’s rather that it’s involved in reward prediction. More specifically, the consensus is now, it is a reward prediction error, signalling when there’s more or less reward (or reward cue) than expected at any time. That’s why there’s more of it for unexpected R and unexpected CS, and less of it for unexpected absence of R.

There’s been much more detailed and specific research into the neural implementation of prediction errors and reward prediction errors since the work just described. There are also some outstanding questions and controversies over some issues of detail. Nonetheless, the general point that some brains (including human ones) implement temporal difference learning, and that reward learning involves RPEs is very widely accepted. As Shea puts it: “… the current state of the art is as strong a scientific consensus as a philosopher could possibly hope for.”

Shea’s paper (sections 2 and 3) gives a really clear and useful account of reward prediction errors, and a detailed explanation of a simplified model in which reward prediction error is immediate (as opposed to delayed, for example if rewards come after a series of actions). In the model that Shea describes the predicted rewards are associated with actions. This is important. The monkey subjects in the experiment described above didn’t have to do anything. Not all rewards are contingent on action, and many important experiments about reward expectancy and learning don’t require subjects to make choices. The same class of learning algorithms can, though, be applied to action selection cases. Then the expected reward is contingent on the actually selected action, and any prediction error modifies the reward expectations for that action.  

Putting the bits Together


Now that we have an idea of what metarepresentation is, and also what a reward prediction error is, it shouldn’t be difficult to see how they relate.

The reward prediction error is metarepresentational in the sense that it represents something about the content of one or more other representations. It says whether actual incoming reward is greater, or smaller, or equal to what was expected. Put slightly differently, it says whether the expectation was correct, or too low, or too high. Either way, the  RPE is about a relationship between the content of some other representations (of expected reward, and actual reward).

Here’s Shea:

“[…] RPEs have metarepresentational contents. They have both indicative and imperative contents (they are so-called pushmi-pullyus). The indicative content is that the content of another representation—the agent’s (first-order) representation of the reward that will be delivered on average for performing a given action—differs from the current feedback, and by how much. The imperative content instructs that it be revised upwards or downwards proportionately.”

So RPEs non-conceptually metarepresent content about the content of other representations, and the RPEs are processed as instructions by other cognitive systems, in ways that lead to modifying the content of the basic representations that the RPEs are about.

This all seems spot on to me. In fact, once an appropriate account of metarepresentation, and a careful description of RPEs are laid out side by side, as Shea does in the first few sections of his paper, the conclusion he is urging seems very difficult to resist. (Later stages of Shea’s paper are partly devoted to considering alternative approaches and deflationary strategies. I’m not going to attempt an account of those sections here. The whole paper is, though, well worth careful attention.)

So what?


I think anyone interested in making philosophical sense of neuroeconomics should read Shea’s paper. I’m going to close with a few specific remarks about how it’s relevant to my preoccupation with common currencies.

When scientists (and some philosophers) refer to a proximal common currency, they often focus on the structure of the common currency as a scale. For example, Levy and Glimcher (discussed in an earlier posting say, when discussing how the things we choose between vary along many different dimensions, and don’t always have the same dimensions in common:

“What we need to do is to take into consideration many different attributes of each option (like color, size, taste, health benefits, our metabolic state, etc.), assess the value of each of the attributes, and combine all of these attributes into one coherent value representation that allows comparison with any other possible option. What we need, at least in principle, is a single common currency of valuation for comparing options of many different kinds” (Levy & Glimcher 2012: p1027).

A fairly strong claim is being here: The brain must encode or represent options (including actions) that might differ in a wide range of modalities on a single unidimensional scale. And a lot of work goes into describing properties of the scale, such as whether in this or that brain process it represents expected, or relative expected utility, or what resolution it has.

But we should also be thinking about the representational structure of the value scale itself. This is an interesting topic even if it doesn’t constitute a completely common, or completely consistent, currency. (Even weakened or approximate theses about proximal common currencies are theses about value representation.) The value representation can’t just be a scale, it has - somehow - to be indexed to representations of what it is about. Just as RPEs are indexed to specific expectations, so the values have to be indexed to actions, and perceptual cues, and other information.   

Speaking very speculatively, it seems unlikely that a represented value scale would be anything like a simple ordered list. It might be tempting to envisage an image of a great big ruler with outcomes inscribed next to their corresponding number of ‘hedons’, or ‘utiles’. But such an image is hardly credible. The list would prohibitively large if it actually stored expected values for all discriminable quantities of all consumption types (one beer, two beers, three beers; one dollar, two dollars, three dollars, let alone the sips and cents). It’s more likely, then, for a value representation to be encoded in a mixture of procedural and model-like ways, that allow reward expectancies to be generated in response to specific option sets, including drawing on our capacities to simulate and imagine. But in that sort of case the details of credit assignment, and what to change in response to RPEs, would be more complicated. Perhaps surprisingly, it’s clearer that RPEs are metarepresentational than it is what the representational structure of the reward expectancies themselves are.

(ASIDE: Some, including Andy Clark, have recently support the view that the whole brain is in the business of using prediction errors to refine expectations, ultimately aiming for a state where nothing is surprising. In that case many prediction errors would be about processes other than reward, including perception and motor control. The thought that reducing prediction error is in some sense the aim of cognition rather than a means leads to some rather odd worrying about why creatures with brains don’t just seek out and stay in dark rooms. (In dark rooms there are no surprises, and so no prediction errors.) I hope to write about the ‘dark room problem’ here in the future. For more, including commentaries, see ‘Whatever Next?’ (may be behind a paywall). Commentaries continued at the Open Access Frontiers in Theoretical and Philosophical Psychology including my own commentary.)

Related postings on this site:


References

Brooks, R.A. (1991). Intelligence without representation, Artificial Intelligence, 47: 139–159.
Clark, A. (2013) Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3), pp181-204.
Lloyd, D. (1989). Simple Minds. Cambridge, MA: MIT Press.
Shea, N. (2014). Reward Prediction Error Signals are Meta-Representational, Noûs, 48(2): 314-341. [LINK]
Schultz, W., Dayan, P., and Montague, R. (1997), ‘A neural substrate of prediction and reward’, Science, 275 (5306), 1593.

Research Blogging Citation:


Shea, N. (2014). Reward Prediction Error Signals are Meta-Representational Noûs, 48 (2), 314-341 DOI: 10.1111/j.1468-0068.2012.00863.x ResearchBlogging.org