Visual Expectations

Expectation suppression and the Bayesian brain hypothesis


What we see is deeply influenced by what we expect to see. One classic example of this is the Kanizsa triangle, in which people see an illusory white triangle defined by a collection of black shapes. Another example – if you live in a city – is hearing a siren and expecting to see an ambulance. Personally, I have had the experience, while hiking in the Rocky Mountains, of getting hungry and mistaking many a red flower for a ripe salmonberry, or of kayaking in the Everglades and mistaking a log for a crocodile.

Upon reflection, most people could probably think of ways in which their expectations influence what they perceive. However, the ability to recognize such a seemingly simple fact depends on a particular way of seeing the world – a world in which, rather than seeing objects directly, we see objects in a way that is colored by our prior expectations. In other words, our minds construct the world that we perceive.

Why, though, would we construct a model the world as opposed to directly perceiving it? The answer might be energy. Capturing a pixel-perfect image of the world is expensive. Consider the security camera. Even though both humans and security cameras observe their environment visually, the human is additionally using other senses, subjectively experiencing their observations, and taking actions in the world. And yet a typical security camera consumes about the same amount of power as a waking human brain. This energy efficiency, it could be argued, relies on the human brain’s use of generative world models. Rather than seeing every frame of vision afresh, we first form an idea of what we expect to see, and only update that idea if we see new and relevant information. In terms of energy usage, constructive perception seems to be orders of magnitude more efficient than direct perception.

But knowing about the theoretical efficiency of constructed perception does not imply that it is possible. Fortunately, there are many statistical and machine learning methods that implement precisely this kind of computation. These are called Bayesian models, which for any situation form a prior hypothesis and update their probability estimates as they encounter new evidence. This notion relates directly to the way that expectations affect human perception. For example, a Bayesian model walking through the Everglades would assign high probability to the event of seeing a crocodile given its environment – so a log might easily be interpreted as evidence of a crocodile.

So it seems like constructed Bayesian perception, often also called predictive perception, is not only energy-efficient but also possible. Thus, it might be that the broad goal of human perception (and maybe even human cognition in general) is to update certain predictions about the world based on a world model. If so, we should be able to uncover the underlying algorithm that makes such predictions, both to understand how it works at a neuronal level and to understand how it might go awry.

Perhaps because visual perception is one of the easiest cognitive tasks to study, it is also one of the best understood. With the broad framework of a predictive brain in mind, the open questions are how the predictions of a world model get combined with observations from the senses, both at a computational and neuronal level.

Sometimes expectations are encoded clearly in the neural hardware. For example, perhaps because cardinally (vertically or horizontally) oriented objects are more common than slanted objects, neurons responding to cardinal orientations are also more common. In a similar way, the top-down feedback connections from, say, the lateral occipital cortex to the visual cortex, help to constrain low-level feature prediction using higher-level object context [1]

However, not all expectations are encoded in neuronal structure. Many associations happen at a timescale too short for cortical changes. Some studies have suggested that the hippocampus can encode some of these short-term associations between stimuli in real time.

Much of the encoding of perceptual expectations might be a mix of structural and functional changes. An important example is motion. The motor regions of the brain send signals to the sensory regions of the brain when making a motion, perhaps to inhibit the neuronal response to one’s own motion. After all, there’s not much point in devoting cognitive resources to noticing my own walking visually when I already know I am walking.

So expectations seem to be quite a ubiquitous form of communication in the brain. But what do they do? Almost always, the expectation of particular stimulus leads to a decreased neuronal response from that stimulus, in a phenomenon called expectation suppression. For example, in terms of total neuronal activity, your brain would likely respond less to a car driving down a road than to a shark flopping down the same road.

While the phenomenon itself has been observed across humans and monkeys in a wide variety of settings, the purpose of expectation suppression is still unclear. One explanation is redundancy reduction: there’s no point for the brain to encode the umpteenth car driving down the road, so the car-detecting neurons lower their firing rate. In this case, depending on how shark-like the car looks the shark-detecting neurons would still be firing at unaffected rates. Another explanation is signal sharpening: there’s no point for the brain to continue considering sharks as a possibility, so the firing rate from shark-detecting neurons (and other irrelevant neurons) is suppressed, making the car-detecting signal appear relatively stronger.

Yet another explanation of expectation suppression, which combines the two given already, comes from predictive coding. For a typical advocate of the predictive coding perspective, there are two kinds of neuronal groups at play here: error units and prediction units. Within the error units, higher-level neurons “explain away” — that is, subtract away — the expected signals from bottom-up input. Within these groups, a strong signal corresponds to a large prediction error and more surprise. Within the prediction units, higher-level neurons suppress the signal from irrelevant neurons – these neurons may as well be called the “confirmation bias” neurons, because their function is mainly to distill the evidence that supports the existing expectation. In the end, comparing the output of the error units to that of the prediction units should yield either a prediction or a decision to pursue future evidence.

While the theory of predictive coding seems to offer a complete explanation for expectation suppression and makes concrete predictions about neuronal structure and function, it has yet to be tested properly in neuroimaging studies, due to a lack of sufficient spatial and temporal resolution. Functional magnetic resonance imaging, or fMRI, has relatively high spatial resolution (down to 0.5 mm scales for 7 Tesla machines). However, because it relies on blood oxygen that appears as a reaction to brain activity, its measurements experience a lag on the order of 1-5 seconds. Electroencephalography, or EEG, meanwhile, has high enough temporal resolution but lacks spatial resolution, especially because electric fields are dispersed as they travel through tissue. In their review paper, de Lange and coauthors suggest using laminar profiling with ultra-high-field (i.e. 7 Tesla) fMRI – that is, expose people to certain types of stimuli, like the Kanizsa triangle, and see what layers of the cortex light up (in the case of the Kanizsa triangle one would hope to see the high-level layers light up, because the illusory triangle is a top-down prediction) [2].

In addition to expectations, attention is another cognitive factor that might bias our perception. It has been known for a long time that attention could improve performance on visual tasks. But does attention also change the subjective experience of perception? To answer this question, Carrasco and colleagues performed a series of experiments over a decade [3] [4]. Their psychophysical paradigm worked as follows: first, they had the participants focus on the center of a digital screen. Then, they presented a flashing dot to cue the participant’s attention to a particular side of the screen. This had the effect of cuing the participant’s attention towards one side, even though their gaze remained fixated on the center. Finally, they presented 2 Gabor patches on the screen – essentially circles filled with parallel stripes – of various orientations and contrast levels, one on each side. If the cued attention impacted subjective experience, the participants would be more likely to call the two patches equal, even if the cued patch had a slightly lower contrast level.

Indeed, this is what Carrasco and colleagues repeatedly observed, across multiple variations of the experiment. In these tests of subjective equality, participants systematically overestimated the contrast of the Gabor patches to which their attention had been cued. To address concerns that this effect may have stemmed from response bias or intra-modal interference, result was corroborated by a cross-modal electrophysiological study, which cued participants using sound and measured their brain responses using EEG [5]. In addition to replicating the effect of attention on perceived contrast, they observed sharp activity in the visual cortex 100ms after the stimuli were presented – and the activity correlated positively with the perceived contrast in the cued Gabor patch. A 100ms reaction would be hard to explain away as response bias, which should be much slower. Moreover, the use of an auditory cue made it unlikely that visual interference explained the original effect. Carrasco and colleagues have since repeated the study across many other dimensions, including spatial frequency, object size, and color saturation.

Notably, Carrasco’s studies focus on covert spatial attention — “covert” means the eyes are not focused on the subject of attention. Thus, these study results can only be explained by the cognitive, top-down effects of attention and not by physical differences between the fovea and the periphery.

If we return to the model of a Bayesian brain, where does attention fit in? Overt attention, in which the image enters the focus of vision, can be directly thought of as an increase in the certainty of the evidence, corresponding to the likelihood term in Bayesian inference. Covert attention, meanwhile, is harder to pin down in Bayesian terms, as it doesn’t change the physical precision of the retinal signal. Still, selectively processing a certain region of peripheral vision with more resources is likely to result in higher precision in that area of visual processing relative to other areas of the periphery. In other words, covert attention may function similarly to overt attention in modulating likelihood, but at a higher level of visual processing.

References

[1] De Lange, F. P., Heilbron, M., & Kok, P. (2018). How do expectations shape perception?. Trends in cognitive sciences, 22(9), 764-779.

[2] Lawrence, S. J. D., Formisano, E., Muckli, L., & de Lange, F. P. (2019). Laminar fMRI: Applications for cognitive neuroscience. NeuroImage, 197, 785-791. https://doi.org/10.1016/j.neuroimage.2017.07.004

[3] Carrasco, M., & Barbot, A. (2019). Spatial attention alters visual appearance. Current opinion in psychology, 29, 56-64.

[4] Carrasco, M., (2018) How visual spatial attention alters perception. Cogn Process 19, 77–88.

[5] Störmer, V. S., McDonald, J. J., & Hillyard, S. A. (2009). Cross-modal cueing of attention alters appearance and early cortical processing of visual stimuli. Proceedings of the National Academy of Sciences, 106(52), 22456-22461. https://doi.org/10.1073/pnas.0907573106