Cognitive Biases Frequently Encountered by the Practicing Data Scientist

Marc Harper

One of the more difficult to teach aspects of data science is communicating with other professionals, especially non-technical ones. Learning to recognize cognitive biases and to respond appropriately is a crucial skill.

Confirmation Bias

Our first bias is so ubiquitous it gets its own section. Confirmation bias -- the tendency to favor information that confirms one's preexisting beliefs -- is rampant throughout data science and consulting. The bias can occur very early in the life of a contract or product: often one is hired or contracted specifically to provide an outside or new confirmation of the beliefs of one or more stakeholders. It can also occur once the final analysis is done, if the stakeholder does not want to believe your results when they contradict expectations.

You've almost certainly heard this idea before: people here what they want to hear: "people in general are twice as likely to select information that supported their own point of view as to consider an opposing idea". Similarly "you cannot reason someone out of something he or she was not reasoned into". For many preexisting beliefs trump evidence.

This is a challenging bias to overcome. Many people do not like to challenge their own beliefs and may react negatively when someone else challenges these beliefs. In fact, some become more sure in their beliefs when presented with contradictory evidence! (This is called the backfire effect and is closely related to belief revision).

Source

Avoiding stakeholders primarily interested in confirmations of their beliefs is possible. Early in negotiations ask many questions as to the nature of the project, the main questions to be addressed, if there are any prevailing beliefs in the company or organziation and if there is any evidence that supports the existing hypotheses. Ask what will happen should you prove or disprove the hypothesis. Often people will simply tell you that they are looking for a confirmatory result.

This is also something that product managers and sales staff should look out for, especially if there is a large implemetation cost for bringing on a new client. Sometimes customers are just looking to prove a point internally and have no intention of long term use.

Closely-related: Status quo bias: the tendency to like things to stay relatively the same -- a possible source of confirmation bias. This is particularly irksome when it originates from a position of power. It's often said that "Science advances on funeral at a time": Max Planck famously remarked that "A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it." There is evidence of this phenomenon.

Watch out that you (as a data scientist) do not fall prey to expectation bias: the tendency for experimenters to believe, certify, and publish data that agree with their expectations for the outcome of an experiment, and to disbelieve, discard, or downgrade the corresponding weightings for data that appear to conflict with those expectation. As a data scientist you need to avoid confirmation bias in your workflows and thoughts. Insights should be data driven. Of course it's fine to use domain specific knowledge as a guide, but ultimately the data has to support the conclusion convincingly.

There are also "group-level" biases. As a data scientist or consultant you may experience two seemingly contradictory effects:

  • You opinion is valued as an "outside perspective"
  • Your opinion is devalued due to Ingroup bias: the tendency for people to give preferential treatment to others they perceive to be members of their own groups.

In both cases there is a flavor of confirmation bias at play -- either your outside/expert opinion "objectively" confirms an existing belief or your opinion is rejected because it does not conform to the group belief.

Statistical Biases

Often as the resident data scientist you will have a much better grasp on statistics than your primary stakeholders. Decisions are frequently based on very small amounts of data, and it's useful to be able to recognize statistical biases as the appear in the decision making process.

  • Insensitivity to Sample Size: People without training in statistics will often not understand the need to collect enough data for statistical significance. I've personally heard many statments along the lines of "it's only three data points but hopefully it is representative" (note the expectation bias as well). That's not how any of this works. The damage compounds with sampling bias, and it all goes back to a poor understanding of statistical experiment design.

  • On the other hand you may encounter the Illusion of validity: the belief that furtherly acquired information generates additional relevant data for predictions, even when it evidently does not. More data isn't always useful and many statistical models yield marginal returns once a large amount of data is already possessed. Sometimes clients and stakeholders want to collect more data rather than accept an undesirable conclusion.

  • Survivorship Bias: focusing on success stories or exceptional cases rather than the distribution of outcomes. When combined with insensitity to sample size this can be devastating. In a production process the average case and the variability should be the focus, not the random outlier that lasts twice as long or has an extreme value of some property.

Source

  • Illegitimate Pattern Extrapolation: I've often seen the shape of data extrapolated far out of the range of the test data where other failure modes or counfounding effects may come into play. This often leads to poor predictions and design failures.

Source

Close to my home there are many street signs advertizing a local college, remarking it's accolades:

  • 3rd in Rhodes Scholars!
  • Top 15 in engineering!

and many more. Presummably we're supposed to conclude that this is a good school because it ranks highly in many categories. But likely many schools could find a list of similar accolades by random chance if thousands of such computations were carried out. Of course the advertisements do not include any information on how many accolades the school dredged through, or how different this set of outcomes is compared to other institutions.

Decision Making Biases

As a data scientist you have a lot of influence over the decision making process by providing crucial information to stakeholders. You also have many chances to observe collect decision making and the many biases that occur. Here's a sample.

  • Appeals to authority: often a particular individual's input will be over-weighted due to authority rather than legitimacy or expertise.
  • Belief bias: An effect where someone's evaluation of the logical strength of an argument is biased by the believability of the conclusion. It's important to focus on what the data reveals. Sometimes the results are counter-intuitive.
  • Reactive devaluation: devaluing proposals only because they purportedly originated with an adversary. It can be hard for some to admit that those they do not like can have good ideas.
  • Availability heuristic: the tendency to overestimate the likelihood of events with greater "availability" in memory, which can be influenced by how recent the memories are or how unusual or emotionally charged they may be. This often manifests as favoring data that is more relevant or more visible rather than a representative random sample.
  • Availability cascade:A self-reinforcing process in which a collective belief gains more and more plausibility through its increasing repetition in public discourse (or "repeat something long enough and it will become true"). Data and statistical insights often challenge deep-seated beliefs in a group. I once inquired why a particular parameter value was chosen and was told that a previous study had shown it to be the "sweet spot" for some outcome. When I found the original data, no correlation existed between the parameter and the outcome! The supposedly valid sweet spot was a myth that came to life through repetition within the company after a misinterpretation of the original data.

Another one I've often encountered, especially at startups, is Optimism bias: the tendency to be over-optimistic, overestimating favorable and pleasing outcomes (see also wishful thinking). On the one hand it's important to stay positive at a startup since there are so many unknowns, but when the evidence says something isn't working, organizations need to adapt.

How to Deal with Biases

How should one handle any of these biases when encountered in the wild? There are a few useful techniques:

  • Focus on real data. When someone makes a wild or unsubstantiated claim, demand to see the supporting data. Often it will not exist.
  • Look for contradicting cases or known counterexamples and explain how the claim cannot be generally true.
  • Emphasize the existance of data that supports a different hypothesis.
  • Look for confounding or incorrect assumptions and challenge the premise of a claim. If the claim is based on a previous study, examine the experimental design for flaws and suggest a new experiment.

Sometimes there is simply nothing you can do, and things won't get better until the source of the problem (a misguided stakeholder, for example) leaves the organization.

Take a look at the list of cognitive biases on Wikipedia -- can you see others that you've encountered? Don't fall prey to Blind Spot Bias just because you've read this article!