How do you avoid false correlations in Big Data?

When evaluating any Big Data correlation claim, ask how many variables were tested. If testing many variables, the result may be Coin 391 — a false positive produced by testing enough variables until one correlated by chance. Before acting on any finding, demand an out-of-sample test to verify the correlation holds on new data. This methodological rigor is essential because the more variables tested, the greater the likelihood of discovering spurious correlations through random chance rather than genuine causal relationships.

Psychology

28512671_everybody-lies

by Seth Stephens-Davidowitz

13 min read

5 key ideas

March 24, 2026

People lie to doctors, pollsters, and themselves—but not to Google. Search data exposes the shocking gap between public behavior and private truth, revealing…

In Brief

Everybody Lies (May ) reveals that internet search data exposes what people truly think — unfiltered by social pressure — in ways surveys and polls never can. Drawing on Big Data analysis, it shows how to interpret this hidden signal, avoid false correlations, and use aggregate data responsibly to understand human behavior at scale.

Key Ideas

Survey answers hide actual behavior

Treat any survey about sensitive behavior — sexual habits, prejudice, financial choices, parenting — as directionally biased toward the socially acceptable answer. The gap between what people report and what they do is systematic and large, not random noise.

Audience presence distorts all reports

Ask of any data source: does the person submitting have an audience? Social media posts and survey responses are performances; private search queries are goals. The closer you get to a goal-driven, audience-free data source, the closer you get to truth.

Multiple comparisons breed false positives

When evaluating a Big Data correlation claim, ask how many variables were tested. If the answer is 'many,' the result may be Coin 391 — a false positive produced by testing enough variables until one correlated by chance. Demand an out-of-sample test before acting on the finding.

Natural experiments establish real causation

Natural experiments — situations where a sharp cutoff, lottery, or exogenous shock randomly assigns people to different conditions — can establish causation without randomized trials. When policymakers cite observational comparisons, ask whether a natural experiment exists that could settle the question more cleanly.

Aggregate trends guide; individual predictions fail

Aggregate search data can legitimately guide resource allocation (suicide prevention ads in a city with rising searches, officers redeployed to protect a threatened mosque). Individual-level prediction is statistically treacherous: even a tenfold increase in individual risk leaves most flagged people innocent, making pre-emptive action both ethically and numerically unjustifiable.

Who Should Read This

Science-curious readers interested in Behavioral Psychology and Social Psychology who want to go beyond the headlines.

Everybody Lies

By Seth Stephens-Davidowitz & Steven Pinker

11 min read

Why does it matter? Because every data source we've used to understand human nature has the same fatal flaw baked in.

You probably think you have a reasonable picture of who people are. Imperfect, sure — polls miss things — but roughly right. Ask Americans how many condoms they use: women give you one number, men give you a significantly higher one, and retail sales data gives you a third that makes both impossible. The math doesn't work. Someone isn't being straight with the researchers. And it's not one demographic or one type of question. It's everyone. Systematically. On almost every topic that carries any social weight.

This book is about the first instrument in human history that catches us telling the truth. Not a survey, not a lab, not a focus group. A search bar. What people type there, alone, unjudged, no one watching, turns out to reveal a picture of human nature that makes everything we thought we knew look curated, performed, and wrong.

Everyone Lies to Surveys — and Not Randomly

Surveys don't fail randomly. They fail in the same direction, every time, for everyone.

The condom numbers aren't an anomaly. Every survey touching something sensitive shows the same skew — each group tilts exactly where social expectation rewards it.

Answering a question — even on paper, even anonymously — is a social act. People perform. A man who admits he hasn't had sex in a year isn't answering a survey; he's confessing something he'd rather not confess. A woman who reports high frequency isn't reporting data; she's navigating a script about what women are supposed to do. The distortion isn't random noise that averages out. It's systematic, directional pressure applied to every sensitive question the same way: toward whatever the respondent believes will sound better.

In practice, surveys are reliable precisely where they don't need to be — preferences for vanilla versus chocolate — and unreliable precisely where they matter most. Anything touching sex, race, money, health habits, or social behavior gets quietly filtered through the question every respondent unconsciously answers first: What would a decent person say here?

The Search Bar Is the Only Confessional That Doesn't Perform Back at You

In 2012, a graduate student in economics sat down with Google Trends and typed in a word he expected almost no one to search. The word was "nigger." He braced for a rarity. Instead, the search volume matched "migraine," "economist," and "Lakers." One in five of those searches was paired with "jokes."

The researcher was Seth Stephens-Davidowitz, and what he'd found cracked open four years of assumptions about where racism lived. He had spent years trusting surveys, which told him explicit racism had faded to a fringe phenomenon. But surveys ask people to represent themselves. Google doesn't. When you open a search bar, you have one thing in mind: getting what you came for. No one grades your answer. No one sees the question. The social pressure that distorts every survey response — the invisible audience asking what would a decent person say here? — simply isn't present.

What emerged from the search data was a map of racism that looked nothing like the one surveys had drawn. Conventional wisdom placed modern prejudice primarily in the South, primarily among Republicans. But the counties with the highest rates of racist searches clustered in upstate New York, western Pennsylvania, eastern Ohio, and industrial Michigan, alongside West Virginia and Mississippi. The geography was East versus West, not South versus North. Racist searches were no higher in Republican-heavy areas than in Democratic ones.

Stephens-Davidowitz sat on this map for four years. It had one striking application: in areas with the most racist searches, Barack Obama underperformed John Kerry (the white Democrat who ran in 2004) by a margin that held even after controlling for education, church attendance, age, and gun ownership. The pattern appeared for no other Democrat. Only for Obama. The search data suggested explicit racism had cost him roughly four percentage points nationally (a finding five journals rejected because reviewers couldn't believe that many vicious racists existed).

Then 2016 arrived. Nate Silver, trying to explain why Donald Trump's primary support was strongest in the Northeast and industrial Midwest and weakest out West, tested unemployment, immigration rates, religiosity, opposition to Obama. The single best predictor of Trump's support, county by county, turned out to be the racism map Stephens-Davidowitz had built from search queries four years before Trump ran for anything.

The surveys had missed all of it. Racists weren't hiding in unusual corners. They were everywhere surveys looked. They just weren't telling surveys the truth. They were telling the truth to Google, because Google was the only one who could help them find what they came for.

The act of searching is instrumental — you type what you mean because accuracy is the only way to get the result you want. The pattern isn't unique to racism: surveys put the gay male population at 2–3%, but gay searches account for roughly 5% of all male porn, and that number barely shifts between Mississippi and Rhode Island. Aggregate millions of those unguarded moments and you don't get a picture of how people want to be seen. You get a picture of what they actually think, want, and fear. That picture, Stephens-Davidowitz found, is darker than almost anyone had been willing to admit.

The Horse Expert Who Won by Ignoring Everything the Experts Used

The same instinct — look where no one else is looking — drove a former banker to spend twenty years trying to predict horse performance from the inside out.

In August 2013, a man in suspenders and a hearing aid stood in a barn at the Saratoga Springs auction, looking at a horse nobody wanted. Horse No. 85 had a scratched ankle, a mediocre pedigree, and nothing in his appearance to recommend him. Sixty-two other horses would sell for more that day, two of them for over a million dollars. No. 85 went for $300,000, bought under a pseudonym by his own owner, Ahmed Zayat, after his data team delivered an unusual message: don't sell this horse. Sell your house if you have to. Do not sell this horse.

The man behind that message was Jeff Seder. He'd walked away from banking in his twenties after staring at a mural of an open field and realizing he was in the wrong life. He spent the next twenty years trying to become the best predictor of horse racing performance in the world. For most of those years, he failed. He built the first large dataset on horse nostril size. Nostril size predicted nothing. He gave horses EKGs. Heart rhythm predicted nothing. He once grabbed a shovel to measure the volume of a horse's excrement, on the theory that losing weight before a race matters. That didn't correlate either.

Then he built a portable ultrasound and started looking at internal organs. The left ventricle, the chamber that drives oxygenated blood to the muscles, turned out to matter enormously. The bigger it was relative to the horse's body, the better the horse ran. It was the single most predictive variable he'd ever found, and no one else was measuring it.

Horse No. 85's left ventricle sat in the 99.61st percentile. Every other key organ was also well above average, which ruled out illness as the cause. Seder had seen thousands of horses over two decades. He told Zayat this might be a one-in-a-million animal.

Eighteen months later, American Pharoah became the first Triple Crown winner in 37 years.

The experts Seder beat weren't running bad statistics. They were running good statistics on the wrong data. Pedigree, the standard signal every serious horseman could recite from memory, explains only a fraction of racing success. Seder ignored it entirely. He spent twenty years searching for a signal that actually predicted performance, not one tradition had certified as important. When he found it, it was sitting inside the horse's chest, invisible to anyone who hadn't thought to look.

That's the move the Big Data revolution actually makes. Not running better models on the same variables everyone else uses. Finding the variable no one thought to measure, and then measuring it until you understand what it means.

You Can Prove Causation Without a Lab — If You Know What to Look For

How do you prove that something causes something else without running a controlled experiment?

Most of the questions that actually matter can't be randomized. You can't randomly assign children to elite schools or withhold advertising from some markets to test whether it works. Randomized trials work beautifully when you can pull them off. But most of the time, you can't.

What you can do is look for places where life already ran the experiment. The calendar, the box office, the blunt cutoff on an admissions test — these create treatment and control groups without any researcher's involvement. Economists call these natural experiments, and the results often contradict both intuition and lab studies.

Take violent movies and crime. Every psychology experiment had found the same thing: people exposed to violent films reported more anger afterward. The reasonable inference was that hit violent movies, seen by millions on the same weekend, must be raising real-world violence. Gordon Dahl and Stefano DellaVigna tested this against FBI hourly crime data, box-office numbers, and violence ratings covering a decade. Some weekends, the most popular film was something like Hannibal. Others, Runaway Bride. They compared assault rates.

Weekends with the popular violent movie had lower crime. The drop started before the movie even began. The mechanism is embarrassingly obvious once you see it: violent films attract young, aggressive men who would otherwise be at bars. Theaters serve no alcohol. A man in a seat watching Hannibal is a man who is not drinking at a pool hall at midnight. Crime stayed lower until 6 a.m. because the alcohol-fueled kind takes hours to climb back. The lab experiments had measured something real: violent films do make people angrier. But that effect was swamped by where aggressive men were spending their evening.

That's what a natural experiment catches that a lab study misses: the full causal chain, not just one link in it. Dahl and DellaVigna randomized nothing. They found an experiment the world had already run.

The same logic overturned one of education's most durable beliefs. Researchers compared students who scored just above the admission cutoff for Stuyvesant High School, one of America's most selective public high schools, to students who scored just below it. The design works because students on either side of a test cutoff are nearly identical — one group happened to score a point higher on one particular day. AP scores, SAT scores, college prestige: indistinguishable. The paper's title said it: "Elite Illusion." Better students attend Stuyvesant; Stuyvesant doesn't make students better.

The More Variables You Test, the Easier It Is to Trick Yourself

Imagine you want to predict the stock market using a lucky coin — one found through rigorous testing. You label a thousand coins, flip each one every morning for two years, and record whether it landed heads and whether the S&P 500 went up or down. After 504 trading days, you dig through the numbers. And there it is: Coin 391. It called market moves correctly 70.3% of the time. The relationship is statistically significant. Coin 391 is your edge.

Except it isn't. With a thousand variables chasing just 504 observations, some variable will hit a lucky streak by pure chance. Reduce it to a hundred coins and the effect mostly disappears. Run the test over twenty years and Coin 391 regresses to chance. You haven't found a signal. You've found the coin that got lucky in a very short window.

Call it the curse of dimensionality, and it's not hypothetical. Researchers at Indiana and Manchester Universities tested moods (happiness, anger, calmness, kindness) against several time lags, looking for which combination predicted Dow Jones moves. Their data covered only a few months. They found it: calmness in tweets predicted market rises six days later. A hedge fund launched on this finding. It shut down one month later.

The uncomfortable part is that the search for weird, non-obvious signals — the instinct that found American Pharoah's left ventricle — is the same instinct that finds Coin 391. Seder spent twenty years testing nostril size, heart rhythm, excrement volume, and other measurements nobody else was taking. Most predicted nothing. The left ventricle was real. But the only way to find the real signal was to cast the net wide enough to catch a lot of noise alongside it. More variables mean more chances to discover something true. They also mean more chances to fool yourself. The discipline isn't in being less creative about what to measure; it's in testing every discovery against new data before you bet real money on it.

The Free Steak Dinner Is Not Generosity — It's Mathematically Timed Extraction

A casino manager approaches a regular at the slot machines. She's lost about $2,800 tonight. He puts on a warm smile: "I see you're having a rough day. Why don't you take your wife to the steakhouse — on us?" It sounds like warmth. It is revenue management.

Harrah's calls this pain-point modeling. Their customer analytics platform, which they call Terabyte, builds a profile on each regular gambler (age, zip code, visit patterns, win/loss sequences), then finds that gambler's statistical twins among thousands of past customers to determine how much loss those people absorbed before going dark for weeks. That figure is the extraction ceiling. At $2,800, the free dinner costs less than losing her for a month. At $3,100, the window has already closed.

Every technique in this book cuts both ways. Behavioral data is honest because people act out what they actually want rather than performing for a researcher, and that honesty is exactly what makes it useful to whoever gets there first. You can deploy it to find which neighborhoods need more suicide prevention resources, or to find the precise dollar amount before a customer is too demoralized to return.

The individual prediction problem is where it breaks down. Every month, roughly 3.5 million Americans make suicide-related searches; fewer than 4,000 die by suicide that month. Aggregate signals work: a city-level spike in "kill Muslims" searches is a reason to put more officers near the local mosque. Individual signals don't: interrogating each of the thousand people who searched it is not.

The same tool. Wildly different returns depending on whether you're protecting a population or targeting an individual. Worth knowing which one you're dealing with.

Fewer Than 3% of You Are Still Reading — and That's the Whole Point

Here's a number: fewer than 3% of readers finish a book like this one. Stephens-Davidowitz knows this because he checked the Kindle highlight data — and, following the data even here, he stops writing and goes to get a beer. The joke is earned. The whole argument has been that the confessional nobody thought was a confessional (the search bar, the late-night query, the thing you typed because you needed help and didn't care who was watching) is the most honest record of human inner life ever assembled. Darker than we'd like. Stranger than we'd admit to a pollster. And, if you know how to read it, genuinely useful in ways that surveys never were. You now know how to read it.

Notable Quotes

“into Google Trends. Call me naïve. But given how toxic the word is, I fully expected this to be a low-volume search. Boy, was I wrong. In the United States, the word”

“—was included in roughly the same number of searches as the word”

“I wondered if searches for rap lyrics were skewing the results? Nope. The word used in rap songs is almost always”

Frequently Asked Questions

What is Everybody Lies about?: Everybody Lies reveals that internet search data exposes what people truly think — unfiltered by social pressure — in ways surveys and polls never can. The book draws on Big Data analysis to show how to interpret this hidden signal, avoid false correlations, and use aggregate data responsibly to understand human behavior at scale. Stephens-Davidowitz and Pinker demonstrate that search queries represent unfiltered goals and genuine private thoughts, providing unprecedented insight into true human behavior, intentions, and beliefs that traditional research methods fail to capture.
Why are surveys unreliable for sensitive behavior?: Surveys about sensitive behavior — sexual habits, prejudice, financial choices, parenting — are directionally biased toward socially acceptable answers. The gap between what people report and what they do is systematic and large, not random noise. Crucially, social media posts and survey responses are performances; private search queries are goals. Understanding this distinction between audience-facing behavior and audience-free intent reveals why traditional survey methods systematically misrepresent the truth on sensitive topics, missing the genuine human desires that search data captures.
How do you avoid false correlations in Big Data?: When evaluating any Big Data correlation claim, ask how many variables were tested. If testing many variables, the result may be Coin 391 — a false positive produced by testing enough variables until one correlated by chance. Before acting on any finding, demand an out-of-sample test to verify the correlation holds on new data. This methodological rigor is essential because the more variables tested, the greater the likelihood of discovering spurious correlations through random chance rather than genuine causal relationships.
Should Big Data be used to predict individual behavior?: Aggregate search data can legitimately guide resource allocation — for example, suicide prevention ads in cities with rising searches, or officers redeployed to protect a threatened mosque. However, individual-level prediction is statistically treacherous. Even a tenfold increase in individual risk leaves most flagged people innocent, making pre-emptive action both ethically and numerically unjustifiable. The book emphasizes using aggregate population patterns responsibly while recognizing that attempting to scale from broader population trends to individual-level predictions introduces dangerous false positives and serious ethical complications.