A week and a half ago three weeks ago Ben Recht finished his excellent blog series on Paul Meehl's Philosophy of Psychology Lectures. The lectures and his posts led to a lot of thought on what constitutes a theory, how one measures the verisimilitude of a theory, and how we evaluate and evolve theories. Why? I think at times we lack the language to discuss disagreements about theories, and how we update our theories in response to evidence. This blog series aims to expose more people to Meehl's framework of understanding theories and add some new terminology of my own.
In this blog post, I want to expand on three things:
First, a working definition of the verisimilitude of a theory as the information it gives us about future events we care about at time horizons we care about.
Second, an explanation of the fallacy of Popperian logic as viewing implication as the gold standard of theorizing when its actually perfect information
Finally, an expansion of the idea of Lakatosian defense, where we add necessary assumptions to a theory to explain negative results, into what I dub the Lakatosian offense, where we remove assumptions whose necessity we were unsure of through evidence.
Trying to define verisimilitude
One thing I love about Meehl is his expertise in using formalisms to illuminate. In modern CS papers, formalisms are often poorly justified excuses for the usage of a specific set of math tools that give the result that the paper's author wanted. Meehl's formalisms and equations are not for crunching numbers: they are to put elegant structure to vague ideas. If speaking clarifies thoughts and writing clarifies speech, then a clear formalism clarifies the vagaries of writing. I want to follow in these footsteps and take a stab at a formalizing verisimilitude, the "realness" or "power" of a theory. Here is a first stab at describing verisimilitude.
Math Symbols Warning: forgive me for Substacks lack of inline latex
Imagine a world divided into discrete individual timesteps1
, where at each timestep any number of events in a set can happen. Call the subset of events that happen at time i Ei . Call the set of possible timelines (of past and future world states)
, where Xi is a set of all possible states at time i. A theory T defines a probability measure as well as a probability space 2
As well as the resulting conditional probabilities. In other words, a theory tells us how likely events are at certain time steps, potentially conditioned on other events.
Part 2: Defining a criterion for versimillitude
A good theory will give us good information about specific future events we care about at a specific time delay or set of time delays into the future. We define a weighting function (aka "how much we care score") g, where
Note that d here is not a point in time, but the distance between two points in time. In other words, its a measure of how early our theory is able to make a prediction.
Part 3: Defining a measure for how verisimilitudinous a theory is
One way we can define how well the theory works for the criterion as follows. Define the quality measure Q as
Where:
is an indicator function that equals 1 if the event did happen and 0 otherwise. Multiplying by 2 and subtracting 1 makes it either 1 or -1 instead.
g(V,d) is a scoring function,
is the conditional probability are theory assigns to the event in question given information at least d time steps in the past. Given that this equation is illustrative and not practical, the natural log is arbitrary, but it is added for easy comparison with some existing concepts in probability (e.g. entropy, information), and
The summations ask us to average over (1) all possible timesteps, and add up over (2) all potential delays from the moment before to timestep 0, and (3) over all possible subsets of events.
Some notes:
This definition only linearly penalizes the false negative of the 'correct' event and the false positive rate for other events. With a few more symbols, you could define an arbitrary loss function given a certain false positive and false negative rate. These are not necessary to understand the core idea, and are left as an exercise to the reader3.
Our definition of a theory's quality is invariant to where in the timeline you are. A good theory should be evergreen, but again nuance could be added here at the cost of simplicity
If g(V,d) is 1 for a specific event V* at delay d* and 0 elsewhere, then the difference in quality between theories T, T’ is Q(T)-Q(T’) is how many more bits of information T gives about (V*,d*) than T’
This definition only focuses on the "realness" of a theory (verisimilitude), and not other considerations like "simplicity".
Back to normal text
If you skipped the above math, it can be summarized as follows:
A good theory gives information about future events
What's the point of all this math? The definition may seem needlessly complicated for the simple ideas it tries to express, but I wish to state it because it illuminates two key ideas I want to develop in the rest of this post:
The problem with Popperian logic is that the platonic ideal of a theory is not to define a relationship P → Q but to define a relationship P ←→ Q. Though showing P → Q is often all we can aspire to, the best theories not only predict when something will happen but when it won't, or give us information about both.
The Lakatosian Defense is not a one way street, but a battleground for refining a theory to give more information. One can either perform a Lakatosian Defense (add implicit assumptions the theory to account for places it doesn't work) or what I will dub the Lakatosian offense (removing potential assumptions through additional experiments)
What Popper got wrong
To summarize the great work of a polymath scientist, the main idea of Karl Popper's Conjectures and Refutations is that what distinguishes "pseudoscience" from "science" is that theories in pseudoscience can always be true no matter the evidence, while theories in science can be proven false. A psychoanalyst can explain everything, but a physicist can get a negative result that invalidates their theory. What then makes a scientific theory great is that it proposes falsifiable results which are then confirmed by later experiments. It tells you something you don't already know about the future, something you can test4. What Popper opposes is inductive reasoning: a good theory be built off of just explaining existing observations. It must then make an unexpected prediction about the future, and confirm that prediction.
The standard critique of Popper, which Meehl highlights, is that falsification is not the only way we learn about the world. When Fleming discovered Penicillin by staring at a petri dish that should have been covered in bacteria but was not, he was performing inductive reasoning. Popper responded in his time with the notion that all good reasoning is implicitly falsification. Who is right?
When I first heard this critique, I sided with Popper. Fleming had a theory about the world (bacteria colonize petri dishes left outside) that turned out to be false. What convinced me that I was wrong was realizing that he did find an assumption about the world false, he did learn about the world without an experiment that set out to falsify a theory. His model of the world, and his theories about it, improved before a single one of the many experiments that resulted in the isolation of penicillin was performed.5
The reason that Meehl is correct is the point I made above: the platonic ideal for a theory that explains an event that we care about is not one that just explains when it happens but also when it doesn't. Implication is not the gold standard of theory. Perfect information is. Going back to definition of quality, the highest quality theorem is one that minimizes both the false negative and false positive rate that we care about. Inductive logic can improve our understanding by decreasing false negatives, while deductive logic can remove false positives.
The reason that "damn strange coincidences", "unlikely experiments", and the like are so important for proving the quality of a theory is that the more surprising a result, the more information it gives us.
The only reason that causal tests (such as controlled experiments) are more valuable than observations is that causal experiments allow for better control of cofounders, and thus more refined predictions. I will expand on this idea more in my next section, about the...
Lakatosian Offense
One of Meehl's big ideas is that of the Lakatosian defense. In the Popperian model, we iterate on a theory through attempted falsification until we manage to break it. If a theory survives falsification, it becomes more believable, and if it does not, it is replaced with new theories.
In reality, this is not the case. Often when finding contradictory evidence we don't reject our theories, but refine them. The Lakatosian defense is an analysis of that process. Ben Recht has a short summary here, with the original paper being here. In short, we divide the theory into one explicit (also known as the core) part and several implicit parts, including assumptions you made when creating the experiment and the particulars of experimental conditions. In practice, when a theory which we think of as good fails, we search for failure in the assumptions and experimental conditions first. If this fails, we try to create a weaker but more accurate theory that requires more assumptions or more requirements for the experimental conditions.
The example Meehl gives is that a group of researchers found that rats could solve fairly complex mazes, while another group did not. Through an adversarial collaboration between the two groups, they realized that how the rats were treated had a huge impact on their resulting ability. Turns out, rats well loved were far more likely to solve the maze.
The point I want to make is that this street goes both ways. When we have weak initial results, we are often unable to eliminate cofounders, and the experimental assumptions tend to be generous. Through RCTs, meta analyses, and replications, we are able to perform the Lakatosian offense: removing assumptions about the theory through further experiments. Lets think about this in the context of dying.
Example theories: Why do Americans die younger than citizens of other high income countries?
It is well established that Americans live shorter lives than our peers in other high income countries, despite spending more per capita and as a percentage of GDP than other countries do. There are many reasons people believe this is the case:
The American healthcare system
High rates of obesity
Low rates of physical activity
Drug use (in the past smoking, now fentanyl)
Vehicle accidents
Gun deaths
And more!
How much each of these factors contribute to the problem is a vital question for American public policy. Ideally, we need a working model of how each of these factors contribute to the population lifespan, so that debates downstream can work off of it. One attempt to do so was "Explaining Divergent Levels of Longevity in High-Income Countries", a report by the National Research Council in 2011. They examine the effects of obesity, physical activity, smoking, social networks, health care, hormone therapy (for women undergoing menopause), and inequality. Unfortunately, its very hard to put numbers to the health effects of each of these factors, but the report goes through the various literatures and advances a few potential explanations:
All of these items can increase all cause mortality. For some areas, however, the literature fails to find a significant difference between other wealthy countries and the US. For example, while poor social networks and high wealth inequality lead to lower mortality, the US did not appear to have a measurably more socially isolated culture in their analysis of the literature, and had a wealthier and more educated populous that still had lower life expectancy across the board of education brackets. Hormone therapy for postmenopausal women was also an ineffective explanation.
Smoking and obesity are two of the most well studied phenomenon. In the mid 1900s the US smoked a lot relative to other countries, and America is famously overweight. American women were especially heavy smokers, and the report estimates that this smoking explains a large part (over a year) of the health gap for American women. Obesity likely accounted for another rough third of the gap (just under a year), varying from 20-35% depending on measurement.
The remaining factors (health care systems, physical activity, racial / geographic inequality) probably have significant effects but we lack meaningful data to make more concrete statements.
All statements about causes of death at a national population level should come with a sea's worth of salt because these things are incredibly hard to estimate and the literature disagrees constantly
There are multiple theories one might believe in after reading this report. Here are some:
Roughly half of the 2011 life expectancy gap was due to obesity and smoking, with the rest explained by differences in physical activity, geographic/racial inequities, and health care systems, with smoking especially important for women. This theory makes no predictions about the future, and thus is of little value.
Excepting changes in {physical activity, geographic/racial inequities/health care systems}, we can predict the changes in life expectancy from past smoking behavior and obesity rates. From this model, we can expect that the health gap will continue to grow as the health effects of rising obesity materialize. Because lung cancer is mostly caused by smoking and relatively more women smoked in America compared to other countries in the past, all else being equal we expect the rates of lung cancer to decrease and the female-male survival gap to expand relative to other countries. This theory makes testable predictions about the future, and is potentially of value.
How has this second prediction held up over the past 13 years? Quite well! The overall gap in mortality between America and other wealthy countries only increased (partially due to Covid), and the female-male life expectancy gap increased by a year. Lung cancer rates decreased drastically. (If you want to calculate how well calibrated various predictions of these relationships were, be my guest). Note that none of this is a falsifiable experiment, but it still corroborates the theory.
Understanding Lakatosian Offense/Defense through imaginary Ozempic worlds.
What would a Lakatosian defense of the theory look like? Imagine a world where it turns that out that obesity was only a good proxy for longevity effects because it correlated with decreased physical exertion, which was far more important. Drugs like Ozempic break this relationship in this hypothetical world, and obesity rates cease to strongly correlate to life expectancy. A Lakatosian defense would be to say that the theory is true assuming Ozempic is not available.
Note that we can also reject the theory outright, and say that an alternate theory (say, that physical exertion and smoking rates predict longevity) is better. Which we would choose would depend on which we believed better predicted the future. In this world obesity data might still make better predictions because it is easier to collect than data for physical activity.
Alternatively, say that Ozempic improves health outcomes so much that geographic disparities vanish6. Geographic disparities are already highly correlated with obesity, and perhaps the other positive health benefits attributed to the drug shrink the gap so that it becomes negligible. A Lakatosian offense would be to say that theory is true regardless of changes to geographic inequity.
Conclusion
This post took far longer and became far longer than anticipated, so let me try to wrap it up quick and explain what else I want to do it in this series. There are three core ideas in this post:
The Verisimilitude of a theory is how well it predicts events we care about at a time horizon we care about given available information.
Popper demand for falsification fails because the goal of a theory is not pure logical deduction but perfect information.
In the refinement of theories, we use the Lakatosian defense and offense to edit the non core parts of theories.
The high level goal of this series is to describe how one forms, refines, and measures 'causal' theories of the world with potentially non causal experiments through these three ideas. To that end, two future posts are planned.
The first is a discussion of measuring how well the world conforms to our theories, comparing Meehl's notion of the Spielraum with the information theoretic definition seen above.
The second will be a discussion (and refinement of) of two "theories" that I have a lot of experience with: the theory of scaling laws in machine learning and the theory of solving climate change through clean energy abundance.
See you then!
A probability space is just a tuple of (a) all indivisible events, (b) events which we define a probability over, and (c) the function we use to measure probability
We use i and not t to avoid confusion with theories
ALWAYS WANTED TO SAY THAT
Note that science is much more than the development of scientific theory. My favorite paper from 2022, Cramming, tested developed no great theory about language models. It merely carefully tested many configurations for how one might train a small LM, and did a thorough analysis of the results. It helped validate certain theories (e.g. the superiority of GLU over transformer architectures across scales) but was not focused on developing one
The other problem with Popperian logic is that falsification does not mean the death of a theory. See the discovery of Neptune due to perturbations in the motion of Uranus
Unicorns also exist in this world