Recently, Nature Genetics published a paper by Polderman et al. entitled “Meta-analysis of the heritability of human traits based on fifty years of twin studies,” . It has already garnered a lot of attention: blog posts and various news sites are trumpeting its conclusions, and putting their own spins on the results. For reasons I’ll briefly discuss below, I found the paper to be very strange, but my main purpose in this post isn’t to criticize it. Rather, I wish to briefly explain, in broad terms, what “heritability” actually means, and perhaps more importantly, what it doesn’t mean. The twin studies analyzed by Polderman et al. attempted to estimate the heritability of various human traits. While one can, for reasons I’ll note briefly below, raise concerns about how accurately heritability can be estimated from twin studies in humans, the more important point is that heritability doesn’t measure what most people think it measures, and while twin studies may be answering some questions, they aren’t, and can’t, answer the questions that many people think they are answering.
I. An odd study
Before getting into heritability, the heart of the matter, I want to reflect briefly on a few features that make the new study somewhat puzzling. The first features I focus on here are related to the study’s design: a meta-analysis of twin-studies. The second set of features are a bit more general.
A. What are meta-analyses for?
Meta-analyses, as the term implies, analyze already published studies by combining the results of a number of individual studies focused on similar experimental outcomes together, in order to generate some overall measure of the results. There are two sorts of situations in which researchers regularly run meta-analyses: a) where previous studies on the issue at hand have reached conflicting conclusions, and b) where previous studies record results that fail to reach statistical significance (or are just barely significant). In both sorts of cases, the hope is that an analysis of all the studies published so far will generate results that are more robust than any individual study.
In the case of the Polderman et al. paper, however, neither purpose applies. It is well established that twin studies estimating heritability of traits in humans generally find a substantial heritable component for essentially all traits measured , nor is the statistical significance of these findings in doubt (although the methodology that generated those findings, and therefore the validity of the findings themselves, is certainly questionable). Performing a large-scale meta-analysis of this sort (over 2500 studies) is a herculean effort, but in this case, the reason why it was done is not explained.
B. What is the point of the conclusion?
One finding that the authors chose to mention in their abstract is that “across all traits the reported heritability is 49%.” This fact got picked up by many of the aforementioned blogs and media sites (“Nature versus Nurture a draw” is a typical headline; this, for reasons discussed below, is wrong on a number of levels). But the heritability of the least heritable traits measured was only around 5%, and the heritability of the most heritable traits over 80%. What information is gained by averaging the heritabilities of traits as disparate as “adult height,” “structure of the eye-ball,” “cognition,” and “social values”?
For comparison, imagine a similar meta-analysis of drug effectiveness that reported the average effectiveness of every randomized clinical trial of a drug over the past 50 years. Of what possible use would it be to know the average effectiveness of various drugs, designed to treat different diseases or disorders? All the different trials share is a vague area of interest (“drug effectiveness”) and a basic methodology (“randomized clinical trials”). That isn’t enough to support the coherence of a meta-analysis.
Certainly, no one is suggesting that the same genes underwrite the heritability of these different traits, nor that these wildly different traits are linked in any interesting ways. While the authors found interesting ways to visualize the data they gathered, for what purpose, in the end, the data was gathered and analyzed remains rather opaque.
C. What gets studied
It would be a mistake to suppose, as many readers of the review have, that about half of the variation in all human traits is genetic (a misreading the authors are at least in part responsible for). There are a number of human traits that are not studied in this way, because the results of such a study are a foregone conclusion. Certainly, the language you speak is a “trait,” and certainly it matters to who you are. But your primary language is determined more or less entirely by the household you grow up in. Once we limit our population to those who grew up speaking a particular language within a particular society, there is no doubt room for heritability to emerge in the acquisition of second languages, for example. But no one would be foolish enough to try to find the heritability of speaking, say, Japanese, taking a world-wide population as the relevant one. Similarly, while some food preferences are no doubt heritable, the cuisine you grow up with, again, surely plays a fundamental role in what foods you regard as “normal,” what you eat regularly, etc; and the foods we regard as normal and eat regularly are surely important traits.
The traits chosen for study matter, as do the traits not chosen. All one can say of a meta-analysis of all the studies done, is that if done well, it really does include all the studies done. The studies not done, of course, can’t be part of the analysis. But there are good reasons to think that the results for an arbitrarily chosen “trait” may not much resemble the traits that have historically been selected for study via these methods.
D. The “Missing” Heritability problem
The authors mention the “missing” heritability problem, but do not appear to take it seriously. For many traits studied, especially psychological traits, large-scale searches for the genes associated with the heritable variation have found at best only a very few genes, each with a tiny effect size, that together explain almost none of the variation that is supposed to be associated with genetic differences. This is in contrast to studies on (some) straightforward physical traits, like height, where genes associated with significantly more of the variation that is supposed to be associated with genetic differences have been found (but still, in no case has all, or even more than a bare majority, of the variation supposed to be associated with genetic differences been accounted for).
As these searches for candidate genes (GWAS – Genome Wide Association Studies) become more powerful, the maximum plausible effect size of each of the genes not found goes down; assuming that the genes interact additively – that is, that the effect of each gene on the trait in question is broadly independent of the other genes’ influences – the number of genes necessary to explain the variance goes up. For some heavily studied traits, the minimum number of genes necessary to explain the variance associated with genetic differences, assuming that the heritability estimates are accurate, is now approaching the thousands. It is, to say the least, somewhat implausible that hundreds of human psychological traits are each influenced by thousands of independent genes, each with a tiny additive effect.
Nevertheless, the authors confidently assert that for about 2/3 of traits studied, the pattern of variation is “consistent” with additive genetic variation, and that this “implies that, for the majority of complex traits, causal genetic variants can be detected using a simple additive genetic model.” And yet, causal genetic variants have not been detected using such models; serious searches using such models have failed, time and time again, to find causal genetic variants associated with more than a trivial fraction of the total heritability!
In “The mystery of missing heritability,” a paper that the authors of the meta-analysis in fact cite, the authors (Zuk, Hechter, Sunyaev and Lander) argue that heritability estimates may be inflated by interaction effects, and that the necessary sample sizes to detect these interactions are unreasonably high (so high that even a meta-analysis of the sort under discussion now would not detect the effects that they are suggesting might be responsible for “phantom” heritability). Now, Zuk and his co-authors may well be wrong – their suggestion might prove to be unworkable, or it may simply turn out to be the case that the world really is very strange and most of the missing heritability really is due to thousands of genes with tiny additive effects. But it is at least rather odd to have cited Zuk et al. and to have utterly ignored the most important conclusion that they drew!
II. Heritability: an overview
Heritability is a technical notion, and the implications of the way that it is defined are sometimes counterintuitive. Put simply, heritability is the proportion of the phenotypic variation in a trait of interest, measured in a given studied population and in a given environment, that is statistically co-varying with genetic differences (however measured) among individuals in the same population.
Note first that heritability is not a measure of “how genetic” a trait is. For heritability to make any sense at all as a statistic, the trait in question must vary in the population in question. So for humans, the heritability of “head number” is undefined – there is (almost) no variation in head-number for living humans (there are a vanishingly few conjoined twins that may count as exceptions; note that two-headedness is rather more common, albeit still rare, in for example snakes). Similarly, the ability to speak Jarawan among the Jawaran population also has an undefined heritability, because virtually all Jawarans speak the language. Since heritability is a measure of what is associated with variation in the trait, and not a measure of what causes the trait, the heritability of finger number in humans is essentially zero, and the vast majority of variation in finger number is environmental (traumatic amputations are the primary cause!).
Heritability can be calculated, in principle, by partitioning the variance (the squared deviation from the mean value) of the trait in question in the following way:
VP = VG + VE + VGxE + Ve
where VP is the total variance in the population, VG is the variance associated with genetic differences in the population, VE is the variance associated with shared environmental variation within the population, VGxE is the variance associated with gene by environment interactions, and Ve is everything else, which includes the variance associated with so-called “unique” environments, developmental ‘noise,’ independent epigenetic effects, and errors. VG is generally broken down into two parts: the variance associated with additive genetic variation (VG-additive), and the variance associated with non-additive genetic variation (VG-nonadditive, sometimes written as VGxG to suggest that it is associated with gene-gene interactions).
Given these definitions, Broad-Sense Heritability, H2, is VG / VP (the variance associated with genetic differences divided by the total population variance) and Narrow-Sense Heritability, h2, is VG-additive / VP (the variance associated only with the genetic differences that are statistically related to genes that behave in an additive manner, divided by the total population variance).
[Further, “C” is defined as the proportion of the variance associated with shared environments: VE / VP. “E,” somewhat confusingly, is defined as the proportion of the variance associated with non-shared environments: Ve / VP.]
Narrow-Sense Heritability is a fantastically important measure, if one is interested in breeding for a particular trait, or in the likely response of a trait to selection more generally. But the coarser Broad-Sense heritability is the measure usually associated with studies of heritability in humans.
III. Heritability and twin studies
In order to estimate heritability, one needs to be able to distinguish variation associated with environmental differences from variation associated with genetic differences. For non-human animals raised under controlled experimental conditions, this can be achieved by distributing organisms with known genetic variations into known developmental environments. But for humans, people that share similar genes (e.g., family members) also tend to share similar environments.
The basic trick exploited by the sorts of twin-studies reviewed in Polderman et al., then, is to compare monozygotic (aka “identical”) twins (“MZ twins”) to dizygotic (aka “fraternal”) same-sex twins (“DZ twins”). The assumption is that the family environment experienced by MZ and DZ twins will be relevantly similar, and therefore that the difference between how similar MZ and DZ twins are can be used to estimate how heritable the trait is.
There are a number of quibbles one can raise regarding the reliability of the estimates generated this way. MZ twins usually share a placenta; DZ twins never do. It is not clear that this makes a difference, but it might. MZ twins are sometimes, and perhaps usually, treated differently than DZ twins – dressed more similarly, assumed by both their parents and the other people that they meet to be more similar in a variety of ways, etc., which may also have a measurable effect.
Aside from these issues, there are other concerns regarding the proper interpretation of twin-studies. Should we regard the estimate of heritability (derived from doubling the difference between how similar MZ twins are to DZ twins) as an estimate of the broad-sense heritability, the narrow-sense heritability, something between the two, or something else entirely? Should we expect DZ twins to share none of the gene-gene interactions that influence the trait, or some fraction of those shared by MZ twins? It is sometimes said that DZ twins share half the genes shared by MZ twins; this is true in a sense, but it is of course also possible (in fact, it seems rather likely) that the alternative alleles available from each parent are not representative of the alleles available in the population as a whole. Again, whether these kinds of issues make a difference is fiercely debated.
The upshot of this is that people who think that twin-studies have serious limitations compared to the methods used in research on non-human animals and plants will be unimpressed with a meta-analysis of twin-studies. Several thousand studies, all with the exact same methodological problems and limitations, are no better than one such study.
IV: Heritability as local measure
Heritability is, famously, a local measure; the heritability of a trait is relative to a particular population and a particular range of developmental environments. Recall that heritability is the fraction of the variance in a trait associated with genetic variation. So anything that changes either the total variation associated with a trait, or changes the fraction of the variance associated with genetic variation, will change the heritability.
One way of increasing heritability is to reduce the variation in the environment for that population; at the extreme, a trait that is mildly heritable under ‘ordinary’ conditions can be made entirely heritable by reducing the environmental variation to near zero. If the environment doesn’t vary in the relevant way, all phenotypic variation associated with environmental variation will be eliminated, and the only variation left will be that associated with genetic differences.
Similarly, increasing the range of environments considered will often reduce heritability; if environments are added that are associated with differences in the trait, the heritability will decrease, as the overall variance is increased.
Changes in heritability due to changes in the population considered are also possible; if we reduce the amount of genetic variation in a population, the variation associated with the genetic component will decrease, and more of the variation left will be associated with whatever environmental variation exists. Similarly, if we increase the genetic variation that is associated with differences in the trait, the total variation will increase, and the total fraction associated with genetic variation will also increase.
Polderman et al.’s meta-analysis reports a heritability for “cognition” of around .57; consider the narrower part of this category “performance on standard IQ tests.” Assume, for the sake of argument, that the heritability of performance of standard IQ tests in contemporary U.S. populations is really around .6. (Contemporary estimates range from around .3 to around .8, give or take; there are good reasons to be suspicious of these estimates, some of which are mentioned below. But this is merely meant to be an illustrative example). IQ tests are designed to yield a mean of 100 and a standard deviation of 15. The total population variance is therefore 225, of which 135 (60%) is associated with genetic variation.
But it is well established that IQ test-taking performance increased steadily over time in many countries, the U.S. included. Reasonable estimates put the average IQ in the 1940s, as measured by 1990s tests, at perhaps 70 or 80. If we take 80 as our estimate, the adjusted standard deviation would be around 12 (note that there are complications here – scores did not improve across all tests equally, nor across all segments of the population equally; leave concerns about these issues aside for the moment – again, this is just an example).
Assume, again for the sake of argument, that the heritability of IQ test-taking performance was .6 in 1940 as well (our evidence for this claim is substantially weaker than for the same claim in the 1990s, and even there, as noted below, the number is at best an odd average of many different measured heritabilities in different subpopulations within the US, but again, this is meant to be an illustrative example – please just play along!).
What would happen if we combined an equal number of individuals from our 1940 population with our 1990 population, and thought of them as a single larger population? Using 1990’s tests as our standard, the first thing to note is that we no longer have a normal curve, but a curve with two ‘humps’ (one at 80 and one at 100). The second thing to note is that the standard deviation has increased (to around 16.6). The third thing to note is that is that there is a new ‘shared environment’ factor that is responsible for a significant portion of the variance – whether the person is from the 1940s or the 1990s explains a significant chunk of the variance in our total population!
In fact, the shared environment associated with the year in this example will be associated with about 36% of the total variance (1/2 [(100-90)2 + (80-90)2]) / 16.62). Once the variance associated with year is partitioned out, 60% of the remaining variance will be associated with genetic variation; heritability has therefore been reduced to around .38 [.6 * (16.62 – (1/2 [(100-90)2 + (80-90)2]) / 16.62)].
So what, in the end, is the “actual” heritability of IQ? The question makes no sense; heritability, as a measure, is always and only relative to particular populations at particular times in particular places. This problem is not merely hypothetical. Turkheimer et al (2003)  found that in the US populations that were studied, in relatively poor families, most of variation in IQ test-taking ability (about 60%) was associated with the shared familial environment, and almost none with genetic variation (the rest was associated with “unique” environments); in relatively affluent families, the reverse held, with most of the variation in IQ test-taking ability (70%) being associated with genetic variation, and almost none associated with shared familial environment (these findings have been relatively robust in the US context). What, then, is the heritability of IQ in these populations in the U.S.? Should we take the average? Should we adjust for the frequency of the tested SES’s in the population? Does the question even make sense?
Another, slightly more fanciful example, may drive the point further home, adapted very loosely from an example of James Flynn .
Imagine a population in which no one plays basketball. It isn’t a sport anyone is familiar with. Imagine that I then test every young adult in this population for basketball playing ability (after explaining the rules, etc.). Much of the variation will likely be broadly genetic (e.g., height will make a huge difference, and within populations today, variation in height is mostly associated with genetic variation; note as well that between populations, however, variations in height can be largely environmental). If I compare MZ to DZ twins, I’ll very likely find that the heritability of basketball playing ability is quite high in this population.
Now imagine that I take, randomly, half the kids from that population, and train them extensively in basketball, whether they like it or not (note that this would be very odd, and also rather mean – forcing people with no particular interest in or talent for basketball to practice for hours a day, to do all sorts of sport-specific strength training, etc.). If I now consider them part of the same population, and measure the heritability of basketball playing ability across the population, heritability will be very low – the vast majority of the difference in ability will come down to whether the people were in the highly trained group or not. (Note that within each sub-group – within the trained and within the untrained – there will very likely be genetic variation associated with differences in abilities, but when we consider the overall population, the differences between the trained and the untrained subgroups will swamp everything else.)
Now, what about a society that cares enough about basketball that to be ‘good’ – good enough to play on a school team, etc. – you have to be really good, because there is a lot of competition, because lots of people try out. Everyone plays a little when they are very young, because it is an important sport that everyone is interested in. And when young, most of the small differences in ability will be down to odd little differences, some of which will likely be heritable within that population – difference in body type, reaction speed, perhaps interest! But then, small differences in abilities and interest will get magnified – those who start out not very good and not very interested are unlikely to pursue it much, very unlikely to get specialized training, etc. Those who start out with some ‘natural’ talent and some interest are likely to be recognized, rewarded, and eventually highly trained.
So, in such a society, any small differences in ability and interest that are related to genetic variation will be greatly magnified. But here is the wrinkle. Any of those differences that are related, however distantly, to those early differences in heritable traits, will show up in an analysis as “genetic.” MZ twins will tend to share the same training regime (or lack thereof) rather more often than DZ twins, because they will tend, more often, to share those small variations that make them either more or less likely than average to pursue basketball. But on one plausible view, what’s doing most of the work in creating differences in abilities is training and practice – not ‘genes’! The trait will be highly heritable, but differences in ability will be mostly down to environments – environments selected (in part self-selected, and in part imposed by others) at least in part because of genetic endowments.
So, in this population, is basketball playing ability mostly genetic, or mostly environmental? The question makes no sense – or rather, depending on how one interprets it, one can defend either answer, or neither, equally well.
It is worth, at this point, stressing another oddity of twin studies. When “shared environment” is spoken of in twin-studies, it means explicitly family environment. Other environments are not considered. So a twin-study would either identify the training regime (or lack of it) noted above as genetic – part of heritability – or if a portion of it was not associated with any genetic differences, it would be chalked up to “unique” environmental effects. But there is nothing “unique,” in the usual sense of the word, about basketball training, either in our hypothetical example or in the real world. Unique simply means, in this context, not part of the familial environment that is shared equally by MZ and DZ twins – nothing more.
V. The Norm of Reaction: another approach to variation
Theodosius Dobzhansky, one of the founders of modern genetics, argued that the correct way of thinking about the relationship between an organism’s genotype and its phenotype was through the lens of the reaction norm. The partial reaction norm for a particular trait and a particular genotype is the way that that trait develops, given that genotype, over a defined range of environments.
The following example is from a paper by Pigliucci and Marlow . The graph shows the reaction norms for bolting time as a function of exposure to increasing lengths of the growing season in 16 populations of Arabidopsis thaliana (a small plant, in the mustard family, that has become one of the model organisms used in biological research).
Note that some populations have average genotypes that are broadly unresponsive to season length, and in others, season length matters enormously to bolting time. Note that some populations have a shorter bolting time than others given one season length, but a longer time than others given a different season length.
Genotype doesn’t determine the phenotype; rather, we can think of it as determining how the organisms will develop given a particular developmental environment. But of course, while it is possible to compare how one trait develops, given a particular set of genotypes, over a few different environments, it is impossible to determine the complete norm of reaction for even a single genotype (a complete norm of reaction would have an essentially infinite number of dimensions – one for each way in which the environment can vary!).
As Dobzhansky put it:
The norm of reaction of a genotype is at best only incompletely known. Complete knowledge of a norm of reaction would require placing the carriers of a given genotype in all possible environments, and observing the phenotypes that develop. This is a practical impossibility. The existing variety of environments is immense, and new environments are constantly produced. Invention of a new drug, a new diet, a new type of housing, a new educational system, a new political regime introduces new environments. (Evolution, Genetics, and Man 1955 pp. 74-75).
From the perspective of reaction norms, questions about “nature” and “nurture” are ill-formed. Rather, the right questions to ask are more of the form “how does this genotype respond to changes in this environmental variable?” and “how does the response of this gene to changes in the environment depend on or vary with the rest of the organism’s genes?” For some traits, the norm of reaction will be basically flat against most reasonable developmental environments – as noted above, there are essentially no genetic variations that, in any reasonable range of environments, regularly produce a living human with other than a single head. Most human genotypes, in most developmental environments, produce humans with 10 digits on their hands – again, for most developmental environments regularly encountered, a reaction norm plotting the number of fingers against the environmental variation will be flat. For other traits, the trait will develop differently in different environments.
The only way to determine the reaction norm of a trait, given a particular genotype, for a particular range of environments, is to raise genetically identical clones in the variety of different environments in which one is interested. Partial norms of reaction (like the one by Pigliucci and Marlow, above), looking at the response of particular genetic variants averaged against varying genetic backgrounds need not use clones, but still require that organisms be able to be sorted into developmental environments based on their genetic endowments. Needless to say, it is (ethically and practically) impossible to generate reaction norms for human traits!
VI. Some final thoughts
So, is it fair to say, as many commentators on the Polderman et al. study have, that the “nature / nurture” debate is over, and that it is about 50/50? Hardly. In part, this is because heritability studies are simply ill-designed to answer ‘nature/nurture’ questions. Is our ability to read and write mostly “nature” or mostly “nurture”? The question, as stated, is too ambiguous to answer. Humans are the only animals we know of that can read and write, and our ability to perform those tasks clearly has something to do with our nature – with the kinds of creatures that we are. But particular environmental contexts are necessary for those skills to develop – “nurture.”
In the end, heritability doesn’t tell us much that we ought to want to know. It doesn’t tell us whether a trait will be easy or hard to change. It doesn’t tell us what developmental resources are necessary for the trait to develop normally, nor how changes in those resources will change the development of the trait in question. It is simply, at best, a snap-shot of how much of the variation in the trait that there happens to be now is statistically associated with the genetic variation there is in this population, in this range of environments, with this particular distribution of genotypes into those environments. This does not make it meaningless, or useless. But it does put severe limits on what can be deduced from it.
Jonathan M. Kaplan is a philosopher at Oregon State University. His main areas of interest are the philosophy of biology and political philosophy.
Lewontin, Richard C. “Annotation: the analysis of variance and the analysis of causes.” American journal of human genetics 26.3 (1974): 400.
Downes, Stephen M., “Heritability,” The Stanford Encyclopedia of Philosophy (Summer 2015 Edition), Edward N. Zalta (ed.).
 Meta-analysis of the heritability of human traits based on fifty years of twin studies, by T.J.C. Polderman et al., Nature Genetics, 18 May 2015.
 See: Eric Turkheimer, 2000. “Three Laws of Behavior Genetics and What They Mean.” Current Directions In Psychological Science. 9(5): 160-164.
 Socioeconomic status modifies heritability of IQ in young children, by E. Turkheimer et al., Psychological Science, November 2003.
 The Flynn Effect: Modernity Made Us Smarter, by S. Mirsky, Scientific American, 20 August 2012.
 Differentiation for flowering time and phenotypic integration in Arabidopsis thaliana in response to season length and vernalization,” Oecologia (2001) 127:501–508.