Words and Genes

This weekend,  people in the United States set off numerous explosive devices to celebrate 239 years of independence from the United Kingdom. Since this separation, the versions of English spoken in the US and the UK have diverged considerably, but still remain (mostly) intelligible. In contrast, North and South Korea, which have only been separated for 70 years, have been more strictly isolated from one another, and as a result the versions of Korean spoken in the two countries have diverged dramatically:

differences [of mutual unintelligibility] now extend to one third of the words spoken on the streets of Seoul and Pyongyang, and up to two thirds in business and official settings.

A friend of mine who is an actual linguist drew my attention to an app that designed to translate between North and South Korean— an acute problem from people who have defected from North to South.

As Darwin and many others have noted, and as I’ve written about here, such language change bears many striking similarities with biological evolution. These similarities are interesting in their own right, and may be helpful for thinking about the long-running debate in evolutionary biology about whether natural selection acts mainly on genes, individuals, or groups.

A language, like English, or German, or French, is like a biological species. Both languages and species are made up of populations of individuals. Languages and species both have boundaries. In biology, the boundary is sex: a species is defined as a population of individuals that naturally mate and produce fertile offspring with one another. This concept is a pretty good rule of thumb, but turns out to be violated frequently in nature. Oak trees are notoriously promiscuous with oak trees from other species. Of course they mate simply by wafting their sperm into the air (tucked inside pollen grains) so they aren’t the choosiest of breeders. But even among mammals, hybrids frequently occur in nature.

In languages, the boundary is not sex but mutual intelligibility. French is considered a different language from English because, as Steve Martin says, “those French have a different word for everything!” But just as with the biological species concept, this is a useful rule of thumb, rather than an absolute rule. Speakers of closely related languages, like Danish and Swedish, can learn to understand one another with some ease.

Species have subspecies, and languages have dialects. Both are closely related to geography and geographic isolation. Languages like English and Chinese contain “dialects” that may be as mutually unintelligible as pairs of “proper” languages. Because the distinction between a language and a dialect is to some extent political, a common saying among linguists is “A language is a dialect with an army and a navy.”

Similarly, the distinction between species and subspecies in biology is somewhat arbitrary. Baboons, for example, are a widespread group of monkeys, occurring through most of Africa, with one species (Hamadryas baboons) extending their range across the Red Sea into Saudi Arabia and Yemen. Even though baboons are among the most intensively studied nonhuman primates, no one seems to about how many different baboon species there are, and what they should be called. Some people consider all baboons to be subspecies of Papio hamadryas. Other people distinguish ten or more species: Guinea baboon, Hamadryas baboon, Hauglin’s baboon, Anubis baboon, “typical” yellow baboons, Ibean yellow baboon, Kinda baboon, grey-footed baboon, Transvaal chacma baboon, and Cape & desert chacma baboon.

Jolly, C. J. (2001). A proper study for mankind: Analogies from the papionin monkeys and their implications for human evolution. Yearbook of Physical Anthropology, Vol 44. C. Ruff. New York, Wiley-Liss, Inc. 44: 177-204.
Jolly, C. J. (2001). A proper study for mankind: Analogies from the papionin monkeys and their implications for human evolution. Yearbook of Physical Anthropology, Vol 44. C. Ruff. New York, Wiley-Liss, Inc. 44: 177-204.

Interbreeding occurs among all these different kinds of baboons where their ranges overlap, so from the point of view of the traditional biological species concept, they are different species. But calling them all subspecies of Papio hamadryas seems odd because Hamadryas baboons are the most distinctive baboons of all: the males have showy capes and tufted tails, and their societies have an unusual multi-level structure quite different from the usual troops of “savanna baboons.” Moreover, as more studies are conducted of other baboons, it has become clear that each of these species (or subspecies) differs from others. For example, Guinea baboons turn out to have a social system quite similar to that of Hamadryas baboons.

So languages are similar to species, and dialects are similar to subspecies. These categories refer to populations. Within populations, individuals vary greatly, both in their language use and in their genes.

Each individual speaker of a language has her own set of words and rules: an idiolect. My idiolect may be very similar to yours, or quite different, depending on our shared vocabulary, which may include technical terms specific to our work, and idiosyncratic speech habits (which my wife complains I have in abundance).

An idiolect is similar to an individual’s genome. Each individual is unique, but at the same time, each individual speaker of a language shares a broad set of words and rules with other speakers of that language (otherwise they wouldn’t be able to communicate – and wouldn’t be considered speakers of the same language).

Continuing the analogy down to the next level, words are similar to genes. Words and genes are both combinatorial, in that they consist of sequences of smaller units combined to make larger units: syllables and letters in words, codons and nucleotides in genes.

Words are made up of syllables. Some words are made of single syllabus, such as “dog,” “cat,” and “fish.” Longer words can be made by combining syllables: “dogfish,” “catfish.”

Similarly, genes are made up of a series of codons. Unlike syllables, which can be spelled with anywhere from one to six or more letters (“a,” “-ed”, “-ing”, “ouch,” “queue,” “smooch,”), codons are always spelled with three letters.

Spelling is easier in genetics than in linguistics, because while languages may use dozens of letters (e.g., 26 in English), all genes are spelled with only 4 letters: G, A, T, and C. These letters stand for the nucleotides Guanine, Adenine, Thymine, and Cytosine.

Words are generally much shorter than genes, however, Words usually have only a few syllables, whereas genes can contain hundreds or thousands of codons.

Each 3-letter codon is translated into an amino acid; these amino acids are in turn connected up together like cars in a train to make proteins. The whole business of making proteins is very complicated, and is perhaps roughly analogous with the translation of mental representations of words into physical phenomena, such as sounds produced by the vocal tract, or signs made with the hands in sign language, or words written on a page or typed on a screen.

Linguistics Genetics
Combinatorial level Example Example Combinatorial level
Letter A, B, C, D, E, F, G… A, C, G, T Nucleotide
Syllable Dog, cat, in-, un-,-ness CAT, TAG, DAT, DCG Codon
Word Dog, cat, catness, undoglike hemoglobin, melanin, lactase, amylase Gene
Idiolect My particular speech My particular genes Genome
Dialect Upper Midwest American English Homo sapiens sapiens Subspecies
Language English Homo sapiens Species
Family Germanic, Indo-european Hominins, Primates Clade

In addition to being combinatorial, words and genes resemble one another in that they are both products of descent with modification. This is the phrase that Darwin preferred to “evolution,” and is really more precise about what happens in evolution. The descent part means that words and genes both have histories and family trees. The modification part means that both words and genes gradually change over time, across generations.

Both words and genes can undergo small changes, “mutations,” in how they are spelled. Genes can change by as little as one nucleotide. Many such mutations are “silent,” that is, they don’t affect the amino acid sequence made by the gene, because the genetic code is redundant: there are only 20 amino acids, but 64 possible codons. So some amino acids can be spelled several different ways. The amino acid serine, for example, can be spelled TCA, TCC, TCG, or TCT.

Mutations can have a wide range of effects, from not changing gene function at all, to wrecking the gene entirely. Some mutations result in slight improvements.

Words also undergo mutations. Talking about mutations in words is a little tricky in that the letters we use to spell them have an imprecise relationship to the way they are actually pronounced. In linguistics, the actual speech sounds that make up words are called phones.

Thus, “water” is spelled the same way in Dutch and English, but is pronounced slightly differently. Even within English, “water” is pronounced differently by different speakers and dialects. In the American Midwest, “water” is pronounced something like “wah-dur,” whereas in some dialects in England it is pronounced more like “wah-tuh.”

But both words and genes are robust to these small changes. They still work when altered just a little bit – which allows them to evolve.

Words and genes both accumulate small changes over time. These changes tend to cluster geographically. People who live near one another for many generations tend to speak the same language and dialect, and also tend to have more similar genes than people who live further away.

So what does all this have to do with the argument in biology about levels of selection?

In 1976, Richard Dawkins drew attention to the gene’s eye view of biology with his book, The Selfish Gene. Prior to this book, a widespread view in biology was that genes are something organisms use to accomplish certain goals. The heart pumps blood, the kidneys filter blood, and the genes store information and transmit it to the next generation. Dawkins, popularizing work by G. C. Williams and W. D. Hamilton and others, turned this view on its head: organisms are “survival machines” that genes use to make more copies of themselves. Dawkins argued that genes are ruthlessly selfish, because only those genes that succeed in getting copied are transmitted to the next generation.

Many people have objected to this view of evolution. Stephen Jay Gould, for example, argued that natural selection acts on individuals, rather than genes. Biologists including Edward. O. Wilson and David Sloan Wilson have argued that selection acts on multiple levels: genes, individuals, groups, perhaps even species. The debate continues with passionate advocates on each side.

In some ways, I think the debate is entirely sterile. Many people on both sides of the debate seem willing to agree that group selection is mathematically equivalent to kin selection. What really seems to feed the passion in this debate is the connotations that people have towards the idea of genes as “selfish” entities. Many people seem to have the impression that group selection is somehow kinder, gentler, and politically more left-leaning than gene-level selection. This view puzzles me, since plausible mechanisms of group selection are often quite nasty, such as intergroup hostility and warfare.

Genes are exotic entities, only recently discovered, and not fully understood even by professional biologists. Words, however, are familiar things that we all use all the time. So linguistic evolution might be easier to grasp for many of us than genetic evolution.

From a Darwinian perspective, a word is selfish in exactly the same way that a gene is. In both cases, versions that succeed in making more copies of themselves are the ones that persist over time.

Genes get themselves copied through reproduction. In species with sexual reproduction, they depend on their host finding a mate and (if there is any parental care) successfully rearing the resulting offspring.

Words get themselves copied in various ways. Vertical transmission is like biological reproduction, in that words are passed down from parent to child. Words differ from genes in that they are also easily passed among unrelated individuals: horizontal transmission. Horizontal transmission of genes does occur, especially in bacteria, but it is less common in complex multicellular creatures like ourselves.

Words vary among speakers, just as genes vary among individuals. Common words are shared by nearly every member of a language community, but there is still variation, among regions, social groups, interest groups, and individuals.

My idiolect, like my genome, is an ephemeral collection of words and rules. It will vanish when I die (apart from whatever words I have left behind in books and such – but even those will represent only a fraction of my actual idiolect, and will show the influence of co-authors, editors and such). My genome will also vanish when I die (parts of it will live on in my children, but all mixed up with my wife’s genes).

Words, however, have longer histories – as do genes. The word I use for H2O, “water,” comes from ancient roots. We see its cousins in words such as “Wasser” in German and the more distant cousin “voda” in Russian (whence the word “vodka,” “my dear little water”), and “uisge” in Scottish Gaelic (and its distilled descendant word in English, “whiskey”).

In language evolution, selection occurs mainly at the level of words. It is individual words that accumulate changes in their sounds and meanings. Words exist in a constant competition with other words for space in each individual’s vocabulary. Words come and go, depending on fashion, technology, and random drift.

The analogy is of course far from perfect and shouldn’t be pushed too far.

In fact, some linguists don’t like this analogy at all. In a 2014 blog post, linguist Asya Pereltsvaig complains:

words are not “just like genes” in that they are easily borrowed from language to language, even across family boundaries, are subject to conscious choice, and are not subject to natural selection.

The first point is true – sort of. The sort of genetic transmission that we are most familiar with is vertical (parent to offspring) rather than horizontal (from one unrelated individual to another). However, horizontal gene transfer turns out to be more important the people used to think.

Bacteria swap genes fairly frequently, such as when they share genes for resistance to antibiotics (like Deadheads swapping tapes of old Grateful Dead shows).

And according to one recent study, some 8% of the human genome originated in retroviruses.

So actually words and genes quite similar in this respect. Most genes, and most words, come from your parents, but some genes and words come from elsewhere, sometimes even quite unrelated sources.

The other claim, that words are not “subject to natural selection,” is also debatable. Pereltsvaig focuses attention the fact that word form is arbitrary.

As was noted by the “father of modern linguistics” Ferdinand de Saussure, the association of sound and meaning of a word is largely random: the sound of house is neither more appropriate to the concept nor better for the “survival of the fittest” than maison (French), dom (Russian), bayit (Hebrew), or iglu (Inuktitut)

It is true that the particular form of a word is basically arbitrary. But it is not true that selection has nothing to do with word form. Over time, long words that are frequently used get shortened. In both French and English, the invention of two-wheeled human-powered transportation required an accompanying new word (“vélocipède,” bicycle), which was subsequently shortened in both languages (“vélo,” bike). French teenagers commonly use “gar” for boy instead of the longer “garçon.” Words that are hard for native speakers to pronounce get changed to make pronunciation easier. Words whose meanings are transparently obvious to native speakers may generally catch on better than words whose meaning is opaque. For example, the term “earworm” (to describe a catchy tune that gets stuck in your ear) has a better chance of being understood, used, and catching on among English speakers than the original German word that it is translated from, “Ohrwurm“.

Pereltsvaig also claims, “words provide no adaptive advantage to people(s) who have them.”

But I disagree with this as well. Words are crucial to survival and reproductive success in human societies. Someone unable to use words at all would have tremendous difficulty holding a job or finding a mate. Using words well is essential, not just for those who write for a living, but also for anyone who talks with other people.

In some cases, correct understanding of a word could make the difference between life and death. One time at Gombe, an American colleague of mine thought she saw a hippo swimming in Lake Tanganyika. She grew alarmed, as several people were swimming nearby. Hippos may seem harmless but they enormous, terrifying beasts with huge sharp tusks (usually hidden inside their vast mouths). Hippos are often said to kill more people in Africa each year than crocodiles. To warn the swimmers, she shouted “Kifaru! Kifaru!” The Swahili-speaking swimmers just looked at her with a puzzled expression – since kifaru means rhinoceros in Swahili, and there was no risk whatsoever that there would be a rhino in the water. It turned out there wasn’t a kiboko (hippo) either, but if there had been, this linguistic mistake could have proved deadly.

The particular words we use tell others about our social status, our level of education, our sense of humor and style, and many other aspects that directly affect our reproductive success. Blurting out the wrong set of words can cost a person dearly (see, for example, James Watson, Tim Hunt, Donald Trump).

Looking at evolution from a gene’s eye view provides insights that simply aren’t available from other perspectives. Many aspects of biology don’t make any sense at all except from a gene’s eye view. The very existence of sex, for example. If selection occurred at the level of individuals, we should see individuals mainly making exact copies of themselves (clones). This sometimes happens in plants and animals, and is the norm for bacteria, but the widespread occurrence of sexual production is very difficult to explain, unless evolution is mainly about the replication of genes.

At the same time, words and genes both exist within incredibly complex systems in which the influence of any one word or gene may not be obvious. Just as each word contributes a tiny bit to each individual’s language output, each gene contributes a bit to each individual’s biological output. The total number of words that an individual speaker knows is estimated to be around 20,000 – 35,000. Coincidentally, this happens to be quite similar to the number of protein-coding genes in the human genome (around 20,000-25,000). Thus, in most cases, any single word or gene is likely to have only a small and subtle influence on an individual’s survival and reproductive success.

Words combine in complex ways to produce phrases, sentences, and longer things like poems, songs, articles and books. Gene products interact in complicated ways to produce living bodies and regulate the expression of other genes.

Both words and genes only make sense within the context of the complex system in which they exist. The French word “entrée” means something eaten at the start of a meal in French (the “entry” into the meal). When English speakers borrowed this word, they rather confusingly used it to refer to the main course of a meal. Similarly, the “meaning” of genes depends on the context in which the genes occur. In animals with red blood cells, the genes for hemoglobin make proteins that carry oxygen. But what would happen if these genes were inserted into a bloodless organism, such as a bacterium? Probably just an accumulation of protein that has no use whatsoever for the bacterium. (This may seem like a weirdly pointless experiment, but it has actually been done to produce and and study mammoth hemoglobin).

Additionally, just because words and genes are both “selfish,” in the sense that those that are better at getting copied are the ones that become most common in a given population, does not mean that they have to promote behavior that is selfish. Animals engage in all sorts of altruistic behavior, much of which is presumably the result of genes promoting altruism – that is in fact the central topic of The Selfish Gene.

For example, individual words might be “selfish,” in the sense that words with features that promote being copied get copied more often. But the words themselves don’t necessarily promote selfish behavior. For example, “Do unto others as you would have done unto you” is a combination of words that has been extremely successful in getting itself copied. Perhaps the great majority of the hundreds of millions of people who speak English have some version of this phrase stored in their memories; and other languages transmit equivalent versions of this phrase. The phrase is good at getting copied, but it advocates cooperative behavior, rather than selfish behavior. This is precisely why the phrase has been so successful. People who make an effort to follow the idea encoded in this phrase are likely more successful at navigating the complexities of village and urban life than people who are mean-spirited and selfish. “Selfish” words, like “selfish” genes, often promote cooperative behavior.

Language evolution and biological evolution both result from the accumulation of small changes at fundamental levels: words and genes. Words are “selfish” in exactly the same way as genes. Words and genes that have attributes that increase their likelihood of being reproduced become more common in the population. But neither words nor genes have goals, or minds, or emotions, or feelings of being selfish, altruistic, or anything else. They are just bits of information that happen to exist within copying systems. And just because these bits of information can be described as selfish doesn’t mean that they invariably code for selfish behavior.