How do we know what ancient Egyptian sounded like, or Old English? Linguistics gives us the tools to reconstruct lost languages from the words we speak today. Here's how it's done.
The most important method of linguistic reconstruction is known as the comparative method. It entails pretty much exactly what you would expect - it looks for related languages and then looks for orderly patterns of changing sounds to figure out a common starting point, which takes the form of a proto-language. The most famous example of this is probably Proto-Indo-European, the reconstructed language that is the common ancestor of the (deep breath) Albanian, Armenian, Baltic, Celtic, Germanic (which includes English), Greek, Indo-Iranian, Italic, and Slavic language families, not to mention a handful more that have since become extinct.
We won't go into the specific sound changes that allowed linguists to reconstruct the Proto-Indo-European word for "dog" as *ḱwṓn, but can examine the basic processes that go into language reconstruction and understand how living languages can still carry with them the "fossils" of languages that haven't been spoken for millennia.
The first step in the comparative method is to look for words that have the same meaning and similar sounds across many different languages - these are known as cognates. If a group of languages all use more or less the same-sounding word to mean something, there's a good chance that they're all related. That's a rather imprecise definition, of course - what we're really looking for is a group of cognates that all have the basic phonemic structure, so that it's possible to work backwards and see the various sound changes that would take the earlier, common word from the proto-language and make it evolve into the different cognates.
As an example, we can look at the different words for "head" in the Romance languages of Italian, Spanish, and Portuguese: "capo", "cabo", and "cabu." Though the internal sounds are slightly different, we can see they all have identical structures. Just working from this tiny data set, we might guess that the ancestor of these three languages - proto-Romance, if you will - had a word for "head" that occupied a common point between these different sounds. We might guess that the "p" sound in Italian was once a "b" like in the others, and that the "u" in Portuguese was once an "o" like in Italian and Spanish. So then, just going by majority rules, the ancestor word is "cabo" or something similar.
That's not terrible reasoning, but it happens to be completely wrong. After all, we know full well what the ancestor language of Italian, Spanish, and Portuguese is - it's Latin, and the Latin word for head is "caput". This speaks to an important point about the comparative method: it only works when you have enough data to actually see consistent, orderly patterns. (It also helps to know some of the basic rules of sound change, but that's really another post altogether.) Three words isn't going to cut it.
This may seem like a ridiculous example, but there's a very real danger in paying too much attention to isolated cognates. For instance, the English word "taboo" sounds almost identical to "tapu", a word in several Polynesian languages that means much the same thing. Is English secretly related to the Polynesian languages? Of course not - English just happened to import that word from the Polynesian language Tongan, but someone unfamiliar with its etymology might mistake this simple borrowing for an actually genetic relationship. And there are plenty of basic words that coincidentally have similar sounds and meaning across unrelated languages - the Mayan word for "mess", for instance, is "mes", and a lot of words for "mother" sound very similar in lots of geographically separated tongues.
As I said before, having a big enough data set to see clear patterns is crucial to language reconstruction. Let's say, for instance, that you want to look for a common ancestor of English and Latin, a process that ultimately led linguists to the reconstruction of Proto-Indo-European. While English is obviously far less closely related to Latin than are any of the Romance languages, there are enough cognates to suggest that the two are at least distantly related.
Next linguists look for what's known as "correspondence sets." An obvious - but frequently wrong - first guess is that the sounds in one language will have much the same function in its related language. For instance, we might expect that words beginning with the letter "d" in English will also start with "d" in Latin. And, again, if you only look at a couple examples, there might seem to be some truth to this - there's the English "day" and and the Latin "dies", for example, and then there's the rather Satanic examples "devil" and "diabolus." Seems like a good match, right?
But that quickly falls apart when you look at the larger vocabularies of the two languages (particularly when you learn that both of those words for devil were actually borrowed from Greek). Instead, it's actually the English letter "t" that serves the same function as the Latin "d", at least when we're talking about the beginnings of words. There's plenty of examples of this: "ten"/"decem", "two"/"duo", "tongue"/"dingua", and so on. This regular correspondence between the two sets is very strong evidence for a genetic relationship between English and Latin.
Once you have these correspondence sets in hand, the trick is to look for the common form that preceded both. In this particular case, we're looking at the phonemes d- and t-, and we would guess that the proto-phoneme is *t-. Explaining the precise reasoning behind that would, like I said, really require its own separate post, but the basic gist is this: "t" is what's known as an voiceless phoneme, whereas "d" is its voiced equivalent, and it's far, far more common for sounds to develop from unvoiced to voiced than the other way round.
And yet even that happens to be wrong in this case, although you wouldn't have any way to know that from just looking at Latin and English. Here our small data sets aren't the issue - you could have the entire vocabularies of those two tongues and still have no idea that the initial Proto-Indo-European form happens to be *dw-. The only way to discover that is by incorporating their distant relative Armenian into the analysis, which reveals a regular sound change from that "dw" phoneme in their language which carries over to the other Indo-European languages.
As you can probably guess from these examples, this is not an exact science - nothing so intimately connected to the human experience ever could be. The entire comparative method is based around the extremely strict assumption that "sound laws have no exceptions." That's true to some extent, but only if you're willing to instantly reclassify all the myriad of exceptions you encounter as just more laws that are never broken. You'll very quickly have a very thick book of sound laws, and it's an open question when this becomes more of a hindrance than a help.
There's plenty of things that can mess up the orderly progression of languages. Anyone familiar with modern English should know about the borrowing of from one language to another. As with "taboo", English freely imports foreign words for which it has no direct equivalent, while somewhat more conservative languages can make use of English's often giddy willingness to just make up a word to fill a linguistic hole - perhaps most famously with the French concept of "le weekend."
Sometimes these problematic influences are more subtle. There's a German term known as "Sprachbund" that describes an area whose languages share many common features, but none of them are actually related. In these cases, geographically close but genetically unrelated languages interact and mix some features, sometimes to the extent that an enterprising linguist could construct an entire false proto-language out of these apparent connections. The best example of this is probably found in Southeast Asia, in which the unrelated languages of Chinese, Korean, Japanese, and Vietnamese were initially grouped together because so many of their features had converged.
And, sometimes, stuff just happens by complete random chance. The Latin word for "miracle", "miraculum", should have become "miraglo" in Spanish. Instead, the "r" and "l" flipped and it became "milagro", and there's no great underlying reason for this beyond the vagaries of chance. With enough data, linguistics can cut through these changes and take us back to proto-languages that were never recorded in history books, languages like Proto-Indo-European that exist today only as the result of backbreaking analysis of dozens upon dozens of languages. The fact that linguists can actually do this at all is something of a miracle, regardless of how you arrange its consonants.
The Handbook of Historical Linguistics by Brian D. Joseph and Richard D. Janda
Historical Linguistics: An Introduction by Lyle Campbell
Language and Linguistics by John Lyons
Historical and Comparative Linguistics by Raimo Antilla