A tools-focussed talk: from raw linguistic data to reconstructed language trees
Today we hosted a talk by Gereon Kaiping from Leiden University. The talk went through the pipeline his group uses to go from linguistic data collected in the field to reconstructed phylogenetic trees of languages produced using mostly off-the-shelf tools. There was a lively discussion with an audience of linguists, bioinformaticians, and statisticians.
Gereon has kindly made his slides available, which can be viewed below, or downloaded.
Talk tomorrow 25/4 on phylogenetics tools for historical linguistics
Tomorrow afternoon we are hosting a talk by Gereon Kaiping, who we met at a recent workshop. All are welcome; details below.
Time and location: Department of Statistics on Tuesday 25th April at 4.00 pm – 5.00 pm in the Small Lecture Theatre (LG.03).
Speaker: Gereon Kaiping , University of Leiden
Title: Some Assembly Required: From sounds to histories in 8 steps using mostly off-the-shelf tools.
Abstract: Phylogenetic methods are gaining traction in linguistics, but have so far been quite inaccessible to linguists:
The core tools doing the tree construction – whether they be heuristic or Bayesian – often come from bioinformatics, and their inputs (eg. Nexus files) and outputs (eg. Newick trees without explicit reconstruction) conform to biological, not linguistic standards – or they are ad-hoc written for a specific datasets. However, this situation is changing: In this talk, I will present a collection of tools, most of which are published elsewhere, that together go the full way from linguistic fieldwork via public cross-linguistic linked databases and Bayesian inference tools to plots of phylogenetic trees with ancestral state reconstruction. I will describe both emerging standards in quantitative historical linguistics that make this process easier, and specific challenges that arose in the construction of this tool chain. The talk will conclude with the discussion of some results from the reconstructed word-meaning correspondences in the Lesser Sunda region of Indonesia, and how they feed back into improving our data and understanding of the local language history.
End of the phylogenetic methods in historical linguistics workshop
Sadly the workshop is over, and we are preparing to return to sunny Oxford! We enjoyed two final talks today, which we summarise below. We have also written up summaries of Tuesday’s talks, and Wednesday’s talks.
Causal inference of evolutionary networks – Johannes Dellert, University of Tübingen
This speaker began by discussing the difficulties with building up phylogenetic networks. Most phylogenetic methods (on languages as well as in biological contexts) are based on trees, but these trees imply a greater independence than we know to be realistic – they usually fail to capture language contact and influence, which can be a major driver of similarity between languages (separate from inheritance). Methods which do utilise networks are usually either visualisations of other kinds of data (where nodes don’t correspond to languages, for instance), or are restricted to narrow sub-classes of network structure which are not often powerful enough to capture the kinds of relationships that one would like to capture.
To address this, the speaker presented a project based on the concept of causal inference, building a network of causal relationships between observational data alone. Correlation does not imply causation – but by considering correlations on a connected network, it’s possible to delete edges on the network in such a way that leads to a structure of causal relationships explaining the observed correlations. The results were mostly very good, and went beyond any previously available method or tool for such analysis. There are some artefacts, e.g. with a group of languages that had influence from German, but where one language in particular had had a lot of German influence and it appeared that this language then had influence on the others (rather than all from German), but overall it seems like a very promising project with great results and an inspiringly creative and successful approach to a very difficult problem.
Simulating lexical evolution with semantic shifts – Gereon Kaiping (*) and Johann-Mattis List (^), University of Leiden (*), Max Planck Institute for the Science of Human History (^)
This talk began with a discussion of some of the problems with current quantitative methods in historical linguistics. A major such problem is the lack of proper data on historical language change, leading to a trend towards models not being properly validated and tested. There is also not much simulation done to test methods, and most existing simulations tend to be very simple. This project aims to develop a more realistic model of language change, under which simulations might be done which could lead to better validation and testing of other quantitative historical linguistic methods. The model further considers semantic drift and replacement, in contrast to most previous methods which consider cognates only corresponding to the same concepts.
This built on concepts from Saussure about the form and meaning of words being ‘two sides of the same coin’. The model sees a language as a bipartite graph between a network of concepts and a vector of words. The evolution of the model involves updating the weighting of edges between the concepts and the words, corresponding to the changing set of vocabulary and meanings of words, over a phylogenetic tree. This draws on game theoretic ideas. They also presented some validation and parameterisation of their models based on available data sets. Their software is open source and available online: https://github.com/anaphory/simuling
Another day of phylogenetic methods in linguistics
Today we enjoyed six talks at the workshop in Tübingen, which we summarise below. We have also summarised yesterday’s talks. Update: also Thursday’s talks!
Further evidence for punctuated language evolution – Gerhard Jäger, University of Tübingen
This talk discussed the concept of punctuated evolution – that is, evolution where the most active phase of change happens just after speciation takes place. In biology this has been suggested as an explanation for the relatively few ‘intermediate stage’ fossils that are found – it seems that it’s often the case that a species arises, quickly evolves into a relatively stable state, and stays fairly unchanged for some time. It has been suggested that the same phenomenon might occur in language change (e.g. by Dixon in 1997).
Two methods had been reproduced: one from Atkinson et al (2008) which works on manually labelled lists of cognate pairs, and one from Holman and Wichmann (2016) which uses language distance (without needing labelled cognate data). Overall the study’s results seemed to suggest that punctuated evolution may indeed be taking place to some extent in language change.
Building histories of Slavic on parallel texts – Ruprecht von Waldenfels, University of Zurich
This talk was quite different from most others at the workshop in two main ways: it examined a language family history which is known in some detail already, and the methods revolved around the use use of parallel texts rather than word lists or other data.
Taking texts which have been translated into all of the languages considered, the study looked at different language features individually, finding different connections between languages. Since the history of the language family is fairly well known, the speaker was able to explain the nature and history of these different relationships for individual language features. This seemed a step forward for Slavic language studies, confirming much more manual work with much automated analysis. It also sent a strong message to those in the audience, that many methods in use (e.g. based on language similarity) may induce a history, but in fact there can be many histories behind the relationships between languages in any particular group.
Reconstructing language ancestry by performing word prediction – Peter Dekker, University of Amsterdam
This talk described a project based on the use of recurrent neural networks with an encoder-decoder structure to detect cognates, in a supervised machine learning framework. This process has some analogy to problems in machine translation, where neural network approaches have been applied with some success, and this project draws on some of the progress in that field to solve problems here. The neural network is trained on pairs of words corresponding to the same concept in different languages. Since this method avoids relying on manual labelling of cognate and non-cognate pairs, the goal is to take all in but only really learn from the cognate pairs, which is achieved by the design of the loss function. Overall it seemed like the project was reaching a baseline of success in line with existing models, and that it had promising scope for tweaks to improve its performance further.
Sound change phylogeny in Uralic family trees and networks – Jyri J. F. Lehtinen, University of Helsinki
This talk began by acknowledging some of the criticisms that have been aimed at the use of phylogenetic methods in historical linguistics in general by linguists. A primary such complaint was the concept of “garbage in, garbage out”. The speaker described a study which involved a very careful process of data selection. The study looked at shared innovations in Uralic languages by looking at reconstructed protoforms and attested forms of words – taking only the words with the most reliable, stable, and regular reconstructed protoforms known from the literature (taking care to avoid including data which has been superseded or isn’t considered reliable).
The study focussed on phonological data, and compared the results of phylogeny reconstruction with this data to other studies using lexical data, as well as trees constructed from qualitative approaches. The results seemed very positive for the approach.
Deep learning and historical linguistics: two case studies – Taraka Rama, University of Tübingen
As well as a high-level introduction to neural networks, this talk discussed the use of neural networks for two linguistics applications: cognate identification and dialect classification. For cognate identification, convolution neural networks are used. This doesn’t require explicit character alignment, and the network was designed with a structure which allows word relatedness and language relatedness, which both inform cognate inference, to be simultaneously learned. The results were positive, even with relatively small data sizes.
For dialect classification, an unsupervised learning approach was taken. Autoencoders were trained on large numbers of words encoded in IPA format, without the need of explicit manual alignment and cognacy judgments. The approach produced some interesting and good-looking maps of dialect distribution in a few different countries.
Tracking modern human population history from linguistic and cranial phenotype – Hugo Reyes-Centeno, University of Tübingen
This talk took a very creative approach to address a conjecture first raised by Darwin – essentially over how much human genealogy can tell us about language genealogy. To examine this relationship, the study made use of the “serial founder effect”, which essentially says that there is less genetic diversity seen as the population moves further from its starting point (as each time a new population is established, it is drawn from only some fraction of the previous, larger population, so the gene pool of the new population is based on a subset of the original genepool). The study investigated whether there’s a relationship between linguistic diversity and genetic diversity.
Properties of cranial bone fragments were used as a phenotypical proxy for genotypic data. By comparing skull fragments from various regions with language diversity from those reasons, the relationship between them was studied. They also controlled for the effects of geography, in terms of distance from the widely-accepted origin population of humanity in Africa. Overall there was not a significant statistical signal of a relationship, but the speaker discussed further aspects of the serial founder affect which could be investigated to get a more detailed picture.
Tuesday in Tübingen
Today we (Jotun and Jimi) attended the second day of the Phylogenetic Methods in Historical Linguistics workshop in Tübingen. We heard several interesting talks on a range of topics, including one from Jotun himself. Here’s a summary of what we heard about today.
Update: we also have written summaries of Wednesday’s talks and Thursday’s talks.
Introduction to comparative biology and statistical alignment – Jotun Hein, University of Oxford
Jotun’s talk introduced some fundamental ideas in comparative biology, going into some detail of basic statistical alignment techniques, and summarising some potential applications to linguistics via projects done in the Hein group and at Oxford more widely.
He discussed the significance of the different possible structures that can be used to model data in these settings. Michael Golden’s work on protein structure evolution was an example of the use of more complex data structures than models which operate on plain strings or sequences. Jotun mentioned Jimi Cullen’s work on context-dependent sequence evolution and mentioned Luke Kelly’s impressive work with Geoff Nicholls on lateral transfer in stochastic Dollo models.
There was a widely enthusiastic reaction to Jotun’s mention of his brief work with Markus Gerstel on transforming English grammar into German through the use of treebank data – this idea could be worth reviving!
What can shared structures tell us about linguistic similarity? – Andrea Fischer, Universität des Saarlandes
This talk was about the use of ideas from information theory to measure similarity between languages. By finding sequential correspondences between parts of similar words in different languages, it’s possible to compress the list of words by defining rules determining the relationships between letters or groups of letters in words in different languages. Using MDL (minimum description length), it’s possible to get an idea of how close two languages are, in a measure roughly inspired by how easy it might be for a speaker of one language to read text in the other language. The process also infers correspondences between parts of words, which in certain circumstances correspond strongly to known (or sensible) linguistics correspondences.
This seems like a rich work developed in a powerfully modular way, making use of a variety of techniques and ideas from information theory, making it quite extensible and modifiable.
Genes, speech, and language – Dan Dediu, Max Planck Institute for Psycholinguistics
This speaker strove to convince us of the interesting connections between genetics and language. The first example he gave was to do with hearing and deafness. He pointed out several specific genetic mutations which (sometimes surprisingly) lead to deafness. One such mutation has led to higher rates of deafness in some communities where it’s very common, and this in turn has led to the development of sign languages in those communities. As the speaker pointed out, this is a compelling language example of genetic-cultural co-evolution.
He went on to describe his work on identifying links between anatomical vocal variation and language variation. This was focussed on a study of languages which include certain types of “click” sounds, based on the hypothesis that such languages are more likely to develop in communities where people typically have a very small alveolar ridge. He has conducted experiments that seem to indicate that this does indeed make it easier to produce these sounds, and that it does seem to be a common feature of the mouths of people from some areas where click-languages arose. He gave the impression that this is the tip of the iceburg, and that there’s a lot more work to be done before any major conclusions can be drawn, but it seems like a promising line of study.
Constructing language phylogenies on different kinds of data – Stephan Eekman, University of Amsterdam
The speaker went through some phylogenies of North Germanic languages that he had generated using standard tools, by varying the type of input data. He used lexical data (which has been most widely used for inferring language phylogenies in the past), phonological data, and morphosyntactic data, as well as running his models on a dataset combining all of these types. He used the standard Swadesh word list, as well as a word list he had constructed himself of vocabulary relating to domestic animals (apparently inspired by a sheep he saw out the window while working on the project!).
He was surprised to find that the best performance by far seemed to be those models run on the domestic animals data. He discussed some aspects of this type of vocabulary which might have contributed to these promising results. Overall the talk raised things to think about when designing future studies more than delivering any major conclusions of its own.
Applying three evolutionary models to linguistics – Andrew Meade, University of Reading
This wide-ranging talk covered three main topics. The first of these was heterogeneous rates of change on a phylogenetic tree, with examples of animal size change in biology and language change with migration in linguistics. The second topic was on phonemic phylogenetic influence, and applying the concept of concerted evolution to identify regular sound changes in language change. The third was applying population genetics methods to word use – exploring how it comes to be that there are many widely-used words for “sofa” but essentially just one for “axle”.
This talk was a strong demonstration of the potential for these methods from comparative biology in their applications to linguistics, and suggest that the speaker has generated quite a wide-reaching body of work with these approaches.