LEAFTOP Language Explorer

What is this?

You are looking at the first draft of an idea Greg Baker had as he was creating the LEAFTOP dataset. The LEAFTOP data set is an automatically extracted set of around 300 nouns that can be derived from bible translations automatically. (The real number is actually higher than this, but this is all that Greg's code can do at the moment.) For each extracted noun, it is possible to calculate a confidence score -- how much more likely this word is to be a good translation than the next nearest possibility.

Taking that a bit further, we can then do a Spearman correlation on the confidence scores between two languages. Very similar languages will pose similar problems for translators (or be similarly easy on some words and concepts) so languages that are similar should have high Spearman correlation scores. Languages that are highly dissimilar should have low Spearman correlation scores.

Since it is really boring just looking at correlation scores, Greg decided to make an interactive explorer. Each language is connected to the 8 languages with which it has the highest correlation. Those 8 are then connected to the 4 that they are most correlated with, and then each of those gets 2. But since these aren't exclusive, a tight bundle of languages can occur with not many languages.

What can I do with it?

You can waste time while you are supposed to working on your linguistics PhD. You can try to disprove the nostratic hypothesis by finding links between Quechuan and Russian. You can see if this data matches up with your latest theory about connections between African languages. You can suggest feature improvements to Greg (gregory.baker2 is the username, and the domain is hdr.mq.edu.au). You can cite this as part of your research so that we can make this explorer respectable enough to become something you give to your undergraduate students.

But this is obviously wrong! Everyone knows that language X and language Y aren't related

No doubt this will happen, particularly where language X is in a country that was first evangelised by speakers of language Y because a data set of nouns from the bible is going to have a lot of loan words from language Y for precisely the specific vocabulary that is in the bible that wasn't previously in language X. This will skew X and Y together very strongly.

But the fun thing to observe is how often it is right even though this program has not been trained on any data other than bible translations. There's no model here that was given a family tree of Indo-European languages: it has just figured these relationships out itself.

I'm convinced. I want to play with it now

Pick a language from this list below. On the following page you can drag languages around or click on them. Have fun!