Wednesday, 13 March 2013

Mapping languages on the internet

[Update: I noted on many of my Google Fusion Table posts that, while the data are still on Google Drive for you to view, GFT no longer offers a polygon or heatmap option, only geocoding by country centroid in its new version. Not sure why, but on this, this, this and another example posted as Iframes not Scripts preserved the old GFT maps.]

Let's explore global language distribution from World Mapper, then language usage on the internet as seen from Wikipedia. This was inspired by an article in The Economist as well as data I previously collected an posted on Google Drive to explore various topics on Google Fusion Tables.

Let's first take the map of world languages that posts the fabulous variety of languages - hover the mouse over to see the number of languages spoken by country (DR Congo and Ivory Coast don't map as various databases call those differently like spelling out Democratic Republic or French Côte d'Ivoire):

Then let's look at Wikipedia's table of languages used to post, edit and visit its crowd sourced information. The Economist says that whilst it's not a comprehensive sample, interesting conclusions can be drawn from the usage statistics.

click on image to enlarge

Now in order to merge these datasets some simplifying assumptions had to be made. Only the most populous language - the 1st in each Worldmapper list entries - was assigned to each country (that is a gross oversimplification of language distribution). Each was then used a a key to merge with Wikipedia article language tables. As that union assigned the full number of articles in each language to each country, I simply divided the article count by the number of countries (that also is a gross oversimplification to assume an even distribution of wiki authorship across countries). See earlier on sportstradereligion how normalisation greatly reduces the wild variations of raw numbers.

Whilst the total Wikipedia articles are overwhelmingly in English:

The distribution of authoring languages by population tells a different story: 

The Economist suggests Wikipedia points to a real resurgence of linguistic awareness and pride by the sheer proportion of articles posted when factored against population... This appears patently true on this map! And please go ahead and visit my Google Fusion Table site to explore it a bit more... 

As a final note, consider the unevenness of the language distribution on Wikipedia: