Is it possible to identify your gender based purely on the text of the messages you write?
What about determining the age and gender of hundreds of thousands of anonymous people, just from the millions of messages received from them? This was the peculiar question that Texperts, the UK’s premier answering service, hoped to solve when they contacted ThinkTank Maths.
The Texperts (now KGB Answers) answer any question imaginable, from “What is the meaning of life?” to “Do sheep shrink in the rain?. With the rise of the smartphone, Texperts were faced with the necessity of re-inventing their business model. To achieve this, they realised there was a need for demographic information (such as gender or age) on their customers — from their own huge, unstructured repository of millions of SMS conversations.
Texperts initially turned to several Natural Language Processing experts, but with unsatisfying results. Indeed, they systematically approached the problem as similar to spam filtering, roughly assigning scores to keywords or using Bayesian classification. But it emerged that such methods were unable to deal with the high dimensionality and enormous size of the dataset.