All You Ever Wanted to Know about Text Mining
Did you know that the pages you like on Facebook and the bulk emails you send out could actually be content that is valued by a company?
Image Credit http://www.canstockphoto.com
If you’ve heard anything about big data or analytics, you must have surely come across the terms “structured” and “unstructured” data. While structured data is a generic reference to a database, unstructured data, which can be textual or non-textual, is as endless as the horizon. While it’s perfectly plausible that business value can be derived from the analysis of conventional reports, not many people actually know that a lot of businesses these days bank heavily upon sites like Facebook, Twitter and LinkedIn to derive value from text analytics.
Text analytics or mining is the analysis of data available to us in day-to-day spoken/written language. It’s amazing that so much data that we generate can actually be used in text mining: word documents, powerpoints, chat messages, emails. The patterns and trends gathered from this unstructured data forms the output – which of course is relevant and of value to the person who has had the data analyzed. While this is a layman’s understanding of text analytics, here’s a look at some popular text mining tasks:
•Text categorization: The start of the text analytics process; assigning pre-defined categories to free text.
• Text clustering: Helps filter documents quickly by grouping documents or texts into subsets (clusters).
• Sentiment analysis: Also referred to as Opinion Mining, this method is used to extract subjective information from content. Just as the term suggests, it has to do with emotion, sentiment – basically, understanding a subject’s emotional response in a context.
• Concept/entity extraction: This process goes into metaphorical levels! Where textual data can actually signify/ refer to a concept. Being able to extract the concept from a bunch of loosely strung words: that would just as “loosely” define this process.
• Entity relation modeling: Here’s another process where logic matters. This process tries to systematically define and describe a business process through logical relationships.
• Document summarization: Extraction of relevant information from various texts written about the same topic.
So there’s some jargon! The application of these text mining techniques to gain further business insight or solve business problems is what is essentially the definition of text analytics. The challenges are many, but they all stem from the fact that so much of unstructured text that is available has to processed to become analyzable. And that happens only through analytical methods (such as those described above), natural language processing, statistical modeling and machine learning techniques.
Before we actually move on to understanding a few ways to extract data from large quantities of text, it’s essential to know that text analytics isn’t something new. It’s being spoken about a lot now and its scope is far larger than it was maybe a decade ago. In fact, the history of text analytics traces back to World War II, when for the first time, “content analysis” was developed by governments. Basically, numeric codes would be assigned to various concepts and ideas that featured in all kinds of content – magazines, records, documents. The sum of these numeric codes could be used to track the frequency, popularity and development of those ideas. There’s a lot that happens after a topic is assigned a numerical code. But that is the basic step to how text analytics is actually done.
Text analytics software is an emergent technology that can translate words and phrases in unstructured data into numerical values – just as in the times of the World Wars. This can be later linked with structured data in a database and analyzed with traditional data mining techniques. It is perfectly possible to derive useful insights from text analytics, although many would still say that it is impossible to discount human intelligence in the entire process.
One of the most popular ways to extract information from large texts is to formulate a query. Without being too technical, it suffices to say that the answers to a query are largely dependant upon the understanding of the issue at hand, the clarity of the question and the way it is asked. Yet another way to extract information – Transformations – are slightly more technical. Transformations are the corollary to the query system. In this method, some assumptions are made about the basic content, and some conditions entered within the software for the content to reveal implications within it that are not otherwise apparent. Of course, you could always have a well-trained human mind doing the research. But that would subject to human opinion, bias, error, misinterpretation. But surely, there needs to be some amount of human logic applied to the entire exercise of text analytics.
With the explosion of free text available to the world, many businesses are searching for automated ways to analyze large volumes of textual data. Some of the most popular text analytics software such as R, Python, SAS, Rapidminer – all offer a variety of services to accurately analyze and report on textual data. But the fact remains that the approximations and results spewed by all these software systems are just the beginning of what text analytics has to offer. What finishes the circle is the application of subtle human intelligence to derive business value.