Text Mining - How to Transform Unstructured Data into Insight

Article
Abstract - Plastic letter tiles from a word game, falling from the sky on a black background

Text mining is a natural language processing technique that helps analysts generate powerful insights by finding meaningful patterns and tendencies in unstructured data.

Some of the main objectives text mining can be used for include:

  • Information extraction
  • Summarization and categorization
  • Sentiment analysis
  • Visualizing and clustering

Other techniques can be applied beyond these core business objectives, but for most businesses in their day-to-day operations it’s these that will lead to the largest breadth of insights.

What is unstructured data?

First we must understand what unstructured data is, and how we can shape and mold it to fit our business purposes. Unstructured data is everything that touches your business and can be measured. Sometimes, it may already include structure - receipts, payroll, consumer opinion polls, for example.

More often than not the text data you want to use for analysis will not have structure.

Think about things like a giant comment thread on social media, customer surveys with short answer boxes, emails from customers, reviews on third party websites... or anything else your business might care to keep tabs on.

In short, unstructured data is just about everything we can find out in the world. While the examples above limit our analysis to only textual mediums, a dynamic 21st century enterprise cannot afford to keep such blinders on.

Theme cloud of themes associated with data silos using Talkwalker's Quick Search.

Talkwalker's Quick Search applies structure to these unstructured datasets in seconds.

Talkwalker is the best-in-breed consumer intelligence platform to measure and visualize customer experience, text mining, and unstructured data analytics. Talkwalker helps enterprises broaden their horizons by including a vast array of mediums including but not limited to video, images, and audio via podcasts.

Create structured data from unstructured data

Of course boundaries are needed, and the first step to generating powerful insights is to give structure to big, unstructured datasets. This means converting the dataset from whatever medium it exists in its original form (video, audio, image, text) into text.

Talkwalker does this in a variety of ways including by identifying logos in images and videos, but also objects (person, dog, tie, cars, etc.) and scenes (airplane, bar, mountain etc.). And also in audio through speech-to-text conversion. This allows for a broad variety of text documents to incorporate into your data analysis.

When using Talkwalker this process happens behind the scenes and is seamless, appearing before you in seconds. The other alternative is to learn programming languages like R and Python and integrate them into your own machine learning algorithm.

Visualizing unstructured data

Once you tame your dataset and it’s fit for text analysis, you can set about the important business of mining your text and applying natural language processing techniques to it to glean insights about your business, your competitors, or your customer base.

Summarization is one natural language processing and visualization technique you can apply. This helps you to understand the scope of your dataset and can immediately speed up your time to insight from text analysis: if using a customer email dataset for example - how many customer email complaints did you receive this month? Is that number more or less than from last month?

Text mining summarization can be used to display things like total results, or secondary metrics. The image shows the total results and results plotted over time.

Summarizing the dataset can include measuring total instances of it occurring, a secondary metric occurring like social media engagements, or plotting results over time.

Categorization is another natural language processing technique that can accomplish similar goals. Sticking with the email dataset for example, what amount of emails included an email signature when they wrote to you? From this, can you learn what is the number one title or occupation your customers have? How about the third or fourth most likely occupation?

Categorization in text mining can help display and of course categorize different metrics, shown here is occupations. The most common occupation in this dataset is authors.

Categorization can help enterprises understand who their customers are.

Clustering is the most visually striking machine learning technique (in my humble opinion), and again the goal here is to generate quick insight. By clustering similarly themed data points in a visualization (1, below), the analyst is able to quickly see and exploit connections between different themes and ideas that may not have any similarity at first glance.

Clustering, visualiized here as Conversation Clusters can help generate quick insight by showing degrees of separation and display how virality works on social media.

Clustering conversations around green mobility in the United States allows us to draw some interesting connections.

1: Using a natural language processing technique known as n-grams that finds the most common and likely similarly nearby words - it’s the same magic that your phone or Google use to predict your next SMS or search.

Building a data lake data lake for your enterprise: A data lake is a central compository where all the data that could be useful and should be mined for insights by your data scientists and researchers lives. Within it you can apply text mining algorithms and other forms of deep learning algorithms i.e., artificial intelligence and rule based queries to your data models.

Text mining for sentiment analysis

Sentiment analysis for text mining is one of the most powerful techniques to apply to natural language processing. With sentiment analysis applied to structured text data, enterprises can finally answer -- at scale and on a detailed level -- the big question they always want to know: Do people like this?

Now sentiment analysis is tricky on a technical level, but it’s also subjective, making it trickier still. For a brand, a phrase like “that’s sick” could be very positive. Or absolutely not.

That nuance and judgement is difficult to incorporate into natural language processing. We compiled a list of some of the best tools to use for sentiment analysis.

Unstructured data analytics tools

We also made a full list of text analytics tools that you should peruse. It also contains a useful glossary explaining in greater detail some of these text mining concepts like natural language processing and text mining.

Here are three tools I recommend for unstructured data analysis.

Talkwalker Analytics | Open platform for unstructured data visualization

Talkwalker is an open platform, which means that we support just about every variety of dataset and file source imaginable. This allows the data analyst to cross-compare against internal and external, owned, and third-party data sources.

This Conversation Cluster™ is visualizing Gen Z's thoughts and fears about the pandemic and vaccines.

Clustering and visualizing the important topics of conversation in society is easy, and interesting, with the right toolset.

Orange | Open source data mining

This free tool allows for text mining, data mining, visualizations, and more. It contains a lot of features and filters, and also includes visualization support.

Orange open-source data mining platform

Open source machine learning and data visualization.

RapidMiner | Data mining framework

RapidMiner is open source, and provides users with a graphic user interface that includes text processing, web mining, reporting, sentiment analysis, series processing, and more.

RapidMiner data mining framework

RapidMiner offers text processing, web mining, reporting, series processing, plus more.

TEXminer | Free text mining tool

TEXminer is another free and open source tool to support you going from unstructured to structured data analysis. This data analysis tool supports ASCII and PDF formats, and also includes language identification. It also supports co-occurrence analysis, central expressions, and analysis of letter frequency.

TexMiner text mining tools

Free text mining tool TEXminer

Takeaway

Wrapping things up, text mining can help apply structure to unstructured datasets. It’s our job as analysts and insights jockeys to help create the boundaries of what we search for so as to be able to give context and meaning to what we find. It’s important to arm yourself with the best tools available. Don’t just focus on impact, focus on connections.

Want more like this?

Want more like this?

Insight delivered to your inbox

Keep up to date with our free email. Hand picked whitepapers and posts from our blog, as well as exclusive videos and webinar invitations keep our Users one step ahead.

By clicking 'SIGN UP', you agree to our Terms of Use and Privacy Policy

side image splash

By clicking 'SIGN UP', you agree to our Terms of Use and Privacy Policy