The purpose of this article is to explain what semantic analysis is, what it means in the context of machine learning and data science, and why it’s important to marketers. But chances are, you knew some of that before you even read this sentence. “Semantic analysis” is right there in the title, and you know this publication targets marketers, not linguists. You might also have noticed that I work for a company that specializes in machine learning technology and that there’s some computer-y sounding headings a little farther down.
You used the contextual clues surrounding the words and phrases on this page to better understand the implied or practical meaning of the content of this article. That’s semantic analysis (SA). As humans, we do this really efficiently and almost unconsciously. We filter all the context surrounding a word/phrase/object/scenario, pull out the relevant pieces, compare them against our past experiences, and use them to deepen our understanding of the content at hand.
Machines have historically sucked at this because they lacked that filter—that ability to determine what is relevant and why. Advances in Machine Intelligence and Natural Language Processing (NLP) have impacted deep semantic analysis heavily through advanced algorithms, powerful computers, and a lot of practice, machines are getting so much better at it.
Machine-driven semantic analysis has a number of real world applications. It helps:
- extract relevant and useful information from large bodies of unstructured data
- find an answer to a question without having to ask a human
- discover the meaning of colloquial speech in online posts
- uncover specific meanings of words used in foreign languages mixed with our own
Before we get into some practical examples of why that matters to you as a marketing professional, let’s take a brief look at the history of text analysis (aka text mining) in marketing.
In the beginning, there was Textual Analysis (and it was… not good)
In the early days of AdTech, people wrote programs that could scrape huge amounts of data and look for words and phrases that recurred frequently. (Remember word clouds?) The implication was that frequency was a signal of importance. Even if we overlook that erroneous assumption for a minute, there are still a few glaring gaps. First, someone has to look at those results and determine why that word is recurring more frequently and what it means to them. Of course, it’s very difficult to do that with words taken out of context, especially when words can have so many different meanings and connotations:
- whip (Cool Whip, bullwhip, whip-smart, ghost ride the whip)
- jaguar (more on that example below)
- run, take, break, apple, crane, date, foil (the list goes on)
And then there was tagging…
Tagging was essentially an attempt to use a human’s nuanced understanding of content to create a system that a machine could propagate on a large scale. We choose some words (taken out of context!) that we hope will convey some meaning to a reader. The errors pile up fast—redundant tags, misspelled tags, inconsistently applied tags, over-tagging—and get multiplied by every person using the system. As systems began to improve, at least we saw people actually using search behavior to guide tag taxonomies, but we’re still only guessing at how an individual user will conceptualize or search for a piece of content.
(We are not saying that you shouldn’t tag your content. Tags are an important component of semantic understanding, and they serve other purposes too (see our post on Open Graph Tags). Just have an authoritative, data-driven taxonomy for your tags or at least a defined set of rules.)
Sentiment Analysis makes a splash
As social media and user-generated content took over the web, marketers got hungry to mine this massive data set for meaning, but discovered a new challenge: knowing if someone is talking about a given topic or brand is less important than knowing how they are feeling and talking about you. A number of social analytics platforms began offering “hot or cold” analyses of topics and brands. While this seems like a nuanced understanding of language, it is really just a layering of explicit understanding (e.g. if the word “sucks” appears alongside my brand, and I know that sucks = negative, then I can infer that what’s being said about my brand is negative). This is still the computer equivalent of rote learning, and we’re never going to get SkyNet to become sentient that way.
“Semantic analysis is not about teaching the machines, it’s about getting them to learn.”
Enter Semantic Analysis
Here’s where we have to do a bit of hand waving, because the science behind true SA is not something you can really elucidate in a 1000-word article. (If you would like to read 17,000 MORE words on Semantic Analysis and Natural Language Processing, this is a good piece.) Semantic analysis is not about teaching the machines, it’s about getting them to learn. From a data processing point of view, semantics are “tokens” that provide context to language. They provide clues not only to the meaning of words, but to their relationships with other words and other tokens. The goal, as it is for any good reader, is to look beyond the words on the page to see the meaning.
Successful SA requires that a program look at capital-m-massive data sets, and at that scale, it has to be making a lot of (correct) assumptions for itself. It’s about taking things that a computer can easily glean from data by looking at frequency, proximity (and many, many other factors) and using them to make meaningful cognitive leaps. For example, a computer can see patterns that tell it these things:
- “dalmatian” and “dog” are semantically related.
- “dalmatian” and “spotted” are more closely related than “dog” and “spotted.”
- “dalmatian” is more frequently capitalized than other nouns.
- “spotted” can mean “seen” or “dotted.”
To achieve the goal—true semantic understanding—the computer would have to make the connection that a Dalmatian is a spotted breed of dog.
Why is Semantic Analysis so important to deliver relevant content?
Why do we care if a computer knows that a Dalmatian is a spotted dog? If it knows that, then when it sees someone looking for “spotted dog,” it can know to connect them to content containing “Dalmatian Puppies.” (Settle down, Cruella… it’s easier said than done.) Now multiply that across millions of users and tens of millions of interactions, and you have a hint of where the value lies.
“If we can understand the content and the user behavior at a deep, semantic level, we can deliver more relevant content and thereby create a more resonant user experience.”
In order to make sure content is relevant to the user, you need two basic components: an understanding of the user and an understanding of the content. Fundamentally, the problem with establishing relationships between pieces of content is that most “scraping” or data capture technology simply doesn’t understand the language within a document very well. There MAY be very simplistic levels of machine learning involved, but they rely heavily on provided tags and a cursory understanding of the individual words on the page, which leaves a lot of room for improvement.
Let’s look at another example:
If you search for the term “jaguar,” you will return results for:
- A luxury car
- A large feline predator
- A football team
- An operating system
- And others that might surprise you
The goal of SA is to pair you with the “jaguar” content you’re actually looking for, and it will take a two-pronged approach to achieve that goal:
- Find contextual clues in your past or real-time behavior (Did your search include the word “sedan?” Did you search for “zoo” recently?).
- Look at all the content at its disposal where “jaguar” or related words occur to determine whether that other content will be the best match for your search. (“Leopard also occurs frequently with “OS,” but not with “car.” “Panther” also occurs frequently with “Jaguar” and “NFL.”)
How many connections it can make and how well it can understand the relationships between those connections determines the relevance of your experience. And, ultimately, relevance is both the goal and the unit of measure when it comes to Semantic Analysis. If we can understand the content and the user behavior at a deep, semantic level, we can deliver more relevant content and thereby create a more resonant user experience.