It tries to indicate that most talking happens between Michael, Dwight, Jim and Pam, while I also selected Dwight to show how you can check the results by person. I set the width of the edges to be in line with the # of lines spoken between the two characters, and the size of the node to represent overall line count. visNetwork is a great tool and very easy to use!
We now know top words include names of other people and occasionally some tokens that hint at a person’s identity (i.e.: ‘Mike’ spoken by Darryl, ‘Tuna’ spoken by Andy or ‘Vance’ spoken by Phyllis). Let’s turn to finding the most common phrases that people use.
A phrase is when you have more than just one word. You can analyze any number of words that follow each other — that is why the methodology is called
ngrams analytics. If n happens to be 2, then we call them bigrams. This is what we’ll be doing now.
Here’s the code how to do that.
We can leverage the same
unnest_tokens function, but this time we’ll call the output column ‘bigram’, we’ll use the ‘text’ column to create the output from, we set the token argument to ‘ngrams’ instead of ‘words’ (default one) and set the ’n’ parameter to 2, indicating we want to create bigrams.
Sticking to the
‘My name is John’ example, while tokenization (token = ‘words’) created this vector: [‘My’, ‘name’, ‘is’, ‘John’], now, the bigram method results in the following vector: [‘My name’, ‘name is’, ‘is John’]. We then separate this new column into two, by taking the first and second token of the bigram. As a next step, we’re only interested in bigrams where neither tokens are stopwords, therefore both contain some sort of information, so we filter out stopwords from both the first and second token vectors. As a last step, we unite the two columns by glueing them together by a space. And we’re done, we have ‘bigramized’ the textual data. Here are the results, by person: Top bigrams (phrases) used by top people
Now this is something! If you’re familiar with the show, you clearly see that the bigrams are very much capable of identifying people. Who else would use ‘nard dog’ and ‘broccoli rob’ other than Andy? Who would be talking about ‘business school’ and ‘mifflin infinity’ other than Ryan? Dwight clearly likes the phrases ‘regional manager’ and ‘assistant regional’. If we’d have used trigrams (3 words to make up a phrase), we’d see ‘assistant regional manager’ in Dwight’s list for sure.
Bigrams are much more capable of identifying a certain person than simple tokens. However, there’s a method that is even more trustworthy than ngrams. It’s called
tf-idf and is the subject of the next topic. C. tf-idf — finding most personal / unique words by person
I sort of hinted at what this algorithm is capable of, but let me quickly explain how it does that. The tf part of tf-idf stands for
Term Frequency, while idf means Inverse Document Frequency.
The first part is straightforward: it takes words, ranks them by their absolute count by document (for example Michael is a ‘document’ here, at least his vocabulary is — tf finds Michael’s top words). A basic tokenization and count aggregation.
IDF is where the magic happens. IDF checks, considering all documents (in this case vocabularies of people) where certain words rank. It determines, if a word is found in most of the documents, then it is common, if it is particular to certain documents only, then it is rare.
Then tf-idf compares (multiplies) total frequency within a certain document with the inverse document frequency, and determines of any word, whether or not it
unique to a certain document. For example, Michael may have said rabies lots of times, but others have not mentioned that word really. Therefore the tf-idf algo will determine that rabies is a unique word in Michael’s vocabulary.
Having understood the basics of the algo, let’s apply it to the data and see what it found. First, the code, then the explanation.
It seems there’s a lot going on here, but it’s quite simple.
I unnest the string data into tokens — simple words I apply lemmatization this time. This is a procedure that tries to get words back to their ‘normal’ , ‘root’ forms. It may take the following words: [‘studying’, ‘studies’, ‘studied’, ‘study’] and after lemmatization all of the words will be [‘study’] as all derive from this word. This way I’m dropping the information of the ‘structure’ of the words, but gain information regarding which ‘root’ words were mostly used. I then count the words by people, an input necessary for tf-idf to determine a word’s idf (the count function is a group_by and then a summarize(count) funtion in one) And then is the most important part: the bind_tf_idf function by tidytext. It doesn’t get much easier than that. It takes the document (name of person) column, token column (words after lemmatization and unnesting) and the count column (how many times the word occured in the given document) and runs the tf-idf math formula. As a last step I decided to visualize only the top 8 unique words by each person, due to ties. tf-idf on top 12 people
So let’s see. The most unique words have been identified. Really, ALL people can be identified by a real Office fan by looking at the tf-idf words.
Andy: tuna, nard, treble, Jessica Angela: Sprinkles, parum pum from Little Drummer Boy Darryl: beanie from Justine, Mike Dwight: deputy, Mose, sheriff, farm etc…
Such a simple tool capable of such great results.
After getting familiar with vocabularies, let’s start focusing on the ‘other big thing’ people usually associate NLP with:
There are numerous ways of running sentiment analysis / sentiment scoring on textual data. Some possible methods are
Categorical sentiment by words (i.e. positive / negative classes such as Bing lexicon, or emotion classes like anger, joy, trust, anticipation from the NRC lexicon) Numerical scoring of sentiment (AFINN lexicon: beautiful, amazing +3, troubled, inconvenience -2) Sentiment scoring run on ngram / sentence level — algorithm determining the overall sentiment of a sentence between -1 and 1 where -1 is all negative, +1 is all positive, 0 is neutral / non-classifiable.
Of the above methods I’ll leverage 3:
I’ll classify tokens into positive and negative, count them by people and create a list for each person with their mostly used positive and negative words. I’ll apply the AFINN lexicon and score each word, then multiple that sentiment score by the frequency of the word and create a list of words that contribute most positivity or negativity to people’s vocabularies. I’ll run sentiment scoring on sentence / line spoken by character level and compare the results to token-level sentiment aggregation by AFINN scores. A. Running sentiment analysis using categorical classification
This is just a simple intro step. I take all words, classify them into positive and negative categories, count each word, and determine the mostly used positive and negative words by person. This is really just a warm-up exercise.
Most frequently used positive and negative words by character
Here what we see is just that comparing the counts of top positive and negative words, most people use their most positive words more frequently than their most negative ones. With the exception of Angela maybe, where her most positive word (fine) occurs almost as many times (~22) as her least positive word (bad). This in itself is not representative of personalities really. For that, I’ll be using the AFINN lexicon.
B. Numeric sentiment scoring using AFINN
This time, instead of simply categorizing words, I’ll be assigning numbers representing positiveness and negativeness to each word. Then, I’ll be multiplying the sentiment scores by the count of the words, creating a ‘contribution’ factor: how much positivity or negativity a word contributed to person’s vocabulary. For example, the word ‘
disgusting’ has a score of -3, while the word ‘ pretty’ is scored at +1. This means it takes 3 ‘ prettys’ to balance out 1 ‘ disgusting’.
Before showing results, let me show you the code to do that.
Here’s what’s going on there:
We unnest the lines into words by people (keeping the information about who spoke the line) We apply lemmatization to get words back to their root form (‘studying’ → ‘study’) We get rid of stopwords from 2 lexicons, a manually created list and the list of the names of people (we cannot score sentiment on names, they’ll be classified as neutral and affect mean and median statistics) We apply the ‘count’ function, which first groups data by ‘name’, then ‘summarizes’ it by count aggregation. We now have 3 columns: (1) name, (2) word and (3) count. We join the AFINN lexicon to our data by the ‘word’ column. Now each word has been scored between -3 and +3 (we use inner join so only matches are kept in the end) As a last step, we multiply the ‘score’ with the ‘count’ to get the contribution factor
Let’s check the visual results.
Contributed sentiment by people
Now all we need to look for is where the red bars are longest to find people who contribute most negativity to their conversations. Angela, Darryl and Dwight seem to be the ones where the average length of the red bars is close to that of the green (positive) ones.
C. Sentiment between people
There’re a couple more things I realized I should do with sentiments. One of them is to check who’s nicest and meanest to whom in the series. For this I’ll use a similar approach as to my ‘conversation network’. I’ll determine who spoke the line to whom, run sentiment scoring by the ‘
from — to’ columns and visualize my results! Sentiment between people (bar chart)
Let’s take Angela: she’s nicest to Dwight, meanest to Oscar. The bar chart is easly interpretable, but a network shall make this look a lot nicer. Again, I cannot paste an HTML element here, so check out two screenshots of the otherwise interactive networks.
Sentiment network (visNetwork)
How nicer! There’s one more thing I can do to make that look better. I don’t necessarily want to visualize all relationships, but focus on the most positive and negative ones. But how to decide which ‘edges’ to keep to the network? Let’s run a distribution analysis of scores by people, and drop the values around the mean / median and only visualize ‘extreme’ relationships.
Here’s what the above chart means. Take Jim — his sentiment scores with ther people range from around 15 to 100 with high extremes. Most scores tend to be between -10 and +30, so I’ll keep everything outside of this range to work with extremes. The visNetwork containing most ‘extreme’ relationships is the following:
Sentiment network with edges representing measure of positivity / negativity
We can sort of see (again, sorry, this would be an interactive network) that the green edges are widest between Jim and Pam, meaning the sentiment between them is most positive amongst all. Oscar and Angela are mutually considerably negative to each other to take another example from the network.
D. Seasonal sentiment trend
As a last step to sentiment analysis, let me quickly run a by-episode and by-season sentiment trend by all people to see if we can follow their happiness / sadness.
First, the episode-level overview.
The above chart offers no information whatsoever. The average sentiments by episode (calculated by using the
sentiment_by function of the sentimentR package where sentiment is scored per line between -1 and 1) are too volatile to offer any insights. Let’s check seasonal data.
This is somewhat more interpretable, but no clear trend can be extracted. Maybe Andy’s firing is hinted at, but there is no clear relationship between seasonal average sentiment scores and happiness / sadness of people.
What else can be looked at tho, is how average sentiment developed over time between the two main rivals of The Office: Jim and Dwight.
I ran two algos on this, first using AFINN on token level, then using sentimentR on line / sentence level, and the results are quite similar.
Both suggest the pair’s relationship got better towards the end, which is in line with the story, as rivalry stopped and a friendship began. This is arguable tho, sentimental trend is difficult to model here.
Before finishing up let me show you another typical NLP job: running topic modeling with LDA (Latent Dirichlet Allocation). To me it’s quite similar to how one unsupervised machine learning algorithm, the k-means clustering does its job. It differs in methodology a lot, but the results are similar: at the end, clustering finds datapoints that somewhat ‘belong’ together, form a ‘similar but unlabeled group’. LDA’s output is a list of words that ‘belong together’, ‘make up an unlabeled topic’.
I’m not getting into the details of how the algo actually runs, but I’ll give you an example. We’ve been working with the top 12 people (by line count). What we can use LDA for in this case, is to find people with similar vocabularies. That is, have (force) LDA create 12 clusters / topics, and spit out the probability of a certain person being part of a given topic.
Let’s do just that.
Here’s what happens in the above 3 lines of code
The input data has 3 columns: (1) name of person who speaks the line (usually refered to as document), (2) the word column (after tokenization) and (3) the column indicating how many times the word occured in the given document We need to create a document-term-matrix for the LDA to run on We set the # of clusters / topics to 12 and set a random state to make our work reproducible Once the LDA algo has finished, we can extract 2 probabilities: (1) betas — probability of a word being part of a topic and (2) gammas — probability of a topic being part of a document: here we extract gammas, as we want the probability of a topic (vocabulary) being part of (spoken by) a document (person)
Here is what we get after visualization
LDA vocabularies of 12 people
Most people have one particular vocabulary they use, however Dwight and Michael seem to both make up 2–2 vocabularies based on their choice of words. Oscar and Angela share a cluster, meaning they have similarly sounding vocabularies (they’re both accountants), and it’s interesting to see that while Jim and Pam have their own respective topics, they share one (topic # 8) which may be their personal (out of the office) lives, like their families, daughter, wedding planning, etc, etc…
This is far from perfect and LDA does not guarantee actual topics, like ‘finance’ or ‘IT’, the topics need to be named by the analyst after some creative, but possibly subjective thinking.
In this blogpost I have touched upon the following NLP topics:
Tokenization, bigramization and tf-idf to extract words, phrases and unique tokens from textual data Sentiment analysis using categorical and numerical outcomes, how they can be used to show contributed sentiment to text Minimal LDA to ‘cluster’ similar-sounding people together, or to at least extract likeliness of one person sharing a topic with another one
I showcased all the above NLP methods on The Office transcripts, and as someone who’s quite familiar with the show, I can honestly say that some of these methods, as easy they are to use, result in awesome findings . Regarding tf-idf and ngrams, no doubt they’re capable of doing wonders to any textual data. Even sentiment scoring seemed to hold up, although sentence level aggregation and trend analysis is difficult, but token level comparison is promising. Regarding LDA, let’s just ‘proceed with caution’.
Overall this was a great way for me to try these methods out, see how they work on ‘live’ data. With questions regarding my code, visit my GitHub page (
https://github.com/kristofrabay) and contact me there. Source: giphy