This tutorial shows how to extract key terms from document and (sub-)collections with TF-IDF and the log-likelihood statistic and a reference corpus.
Like in the previous tutorial we read the CSV data file containing the State of the union addresses and preprocess the corpus object with a sequence of tm_map
functions.
This time, we also apply stemming to the corpus. Stemming reduces (potentially) inflected word forms to its word stem to unify similar semantic forms to the same text representation. For instance, ‘presidents’ becomes ‘president’ and ‘singing’ becomes ‘sing’.
Finally, we create a Document-Term-Matrix.
options(stringsAsFactors = FALSE)
library(tm)
textdata <- read.csv("data/sotu.csv", sep = ";", encoding = "UTF-8")
english_stopwords <- readLines("resources/stopwords_en.txt", encoding = "UTF-8")
# Create corpus object
m <- list(ID = "id", content = "text", DateTimeStamp = "date")
myReader <- readTabular(mapping = m)
corpus <- Corpus(DataframeSource(textdata), readerControl = list(reader = myReader))
corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, english_stopwords)
# Stemming
require("SnowballC")
corpus <- tm_map(corpus, stemDocument, language = "en")
corpus <- tm_map(corpus, stripWhitespace)
# View first document
substr(as.character(corpus[[1]]), 0, 250)
# Create DTM
DTM <- DocumentTermMatrix(corpus)
A widely used method to weight terms according to their semantic contribution to a document is the term frequency–inverse document frequency measure (TF-IDF). The idea is, the more a term occurs in a document, the more contributing it is. At the same time, in the more documents a term occurs, the less informative it is for a single document. The product of both measures is the resulting weight.
Let us compute TF-IDF weights for all terms in the first speech of Barack Obama.
require(slam)
# Compute IDF: log(N / n_i)
number_of_docs <- nrow(DTM)
term_in_docs <- col_sums(DTM > 0)
idf <- log2(number_of_docs / term_in_docs)
# Compute TF
first_obama_speech <- which(textdata$president == "Barack Obama")[1]
tf <- as.vector(DTM[first_obama_speech, ])
# Compute TF-IDF
tf_idf <- tf * idf
names(tf_idf) <- colnames(DTM)
The last operation is to append the column names again to the resulting term weight vector. If we now sort the tf-idf weights decreasingly, we get the most important terms for the Obama speech, according to this weight.
sort(tf_idf, decreasing = T)[1:20]
## -start dont job tonight lend recoveri layoff
## 31.40700 27.64286 27.21418 24.28682 23.18042 23.06548 20.55525
## colleg ensur crisi budget quitters long-term iraq
## 20.18859 18.22675 18.06525 16.66461 15.70350 14.96518 13.83773
## clean mortgage auto global weve deficit
## 13.78296 13.70350 13.58946 13.40912 13.40912 13.29039
If we would have just relied upon term frequency, we would have obtained a list of stop words as most important terms. By re-weighting with inverse document frequency, we can see a heavy focus on business terms in the first speech. By the way, the tm-package provides a convenient function for computing tf-idf weights of a given DTM: weightTfIdf(DTM)
.
We now use a more sophisticated method with a comparison corpus and the log likelihood statistic.
targetDTM <- DocumentTermMatrix(corpus)
termCountsTarget <- as.vector(targetDTM[first_obama_speech, ])
names(termCountsTarget) <- colnames(targetDTM)
# Just keep counts greater than zero
termCountsTarget <- termCountsTarget[termCountsTarget > 0]
In termCountsTarget we have the tf for the first Obama speech again.
As a comparison corpus, we select a corpus from the Leipzig Corpora Collection (http://corpora.uni-leipzig.de): 30.000 randomly selected sentences from the Wikipedia of 2010. CAUTION: The preprocessing of the comparison corpus must be identical to the preprocessing Of the target corpus to achieve meaningful results!
lines <- readLines("resources/eng_wikipedia_2010_30K-sentences.txt", encoding = "UTF-8")
comparisonCorpus <- Corpus(VectorSource(lines))
comparisonCorpus <- tm_map(comparisonCorpus, removePunctuation, preserve_intra_word_dashes = TRUE)
comparisonCorpus <- tm_map(comparisonCorpus, removeNumbers)
comparisonCorpus <- tm_map(comparisonCorpus, content_transformer(tolower))
comparisonCorpus <- tm_map(comparisonCorpus, removeWords, english_stopwords)
comparisonCorpus <- tm_map(comparisonCorpus, stemDocument, language = "en")
comparisonCorpus <- tm_map(comparisonCorpus, stripWhitespace)
From the comparison corpus, we also create a count of all terms.
comparisonDTM <- DocumentTermMatrix(comparisonCorpus)
termCountsComparison <- col_sums(comparisonDTM)
In termCountsComparison we now have the frequencies of all (target) terms in the comparison corpus.
Let us now calculate the log-likelihood ratio test by comparing frequencies of a term in both corpora, taking the size of both corpora into account. First for a single term:
# Loglikelihood for a single term
term <- "care"
# Determine variables
a <- termCountsTarget[term]
b <- termCountsComparison[term]
c <- sum(termCountsTarget)
d <- sum(termCountsComparison)
# Compute log likelihood test
Expected1 = c * (a+b) / (c+d)
Expected2 = d * (a+b) / (c+d)
t1 <- a * log((a/Expected1))
t2 <- b * log((b/Expected2))
logLikelihood <- 2 * (t1 + t2)
print(logLikelihood)
## care
## 68.73001
The LL value indicates whether the term occurs significantly more frequently / less frequently in the target counts than we would expect from the observation in the comparative counts. Specific significance thresholds are defined for the LL values:
With R it is easy to calculate the LL-value for all terms at once. This is possible because many computing operations in R can be applied not only to individual values, but to entire vectors and matrices. For example, a / 2
results in a single value a divided by 2 if a
is a single number. If a
is a vector, the result is also a vector, in which all values are divided by 2.
ATTENTION: A comparison of term occurrences between two documents/corpora is actually only useful if the term occurs in both units. Since, however, we also want to include terms which are not contained in the comparative corpus (the termCountsComparison
vector contains 0 values for these terms), we simply add 1 to all counts during the test. This is necessary to avoid uninterpretable NaN
values which otherwise would result from the log-function on 0-values during the LL test. Alternatively, the test could be performed only on terms that actually occur in both corpora.
First, let’s have a look into the set of terms only occurring in the target document, but not in the comparison corpus.
# use set operation to get terms only occuring in target document
uniqueTerms <- setdiff(names(termCountsTarget), names(termCountsComparison))
# Have a look into a random selection of terms unique in the target corpus
sample(uniqueTerms, 20)
## [1] "banks" "resources" "war-era" "-start" "theyll"
## [6] "commun" "generation" "priorities" "running" "-teach"
## [11] "paycheck" "peels" "sacrifice" "committed" "federal"
## [16] "gains" "website" "minneapolis" "peril" "training"
Now we calculate the statistics the same way as above, but with vectors. But, since there might be terms in the targetCounts which we did not observe in the comparison corpus, we need to make both vocabularies matching. For this, we append unique terms from the target as zero counts to the comparison frequency vector. Moreover, we use a little trick to check for zero counts of frequency values in a or b when computing t1 or t2. If a count is zero the log function would produce an NaN value, which we want to avoid. In this case the a == 0
resp. b == 0
expression add 1 to the expression which yields a 0 value after applying the log function.
# Create vector of zeros to append to comparison counts
zeroCounts <- rep(0, length(uniqueTerms))
names(zeroCounts) <- uniqueTerms
termCountsComparison <- c(termCountsComparison, zeroCounts)
# Get list of terms to compare from intersection of target and comparison vocabulary
termsToCompare <- intersect(names(termCountsTarget), names(termCountsComparison))
# Calculate statistics (same as above, but now with vectors!)
a <- termCountsTarget[termsToCompare]
b <- termCountsComparison[termsToCompare]
c <- sum(termCountsTarget)
d <- sum(termCountsComparison)
Expected1 = c * (a+b) / (c+d)
Expected2 = d * (a+b) / (c+d)
t1 <- a * log((a/Expected1) + (a == 0))
t2 <- b * log((b/Expected2) + (b == 0))
logLikelihood <- 2 * (t1 + t2)
# Compare relative frequencies to indicate over/underuse
relA <- a / c
relB <- b / d
# underused terms are multiplied by -1
logLikelihood[relA < relB] <- logLikelihood[relA < relB] * -1
Let’s take a look at the results: The 50 more frequently used / less frequently used terms, and then the more frequently used terms compared to their frequency. We also see terms that have comparatively low frequencies are identified by the LL test as statistically significant compared to the reference corpus.
# top terms (overuse in targetCorpus compared to comparisonCorpus)
sort(logLikelihood, decreasing=TRUE)[1:25]
## american economi health tonight job budget care
## 127.26575 99.12254 86.54447 83.80928 78.68057 72.11434 68.73001
## america crisi recoveri lend deficit cost plan
## 67.41958 64.94608 64.94608 61.72404 56.99759 56.62186 53.20404
## reform afford congress invest tax energi -start
## 50.15442 47.46049 47.21880 45.13974 39.99798 38.76727 38.46741
## educ dont long-term dollar
## 37.84352 37.54112 35.80695 35.05019
# bottom terms (underuse in targetCorpus compared to comparisonCorpus)
sort(logLikelihood, decreasing=FALSE)[1:25]
## game citi develop oper point general
## -4.3425856 -4.3167595 -2.9149729 -1.9059638 -1.8728417 -1.7416719
## design larg earli record run book
## -1.6340510 -1.6340510 -1.5596799 -1.4031554 -1.3621047 -1.3213579
## includ left area person number product
## -1.2478227 -1.2408068 -1.2179743 -1.1812503 -1.1290878 -0.9047823
## produc control local show success leav
## -0.8682676 -0.8592053 -0.8411617 -0.8174538 -0.7266182 -0.6336620
## age
## -0.5605558
llTop100 <- sort(logLikelihood, decreasing=TRUE)[1:100]
frqTop100 <- termCountsTarget[names(llTop100)]
frqLLcomparison <- data.frame(llTop100, frqTop100)
View(frqLLcomparison)
# Number of signficantly overused terms (p < 0.01)
sum(logLikelihood > 6.63)
## [1] 329
The method extracted 329 key terms from the first Obama speech.
Finally, visualize the result of the 50 most significant terms as Wordcloud. This can be realized simply by function of the package wordcloud. Additionally to the words and their weights (here we use likelihood values), we override default scaling and color parameters. Feel free to try different parameters to modify the wordcloud rendering.
require(wordcloud)
top50 <- sort(logLikelihood, decreasing = TRUE)[1:50]
wordcloud(names(top50), top50, max.words = 50, scale = c(3, .9), colors = brewer.pal(8, "Dark2"), random.order = F)
Key term extraction cannot be done for single documents, but for entire (sub-)corpora. Depending on the comparison corpora, the results may vary. Instead of comparing a single document to a wikipedia corpus, we now compare collections of speeches of a single president, to speeches of all other presidents.
For this, we iterate over all different president names using a for-loop. Within the loop, we utilize a logical vector (Boolean TRUE/FALSE values), to split the DTM into two sub matrices: rows of the currently selected president and rows of all other presidents. From these matrices our counts of target and comparison frequencies are created. The statistical computation of the log-likelihood measure from above, we outsourced into the function calculateLogLikelihood
which we load with the source
command at the beginning of the block. The function just takes both frequency vectors as input parameters and outputs a LL-value for each term of the target vector.
Results of the LL key term extraction are visualized again as a wordcloud. Instead of plotting the wordcloud into RStudio, this time we write the visualization as a PDF-file to disk into the wordclouds
folder. After the for-loop is completed, the folder should contain 42 wordcloud PDFs, one for each president.
source("calculateLogLikelihood.R")
presidents <- unique(textdata$president)
for (president in presidents) {
cat("Extracting terms for president", president, "\n")
selector_logical_idx <- textdata$president == president
presidentDTM <- targetDTM[selector_logical_idx, ]
termCountsTarget <- col_sums(presidentDTM)
otherDTM <- targetDTM[!selector_logical_idx, ]
termCountsComparison <- col_sums(otherDTM)
loglik_terms <- calculateLogLikelihood(termCountsTarget, termCountsComparison)
top50 <- sort(loglik_terms, decreasing = TRUE)[1:50]
fileName <- paste0("wordclouds/", president, ".pdf")
pdf(fileName, width = 9, height = 7)
wordcloud(names(top50), top50, max.words = 50, scale = c(3, .9), colors = brewer.pal(8, "Dark2"), random.order = F)
dev.off()
}
## word.frq frq word.tfidf tfidf word.ll ll
## 1 state 7574 program 1366.6042 congress 2946.2685
## 2 govern 7209 tonight 1163.5595 govern 2698.1090
## 3 year 5925 job 1044.7222 state 1844.8106
## 4 nation 5427 mexico 979.7642 nation 1654.3927
## 5 congress 4987 budget 828.1032 unit 1229.1613
## 6 unit 4562 bank 823.9534 countri 1189.2364
## 7 countri 3930 territori 818.3297 law 1034.0600
## 8 peopl 3703 gold 760.6911 great 1000.0192
## 9 great 3628 billion 747.1415 recommend 924.3473
## 10 american 3341 cent 745.8005 american 898.3763
2017, Andreas Niekler and Gregor Wiedemann. GPLv3. tm4ss.github.io