Highly Similar Texts

May 17, 2018 11 min read

A while ago I was building a database of newspaper articles retrieved from LexisNexis for a research project in which I was working as a research assistant. At some point we noticed that we seemed to have a lot of duplicates in our database. I had already removed the duplicates with R so we were really surprised that those are still in there. However, after some investigation, I found that there are indeed small differences between the articles we had identified manually as duplicates in our data. This post describes a strategy we came up with to effectively find those highly similar articles. For convenience, I implemented a version of that solution in the function lnt_similarity of my LexisNexisTools package.

The problem is that R will not consider two texts duplicates when only a single character, such as a comma, whitespace or quotation mark is different. That was therefore our first guess. But we found more than that. In some cases, subtle edits had been made. A different word with the same meaning, the title or position of a person added or removed from the article or a new sentence added to the end of an article. Strangely, many of those articles were published on the same date and in the same newspaper.

For some instances, the differences could be explained by taking a look at the editions: Many newspapers in the UK have localised editions and the Scottish edition of the Daily Mail might have an extra sentence about how a story affected Scotland or a comment from a Scottish politician was added somewhere in the article.

However, this poses a problem to any valid media analysis: These articles would be wrongfully counted several times in the sample. The introduced bias would mean that there are more articles from newspapers that add several different editions to the LexisNexis database. An additional problem was that while the information on editions was available for some outlets, it seemed that LexisNexis did not provide it for others. So we needed a computational way to find the duplicates.

Available solution 1: the accurate one

Comparing chunks of text is more complicated than we initially thought. The most accurate solution available would be the so-called Levensthein or edit distance.

Consider this example that I pulled off Wikipedia:

kitten → sitten (substitution of “s” for “k”)
sitten → sittin (substitution of “i” for “e”)
sittin → sitting (insertion of “g” at the end).

The distance between “kitten” and “sitting” is three, because three edits need to be made to turn the former into the latter. This is implemented in base R:

adist("kitten", "sitting")

##      [,1]
## [1,]    3

One problem with this is that arguably “kitten” and “sitting” are two very different character string whereas two newspaper articles of a few hundred words where the edit distance is just 3 are very similar. An easy solution is to calculate the relative distance instead: I chose to divide the distance by the length of the longer word, which has the advantage that the result will be between 0 (for identical texts) or 1 (for texts that are completely different).

compare <- c("kitten", "sitting")
adist(compare[1], compare[2]) / max(nchar(compare))

##           [,1]
## [1,] 0.4285714

The relative distance of 0.43 means that the texts are very different, which is intuitively right.

However, there is a second problem when using the edit distance: Although Vladimir Levenshtein published his paper about the algorithm to calculate the distance more than 50 years ago, it is still extremely computational demanding, even for modern machines.

There are two factors which make the computation more demanding: firstly, the demand for calculation scales up extremely quickly when comparing more than one text:

compare <- c("kitten", "sitting", "knitting", "omitting")
adist(compare) / max(nchar(compare))

##       [,1]  [,2]  [,3] [,4]
## [1,] 0.000 0.375 0.375 0.50
## [2,] 0.375 0.000 0.250 0.25
## [3,] 0.375 0.250 0.000 0.25
## [4,] 0.500 0.250 0.250 0.00

For the four words above, 12 calculations are necessary, since each text needs to be compared to each other text in the sample (i.e. n*(n-1) calculation with n = number of texts to compare).

Secondly, comparing longer texts is computationally more expensive. Consider the following: I measure time and resources needed to calculate the difference between two words (“kitten” and “sitting”) and between two chunks of Lorem Ipsum, 651 and 650 characters long.

compare <- c("kitten", "sitting")
compare2 <- c(
  "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam lacinia elementum sapien, eget aliquet ex finibus ut. Suspendisse eget molestie ligula. Etiam quis purus sed urna lacinia lobortis. Suspendisse tempor purus at nibh ultricies pellentesque. Cras sed blandit risus. Duis sit amet felis magna. Quisque consequat libero id turpis dapibus, non venenatis elit sodales. Nam fermentum, sapien nec vehicula blandit, nulla lectus fringilla ligula, eget tempus eros justo vel ante. Ut quis urna id nunc maximus accumsan vitae non leo. Phasellus vestibulum felis id erat euismod consequat. Nam ac metus vitae nunc volutpat luctus sit amet quis sapien.",
  "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam lacinia elementum sapien, eget aliquet ex finibus ut. Suspendisse eget molestie ligula. Etiam quis purus sed urna lacinia lobortis. Suspendisse tempor purus at nibh ultricies pellentesque. Cras sed blandit risus. Duis sit amet felis magna. Quisque consequat libero id turpis dapibus, non venenatis elit sodales. Nam fermentum, sapien nec vehicula blandit, nulla lectus fringilla ligula, eget tempus eros justo vel ante. Ut quis urna id nunc maximus accumsan vitae non leo. Phasellus vestibulum felis id erat euismod consequat. Nam ac metus vitae nunc volutpat luctus sit amet quis sapien"
)
benchmark <- bench::mark(
  short <- adist(compare) / max(nchar(compare)),
  long <- adist(compare2) / max(nchar(compare2)),
  check = FALSE
)
dplyr::select(benchmark, median, mem_alloc)

## # A tibble: 2 x 2
##     median mem_alloc
##   <bch:tm> <bch:byt>
## 1   16.4µs        0B
## 2   12.4ms    10.4KB

To get a feel for this, the mean time for the second calculation is 12.4ms that is milliseconds. In the same unit as the first calculation, that would be 12,365 microseconds. This is an 753% increase on average. The allocated memory can’t even be calculated since the first value is too small to measure while the second calculation uses quite a bit of memory. Some testing revealed that around 3500 articles would already bring my 32GB RAM machine to its knees. This makes it almost impossible to compare hundreds or thousands of newspaper articles.

Available solution 2: the efficient one

While the edit distance is the most accurate method for comparing texts, there are others which are much more efficient. Instead of comparing character for character, it is also possible to chop up each text into its elements and then compare which elements are shared by two or more texts.

I use the excellent quanteda package to turn the texts into a matrix (referred to as document-feature matrix or dfm):

library(quanteda)
dfm <- dfm(c(compare, compare2))
as.matrix(dfm)

	kitten	sitting	lorem	ipsum	dolor	sit	amet	,	consectetur	adipiscing
text1	1	0	0	0	0	0	0	0	0	0
text2	0	1	0	0	0	0	0	0	0	0
text3	0	0	1	1	1	3	3	6	1	1
text4	0	0	1	1	1	3	3	6	1	1

I’m only displaying the first 10 columns of the dfm here. Basically every column is one word and then every row is one text. The number in the cell tells us how often a word occurred in a text. Comparing the matrices of two texts is a much easier task for a computer. Since the data is in a matrix form, cosine similarity can be calculated (other similarity measures are available in quanteda but the cosine similarity is, in my opinion, the easiest to read and interpret).

textstat_simil(dfm, method = "cosine")

## textstat_simil object; method = "cosine"
##       text1 text2 text3 text4
## text1     1     0     0     0
## text2     0     1     0     0
## text3     0     0 1.000 0.999
## text4     0     0 0.999 1.000

What we see here is that of the four texts, only text 3 and 4 are similar (more than 99% similar), the other two (“kitten” and “sitting”) are considered completely different from the others.

However, consider another example:

compare3 <- c(
  "Dog bites man",
  "Man bites dog"
)

dfm2 <- dfm(compare3)
textstat_simil(dfm2, method = "cosine")

## textstat_simil object; method = "cosine"
##       text1 text2
## text1     1     1
## text2     1     1

The similarity between “Dog bites man” and “Man bites dog” is 100%. But the meaning of the two sentences is arguably very different! This mismatch between intuition and results is caused by textstat_simil() ignoring word order and dfm() getting rid of capital letters by the default—which is the main reason behind the performance gain! This makes textstat_simil() a fast alternative but also relatively inaccurate when comparing text with similar vocabulary but different meaning.

Divide and conquer

However, the two methods combined can cancel each other’s shortcomings out. Let’s check this on all three examples I used so far, combined:

texts <- c(compare, compare2, compare3)
dfm3 <- dfm(texts)
simil <- textstat_simil(dfm3, method = "cosine")
simil

## textstat_simil object; method = "cosine"
##       text1 text2 text3 text4 text5 text6
## text1     1     0     0     0     0     0
## text2     0     1     0     0     0     0
## text3     0     0 1.000 0.999     0     0
## text4     0     0 0.999 1.000     0     0
## text5     0     0     0     0     1     1
## text6     0     0     0     0     1     1

As you can see again, text three and four as well as text five and six are deemed very similar by textstat_simil(). We can use this information to do some more in-depth testing without making the task overly expensive by only calculating the relative distance on the texts already flagged as highly similar. I do some data wrangling here to get the distance matrix into a presentable form:

library(dplyr)
simil_clean <- simil %>%
  as.matrix() %>%
  reshape2::melt() %>%
  mutate_if(is.factor, as.character) %>% 
  filter(value > 0.95) %>% 
  filter(Var1 != Var2) %>% 
  mutate(comb = purrr::map_chr(seq_along(Var1), function(i) {
    paste(sort(c(Var1[i], Var2[i])), collapse = " ")
  })) %>% 
  distinct(comb, .keep_all = TRUE) %>% 
  mutate(Orig_ID = as.integer(gsub("text", "", Var1)),
         Dup_ID = as.integer(gsub("text", "", Var2))) %>% 
  mutate(Orig_text = texts[Orig_ID],
         Dup_text = texts[Dup_ID]) %>% 
  select(-Var1, -Var2, -comb) %>% 
  mutate(rel_edit = purrr::map_dbl (seq_along(Orig_ID), function(i) {
    compare <- c(.data$Orig_text[i], .data$Dup_text[i])
    as.numeric(adist(compare)[2] / max(nchar(compare)))
  })) 
simil_clean %>% 
  select(-Orig_text, -Dup_text)

value	Orig_ID	Dup_ID	rel_edit
0.9990959	4	3	0.0015361
1.0000000	6	5	0.4615385

Now this is a little confusing, so let me explain: ID 3 and 4 are the Lorem Ipsum texts, 5 and 6 are “Dog bites man” and “Man bites dog”. These texts end up in pairs, as their similarity value is above a threshold I used (0.95). rel_edit is the relative edit distance. The more similar two texts are, the lower this second value—the exact opposite of the similarity measure. So the very low distance between the two Lorem Ipsum texts and the high distance between “Dog bites man” and “Man bites dog” indicate that the first pair is extremely similar while the latter pair consists of pretty different texts. In a newspaper media analysis, it would probably make a lot of sense to remove either text 4 or 3. Texts 5 and 6, however, should probably stay in the database as they are, in fact, quite different.

Implementation in LexisNexisTools

Luckily for you, all this is already conveniently implemented in LexisNexisTools. Since the package is made for newspaper data, the function to find similar texts assumes that articles need to be published on the same day to consider them similar. To get the similarity between texts anyway, just provide a date vector with the same length as the text object:

LexisNexisTools::lnt_similarity(texts, rep("2018-05-19", length(texts)))

Date	ID_original	text_original	ID_duplicate	text_duplicate	Similarity	rel_dist
2018-05-19	3	Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam lacinia elementum sapien, eget aliquet ex finibus ut. Suspendisse eget molestie ligula. Etiam quis purus sed urna lacinia lobortis. Suspendisse tempor purus at nibh ultricies pellentesque. Cras sed blandit risus. Duis sit amet felis magna. Quisque consequat libero id turpis dapibus, non venenatis elit sodales. Nam fermentum, sapien nec vehicula blandit, nulla lectus fringilla ligula, eget tempus eros justo vel ante. Ut quis urna id nunc maximus accumsan vitae non leo. Phasellus vestibulum felis id erat euismod consequat. Nam ac metus vitae nunc volutpat luctus sit amet quis sapien.	4	Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam lacinia elementum sapien, eget aliquet ex finibus ut. Suspendisse eget molestie ligula. Etiam quis purus sed urna lacinia lobortis. Suspendisse tempor purus at nibh ultricies pellentesque. Cras sed blandit risus. Duis sit amet felis magna. Quisque consequat libero id turpis dapibus, non venenatis elit sodales. Nam fermentum, sapien nec vehicula blandit, nulla lectus fringilla ligula, eget tempus eros justo vel ante. Ut quis urna id nunc maximus accumsan vitae non leo. Phasellus vestibulum felis id erat euismod consequat. Nam ac metus vitae nunc volutpat luctus sit amet quis sapien	1	0.0015361
2018-05-19	5	Dog bites man	6	Man bites dog	1	0.4615385

As you can see, the result is the same, but the code a lot easier.

R media anlysis