Detect and remove signature in forum text messages using R - regex

I've a collection of text messages scraped from a forum into a data frame. Here's a reproducible example:
example.df <- data.frame(author=c("Mikey", "Donald", "Mikey", "Daisy", "Minnie", "Daisy"),
message=c("Hello World! Mikey Mouse",
"Quack Quack! Donald Duck",
"I was born in 1928. Mikey Mouse",
"Quack Quack! Daisy Duck",
"The quick fox jump over Minnie Mouse",
"Quack Quack! Daisy Duck"))
My idea is to find the longest common suffix found on every message for the same author for all those who have written more than on message. For all others, well, I'll find a regex way that gracefully degradates.
I found the bioconductor package RLibstree that looks promising, thanks to the function getLongestCommonSubstring, but I don't know how to group the function to all the messages from the same author.

I think I'd convert to a list in the following format and use the stringdist package to find common sentences and remove any above a certain threshold of similarity for all sentences used by an author. outer may be of use here as well:
## load packages in this order
library(stringi)
library(magrittr)
example.df[["message"]] %>%
stringi::stri_split_regex(., "(?<=[.?!]{1,5})\\s+") %>%
split(example.df[["author"]])
## $Daisy
## $Daisy[[1]]
## [1] "Quack Quack!" "Daisy Duck"
##
## $Daisy[[2]]
## [1] "Quack Quack!" "Daisy Duck"
##
##
## $Donald
## $Donald[[1]]
## [1] "Quack Quack!" "Donald Duck"
##
##
## $Mikey
## $Mikey[[1]]
## [1] "Hello World!" "Mikey Mouse"
##
## $Mikey[[2]]
## [1] "I was born in 1928." "Mikey Mouse"
##
##
## $Minnie
## $Minnie[[1]]
## [1] "The quick fox jump over Minnie Mouse"

I don't know how to group the function to all the messages from the
same author.
Perhaps tapply is what you are looking for.
> tapply(as.character(example.df$message), example.df$author, function(x) x)
$Daisy
[1] "Quack Quack! Daisy Duck" "Quack Quack! Daisy Duck"
$Donald
[1] "Quack Quack! Donald Duck"
$Mikey
[1] "Hello World! Mikey Mouse" "I was born in 1928. Mikey Mouse"
$Minnie
[1] "The quick fox jump over Minnie Mouse"
You can use your own function in place of function(x) x, of course.

Here is an implementation that uses no additional libraries.
example.df <- data.frame(author=c("Mikey", "Donald", "Mikey",
"Daisy", "Minnie", "Daisy"),
message=c("Hello World! Mikey Mouse",
"Quack Quack! Donald Duck",
"I was born in 1928. Mikey Mouse",
"Quack Quack! Daisy Duck",
"The quick fox jump over Minnie Mouse",
"Quack Quack! Daisy Duck"))
signlen = function(am) # determine signature length of an author's messages
{
if (length(am) <= 1) return(0) # return if not more than 1 message
# turn the messages into reversed vectors of single characters
# in order to conveniently access the suffixes from index 1 on
am = lapply(strsplit(as.character(am), ''), rev)
# find the longest common suffix in the messages
longest_common = .Machine$integer.max
for (m in 2:length(am))
{
i = 1
max_length = min(length(am[[m]]), length(am[[m-1]]), longest_common)
while (i <= max_length && am[[m]][i] == am[[m-1]][i]) i = i+1
longest_common = i-1
if (longest_common == 0) return(0) # shortcut: need not look further
}
return(longest_common)
}
# determine signature length of every author's messages
signature_length = tapply(example.df$message, example.df$author, signlen)
#> signature_length
# Daisy Donald Mikey Minnie
# 23 0 12 0
# determine resulting length "to" of messages with signatures removed
to = nchar(as.character(example.df$message))-signature_length[example.df$author]
#> to
# Mikey Donald Mikey Daisy Minnie Daisy
# 12 24 19 0 36 0
# remove the signatures by replacing messages with resulting substring
example.df$message = substr(example.df$message, 1, to)
#> example.df
# author message
#1 Mikey Hello World!
#2 Donald Quack Quack! Donald Duck
#3 Mikey I was born in 1928.
#4 Daisy
#5 Minnie The quick fox jump over Minnie Mouse
#6 Daisy

Related

Regex in R -- extracting sub-string based on two start/stop words

I have a character (text) column:
tweets <- c(
"Drinking a Bud Light by #Budweiser # Joe's Crab Shack http://www.joes.com",
"Drinking a Sam Adams Winter Ale by #SamAdams # Growler Stop http://www.growlerstop.com",
"Drinking a Coco Loco by #NoDaBrewing # The Corner Pub http://www.cornerpub.com"
)
As you can see, assume the tweets have a standard structure:
"Drinking a [name of beer] by #[name of brewery] # [name of bar, notice whitespace] http://"
I want to use regular expressions (and substr()?) to create three new columns:
Name of the beer
Name of the brewery
Name of the bar (note that it could have white space, so needs to go to "http:")
One step further - how do I control for some Tweets that do not have the same structure?
It's ugly:
setNames(nm=c('beer','brewery','bar'),as.data.frame(do.call(rbind,
regmatches(tweets,regexec('^Drinking an? (.*) by #(.*) # (.*) http://.*$',tweets))
)[,-1L]));
## beer brewery bar
## 1 Bud Light Budweiser Joe's Crab Shack
## 2 Sam Adams Winter Ale SamAdams Growler Stop
## 3 Coco Loco NoDaBrewing The Corner Pub
See regexec() and regmatches().
do.call(rbind,strsplit(gsub('.*\\ba\\b(.*) by #(.*) #(.*) http.*','\\1|\\2|\\3',tweets),'\\|'))
# [,1] [,2] [,3]
#[1,] " Bud Light" "Budweiser" " Joe's Crab Shack"
#[2,] " Sam Adams Winter Ale" "SamAdams" " Growler Stop"
#[3,] " Coco Loco" "NoDaBrewing" " The Corner Pub"

Extracting pattern from the nested list in R using regex

I have following sorted list (lst) of time periods and I want to split the periods into specific dates and then extract maximum time period without altering order of the list.
$`1`
[1] "01.12.2015 - 21.12.2015"
$`2`
[1] "22.12.2015 - 05.01.2016"
$`3`
[1] "14.09.2015 - 12.10.2015" "29.09.2015 - 26.10.2015"
Therefore, after adjustment list should look like this:
$`1`
[1] "01.12.2015" "21.12.2015"
$`2`
[1] "22.12.2015" "05.01.2016"
$`3`
[1] "14.09.2015" "12.10.2015" "29.09.2015" "26.10.2015"
In order to do so, I began with splitting the list:
lst_split <- str_split(lst, pattern = " - ")
which leads to the following:
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "c(\"14.09.2015" "12.10.2015\", \"29.09.2015" "26.10.2015\")"
Then, I tried to extract the pattern:
lapply(lst_split, function(x) str_extract(pattern = c("\\d+\\.\\d+\\.\\d+"),x))
but my output is missing one date (29.09.2015)
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "14.09.2015" "12.10.2015" "26.10.2015"
Does anyone have an idea how I could make it work and maybe propose more efficient solution? Thank you in advance.
Thanks to comments of #WiktorStribiżew and #akrun it is enough to use str_extract_all.
In this example:
> str_extract_all(lst,"\\d+\\.\\d+\\.\\d+")
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "14.09.2015" "12.10.2015" "29.09.2015" "26.10.2015"
1) Use strsplit, flatten each component using unlist, convert the dates to "Date" class and then use range to get the maximum time span. No packages are used.
> lapply(lst, function(x) range(as.Date(unlist(strsplit(x, " - ")), "%d.%m.%Y")))
$`1`
[1] "2015-12-01" "2015-12-21"
$`2`
[1] "2015-12-22" "2016-01-05"
$`3`
[1] "2015-09-14" "2015-10-26"
2) This variation using a magrittr pipeline also works:
library(magrittr)
lapply(lst, function(x)
x %>%
strsplit(" - ") %>%
unlist %>%
as.Date("%d.%m.%Y") %>%
range
)
Note: The input lst in reproducible form is:
lst <- structure(list(`1` = "01.12.2015 - 21.12.2015", `2` = "22.12.2015 - 05.01.2016",
`3` = c("14.09.2015 - 12.10.2015", "29.09.2015 - 26.10.2015"
)), .Names = c("1", "2", "3"))

Easy way to extract text between defined set of strings in R

I have some text with defined labels and need to split the text according to the labels.
For example given the text with labels set {A, B, C..}
text <- c("A: how are you B: hello sir C: bye bye")
text2 <- c("USER COMMENTS: TEST PROC: Refer manual. SOLUTION: fix BIAS32 user:param", "TEST PROC: install spare unit. USER COMMENTS: hello sir SOLUTION: tighten bolt 12","TEST PROC: bye bye.")
I need to extract text "how are you", "hello sir" , etc.. corresponding to labels A, B, etc.
There is no specific order of the labels, certain labels could be missing and labels can be phrases (not just characters)
This is what I have so far to extract text corresponding to label A:
gsub("(.*A.*:)(.*)(B.*|C.*)","\\2",text,perl=TRUE)
But this does not work in so many cases!
I am looking for a solution where I can define a vector of labels such as
labels <- c("USER COMMENTS", "TEST PROC", "SOLUTION") # this is a big list!
and extract the text corresponding to these labels as below
USER COMMENTS are "", "hello sir"
TEST PROC are "Refer manual.", "install spare unit.","bye bye."
SOLUTION are "fix BIAS32 user:param", "tighten bolt 12"
etc..
I think I might have a solution based on Sharath's comment.
First, there's strsplit(), which can split a vector based on regex. In your case you could use:
labels2<-paste(labels,collapse="|")
[1] "USER COMMENTS|TEST PROC|SOLUTION"
If you apply strsplit on that:
splittedtext<-strsplit(text2,labels2)
[[1]]
[1] "" ": "
[3] ": Refer manual. " ": fix BIAS32 user:param"
[[2]]
[1] "" ": install spare unit. " ": hello sir "
[4] ": tighten bolt 12"
[[3]]
[1] "" ": bye bye."
Pretty much what you want, right? You could do some refining by adding ": " to the end of every index, and the first element is gibberish. So taking care of the latter:
splittedtext<-lapply(splittedtext,"[",-1)
That generates the problem that you must figure out to which label a comment applies. For that you could use regexpr() function in R.
pos=sapply(labels,regexpr,text2)
USER COMMENTS TEST PROC SOLUTION
[1,] 1 16 41
[2,] 32 1 57
[3,] -1 1 -1
Each cell represents the position in which said label [column] appear on string [row]. -1 denote that it does not appear on this string.
Now switch, -1 for NA, and rank the remaining numbers. That will give to you which string snippet represents that label.
pos=ifelse(pos==-1,NA,pos) #switch -1 for NA
pos=t(apply(pos,1,rank,na.last="keep"))
USER COMMENTS TEST PROC SOLUTION
[1,] 1 2 3
[2,] 2 1 3
[3,] NA 1 NA
Now it's just matching.

How to extract sentences containing specific person names using R

I am using R to extract sentences containing specific person names from texts and here is a sample paragraph:
Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin. Melanchthon became professor of the Greek language in Wittenberg at the age of 21. He studied the Scripture, especially of Paul, and Evangelical doctrine. He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments. Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium.
In this short paragraph, there are several person names such as:
Johann Reuchlin, Melanchthon, Johann Eck. With the help of openNLP package, three person names Martin Luther, Paul and Melanchthon can be correctly extracted and recognized. Then I have two questions:
How could I extract sentences containing these names?
As the output of named entity recognizer is not so promising, if I add "[[ ]]" to each name such as [[Johann Reuchlin]], [[Melanchthon]], how could I extract sentences containing these name expressions [[A]], [[B]] ...?
Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
Or a little cleaner:
sentences<-unlist(strsplit(para,split="\\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]
If you are looking for the sentences that each person is in as separate returns then:
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)
[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
Edit 3: To add each persons name, do something simple such as:
foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}
EDIT 4:
And if you wanted to find sentences that had multiple people/places/things (words), then just add an argument for those two such as:
toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")
and change perl to TRUE:
foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}
> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] "Paul"
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] "Melanchthon"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
EDIT 5: Answering your other question:
Given:
sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"
gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
Will give you the words inside the double brackets.
> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen" "Wittenberg" "Martin Luther" "Johann Reuchlin"
Here's a considerably simpler method using two packages quanteda and stringi:
sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
sentList <- split(sents, list(namesFound))
sentList[["Melanchthon"]]
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
sentList
## $`Martin Luther`
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
##
## $Melanchthon
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
##
## $Paul
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."

How would I turn a multivalue string into a usable frequency table in R?

I have a field in a data frame called plugins_Apache_module
it contains strings like:
c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52",
"mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23",
"mod_ssl/2.2.9")
I need a frequency table on the modules, and also their versions.
What is the best way to do this in R? As being rather new in R, I've seen strsplit, gsub, some chatrooms also suggested I use the qdap package.
Ideally I would want the string transformed into a dataframe with a column for every mod, if the module is there, then the version goes in that particular field. How would I accomplish such a transform?
What dataframe format would be suggested if I want top-level frequencies - say mod_ssl (all versions) as well as relational options (mod_perl is very often used with mod_ssl).
I'm not too sure how to handle such variable length data when pushing into a dataframe for processing. Any advice is welcome.
I consider the right answer to look like:
mod_perl mod_python mod_ssl mod_auth_passthrough mod_bwlimited
1.99_16 3.1.3 2.0.52
2.2.23 2.1 1.4
2.2.9
So basically the first bit becomes a column and the version(s) that follows become a row entry
st <- c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52", "mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23", "mod_ssl/2.2.9")
scan(text=st, what="", sep=",")
Read 7 items
[1] "mod_perl/1.99_16" "mod_python/3.1.3" "mod_ssl/2.0.52"
[4] "mod_auth_passthrough/2.1" "mod_bwlimited/1.4" "mod_ssl/2.2.23"
[7] "mod_ssl/2.2.9"
strsplit( scan(text=st, what="", sep=","), "/")
Read 7 items
[[1]]
[1] "mod_perl" "1.99_16"
[[2]]
[1] "mod_python" "3.1.3"
[[3]]
[1] "mod_ssl" "2.0.52"
[[4]]
[1] "mod_auth_passthrough" "2.1"
[[5]]
[1] "mod_bwlimited" "1.4"
[[6]]
[1] "mod_ssl" "2.2.23"
[[7]]
[1] "mod_ssl" "2.2.9"
table( sapply(strsplit( scan(text=st, what="", sep=","), "/"), "[",1) )
#----------------
Read 7 items
mod_auth_passthrough mod_bwlimited mod_perl mod_python
1 1 1 1
mod_ssl
3
table( scan(text=st, what="", sep=",") )
#-----------
Read 7 items
mod_auth_passthrough/2.1 mod_bwlimited/1.4 mod_perl/1.99_16
1 1 1
mod_python/3.1.3 mod_ssl/2.0.52 mod_ssl/2.2.23
1 1 1
mod_ssl/2.2.9
1
You ask for at minimum two different things. Adding desired output greatly helped. I'm not sure if what you ask for is what you really want but you asked and it seemed like a fun problem. Ok here's how I would approach this using qdap (this requires qdap version 1.1.0 though):
## load qdap
library(qdap)
## your data
x <- c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52",
"mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23",
"mod_ssl/2.2.9")
## strsplit on commas and slashes
dat <- unlist(lapply(x, strsplit, ",|/"), recursive=FALSE)
## make just a list of mods per row
mods <- lapply(dat, "[", c(TRUE, FALSE))
## make a string of versions
ver <- unlist(lapply(dat, "[", c(FALSE, TRUE)))
## make a lookup key and split it into lists
key <- data.frame(mod = unlist(mods), ver, row = rep(seq_along(mods),
sapply(mods, length)))
key2 <- split(key[, 1:2], key$row)
## make it into freq. counts
freqs <- mtabulate(mods)
## rename assign freq table to vers in case you want freqs ans replace 0 with NA
vers <- freqs
vers[vers==0] <- NA
## loop through and fill the ones in each row using an env. lookup (%l%)
for(i in seq_len(nrow(vers))) {
x <- vers[i, !is.na(vers[i, ]), drop = FALSE]
vers[i, !is.na(vers[i, ])] <- colnames(x) %l% key2[[i]]
}
## Don't print the NAs
print(vers, na.print = "")
## mod_auth_passthrough mod_bwlimited mod_perl mod_python mod_ssl
## 1 1.99_16 3.1.3 2.0.52
## 2 2.1 1.4 2.2.23
## 3 2.2.9
## the frequency counts per mods
freqs
## mod_auth_passthrough mod_bwlimited mod_perl mod_python mod_ssl
## 1 0 0 1 1 1
## 2 1 1 0 0 1
## 3 0 0 0 0 1