How to extract sentences containing specific person names using R - regex

I am using R to extract sentences containing specific person names from texts and here is a sample paragraph:
Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin. Melanchthon became professor of the Greek language in Wittenberg at the age of 21. He studied the Scripture, especially of Paul, and Evangelical doctrine. He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments. Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium.
In this short paragraph, there are several person names such as:
Johann Reuchlin, Melanchthon, Johann Eck. With the help of openNLP package, three person names Martin Luther, Paul and Melanchthon can be correctly extracted and recognized. Then I have two questions:
How could I extract sentences containing these names?
As the output of named entity recognizer is not so promising, if I add "[[ ]]" to each name such as [[Johann Reuchlin]], [[Melanchthon]], how could I extract sentences containing these name expressions [[A]], [[B]] ...?

Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
Or a little cleaner:
sentences<-unlist(strsplit(para,split="\\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]
If you are looking for the sentences that each person is in as separate returns then:
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)
[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
Edit 3: To add each persons name, do something simple such as:
foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}
EDIT 4:
And if you wanted to find sentences that had multiple people/places/things (words), then just add an argument for those two such as:
toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")
and change perl to TRUE:
foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}
> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] "Paul"
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] "Melanchthon"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
EDIT 5: Answering your other question:
Given:
sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"
gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
Will give you the words inside the double brackets.
> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen" "Wittenberg" "Martin Luther" "Johann Reuchlin"

Here's a considerably simpler method using two packages quanteda and stringi:
sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
sentList <- split(sents, list(namesFound))
sentList[["Melanchthon"]]
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
sentList
## $`Martin Luther`
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
##
## $Melanchthon
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
##
## $Paul
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."

Related

Regex in R -- extracting sub-string based on two start/stop words

I have a character (text) column:
tweets <- c(
"Drinking a Bud Light by #Budweiser # Joe's Crab Shack http://www.joes.com",
"Drinking a Sam Adams Winter Ale by #SamAdams # Growler Stop http://www.growlerstop.com",
"Drinking a Coco Loco by #NoDaBrewing # The Corner Pub http://www.cornerpub.com"
)
As you can see, assume the tweets have a standard structure:
"Drinking a [name of beer] by #[name of brewery] # [name of bar, notice whitespace] http://"
I want to use regular expressions (and substr()?) to create three new columns:
Name of the beer
Name of the brewery
Name of the bar (note that it could have white space, so needs to go to "http:")
One step further - how do I control for some Tweets that do not have the same structure?
It's ugly:
setNames(nm=c('beer','brewery','bar'),as.data.frame(do.call(rbind,
regmatches(tweets,regexec('^Drinking an? (.*) by #(.*) # (.*) http://.*$',tweets))
)[,-1L]));
## beer brewery bar
## 1 Bud Light Budweiser Joe's Crab Shack
## 2 Sam Adams Winter Ale SamAdams Growler Stop
## 3 Coco Loco NoDaBrewing The Corner Pub
See regexec() and regmatches().
do.call(rbind,strsplit(gsub('.*\\ba\\b(.*) by #(.*) #(.*) http.*','\\1|\\2|\\3',tweets),'\\|'))
# [,1] [,2] [,3]
#[1,] " Bud Light" "Budweiser" " Joe's Crab Shack"
#[2,] " Sam Adams Winter Ale" "SamAdams" " Growler Stop"
#[3,] " Coco Loco" "NoDaBrewing" " The Corner Pub"

Filling in data frame column using regular expressions (?)

Ok, so I have a data frame of web forum comments. Each row has a cell containing an ID which is part of the link to that comment's parent comment. The rows contain the full permalink to the comment, of which the ID is the varying part.
I'd like to add a column that shows the user name attached to that parent comment. I'm assuming I'll need to use some regular expression function, which I find mystifying at this point.
In workflow terms, I need to find the row whose URL contains the parent comment ID, grab the user name from that row. Here's a toy example:
toy <- rbind(c("yes?", "john", "www.website.com/4908", "3214", NA), c("don't think so", "mary", "www.website.com/3958", "4908", NA))
toy <- as.data.frame(toy)
colnames(toy) <- c("comment", "user", "URL", "parent", "parent_user")
comment user URL parent parent_user
1 yes? john www.website.com/4908 3214 <NA>
2 don't think so mary www.website.com/3958 4908 <NA>
which needs to become:
comment user URL parent parent_user
1 yes? john www.website.com/4908 3214 <NA>
2 don't think so mary www.website.com/3958 4908 john
Some values in this column will be NA, since they're top level comments. So something like,
dataframe$parent_user <- dataframe['the row where parent
ID i is found in the URL column', 'the user name column in that row']
Thanks!!
Another option, using the basename function from base R, which "removes all of the path up to and including the last path separator (if any)"
toy$user[match(toy$parent, basename(as.character(toy$URL)))]
#1] <NA> john
#Levels: john mary
Here is a vectorized option with stri_extract and match
library(stringi)
toy$parent_user <- toy$user[match(toy$parent,stri_extract(toy$URL,
regex=paste(toy$parent, collapse="|")))]
toy
# comment user URL parent parent_user
#1 yes? john www.website.com/4908 3214 <NA>
#2 don't think so mary www.website.com/3958 4908 john
Or as #jazzurro mentioned, a faster option would be using stri_extract with data.table and fmatch
library(data.table)
library(fastmatch)
setDT(toy)[, parent_user := user[fmatch(parent,
stri_extract_last_regex(str=URL, pattern = "\\d+"))]]
Or a base R option would be
with(toy, user[match(parent, sub("\\D+", "", URL))])
#[1] <NA> john
#Levels: john mary
nchar('with(toy, user[match(parent, sub("\\D+", "", URL))])')
#[1] 51
nchar('toy$user[match(toy$parent, basename(as.character(toy$URL)))]')
#[1] 60
Perhaps not the prettiest way to do it, but an option:
toy$parent_user <- sapply(toy$parent,
function(x){p <- toy[x == sub('[^0-9]*', '', toy$URL), 'user'];
ifelse(length(p) > 0, as.character(p), NA)})
toy
# comment user URL parent parent_user
# 1 yes? john www.website.com/4908 3214 <NA>
# 2 don't think so mary www.website.com/3958 4908 john
The second line is really just to deal with cases lacking matches.

Easy way to extract text between defined set of strings in R

I have some text with defined labels and need to split the text according to the labels.
For example given the text with labels set {A, B, C..}
text <- c("A: how are you B: hello sir C: bye bye")
text2 <- c("USER COMMENTS: TEST PROC: Refer manual. SOLUTION: fix BIAS32 user:param", "TEST PROC: install spare unit. USER COMMENTS: hello sir SOLUTION: tighten bolt 12","TEST PROC: bye bye.")
I need to extract text "how are you", "hello sir" , etc.. corresponding to labels A, B, etc.
There is no specific order of the labels, certain labels could be missing and labels can be phrases (not just characters)
This is what I have so far to extract text corresponding to label A:
gsub("(.*A.*:)(.*)(B.*|C.*)","\\2",text,perl=TRUE)
But this does not work in so many cases!
I am looking for a solution where I can define a vector of labels such as
labels <- c("USER COMMENTS", "TEST PROC", "SOLUTION") # this is a big list!
and extract the text corresponding to these labels as below
USER COMMENTS are "", "hello sir"
TEST PROC are "Refer manual.", "install spare unit.","bye bye."
SOLUTION are "fix BIAS32 user:param", "tighten bolt 12"
etc..
I think I might have a solution based on Sharath's comment.
First, there's strsplit(), which can split a vector based on regex. In your case you could use:
labels2<-paste(labels,collapse="|")
[1] "USER COMMENTS|TEST PROC|SOLUTION"
If you apply strsplit on that:
splittedtext<-strsplit(text2,labels2)
[[1]]
[1] "" ": "
[3] ": Refer manual. " ": fix BIAS32 user:param"
[[2]]
[1] "" ": install spare unit. " ": hello sir "
[4] ": tighten bolt 12"
[[3]]
[1] "" ": bye bye."
Pretty much what you want, right? You could do some refining by adding ": " to the end of every index, and the first element is gibberish. So taking care of the latter:
splittedtext<-lapply(splittedtext,"[",-1)
That generates the problem that you must figure out to which label a comment applies. For that you could use regexpr() function in R.
pos=sapply(labels,regexpr,text2)
USER COMMENTS TEST PROC SOLUTION
[1,] 1 16 41
[2,] 32 1 57
[3,] -1 1 -1
Each cell represents the position in which said label [column] appear on string [row]. -1 denote that it does not appear on this string.
Now switch, -1 for NA, and rank the remaining numbers. That will give to you which string snippet represents that label.
pos=ifelse(pos==-1,NA,pos) #switch -1 for NA
pos=t(apply(pos,1,rank,na.last="keep"))
USER COMMENTS TEST PROC SOLUTION
[1,] 1 2 3
[2,] 2 1 3
[3,] NA 1 NA
Now it's just matching.

Extracting everything after first two words in R

I am trying to extract all the info, using a regular expression in R, after the first number and first word of an entry in a data frame.
For example:
Header =
c("2006 Volvo XC70",
"2012 Ford Econoline Cargo Van E-250 Commercial",
"2012 Nissan Frontier",
"2012 Kia Soul 5dr Wagon Automatic")
I want to write a pattern that will grab Volvo XC70, or Econoline Cargo Van E-250 Commercial (everything after the year and make) from an entry in my "header" column so that I may run the function on my data frame and create a new "model" column. I can't figure out a pattern that will allow me to skip the first string of integers, then a space, then the first string of characters, and then a space, and then grab everything proceeding.
Any help would be appreciated. Thanks!
Just use sub.
sub("^\\d+\\s+\\w+\\s+", "", df$x)
Example:
x <- "2012 Ford Econoline Cargo Van E-250 Commercial"
sub("^\\d+\\s+\\w+\\s+", "", x)
# [1] "Econoline Cargo Van E-250 Commercial"
For this task, I would fetch a basic list using the XML package:
library(XML)
doc <- xmlParse('http://www.fueleconomy.gov/ws/rest/ympg/shared/menu/make')
Now that we fetched the XML data we can create a vector with the car makes:
mk <- xpathSApply(doc, '//value', xmlValue)
Finally, I'll compile the pattern and play around with sprintf and sub:
df$Makes <- sub(sprintf('\\d+ (?:%s) ', paste(mk, collapse='|')), '', df$Header)
Output:
## Header
# 1 2006 Volvo XC70
# 2 2012 Ford Econoline Cargo Van E-250 Commercial
# 3 2012 Nissan Frontier
# 4 2012 Kia Soul 5dr Wagon Automatic
## Makes
# 1 XC70
# 2 Econoline Cargo Van E-250 Commercial
# 3 Frontier
# 4 Soul 5dr Wagon Automatic

Detect and remove signature in forum text messages using R

I've a collection of text messages scraped from a forum into a data frame. Here's a reproducible example:
example.df <- data.frame(author=c("Mikey", "Donald", "Mikey", "Daisy", "Minnie", "Daisy"),
message=c("Hello World! Mikey Mouse",
"Quack Quack! Donald Duck",
"I was born in 1928. Mikey Mouse",
"Quack Quack! Daisy Duck",
"The quick fox jump over Minnie Mouse",
"Quack Quack! Daisy Duck"))
My idea is to find the longest common suffix found on every message for the same author for all those who have written more than on message. For all others, well, I'll find a regex way that gracefully degradates.
I found the bioconductor package RLibstree that looks promising, thanks to the function getLongestCommonSubstring, but I don't know how to group the function to all the messages from the same author.
I think I'd convert to a list in the following format and use the stringdist package to find common sentences and remove any above a certain threshold of similarity for all sentences used by an author. outer may be of use here as well:
## load packages in this order
library(stringi)
library(magrittr)
example.df[["message"]] %>%
stringi::stri_split_regex(., "(?<=[.?!]{1,5})\\s+") %>%
split(example.df[["author"]])
## $Daisy
## $Daisy[[1]]
## [1] "Quack Quack!" "Daisy Duck"
##
## $Daisy[[2]]
## [1] "Quack Quack!" "Daisy Duck"
##
##
## $Donald
## $Donald[[1]]
## [1] "Quack Quack!" "Donald Duck"
##
##
## $Mikey
## $Mikey[[1]]
## [1] "Hello World!" "Mikey Mouse"
##
## $Mikey[[2]]
## [1] "I was born in 1928." "Mikey Mouse"
##
##
## $Minnie
## $Minnie[[1]]
## [1] "The quick fox jump over Minnie Mouse"
I don't know how to group the function to all the messages from the
same author.
Perhaps tapply is what you are looking for.
> tapply(as.character(example.df$message), example.df$author, function(x) x)
$Daisy
[1] "Quack Quack! Daisy Duck" "Quack Quack! Daisy Duck"
$Donald
[1] "Quack Quack! Donald Duck"
$Mikey
[1] "Hello World! Mikey Mouse" "I was born in 1928. Mikey Mouse"
$Minnie
[1] "The quick fox jump over Minnie Mouse"
You can use your own function in place of function(x) x, of course.
Here is an implementation that uses no additional libraries.
example.df <- data.frame(author=c("Mikey", "Donald", "Mikey",
"Daisy", "Minnie", "Daisy"),
message=c("Hello World! Mikey Mouse",
"Quack Quack! Donald Duck",
"I was born in 1928. Mikey Mouse",
"Quack Quack! Daisy Duck",
"The quick fox jump over Minnie Mouse",
"Quack Quack! Daisy Duck"))
signlen = function(am) # determine signature length of an author's messages
{
if (length(am) <= 1) return(0) # return if not more than 1 message
# turn the messages into reversed vectors of single characters
# in order to conveniently access the suffixes from index 1 on
am = lapply(strsplit(as.character(am), ''), rev)
# find the longest common suffix in the messages
longest_common = .Machine$integer.max
for (m in 2:length(am))
{
i = 1
max_length = min(length(am[[m]]), length(am[[m-1]]), longest_common)
while (i <= max_length && am[[m]][i] == am[[m-1]][i]) i = i+1
longest_common = i-1
if (longest_common == 0) return(0) # shortcut: need not look further
}
return(longest_common)
}
# determine signature length of every author's messages
signature_length = tapply(example.df$message, example.df$author, signlen)
#> signature_length
# Daisy Donald Mikey Minnie
# 23 0 12 0
# determine resulting length "to" of messages with signatures removed
to = nchar(as.character(example.df$message))-signature_length[example.df$author]
#> to
# Mikey Donald Mikey Daisy Minnie Daisy
# 12 24 19 0 36 0
# remove the signatures by replacing messages with resulting substring
example.df$message = substr(example.df$message, 1, to)
#> example.df
# author message
#1 Mikey Hello World!
#2 Donald Quack Quack! Donald Duck
#3 Mikey I was born in 1928.
#4 Daisy
#5 Minnie The quick fox jump over Minnie Mouse
#6 Daisy