In R, how do I match markdown list - regex

I'm trying to match the following ordered and unordered lists and extract the bullet/list point.
library(stringr)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)
What I would like to do is:
Recognize that it's a list programatically
Parse each into just the text of the list item
The result would be
some_str_fun(example,pattern) # or multiples
"Bullet 1" "Bullet 2" "Bullet 3"
"Bullet 1" "Bullet 2" "Bullet 3"
"This is a test 1" "This is a test with some *formatting*"
"This is a test with different _formatting_"
I've been playing with the following patterns, and str_extract/match but can't seem to find something completely functional
[*]+\\s(.*?)[\n]* # for * Bullet X\n
[1-9]+[.]\\s(.*?)[\n]* # for 1. Bullet X\n
I've tried a bunch of different iterations on these patters but can't quite seem to get what I'm looking for.

You can use strapply from the gsubfn package to match the entire pattern.
library(gsubfn)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)
strapply(examples, '(?:\\*|\\d+\\.) *([^\n]+)', c, simplify = c)
# [1] "Bullet 1"
# [2] "Bullet 2"
# [3] "Bullet 3"
# [4] "Bullet 1"
# [5] "Bullet 2"
# [6] "Bullet 3"
# [7] "This is a test 1"
# [8] "This is a test with some *formatting*"
# [9] "This is a test with different _formatting_"

This is a bit of a different approach, but if you render the markdown to HTML you can use some existing extraction methods to do what you want:
library(stringr)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)
extract_md_list <- function(md_text) {
require(rvest)
require(rmarkdown)
fil_md <- tempfile()
fil_html <- tempfile()
writeLines(md_text, con=fil_md)
render(fil_md, output_format="html_document", output_file=fil_html, quiet=TRUE)
pg <- html(fil_html)
ret <- html_nodes(pg, "li") %>% html_text()
# cleanup
unlink(fil_md)
unlink(fil_html)
return(ret)
}
extract_md_list(examples)
## [1] "Bullet 1"
## [2] "Bullet 2"
## [3] "Bullet 3"
## [4] "Bullet 1"
## [5] "Bullet 2"
## [6] "Bullet 3"
## [7] "This is a test 1"
## [8] "This is a test with some formatting"
## [9] "This is a test with different formatting"

Here is another option. You can wrap in unlist if desired:
str_extract_all(examples, "[^*1-9\n ]\\w+( ?[\\w*]+)*")
# or
#str_extract_all(examples, "[^*1-9\n ]\\w+( ?[a-zA-Z0-9_*]+)*")
#[[1]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[2]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[3]]
#[1] "This is a test 1"
#[2] "This is a test with some *formatting*"
#[3] "This is a test with different _formatting_"
There are several other options, particularly if you're not concerned about getting it all in a single regex or single line of code. Here's one more approach. The regex is simpler, but you end up with "", which requires the additional line:
splits <- unlist(str_split(examples, "\n|\\d+\\. |\\* "))
splits[splits != ""]
#[1] "Bullet 1"
#[2] "Bullet 2"
#[3] "Bullet 3"
#[4] "Bullet 1"
#[5] "Bullet 2"
#[6] "Bullet 3"
#[7] "This is a test 1"
#[8] "This is a test with some *formatting*"
#[9] "This is a test with different _formatting_"

Related

Extracting pattern from the nested list in R using regex

I have following sorted list (lst) of time periods and I want to split the periods into specific dates and then extract maximum time period without altering order of the list.
$`1`
[1] "01.12.2015 - 21.12.2015"
$`2`
[1] "22.12.2015 - 05.01.2016"
$`3`
[1] "14.09.2015 - 12.10.2015" "29.09.2015 - 26.10.2015"
Therefore, after adjustment list should look like this:
$`1`
[1] "01.12.2015" "21.12.2015"
$`2`
[1] "22.12.2015" "05.01.2016"
$`3`
[1] "14.09.2015" "12.10.2015" "29.09.2015" "26.10.2015"
In order to do so, I began with splitting the list:
lst_split <- str_split(lst, pattern = " - ")
which leads to the following:
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "c(\"14.09.2015" "12.10.2015\", \"29.09.2015" "26.10.2015\")"
Then, I tried to extract the pattern:
lapply(lst_split, function(x) str_extract(pattern = c("\\d+\\.\\d+\\.\\d+"),x))
but my output is missing one date (29.09.2015)
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "14.09.2015" "12.10.2015" "26.10.2015"
Does anyone have an idea how I could make it work and maybe propose more efficient solution? Thank you in advance.
Thanks to comments of #WiktorStribiżew and #akrun it is enough to use str_extract_all.
In this example:
> str_extract_all(lst,"\\d+\\.\\d+\\.\\d+")
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "14.09.2015" "12.10.2015" "29.09.2015" "26.10.2015"
1) Use strsplit, flatten each component using unlist, convert the dates to "Date" class and then use range to get the maximum time span. No packages are used.
> lapply(lst, function(x) range(as.Date(unlist(strsplit(x, " - ")), "%d.%m.%Y")))
$`1`
[1] "2015-12-01" "2015-12-21"
$`2`
[1] "2015-12-22" "2016-01-05"
$`3`
[1] "2015-09-14" "2015-10-26"
2) This variation using a magrittr pipeline also works:
library(magrittr)
lapply(lst, function(x)
x %>%
strsplit(" - ") %>%
unlist %>%
as.Date("%d.%m.%Y") %>%
range
)
Note: The input lst in reproducible form is:
lst <- structure(list(`1` = "01.12.2015 - 21.12.2015", `2` = "22.12.2015 - 05.01.2016",
`3` = c("14.09.2015 - 12.10.2015", "29.09.2015 - 26.10.2015"
)), .Names = c("1", "2", "3"))

Easy way to extract text between defined set of strings in R

I have some text with defined labels and need to split the text according to the labels.
For example given the text with labels set {A, B, C..}
text <- c("A: how are you B: hello sir C: bye bye")
text2 <- c("USER COMMENTS: TEST PROC: Refer manual. SOLUTION: fix BIAS32 user:param", "TEST PROC: install spare unit. USER COMMENTS: hello sir SOLUTION: tighten bolt 12","TEST PROC: bye bye.")
I need to extract text "how are you", "hello sir" , etc.. corresponding to labels A, B, etc.
There is no specific order of the labels, certain labels could be missing and labels can be phrases (not just characters)
This is what I have so far to extract text corresponding to label A:
gsub("(.*A.*:)(.*)(B.*|C.*)","\\2",text,perl=TRUE)
But this does not work in so many cases!
I am looking for a solution where I can define a vector of labels such as
labels <- c("USER COMMENTS", "TEST PROC", "SOLUTION") # this is a big list!
and extract the text corresponding to these labels as below
USER COMMENTS are "", "hello sir"
TEST PROC are "Refer manual.", "install spare unit.","bye bye."
SOLUTION are "fix BIAS32 user:param", "tighten bolt 12"
etc..
I think I might have a solution based on Sharath's comment.
First, there's strsplit(), which can split a vector based on regex. In your case you could use:
labels2<-paste(labels,collapse="|")
[1] "USER COMMENTS|TEST PROC|SOLUTION"
If you apply strsplit on that:
splittedtext<-strsplit(text2,labels2)
[[1]]
[1] "" ": "
[3] ": Refer manual. " ": fix BIAS32 user:param"
[[2]]
[1] "" ": install spare unit. " ": hello sir "
[4] ": tighten bolt 12"
[[3]]
[1] "" ": bye bye."
Pretty much what you want, right? You could do some refining by adding ": " to the end of every index, and the first element is gibberish. So taking care of the latter:
splittedtext<-lapply(splittedtext,"[",-1)
That generates the problem that you must figure out to which label a comment applies. For that you could use regexpr() function in R.
pos=sapply(labels,regexpr,text2)
USER COMMENTS TEST PROC SOLUTION
[1,] 1 16 41
[2,] 32 1 57
[3,] -1 1 -1
Each cell represents the position in which said label [column] appear on string [row]. -1 denote that it does not appear on this string.
Now switch, -1 for NA, and rank the remaining numbers. That will give to you which string snippet represents that label.
pos=ifelse(pos==-1,NA,pos) #switch -1 for NA
pos=t(apply(pos,1,rank,na.last="keep"))
USER COMMENTS TEST PROC SOLUTION
[1,] 1 2 3
[2,] 2 1 3
[3,] NA 1 NA
Now it's just matching.

How to extract sentences containing specific person names using R

I am using R to extract sentences containing specific person names from texts and here is a sample paragraph:
Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin. Melanchthon became professor of the Greek language in Wittenberg at the age of 21. He studied the Scripture, especially of Paul, and Evangelical doctrine. He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments. Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium.
In this short paragraph, there are several person names such as:
Johann Reuchlin, Melanchthon, Johann Eck. With the help of openNLP package, three person names Martin Luther, Paul and Melanchthon can be correctly extracted and recognized. Then I have two questions:
How could I extract sentences containing these names?
As the output of named entity recognizer is not so promising, if I add "[[ ]]" to each name such as [[Johann Reuchlin]], [[Melanchthon]], how could I extract sentences containing these name expressions [[A]], [[B]] ...?
Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
Or a little cleaner:
sentences<-unlist(strsplit(para,split="\\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]
If you are looking for the sentences that each person is in as separate returns then:
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)
[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
Edit 3: To add each persons name, do something simple such as:
foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}
EDIT 4:
And if you wanted to find sentences that had multiple people/places/things (words), then just add an argument for those two such as:
toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")
and change perl to TRUE:
foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}
> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[[2]]
[1] "Paul"
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[[3]]
[1] "Melanchthon"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
EDIT 5: Answering your other question:
Given:
sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"
gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
Will give you the words inside the double brackets.
> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen" "Wittenberg" "Martin Luther" "Johann Reuchlin"
Here's a considerably simpler method using two packages quanteda and stringi:
sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
sentList <- split(sents, list(namesFound))
sentList[["Melanchthon"]]
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
sentList
## $`Martin Luther`
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
##
## $Melanchthon
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
##
## $Paul
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."

Split String into Columns by two character markers

All, I have searched around and can't find the answer on how to do this. I am relatively new to R and have not used regular expresions before but bascially I have some data put into a field like this:
"#Route - 6 #Category - PARKING #Details - Parking issues#Result - MOVED ON #Vehicle Type - Mercedes "
I basically want to be able to split the string up into different elements, so each category after the # has it own column.
I tried using the tidyr package and initially tried:
string %>% separate(Description, into = c("Route","Details","Result","License No",
"Vehicle Desciption"),
sep = "\n#", remove =F, extra = "drop")
But realised I only wanted the data after the "-". I tried inserting a "-" in the code but it didn't work. Does anyone know how I can split the string ideally between the "-" and the "#".
Many thanks
In one line:
> gsub("^\\s+|\\s+$","",gsub(".*?[-]","",unlist(strsplit(str,"#"))))
[1] "" "6" "PARKING" "Parking issues" "MOVED ON" "Mercedes"
Or separate for better understanding:
Break string by "#":
a = unlist(strsplit(str,"#"))
Remove what is before the "-"
b = gsub(".*?[-]","",a)
Remove leading and trailing spaces:
gsub("^\\s+|\\s+$","",b)
You could do the following:
strsplit(x, ' *#[^-]+- *')[[1]][2:6]
# [1] "6" "PARKING" "Parking issues" "MOVED ON" "Mercedes"
To supply the column names you desire, I suppose you could do something like:
mat <- matrix(strsplit(x, ' *#[^-]+- *')[[1]][2:6], ncol=5, byrow=T)
colnames(mat) <- c('Route', 'CAT', 'Details', 'Result', 'Vehicle Description')
# Route CAT Details Result Vehicle Description
# [1,] "6" "PARKING" "Parking issues" "MOVED ON" "Mercedes"
Using str_extract from stringr
library(stringr)
str_extract_all(str1, '(?<=-\\s)\\w+(?:\\s*\\w+){0,}')[[1]]
#[1] "6" "PARKING" "Parking issues" "MOVED ON"
#[5] "Mercedes"
str_extract_all(str2, '(?<=-\\s)\\w+(?:\\s*\\w+){0,}')[[1]]
#[1] "6" "PARKING"
#[3] "Parking issues" "MOVED ON"
#[5] "Mercedes" "Parking issues are present"
#[7] "MOVED ON" "Mercedes"
data
str1 <- "#Route - 6 #Category - PARKING #Details - Parking issues#Result - MOVED ON #Vehicle Type - Mercedes "
str2 <- "#Route - 6 #Category - PARKING #Details - Parking issues#Result - MOVED ON #Vehicle Type - Mercedes #Details - Parking issues are present#Result - MOVED ON #Vehicle Type - Mercedes "

Detect and remove signature in forum text messages using R

I've a collection of text messages scraped from a forum into a data frame. Here's a reproducible example:
example.df <- data.frame(author=c("Mikey", "Donald", "Mikey", "Daisy", "Minnie", "Daisy"),
message=c("Hello World! Mikey Mouse",
"Quack Quack! Donald Duck",
"I was born in 1928. Mikey Mouse",
"Quack Quack! Daisy Duck",
"The quick fox jump over Minnie Mouse",
"Quack Quack! Daisy Duck"))
My idea is to find the longest common suffix found on every message for the same author for all those who have written more than on message. For all others, well, I'll find a regex way that gracefully degradates.
I found the bioconductor package RLibstree that looks promising, thanks to the function getLongestCommonSubstring, but I don't know how to group the function to all the messages from the same author.
I think I'd convert to a list in the following format and use the stringdist package to find common sentences and remove any above a certain threshold of similarity for all sentences used by an author. outer may be of use here as well:
## load packages in this order
library(stringi)
library(magrittr)
example.df[["message"]] %>%
stringi::stri_split_regex(., "(?<=[.?!]{1,5})\\s+") %>%
split(example.df[["author"]])
## $Daisy
## $Daisy[[1]]
## [1] "Quack Quack!" "Daisy Duck"
##
## $Daisy[[2]]
## [1] "Quack Quack!" "Daisy Duck"
##
##
## $Donald
## $Donald[[1]]
## [1] "Quack Quack!" "Donald Duck"
##
##
## $Mikey
## $Mikey[[1]]
## [1] "Hello World!" "Mikey Mouse"
##
## $Mikey[[2]]
## [1] "I was born in 1928." "Mikey Mouse"
##
##
## $Minnie
## $Minnie[[1]]
## [1] "The quick fox jump over Minnie Mouse"
I don't know how to group the function to all the messages from the
same author.
Perhaps tapply is what you are looking for.
> tapply(as.character(example.df$message), example.df$author, function(x) x)
$Daisy
[1] "Quack Quack! Daisy Duck" "Quack Quack! Daisy Duck"
$Donald
[1] "Quack Quack! Donald Duck"
$Mikey
[1] "Hello World! Mikey Mouse" "I was born in 1928. Mikey Mouse"
$Minnie
[1] "The quick fox jump over Minnie Mouse"
You can use your own function in place of function(x) x, of course.
Here is an implementation that uses no additional libraries.
example.df <- data.frame(author=c("Mikey", "Donald", "Mikey",
"Daisy", "Minnie", "Daisy"),
message=c("Hello World! Mikey Mouse",
"Quack Quack! Donald Duck",
"I was born in 1928. Mikey Mouse",
"Quack Quack! Daisy Duck",
"The quick fox jump over Minnie Mouse",
"Quack Quack! Daisy Duck"))
signlen = function(am) # determine signature length of an author's messages
{
if (length(am) <= 1) return(0) # return if not more than 1 message
# turn the messages into reversed vectors of single characters
# in order to conveniently access the suffixes from index 1 on
am = lapply(strsplit(as.character(am), ''), rev)
# find the longest common suffix in the messages
longest_common = .Machine$integer.max
for (m in 2:length(am))
{
i = 1
max_length = min(length(am[[m]]), length(am[[m-1]]), longest_common)
while (i <= max_length && am[[m]][i] == am[[m-1]][i]) i = i+1
longest_common = i-1
if (longest_common == 0) return(0) # shortcut: need not look further
}
return(longest_common)
}
# determine signature length of every author's messages
signature_length = tapply(example.df$message, example.df$author, signlen)
#> signature_length
# Daisy Donald Mikey Minnie
# 23 0 12 0
# determine resulting length "to" of messages with signatures removed
to = nchar(as.character(example.df$message))-signature_length[example.df$author]
#> to
# Mikey Donald Mikey Daisy Minnie Daisy
# 12 24 19 0 36 0
# remove the signatures by replacing messages with resulting substring
example.df$message = substr(example.df$message, 1, to)
#> example.df
# author message
#1 Mikey Hello World!
#2 Donald Quack Quack! Donald Duck
#3 Mikey I was born in 1928.
#4 Daisy
#5 Minnie The quick fox jump over Minnie Mouse
#6 Daisy