Easy way to extract text between defined set of strings in R - regex

I have some text with defined labels and need to split the text according to the labels.
For example given the text with labels set {A, B, C..}
text <- c("A: how are you B: hello sir C: bye bye")
text2 <- c("USER COMMENTS: TEST PROC: Refer manual. SOLUTION: fix BIAS32 user:param", "TEST PROC: install spare unit. USER COMMENTS: hello sir SOLUTION: tighten bolt 12","TEST PROC: bye bye.")
I need to extract text "how are you", "hello sir" , etc.. corresponding to labels A, B, etc.
There is no specific order of the labels, certain labels could be missing and labels can be phrases (not just characters)
This is what I have so far to extract text corresponding to label A:
gsub("(.*A.*:)(.*)(B.*|C.*)","\\2",text,perl=TRUE)
But this does not work in so many cases!
I am looking for a solution where I can define a vector of labels such as
labels <- c("USER COMMENTS", "TEST PROC", "SOLUTION") # this is a big list!
and extract the text corresponding to these labels as below
USER COMMENTS are "", "hello sir"
TEST PROC are "Refer manual.", "install spare unit.","bye bye."
SOLUTION are "fix BIAS32 user:param", "tighten bolt 12"
etc..

I think I might have a solution based on Sharath's comment.
First, there's strsplit(), which can split a vector based on regex. In your case you could use:
labels2<-paste(labels,collapse="|")
[1] "USER COMMENTS|TEST PROC|SOLUTION"
If you apply strsplit on that:
splittedtext<-strsplit(text2,labels2)
[[1]]
[1] "" ": "
[3] ": Refer manual. " ": fix BIAS32 user:param"
[[2]]
[1] "" ": install spare unit. " ": hello sir "
[4] ": tighten bolt 12"
[[3]]
[1] "" ": bye bye."
Pretty much what you want, right? You could do some refining by adding ": " to the end of every index, and the first element is gibberish. So taking care of the latter:
splittedtext<-lapply(splittedtext,"[",-1)
That generates the problem that you must figure out to which label a comment applies. For that you could use regexpr() function in R.
pos=sapply(labels,regexpr,text2)
USER COMMENTS TEST PROC SOLUTION
[1,] 1 16 41
[2,] 32 1 57
[3,] -1 1 -1
Each cell represents the position in which said label [column] appear on string [row]. -1 denote that it does not appear on this string.
Now switch, -1 for NA, and rank the remaining numbers. That will give to you which string snippet represents that label.
pos=ifelse(pos==-1,NA,pos) #switch -1 for NA
pos=t(apply(pos,1,rank,na.last="keep"))
USER COMMENTS TEST PROC SOLUTION
[1,] 1 2 3
[2,] 2 1 3
[3,] NA 1 NA
Now it's just matching.

Related

Regex in R -- extracting sub-string based on two start/stop words

I have a character (text) column:
tweets <- c(
"Drinking a Bud Light by #Budweiser # Joe's Crab Shack http://www.joes.com",
"Drinking a Sam Adams Winter Ale by #SamAdams # Growler Stop http://www.growlerstop.com",
"Drinking a Coco Loco by #NoDaBrewing # The Corner Pub http://www.cornerpub.com"
)
As you can see, assume the tweets have a standard structure:
"Drinking a [name of beer] by #[name of brewery] # [name of bar, notice whitespace] http://"
I want to use regular expressions (and substr()?) to create three new columns:
Name of the beer
Name of the brewery
Name of the bar (note that it could have white space, so needs to go to "http:")
One step further - how do I control for some Tweets that do not have the same structure?
It's ugly:
setNames(nm=c('beer','brewery','bar'),as.data.frame(do.call(rbind,
regmatches(tweets,regexec('^Drinking an? (.*) by #(.*) # (.*) http://.*$',tweets))
)[,-1L]));
## beer brewery bar
## 1 Bud Light Budweiser Joe's Crab Shack
## 2 Sam Adams Winter Ale SamAdams Growler Stop
## 3 Coco Loco NoDaBrewing The Corner Pub
See regexec() and regmatches().
do.call(rbind,strsplit(gsub('.*\\ba\\b(.*) by #(.*) #(.*) http.*','\\1|\\2|\\3',tweets),'\\|'))
# [,1] [,2] [,3]
#[1,] " Bud Light" "Budweiser" " Joe's Crab Shack"
#[2,] " Sam Adams Winter Ale" "SamAdams" " Growler Stop"
#[3,] " Coco Loco" "NoDaBrewing" " The Corner Pub"

Split String into Columns by two character markers

All, I have searched around and can't find the answer on how to do this. I am relatively new to R and have not used regular expresions before but bascially I have some data put into a field like this:
"#Route - 6 #Category - PARKING #Details - Parking issues#Result - MOVED ON #Vehicle Type - Mercedes "
I basically want to be able to split the string up into different elements, so each category after the # has it own column.
I tried using the tidyr package and initially tried:
string %>% separate(Description, into = c("Route","Details","Result","License No",
"Vehicle Desciption"),
sep = "\n#", remove =F, extra = "drop")
But realised I only wanted the data after the "-". I tried inserting a "-" in the code but it didn't work. Does anyone know how I can split the string ideally between the "-" and the "#".
Many thanks
In one line:
> gsub("^\\s+|\\s+$","",gsub(".*?[-]","",unlist(strsplit(str,"#"))))
[1] "" "6" "PARKING" "Parking issues" "MOVED ON" "Mercedes"
Or separate for better understanding:
Break string by "#":
a = unlist(strsplit(str,"#"))
Remove what is before the "-"
b = gsub(".*?[-]","",a)
Remove leading and trailing spaces:
gsub("^\\s+|\\s+$","",b)
You could do the following:
strsplit(x, ' *#[^-]+- *')[[1]][2:6]
# [1] "6" "PARKING" "Parking issues" "MOVED ON" "Mercedes"
To supply the column names you desire, I suppose you could do something like:
mat <- matrix(strsplit(x, ' *#[^-]+- *')[[1]][2:6], ncol=5, byrow=T)
colnames(mat) <- c('Route', 'CAT', 'Details', 'Result', 'Vehicle Description')
# Route CAT Details Result Vehicle Description
# [1,] "6" "PARKING" "Parking issues" "MOVED ON" "Mercedes"
Using str_extract from stringr
library(stringr)
str_extract_all(str1, '(?<=-\\s)\\w+(?:\\s*\\w+){0,}')[[1]]
#[1] "6" "PARKING" "Parking issues" "MOVED ON"
#[5] "Mercedes"
str_extract_all(str2, '(?<=-\\s)\\w+(?:\\s*\\w+){0,}')[[1]]
#[1] "6" "PARKING"
#[3] "Parking issues" "MOVED ON"
#[5] "Mercedes" "Parking issues are present"
#[7] "MOVED ON" "Mercedes"
data
str1 <- "#Route - 6 #Category - PARKING #Details - Parking issues#Result - MOVED ON #Vehicle Type - Mercedes "
str2 <- "#Route - 6 #Category - PARKING #Details - Parking issues#Result - MOVED ON #Vehicle Type - Mercedes #Details - Parking issues are present#Result - MOVED ON #Vehicle Type - Mercedes "

How to use separate() properly?

I have some difficulties to extract an ID in the form:
27da12ce-85fe-3f28-92f9-e5235a5cf6ac
from a data frame:
a<-c("NAME_27da12ce-85fe-3f28-92f9-e5235a5cf6ac_THOMAS_MYR",
"NAME_94773a8c-b71d-3be6-b57e-db9d8740bb98_THIMO",
"NAME_1ed571b4-1aef-3fe2-8f85-b757da2436ee_ALEX",
"NAME_9fbeda37-0e4f-37aa-86ef-11f907812397_JOHN_TYA",
"NAME_83ef784f-3128-35a1-8ff9-daab1c5f944b_BISHOP",
"NAME_39de28ca-5eca-3e6c-b5ea-5b82784cc6f4_DUE_TO",
"NAME_0a52a024-9305-3bf1-a0a6-84b009cc5af4_WIS_MICHAL",
"NAME_2520ebbb-7900-32c9-9f2d-178cf04f7efc_Sarah_Lu_Van_Gar/Thomas")
Basically its the thing between the first and the second underscore.
Usually I approach that by:
library(tidyr)
df$a<-as.character(df$a)
df<-df[grep("_", df$a), ]
df<- separate(df, a, c("ID","Name") , sep = "_")
df$a<-as.numeric(df$ID)
However this time there a to many underscores...and my approach fails. Is there a way to extract that ID?
I think you should use extract instead of separate. You need to specify the patterns which you want to capture. I'm assuming here that ID is always starts with a number so I'm capturing everything after the first number until the next _ and then everything after it
df <- data.frame(a)
df <- df[grep("_", df$a),, drop = FALSE]
extract(df, a, c("ID", "NAME"), "[A-Za-z].*?(\\d.*?)_(.*)")
# ID NAME
# 1 27da12ce-85fe-3f28-92f9-e5235a5cf6ac THOMAS_MYR
# 2 94773a8c-b71d-3be6-b57e-db9d8740bb98 THIMO
# 3 1ed571b4-1aef-3fe2-8f85-b757da2436ee ALEX
# 4 9fbeda37-0e4f-37aa-86ef-11f907812397 JOHN_TYA
# 5 83ef784f-3128-35a1-8ff9-daab1c5f944b BISHOP
# 6 39de28ca-5eca-3e6c-b5ea-5b82784cc6f4 DUE_TO
# 7 0a52a024-9305-3bf1-a0a6-84b009cc5af4 WIS_MICHAL
# 8 2520ebbb-7900-32c9-9f2d-178cf04f7efc Sarah_Lu_Van_Gar/Thomas
try this (which assumes that the ID is always the part after the first unerscore):
sapply(strsplit(a, "_"), function(x) x[[2]])
which gives you "the middle part" which is your ID:
[1] "27da12ce-85fe-3f28-92f9-e5235a5cf6ac" "94773a8c-b71d-3be6-b57e-db9d8740bb98"
[3] "1ed571b4-1aef-3fe2-8f85-b757da2436ee" "9fbeda37-0e4f-37aa-86ef-11f907812397"
[5] "83ef784f-3128-35a1-8ff9-daab1c5f944b" "39de28ca-5eca-3e6c-b5ea-5b82784cc6f4"
[7] "0a52a024-9305-3bf1-a0a6-84b009cc5af4" "2520ebbb-7900-32c9-9f2d-178cf04f7efc"
if you want to get the Name as well a simple solution would be (which assumes that the Name is always after the second underscore):
Names <- sapply(strsplit(a, "_"), function(x) Reduce(paste, x[-c(1,2)]))
which gives you this:
[1] "THOMAS MYR" "THIMO" "ALEX" "JOHN TYA"
[5] "BISHOP" "DUE TO" "WIS MICHAL" "Sarah Lu Van Gar/Thomas"

Remove defined strings from sentences in dataframe

I need to remove defined strings from sentences in data frame:
sent1 = data.frame(Sentences=c("bad printer for the money wireless setup was surprisingly easy",
"love my samsung galaxy tabinch gb whitethis is the first"), user = c(1,2))
Sentences User
bad printer for the money wireless setup was surprisingly easy 1
love my samsung galaxy tabinch gb whitethis is the first 2
Defined strings for excluding, e.g.:
stop_words <- c("bad", "money", "love", "is", "the")
I was wondering about something like this:
library(stringr)
words1 <- (str_split(unlist(sent1$Sentences)," "))
ddd = which(words1[[1]] %in% stop_words)
words1[[1]][-ddd]
But I need it for all items in the list. Then I need to have output table in the same structure as input table sent1, but without defined strings.
Please, I very appreciate any of help or advice.
You can combine the stop words and create a regex pattern. Therefore, you only need a single gsub command.
# create regex pattern
pattern <- paste0("\\b(?:", paste(stop_words, collapse = "|"), ")\\b ?")
# [1] "\\b(?:bad|money|love|is|the)\\b ?"
# remove stop words
res <- gsub(pattern, "", sent1$Sentences)
# [1] "printer for wireless setup was surprisingly easy"
# [2] "my samsung galaxy tabinch gb whitethis first"
# store result in a data frame
data.frame(Sentences = res)
# Sentences
# 1 printer for wireless setup was surprisingly easy
# 2 my samsung galaxy tabinch gb whitethis first

How would I turn a multivalue string into a usable frequency table in R?

I have a field in a data frame called plugins_Apache_module
it contains strings like:
c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52",
"mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23",
"mod_ssl/2.2.9")
I need a frequency table on the modules, and also their versions.
What is the best way to do this in R? As being rather new in R, I've seen strsplit, gsub, some chatrooms also suggested I use the qdap package.
Ideally I would want the string transformed into a dataframe with a column for every mod, if the module is there, then the version goes in that particular field. How would I accomplish such a transform?
What dataframe format would be suggested if I want top-level frequencies - say mod_ssl (all versions) as well as relational options (mod_perl is very often used with mod_ssl).
I'm not too sure how to handle such variable length data when pushing into a dataframe for processing. Any advice is welcome.
I consider the right answer to look like:
mod_perl mod_python mod_ssl mod_auth_passthrough mod_bwlimited
1.99_16 3.1.3 2.0.52
2.2.23 2.1 1.4
2.2.9
So basically the first bit becomes a column and the version(s) that follows become a row entry
st <- c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52", "mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23", "mod_ssl/2.2.9")
scan(text=st, what="", sep=",")
Read 7 items
[1] "mod_perl/1.99_16" "mod_python/3.1.3" "mod_ssl/2.0.52"
[4] "mod_auth_passthrough/2.1" "mod_bwlimited/1.4" "mod_ssl/2.2.23"
[7] "mod_ssl/2.2.9"
strsplit( scan(text=st, what="", sep=","), "/")
Read 7 items
[[1]]
[1] "mod_perl" "1.99_16"
[[2]]
[1] "mod_python" "3.1.3"
[[3]]
[1] "mod_ssl" "2.0.52"
[[4]]
[1] "mod_auth_passthrough" "2.1"
[[5]]
[1] "mod_bwlimited" "1.4"
[[6]]
[1] "mod_ssl" "2.2.23"
[[7]]
[1] "mod_ssl" "2.2.9"
table( sapply(strsplit( scan(text=st, what="", sep=","), "/"), "[",1) )
#----------------
Read 7 items
mod_auth_passthrough mod_bwlimited mod_perl mod_python
1 1 1 1
mod_ssl
3
table( scan(text=st, what="", sep=",") )
#-----------
Read 7 items
mod_auth_passthrough/2.1 mod_bwlimited/1.4 mod_perl/1.99_16
1 1 1
mod_python/3.1.3 mod_ssl/2.0.52 mod_ssl/2.2.23
1 1 1
mod_ssl/2.2.9
1
You ask for at minimum two different things. Adding desired output greatly helped. I'm not sure if what you ask for is what you really want but you asked and it seemed like a fun problem. Ok here's how I would approach this using qdap (this requires qdap version 1.1.0 though):
## load qdap
library(qdap)
## your data
x <- c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52",
"mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23",
"mod_ssl/2.2.9")
## strsplit on commas and slashes
dat <- unlist(lapply(x, strsplit, ",|/"), recursive=FALSE)
## make just a list of mods per row
mods <- lapply(dat, "[", c(TRUE, FALSE))
## make a string of versions
ver <- unlist(lapply(dat, "[", c(FALSE, TRUE)))
## make a lookup key and split it into lists
key <- data.frame(mod = unlist(mods), ver, row = rep(seq_along(mods),
sapply(mods, length)))
key2 <- split(key[, 1:2], key$row)
## make it into freq. counts
freqs <- mtabulate(mods)
## rename assign freq table to vers in case you want freqs ans replace 0 with NA
vers <- freqs
vers[vers==0] <- NA
## loop through and fill the ones in each row using an env. lookup (%l%)
for(i in seq_len(nrow(vers))) {
x <- vers[i, !is.na(vers[i, ]), drop = FALSE]
vers[i, !is.na(vers[i, ])] <- colnames(x) %l% key2[[i]]
}
## Don't print the NAs
print(vers, na.print = "")
## mod_auth_passthrough mod_bwlimited mod_perl mod_python mod_ssl
## 1 1.99_16 3.1.3 2.0.52
## 2 2.1 1.4 2.2.23
## 3 2.2.9
## the frequency counts per mods
freqs
## mod_auth_passthrough mod_bwlimited mod_perl mod_python mod_ssl
## 1 0 0 1 1 1
## 2 1 1 0 0 1
## 3 0 0 0 0 1