Extracting text after "?" - regex

I have a string
x <- "Name of the Student? Michael Sneider"
I want to extract "Michael Sneider" out of it.
I have used:
str_extract_all(x,"[a-z]+")
str_extract_all(data,"\\?[a-z]+")
But can't extract the name.

I think this should help
substr(x, str_locate(x, "?")+1, nchar(x))

Try this:
sub('.*\\?(.*)','\\1',x)

x <- "Name of the Student? Michael Sneider"
sub(pattern = ".+?\\?" , x , replacement = '' )

To take advantage of the loose wording of the question, we can go WAY overboard and use natural language processing to extract all names from the string:
library(openNLP)
library(NLP)
# you'll also have to install the models with the next line, if you haven't already
# install.packages('openNLPmodels.en', repos = 'http://datacube.wu.ac.at/', type = 'source')
s <- as.String(x) # convert x to NLP package's String object
# make annotators
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
entity_annotator <- Maxent_Entity_Annotator()
# call sentence and word annotators
s_annotated <- annotate(s, list(sent_token_annotator, word_token_annotator))
# call entity annotator (which defaults to "person") and subset the string
s[entity_annotator(s, s_annotated)]
## Michael Sneider
Overkill? Probably. But interesting, and not actually all that hard to implement, really.

str_match is more helpful in this situation
str_match(x, ".*\\?\\s(.*)")[, 2]
#[1] "Michael Sneider"

Related

How to match a specific string using regular expressions in R

I am trying to extract some financial data using regular expressions in R.
I have used a RegEx tester, http://regexr.com/, to make a regular expression that SHOULD capture the information I need - the problem is just that it doesn't...
I have extracted data from this URL: http://finance.yahoo.com/q/cp?s=%5EOMXC20+Components
I want to match the company names (DANSKE.CO, DSV.CO etc.) and I have created following regular expression which matches it on regexr.com:
.q\?s=(\S*\\)
But it doesn't work in R. Can someone help me figure out how to go about this?
Instead of messing around with regular expressions I would use XPath for something like fetching HTML content:
library("XML")
f <- tempfile()
download.file("https://finance.yahoo.com/q/cp?s=^OMXC20+Components", f)
doc <- htmlParse(f)
xpathSApply(doc, "//b/a", xmlValue)
# [1] "CARL-B.CO" "CHR.CO" "COLO-B.CO" "DANSKE.CO" "DSV.CO"
# [6] "FLS.CO" "GEN.CO" "GN.CO" "ISS.CO" "JYSK.CO"
# [11] "MAERSK-A.CO" "MAERSK-B.CO" "NDA-DKK.CO" "NOVO-B.CO" "NZYM-B.CO"
# [16] "PNDORA.CO" "TDC.CO" "TRYG.CO" "VWS.CO" "WDH.CO"
Does this help? If not, post back, and I'll provide another suggestion.
library(XML)
stocks <- c("AXP","BA","CAT","CSCO")
for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))
# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}
# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]
# SAVE TO CSV
write.table(stockdata, "C:/Users/rshuell001/Desktop/MyData.csv", sep=",",
row.names=FALSE, col.names=FALSE)
# REMOVE TEMP OBJECTS
rm(df, stockdatalist)

Search and replace multiple strings in list of strings: improve R code

I am looking for a simplified solution to the following problem in R: I have a list of names that are separated by commas – however, some of the names also have commas in them. In order to separate the names, I would like to replace all names with commas first and then split by comma. My problem is that I have around 26 000 strings with several names in each and I have a list of around 130 names with commas. I have written a nested foreach loop (in order to use multiple cores to speed things up) and it works but it’s horribly slow. Is there a quicker way to search in the strings and replace the relevant names? Here is my sample code:
List_of_names<-as.data.frame(c("Fred, Heiko, Franz, Jr., Nice, LLC, Meike","Digital, Mike, John, Sr","Svenja, Sven"))
Comma_names<-as.data.frame(c("Franz, Jr.","Nice, LLC","John, Sr"))
colnames(Comma_names)<-"name"
Comma_names$replace_names<-gsub(",", "",Comma_names[,"name"])
library(doParallel)
library(foreach)
cl<-makeCluster(4) # Create cluster with desired number of cores
registerDoParallel(cl) # Register cluster
names_new<-foreach (i=1:nrow(List_of_names),.errorhandling="pass",.packages=c("foreach")) %dopar% {
name_2<-List_of_names[i,]
foreach (j=1:nrow(Comma_names),.combine=rbind,.errorhandling="pass") %do% {
if(length(grep(Comma_names[j,1],name_2))>0){
name_2<-gsub(Comma_names[j,1], Comma_names[j,2],name_2)
}
}
name_2
}
In addition, the result of the foreach loop is a list but if I try to save the list or replace the column in my original dataframe it takes forever. How can I change my code to make it faster?
Thank you everyone who is reads this and is able to help!
Principle
You can use a combination from Reduce and stri_replace_all from package stringi.
Code
library(stringi)
Comma_names <- structure(list(name = c("Franz, Jr.", "Nice, LLC", "John, Sr"),
replace_names = c("Franz Jr.", "Nice LLC", "John Sr")),
.Names = c("name", "replace_names"),
row.names = c(NA, -3L), class = "data.frame")
List_of_names <- structure(list(name = c("Fred, Heiko, Franz, Jr., Nice, LLC, Meike",
"Digital, Mike, John, Sr", "Svenja, Sven")),
.Names = "name",
row.names = c(NA, -3L), class = "data.frame")
wrapper <- function(str, ind) stri_replace_all(str, Comma_names$replace_names[ind],
fixed = Comma_names$name[ind])
ind <- 1:NROW(Comma_names)
Reduce(wrapper, ind, init = List_of_names$name)
# [1] "Fred, Heiko, Franz Jr., Nice LLC, Meike"
# [2] "Digital, Mike, John Sr"
# [3] "Svenja, Sven"
Explanation
stri_replace_all is a fast function which replaces all occurrences in a string. With Reduce you apply a function to the the result of the previous function call. So we apply wrapper to the column with all the names and replace the string in the first row of Comma_names. This string we again feed to wrapper now with the aim to replace all occurrences of the second row and so on. This code should run reasonable fast and you do not need to parallezie. Would be curious to hear your feedback on the execution time.
Benchmark
Just a little benchmark with 3 millions lines:
List_of_names <- List_of_names[rep(1:NROW(List_of_names), 1e6), , drop = FALSE]
system.time(invisible(Reduce(wrapper, ind, init = List_of_names$name)))
# user system elapsed
# 1.95 0.00 1.96

R match expression multiple times in the same line

I am working with a set of Tweets (very original, I know) in R and would like to extract the text after each # sign and after each # and put them into separate variables. For example:
This is a test tweet using #twitter. #johnsmith #joesmith.
Ideally I would like it to create new variables in the dataframe that has twitter johnsmith joesmith, etc.
Currently I am using
data$at <- str_match(data$tweet_text,"\s#\w+")
data$hash <- str_match(data$tweet_text,"\s#\w+")
Which obviously gives me the first occurrence of each into a new variable. Any suggestions?
strsplit and grep will work:
x <-strsplit("This is a test tweet using #twitter. #johnsmith #joesmith."," ")
grep("#|#",unlist(x), value=TRUE)
#[1] "#twitter." "#johnsmith" "#joesmith."
If you only want to keep the words, no #,# or .:
out <-grep("#|#",unlist(x), value=TRUE)
gsub("#|#|\\.","",out)
[1] "twitter" "johnsmith" "joesmith"
UPDATE Putting the results in a list:
my_list <-NULL
x <-strsplit("This is a test tweet using #twitter. #johnsmith #joesmith."," ")
my_list$hash <-c(my_list$hash,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
my_list$at <-c(my_list$at,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
x <-strsplit("2nd tweet using #second. #jillsmith #joansmith."," ")
my_list$hash <-c(my_list$hash,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
my_list$at <-c(my_list$at,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
my_list
$hash
[1] "twitter" "second"
$at
[1] "johnsmith" "joesmith" "jillsmith" "joansmith"

Extract words that meet a length condition from string

I have a patent data set and when I import the IPC-class information to R I get a string containing whitespaces in a variable amount and a set of numbers I don't need. The following are the IPC codes corresponding to a patent file:
b <- "F24J 2/05 20060101AFI20150224BHEP F24J 2/46 20060101ALI20150224BHEP "
I would like to remove all whitespaces and that long alphanumeric string and just get the data I am interested in, obtaining a data frame like this, in this case:
m <- data.frame(matrix(c("F24J 2/05", "F24J 2/46"), byrow = TRUE, nrow = 1, ncol = 2))
m
I am trying with gsub, since I know that the long string will always have a length considerably longer than the data I am interested in:
x = gsub("\\b[a-zA-Z0-9]{8,}\\b", "", ipc)
x
But I get stuck when I try to further clean this object in order to get the data frame I want. I am really stuck on this, and I would really appreciate if someone could help me.
Thank you very much in advance.
You can use str_extract_all from stringr package, provided you know the pattern you look for:
library(stringr)
str_extract_all(b, "[A-Z]\\d{2}[A-Z] *\\d/\\d{2}")[[1]]
#[1] "F24J 2/05" "F24J 2/46"
Option 1, select all the noise data and remoe it using a sustitution:
/\s+|\w{5,}/g
(Spaces and 'long' words)
https://regex101.com/r/lG4dC4/1
Option 2, select all the short words (length max 4):
/\b\S{4}\b/g
https://regex101.com/r/fZ8mH5/1
or…
library(stringi)
library(readr)
read_fwf(paste0(stri_match_all_regex(b, "[[:alnum:][:punct:][:blank:]]{50}")[[1]][,1], collapse="\n"),
fwf_widths(c(7, 12, 31)))[,1:2]
## X1 X2
## 1 F24J 2/05
## 2 F24J 2/46
(this makes the assumption - from only seeing 2 'records' - that each 'record' is 50 characters long)
Here's an approach to akie the amtrix using qdapRegex (I maintain this package) + magrittr's pipeline:
library(qdapRegex); library(magrittr)
b %>%
rm_white_multiple() %>%
rm_default(pattern="F[0-9A-Z]+\\s\\d{1,2}/\\d{1,2}", extract=TRUE) %>%
unlist() %>%
strsplit("\\s") %>%
do.call(rbind, .)
## [,1] [,2]
## [1,] "F24J" "2/05"
## [2,] "F24J" "2/46"

How to use separate() properly?

I have some difficulties to extract an ID in the form:
27da12ce-85fe-3f28-92f9-e5235a5cf6ac
from a data frame:
a<-c("NAME_27da12ce-85fe-3f28-92f9-e5235a5cf6ac_THOMAS_MYR",
"NAME_94773a8c-b71d-3be6-b57e-db9d8740bb98_THIMO",
"NAME_1ed571b4-1aef-3fe2-8f85-b757da2436ee_ALEX",
"NAME_9fbeda37-0e4f-37aa-86ef-11f907812397_JOHN_TYA",
"NAME_83ef784f-3128-35a1-8ff9-daab1c5f944b_BISHOP",
"NAME_39de28ca-5eca-3e6c-b5ea-5b82784cc6f4_DUE_TO",
"NAME_0a52a024-9305-3bf1-a0a6-84b009cc5af4_WIS_MICHAL",
"NAME_2520ebbb-7900-32c9-9f2d-178cf04f7efc_Sarah_Lu_Van_Gar/Thomas")
Basically its the thing between the first and the second underscore.
Usually I approach that by:
library(tidyr)
df$a<-as.character(df$a)
df<-df[grep("_", df$a), ]
df<- separate(df, a, c("ID","Name") , sep = "_")
df$a<-as.numeric(df$ID)
However this time there a to many underscores...and my approach fails. Is there a way to extract that ID?
I think you should use extract instead of separate. You need to specify the patterns which you want to capture. I'm assuming here that ID is always starts with a number so I'm capturing everything after the first number until the next _ and then everything after it
df <- data.frame(a)
df <- df[grep("_", df$a),, drop = FALSE]
extract(df, a, c("ID", "NAME"), "[A-Za-z].*?(\\d.*?)_(.*)")
# ID NAME
# 1 27da12ce-85fe-3f28-92f9-e5235a5cf6ac THOMAS_MYR
# 2 94773a8c-b71d-3be6-b57e-db9d8740bb98 THIMO
# 3 1ed571b4-1aef-3fe2-8f85-b757da2436ee ALEX
# 4 9fbeda37-0e4f-37aa-86ef-11f907812397 JOHN_TYA
# 5 83ef784f-3128-35a1-8ff9-daab1c5f944b BISHOP
# 6 39de28ca-5eca-3e6c-b5ea-5b82784cc6f4 DUE_TO
# 7 0a52a024-9305-3bf1-a0a6-84b009cc5af4 WIS_MICHAL
# 8 2520ebbb-7900-32c9-9f2d-178cf04f7efc Sarah_Lu_Van_Gar/Thomas
try this (which assumes that the ID is always the part after the first unerscore):
sapply(strsplit(a, "_"), function(x) x[[2]])
which gives you "the middle part" which is your ID:
[1] "27da12ce-85fe-3f28-92f9-e5235a5cf6ac" "94773a8c-b71d-3be6-b57e-db9d8740bb98"
[3] "1ed571b4-1aef-3fe2-8f85-b757da2436ee" "9fbeda37-0e4f-37aa-86ef-11f907812397"
[5] "83ef784f-3128-35a1-8ff9-daab1c5f944b" "39de28ca-5eca-3e6c-b5ea-5b82784cc6f4"
[7] "0a52a024-9305-3bf1-a0a6-84b009cc5af4" "2520ebbb-7900-32c9-9f2d-178cf04f7efc"
if you want to get the Name as well a simple solution would be (which assumes that the Name is always after the second underscore):
Names <- sapply(strsplit(a, "_"), function(x) Reduce(paste, x[-c(1,2)]))
which gives you this:
[1] "THOMAS MYR" "THIMO" "ALEX" "JOHN TYA"
[5] "BISHOP" "DUE TO" "WIS MICHAL" "Sarah Lu Van Gar/Thomas"