R match expression multiple times in the same line - regex

I am working with a set of Tweets (very original, I know) in R and would like to extract the text after each # sign and after each # and put them into separate variables. For example:
This is a test tweet using #twitter. #johnsmith #joesmith.
Ideally I would like it to create new variables in the dataframe that has twitter johnsmith joesmith, etc.
Currently I am using
data$at <- str_match(data$tweet_text,"\s#\w+")
data$hash <- str_match(data$tweet_text,"\s#\w+")
Which obviously gives me the first occurrence of each into a new variable. Any suggestions?

strsplit and grep will work:
x <-strsplit("This is a test tweet using #twitter. #johnsmith #joesmith."," ")
grep("#|#",unlist(x), value=TRUE)
#[1] "#twitter." "#johnsmith" "#joesmith."
If you only want to keep the words, no #,# or .:
out <-grep("#|#",unlist(x), value=TRUE)
gsub("#|#|\\.","",out)
[1] "twitter" "johnsmith" "joesmith"
UPDATE Putting the results in a list:
my_list <-NULL
x <-strsplit("This is a test tweet using #twitter. #johnsmith #joesmith."," ")
my_list$hash <-c(my_list$hash,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
my_list$at <-c(my_list$at,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
x <-strsplit("2nd tweet using #second. #jillsmith #joansmith."," ")
my_list$hash <-c(my_list$hash,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
my_list$at <-c(my_list$at,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
my_list
$hash
[1] "twitter" "second"
$at
[1] "johnsmith" "joesmith" "jillsmith" "joansmith"

Related

How to match a specific string using regular expressions in R

I am trying to extract some financial data using regular expressions in R.
I have used a RegEx tester, http://regexr.com/, to make a regular expression that SHOULD capture the information I need - the problem is just that it doesn't...
I have extracted data from this URL: http://finance.yahoo.com/q/cp?s=%5EOMXC20+Components
I want to match the company names (DANSKE.CO, DSV.CO etc.) and I have created following regular expression which matches it on regexr.com:
.q\?s=(\S*\\)
But it doesn't work in R. Can someone help me figure out how to go about this?
Instead of messing around with regular expressions I would use XPath for something like fetching HTML content:
library("XML")
f <- tempfile()
download.file("https://finance.yahoo.com/q/cp?s=^OMXC20+Components", f)
doc <- htmlParse(f)
xpathSApply(doc, "//b/a", xmlValue)
# [1] "CARL-B.CO" "CHR.CO" "COLO-B.CO" "DANSKE.CO" "DSV.CO"
# [6] "FLS.CO" "GEN.CO" "GN.CO" "ISS.CO" "JYSK.CO"
# [11] "MAERSK-A.CO" "MAERSK-B.CO" "NDA-DKK.CO" "NOVO-B.CO" "NZYM-B.CO"
# [16] "PNDORA.CO" "TDC.CO" "TRYG.CO" "VWS.CO" "WDH.CO"
Does this help? If not, post back, and I'll provide another suggestion.
library(XML)
stocks <- c("AXP","BA","CAT","CSCO")
for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))
# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}
# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]
# SAVE TO CSV
write.table(stockdata, "C:/Users/rshuell001/Desktop/MyData.csv", sep=",",
row.names=FALSE, col.names=FALSE)
# REMOVE TEMP OBJECTS
rm(df, stockdatalist)

How to use separate() properly?

I have some difficulties to extract an ID in the form:
27da12ce-85fe-3f28-92f9-e5235a5cf6ac
from a data frame:
a<-c("NAME_27da12ce-85fe-3f28-92f9-e5235a5cf6ac_THOMAS_MYR",
"NAME_94773a8c-b71d-3be6-b57e-db9d8740bb98_THIMO",
"NAME_1ed571b4-1aef-3fe2-8f85-b757da2436ee_ALEX",
"NAME_9fbeda37-0e4f-37aa-86ef-11f907812397_JOHN_TYA",
"NAME_83ef784f-3128-35a1-8ff9-daab1c5f944b_BISHOP",
"NAME_39de28ca-5eca-3e6c-b5ea-5b82784cc6f4_DUE_TO",
"NAME_0a52a024-9305-3bf1-a0a6-84b009cc5af4_WIS_MICHAL",
"NAME_2520ebbb-7900-32c9-9f2d-178cf04f7efc_Sarah_Lu_Van_Gar/Thomas")
Basically its the thing between the first and the second underscore.
Usually I approach that by:
library(tidyr)
df$a<-as.character(df$a)
df<-df[grep("_", df$a), ]
df<- separate(df, a, c("ID","Name") , sep = "_")
df$a<-as.numeric(df$ID)
However this time there a to many underscores...and my approach fails. Is there a way to extract that ID?
I think you should use extract instead of separate. You need to specify the patterns which you want to capture. I'm assuming here that ID is always starts with a number so I'm capturing everything after the first number until the next _ and then everything after it
df <- data.frame(a)
df <- df[grep("_", df$a),, drop = FALSE]
extract(df, a, c("ID", "NAME"), "[A-Za-z].*?(\\d.*?)_(.*)")
# ID NAME
# 1 27da12ce-85fe-3f28-92f9-e5235a5cf6ac THOMAS_MYR
# 2 94773a8c-b71d-3be6-b57e-db9d8740bb98 THIMO
# 3 1ed571b4-1aef-3fe2-8f85-b757da2436ee ALEX
# 4 9fbeda37-0e4f-37aa-86ef-11f907812397 JOHN_TYA
# 5 83ef784f-3128-35a1-8ff9-daab1c5f944b BISHOP
# 6 39de28ca-5eca-3e6c-b5ea-5b82784cc6f4 DUE_TO
# 7 0a52a024-9305-3bf1-a0a6-84b009cc5af4 WIS_MICHAL
# 8 2520ebbb-7900-32c9-9f2d-178cf04f7efc Sarah_Lu_Van_Gar/Thomas
try this (which assumes that the ID is always the part after the first unerscore):
sapply(strsplit(a, "_"), function(x) x[[2]])
which gives you "the middle part" which is your ID:
[1] "27da12ce-85fe-3f28-92f9-e5235a5cf6ac" "94773a8c-b71d-3be6-b57e-db9d8740bb98"
[3] "1ed571b4-1aef-3fe2-8f85-b757da2436ee" "9fbeda37-0e4f-37aa-86ef-11f907812397"
[5] "83ef784f-3128-35a1-8ff9-daab1c5f944b" "39de28ca-5eca-3e6c-b5ea-5b82784cc6f4"
[7] "0a52a024-9305-3bf1-a0a6-84b009cc5af4" "2520ebbb-7900-32c9-9f2d-178cf04f7efc"
if you want to get the Name as well a simple solution would be (which assumes that the Name is always after the second underscore):
Names <- sapply(strsplit(a, "_"), function(x) Reduce(paste, x[-c(1,2)]))
which gives you this:
[1] "THOMAS MYR" "THIMO" "ALEX" "JOHN TYA"
[5] "BISHOP" "DUE TO" "WIS MICHAL" "Sarah Lu Van Gar/Thomas"

Extracting text after "?"

I have a string
x <- "Name of the Student? Michael Sneider"
I want to extract "Michael Sneider" out of it.
I have used:
str_extract_all(x,"[a-z]+")
str_extract_all(data,"\\?[a-z]+")
But can't extract the name.
I think this should help
substr(x, str_locate(x, "?")+1, nchar(x))
Try this:
sub('.*\\?(.*)','\\1',x)
x <- "Name of the Student? Michael Sneider"
sub(pattern = ".+?\\?" , x , replacement = '' )
To take advantage of the loose wording of the question, we can go WAY overboard and use natural language processing to extract all names from the string:
library(openNLP)
library(NLP)
# you'll also have to install the models with the next line, if you haven't already
# install.packages('openNLPmodels.en', repos = 'http://datacube.wu.ac.at/', type = 'source')
s <- as.String(x) # convert x to NLP package's String object
# make annotators
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
entity_annotator <- Maxent_Entity_Annotator()
# call sentence and word annotators
s_annotated <- annotate(s, list(sent_token_annotator, word_token_annotator))
# call entity annotator (which defaults to "person") and subset the string
s[entity_annotator(s, s_annotated)]
## Michael Sneider
Overkill? Probably. But interesting, and not actually all that hard to implement, really.
str_match is more helpful in this situation
str_match(x, ".*\\?\\s(.*)")[, 2]
#[1] "Michael Sneider"

Remove defined strings from sentences in dataframe

I need to remove defined strings from sentences in data frame:
sent1 = data.frame(Sentences=c("bad printer for the money wireless setup was surprisingly easy",
"love my samsung galaxy tabinch gb whitethis is the first"), user = c(1,2))
Sentences User
bad printer for the money wireless setup was surprisingly easy 1
love my samsung galaxy tabinch gb whitethis is the first 2
Defined strings for excluding, e.g.:
stop_words <- c("bad", "money", "love", "is", "the")
I was wondering about something like this:
library(stringr)
words1 <- (str_split(unlist(sent1$Sentences)," "))
ddd = which(words1[[1]] %in% stop_words)
words1[[1]][-ddd]
But I need it for all items in the list. Then I need to have output table in the same structure as input table sent1, but without defined strings.
Please, I very appreciate any of help or advice.
You can combine the stop words and create a regex pattern. Therefore, you only need a single gsub command.
# create regex pattern
pattern <- paste0("\\b(?:", paste(stop_words, collapse = "|"), ")\\b ?")
# [1] "\\b(?:bad|money|love|is|the)\\b ?"
# remove stop words
res <- gsub(pattern, "", sent1$Sentences)
# [1] "printer for wireless setup was surprisingly easy"
# [2] "my samsung galaxy tabinch gb whitethis first"
# store result in a data frame
data.frame(Sentences = res)
# Sentences
# 1 printer for wireless setup was surprisingly easy
# 2 my samsung galaxy tabinch gb whitethis first

Parsing out a line in R to pick different objects

I have this line:
system<-c("System configuration: type=Shared mode=Uncapped smt=4 lcpu=96 mem=393216MB psize=64 ent=16.00")
I need to parse out this and pick smt, lcpu, mem, mpsize and ent into different objects.
For example, I doing this to pick the smt, but it picks the whole line, any ideas what I am doing wrong here?
smt<-sub('^.* smt=([[:digit:]])', '\\1', system)
smt needs to have a number 4 in this case.
I would use strsplit a couple times, and type.convert:
parse.config <- function(x) {
clean <- sub("System configuration: ", "", x)
pairs <- strsplit(clean, " ")[[1]]
items <- strsplit(pairs, "=")
keys <- sapply(items, `[`, 1)
values <- sapply(items, `[`, 2)
values <- lapply(values, type.convert, as.is = TRUE)
setNames(values, keys)
}
config <- parse.config(system)
# $type
# [1] "Shared"
#
# $mode
# [1] "Uncapped"
#
# $smt
# [1] 4
#
# $lcpu
# [1] 96
#
# $mem
# [1] "393216MB"
#
# $psize
# [1] 64
#
# $ent
# [1] 16
The output is a list so you can access any of the parsed items, for example:
config$smt
# [1] 4
Using strapplyc in the gusbfn package the following creates a list L whose names are the left hand sides such as smt and whose values are the right hand sides.
library(gsubfn)
LHS <- strapplyc( system, "(\\w+)=" )[[1]]
RHS <- strapplyc( system, "=(\\w+)" )[[1]]
L <- setNames( as.list(RHS), LHS )
For example we can now get smt like this (and similarly for the other left hand sides):
> L$smt
[1] "4"
UPDATE: Simplified.
add .* to the end of your matching expression and you'll get "4".
sub('^.* smt=([[:digit:]]+).*', '\\1', system)
You may want to add the + I included in the instance where it is more than a single digit.
You could also approach this by splitting on spaces and the finding the matches:
splits <- unlist(strsplit(system, ' '))
sub('smt=', '', grep('smt=', splits, value=TRUE))
# [1] "4"
or wrapping it in a function:
matchfun <- function(string, to_match, splitter=' ') {
splits <- unlist(strsplit(string, splitter))
sub(to_match, '', grep(to_match, splits, value=TRUE))
}
matchfun(system, 'smt=')
# [1] "4"
Well, I'm voting for #GaborGrothendieck's, but am offering this as a more pedestrian alternative:
inp <- c("System configuration: type=Shared mode=Uncapped smt=4 lcpu=96 mem=393216MB psize=64 ent=16.00")
inparsed <- read.table(text=inp, stringsAsFactors=FALSE)
vals <- unlist(inparsed)[grep("\\=", unlist(inparsed))]
vals
# V3 V4 V5 V6 V7 V8 V9
# type=Shared mode=Uncapped smt=4 lcpu=96 mem=393216MB psize=64 ent=16.00
vals[grep("smt|lcpu|mem|mpsize|ent", vals)]
V5 V6 V7 V9
"smt=4" "lcpu=96" "mem=393216MB" "ent=16.00"
I would note that choosing the name 'system' for a variable seems most unwise in light of the system function's existence.