Extract URL parameters and values in R - regex

We want to extract parameters and values from a given URL like
http://www.exemple.com/?a=1&b=2&c=3#def
Using xml2::url_parse we were able to Parse a url into its component pieces. However we still want to devide the query into elements using gsub matching regular expression:
([^?&=#]+)=([^&#]*)
Desired output
a=1
b=2
c=3

Use urltools package to parse URLs.
> u <- "http://www.exemple.com/?a=1&b=2&c=3#def"
> strsplit(urltools::parameters(u), "&")[[1L]]
[1] "a=1" "b=2" "c=3"
> urltools::param_get(u, "b")
b
1 2

We can try
library(stringr)
matrix(str_extract_all(str1, "[a-z](?=\\=)|(?<=\\=)\\d+")[[1]], ncol=2, byrow=TRUE)
Or if we need the = also
str_extract_all(str1, "[a-z]=\\d+")[[1]]
#[1] "a=1" "b=2" "c=3"
data
str1 <- "http://www.exemple.com/?a=1&b=2&c=3#def"

Related

grepl() and lapply to fill missing values

I have the following data as an example:
fruit.region <- data.frame(full =c("US red apple","bombay Asia mango","gold kiwi New Zealand"), name = c("apple", "mango", "kiwi"), country = c("US","Asia","New Zealand"), type = c("red","bombay","gold"))
I would like R to be able to look at other items in the "full" (name) column that don't have values for "name", "country" and "type" and see if they match other items. For instance, if full had a 4th row with "bombay US mango" it would be able to identify that the country should read US, bombay should be under type and mango should be under name.
This is what I have so far, which merely identifies (logically) where the items match:
new.entry <- c("bombay US mango")
split.new.entry <- strsplit(new.entry, " ")
lapply(split.new.entry, function(x){
check = grepl(x, fruit.region, ignore.case=TRUE)
print(check)
})
I'm at a bit of a standstill..I've read through a number of regex posts and the r help guides on grepl but am not able to find a great solution. What I have doesn't fully identify a logical "match" vector so I'm unable to subset and use an if statement to concatenate on different elements. Ideally, I'd like to be able to replace these elements in data.table form as my fruit.region will actually be in a data table. Does anyone have any suggestions on the best approach?
Using the str_detect function from the stringr library. This gives a list, ready to rbind:
library(stringr)
addnewrow <- function(newfruit){
z<-lapply(fruit.region[,2:4], function(x) x[str_detect(new.entry, x)])
z$full <- newfruit
z
}
addnewrow(new.entry)
$name
[1] "mango"
$country
[1] "US"
$type
[1] "bombay"
$full
[1] "bombay US mango"
The next step would depend on your desired outcome - if you only want to add one, try:
rbind(fruit.region, addnewrow(new.entry))
If you have a lot:
z <- do.call(rbind, lapply(c(new.entry, new.entry), addnewrow))
rbind(fruit.region, z)
NB make sure your columns are character first:
fruit.region[] <- lapply(fruit.region, as.character)

regex for matching column names on a matrix

I'll have two strings of the form
"Initestimate" or "L#estimate" with # being a 1 or 2 digit number
" Nameestimate" with Name being the name of the actual symbol. In the example below, the name of our symbol is "6JU4"
And I have a matrix containing, among other things, columns containing "InitSymbol" and "L#Symbol". I want to return the column name of the column where the first row holds the substring before "estimate".
I'm using stringr. Right now I have it coded with a bunch of calls to str_sub but its really sloppy and I wanted to clean it up and do it right.
example code:
> examplemat <- matrix(c("RYU4","6JU4","6EU4",1,2,3),ncol=6)
> colnames(examplemat) <- c("InitSymb","L1Symb","L2Symb","RYU4estimate","6JU4estimate","6EU4estimate")
> examplemat
InitSymb L1Symb L2Symb RYU4estimate 6JU4estimate 6EU4estimate
[1,] "RYU4" "6JU4" "6EU4" "1" "2" "3"
> searchStr <- "L1estimate"
So with answer being the answer I'm looking for, I want to be able to input examplemat[,answer] so I can extract the data column (in this case, "2")
I don't really know how to do regex, but I think the answer looks something like
examplemat[,paste0(**some regex function**("[(Init)|(L[:digit:]+)]",searchStr),"estimate")]
what function goes there, and is my regex code right?
May be you can try:
library(stringr)
Extr <- str_extract(searchStr, '^[A-Za-z]\\d+')
Extr
[1] "L1"
#If the searchStr is `Initestimate`
#Extr <- str_extract(searchStr, '^[A-Za-z]{4}')
pat1 <- paste0("(?<=",Extr,").*")
indx1 <-examplemat[,str_detect(colnames(examplemat),perl(pat1))]
pat2 <- paste0("(?<=",indx1,").*")
examplemat[,str_detect(colnames(examplemat), perl(pat2))]
#6JU4estimate
# "2"
#For searchStr using Initestimate;
#examplemat[,str_detect(colnames(examplemat), perl(pat2))]
#RYU4estimate
# "1"
The question is bit confusing so I am quite not sure if my interpretation is correct.
First, you would extract the values in the string "coolSymb" without "Symb"
Second, you can detect if column name contains "cool" and return the location (column index)
by which() statement.
Finally, you can extract the value using simple matrix indexing.
library(stringr)
a = str_split("coolSymb", "Symb")[[1]][1]
b = which(str_detect(colnames(examplemat), a))
examplemat[1, b]
Hope this helps,
won782's use of str_split inspired me to find an answer that works, although I still want to know how to do this by matching the prefix instead of excluding the suffix, so I'll accept an answer that does that.
Here's the step-by-step
> str_split("L1estimate","estimate")[[1]][1]
[1] "L1"
replace the above step with one that gets {L1} instead of getting {not estimate} for bonus points
> paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")
[1] "L1Symb"
> examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")]
L1Symb
[1,] "6JU4"
> paste0(examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")],"estimate")
[1] "6JU4estimate"
> examplemat[,paste0(examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")],"estimate")]
6JU4estimate
[1,] "2"

How to replace strings containing 'greater than' and 'less than'

I am trying to replace strings containing > and < with R
datanames<-names(data)
datanames
## [1] BbMx>2.5 BbAv>2.5 BbMx<2.5 BbAv<2.5
datanames<-gsub("[>]","gt",datanames)
datanames<-gsub("[<]","lt",datanames)
datanames<-gsub("[.]","",datanames)
datanames
## [1] BbMx25 BbAv25 BbMx251 BbAv251
What I am doing wrong?
UPDATE: For some strange reason R doesn't read the same character of the csv. Namely in my csv I read with libreoffice
"BbMx>2.5" "BbAv>2.5" "BbMx<2.5" "BbAv<2.5"
but once R read csv turn this strings in
"BbMx.2.5" "BbAv.2.5" "BbMx.2.5.1" "BbAv.2.5.1"
If you just do
x <- c("BbMx>2.5","BbAv>2.5","BbMx<2.5","BbAv<2.5")
x <- gsub("[>]","gt",x)
x <- gsub("[<]","lt",x)
x <- gsub("[.]","",x)
You should get
"BbMxgt25" "BbAvgt25" "BbMxlt25" "BbAvlt25"
as expected. The problem is that the input from names(data) isn't what you think it is.
R has rules about valid column names in data.frames. R will run make.names on those values to attempt to make uniuqe, valid names. This involved replacing non-alphanumeric values with periods and adding suffixes to ensure uniqueness.
To disable the auto-renaming, you can set check.names=F with the read.table/read.csv function and do the renaming yourself.
So if you have
x<-c("BbMx>2.5", "BbAv>2.5", "BbMx<2.5","BbAv<2.5" )
Then
make.names(x, unique=T)
# [1] "BbMx.2.5" "BbAv.2.5" "BbMx.2.5.1" "BbAv.2.5.1"
So ultimately this had nothing to do with gsub. This was really about how R transforms raw data into data.frames.
I know #MrFlick has provided an answer already, but just to comment on the way you are implementing your characters and calls using gsub, the < and > characters are not considered a character of special meaning so you do not need to place them inside a character class [ ], you can use them as a literal.
And you can cascade your gsub functions together here.
datanames <- gsub('>', 'gt', gsub('<', 'lt', gsub('\\.', '', datanames)))

Assign names to variable using regular expression in R

So I have a bunch of variables in my workspace. I want to assign a subset of them to a new variable, so I can easily run functions on this subset:
workspace:
...
group10
group40
location40
test
desired assignment:
groupList <- list(group10,group40, ...)
intended regular expression:
^group[0-9]+
Any ideas?
ls accepts a pattern argument:
group10 <- group40 <- location40 <- test <- NA
mysub <- ls(pattern="^group[0-9]+")
mysub
#[1] "group10" "group40"
You can use lapply to loop over the list of variable names and get their values
groupList <- lapply(mysub, get)
or, in one line
groupList <- lapply(ls(pattern="^group[0-9]+"), get)

Regular Expressions in R - compare one column to another

I currently have a dataset which has two columns that I'd like to compare. In one column, I have a string that I'd like to search for (let's call it column A). In a second column (let's call it column B) are some more strings.
The problem is that both columns have varying contents, so the pattern being searched for in the regular expression is likely to change from one row to another. Normally, when I'm searching a column for a particular string, I use something like this:
df$output <- NA
df$output[grep("TARGET_STRING", df$column_B)] <- "STRING_FOUND"
However, now that I'm trying to do this:
df$output[grep(df$column_A, df$column_B)] <- "STRING_FOUND"
Unfortunately, this gives an error:
argument 'pattern' has length > 1 and
only the first element will be used
I've tried various methods to fix this, and can't seem to find a simple solution, and I'm sure there must be one. I can see why it's throwing an error (I think), but I'm not sure how to solve it. What do I need to do to get the regular expression working?
Edit: Here's the testing data.frame I've been using to explore it:
column_A <- c("A", "A", "B", "B")
column_B <- c("A", "zzz", "B", "zzz")
greptest <- data.frame(column_A, column_B)
greptest$output<-NA
greptest$output[grep(greptest$column_A, greptest$column_B)] <- "STRING_FOUND"
You can write a function that wraps grepl and then use apply:
grepFun <- function(rw){
grepl(rw[1],rw[2],fixed=TRUE)
}
xx <- apply(greptest,1,grepFun)
greptest$output[xx] <- "STRING_FOUND"
You've already excepted my answer, but I thought I'd provide another, somewhat more efficient version using ddply:
grepFun1 <- function(x){
ind <- grepl(x$column_A[1],x$column_B,fixed=TRUE)
x$output <- NA
x$output[ind] <- "STRING_FOUND"
x
}
ddply(greptest,.(column_A),.fun=grepFun1)
This version will likely be faster if you have lots of repetition in the values for column_A.
I'm not sure what your expected result is, but here's my code:
> grep(greptest[1,"column_A"], greptest$column_B)
[1] 1 2
> grep(greptest[2,"column_A"], greptest$column_B)
integer(0)
> grep(greptest[3,"column_A"], greptest$column_B)
[1] 3 4
> grep(greptest[4,"column_A"], greptest$column_B)
integer(0)
> cbind(column_A,column_B,column_A==column_B)
column_A column_B
[1,] "A" "A" "TRUE"
[2,] "A" "zzz" "FALSE"
[3,] "B" "B" "TRUE"
[4,] "B" "zzz" "FALSE"
I switched A and B in the grep code, because otherwise you only get one hit per grep. You have to loop through elements, if you'd like to search for all of them (or use a loop equivalent).
If you'd like just to compare row by row, then a simple == suffices.