Assign names to variable using regular expression in R - regex

So I have a bunch of variables in my workspace. I want to assign a subset of them to a new variable, so I can easily run functions on this subset:
workspace:
...
group10
group40
location40
test
desired assignment:
groupList <- list(group10,group40, ...)
intended regular expression:
^group[0-9]+
Any ideas?

ls accepts a pattern argument:
group10 <- group40 <- location40 <- test <- NA
mysub <- ls(pattern="^group[0-9]+")
mysub
#[1] "group10" "group40"
You can use lapply to loop over the list of variable names and get their values
groupList <- lapply(mysub, get)
or, in one line
groupList <- lapply(ls(pattern="^group[0-9]+"), get)

Related

Extract URL parameters and values in R

We want to extract parameters and values from a given URL like
http://www.exemple.com/?a=1&b=2&c=3#def
Using xml2::url_parse we were able to Parse a url into its component pieces. However we still want to devide the query into elements using gsub matching regular expression:
([^?&=#]+)=([^&#]*)
Desired output
a=1
b=2
c=3
Use urltools package to parse URLs.
> u <- "http://www.exemple.com/?a=1&b=2&c=3#def"
> strsplit(urltools::parameters(u), "&")[[1L]]
[1] "a=1" "b=2" "c=3"
> urltools::param_get(u, "b")
b
1 2
We can try
library(stringr)
matrix(str_extract_all(str1, "[a-z](?=\\=)|(?<=\\=)\\d+")[[1]], ncol=2, byrow=TRUE)
Or if we need the = also
str_extract_all(str1, "[a-z]=\\d+")[[1]]
#[1] "a=1" "b=2" "c=3"
data
str1 <- "http://www.exemple.com/?a=1&b=2&c=3#def"

grepl() and lapply to fill missing values

I have the following data as an example:
fruit.region <- data.frame(full =c("US red apple","bombay Asia mango","gold kiwi New Zealand"), name = c("apple", "mango", "kiwi"), country = c("US","Asia","New Zealand"), type = c("red","bombay","gold"))
I would like R to be able to look at other items in the "full" (name) column that don't have values for "name", "country" and "type" and see if they match other items. For instance, if full had a 4th row with "bombay US mango" it would be able to identify that the country should read US, bombay should be under type and mango should be under name.
This is what I have so far, which merely identifies (logically) where the items match:
new.entry <- c("bombay US mango")
split.new.entry <- strsplit(new.entry, " ")
lapply(split.new.entry, function(x){
check = grepl(x, fruit.region, ignore.case=TRUE)
print(check)
})
I'm at a bit of a standstill..I've read through a number of regex posts and the r help guides on grepl but am not able to find a great solution. What I have doesn't fully identify a logical "match" vector so I'm unable to subset and use an if statement to concatenate on different elements. Ideally, I'd like to be able to replace these elements in data.table form as my fruit.region will actually be in a data table. Does anyone have any suggestions on the best approach?
Using the str_detect function from the stringr library. This gives a list, ready to rbind:
library(stringr)
addnewrow <- function(newfruit){
z<-lapply(fruit.region[,2:4], function(x) x[str_detect(new.entry, x)])
z$full <- newfruit
z
}
addnewrow(new.entry)
$name
[1] "mango"
$country
[1] "US"
$type
[1] "bombay"
$full
[1] "bombay US mango"
The next step would depend on your desired outcome - if you only want to add one, try:
rbind(fruit.region, addnewrow(new.entry))
If you have a lot:
z <- do.call(rbind, lapply(c(new.entry, new.entry), addnewrow))
rbind(fruit.region, z)
NB make sure your columns are character first:
fruit.region[] <- lapply(fruit.region, as.character)

Recursively download zip files from webpage (Windows)

Is it possible to download all zip files from a webpage without specifying the individual links one at a time.
I would like to download all monthly account zip files from http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html.
I am using Windows 8.1, R3.1.1. I do not have wget on the PC so can't use a recursive call.
Alternative:
As a workaround i have tried downloading the webpage text itself. I would then like to extract the name of each zip file which i can then pass to download.file in a loop. However, i am struggling with extracting the name.
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
temp <- tempfile()
download.file(pth,temp)
dat <- readLines(temp)
unlink(temp)
g <- dat[grepl("accounts_monthly", tolower(dat))]
g contains character strings with the file names, amongst other characters.
g
[1] " <li>Accounts_Monthly_Data-September2013.zip (775Mb)</li>"
[2] " <li>Accounts_Monthly_Data-October2013.zip (622Mb)</li>"
I would like to extract the name of the files Accounts_Monthly_Data-September2013.zip and so on, but my regex is quite terrible (see for yourself)
gsub(".*\\>(\\w+\\.zip)\\s+", "\\1", g)
data
g <- c(" <li>Accounts_Monthly_Data-September2013.zip (775Mb)</li>",
" <li>Accounts_Monthly_Data-October2013.zip (622Mb)</li>"
)
Use the XML package:
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
library(XML)
doc <- htmlParse(pth)
myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles)
mapply(download.file, url = fileURLS, destfile = myfiles)
"//a[contains(text(),'Accounts_Monthly_Data')]" is an XPATH expression. It instructs the XML package to select all nodes that are anchors( a ) containing text "Accounts_Monthly_Data". This results is a list of nodes. The fun = xmlAttrs argument then tells the XML package to pass these nodes to the xmlAttrs function. This function strips the attributes from xml nodes. The anchor only have one attribute in this case the href which is what we are looking for.

Creating a customized objects function (in R) that uses regular expression

Why does not the following custom objects function work?
objects0 <- function(find_term)
{
objects(pattern=glob2rx(paste0("*",find_term,"*")))
}
txt1 <- 100
tt <- 200
> objects0('txt')
character(0)
But when I write
objects(pattern=glob2rx(paste0("*",'txt',"*")))
it works just fine.
You need to specify the environment where to look for objects.
Add parameter envir=parent.frame() to objects call:
objects0 <- function(find_term)objects(pattern=glob2rx(paste0("*",find_term,"*")), envir=parent.frame())
Maybe a better way is to add envir=globalenv() to ensure that the search is done in the global environment always.

Regular Expressions in R - compare one column to another

I currently have a dataset which has two columns that I'd like to compare. In one column, I have a string that I'd like to search for (let's call it column A). In a second column (let's call it column B) are some more strings.
The problem is that both columns have varying contents, so the pattern being searched for in the regular expression is likely to change from one row to another. Normally, when I'm searching a column for a particular string, I use something like this:
df$output <- NA
df$output[grep("TARGET_STRING", df$column_B)] <- "STRING_FOUND"
However, now that I'm trying to do this:
df$output[grep(df$column_A, df$column_B)] <- "STRING_FOUND"
Unfortunately, this gives an error:
argument 'pattern' has length > 1 and
only the first element will be used
I've tried various methods to fix this, and can't seem to find a simple solution, and I'm sure there must be one. I can see why it's throwing an error (I think), but I'm not sure how to solve it. What do I need to do to get the regular expression working?
Edit: Here's the testing data.frame I've been using to explore it:
column_A <- c("A", "A", "B", "B")
column_B <- c("A", "zzz", "B", "zzz")
greptest <- data.frame(column_A, column_B)
greptest$output<-NA
greptest$output[grep(greptest$column_A, greptest$column_B)] <- "STRING_FOUND"
You can write a function that wraps grepl and then use apply:
grepFun <- function(rw){
grepl(rw[1],rw[2],fixed=TRUE)
}
xx <- apply(greptest,1,grepFun)
greptest$output[xx] <- "STRING_FOUND"
You've already excepted my answer, but I thought I'd provide another, somewhat more efficient version using ddply:
grepFun1 <- function(x){
ind <- grepl(x$column_A[1],x$column_B,fixed=TRUE)
x$output <- NA
x$output[ind] <- "STRING_FOUND"
x
}
ddply(greptest,.(column_A),.fun=grepFun1)
This version will likely be faster if you have lots of repetition in the values for column_A.
I'm not sure what your expected result is, but here's my code:
> grep(greptest[1,"column_A"], greptest$column_B)
[1] 1 2
> grep(greptest[2,"column_A"], greptest$column_B)
integer(0)
> grep(greptest[3,"column_A"], greptest$column_B)
[1] 3 4
> grep(greptest[4,"column_A"], greptest$column_B)
integer(0)
> cbind(column_A,column_B,column_A==column_B)
column_A column_B
[1,] "A" "A" "TRUE"
[2,] "A" "zzz" "FALSE"
[3,] "B" "B" "TRUE"
[4,] "B" "zzz" "FALSE"
I switched A and B in the grep code, because otherwise you only get one hit per grep. You have to loop through elements, if you'd like to search for all of them (or use a loop equivalent).
If you'd like just to compare row by row, then a simple == suffices.