how do you subset data frame based on grep [duplicate] - regex

I am having trouble subsetting my data. I want the data subsetted on column x, where the first 3 characters begin G45.
My data frame:
x <- c("G448", "G459", "G479", "G406")
y <- c(1:4)
My.Data <- data.frame (x,y)
I have tried:
subset (My.Data, x=="G45*")
But I am unsure how to use wildcards. I have also tried grep() to find the indicies:
grep ("G45*", My.Data$x)
but it returns all 4 rows, rather than just those beginning G45, probably also as I am unsure how to use wildcards.

It's pretty straightforward using [ to extract:
grep will give you the position in which it matched your search pattern (unless you use value = TRUE).
grep("^G45", My.Data$x)
# [1] 2
Since you're searching within the values of a single column, that actually corresponds to the row index. So, use that with [ (where you would use My.Data[rows, cols] to get specific rows and columns).
My.Data[grep("^G45", My.Data$x), ]
# x y
# 2 G459 2
The help-page for subset shows how you can use grep and grepl with subset if you prefer using this function over [. Here's an example.
subset(My.Data, grepl("^G45", My.Data$x))
# x y
# 2 G459 2
As of R 3.3, there's now also the startsWith function, which you can again use with subset (or with any of the other approaches above). According to the help page for the function, it's considerably faster than using substring or grepl.
subset(My.Data, startsWith(as.character(x), "G45"))
# x y
# 2 G459 2

You may also use the stringr package
library(dplyr)
library(stringr)
My.Data %>% filter(str_detect(x, '^G45'))
You may not use '^' (starts with) in this case, to obtain the results you need

Related

Use lapply on a subset of list elements and return list of same length as original in R

I want to apply a regex operation to a subset of list elements (which are character strings) using lapply and return a list of same length as the original. The list elements are long strings (derived from reading in long text files and collapsing paragraphs into a single string). The regex operation is valid only for the subset of list elements/strings. I want the non-subsetted list elements (character strings) to be returned in their original state.
The regex operation is str_extract from the stringr package, i.e. I want to extract a substring from a longer string. I subset the list elements based on a regex pattern in the filename.
An example with simplified data:
library(stringr)
texts <- as.list(c("abcdefghijkl", "mnopqrstuvwxyz", "ghijklmnopqrs", "uvwxyzabcdef"))
filenames <- c("AB1997R.txt", "BG2000S.txt", "MN1999R.txt", "DC1997S.txt")
names(texts) <- filenames
regexp <- "abcdef"
I know in advance to which strings I want to apply the regex operation, and hence I want to subset these strings. That is, I don't want to run the regex over all elements in the list, as doing so will return some invalid results (which is not apparent in this simplified example).
I've made a few naive efforts, e.g.:
x <- lapply(texts[str_detect(names(texts), "1997")], str_extract, regexp)
> x
$AB1997R.txt
[1] "abcdef"
$DC1997S.txt
[1] "abcdef"
which returns a reduced-length list containing just the substrings found.
But the results I want to get are:
> x
$AB1997R.txt
[1] "abcdef"
$BG2000S.txt
[1] "mnopqrstuvwxyz"
$MN1999R.txt
[1] "ghijklmnopqrs"
$DC1997S.txt
[1] "abcdef"
where the strings not containing the regex pattern are returned in their original state.
I have informed myself about stringr, lapply and llply (in the plyr package), but many operations are illustrated using dataframes as examples, not lists, and don't involve regex operations on character strings. I can achieve my goal using a for loop, but I'm trying to get away from that, as is generally advised, and get better at using the apply-class of functions.
You can use the subset operator [<-:
x <- texts
is1997 <- str_detect(names(texts), "1997")
x[is1997] <- lapply(texts[is1997], str_extract, regexp)
x
# $AB1997R.txt
# [1] "abcdef"
#
# $BG2000S.txt
# [1] "mnopqrstuvwxyz"
#
# $MN1999R.txt
# [1] "ghijklmnopqrs"
#
# $DC1997S.txt
# [1] "abcdef"
#
You can try sub
sub(paste0('.*(', regexp, ').*'), '\\1', texts)
# AB1997R.txt BG2000S.txt MN1999R.txt DC1997S.txt
# "abcdef" "mnopqrstuvwxyz" "ghijklmnopqrs" "abcdef"
Also, if you need to match the names of 'texts' with 1997, we can use grep
indx <- grep('1997', names(texts))
texts[indx] <- sub(paste0('.*(', regexp, ').*'), '\\1', texts[indx])
as.list(texts)

filtering columns by regex in dataframe

I have a large dataframe (3000+ columns) and I am trying to get a list of all column names that follow this pattern:
"stat.mineBlock.minecraft.123456stone"
"stat.mineBlock.minecraft.DFHFFBSBstone2"
"stat.mineBlock.minecraft.AAAstoneAAAA"
My code:
stoneCombined<-grep("^[stat.mineBlock.minecraft.][a-zA-Z0-9]*?[stone][a-zA-Z0-9]*?", colnames(ingame), ignore.case =T)
where ingame is the dataframe I am searching. My code returns a list of numbers however instead of the dataframe columns (like those above) that I was expecting. Con someone tell me why?
After adding value=TRUE (Thanks to user227710):
I now get column names, but I get every column in my dataset not those that contain : stat.mineBlock.minecraft. and stone like I was trying to get.
To return the column names you need to set value=TRUE as an additional argument of grep. The default option in grep is to set value=FALSE and so it will give you indices of the matched colnames. .
help("grep")
value
if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.
grep("your regex pattern", colnames(ingame),value=TRUE, ignore.case =T)
Here is a solution in dplyr:
library(dplyr)
your_df %>%
select(starts_with("stat.mineBlock.minecraft"))
The more general way to match a column name to a regex is with matches() inside select(). See ?select for more information.
My answer is based on this SO post. As per the regex, you were very close.
Just [] create a character class matching a single character from the defined set, and it is the main reason it was not working. Also, perl=T is always safer to use with regex in R.
So, here is my sample code:
df <- data.frame(
"stat.mineBlock.minecraft.123456stone" = 1,
"stat.mineBlock.minecraft.DFHFFBSBwater2" = 2,
"stat.mineBlock.minecraft.DFHFFBSBwater3" = 3,
"stat.mineBlock.minecraft.DFHFFBSBstone4" = 4
)
grep("^stat\\.mineBlock\\.minecraft\\.[a-zA-Z0-9]*?stone[a-zA-Z0-9]*?", colnames(df), value=TRUE, ignore.case=T, perl=T)
See IDEONE demo

Correct wrongly formatted dates

I have some incorrect dates between good formatted dates, looking something like this:
df <- data.frame(col=c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05"))
How can I convert the incorrect format between the existing correctly formatted dates?
I'm able to remove the first dashes, but also the it requires to remove the last 3 characters -01 or -1. So that the corrected values are:
desired <- c("1.1.11","1.1.12","1.1.13","1.1.14","1.10.10","1.10.11","1.10.12","2010-03-31","2010-04-01","2010-04-05"))
What I'm strangling with is the -01 part, since by removing these, would also remove part of the correct formatted dates.
EDIT: The format is mm.dd.yy
Here is a pretty simple solution using sub ...
sub('^-+([^-]+).+', '\\1', df$col)
# [1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
# [6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"
Just remove all the non-word characters present at the start or -01 or -1 present at the end which was not preceded by -+ two digits.
> x <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> gsub("^\\W+|(?<!-\\d{2})-0?1$", "", x, perl=T)
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
[6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"
A simple regexp will solve these kinds of problems pretty well:
> df <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> df
[1] "--1.1.11-01" "--1.11.12-1" "--1.1.13-01" "--1.1.14-01" "--1.10.10-01" "-1.10.11-01" "---1.10.12-01"
[8] "2010-03-31" "2010-04-01" "2010-04-05"
> df <- sub(".*([0-9]{4}\\-[0-9]{2}\\-[0-9]{2}|[0-9]{1,2}\\.[0-9]{1,2}\\.[0-9]{1,2}).*", "\\1", df)
> df
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10" "1.10.11" "1.10.12" "2010-03-31" "2010-04-01"
[10] "2010-04-05"
Note that I made it a character vector instead of data.frame.
The solution itself is just matching one pattern or the other pattern and then dropping the rest by replacing it with the subpattern.
I here observe that if the prefix of a date has an entry as -1 or --1 then only there exists a illegal suffix i.e -01.
You could first take all the values in array.
So you will have an array of "--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01"
Now you can check for the prefix if is it -1 or --1. if there exists any such thing then you can mark it as to remove the suffix -01 as well .
According to the input pattern above I feel that the above strategy would work.
Please let me know if the strategy works

conditional string splitting in R (using tidyr)

I have a data frame like this:
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
I'd like to split the variable column into two; one column to indicate if the variable is a 'cost' and another column to indicate whether or not the variable is "reed". I cannot seem to figure out the right regex for the split (e.g. using tidyr)
If my data were something nicer, say:
Y <- data.frame(value = c(1,2,3,4),
variable = c("adjusted_cost", "adjusted_cost", "reed_cost", "reed_cost"))
Then this is trivial with tidyr:
separate(Y, variable, c("Type", "Model"), "_")
and bingo. Instead, it looks like I need some kind of conditional statement to split on "_" if it is present, and otherwise split on the start of the pattern ("^").
I tried:
separate(X, variable, c("Policy-cost", "Reed"), "(?(_)_|^)", perl=TRUE)
but no luck. I realize I cannot even split to an empty string successfully:
separate(X, variable, c("Policy-cost", "Reed"), "^", perl=TRUE)
how should I do this?
Edit Note that this is a minimal example of a larger problem, in which there are many possible variables (not just cost and reed_cost) so I do not want to string match each one.
I am looking for a solution that splits arbitrary variables by the _ pattern if present and otherwise splits them into a blank string and the original label.
I also realize I could just grep for the presence of _ and then construct the columns manually. That's fine if rather less elegant; it seems there should be a way to split on a string using a conditional that can return an empty string...
Assuming you may or may not have a separator and that cost and reed aren't necessarily mutually exclusive, why not search for the specific string instead of the separator?
Example:
library(stringr)
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
X$cost <- str_detect(X$variable,"cost")
X$reed <- str_detect(X$variable,"reed")
You could try:
X$variable <- ifelse(!grepl("_", X$variable), paste0("_", X$variable), as.character(X$variable))
separate(X, variable, c("Policy-cost", "Reed"), "_")
# value Policy-cost Reed
#1 1 cost
#2 2 cost
#3 3 reed cost
#4 4 reed cost
Or
X$variable <- gsub("\\b(?=[A-Za-z]+\\b)", "_", X$variable, perl=T)
X$variable
#[1] "_cost" "_cost" "reed_cost" "reed_cost"
separate(X, variable, c("Policy-cost", "Reed"), "_")
Explanation
\\b(?=[A-Za-z]+\\b) : matches a word boundary \\b and looks ahead for characters followed by word boundary. The third and fourth elements does not match, so it was not replaced.
Another approach with base R:
cbind(X["value"],
setNames(as.data.frame(t(sapply(strsplit(as.character(X$variable), "_"),
function(x)
if (length(x) == 1) c("", x)
else x))),
c("Policy-cost", "Reed")))
# value Policy-cost Reed
# 1 1 cost
# 2 2 cost
# 3 3 reed cost
# 4 4 reed cost

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.