filtering columns by regex in dataframe - regex

I have a large dataframe (3000+ columns) and I am trying to get a list of all column names that follow this pattern:
"stat.mineBlock.minecraft.123456stone"
"stat.mineBlock.minecraft.DFHFFBSBstone2"
"stat.mineBlock.minecraft.AAAstoneAAAA"
My code:
stoneCombined<-grep("^[stat.mineBlock.minecraft.][a-zA-Z0-9]*?[stone][a-zA-Z0-9]*?", colnames(ingame), ignore.case =T)
where ingame is the dataframe I am searching. My code returns a list of numbers however instead of the dataframe columns (like those above) that I was expecting. Con someone tell me why?
After adding value=TRUE (Thanks to user227710):
I now get column names, but I get every column in my dataset not those that contain : stat.mineBlock.minecraft. and stone like I was trying to get.

To return the column names you need to set value=TRUE as an additional argument of grep. The default option in grep is to set value=FALSE and so it will give you indices of the matched colnames. .
help("grep")
value
if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.
grep("your regex pattern", colnames(ingame),value=TRUE, ignore.case =T)

Here is a solution in dplyr:
library(dplyr)
your_df %>%
select(starts_with("stat.mineBlock.minecraft"))
The more general way to match a column name to a regex is with matches() inside select(). See ?select for more information.

My answer is based on this SO post. As per the regex, you were very close.
Just [] create a character class matching a single character from the defined set, and it is the main reason it was not working. Also, perl=T is always safer to use with regex in R.
So, here is my sample code:
df <- data.frame(
"stat.mineBlock.minecraft.123456stone" = 1,
"stat.mineBlock.minecraft.DFHFFBSBwater2" = 2,
"stat.mineBlock.minecraft.DFHFFBSBwater3" = 3,
"stat.mineBlock.minecraft.DFHFFBSBstone4" = 4
)
grep("^stat\\.mineBlock\\.minecraft\\.[a-zA-Z0-9]*?stone[a-zA-Z0-9]*?", colnames(df), value=TRUE, ignore.case=T, perl=T)
See IDEONE demo

Related

how do you subset data frame based on grep [duplicate]

I am having trouble subsetting my data. I want the data subsetted on column x, where the first 3 characters begin G45.
My data frame:
x <- c("G448", "G459", "G479", "G406")
y <- c(1:4)
My.Data <- data.frame (x,y)
I have tried:
subset (My.Data, x=="G45*")
But I am unsure how to use wildcards. I have also tried grep() to find the indicies:
grep ("G45*", My.Data$x)
but it returns all 4 rows, rather than just those beginning G45, probably also as I am unsure how to use wildcards.
It's pretty straightforward using [ to extract:
grep will give you the position in which it matched your search pattern (unless you use value = TRUE).
grep("^G45", My.Data$x)
# [1] 2
Since you're searching within the values of a single column, that actually corresponds to the row index. So, use that with [ (where you would use My.Data[rows, cols] to get specific rows and columns).
My.Data[grep("^G45", My.Data$x), ]
# x y
# 2 G459 2
The help-page for subset shows how you can use grep and grepl with subset if you prefer using this function over [. Here's an example.
subset(My.Data, grepl("^G45", My.Data$x))
# x y
# 2 G459 2
As of R 3.3, there's now also the startsWith function, which you can again use with subset (or with any of the other approaches above). According to the help page for the function, it's considerably faster than using substring or grepl.
subset(My.Data, startsWith(as.character(x), "G45"))
# x y
# 2 G459 2
You may also use the stringr package
library(dplyr)
library(stringr)
My.Data %>% filter(str_detect(x, '^G45'))
You may not use '^' (starts with) in this case, to obtain the results you need

Postgresql - How do I extract the first occurence of a substring in a string using a regular expression pattern?

I am trying to extract a substring from a text column using a regular expression, but in some cases, there are multiple instances of that substring in the string.
In those cases, I am finding that the query does not return the first occurrence of the substring. Does anyone know what I am doing wrong?
For example:
If I have this data:
create table data1
(full_text text, name text);
insert into data1 (full_text)
values ('I 56, donkey, moon, I 92')
I am using
UPDATE data1
SET name = substring(full_text from '%#"I ([0-9]{1,3})#"%' for '#')
and I want to get 'I 56' not 'I 92'
You can use regexp_matches() instead:
update data1
set full_text = (regexp_matches(full_text, 'I [0-9]{1,3}'))[1];
As no additional flag is passed, regexp_matches() only returns the first match - but it returns an array so you need to pick the first (and only) element from the result (that's the [1] part)
It is probably a good idea to limit the update to only rows that would match the regex in the first place:
update data1
set full_text = (regexp_matches(full_text, 'I [0-9]{1,3}'))[1]
where full_text ~ 'I [0-9]{1,3}'
Try the following expression. It will return the first occurrence:
SUBSTRING(full_text, 'I [0-9]{1,3}')
You can use regexp_match() In PostgreSQL 10+
select regexp_match('I 56, donkey, moon, I 92', 'I [0-9]{1,3}');
Quote from documentation:
In most cases regexp_matches() should be used with the g flag, since
if you only want the first match, it's easier and more efficient to
use regexp_match(). However, regexp_match() only exists in PostgreSQL
version 10 and up. When working in older versions, a common trick is
to place a regexp_matches() call in a sub-select...

subset doesn't recognize regex

I just cannot get this working. i want to subset all the rows containing "mail". I use this:
Email <- subset(Total_Content, source == ".*mail.*")
I have rows like this ones:
"snt152.mail.live.com",
"mailing.serviciosmovistar.com",
"blu179.mail.live.com"
But when using: "View(Email)"
I just get a data.frame empty (just see the columns). I don't need to "scape" any metacharacter, because i need the "." to mean "anycharacter" and the "*" (0 or more times), right? Thanks.
Well, no, it doesn't - it's not meant to. You're not passing it a regular expression to be evaluated against each row, you're just passing it a character string; it doesn't know that . and * are regex characters because it's not performing a regex search. It's returning all rows where source is the literal string .mail. - which in this case is 0 rows.
What you probably want to be doing (I'm assuming this is a data.frame, here) is:
Email <- Total_Content[grepl(x = Total_Content$source, pattern = ".*mail.*"),]
grepl produces a set of boolean values of whether each entry in Total_Content$source matched the pattern. Total_Content[boolean_vector,] limits to those rows of Total_Content where the equivalent boolean is TRUE.
Why not use subset with a logical regex funtion?
Email <- subset(Total_Content, grepl(".*mail.*", source) )
The subset function does create a local environment for the evaluation of expressions that are used in either the 'subset' (row targets) or the 'select' (column targets) arguments.

R: replacing special character in multiple columns of a data frame

I try to replace the german special character "ö" in a dataframe by "oe". The charcter occurs in multiple columns so I would like to be able to do this all in one by not having to specify individual columns.
Here is a small example of the data frame
data <- data.frame(a=c("aö","ab","ac"),b=c("bö","bb","ab"),c=c("öc","öb","acö"))
I tried :
data[data=="ö"]<-"oe"
but this did not work since I would need to work with regular expressions here. However when I try :
data[grepl("ö",data)]<-"oe"
I do not get what I want.
The dataframe at the end should look like:
> data
a b c
1 aoe boe oec
2 ab bb oeb
3 ac ab acoe
>
The file is a csv import that I import by read.csv. However, there seems to be no option to change to fix this with the import statement.
How do I get the desired outcome?
Here's one way to do it:
data <- apply(data,2,function(x) gsub("ö",'oe',x))
Explanation:
Your grepl doesn't work because grepl just returns a boolean matrix (TRUE/FALSE) corresponding to the elements in your data frame for which the regex matches. What the assignment then does is replace not just the character you want replaced but the entire string. To replace part of a string, you need sub (if you want to replace just once in each string) or gsub (if you want all occurrences replaces). To apply that to every column you loop over the columns using apply.
If you want to return a data frame, you can use:
data.frame(lapply(data, gsub, pattern = "ö", replacement = "oe"))

r gsub and regex, obating y*_x* from y*_x*_xxxx.csv

General situation: I am currently trying to name dataframes inside a list in accordance to the csv files they have been retrieved from, I found that using gsub and regex is the way to go. Unfortunately, I can’t produce exactly what I need, just sort of.
I would be very grateful for some hints from someone more experienced, maybe there is a reasonable R regex cheat cheet ?
File are named r2_m1_enzyme.csv, the script should use the first 4 characters to name the corresponding dataframe r2_m1, and so on…
# generates a list of dataframes, to mimic a lapply(f,read.csv) output:
data <- list(data.frame(c(1,2)),data.frame(c(1,2)),data.frame(c(1,2)),data.frame(c(1,2)))
# this mimics file names obtained by list.files() function
f <-c("r1_m1_enzyme.csv","r2_m1_enzyme.csv","r1_m2_enzyme.csv","r2_m2_enzyme.csv")
# this should name the data frames according to the csv file they have been derived from
names(data) <- gsub("r*_m*_.*","\\1", f)
but it doesnt work as expected... they are named r2_m1_enzyme.csv instead of the desired r2_m1, although .* should stop it?
If I do:
names(data) <- gsub("r*_.*","\\1", f)
I do get r1, r2, r3 ... but I am missing my second index.
The question: So my questions is, what regex expression would allow me to obtain strings “r1_m1”, “r2_m1”, “r1_m2”, ... from strings that are are named r*_m*_xyz.csv
Search history: R regex use * for only one character, Gsub regex replacement, R ussing parts of filename to name dataframe, R regex cheat sheet,...
If your names are always five characters long you could use substr:
substr(f, 1, 5)
If you want to use gsub you have to group your expression (via ( and )) because \\1 refers to the first group and insert its content, e.g.:
gsub("^(r[0-9]+_m[0-9]+).*", "\\1", f)