R! remove element from list which start from specific letters - list

I create a list of files:
folder_GLDAS=dir(foldery[numeryfolderow],pattern="_OBC.asc",recursive=F,full.names=T)
Unfortunately there is one additional object which i would like to remove (file name begin with "NOWY" - NOWYevirainf_OBC.asc).
How can I find index of this element on list to remove it by typing:
folder_GLDAS<=folder_GLDAS[-to_remove] ??

Filter by using a regular expression.
folder_GLDAS <- folder_GLDAS[!grepl("^NOWY", folder_GLDAS)]
(You can also swap grepl for str_detect in stringr.)

Assuming that your list is one-dimensional, something like this should work:
*folder_GLDAS<-*folder_GLDAS[substr(*folder_GLDAS,1,4)!='NOWY']

You can actually make a (rather complex) PERL regex pattern that matches all names that end in "_OBC.asc" but DO NOT start with "NOWY": "^(?!NOWY).*_OBC\\.asc$"
Unfortunately the PERL syntax is not recognized by dir. But you could do it with grep like this:
folder_GLDAS <- dir(foldery[numeryfolderow],recursive=F,full.names=T)
folder_GLDAS <- grep(folder_GLDAS, pattern="^(?!NOWY).*_OBC\\.asc$", perl=T, value=T)
Also note that the "." in "_OBC.asc" needs to be escaped - otherwise you'll match for example "_OBCXasc" as well).

Related

REGEX - Remove Unwanted Text

I have a list of Items example (files in a folder), each item in the list is in its own string.
in the example the X--Y-- Have incrementing Digits.
my program has the filenames in a list eg : ["file1.txt", "file2.txt"]
item 1:
"X1Y2 alehandro alex.txt"
item 2:
"X1Y3 james file of files.txt"
so for each string i want to keep only the first Part the "X1Y2" parts for each file so I need to remove all the extra text on the filename.
I just want a regex expression on how to do this, I still do struggle with regex.
I need to pass this through a, replace with "" algorithm,
(using microsoft powertoys-rename to do this..
Alternatives in powershell also welcome.
any advice would be appreciated
I Want output to be the following
["X1Y2.txt","X2Y3.txt","X4Y3.txt"]
with the unwanted extra text removed.
A general solution using re.sub along with a list comprehension might be:
files = ["X1Y2 alehandro alex.txt", "X1Y3 james file of files.txt"]
output = [re.sub(r'(\S+).*\.(\w+)$', r'\1.\2', f) for f in files]
print(output) # ['X1Y2.txt', 'X1Y3.txt']

Match return substring between two substrings using regexp

I have a list of records that are character vectors. Here's an example:
'1mil_0,1_1_1_lb200_ks_drivers_sorted.csv'
'1mil_0_1_lb100_ks_drivers_sorted.csv'
'1mil_1_1_lb2_100_100_ks_drivers_sorted.csv'
'1mil_1_1_lb100_ks_drivers_sorted.csv'
From these names I would like to extract whatever's between the two substrings 1mil_ and _ks_drivers_sorted.csv.
So in this case the output would be:
0,1_1_1_lb200
0_1_lb100
1_1_lb2_100_100
1_1_lb100
I'm using MATLAB so I thought to use regexp to do this, but I can't understand what kind of regular expression would be correct.
Or are there some other ways to do this without using regexp?
Let the data be:
x = {'1mil_0,1_1_1_lb200_ks_drivers_sorted.csv'
'1mil_0_1_lb100_ks_drivers_sorted.csv'
'1mil_1_1_lb2_100_100_ks_drivers_sorted.csv'
'1mil_1_1_lb100_ks_drivers_sorted.csv'};
You can use lookbehind and lookahead to find the two limiting substrings, and match everything in between:
result = cellfun(#(c) regexp(c, '(?<=1mil_).*(?=_ks_drivers_sorted\.csv)', 'match'), x);
Or, since the regular expression only produces one match, the following simpler alternative can be used (thanks #excaza for noticing):
result = regexp(x, '(?<=1mil_).*(?=_ks_drivers_sorted\.csv)', 'match', 'once');
In your example, either of the above gives
result =
4×1 cell array
'0,1_1_1_lb200'
'0_1_lb100'
'1_1_lb2_100_100'
'1_1_lb100'
For me the easy way to do this is just use espace or nothing to replace what you don't need in your string, and the rest is what you need.
If is a list, you can use a loop to do this.
Exemple to replace "1mil_" with "" and "_ks_drivers_sorted.csv" with ""
newChr = strrep(chr,'1mil_','')
newChr = strrep(chr,'_ks_drivers_sorted.csv','')

Extract info inside parenthesis in R

I have some rows, some have parenthesis and some don't. Like ABC(DEF) and ABC. I want to extract info from parenthesis:
ABC(DEF) -> DEF
ABC -> NA
I wrote
gsub(".*\\((.*)\\).*", "\\1",X).
It works good for ABC(DEF), but output "ABC" when there is not parenthesis.
If you do not want to get ABC when using sub with your regex, you need to add an alternative that would match all the non-empty string and remove it.
X <- c("ABC(DEF)", "ABC")
sub(".*(?:\\((.*)\\)).*|.*", "\\1",X)
^^^
See the IDEONE demo.
Note you do not have to use gsub, you only need one replacement to be performed, so a sub will do.
Also, a stringr str_match would also be handy for this task:
str_match(X, "\\((.*)\\)")
or
str_match(X, "\\(([^()]*)\\)")
Using string_extract() will work.
library(stringr)
df$`new column` <- str_extract(df$`existing column`, "(?<=\\().+?(?=\\))")
This creates a new column of any text inside parentheses of an existing column. If there is no parentheses in the column, it will fill in NA.
The inspiration for my answer comes from this answer on the original question about this topic

List files in R that do NOT match a pattern

R has a function to list files in a directory, which is list.files(). It comes with the optional parameter pattern= to list only files that match the pattern.
Files in directory data:
File1.csv File2.csv new_File1.csv new_File2.csv
list.files(path="data", pattern="new_")
results in [1] "new_File1.csv" "new_File2.csv".
But how can I invert the search, i.e. list only File1.csv and File2.csv?
I belive you will have to do it yourself, as list.files does not support Perl regex (so you couldn't do something like pattern=^(?!new_)).
i.e. list all files then filter them with grep:
grep(list.files(path="data"), pattern='new_', invert=TRUE, value=TRUE)
The grep(...) does the pattern matching; invert=TRUE inverts the match; value=TRUE returns the values of the matches (i.e. the filenames) rather than the indices of the matches.
I think that the simplest (and probably fastest if you include programmer time) approach is to run list.files 2 times, once to list all the files, then the second time with the pattern of files that you do not want, then use the setdiff function to find those file names that are not in the group that you want to exclude.
Complementing #Greg Snow answer:
library("here")
path <- here("Data", "Folder", "Subfolder")
trees_to_dfs <- list.files(path, pattern = ".csv")
unwanted <- list.files(path, pattern = "all.csv")
trees_to_dfs <- base::setdiff(trees_to_dfs, unwanted)

r gsub and regex, obating y*_x* from y*_x*_xxxx.csv

General situation: I am currently trying to name dataframes inside a list in accordance to the csv files they have been retrieved from, I found that using gsub and regex is the way to go. Unfortunately, I can’t produce exactly what I need, just sort of.
I would be very grateful for some hints from someone more experienced, maybe there is a reasonable R regex cheat cheet ?
File are named r2_m1_enzyme.csv, the script should use the first 4 characters to name the corresponding dataframe r2_m1, and so on…
# generates a list of dataframes, to mimic a lapply(f,read.csv) output:
data <- list(data.frame(c(1,2)),data.frame(c(1,2)),data.frame(c(1,2)),data.frame(c(1,2)))
# this mimics file names obtained by list.files() function
f <-c("r1_m1_enzyme.csv","r2_m1_enzyme.csv","r1_m2_enzyme.csv","r2_m2_enzyme.csv")
# this should name the data frames according to the csv file they have been derived from
names(data) <- gsub("r*_m*_.*","\\1", f)
but it doesnt work as expected... they are named r2_m1_enzyme.csv instead of the desired r2_m1, although .* should stop it?
If I do:
names(data) <- gsub("r*_.*","\\1", f)
I do get r1, r2, r3 ... but I am missing my second index.
The question: So my questions is, what regex expression would allow me to obtain strings “r1_m1”, “r2_m1”, “r1_m2”, ... from strings that are are named r*_m*_xyz.csv
Search history: R regex use * for only one character, Gsub regex replacement, R ussing parts of filename to name dataframe, R regex cheat sheet,...
If your names are always five characters long you could use substr:
substr(f, 1, 5)
If you want to use gsub you have to group your expression (via ( and )) because \\1 refers to the first group and insert its content, e.g.:
gsub("^(r[0-9]+_m[0-9]+).*", "\\1", f)