List files in R that do NOT match a pattern - regex

R has a function to list files in a directory, which is list.files(). It comes with the optional parameter pattern= to list only files that match the pattern.
Files in directory data:
File1.csv File2.csv new_File1.csv new_File2.csv
list.files(path="data", pattern="new_")
results in [1] "new_File1.csv" "new_File2.csv".
But how can I invert the search, i.e. list only File1.csv and File2.csv?

I belive you will have to do it yourself, as list.files does not support Perl regex (so you couldn't do something like pattern=^(?!new_)).
i.e. list all files then filter them with grep:
grep(list.files(path="data"), pattern='new_', invert=TRUE, value=TRUE)
The grep(...) does the pattern matching; invert=TRUE inverts the match; value=TRUE returns the values of the matches (i.e. the filenames) rather than the indices of the matches.

I think that the simplest (and probably fastest if you include programmer time) approach is to run list.files 2 times, once to list all the files, then the second time with the pattern of files that you do not want, then use the setdiff function to find those file names that are not in the group that you want to exclude.

Complementing #Greg Snow answer:
library("here")
path <- here("Data", "Folder", "Subfolder")
trees_to_dfs <- list.files(path, pattern = ".csv")
unwanted <- list.files(path, pattern = "all.csv")
trees_to_dfs <- base::setdiff(trees_to_dfs, unwanted)

Related

REGEX - Remove Unwanted Text

I have a list of Items example (files in a folder), each item in the list is in its own string.
in the example the X--Y-- Have incrementing Digits.
my program has the filenames in a list eg : ["file1.txt", "file2.txt"]
item 1:
"X1Y2 alehandro alex.txt"
item 2:
"X1Y3 james file of files.txt"
so for each string i want to keep only the first Part the "X1Y2" parts for each file so I need to remove all the extra text on the filename.
I just want a regex expression on how to do this, I still do struggle with regex.
I need to pass this through a, replace with "" algorithm,
(using microsoft powertoys-rename to do this..
Alternatives in powershell also welcome.
any advice would be appreciated
I Want output to be the following
["X1Y2.txt","X2Y3.txt","X4Y3.txt"]
with the unwanted extra text removed.
A general solution using re.sub along with a list comprehension might be:
files = ["X1Y2 alehandro alex.txt", "X1Y3 james file of files.txt"]
output = [re.sub(r'(\S+).*\.(\w+)$', r'\1.\2', f) for f in files]
print(output) # ['X1Y2.txt', 'X1Y3.txt']

Exact pattern match in r

I am reading files from a folder using List.files but i want to read only specific files to be read. I have files like below.
D420000900100hour.1-4-2001.31-12-2001
D420000700600hour8.1-1-2001.31-12-2004
D420000500150hour.1-1-2001.31-12-2004
Notice here i have "hour" and "hour8". I want to only list files containing exactly "hour".
files <- list.files(pattern = "hour")
With this piece of code however it returns files with both "hour" and "hour8". I am trying to use ^ and $. but they dont seem to work with "pattern".
How do i do this.
Based on the example, we can change the pattern argument to hour followed by .
list.files(pattern = "hour\\.")
Or 'hour' followed by any character that is not a number
list.files(pattern = "hour[^0-9]")

r gsub and regex, obating y*_x* from y*_x*_xxxx.csv

General situation: I am currently trying to name dataframes inside a list in accordance to the csv files they have been retrieved from, I found that using gsub and regex is the way to go. Unfortunately, I can’t produce exactly what I need, just sort of.
I would be very grateful for some hints from someone more experienced, maybe there is a reasonable R regex cheat cheet ?
File are named r2_m1_enzyme.csv, the script should use the first 4 characters to name the corresponding dataframe r2_m1, and so on…
# generates a list of dataframes, to mimic a lapply(f,read.csv) output:
data <- list(data.frame(c(1,2)),data.frame(c(1,2)),data.frame(c(1,2)),data.frame(c(1,2)))
# this mimics file names obtained by list.files() function
f <-c("r1_m1_enzyme.csv","r2_m1_enzyme.csv","r1_m2_enzyme.csv","r2_m2_enzyme.csv")
# this should name the data frames according to the csv file they have been derived from
names(data) <- gsub("r*_m*_.*","\\1", f)
but it doesnt work as expected... they are named r2_m1_enzyme.csv instead of the desired r2_m1, although .* should stop it?
If I do:
names(data) <- gsub("r*_.*","\\1", f)
I do get r1, r2, r3 ... but I am missing my second index.
The question: So my questions is, what regex expression would allow me to obtain strings “r1_m1”, “r2_m1”, “r1_m2”, ... from strings that are are named r*_m*_xyz.csv
Search history: R regex use * for only one character, Gsub regex replacement, R ussing parts of filename to name dataframe, R regex cheat sheet,...
If your names are always five characters long you could use substr:
substr(f, 1, 5)
If you want to use gsub you have to group your expression (via ( and )) because \\1 refers to the first group and insert its content, e.g.:
gsub("^(r[0-9]+_m[0-9]+).*", "\\1", f)

Extracting text in R

I am trying to extract a variable-length substring of text using R. I have several characters such as the following:
"\"/Users/Nel/Documents/Project/Data/dataset.csv\""
I need to extract the file path from each such character. In this case what I am trying to get is:
path1 <- "/Users/Nel/Documents/Project/Data/dataset.csv"
I am able to use the substring function:
path1 <- substr("\"/Users/Nel/Documents/Project/Data/dataset.csv\"", 3, 46)
with the indices hard-coded to get what I want in this particular instance. However, this particular path is one of many, and I need to be able to find these indices on the fly. I believe the
grep()
function could work but I can't figure out the relevant regular expressions. Thanks.
It seems like you are just trying to remove some hard-coded quotation marks.
Try gsub:
x
# [1] "\"/Users/Nel/Documents/Project/Data/dataset.csv\""
gsub('\"',"",x)
# [1] "/Users/Nel/Documents/Project/Data/dataset.csv"
## or
# gsub('["]', "", x)

R! remove element from list which start from specific letters

I create a list of files:
folder_GLDAS=dir(foldery[numeryfolderow],pattern="_OBC.asc",recursive=F,full.names=T)
Unfortunately there is one additional object which i would like to remove (file name begin with "NOWY" - NOWYevirainf_OBC.asc).
How can I find index of this element on list to remove it by typing:
folder_GLDAS<=folder_GLDAS[-to_remove] ??
Filter by using a regular expression.
folder_GLDAS <- folder_GLDAS[!grepl("^NOWY", folder_GLDAS)]
(You can also swap grepl for str_detect in stringr.)
Assuming that your list is one-dimensional, something like this should work:
*folder_GLDAS<-*folder_GLDAS[substr(*folder_GLDAS,1,4)!='NOWY']
You can actually make a (rather complex) PERL regex pattern that matches all names that end in "_OBC.asc" but DO NOT start with "NOWY": "^(?!NOWY).*_OBC\\.asc$"
Unfortunately the PERL syntax is not recognized by dir. But you could do it with grep like this:
folder_GLDAS <- dir(foldery[numeryfolderow],recursive=F,full.names=T)
folder_GLDAS <- grep(folder_GLDAS, pattern="^(?!NOWY).*_OBC\\.asc$", perl=T, value=T)
Also note that the "." in "_OBC.asc" needs to be escaped - otherwise you'll match for example "_OBCXasc" as well).