Finding multiple files from different folders using regular expressions - regex

I'm trying to load multiple .txt files in R, from different folders.
I have problems writing the path and pattern using regular expressions.
My path has this structure:
'/Users/folderA/folderB/folderC/folderD/01_01_2012/folderE/file.txt'
So, the path is almost the same, except that the folder with the date name always changes.
I have tried to load it like this:
filesToProcess <- list.files(path = "/Users/folderA/folderB/folderC/folderD/",
pattern = "*_*_*/folderE/*.txt")
But this doesn't seem to work.
Could someone please help me writing down this with regular expressions?
Thanks a lot!

The key here is to use argument recursive=TRUE so that you can search inside the folders that are in the original directory:
filesToProcess <- list.files(path = "/Users/folderA/folderB/folderC/folderD",
pattern = "txt", recursive = TRUE, full.names = TRUE)
The pattern has to correspond to the name of the files, it can't refer to the name of the folders (see ?list.files). That's why you need a second step where you have to narrow down to the specific folders you wanted. Note the use of argument full.names=TRUEin the previous call that allow us to keep the path of each file (NB: you also have to drop the final / of the path argument or else it ends up doubled in our output and leads to an error when you'll try to upload the files).
filesToProcess[grep("folderE", filesToProcess)]
A final note:
Your regular expression was flawed anyway: * means
The preceding item will be matched zero or more times.
What you wanted was .: see ?regexp
The period . matches any single character.

Although the subject refers to regular expressions it seems from the example that you really want to use globs. In that case try:
Sys.glob("/Users/folderA/folderB/folderC/folderD/*_*_*/folderE/*.txt")

Related

Do not include certain source files

I have a folder containing all the log files, the filenames are colour-red, colour-green, colour-blue, colour-yellow, etc. I am writing the spl to include all the files except one, e.g. colour-white.
I know the * performs the wildcard search, and [^c] excludes specific character in the bracket. But I don't know how to combine them to exclude a certain word. On the other hand, I am not sure the same regrex rule apply for splunk.
source= "log/colour-*"
source= "log/colour-[^w]"
The desired result of the query is to retrieve all the files, expect colour-white.
Maybe some filters can be applied to retrieve the desired result, but so far the filters I know are for the file contents, not the file names.
You can also use something like this in your search query,
source!="log/colour-white"
And you can also check the difference between != and NOT at below link to get a more clear info on what to use.
Splunk Answers
The search command (the implicit command before the first |) does not support regex. To exclude something, use NOT.
(source = "log/colour-*" NOT source = "log/colour-w*")

how to make sense of expression logic in ssis

I am working on a SSIS project that involves unzipping a folder which when extracted contains multiple text files in the same directory using a ForEachLoop Container.
each file will have a different Name.
I have two variables of which variable 2 has an expression
Variable 1
name = zipfileName
Value= sample.zip
variable 2
name = FileName
value = *.*
Expression = REPLACE(#[User::ZipFileName],".zip",".txt")
I need clarification concerning the expression part
My thinking is that this expression means the name of the zipfile is replaced with .txt extension when extracted? I also would like to know how it dynamically changes fileNames in runtime seeing as there
are multiple files
thanks
From what I can see, the Expression is replacing .zip for .txt in [User::ZipFileName]
If the value of [User::ZipFileName] is somefile.zip
the output would be:
somefile.txt

pattern matching a filename in R

This is probably real simple, but I can't seem to figure out how to do it.
I have an application in R (Shiny) where a user uploads to the application a *.zip file that contains all the components of an ESRI shapefile. I unpack these files into their own directory. This folder then, may or may not, contain a *.shp.xml file. At some point in my R code, I need to find the exact name of the *.shp file that has been unpacked, and distinguish it from the *.shp.xml file. How do I write the expression that will do that? I was thinking to use list.files, but I am unsure how to write the rest of the expression.
thanks!
With R regex patterns the "$" has special meaning as the end of a character element (and the 'dots' need to be escaped with \\, so
shpfils <- list.files(path, pattern="\\.shp$")
This should isolate your file -
Sys.glob("*shp")
as compared to
Sys.glob("*shp*")
which should give both the files
or
Sys.glob("*shp.xml")
which should give the .shp.xml file

Regex for converting file path to package/namespace

Given the following file path:
/Users/Lawrence/MyProject/some/very/interesting/Code.scala
I would like to generate the following using a single regex replace (the root can be a constant):
some.very.interesting
This is for the purpose of generating a snippet for Sublime Text which can automatically insert the correct package/namespace header for my scala/java classes :)
Sublime Text uses the following syntax for their regex replace patterns (aka 'substitutions'):
{input/regex/replace/flags}
Hence why an iterative approach cannot be taken - it has to be done in one pass! Also, substitutions cannot be nested :(
If you know the maximum number of nested folders.You can specify that in your regex.
For 1 to 3 nested folders
Regex:/Users/Lawrence/MyProject/(\w+)/?(\w+)?/?(\w+)?/[^/]+$
Replace:$1.$2.$3
For 1 to 5 nested folders
Regex:/Users/Lawrence/MyProject/(\w+)/?(\w+)?/?(\w+)?/?(\w+)?/?(\w+)?/[^/]+$
Replace:$1.$2.$3.$4.$5
Given the constraints this is only thing you can do
Input
/Users/Lawrence/MyProject/some/very/interesting/Code.scala
Regex
^/Users/Lawrence/MyProject/[^/]+/[^/]+/[^/]+/Code.scala
or
^/[^/]+/[^/]+/[^/]+/([^/]+)/([^/]+)/([^/]+)/
Replace
\1.\2.\3
Update
This gets you closer, but not exactly it:
Regex
(^/Users/Lawrence/MyProject/|/Code\.scala$|/)
Replacement
.
Output would be:
.some.very.interesting.
Without multiple replacements in a single line and without recursive back references it's going to be hard.
You might have to do a second replacement, replacing something like this with an empty string (if you can):
(^\.|\.$)

Regex return file name, remove path and file extension

I have a data.frame that contains a text column of file names. I would like to return the file name without the path or the file extension. Typically, my file names have been numbered, but they don't have to be. For example:
df<-data.frame(data=c("a","b"),fileNames=c("C:/a/bb/ccc/NAME1.ext","C:/a/bb/ccc/d D2/name2.ext"))
I would like to return the equivalent of
df<-data.frame(data=c("a","b"),fileNames=c("NAME","name"))
but I cannot figure out the slick regular expression to do this with gsub. For example, I can get rid of the extension with (provided the file name ends with a number):
gsub('([0-9]).ext','',df[,"fileNames"])
Though I've been trying various patterns (by reading the regex help files and similar solutions on this site), I can't get a regex to return the text between the last "/" and the first ".". Any thoughts or forwards to similar questions are much appreciated!
The best I have gotten is:
gsub('*[[:graph:]_]/|*[[:graph:]_].ext','',df[,"fileNames"])
But this 1) doesn't get rid of all the leading path characters and 2) is dependent on a specific file extension.
Perhaps this will get you closer to your solution:
library(tools)
basename(file_path_sans_ext(df$fileNames))
# [1] "NAME1" "name2"
The file_path_sans_ext function is from the "tools" package (which I believe usually comes with R), and that will extract the path up to (but not including) the extension. The basename function will then get rid of your path information.
Or, to take from file_path_sans_ext and modify it a bit, you can try:
sub("(.*\\/)([^.]+)(\\.[[:alnum:]]+$)", "\\2", df$fileNames)
# [1] "NAME1" "name2"
Here, I've "captured" all three parts of the "fileNames" variables, so if you wanted just the file paths, you would change "\\2" to "\\1", and if you wanted just the file extensions, you would change it to "\\3".
First of all, to get rid of the "leading path", you can use basename. To remove the extension, you can use sub similar to your description in your question:
filenames <- sub("\\.[[:alnum:]]+$", "", basename(as.character(df$fileNames)))
Note that you should use sub instead of gsub here, because the file extension can only occur once for each filename. Also, you should use \\. which matches a dot instead of . which matches any symbol. Finally, you should append $ to the pattern to make sure you are removing the extension only if it is at the end of the filename.
Edit: the function file_path_sans_ext suggested in Ananda Mahto's solution works via sub("([^.]+)\\.[[:alnum:]]+$", "\\1", x), i.e. instead of removing the extension as above, the non-extension part of the filename is retained. I can't see any specific advantages or disadvantages of both methods in the OP's case.