pattern matching a filename in R - regex

This is probably real simple, but I can't seem to figure out how to do it.
I have an application in R (Shiny) where a user uploads to the application a *.zip file that contains all the components of an ESRI shapefile. I unpack these files into their own directory. This folder then, may or may not, contain a *.shp.xml file. At some point in my R code, I need to find the exact name of the *.shp file that has been unpacked, and distinguish it from the *.shp.xml file. How do I write the expression that will do that? I was thinking to use list.files, but I am unsure how to write the rest of the expression.
thanks!

With R regex patterns the "$" has special meaning as the end of a character element (and the 'dots' need to be escaped with \\, so
shpfils <- list.files(path, pattern="\\.shp$")

This should isolate your file -
Sys.glob("*shp")
as compared to
Sys.glob("*shp*")
which should give both the files
or
Sys.glob("*shp.xml")
which should give the .shp.xml file

Related

give sudo permission to log files on different paths like /a/b1/c.log and /a/b2/d.log etc. files

I need a nice column for Centrify tool which include all the log files under the different folders, for example;
/oradata1/oracle/admin/A/scripts/rman_logs/*.log
/oracle/oracle/admin/B/scripts/rman_logs/*.log
/oradata2/admin/C/scripts/logs/*.log
I used this but after the * character user can see all logs;
/ora(data(1|2)|cle)/oracle|admin/admin/*/scripts/rman_logs
/ora(data(1|2)|cle)/oracle|admin/admin/*/scripts/rman_logs
Which expression I must use.
If I understandy our question correctly, you want only .log files. You can use a positive lookahead to assert that it is indeed a log file (contains .log at the end of filename), and match the filename whatever it is (.*).
Then it's really easy. (?=.*\.log(?:$|\s)).* Of course, you can also add specific folders if you wish to restrict the matches, but the positive lookahead will still do its work. I.e. (?=.*\.log(?:$|\s)).*/scripts/.*
EDIT: As your comment, you only need those folders, so you just specify their filepaths in alternations and add [^.\s\/]*\.log at the end. So:
(?:\/oradata1\/oracle\/admin\/A\/scripts\/rman_logs\/|\/oracle\/oracle\/admin\/B\/scripts\/rman_logs\/|\/oradata2\/admin\/C\/scripts\/logs\/)[^\s.\/]*\.log You may shorten the regex by trying to combine filepath elements, but, imo, not necessary as you might as well specify each filepath individually, if they don't overlap too much.
I have found a global expression.
this is not a good way but it works and save me from lots of job. The main files are under the ....../scripts/rman_logs/ for all servers so I use this way.
I can produce these lines and can be a command group for users so this works good
tail /////scripts/rman_logs/*.log
tail ////scripts/rman_logs/.log
Thanks for your helps.

Regular expression replace filenames

I have a large XML file, with many references to different file names, all PDF files. I want to replace all the different file names, with the a specific file name. I am using Notepad++.
For example:
cat.pdf
dog.pdf
bird.pdf
Replace all these with whale.pdf.
I have googled, searched, tried and failed for so long right now, and I cannot make it work. I don't know what I am doing wrong.
If you specifically intend to match several names you can do that in this way:
(cat|dog|bird)\.pdf\b
You can try
\w+\.pdf\b
Replace with whale.pdf.

Regex return file name, remove path and file extension

I have a data.frame that contains a text column of file names. I would like to return the file name without the path or the file extension. Typically, my file names have been numbered, but they don't have to be. For example:
df<-data.frame(data=c("a","b"),fileNames=c("C:/a/bb/ccc/NAME1.ext","C:/a/bb/ccc/d D2/name2.ext"))
I would like to return the equivalent of
df<-data.frame(data=c("a","b"),fileNames=c("NAME","name"))
but I cannot figure out the slick regular expression to do this with gsub. For example, I can get rid of the extension with (provided the file name ends with a number):
gsub('([0-9]).ext','',df[,"fileNames"])
Though I've been trying various patterns (by reading the regex help files and similar solutions on this site), I can't get a regex to return the text between the last "/" and the first ".". Any thoughts or forwards to similar questions are much appreciated!
The best I have gotten is:
gsub('*[[:graph:]_]/|*[[:graph:]_].ext','',df[,"fileNames"])
But this 1) doesn't get rid of all the leading path characters and 2) is dependent on a specific file extension.
Perhaps this will get you closer to your solution:
library(tools)
basename(file_path_sans_ext(df$fileNames))
# [1] "NAME1" "name2"
The file_path_sans_ext function is from the "tools" package (which I believe usually comes with R), and that will extract the path up to (but not including) the extension. The basename function will then get rid of your path information.
Or, to take from file_path_sans_ext and modify it a bit, you can try:
sub("(.*\\/)([^.]+)(\\.[[:alnum:]]+$)", "\\2", df$fileNames)
# [1] "NAME1" "name2"
Here, I've "captured" all three parts of the "fileNames" variables, so if you wanted just the file paths, you would change "\\2" to "\\1", and if you wanted just the file extensions, you would change it to "\\3".
First of all, to get rid of the "leading path", you can use basename. To remove the extension, you can use sub similar to your description in your question:
filenames <- sub("\\.[[:alnum:]]+$", "", basename(as.character(df$fileNames)))
Note that you should use sub instead of gsub here, because the file extension can only occur once for each filename. Also, you should use \\. which matches a dot instead of . which matches any symbol. Finally, you should append $ to the pattern to make sure you are removing the extension only if it is at the end of the filename.
Edit: the function file_path_sans_ext suggested in Ananda Mahto's solution works via sub("([^.]+)\\.[[:alnum:]]+$", "\\1", x), i.e. instead of removing the extension as above, the non-extension part of the filename is retained. I can't see any specific advantages or disadvantages of both methods in the OP's case.

Finding multiple files from different folders using regular expressions

I'm trying to load multiple .txt files in R, from different folders.
I have problems writing the path and pattern using regular expressions.
My path has this structure:
'/Users/folderA/folderB/folderC/folderD/01_01_2012/folderE/file.txt'
So, the path is almost the same, except that the folder with the date name always changes.
I have tried to load it like this:
filesToProcess <- list.files(path = "/Users/folderA/folderB/folderC/folderD/",
pattern = "*_*_*/folderE/*.txt")
But this doesn't seem to work.
Could someone please help me writing down this with regular expressions?
Thanks a lot!
The key here is to use argument recursive=TRUE so that you can search inside the folders that are in the original directory:
filesToProcess <- list.files(path = "/Users/folderA/folderB/folderC/folderD",
pattern = "txt", recursive = TRUE, full.names = TRUE)
The pattern has to correspond to the name of the files, it can't refer to the name of the folders (see ?list.files). That's why you need a second step where you have to narrow down to the specific folders you wanted. Note the use of argument full.names=TRUEin the previous call that allow us to keep the path of each file (NB: you also have to drop the final / of the path argument or else it ends up doubled in our output and leads to an error when you'll try to upload the files).
filesToProcess[grep("folderE", filesToProcess)]
A final note:
Your regular expression was flawed anyway: * means
The preceding item will be matched zero or more times.
What you wanted was .: see ?regexp
The period . matches any single character.
Although the subject refers to regular expressions it seems from the example that you really want to use globs. In that case try:
Sys.glob("/Users/folderA/folderB/folderC/folderD/*_*_*/folderE/*.txt")

eclipse file name pattern search with regular expression

I have a lot of java files:
Foo01.java
Foo02.java
Foo03.java
Foo04.java
Foo05.java
Foo01Bar.java
Foo01Bar.java
Foo02Bar.java
Foo03Bar.java
Foo04Bar.java
Foo05Bar.java
And I need to replace an expression in and only in FooXX.java classes.
Using CTRL + H in eclipse, in the file name pattern, I tried Foo(\d\d).java, but It does not work. If I write Foo*.java, every FooXXBar.java will also appears, and I don't want to.
What's the way to do it?
I don't think eclipse has the capability to do full regular expressions on file names. As far as I know you can use * to match any string and ? to match any single character for a file. As a result if your file list is similar to the above you can search for:
Foo??.java
For more complex file searches you probably need to use a combination of the unix/windows command line tools (depending on your OS choice).