Exact pattern match in r - regex

I am reading files from a folder using List.files but i want to read only specific files to be read. I have files like below.
D420000900100hour.1-4-2001.31-12-2001
D420000700600hour8.1-1-2001.31-12-2004
D420000500150hour.1-1-2001.31-12-2004
Notice here i have "hour" and "hour8". I want to only list files containing exactly "hour".
files <- list.files(pattern = "hour")
With this piece of code however it returns files with both "hour" and "hour8". I am trying to use ^ and $. but they dont seem to work with "pattern".
How do i do this.

Based on the example, we can change the pattern argument to hour followed by .
list.files(pattern = "hour\\.")
Or 'hour' followed by any character that is not a number
list.files(pattern = "hour[^0-9]")

Related

REGEX - Remove Unwanted Text

I have a list of Items example (files in a folder), each item in the list is in its own string.
in the example the X--Y-- Have incrementing Digits.
my program has the filenames in a list eg : ["file1.txt", "file2.txt"]
item 1:
"X1Y2 alehandro alex.txt"
item 2:
"X1Y3 james file of files.txt"
so for each string i want to keep only the first Part the "X1Y2" parts for each file so I need to remove all the extra text on the filename.
I just want a regex expression on how to do this, I still do struggle with regex.
I need to pass this through a, replace with "" algorithm,
(using microsoft powertoys-rename to do this..
Alternatives in powershell also welcome.
any advice would be appreciated
I Want output to be the following
["X1Y2.txt","X2Y3.txt","X4Y3.txt"]
with the unwanted extra text removed.
A general solution using re.sub along with a list comprehension might be:
files = ["X1Y2 alehandro alex.txt", "X1Y3 james file of files.txt"]
output = [re.sub(r'(\S+).*\.(\w+)$', r'\1.\2', f) for f in files]
print(output) # ['X1Y2.txt', 'X1Y3.txt']

Matching string between nth occurrence of character in python with RegEx

I'm working with files in a tar.gz file which contains txt files and trying to extract the filename of a the related TarInfo object whose member.name property looks like this:
aclImdb/test/neg/1026_2.txt
aclImdb/test/neg/1027_5.txt
...
aclImdb/test/neg/1030_4.txt
I've written the following code which prints the string test/neg/1268_2
regex = '\/((?:[^/]*/).*?)\.'
with tarfile.open("C:\\Users\\Orestis\\Desktop\\aclImdb_v1.tar.gz") as archive:
for member in archive.getmembers():
if member.isreg():
m = re.findall(regex, member.name)
print(m)
How should I modify the regex to extract only the 1268_2 part of the filenames? Effectively I want to extract the string after the 3rd occurrence of "/" and before the 1st occurrence of ".".
You could hardcode this:
.*?\/.*?\/.*?\/(.*?)\.
More elegant is something along the lines of this:
(.*?\/){3}(.*?)\.
You can simply change the 3 to suit your pattern. (Note that the group you'll want is $2)

List files in R that do NOT match a pattern

R has a function to list files in a directory, which is list.files(). It comes with the optional parameter pattern= to list only files that match the pattern.
Files in directory data:
File1.csv File2.csv new_File1.csv new_File2.csv
list.files(path="data", pattern="new_")
results in [1] "new_File1.csv" "new_File2.csv".
But how can I invert the search, i.e. list only File1.csv and File2.csv?
I belive you will have to do it yourself, as list.files does not support Perl regex (so you couldn't do something like pattern=^(?!new_)).
i.e. list all files then filter them with grep:
grep(list.files(path="data"), pattern='new_', invert=TRUE, value=TRUE)
The grep(...) does the pattern matching; invert=TRUE inverts the match; value=TRUE returns the values of the matches (i.e. the filenames) rather than the indices of the matches.
I think that the simplest (and probably fastest if you include programmer time) approach is to run list.files 2 times, once to list all the files, then the second time with the pattern of files that you do not want, then use the setdiff function to find those file names that are not in the group that you want to exclude.
Complementing #Greg Snow answer:
library("here")
path <- here("Data", "Folder", "Subfolder")
trees_to_dfs <- list.files(path, pattern = ".csv")
unwanted <- list.files(path, pattern = "all.csv")
trees_to_dfs <- base::setdiff(trees_to_dfs, unwanted)

selecting files.csv in R

I need to select all the files inside a folder in format .csv that contains only non numerical characters.
I use the following code, but it selects only 9 files of 13 with the chosen pattern. Is it right?
I select files like Berlin.csv
filenames <- list.files(pattern="[:alpha:].csv", full.names=TRUE)
ldf <- lapply(filenames, read.csv, header = FALSE)
length(ldf)
ldf
You want something like:
list.files(pattern = "^[[:alpha:]]+\\.csv")
That pattern will match any CSV that starts with and contains only alphabetical characters. But, if you want to allow filenames with other non-alphabetic characters (e.g., spaces, punctuation), use something like this:
list.files(pattern = "^[^[:digit:]]+\\.csv")
That will just exclude any filenames that have a number in them. (Note the two different meanings of ^ when used inside and outside of a character class.)

Regular Expression for Ignoring Files that Begin and End with a Sequence

I am using Sublime Text 2 and trying to filter out any files that do not begin with a string sequence or end with a string sequence.
Here are some samples with my desired outcome:
AAA.123.ZZZ = TRUE
AAA.MY.SPECIAL.FILE.ZZZ = TRUE
ABC.123.ZZZ = FALSE
AAA.123.XYZ = FALSE
/SUBFOLDERNAME = FALSE
FILE NAME WITH WHITESPACE.TXT = FALSE
I am using the following expression, but many files are getting by the filter:
^(?!AAA\..*\.ZZZ$)[\w\.-]+$
I want to include this regular expression in the Sublime Text 2 SFTP configuration under the "ignore_regexes" section.
I realize this is a double negative (using an ignore an inverse match), but I want to be able to replace AAA and ZZZ so that only files that begin with AAA. and end with .ZZZ are included by Sublime SFTP.
I don't know if you can find something simpler, but the following appears to work:
^(?!AAA\.).*|.*(?<!\.ZZZ)$
as illustrated in http://rubular.com/r/3yUXh0TOfE
Or, if you need to avoid the negative lookbehind, you can use:
^(?!AAA\.).*|.*(?!\.ZZZ).{4}$
as illustrated in http://rubular.com/r/VUd3yAQTzl