selecting files.csv in R - regex

I need to select all the files inside a folder in format .csv that contains only non numerical characters.
I use the following code, but it selects only 9 files of 13 with the chosen pattern. Is it right?
I select files like Berlin.csv
filenames <- list.files(pattern="[:alpha:].csv", full.names=TRUE)
ldf <- lapply(filenames, read.csv, header = FALSE)
length(ldf)
ldf

You want something like:
list.files(pattern = "^[[:alpha:]]+\\.csv")
That pattern will match any CSV that starts with and contains only alphabetical characters. But, if you want to allow filenames with other non-alphabetic characters (e.g., spaces, punctuation), use something like this:
list.files(pattern = "^[^[:digit:]]+\\.csv")
That will just exclude any filenames that have a number in them. (Note the two different meanings of ^ when used inside and outside of a character class.)

Related

Matching string between nth occurrence of character in python with RegEx

I'm working with files in a tar.gz file which contains txt files and trying to extract the filename of a the related TarInfo object whose member.name property looks like this:
aclImdb/test/neg/1026_2.txt
aclImdb/test/neg/1027_5.txt
...
aclImdb/test/neg/1030_4.txt
I've written the following code which prints the string test/neg/1268_2
regex = '\/((?:[^/]*/).*?)\.'
with tarfile.open("C:\\Users\\Orestis\\Desktop\\aclImdb_v1.tar.gz") as archive:
for member in archive.getmembers():
if member.isreg():
m = re.findall(regex, member.name)
print(m)
How should I modify the regex to extract only the 1268_2 part of the filenames? Effectively I want to extract the string after the 3rd occurrence of "/" and before the 1st occurrence of ".".
You could hardcode this:
.*?\/.*?\/.*?\/(.*?)\.
More elegant is something along the lines of this:
(.*?\/){3}(.*?)\.
You can simply change the 3 to suit your pattern. (Note that the group you'll want is $2)

simple regex to matching multiple word with spaces/multiple space or no spaces and special characters

I have a string that is delimited by a comma.
The first 3 fields are static.
Fields 4-20 are dynamic and can contain any string even if it has special characters but cannot be empty.
Field 21 is static
Field 22 is dynamic and can contain any string even if it has special characters.
Fields 23,24 are static.
I need to make sure the string matches the above criteria and is a match, but am wondering on how to make fields 4-20 have the option of containing the special characters and not be blank. (Total of 17 between 4-20)
If I remove the requirement of the special characters this seems to work:
Field1\,Field2\,Field3\,+([\w\s\,]+)F21/C\,[\w\s\,]+(F/23\,)(Field24)
with this string
Field1,Field2,Field3,F4,f5,6f 1,f72,f8,F9,F10,F1,f12,f13,f14,f15,f16,f17,f18,f19,f20,F21/C,F22,F/23,Field24
Is there a way to accomplish this with fields 4-20 having special characters and not being empty like "" or " " or am I pushing it too far?
I know I can parse it through c# but I'm experimenting with Regex and it seems pretty powerful.
Thanks
I did not fully understand the problem
But I think that's what you want bottom line:
s1,s2,s3,([^ ,]+,){17}s21,[^ ,]+,s23,s24
replace the sX to relevant static fields.
example:
https://regex101.com/r/EaAPKH/1

Exact pattern match in r

I am reading files from a folder using List.files but i want to read only specific files to be read. I have files like below.
D420000900100hour.1-4-2001.31-12-2001
D420000700600hour8.1-1-2001.31-12-2004
D420000500150hour.1-1-2001.31-12-2004
Notice here i have "hour" and "hour8". I want to only list files containing exactly "hour".
files <- list.files(pattern = "hour")
With this piece of code however it returns files with both "hour" and "hour8". I am trying to use ^ and $. but they dont seem to work with "pattern".
How do i do this.
Based on the example, we can change the pattern argument to hour followed by .
list.files(pattern = "hour\\.")
Or 'hour' followed by any character that is not a number
list.files(pattern = "hour[^0-9]")

Match anything except character unless it's followed by some other character

I've got this odd string:
firstName:Paul Henry,retired:true,message:A, B & more,title:mr
which needs to be split into its <key>:<value> pairs. Unfortunately, key/value pairs are separated by , which itself can be part of the value. Hence, a simple string-split at , would not produce the correct result.
Keys contain only word characters and values can contain :.
What I need (I think) is something like
\w*:match-anything-but-comma-unless-comma-is-followed-by-space
What kind of works is
\w*:[\w ?!&%,]*(?![^,])
but of course I wouldn't want to explicitly list all characters in the character class (just listed a few for this example).
If you want to split on a comma, unless the comma is followed by a space, why not just:
,(?=\S)
Not sure what language you are using, but in C# the line might look like:
splitArray = Regex.Split(subjectString, #",(?=\S)");
You are trying to do something complicated with a regular expression that would be simple (and easy to understand) with a little code. That's usually a mistake. Just write a little code.
In your case, you want to split the input on commas. If you get a chunk that doesn't contain a colon, you want to treat it as part of the previous chunk. So just write that. For example, in Python, I'd do it like this:
chunks = input.split(',')
associations = []
for chunk in chunks:
if ':' in chunk:
associations.append(chunk)
else:
associations[-1] += ',' + chunk
map = dict(association.split(':') for association in associations)

parse text with Matlab

I have a text file (output from an old program) that I'd like to clean. Here's an example of the file contents.
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
|||||||||||||||||
|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08|3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
||||||||||||asasasas
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
*|comment
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
I'd like it to look like:
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
*|comment
*|comment
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
*|comment||||||||||||||||
*|comment|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08||3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
*|comment
*|comment
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
The data are divided into 'packages' distinct from the first letter (PNLS). Each package must have at least two dedicated lines (* |) which is then read as a comment. The white lines between different letters are filled with character * |. The lines between various letters that do not begin with * | to be added. The white lines and characters 'random' between identical letters are removed.
Perhaps it is clearer in the example files.
How do I manipulate the text? Thank you in advance for the help.
Use fileread to get your file into MATLAB.
text = fileread('my file to clean.txt');
Split the resulting character string up by splitting on the new lines. (The newlines characters depend on your operating system.)
lines = regexp(text, '\r\n', 'split');
It isn't entirely clear exactly how you want the file cleaned, but these things might get you started.
% Replace blank lines with comment string
blanks = cellfun(#isempty, lines);
comment = '*|comment';
lines(blanks) = cellstr(repmat(comment, sum(blanks), 1))
% Prepend comment string to lines that start with a pipe
lines = regexprep(lines, '^\|', '\*\|comment\|')
You'll be needing to know your way around regular expressions. There's a good guide to them at regular-expressions.info.