regular expression is chopping off last character of filename - regex

Anyone know why this is happening:
Filename: 031\_Lobby.jpg
RegExp: (\d+)\_(.*)[^\_e|\_i]\.jpg
Replacement: \1\_\2\_i.jpg
That produces this:
031\_Lobb\_i.jpg
For some reason it's chopping the last character from the second back-
reference (the "y" in "Lobby". It doesn't do that when I remove the [^_e|_i] so I must be doing something wrong that's related to that.
Thanks!

You force it to chop off the last character with this part of your regex:
[^_e|_i]
Which translates as: Any single character except "_", "e", "|", "i".
The "y" in "Lobby" matches this criterion.
You mean "not _e" and "not _i", obviously, but that's not the way to express it. This would be right:
(\d+)_(.+)(?<!_[ei])\.jpg
Note that the dot needs to be escaped in regular expressions.

it is removing the "y" because [^_e|_i] matches the y, and the .* matches everything before the y.

You're forcing it to have a last character different from _e and _i. You should use this instead (note the last *):
(\d+)_(.*)[^_e|_i]*.jpg

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

R digit-expression and unlist doesn't work

So I've bought a book on R and automated data collection, and one of the first examples are leaving me baffled.
I have a table with a date-column consisting of numbers looking like this "2001-". According to the tutorial, the line below will remove the "-" from the dates by singling out the first four digits:
yend_clean <- unlist(str_extract_all(danger_table$yend, "[[:digit:]]4$"))
When I run this command, "yend_clean" is simply set to "character (empty)".
If I remove the ”4$", I get all of the dates split into atoms so that the list that originally looked like this "1992", "2003" now looks like this "1", "9" etc.
So I suspect that something around the "4$" is the problem. I can't find any documentation on this that helps me figure out the correct solution.
Was hoping someone in here could point me in the right direction.
This is a regular expression question. Your regular expression is wrong. Use:
unlist(str_extract_all("2003-", "^[[:digit:]]{4}"))
or equivalently
sub("^(\\d{4}).*", "\\1", "2003-")
of if really all you want is to remove the "-"
sub("-", "", "2003-")
Repetition in regular expressions is controlled by the {} parameter. You were missing that. Additionally $ means match the end of the string, so your expression translates as:
match any single digit, followed by a 4, followed by the end of the string
When you remove the "4", then the pattern becomes "match any single digit", which is exactly what happens (i.e. you get each digit matched separately).
The pattern I propose says instead:
match the beginning of the string (^), followed by a digit repeated four times.
The sub variation is a very common technique where we create a pattern that matches what we want to keep in parentheses, and then everything else outside of the parentheses (.* matches anything, any number of times). We then replace the entire match with just the piece in the parens (\\1 means the first sub-expression in parentheses). \\d is equivalent to [[:digit:]].
A good website to learn about regex
A visualization tool to see how specific regular expressions match strings
If you mean the book Automated Data Collection with R, the code could be like this:
yend_clean <- unlist(str_extract_all(danger_table$yend, "[[:digit:]]{4}[-]$"))
yend_clean <- unlist(str_extract_all(yend_clean, "^[[:digit:]]{4}"))
Assumes that you have a string, "1993–2007, 2010-", and you want to get the last given year, which is "2010". The first line, which means four digits and a dash and end, return "2010-", and the second line return "2010".

Regex.Split string on each literal (included in result)

string s = "123wWdf4d556e";
after splitting result should be:
"123", "w", "W", "d", "f", "4", "d", "556", "e"
The logic is: split to each integer number, and single char.
I have tried something like this, but it doesn't work. An explanation would be nice, so I can understand why it didn't work. :)
string[] result = Regex.Split(s, "\w+(?=[a-zA-Z]");
Edit: edited the above result.
Use a look-behind:
string[] result = Regex.Split(s, "(?<=[a-zA-Z])");
Yours doesn't work because you are trying to split on word characters, and in the course of the split such characters will be removed from the result. Think about it like this: When you split a CSV-string on a comma, are the commas preserved in the result? The same kind of thing is happening in your attempt.
Using an assertion, like you were trying and what I am displaying, works because it's akin to splitting on the void next to the character you are seeking. This is because assertions are "zero-width"--they don't consume anything. So the pattern above basically says, "split on the void that comes after an alphabetic character."
Per you edit, you can use the same concept, but expand on it a tad:
string[] result = Regex.Split(s, #"(?<=\d)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=\d)|(?<=[a-zA-Z])(?=[a-zA-Z])");
You use alternation ( | ) to set up the variants of what you want to split on: integer followed by a letter ( (?<=\d)(?=[a-zA-Z]) ); letter followed by an integer ( (?<=[a-zA-Z])(?=\d) ); any two consecutive letters ( (?<=[a-zA-Z])(?=[a-zA-Z]) ). Each variant uses assertions to split on the voids between the target characters. Using a combination of lookbehind and lookahead permits you to split on this exact void.
Update: This works for a match, not a split.
The following regex will work if you use the 'ignore case' flag (i).
\d*[a-z]
Explanation
Your regex captured all words greedily up to when it was followed by a letter. It did not capture the letter since you used a lookahead.
My regex captures all digits (if any) and the first letter following the digit. You can see it in action on www.debuggex.com. Note that the f is captured, which you omitted from you expected result. I assume this was a mistake.

Perl regex | Match second from the right

I'm trying to parse an OID and extract the #18 but I am unsure on how to write it to count Right to Left using a dot as a delimiter:
1.3.6.1.2.1.31.1.1.1.18.10035
This regex will grab the last value
my $ifindex = ($_=~ /^.*[.]([^.]*)$/);
I haven't found a way to tweak it to get the value I need yet.
How about:
my $str = "1.3.6.1.2.1.31.1.1.1.18.10035";
say ((split(/\./, $str))[-2]);
output:
18
If the format is always the same (ie. always second from right) then you can either use:-
m/(\d+)\.\d+$/;
..and the answer will end up in: $1
Or a different approach would be to split the string into an array on the dots and examine the penultimate value in the array.
What you need is simpler:
my $ifindex;
if (/(\d+)\.\d+$/)
{
$ifindex = $1;
}
A couple of comments:
You don't need to match the entire string, only the part you care about. Thus, no need to anchor to the beginning with ^ and use .*. Anchor to the end only.
[.] is a character class, intended for matching groups of characters. e.g., [abc] will match either a, b, or c. It should be avoided when matching a single character; just match that character instead. In this case you do need to escape it, since it is a special character: \..
I have assumed based on your example that all of the terms have to be numbers. Hence, I used \d+ for the terms.
my $ifindex = ($_=~ /^.*[.]([^.]*)[.][^.]*$/);

how do the regular expressions * and ? metacharacter work?

Hi I'm going through regular expressions but I'm confused about metacharacters, particularly '*' and '?'.
'*' is supposed to match the preceding character 0 or more times.
For example, 'ta*k' supposedly matches 'tak' and 'tk'.
But I wouldn't have thought this to be true at all - here's my reasoning:
for tak:
regexp: I need a 't'
string: I have 't'
regexp: okay, your next character needs to be an 'a'
string: yes it is
regexp: okay, keep giving me characters until your character isn't an 'a'
string: okay. I've just given you 'k'
regexp: okay, your next character needs to be a 'k'
string: I don't have any more characters left!
regexp: fail
for tk:
regexp: I need a 't'
string: I have 't'
regexp: okay, your next character needs to be an 'a'
string: no, it's a 'k'
regexp: fail
Can someone clarify for me why 'tak' and 'tk' matches 'ta*k'?
* does not mean to match a character zero or more times, but an atom zero or more times. A single character is an atom, but so is any grouping.
And * means zero or more. When the regex cursor has "swallowed" the t, the positions are:
in the regex: t|a*k
in the string: t|ak
The regex engine then tries and eats as as much as possible. Here there is one. After it has swallowed it, the positions are:
in the regex: ta*|k
in the string: ta|k
Then the k is swallowed:
in the regex: ta*k|
in the string: tak|
End of regex, match. Note that the string may have other characters behind, the regex engine doesn't care: it has a match.
In the case where the string is tk, before a* the positions are:
in the regex: t|a*k
in the string: t|k
But * can match an empty set of as, therefore a* is satisfied! Which means the positions then become:
in the regex: ta*|k
in the string: t|k
Rinse, repeat. Now, let's take taak as an input and ta?k as a regex: this will fail, but let's see how...
# before first character
regex: |ta?k
input: |taak
# t
regex: t|a?k
input: t|aak
# a?
regex: ta?|k
input: ta|ak
# k? Oops! No...
regex: |ta?k
input: t|aak
# t? Oops! No...
regex: |ta?k
input: ta|ak
# t? Oops! No...
regex: |ta?k
input: taa|k
# t? Oops! No...
regex: |ta?k
input: taak|
# t? Oops! No... And nothing to read anymore
# FAIL
Which is why it is VERY important to make regexes fail FAST.
Because a* means "zero or more instances of a".
When "it" asks for all characters that aren't "a", once it has one, it (roughly) pushes it back into the input stream. (Or it peeks ahead, or it just keeps it, etc.)
First sequence: here's your first non-"a", I'll hold on to that. You need a "k" next, that's what I have.
Second sequence: the next character doesn't need to be an "a"--it may be one or more "a". In this case it's none. I'll hold on to that non-"a". You need a "k"? I got your "k" right here still.
You are one character ahead:
regexp: okay, keep giving me characters until your character isn't an
'a'
string: next character is not an 'a'
regexp: okay, your next character needs to be a 'k'
string: next char is a 'k'
So it works. Note that 'a*' means "0 or more occourrences of 'a'", and not "1 or more occources of 'a'". For the latter one there's the '+' sign, like in 'a+'.
ta*k means, one 't', followed by 0 or more 'a's, followed by one 'k'. So 0 'a' characters, would make 'tk` a possible match.
If you want "1 or more" instead of "0 or more", use the + instead of *. That is, ta+k will match 'tak' but not 'tk'.
Let me know if there's anything I didn't explain.
By the way, RegEx doesn't always go left to right. The engine often backtracks, peeks ahead and studies the input. It's really complicated, which is why it's so powerful. If you looks at sites such as this one, they sometimes explain what the engine is doing. I recommend their tutorials because that's where I learned about RegEx!
The fundamental thing to remember is that a regular expression is a convenient shorthand for typing out a set of strings. a{1,5} is simply shorthand for the set of strings (a, aa, aaa, aaaa, aaaaa). a* is shorthand for ([empty], a, aa, aaa, ...).
Thus, in effect, when you feed a regular expression to a search algorithm, you are telling it the list of strings to search for.
Consequently, when you feed ta*k to your search algorithm, you are actually feeding it the set of strings (tk, tak, taak, taaak, taaaak, ...).
So, yes, it is useful to understand how the search algorithm will work, so that you can offer the most efficient regular expression, but don't let the tail wag the dog.