Problem with a regex to extract informations of a filename - regex

I have to extract informations of filenames.
There is a possible name which can contains a-z , 0-9 , - , _
There is a possible separator which can be _ , . , / , \ , * , # or a space
There is always a number before the extension
There is always an extension
For now, here is where I am
^((?<name>[a-z0-9-_]+)(?<separator>[_\.#\/\\*\- ])){0,1}(?<number>\d+)\.(?<extension>[a-z]{3})$
I have to match all of theses:
tot0_tutu_00001.tif
tot0.0001.tif
tot0#00001.tif
tot0/0001.tif
tot0\00001.tif
tot0*0001.tif
00001.tif
tot0-tutu_0001.tif
tot0-tutu-00001.tif
tot0-tutu 000001.tif
tot0-tutu000001.tif
That regex is working for all cases axcept the last one
tot0-tutu000001.tif
I cannot figure how to solve this
Here is a sandbox
https://regex101.com/r/wZP6RI/1

The filenames do not start with a separator, so you could make the whole name part optional. If it is present, make sure it starts with a-z0-9 and optionally match the separator.
For the digits part, you can use a negative lookbehind to start matching digits where there is no digit directly before it.
^(?<name>(?:[a-z0-9]+(?:[-_][a-z0-9]+)*(?<separator>[_\.#\/\\*\- ]?))?)(?<!\d)(?<number>\d+)\.(?<extension>[a-z]{3})$
^ Start of string
(?<name> Named group name
(?: Non capture group
[a-z0-9]+ Match 1+ occurrences of a range a-z or 0-9
(?:[-_][a-z0-9]+)* Optionally repeat the previous with either - or _ prepended
(?<separator> Named group separator
[_\.#\/\\*\- ]? Optionally match any of the listed chars
) Close group separator
)? Close non capture group and make it optional
) Close group name
(?<!\d)(?<number>\d+) Named group number match 1+ digits asserting no digit directly to the right
\.(?<extension>[a-z]{3}) Match . and named group extension matching 3 times a char in range a-z
$ End of string
Regex demo

According to the criteria you stated, it feels like it could be simplified:
^[a-z0-9_-]*[_.\/\\*# ]?\d+\.[a-z]{3}$
See https://regex101.com/r/z5eRwM/1.
Feel free to regroup as you need.
Side note, careful with - in char classes, it needs to be either at the very beginning or at the very end to be considered as the - char (and not the range separator as in [a-z]):
[-az]: matches -, a and z
[a-z]: matches chars between a and z (- excluded)
[az-]: matches a, z and -

Related

How to conditionally expect particular characters if a prior regex matched?

I want to expect some characters only if a prior regex matched. If not, no characters (empty string) is expected.
For instance, if after the first four characters appears a string out of the group (A10, B32, C56, D65) (kind of enumeration) then a "_" followed by a 3-digit number like 123 is expected. If no element of the mentioned group appears, no other string is expected.
My first attempt was this but the ELSE branch does not work:
^XXX_(?<DT>A12|B43|D14)(?(DT)(_\d{1,3})|)\.ZZZ$
XXX_A12_123.ZZZ --> match
XXX_A11.ZZZ --> match
XXX_A12_abc.ZZZ --> no match
XXX_A23_123.ZZZ --> no match
These are examples of filenames. If the filename contains a string of the mentioned group like A12 or C56, then I expect that this element if followed by an underscore followed by 1 to 3 digits. If the filename does not contain a string of that group (no character or a character sequence different from the strings in the group) then I don't want to see the underscore followed by 1 to 3 digits.
For instance, I could extend the regex to
^XXX_(?<DT>A12|B43|D14)_\d{5}(?(DT)(_\d{1,3})|)_someMoreChars\.ZZZ$
...and then I want these filenames to be valid:
XXX_A12_12345_123_wellDone.ZZZ
XXX_Q21_00000_wellDone.ZZZ
XXX_Q21_00000_456_wellDone.ZZZ
...but this is invalid:
XXX_A12_12345_wellDone.ZZZ
How can I make the ELSE branch of the conditional statement work?
In the end I intend to have two groups like
Group A: (A11, B32, D76, R33)
Group B: (A23, C56, H78, T99)
If an element of group A occurs in the filename then I expect to find _\d{1,3} in the filename.
If an element of group B occurs ion the filename then the _\d{1,3} shall be optional (it may or may not occur in the filename).
I ended up in this regex:
^XXX_(?:(?A12|B43|D14))?(?(DT)(_\d{5}_\d{1,3})|(?!(?&DT))(?!.*_\d{3}(?!\d))).*\.ZZZ$
^XXX_(?:(?<DT>A12|B43|D14))?_\d{5}(?(DT)(_\d{1,3})|(?!(?&DT))(?!.*_\d{3}(?!\d))).+\.ZZZ$
Since I have to use this regex in the OpenApi #Pattern annotation I have the problem that I get the error:
Conditionals are not supported in this regex dialect.
As #The fourth bird suggested alternation seems to do the trick:
XXX_((((A12|B43|D14)_\d{5}_\d{1,3}))|((?:(A10|B10|C20)((?:_\d{5}_\d{3})|(?:_\d{3}))))).*\.ZZZ$
The else branch is the part after the |, but if you also want to match the 2nd example, the if clause would not work as you have already matched one of A12|B43|D14
The named capture group is not optional, so the if clause will always be true.
What you can do instead is use an alternation to match either the numeration part followed by an underscore and 3 digits, or match an uppercase char and 2 digits.
^XXX_(?:(?<DT>A12|B43|D14)_\d{1,3}|[A-Z]\d{2})\.ZZZ$
Regex demo
If you want to make use of the if/else clause, you can make the named capture group optional, and then check if group 1 exists.
^XXX_(?<DT>A12|B43|D14)?(?(DT)_\d{1,3}|[A-Z]\d{2})\.ZZZ$
Regex demo
For the updated question:
^XXX_(?<DT>A12|B43|D14)?(?(DT)(?:_\d{5})?_\d{3}(?!\d)|(?!A12|B43|D14|[A-Z]\d{2}_\d{3}(?!\d))).*\.ZZZ$
The pattern matches:
^ Start of string
XXX_ Match literally
(?<DT>A12|B43|D14)?
(?(DT) If we have group DT
(?:_\d{5})? Optionally match _ and 5 digits
_\d{3}(?!\d) Match _ and 3 digits
| Or
(?! Negative lookahead, assert not to the right
A12|B43|D14| Match one of the alternatives, or
[A-Z]\d{2}_\d{3}(?!\d) Match 1 char A-Z, 2 digits _ 3 digits not followed by a digit
) Close lookahead
) Close if clause
.* Match the rest of the line
\.ZZZ Match . and ZZZ
$ End of string
Regex demo

How to convert a camelCased variable to lowercase with underscores in Notepad++ or IntelliJ using regular expressions

I have to rename the toString output variables in several hundred files with many occurrences in each. In the most efficient way possible, how could I parse this text:
.append(", myVariable=").append(myVariable)
.append(", myOtherVariable=").append(myOtherVariable)
.append(", mylowervariable=").append(myLowerVariable) // note the left is already lowercase
.append(", myVarWithURL=").append(myVarWithURL);
and it becomes:
.append(", my_variable=").append(myVariable)
.append(", my_other_variable=").append(myOtherVariable)
.append(", mylowervariable=").append(myLowerVariable) // note the left is already lowercase
.append(", my_var_with_url=").append(myVarWithURL);
The ones on the right are to remain unchanged, while the ones to the left of the equals sign are to be changed, if they contain uppercase characters.
These will be of arbitrary lengths with a varying number of upper case letters. I was thinking I had to do some sort of lookahead but could not get the replacement value to work correctly.
I have the flexibility of being able to do this in IntelliJ or Notepad++, so I can easily perform the \l \L operators to make a replacement value lowercase.
This was my thought process:
in: myLongCamelCasedVariable
re: ([a-z]+)([A-Z]{1})([a-z]+) // repeat grouping for capturing
group 1 group 2 group 3 group 4
my + [ L + ong ] + [ C + amel ] + [ C + ased ] + [ V + ariable ]
Is it possible to use a regular expression to effectively capture the various groups of 'text' in the larger text string, and 'loop' over that and apply the output?
Out: $1_\l$2 .... etc
Now I am just stuck
You may use
Find What: (?:\G(?!\A)|",\h*)\K(\b|[a-z]+)([A-Z]+)(?=\w*=")
Replace With: $1_\L$2
Match case: True
Details:
(?:\G(?!\A)|",\h*) - start matching from the end of the previous successful match (\G(?!\A)) or (|) a ", and zero or more horizontal whitespaces (",\h*)
\K - remove the text matched so far from the match memory buffer
(\b|[a-z]+) - Group 1: word boundary or one or more lowercase letters
([A-Z]+) - Group 2: one or more uppercase letters
(?=\w*=") - immediately to the right, there must be zero or more word chars followed with a = char.
The replacement is $1_\L$2: Group 1, _, and then lowercased Group 2 value.
See the Notepad++ demo screen:
You could match sequences of an uppercase char followed by optional uppercase chars and then optional lowercase chars.
In the replacement use _ followed by the lowercased match \L$0
Find what:
(?>,\h+[a-z]+|\G(?!^))\K[A-Z][A-Z]*[a-z]*
(?> Atomic group
,\h+[a-z]+ Match a comma, 1 or more spaces and 1 or more lowercase chars
| Or
\G(?!^) Assert the current position at the end of the previous match but not at the start of the string (so the first part of the alternation has to match first)
) Close atomic group
\K Forget what is matched so far
[A-Z][A-Z]*[a-z]* Match an uppercase char followed by optional upper and lowercase chars
Replace with:
_\L$0
Regex demo
Without using \K you can use 2 capture groups.
(?>(, [a-z]+)|\G(?!^))([A-Z][A-Z]*[a-z]*)
In the replacement use $1_\L$2

Comma separated prefix list with commas inside

I'm trying to match a comma separated list with prefixed values which contains also a comma.
I finally made it to match all occurrence which doesn't have a ,.
Sample String (With NL for visualization - original string doesn't have NL):
field01=Value 1,
field02=Value 2,
field03=<xml value>,
field04=127.0.0.1,
field05=User-Agent: curl/7.28.0\r\nHost: example.org\r\nAccept: */*,
field06=Location, Resource,
field07={Item 1},{Item 2}
My actual RegEx looks like this not optimized piece ....
(?'fields'(field[0-9]{2,3})=?([\s\w\d_<>.:="*?\-\/\\(){}<>'#]+))([^,](?&fields))*
Any one has a clue how to solve this?
EDIT:
The first pattern is near to my expected result.
This is a anonymized full example of the string:
asm01=Predictable Resource Location,Information Leakage,asm02=N/A,asm04=Uncategorized,asm08=2021-02-15 09:18:16,asm09=127.0.0.1,asm10=443,asm11=N/A,asm15=,asm16=DE,asm17=User-Agent: curl/7.29.0\r\nHost: dev.example.com\r\nAccept: */*\r\nX-Forwarded-For: 127.0.0.1\r\n\r\n,asm18=/Common/_www.example.com_live_v1,asm20=127.0.0.1,asm22=,asm27=HEAD,asm34=/Common/_www.example.com_live_v1,asm35=HTTPS,asm39=blocked,asm41=0,asm42=3,asm43=0,asm44=Error,asm46=200000028,200100015,asm47=Unix hidden (dot-file) access,.htaccess access,asm48={Unix/Linux Signatures},{Apache/NCSA HTTP Server Signatures},asm50=40622,asm52=200000028,asm53=Unix hidden (dot-file) access,asm54={Unix/Linux Signatures},asm55=,asm61=,asm62=,asm63=8985143867830069446,asm64=example-waf.example.com,asm65=/.htaccess,asm67=Attack signature detected,asm68=<?xml version='1.0' encoding='UTF-8'?><BAD_MSG><violation_masks><block>13020008202d8a-f803000000000000</block><alarm>417020008202f8a-f803000000000000</alarm><learn>13000008202f8a-f800000000000000</learn><staging>200000-0</staging></violation_masks><request-violations><violation><viol_index>42</viol_index><viol_name>VIOL_ATTACK_SIGNATURE</viol_name><context>request</context><sig_data><sig_id>200000028</sig_id><blocking_mask>7</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>0</offset><length>2</length></kw_data></sig_data><sig_data><sig_id>200000028</sig_id><blocking_mask>4</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>0</offset><length>3</length></kw_data></sig_data><sig_data><sig_id>200100015</sig_id><blocking_mask>7</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>1</offset><length>9</length></kw_data></sig_data></violation></request-violations></BAD_MSG>,asm69=5,asm71=/Common/_dev.example.com_SSL,asm75=127.0.0.1,asm100=,asm101=HEAD /.htaccess HTTP/1.1\r\nUser-Agent: curl/7.29.0\r\nHost: dev.example.com\r\nAccept: */*\r\nX-Forwarded-For: 127.0.0.1\r\n\r\n#015
The pattern does not work as the fields group matches the string field
You are trying to repeat the named group fields but the example strings do not have the string field.
Note that [^,] matches any char except a comma, you can omit the capture group inside the named group field as it already is a group and \w also matches \d
With 2 capture groups:
\b(asm[0-9]+)=(.*?)(?=,asm[0-9]+=|$)
\b A word boundary
(asm[0-9]+) Capture group 1, match asm and 1+ digits
= Match literally
(.*?) Capture group 2, match any char as least as possible
(?= Positive lookahead, assert what is at the right is
,asm[0-9]+= Match ,asm followed by 1+ digits and =
| Or
$ Assert the end of the string
) Close lookahead
Regex demo
A simple solution would be (see regexr.com/5mg1b):
/((asm\d{2,3})=(.*?))(?=,asm|$)/g
Match groupings will be:
group #1 - asm01=Predictable Resource Location,Information Leakage
group #2 - asm01
group #3 - Predictable Resource Location,Information Leakage
Conditions:
This will match everything including empty values
The key here is to make sure that each match is delimited by either a comma and your field descriptor, or an end of string. A look ahead will be handy here: (?=,asm|$).

Regex to Match Words and Numbers with Repeating Sequences (FOO-123 / FOO-456 /...etc)

https://regexr.com/539me
I have a changelog that I need to look like this:
- [FOO-123] This is a change from one project
- [FOO-567 / FOO-890] This has two changes from one project
- [BAR-123 / BAZ-456 / BANG-1234 ] This has three changes from three different projects
I was satisfied with my current regex that I have, but then I started testing it further, and it messes up when I accidentally type typos or add a character like A from BAR to FOO to make FOA or missing a /:
- [FOB-1234] hello
- [BAG-1234] how
- [FOO-1234 FOO-5678] are
- [FOA-1234 / BARG-1234 / BZF-1234] you?
How would I get it so that the top is always good but the bottom never works?
Regex I've currently created:
/-\s\[[(FOO|BAR|BAZ|BANG)-\d{\s}{/}{\s}+]*]\s.+/g
https://regexr.com/539me
You could match one of the alternatives and use an optionally repeating group prepended with a space, forward slash and space.
^-\s\[(?:FOO|BAR|BAZ|BANG)-\d+(?: / (?:FOO|BAR|BAZ|BANG)-\d+)*\] .+$
That will match
^ Start of string
\s\[ Match a whitespace char and [
(?:FOO|BAR|BAZ|BANG) Match any of the alternatives
-\d+ Match - and 1+ digits
(?: Non capture group
/ (?:FOO|BAR|BAZ|BANG)-\d+ Match / , 1 or the alternatives and - plus 1+ digits
)* Close group and repeat 0+ times
\] .+ Match ], space and 1+ occurrences of any char except a newline.
$ End of string
Regex demo
Note to remove the [ and ] around the group or else it would make it a character class.

I need a regx to validate a name that can be 1, 2, or 3 words

In this example I try to validate for a city name. It works if I enter San Louis Obispo but not if I enter Boulder Creek or Boulder. I thought ? was supposed to make a block optional.
if (!/^[a-zA-Z'-]+\s[a-zA-Z'-]*\s([a-zA-Z']*)?$/.test(field)){
return "Enter City only a-z A-Z .\' allowed and not over 20 characters.\n";
}
I think spaces are the problem (\s). You made second and third words optional (by using * instead of +), but not the spaces. Question mark is only being applied to the third word because of parentheses.
The issue with your regex is that, in english, it says to match a word that's required to be followed by a space that's optionally followed by another word but then is required to have another space and then optionally another word. So, a single-word would not match - however, a word followed by two spaces would. Additionally two words that have a space at the end would also match - but neither without the trailing spaces would match.
To fix your exact regex you should add another grouping (non-matching group with (?: instead of just () around the second word to the end of the sentence) and have this group as optional with ?. Also, move the \s's inside the optional groups as well.
Try this:
^[a-zA-Z'-]+(?:\s[a-zA-Z'-]+(?:\s[a-zA-Z']+)?)?$
Regex explaind:
^ # beginning of line
[a-zA-Z'-]+ # first matching word
(?: # start of second-matching word
\s[a-zA-Z'-]+ # space followed by matching word
(?: # start of third-matching word
\s[a-zA-Z']+ # space followed by matching word
)? # third-matching word is optional
)? # second-matching word is optional
$ # end of line
Alternatively, you can try the following regex:
^([a-zA-Z'-]+(?:\s[a-zA-Z'-]+){0,2})$
This will match 1 through 3 words, or "cities", in a given line with the ability to adjust the range of words without having to further-duplicate the matching set for each new word.
Regex explained:
^( # start of line & matching group
[a-zA-Z'-]+ # required first matching word
(?: # start a non-matching group (required to "match", but not returned as an individual group)
\s # sub-group required to start with a space
[a-zA-Z'-]+ # sub-group matching word
){0,2} # sub-group can match 0 -> 2 times
)$ # end of matching group & line
So, if you want to add the ability to match more than 3 words, you can change the 2 in the {0,2} range above to be the number of words you want to match minus 1 (i.e. if you want to match 4 words, you'll set it to {0,3}).