How to write a regular expression for my string? [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I want to write a regular expression for QR8.4_Z4J25 in shell script? How can i do it?
Is this correct?
[QR][0-9][.][0-9][_][A-Z][0-9][A-Z][0-9][0-9]

It's obviously wrong because it'll only match Q8.4_Z4J25 or R8.4_Z4J25, but not QR8.4_Z4J25
A bracket matches any one character specified, so you'd like to write:
[Q][R][0-9][.][0-9][_][A-Z][0-9][A-Z][0-9][0-9]
You don't need to use brackets for a single character, though, so it can be simplified to
QR[0-9]\.[0-9]_[A-Z][0-9][A-Z][0-9][0-9]
Be sure to escape the dot if it's outside of a bracket because it would otherwise match any single character.
in case you want to match QR9.1_8A9YK as well, you should change it to
QR[0-9]\.[0-9]_[A-Z0-9]\{5\}
If you're using Extented Regular Expression, usually by supplying an option -E to the tool you're using, then you shouldn't escape the braces:
QR[0-9]\.[0-9]_[A-Z0-9]{5}

Square brackets in regular expressions denote a collection of characters.
[MX_5] will match one character that is M, X _ or 5.
[0-9] will match one character that is between 0 and 9.
[a-z] will match one character that is between lowercase a and z.
Notice the pattern? The square brackets match a single character. In order to match multiple characters they need to be followed by a + or * or {} to denote how many of those characters it should match.
However, in your case, you just want to match the actual letters QR in that order, so simply don't use square brackets.
QR[0-9]\.[0-9]_[A-Z][0-9][A-Z][0-9][0-9]
The same goes for characters like the underscore which are always in the same place. Note that the . was escaped with a \ because it has a special meaning in regex.
Going back to matching multiple characters with square brackets, if the order of the last 5 characters doesn't matter, you can further reduce your expression using a single square bracket and a {} to match all your trailing characters after the underscore.
QR[0-9]\.[0-9]_[A-Z0-9]{5}

Related

Regex - Why don't these two expressions produce the same result? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm currently using this website to create some regular expressions for a programming language I want to build, at the moment I'm just setting up an expression for identifiers.
In my language, identifiers are expressed like most languages:
They cannot begin with a digit, or special character other than an underscore
After the first character they can contain alphanumeric and underscore characters
Given those rules I've come up with the following expression by myself:
^\D\w+$
Obviously, it doesn't account for special characters, however the following expression does (which I didn't make myself):
^(?!\d)\w+$
Why does the second expression account special characters? Shouldn't they be producing the same results?
I will explain why the second regex works.
The second regex uses a lookahead. After matching the start of the string, the engine checks whether the next character is a digit but it does not match it! This is important because if the next character is not a digit, it tries to use \w to match that same character, which it couldn't if the character is a symbol, if it is a digit, the negative lookahead fails and nothing is matched.
\D on the other hand, will match the character if it is not a digit, and \w will match whatever comes after that. That means all symbols are accepted.
This ^(?!\d)\w+$ means a string consisted of word characters [a-zA-Z0-9_] that doesn't start with a digit.
This ^\D\w+$ means a non-digit character followed by at least one character from [a-zA-Z0-9_] set.
So #ab01 is matched by second regex while first regex rejects it.
(?!\d)\w+ means "match a word which is not prepended with digits". But as you're wrapping it with ^ and $ characters it is basically the same as just ^\w+$ which is obviously not the same as ^\D\w+$. ^(?!\d).+\w+$ (note ".+" in the middle) would behave the same as ^\D\w+$

Regular Expressions particualrs for VSCode Syntax Highlighting [duplicate]

This question already has answers here:
Reference - What does this regex mean?
(1 answer)
What does ?! mean?
(3 answers)
Closed 5 years ago.
I'm trying to write a sytnax highlighter for VSCode, which uses the TextMate format. I've got an entry for one-line comments, copied from an example, and it works fine, but I'd like to extend/modify it.
"linecomment": {
"name": "comment",
"match": "(%)(?!(\\[=*\\[|\\]=*\\])).*$\n?",
"captures": {
"1": {
"name": "comment"
}
}
},
The problem is, the regular expressions used here are not documented anywhere that I can find. I understand basic Grep and the theory behind regular expressions, but I have no idea what is going on in ?!(\\[=*\\[|\\]=*\\])).*$\n?. In particular, I don't know which characters are in the regex language, and which are being matched.
Can somebody explain to me:
Which regular expression format is used here, and where it is documented?
What the given regex means, and what its parts are?
I don't know the answer to (1), but the answer to (2) is as follows:
Firstly, if you've only used grep and not other flavours of regex, you should know that there are some syntax differences. In most flavours, for example, \+ is a literal + and + is the quantifier; in grep + is literal and \+ is the quantifier. And there are other characters where the meaning of \ is reversed in this way.
Secondly, the string literal isn't the same as the string itself, because of backslash-escaping. The string literal looks like this:
"(%)(?!(\\[=*\\[|\\]=*\\])).*$\n?"
while the string itself looks like this:
(%)(?!(\[=*\[|\]=*\])).*$
?
(with a newline character near the end).
Let's look at the following subexpression:
\[=*\[|\]=*\]
At first I thought this was a character class, delimited by \[ and \]. But (a) I don't know of any flavour of regex where backslash-escaped square brackets are character class delimiters and unescaped ones are literal square brackets, rather than vice versa; (b) why would someone write a character class with repeated characters?; (c) there's no obvious reason why the first \] would be a literal ] and the second one would end the character class. So it looks like \[ and \] are literal square brackets.
| means "or" in regexes. It is a low-precedence operator. So this subexpression means either \[=*\[ or \]=*\]. In other words, it matches strings such as [[, [=[, [======[, etc, as well as ]], ]=], etc.
(?!...) is a zero-width assertion. It is a negative lookahead: it matches at any point in the string where the positive lookahead (?=...) would not match. In general, if the regex A matches the string a and C matches string c then the regex A(?!B)C matches the string ac, unless the regex B matches c (or some substring of c). In other words, the match fails if the string is something like %]==].
.* matches any number of characters. (0 is a number). (I assume this doesn't match newlines.) $ is another zero-width assertion: it can only match at the end of the line. Actually, it's not needed in this case - the .* subexpression is greedy and will match all non-newline characters, so the end of the .* match is guaranteed to be the end of the line. That is, unless there's some edge case I'm not aware of involving carriage returns or some even more exotic line terminating character.
Finally, \n? will match the newline character itself, if it exists (? is a quantifier). If this is the last line of the string then there may not be a newline; in that case the regex match would fail without the ?.
Putting it all together: The regex will match from a % until the end of the line, including the newline character if it exists, unless the string it's trying to match starts with %[[ or %]==] or something similar.

What is the meaning of this regular expression? ['`?!\"-/] [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
What is the meaning of this regular expression?
['`?!\"-/]
Why it matches parenthesis?
I used Java for development
In your regex
['`?!\"-/]
The quantity "-/ is being interpreted as a range of values, just as A-Z would mean taking every letter between A and Z. It turns out, by reading the basic ASCII table, that parentheses lie within this range, so your pattern is including them.
One trick you can use here with dash is to place it at the end:
['`?!\"/-]
^^^^ this will not be interpreted as a range
Because you didn't escape the dash -. The dash, inside a character class [] denotes a range of characters. In this case from " to /. And parentheses are between those, in ASCII.
The dash needs to be escaped \-, if it's not the first or last character, inside a character class, when you want it to be matched as a literal.
You have to use following
You need to escape -, otherwise, parentheses are matching.
Seems like "-/ will include parentheses as well. Like [A-C], which matches ASCII chars between A to C
[\'`?!\"\-/]
It will match following characters in a string.
'`?"-/
Check in the regex101

regex + selecting file endings

here is my regex
I am trying to capture the files *08.tgz, *09.tgz, and *01.tgz
And this is what I have. but his also captures *10.tgz, due to the 09
.*\/*[09|8|1].tgz
I know I can do .*\/*[9|8|1].tgz and this will only capture *08.tgz, *09.tgz, and *01.tgz, but what I want to understand is why does the 0 captre the 10.tgz file??
data
./backup_public_html_20160308.tgz
./backup_public_html_20160301.tgz
./backup_public_html_20160302.tgz
./backup_public_html_20160306.tgz
./backup_public_html_20160304.tgz
./backup_public_html_20160303.tgz
./backup_public_html_20160307.tgz
./backup_public_html_20160305.tgz
./backup_public_html_20160309.tgz
./backup_public_html_20160310.tgz
[09|8|1] is character class, trying to match any of the characters included - so it will match either 0 or 9 or 8 or 1 or |
You might be looking for 0[189] matching 0 followed by either 1 or 8 or 9
I would be explicit and use
.*\/*(08|09|01).tgz
Let's look at this part of your regex where the actual matching of number is taking place.
[09|8|1] says
either 0 or 9
either 8
either 1
either a |
Now you are thinking it's matching 10.tgz. But it's actually matching 0.tgz
And when you change it to [9|8|1] it says.
either 9
either 8
either 1
either a |
Now 0.tgz won't match.
You misuse the character class as a group. Your regex .*\/*[09|8|1].tgz matches zero or more characters other than a newline (with .*) as many as possible (since * is a greedy quantifier), followed with zero or more / symbols, and then 1 symbol from the character class [09|8|1] - that is, either 0, 9, |, 8, or 1 followed with any character but a newline (since . matches any character but a newline) and then tgz.
For more details on how character classes work, see Character classes or Character Sets:
With a "character class", also called "character set", you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets. If you want to match an a or an e, use [ae]. You could use this in gr[ae]y to match either gray or grey.
In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^), and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. To search for a star or plus, use [+*]. Your regex will work fine if you escape the regular metacharacters inside a character class, but doing so significantly reduces readability.
To capture the files *08.tgz, *09.tgz, and *01.tgz, use
.*0[981]\.tgz
OR
^.*0[981]\.tgz$
See the regex demo. The ^ is a start of string anchor and $ is an end of string anchor, and thus, the ^.*0[981]\.tgz$ pattern will require a full string match.
NOTE: To match a literal . you need to ecape it or place.. yes, into a character class as . loses its special meaning inside it and just denotes a literal dot there.
See the regex demo
You've confused character class and an alternation.
Try this:
.*0(9|8|1)\.tgz
Or more simply:
.*0[981]\.tgz
Note also repairs to other parts of your regex.

Regular expressions, can I exclude pairs of characters?

How do you exclude pairs of characters from a regular expression?
I am trying to get a regular expression that will have 5 alphanumeric characters followed by
anything except "XX" and "AD", followed by XX.
So
D22D0ACXX
will match, but the following two will not match
D22D0ADXX
D22D0XXXX.
My first attempt was :
([A-Z0-9]{5}[^(?AD)|(?XX)]XX)
But this treats the character classes part [^(?AD)|(?XX)] as one character, so I end up with the last 8 characters, not all 9.
Can I exclude pairs of characters without getting into back references?
I need to capture the whole group, hence the outer parenthesis. The negative lookahead suggestions don't seem to do this.
Use negative lookahead:
([A-Z0-9]{5}(?!(AD|XX)XX).{4})
Don't treat it as a character class, instead, think of it as an alternation with a negative lookahead, e.g:
([A-Z0-9]{5}(?!(AD|XX)XX))
Then, if you need the tail, include it after the lookhead, e.g:
([A-Z0-9]{5}(?!(AD|XX)XX)[A-Z0-9]{4})