This question already has answers here:
Reference - What does this regex mean?
(1 answer)
What does ?! mean?
(3 answers)
Closed 5 years ago.
I'm trying to write a sytnax highlighter for VSCode, which uses the TextMate format. I've got an entry for one-line comments, copied from an example, and it works fine, but I'd like to extend/modify it.
"linecomment": {
"name": "comment",
"match": "(%)(?!(\\[=*\\[|\\]=*\\])).*$\n?",
"captures": {
"1": {
"name": "comment"
}
}
},
The problem is, the regular expressions used here are not documented anywhere that I can find. I understand basic Grep and the theory behind regular expressions, but I have no idea what is going on in ?!(\\[=*\\[|\\]=*\\])).*$\n?. In particular, I don't know which characters are in the regex language, and which are being matched.
Can somebody explain to me:
Which regular expression format is used here, and where it is documented?
What the given regex means, and what its parts are?
I don't know the answer to (1), but the answer to (2) is as follows:
Firstly, if you've only used grep and not other flavours of regex, you should know that there are some syntax differences. In most flavours, for example, \+ is a literal + and + is the quantifier; in grep + is literal and \+ is the quantifier. And there are other characters where the meaning of \ is reversed in this way.
Secondly, the string literal isn't the same as the string itself, because of backslash-escaping. The string literal looks like this:
"(%)(?!(\\[=*\\[|\\]=*\\])).*$\n?"
while the string itself looks like this:
(%)(?!(\[=*\[|\]=*\])).*$
?
(with a newline character near the end).
Let's look at the following subexpression:
\[=*\[|\]=*\]
At first I thought this was a character class, delimited by \[ and \]. But (a) I don't know of any flavour of regex where backslash-escaped square brackets are character class delimiters and unescaped ones are literal square brackets, rather than vice versa; (b) why would someone write a character class with repeated characters?; (c) there's no obvious reason why the first \] would be a literal ] and the second one would end the character class. So it looks like \[ and \] are literal square brackets.
| means "or" in regexes. It is a low-precedence operator. So this subexpression means either \[=*\[ or \]=*\]. In other words, it matches strings such as [[, [=[, [======[, etc, as well as ]], ]=], etc.
(?!...) is a zero-width assertion. It is a negative lookahead: it matches at any point in the string where the positive lookahead (?=...) would not match. In general, if the regex A matches the string a and C matches string c then the regex A(?!B)C matches the string ac, unless the regex B matches c (or some substring of c). In other words, the match fails if the string is something like %]==].
.* matches any number of characters. (0 is a number). (I assume this doesn't match newlines.) $ is another zero-width assertion: it can only match at the end of the line. Actually, it's not needed in this case - the .* subexpression is greedy and will match all non-newline characters, so the end of the .* match is guaranteed to be the end of the line. That is, unless there's some edge case I'm not aware of involving carriage returns or some even more exotic line terminating character.
Finally, \n? will match the newline character itself, if it exists (? is a quantifier). If this is the last line of the string then there may not be a newline; in that case the regex match would fail without the ?.
Putting it all together: The regex will match from a % until the end of the line, including the newline character if it exists, unless the string it's trying to match starts with %[[ or %]==] or something similar.
Related
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm currently using this website to create some regular expressions for a programming language I want to build, at the moment I'm just setting up an expression for identifiers.
In my language, identifiers are expressed like most languages:
They cannot begin with a digit, or special character other than an underscore
After the first character they can contain alphanumeric and underscore characters
Given those rules I've come up with the following expression by myself:
^\D\w+$
Obviously, it doesn't account for special characters, however the following expression does (which I didn't make myself):
^(?!\d)\w+$
Why does the second expression account special characters? Shouldn't they be producing the same results?
I will explain why the second regex works.
The second regex uses a lookahead. After matching the start of the string, the engine checks whether the next character is a digit but it does not match it! This is important because if the next character is not a digit, it tries to use \w to match that same character, which it couldn't if the character is a symbol, if it is a digit, the negative lookahead fails and nothing is matched.
\D on the other hand, will match the character if it is not a digit, and \w will match whatever comes after that. That means all symbols are accepted.
This ^(?!\d)\w+$ means a string consisted of word characters [a-zA-Z0-9_] that doesn't start with a digit.
This ^\D\w+$ means a non-digit character followed by at least one character from [a-zA-Z0-9_] set.
So #ab01 is matched by second regex while first regex rejects it.
(?!\d)\w+ means "match a word which is not prepended with digits". But as you're wrapping it with ^ and $ characters it is basically the same as just ^\w+$ which is obviously not the same as ^\D\w+$. ^(?!\d).+\w+$ (note ".+" in the middle) would behave the same as ^\D\w+$
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I want to write a regular expression for QR8.4_Z4J25 in shell script? How can i do it?
Is this correct?
[QR][0-9][.][0-9][_][A-Z][0-9][A-Z][0-9][0-9]
It's obviously wrong because it'll only match Q8.4_Z4J25 or R8.4_Z4J25, but not QR8.4_Z4J25
A bracket matches any one character specified, so you'd like to write:
[Q][R][0-9][.][0-9][_][A-Z][0-9][A-Z][0-9][0-9]
You don't need to use brackets for a single character, though, so it can be simplified to
QR[0-9]\.[0-9]_[A-Z][0-9][A-Z][0-9][0-9]
Be sure to escape the dot if it's outside of a bracket because it would otherwise match any single character.
in case you want to match QR9.1_8A9YK as well, you should change it to
QR[0-9]\.[0-9]_[A-Z0-9]\{5\}
If you're using Extented Regular Expression, usually by supplying an option -E to the tool you're using, then you shouldn't escape the braces:
QR[0-9]\.[0-9]_[A-Z0-9]{5}
Square brackets in regular expressions denote a collection of characters.
[MX_5] will match one character that is M, X _ or 5.
[0-9] will match one character that is between 0 and 9.
[a-z] will match one character that is between lowercase a and z.
Notice the pattern? The square brackets match a single character. In order to match multiple characters they need to be followed by a + or * or {} to denote how many of those characters it should match.
However, in your case, you just want to match the actual letters QR in that order, so simply don't use square brackets.
QR[0-9]\.[0-9]_[A-Z][0-9][A-Z][0-9][0-9]
The same goes for characters like the underscore which are always in the same place. Note that the . was escaped with a \ because it has a special meaning in regex.
Going back to matching multiple characters with square brackets, if the order of the last 5 characters doesn't matter, you can further reduce your expression using a single square bracket and a {} to match all your trailing characters after the underscore.
QR[0-9]\.[0-9]_[A-Z0-9]{5}
Let's say I have a string in which I wanted to parse from an opening double-quote to a closing double-quote:
asdf"pass\"word"asdf
I was lucky enough to discover that the following PCRE would match from the opening double-quote to the closing double-quote while ignoring the escaped double-quote in the middle (to properly parse the logical unit):
".*?(?:(?!\\").)"
Match:
"pass\"word"
However, I have no idea why this PCRE matches the opening and closing double-quote properly.
I know the following:
" = literal double-quote
.*? = lazy matching of zero or more of any character
(?: = opening of non-capturing group
(?!\") = asserts its impossible to match literal \"
. = single character
) = closing of non-capturing group
" = literal double-quote
It appears that a single character and a negative lookahead are apart of the same logical group. To me , this means the PCRE is saying "Match from a double-quote to zero or more of any character as long as there is no \" right after the character, then match one more character and one single double quote."
However, according to that logic the PCRE would not match the string at all.
Could someone help me wrap my head around this?
It's easier to understand if you change the non-capture group to be a capture group.
Lazy matching generally moves forward one character at a time (vs. greedy matching everything it can and then giving up what it must). But it "moves forward" as far as satisfying the required parts of the pattern after it, which is accomplished by letting the .*? match everything up to r, then letting the negative lookahead + . match the d.
Update: you asked in comment:
how come it matches up to the r at all? shouldn't the negative
lookahead prevent it from getting passed the \" in the string? thanks
for helpin me understand, by the way
No, because it is not the negative lookahead stuff that is matching it. That is why I suggested you change the non-captured group into a captured group, so that you can see it is .*? that matches the \", not (?:(?!\\").)
.*? has the potential to match the entire string, and the regex engine uses that to satisfy the requirement to match the rest of the pattern.
Update 2:
It is effectively the same as doing this: ".*?[^\\]" which is probably a lot easier to wrap your head around.
A (slightly) better pattern would be to use a negative lookbehind like so: ".*?(?<!\\)" because it will allow for an empty string "" to be matched (a valid match in many contexts), but negative lookbehinds aren't supported in all engines/languages (from your tags, pcre supports it, but I don't think you can really do this in bash except e.g. grep -P '[pattern]' .. which basically runs it through perl).
Nothing to add to Crayon Violent explanation, only a little disambiguation and ways to match substrings enclosed between double quotes (with eventually quotes escaped by a backslash inside).
First, it seems that you use in your question the acronym "PCRE" (Perl Compatible Regular Expression) that is the name of a particular regex engine (and by extension or somewhat imprecisely refers to its syntax) in place of the word "pattern" that is the regular expression that describes a group of other strings (whatever the regex engine used).
With Bash:
A='asdf"pass\"word"asdf'
pattern='"(([^"\\]|\\.)*)"'
[[ $A =~ $pattern ]]
echo ${BASH_REMATCH[1]}
You can use this pattern too: pattern='"(([^"\\]+|\\.)*)"'
With a PCRE regex engine, you can use the first pattern, but it's better to rewrite it in a more efficient way:
"([^"\\]*+(?:\\.[^"\\])*+)"
Note that for these three patterns don't need any lookaround. They are able to deal with any number of consecutive backslashes: "abc\\\"def" (a literal backslash and an escaped quote), "abcdef\\\\" (two literal backslashes, the quote is not escaped).
I have this regular expression
([A-Z], )*
which should match something like
test, (with a space after the comma)
How to I change the regex expression so that if there are any characters after the space then it doesn't match.
For example if I had:
test, test
I'm looking to do something similar to
([A-Z], ~[A-Z])*
Cheers
Use the following regular expression:
^[A-Za-z]*, $
Explanation:
^ matches the start of the string.
[A-Za-z]* matches 0 or more letters (case-insensitive) -- replace * with + to require 1 or more letters.
, matches a comma followed by a space.
$ matches the end of the string, so if there's anything after the comma and space then the match will fail.
As has been mentioned, you should specify which language you're using when you ask a Regex question, since there are many different varieties that have their own idiosyncrasies.
^([A-Z]+, )?$
The difference between mine and Donut is that he will match , and fail for the empty string, mine will match the empty string and fail for ,. (and that his is more case-insensitive than mine. With mine you'll have to add case-insensitivity to the options of your regex function, but it's like your example)
I am not sure which regex engine/language you are using, but there is often something like a negative character groups [^a-z] meaning "everything other than a character".
When I searched about something, I found an answered question in this site. 2 of the answers contain
/([^.]*)\.(.*)/
on their answer.
The question is located at Find & replace jquery. I'm newbie in javascript, so I wonder, what does it mean? Thanks.
/([^.]*)\.(.*)/
Let us deconstruct it. The beginning and trailing slash are delimiters, and mark the start and end of the regular expression.
Then there is a parenthesized group: ([^.]*) The parentheseis are there just to group a string together. The square brackets denote a "character group", meaning that any character inside this group is accepted in its place. However, this group is negated by the first character being ^, which reverse its meaning. Since the only character beside the negation is a period, this matches a single character that is not a period. After the square brackets is a * (asterisk), which means that the square brackets can be matched zero or more times.
Then we get to the \.. This is an escaped period. Periods in regular expressions have special meaning (except when escaped or in a character group). This matches a literal period in the text.
(.*) is a new paranthesized sub-group. This time, the period matches any character, and the asterisk says it can be repeated as many times as needs to.
In summary, the expression finds any sequence of characters (that isn't a period), followed by a single period, again followed by any character.
Edit: Removed part about shortening, as it defeats the assumed purpose of the regular expression.
It's a regular expression (it matches non-periods, followed by a period followed by anything (think "file.ext")). And you should run, not walk, to learn about them. Explaining how this particular regular expression works isn't going to help you as you need to start simpler. So start with a regex tutorial and pick up Mastering Regular Expressions.
Original: /([^.]*)\.(.*)/
Split this as:
[1] ([^.]*) : It says match all characters except . [ period ]
[2] \. : match a period
[3] (.*) : matches any character
so it becomes
[1]Match all characters which are not . [ period ] [2] till you find a .[ period ] then [3] match all characters.
Anything except a dot, followed by a dot, followed by anything.
You can test regex'es on regexpal
It's a regular expression that roughly searches for a string that doesn't contain a period, followed by a period, and then a string containing any characters.
That is a regular expression. Regular expressions are powerful tools if you use them right.
That particular regex extracts filename and extension from a string that looks like "file.ext".
It's a regular expression that splits a string into two parts: everything before the first period, and then the remainder. Most regex engines (including the Javascript one) allow you to then access those parts of the string separately (using $1 to refer to the first part, and $2 for the second part).
This is a regular expression with some advanced use.
Consider a simpler version: /[^.]*\..*/ which is the same as above without parentheses. This will match just any string with at least one dot. When the parentheses are added, and a match happens, the variables \1 and \2 will contain the matched parts from the parentheses. The first one will have anything before the first dot. The second part will have everything after the first dot.
Examples:
input: foo...bar
\1: foo
\2: ..bar
input: .foobar
\1:
\2: foobar
This regular expression generates two matching expressions that can be retrieved.
The two parts are the string before the first dot (which may be empty), and the string after the first dot (which may contain other dots).
The only restriction on the input is that it contain at least one dot. It will match "." contrary to some of the other answers, but the retrived groups will be empty.
IMO /.*\..*/g Would do the same thing.
const senExample = 'I am test. Food is good.';
const result1 = senExample.match(/([^.]*)\.(.*)/g);
console.log(result1); // ["I am test. Food is good."]
const result2 = senExample.match(/^.*\..*/g);
console.log(result2); // ["I am test. Food is good."]
the . character matches any character except line break characters the \r or \n.
the ^ negates what follows it (in this case the dot)
the * means "zero or more times"
the parentheses group and capture,
the \ allows you to match a special character (like the dot or the star)
so this ([^.]*) means any line break repeated zero or more times (it just eats up carriage returns).
this (.*) part means any string of characters zero or more times (except the line breaks)
and the \. means a real dot
so the whole thing would match zero or more line breaks followed by a dot followed by any number of characters.
For more information and a really great reference on Regular Expressions check out: http://www.regular-expressions.info/reference.html
It's a regular expression, which basically is a pattern of characters that is used to describe another pattern of characters. I once used regexps to find an email address inside a text file, and they can be used to find pretty much any pattern of text within a larger body of text provided you write the regexp properly.