How does the following regex work? - regex

Let's say I have a string in which I wanted to parse from an opening double-quote to a closing double-quote:
asdf"pass\"word"asdf
I was lucky enough to discover that the following PCRE would match from the opening double-quote to the closing double-quote while ignoring the escaped double-quote in the middle (to properly parse the logical unit):
".*?(?:(?!\\").)"
Match:
"pass\"word"
However, I have no idea why this PCRE matches the opening and closing double-quote properly.
I know the following:
" = literal double-quote
.*? = lazy matching of zero or more of any character
(?: = opening of non-capturing group
(?!\") = asserts its impossible to match literal \"
. = single character
) = closing of non-capturing group
" = literal double-quote
It appears that a single character and a negative lookahead are apart of the same logical group. To me , this means the PCRE is saying "Match from a double-quote to zero or more of any character as long as there is no \" right after the character, then match one more character and one single double quote."
However, according to that logic the PCRE would not match the string at all.
Could someone help me wrap my head around this?

It's easier to understand if you change the non-capture group to be a capture group.
Lazy matching generally moves forward one character at a time (vs. greedy matching everything it can and then giving up what it must). But it "moves forward" as far as satisfying the required parts of the pattern after it, which is accomplished by letting the .*? match everything up to r, then letting the negative lookahead + . match the d.
Update: you asked in comment:
how come it matches up to the r at all? shouldn't the negative
lookahead prevent it from getting passed the \" in the string? thanks
for helpin me understand, by the way
No, because it is not the negative lookahead stuff that is matching it. That is why I suggested you change the non-captured group into a captured group, so that you can see it is .*? that matches the \", not (?:(?!\\").)
.*? has the potential to match the entire string, and the regex engine uses that to satisfy the requirement to match the rest of the pattern.
Update 2:
It is effectively the same as doing this: ".*?[^\\]" which is probably a lot easier to wrap your head around.
A (slightly) better pattern would be to use a negative lookbehind like so: ".*?(?<!\\)" because it will allow for an empty string "" to be matched (a valid match in many contexts), but negative lookbehinds aren't supported in all engines/languages (from your tags, pcre supports it, but I don't think you can really do this in bash except e.g. grep -P '[pattern]' .. which basically runs it through perl).

Nothing to add to Crayon Violent explanation, only a little disambiguation and ways to match substrings enclosed between double quotes (with eventually quotes escaped by a backslash inside).
First, it seems that you use in your question the acronym "PCRE" (Perl Compatible Regular Expression) that is the name of a particular regex engine (and by extension or somewhat imprecisely refers to its syntax) in place of the word "pattern" that is the regular expression that describes a group of other strings (whatever the regex engine used).
With Bash:
A='asdf"pass\"word"asdf'
pattern='"(([^"\\]|\\.)*)"'
[[ $A =~ $pattern ]]
echo ${BASH_REMATCH[1]}
You can use this pattern too: pattern='"(([^"\\]+|\\.)*)"'
With a PCRE regex engine, you can use the first pattern, but it's better to rewrite it in a more efficient way:
"([^"\\]*+(?:\\.[^"\\])*+)"
Note that for these three patterns don't need any lookaround. They are able to deal with any number of consecutive backslashes: "abc\\\"def" (a literal backslash and an escaped quote), "abcdef\\\\" (two literal backslashes, the quote is not escaped).

Related

How can I get the second part of a hyphenated word using regex?

For example, I have the word: sh0rt-t3rm.
How can I get the t3rm part using perl regex?
I could get sh0rt by using [(a-zA-Z0-9)+]\[-\], but \[-\][(a-zA-Z0-9)+] doesn't work to get t3rm.
The syntax used for the regex is not correct to get either sh0rt or t3rm
You flipped the square brackets and the parenthesis, and the hyphen does not have to be between square brackets.
To get sh0rt in sh0rt-t3rm you you might use for example one of:
Regex
Demo
Explanation
\b([a-zA-Z0-9]+)-
Demo 1
\b is a word boundary to prevent a partial word match, the value is in capture group 1.
\b[a-zA-Z0-9]+(?=-)
Demo 2
Match the allowed chars in the character class, and assert a - to the right using a positive lookahead (?=-)
To get t3rm in sh0rt-t3rm you might use for example one of:
Regex
Demo
Explanation
-([a-zA-Z0-9]+)\b
Demo 3
The other way around with a leading - and get the value from capture group 1.
-\K[a-zA-Z0-9]+\b
Demo 4
Match - and use \K to keep out what is matched so far. Then match 1 or more times the allowed chars in the character class.
If your whole target string is literally just sh0rt-t3rm then you want all that comes after the -.
So the barest and minimal version, cut precisely for this description, is
my ($capture) = $string =~ /-(.+)/;
We need parenthesis on the left-hand-side so to make regex run in a list context because that's when it returns the matches (otherwise it returns true/false, normally 1 or '').
But what if the preceding text may have - itself? Then make sure to match all up to that last -
my ($capture) = $string =~ /.*-(.+)/;
Here the "greedy" nature of the * quantifier makes the previous . match all it possibly can so that the whole pattern still matches; thus it goes up until the very last -.
There are of course many other variations on how the data may look like, other than just being one hyphenated-word. In particular, if it's a part of a text, you may want to include word-boundaries
my ($capture) = $string =~ /\b.*?-(.+?)\b/;
Here we also need to adjust our "wild-card"-like pattern .+ by limiting it using ? so that it is not greedy. This matches the first such hyphenated word in the $string. But if indeed only "word" characters fly then we can just use \w (instead of . and word-boundary anchors)
my ($capture) = $string =~ /\w*?-(\w+)/;
Note that \w matches [a-zA-Z0-9_] only, which excludes some characters that may appear in normal text (English, not to mention all other writing systems).
But this is clearly getting pickier and cookier and would need careful close inspection and testing, and more complete knowledge of what the data may look like.
Perl offers its own tutorial, perlretut, and the main full reference is perlre
-([a-zA-Z0-9]+) will match a - followed by a word, with just the word being captured.
Demo

Match asterisk followed by space in PCRE

I'm just having trouble figuring out how to regex properly. What I need is to match an asterisk followed by a space followed by any amount of characters that aren't \n. (Similar to reddit list formatting)
Example:
* Test
* Test2
* Test3
The closest I got was this, but it wasn't working.
/^[*][ ](.*?)/s
Can anyone familiar with PCRE help me.
You should not use a lazy dot pattern at the end of the regex because it will never match any single char (as it will be skipped when the regex engine comes up to it, and since there is nothing to match after it, the empty string will be matched by .*?).
Use the greedy dot pattern:
^\* (.*)
See the regex demo
Other notes: you may use \h to match any horizontal whitespace instead of the regular space in the pattern. To match start of lines with ^ use m modifier. Only use s modifier if you need . to match any chars including a newline (and carriage return depending on PCRE verbs that are active).

How to add quotes for strings and null values between commas by using regex

Maybe its trivial questions but I have problem with it. I have following string:
,a1a,1a1,11,,aaa,,,a,84.34,"",ssd
I want to achieve following effect by using regex:
"","a1a","1a1",11,"","aaa","","","a",84.34,"","ssd"
So I want to everything between commas was surrounded quotes, except integers and floating point numbers. How to do this using regex?
(*SKIP)(*F) Magic
In the demo, have a look at the replacements at the bottom.
This is a great task for preg_replace, because PCRE (the regex engine used by PHP) has a beautiful feature to skip certain content.
You can do it in one step with this lovely regex (see demo):
((?<=^|,)\d+(?:\.\d+)?(?:(?=,)|$)(*SKIP)(*F)|(?<=^|,)[^,]*(?:(?=,)|$))
Explanation
The outside parentheses capture everything to Group 1.
There are two parts to the regex, on each side of the | OR
The left side of the alternation | uses \d+(?:\.\d+)? to match these floats and integers you don't want. We use the lookbehind (?<=^|,) to make sure there is a comma behind (or the beginning of the string), and the (?:(?=,)|$) to check that what follows is a comma or the end of the string. After matching, we deliberately fail, after which the engine skips to the next position in the string.
The right side uses [^,]* to match anything that is not a comma, including an empty sring, and we know it is the right content because it was not matched by the expression on the left. Again, we use lookarounds to check our position.
The replacement string '"\1"' embeds our match into double quotes.
How to use it in code:
$regex = "~((?<=^|,)\d+(?:\.\d+)?(?:(?=,)|$)(*SKIP)(*F)|(?<=^|,)[^,]*(?:(?=,)|$))~";
$replaced = preg_replace($regex,'"\1"',$string);
Here's another variant:
$regex = '/(?<![^,])(?!"[^"]*")(?![-+]?[0-9]*\.?[0-9]+\b)[^,]*+(?![^,])/';
$result = preg_replace($regex, '"$0"', $subject);
In somewhat more readable form:
(?<![^,])
(?!
"[^"]*"
|
[-+]?[0-9]*\.?[0-9]+\b
)
[^,]*+
(?![^,])
The major points of interest are:
The negative lookbehind (?<![^,]) to match the leading delimiter (or absence thereof). You can read it as if there's a character before this position, it must not be non-comma. It isn't always possible to use this idiom, but I like it because it feels less clumsy than the more common (?<=^|,), and it doesn't waste a capturing group like the (^|,) idiom.
The negative lookahead (?![^,]) similarly acts as the ending anchor.
In the lookahead to prevent it matching already-quoted fields, I'm assuming I don't have to worry about escaped quotes. Those are easy enough to deal with, but first you need to know whether it uses backslashes ("a\"b\"c") or quotes ("a""b""c") to escape them.
The negative lookahead to prevent it matching a number uses a regex from RegexBuddy's library, and it's the loosest of several such regexes. If you need something more precise, it's available.

Nested regex lookahead and lookbehind

I am having problems with the nested '+'/'-' lookahead/lookbehind in regex.
Let's say that I want to change the '*' in a string with '%' and let's say that '\' escapes the next character. (Turning a regex to sql like command ^^).
So the string
'*test*' should be changed to '%test%',
'\\*test\\*' -> '\\%test\\%', but
'\*test\*' and '\\\*test\\\*' should stay the same.
I tried:
(?<!\\)(?=\\\\)*\* but this doesn't work
(?<!\\)((?=\\\\)*\*) ...
(?<!\\(?=\\\\)*)\* ...
(?=(?<!\\)(?=\\\\)*)\* ...
What is the correct regex that will match the '*'s in examples given above?
What is the difference between (?<!\\(?=\\\\)*)\* and (?=(?<!\\)(?=\\\\)*)\* or if these are essentially wrong the difference between regex that have such a visual construction?
To find an unescaped character, you would look for a character that is preceded by an even number of (or zero) escape characters. This is relatively straight-forward.
(?<=(?<!\\)(?:\\\\)*)\* # this is explained in Tim Pietzcker' answer
Unfortunately, many regex engines do not support variable-length look-behind, so we have to substitute with look-ahead:
(?=(?<!\\)(?:\\\\)*\*)(\\*)\* # also look at ridgerunner's improved version
Replace this with the contents of group 1 and a % sign.
Explanation
(?= # start look-ahead
(?<!\\) # a position not preceded by a backslash (via look-behind)
(?:\\\\)* # an even number of backslashes (don't capture them)
\* # a star
) # end look-ahead. If found,
( # start group 1
\\* # match any number of backslashes in front of the star
) # end group 1
\* # match the star itself
The look-ahead makes sure only even numbers of backslashes are taken into account. Anyway, there is no way around matching them into a group, since the look-ahead does not advance the position in the string.
Ok, since Tim decided to not update his regex with my suggested mods (and Tomalak's answer is not as streamlined), here is my recommended solution:
Replace: ((?<!\\)(?:\\\\)*)\* with $1%
Here it is in the form of a commented PHP snippett:
// Replace all non-escaped asterisks with "%".
$re = '% # Match non-escaped asterisks.
( # $1: Any/all preceding escaped backslashes.
(?<!\\\\) # At a position not preceded by a backslash,
(?:\\\\\\\\)* # Match zero or more escaped backslashes.
) # End $1: Any preceding escaped backslashes.
\* # Unescaped literal asterisk.
%x';
$text = preg_replace($re, '$1%', $text);
Addendum: Non-lookaround JavaScript Solution
The above solution does require lookbehind, so it will not work in JavaScript. The following JavaScript solution does not use lookbehind:
text = text.replace(/(\\[\S\s])|\*/g,
function(m0, m1) {
return m1 ? m1 : '%';
});
This solution replaces each instance of backslash-anything with itself, and each instance of * asterisk with a % percent sign.
Edit 2011-10-24: Fixed Javascript version to correctly handle cases such as: **text**. (Thanks to Alan Moore for pointing out the error in previous version.)
Others have shown how this can be done with a lookbehind, but I'd like to make a case for not using lookarounds at all. Consider this solution (demo here):
s/\G([^*\\]*(?:\\.[^*\\]*)*)\*/$1%/g;
The bulk of the regex, [^*\\]*(?:\\.[^*\\]*)*, is an example of Friedl's "unrolled loop" idiom. It consumes as many as it can of individual characters other than asterisk or backslash, or pairs of characters consisting of a backslash followed by anything. That allows it to avoid consuming unescaped asterisks, no matter how many escaped backslashes (or other characters) precede them.
The \G anchors each match to the position where the previous match ended, or to the beginning of the input if this is the first match attempt. This prevents the regex engine from simply skipping over escaped backslashes and matching the unescaped asterisks anyway. So, each iteration of the /g controlled match consumes everything up to the next unescaped asterisk, capturing all but the asterisk in group #1. Then that's plugged back in and the * is replaced with %.
I think this is at least as readable as the lookaround approaches, and easier to understand. It does require support for \G, so it won't work in JavaScript or Python, but it works just fine in Perl.
So you essentially want to match * only if it's preceded by an even number of backslashes (or, in other words, if it isn't escaped)? Then you don't need lookahead at all since you're only looking back, aren't you?
Search for
(?<=(?<!\\)(?:\\\\)*)\*
and replace with %.
Explanation:
(?<= # Assert that it's possible to match before the current position...
(?<!\\) # (unless there are more backslashes before that)
(?:\\\\)* # an even number of backslashes
) # End of lookbehind
\* # Then match an asterisk
The problem of detecting escaped backslashes in regex has fascinated me for a while, and it wasn't until recently that I realized I was completely overcomplicating it. There are a couple of things that make it simpler, and as far as I can tell nobody here has noticed them yet:
Backslashes escape any character after them, not just other backslashes. So (\\.)* will eat an entire chain of escaped characters, whether they're backslashes or not. You don't have to worry about even- or odd-numbered slashes; just check for a solitary \ at the beginning or end of the chain (ridgerunner's JavaScript solution does take advantage of this).
Lookarounds aren't the only way to make sure you start with the first backslash in a chain. You can just look for a non-backslash character (or the start of the string).
The result is a short, simple pattern that doesn't need lookarounds or callbacks, and it's shorter than anything else I see so far.
/(?!<\\)(\\.)*\*/g
And the replacement string:
"$1%"
This works in .NET, which allows lookbehinds, and it should work for you in Perl. It's possible to do it in JavaScript, but without lookbehinds or the \G anchor, I can't see a way to do it in a one-liner. Ridgerunner's callback should work, as will a loop:
var regx = /(^|[^\\])(\\.)*\*/g;
while (input.match(regx)) {
input = input.replace(regx, '$1$2%');
}
There are a lot of names here I recognize from other regex questions, and I know some of you are smarter than me. If I've made a mistake, please say so.

RegEx with strange behaviour: matching String with back reference to allow escaping and single and double quotes

Matching a string that allows escaping is not that difficult.
Look here: http://ad.hominem.org/log/2005/05/quoted_strings.php.
For the sake of simplicity I chose the approach, where a string is divided into two "atoms": either a character that is "not a quote or backslash" or a backslash followed by any character.
"(([^"\\]|\\.)*)"
The obvious improvement now is, to allow different quotes and use a backreference.
(["'])((\\.|[^\1\\])*?)\1
Also multiple backslashes are interpreted correctly.
Now to the part, where it gets weird: I have to parse some variables like this (note the missing backslash in the first variable value):
test = 'foo'bar'
var = 'lol'
int = 7
So I wrote quite an expression. I found out that the following part of it does not work as expected (only difference to the above expression is the appended "([\r\n]+)"):
(["'])((\\.|[^\1\\])*?)\1([\r\n]+)
Despite the missing backslash, 'foo'bar' is matched. I used RegExr by gskinner for this (online tool) but PHP (PCRE) has the same behaviour.
To fix this, you can hardcode the quote by replacing the backreferences with '. Then it works as expected.
Does this mean the backreference does actually not work in this case? And what does this have to do with the linebreak characters, it worked without it?
You can't use a backreference inside a character class; \1 will be interpreted as octal 1 in this case (at least in some regex engines, I don't know if this is universally true).
So instead try the following:
(["'])(?:\\.|(?!\1).)*\1(?:[\r\n]+)
or, as a verbose regex:
(["']) # match a quote
(?: # either match...
\\. # an escaped character
| # or
(?!\1). # any character except the previously matched quote
)* # any number of times
\1 # then match the previously matched quote again
(?:[\r\n]+) # plus one or more linebreak characters.
Edit: Removed some unnecessary parentheses and changed some into non-capturing parentheses.
Your regex insists on finding at least one carriage return after the matched string - why? What if it's the last line of your file? Or if there is a comment or whitespace after the string? You probably should drop that part completely.
Also note that you don't have to make the * lazy for this to work - the regex can't cross an unescaped quote character - and that you don't have to check for backslashes in the second part of the alternation since all backslashes have already been scooped up by the first part of the alternation (?:\\.|(?!\1).). That's why this part has to be first.