Pattern backreference to an optional capturing subexpression - regex

In an attempt to use Bash's built-in regular expression matching to parse the following types of strings, which are to be converted to Perl substitution expressions (quotes are not part of data)
'~#A#B#'
#^ ^ ^-- Replacement string.
#| +---- Pattern string.
#+------ Regular expression indicator (no need to escape strings A and B),
# which is only allowed if strings A and B are surrounded with ##.
# Strings A and B may not contain #, but are allowed to have ~.
'#A#B#'
#^------ When regex indicator is missing, strings A and B will be escaped.
'A#B'
# Simplified form of '#A#B#', i. e. without the enclosing ##.
# Still none of the strings A and B is allowed to contain # at any position,
# but can have ~, so leading ~ should be treated as part of string A.
I tried the following pattern (again, without quotes):
'^((~)?(#))?([^#]+)#([^#]+)\3$'
That is, it declares the leading ~# optional (and ~ in it even more optional), then captures parts A and B, and requires the trailing # to be present only if it was present in the leader. The leading # is captured for backreference matching only — it is not needed elsewhere, while ~ is captured to be inspected by script afterwards.
However, that pattern only works as expected with the most complete types of input data:
'~#A#B#'
'#A#B#'
but not for
'A#B'
I. e., whenever the leading part is missing, \3 fails to match. But if \3 is replaced with .*, the match succeeds and it can be seen that ${BASH_REMATCH[3]} is an empty string. This is something that I do not understand, provided that unset variables are treated as empty strings in Bash. How do I match a backreference with optional content then?
As a workaround, I could write an alternative pattern
'^(~?)#([^#]+)#([^#]+)#$|^([^#]+)#([^#]+)$'
but it results in distinct capture groups for each possible case, which makes the code less intuitive.
Important note. As #anubhava mentioned in his comment, backreference matching may not be available in some Bash builds (perhaps it is a matter of build options rather than of version number, or even of some external library). This question is of course targeted at those Bash environments that support such functionality.

There are two ways to deal with this problem:
Instead of making the group optional (in other words, allowing it to not match at all), make it mandatory but match the empty string. In other words, change constructs like (#)? to (#?).
Use a conditional to match the backreference \3 only if group 3 matched. To do this, change \3 to (?(3)#|).
Generally, the first option is preferable because of its better readability. Also, bash's regular expressions don't seem to support conditional constructs, so we need to make option 1 work. This is difficult because of the additional condition that ~ is only allowed if # is also present. If bash supported lookaheads, we could do something like ((~)(?:#))?(#?). But since it doesn't, we need to get creative. I've come up with the following pattern:
^((~(#))|(#?))([^#]+)#([^#]+)(\3|\4)$
Demo.
The idea is to make use of the alternation operator | to handle two different cases: Either the text starts with ~#, or it doesn't. ((~(#))|(#?)) captures ~# in group 2 and # in group 3 if possible, but if there's no ~ then it just captures # (if present) in group 4. Then we can use (\3|\4) at the end to match the closing #, if there was an opening one (remember, group 3 captured # if the text started with ~#, and group 4 captured # or the empty string if the text did not start with ~#).

Related

Regular Expression for poorly defined key value pairs

I am using regular expressions to parse a text files that look like the following:
<diagnostics> data=filenames/sometimes with/spaces\filename with or without spaces.dat start=0 end=90 overload=2 offset=871
<region> data=another file.filetype <diagnostics> replay=true
I would like to find all data names by scanning individual lines. If there were no spaces in the folder or filenames I could match against data= and then scan until a space with pattern:
data=([^ \n]*)
I might scan until a .xxxx filename is found, but in theory periods can be part of the folder or partial filenames. The actual pattern is to scan until data= is found and then keep going until end of line or until either one of the following: <, unknownTagNoSpaces=.
<stuff> data=(folder one/folder\value I want.whatever) (unknownTagNoSpaces)=
<stuff> replay=false data=(value I want followed by newline.xxx)
data=(folder/value I want.hhhh) <something>
So the regular expression might be to stop:
data=[^/\n|=|</]*
and this almost works except in the case of the equals sign = I have to omit the word (no spaces) and space before the equals sign as well so data=value.docx otherkey=something removes otherkey from the match.
Is this possible with regular expressions? I think the answer might be no.
I hope i understood what you want, so here is my try:
data=((?:(?> *[^ \n<=]+)(?!=))*)
It uses atomic groups, i hope your regex engine supports it.
Explanation:
data=((?:(?> *[^ \n<=]+)(?!=))*) whole regex
data=( ) match 'data=' and the stuff behind it as first capture group
(?: )* repeat as long as the contained stuff is valid
(?> ) atomic group: treat as one part, don not break apart, "tokenize"
̺ * match all spaces here (has some nice effect explained later)
[^ \n<=]+ match (at least one) symbol that is not newline, '<' or '='
(?!=) ensure there is no equal sign
The atomic group captures preceding whitespace and all valid symbols thus stopping at spaces.
Since spaces are captured beforehand there will no trailing whitespace, however leading whitespace must be matched (but can be excluded from the capture group) because the 'data=' prefix is also part of the match.
The atomic group magic happens when the '=' is encountered. It is not allowed in the atomic group and if it is found to be behind it the entire group will be discarded.
In this case the group consist of the attributes name and the spaces in between.
Example on regex101
I thought about a solution without atomic groups:
data=((?: *(?![^ ]+=)[^< ]+)*)
Explanation:
data=((?: *(?![^ ]+=)[^< ]+)*) whole regex
data=( ) match 'data=' and the stuff behind it as first capture group
(?: )* repeat as long as the contained stuff is valid
̺ * match all spaces here
(?![^ ]+=) check that no "attribute" (no-space followed by '=') comes next
[^< ]+ math all the valid symbols
This regex basically checks for all text that appears that it is not followed by '=' and then matches it.
Example on regex101

Perform substitution on regex results, but only on a given condition

First of all, let me please clarify that I know absolutely nothing about regular expressions, but I need to write a "Tagger Script" for MusicBrainz Picard so that it doesn't mess with the way I format certain aspects of my tracks' titles.
Here's what I need to do:
- Find all sub-strings inside parenthesis
- Then, for those matches that meet a given criteria and those matches only, change the parentheses to brackets
For example, consider this string:
DJ Fresh - Louder (Sian Evans) (Flux Pavilion & Doctor P Remix)
It needs to be changed like so:
DJ Fresh - Louder (Sian Evans) [Flux Pavilion & Doctor P Remix]
The condition is that if the string within the parentheses contains the sub-string "dj" or "mix" or "version" or "inch", etc... then the parentheses surrounding it need to be changed to brackets.
So, the question is:
Is it possible to create a single regex expression that can perform this operation?
Thank you very much in advance.
Assuming there are no nested brackets, you can use the following regex to search for the text:
(?i)\((?=[^()]*(?:dj|mix|version|inch))([^()]+)\)
Note that the regex is case-insensitive, due to (?i) in front - make it case-sensitive by removing it.
Check the syntax of your language to see if you can use r prefix, e.g. r'literal_string', to specify literal string.
And use the following as replacement:
[$1]
You can include more keywords by adding keywords to (?:dj|mix|version|inch) part, each keyword separated by |. If the keyword contains (, ), [, ], |, ., +, ?, *, ^, $, \, {, } you need to escape them (I'm 99% sure the list is exhaustive). An easier way to think about it is: if the keyword only contains space and alphanumeric (but note that the number of spaces is strict), you can add them into the regex without causing side-effect.
Dissecting the regex:
(?i): Case-insensitive mode
\(: ( is special character in regex, need to escape it by prepending \.
(?=[^()]*(?:dj|mix|version|inch)): Positive look-ahead (?=pattern):
[^()]*: I need to check that the text is within bracket, not outside or in some other bracket, so I use a negated character class [^characters] to avoid matching () and spill outside the current bracket. The assumption I made also comes into play a bit here.
(?:dj|mix|version|inch): A list of keywords, in a non-capturing group (?:pattern). | means alternation.
([^()]+): The assumption about no nested bracket makes it easier to match all the characters inside the bracket. The text is captured for later replacement, since (pattern) is capturing group, as opposed to (?:pattern).
\): ) is special character in regex, need to escape it by prepending \.

I can't find a regex for this Regular Expression

I am not very good with regular expressions, and I just have a simple question here.
I have a list of links in this way:
http://domain.com/andrei/sometext
http://domain2.com/someothertext/sometextyouknow/whoknows
http://domain341.com/text/thisisit/haha
I just want two regular expressions, to take this out:
http://domain.com/andrei/
http://domain2.com/someothertext/
http://domain341.com/text/
This is the first regex that I need, and I need another regex only to take out the domain, but I guess I'll figure that out if somebody could tell me the regex to take out only what I wrote.
This is what you (most likely) need:
[a-z]+://([^/ ]+)(?:/[^/ ]*/?)?
Here's how it works:
[a-z]+ part is for protocol name (this means, "1 or more letters" - it will match http/https/file/ftp/gopher/foo/whatever protocol, but if you want to match only "http" you can write it explicitly)
:// is literally what it says ;)
[^/ ]+ is one or more non-slash and non-space character. it can be "a", can be fqdn, can be ip address. whatever
(?:/[^/ ]*/?)? - this one is more complicated. The ? in the end means that this whole thing in parentheses may or may not be there (it is optional). ?: immediately inside parentheses means do not reuse this sub-pattern (it is not assigned a number and cannot be re-used later by that number). [^/ ]* means 0 or more non-slash non-space characters, and the question mark after the trailing slash, again, states that the slash is optional.
Overall, this ensures matches for things like this:
http://foo/bar/baz/something -> http://foo/bar/
http://hello.world.example.com/ -> http://hello.world.example.com/
http://foo.net -> http://foo.net
ftp://ftp.mozilla.org/pub -> ftp://ftp.mozilla.org/pub
NOTE #1: I did not use escaping for forward slashes intentionally to make the expression more readable, so make sure you use some other character as a delimiter, OR escape all the appearances of / - use \/ instead.
NOTE #2: Add i modifier if you want the expression to be case-insensitive (a-z will not match caps), and g modifier if you want to make multiple matches in one big block of text.
In the matches, subpattern 0 will be the whole matched thing, and subpattern 1 - only hostname
This is probably what you are looking for:
([a-zA-Z]+://([\w.]*)/(?:.*?/)?)
You have all the match in the group 1 and just the domain in the group 2. No need for 2 regular expressions. :)
Use regex https?:\/\/[^\/]+\/[^\/]+/(.*) for your first task - replace $1 with emtpy string ''.
Use regex https?:\/\/([^\/]+) for your second task - a match $1 is the domain name.

RegEx with strange behaviour: matching String with back reference to allow escaping and single and double quotes

Matching a string that allows escaping is not that difficult.
Look here: http://ad.hominem.org/log/2005/05/quoted_strings.php.
For the sake of simplicity I chose the approach, where a string is divided into two "atoms": either a character that is "not a quote or backslash" or a backslash followed by any character.
"(([^"\\]|\\.)*)"
The obvious improvement now is, to allow different quotes and use a backreference.
(["'])((\\.|[^\1\\])*?)\1
Also multiple backslashes are interpreted correctly.
Now to the part, where it gets weird: I have to parse some variables like this (note the missing backslash in the first variable value):
test = 'foo'bar'
var = 'lol'
int = 7
So I wrote quite an expression. I found out that the following part of it does not work as expected (only difference to the above expression is the appended "([\r\n]+)"):
(["'])((\\.|[^\1\\])*?)\1([\r\n]+)
Despite the missing backslash, 'foo'bar' is matched. I used RegExr by gskinner for this (online tool) but PHP (PCRE) has the same behaviour.
To fix this, you can hardcode the quote by replacing the backreferences with '. Then it works as expected.
Does this mean the backreference does actually not work in this case? And what does this have to do with the linebreak characters, it worked without it?
You can't use a backreference inside a character class; \1 will be interpreted as octal 1 in this case (at least in some regex engines, I don't know if this is universally true).
So instead try the following:
(["'])(?:\\.|(?!\1).)*\1(?:[\r\n]+)
or, as a verbose regex:
(["']) # match a quote
(?: # either match...
\\. # an escaped character
| # or
(?!\1). # any character except the previously matched quote
)* # any number of times
\1 # then match the previously matched quote again
(?:[\r\n]+) # plus one or more linebreak characters.
Edit: Removed some unnecessary parentheses and changed some into non-capturing parentheses.
Your regex insists on finding at least one carriage return after the matched string - why? What if it's the last line of your file? Or if there is a comment or whitespace after the string? You probably should drop that part completely.
Also note that you don't have to make the * lazy for this to work - the regex can't cross an unescaped quote character - and that you don't have to check for backslashes in the second part of the alternation since all backslashes have already been scooped up by the first part of the alternation (?:\\.|(?!\1).). That's why this part has to be first.

Matching quote contents

I am trying to remove quotes from a string. Example:
"hello", how 'are "you" today'
returns
hello, how are "you" today
I am using php preg_replace.
I've got a couple of solutions at the moment:
(\'|")(.*)\1
Problem with this is it matches all characters (including quotes) in the middle, so the result ($2) is
hello", how 'are "you today'
Backreferences cannot be used in character classes, so I can't use something like
(\'|")([^\1\r\n]*)\1
to not match the first backreference in the middle.
Second solution:
(\'[^\']*\'|"[^"]*")
Problem is, this includes the quotes in the back reference so doesn't actually do anything at all. The result ($1):
"hello", how 'are "you" today'
Instead of:
(\'[^\']*\'|"[^"]*")
Simply write:
\'([^\']*)\'|"([^"]*)"
\______/ \_____/
1 2
Now one of the groups will match the quoted content.
In most flavor, when a group that failed to match is referred to in a replacement string, the empty string gets substituted in, so you can simply replace with $1$2 and one will be the successful capture (depending on the alternate) and the other will substitute in the empty string.
Here's a PHP implementation (as seen on ideone.com):
$text = <<<EOT
"hello", how 'are "you" today'
EOT;
print preg_replace(
'/\'([^\']*)\'|"([^"]*)"/',
'$1$2',
$text
);
# hello, how are "you" today
A closer look
Let's use 1 and 2 for the quotes (for clarity). Whitespaces will also be added (for clarity).
Before, you have, as your second solution, this pattern:
( 1[^1]*1 | 2[^2]*2 )
\_______________________/
capture whole thing
content and quotes
As you correctly pointed out, this match a pair of quotes correctly (assuming that you can't escape quotes), but it doesn't capture the content part.
This may not be a problem depending on context (e.g. you can simply trim one character from the beginning and end to get the content), but at the same time, it's also not that hard to fix the problem: simply capture the content from the two possibilities separately.
1([^1]*)1 | 2([^2]*)2
\_____/ \_____/
capture contents from
each alternate separately
Now either group 1 or group 2 will capture the content, depending on which alternate was matched. As a "bonus", you can check which quote was used, i.e. if group 1 succeeded, then 1 was used.
Appendix
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
(…) is used for grouping. (pattern) is a capturing group and creates a backreference. (?:pattern) is non-capturing.
References
regular-expressions.info/Brackets for capturing, Alternation, Character class, Repetition
Regarding:
Backreferences cannot be used in character classes, so I can't use something like
(\'|")([^\1\r\n]*)\1
(\'|")(((?!(\1|\r|\n)).)*)\1
(where (?!...) is a negative lookahead for ...) should work.
I dont know whether this solves your main problem, but it does solve the "match a character iff it doesnt match a backref" part.
Edit:
Missed a parenthesis, fixed.
You cannot do this with a regular expression. This requires an internal state to keep track of (among other things)
Whether or not a previous quote of a certain type has been encountered
Whether or not the "outer" level of quotes is the current level
Whether an "inner" set of quotes has been descended into, and if so, where that set of quotes begins in the string
This requires a grammar-aware parser to do correctly. A regular expression engine does not keep state because it is a finite state automata, which only operates on the current input regardless of previous circumstances.
It's the same reason you cannot reliably match sets of nested parentheses or XML elements.