more robust regular expression lookaround

more robust regular expression lookaround - regex

This is the input string: $table_prefix = 'wp5t3s1tc_'; which is part of a larger config file.
I want to match anything between the ''
The expression I have working is (?<=\$table_prefix(\s{2}=\s\'))(.*)?(?=\') which is not great because of the brittle way the lookaround works with the whitespace character either side of the =. If the config file changes with multiple spaces either side of the = then the expression won't work.
I am thinking it should look more like (?<=\$table_prefix(\s*\=\s*\'))(.*)?(?=\') but that of course does not work.
Could someone briefly explain a more elegant way of doing this match?

Here's a possible solution using grep. It is not very elegant, but it should be robust if you are concerned about variable spaces around the =.
Since variable length assertions are not allowed in grep, AFAIK, the only thing I can think of is to perform the extraction in two stages:
grep -oP '(?<=\$table_prefix).*(?='"'"')' file_name | grep -oP '(?<='"'"').*'
I'm basically capturing all the spaces around the = first, along with 'wp5t3s1tc_, and then extracting everything after the '. The weird '"'"' is to escape the single quote character.
Or you could use sed instead of the second grep:
grep -oP '(?<=\$table_prefix).*(?='"'"')' file_name | sed 's/ *= *'"'"'//'

You don't need to use lookaround at all as long you are guaranteed that the ' character won't appear in the sequence you are trying to match. You can use greedy search with complementary regular set, which will result in finite automata that will match greedily any string that will not contain the ' character.
To parse only the subsequence in the single quotes, use named groups (or unnamed groups if your engine does not support that. In this case, you will have to access the group by it's index instead of given name).
This regular expression does what you seek:
\$table_prefix\s*=\s*'(?<match>[^'.]*)';
Check with http://rubular.com/

Related

Regular expression to match string in line between single ":" field delimiters and exclude them, when the string also contains "::" field delimiters

Using a regular expression, I need to match only the IPv4 subnet mask from the given input string:
ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
For testing this input string is contained in a text file called file.txt, however the actual use case will be to parse /proc/cmdline, and I will need a solution that starts parsing, counting fields, and matching after encountering "ip=" until the next white space character.
I'm using bash 4.2.46 with GNU grep 2.20 on an EL 7.9 workstation, x86_64 to test the expression.
Based on examples I've seen looking at other questions, I've come up with the following grep command and PCRE regular expression which gives output that is very close to what I need.
[user#ws01 ~]$ grep -o -P '(?<!:)(?:\:[0-9])(.*?)(?=:)' file.txt
:255.255.254.0
My understanding of what I've done here is that, I've started with a negative lookbehind with a ":" character to try and exclude the first "::" field, followed by a non capturing group to match on an escaped ":" character, followed by a number, [0-9], then a capturing group with .*?, for the actual match of the string itself, and finally a look ahead for the next ":" character.
The problem is that this gives the desired string, but includes an extra : character at the beginning of the string.
Expected output should look like this:
255.255.254.0
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters. The reason for this is because a field can have an empty value. For example
:<null>:ip:gw:netmask:hostname:<null>:off
Null is shown here to indicate an omitted value not passed by the user, that the user does not need to provide for the intended purpose.
I've tried a few different expressions as suggested in other answers that use negative look behinds and look aheads to not start matching at a : which is neighbored by another :
For example, see this question:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
If I can start matching at the first single colon, by itself, which is not followed by or preceded by another : character, while excluding the colon character as the delimiter, and continue matching until the next single colon which is also not neighboring another : and without including the colon character, that should match the desired string.
I'm able to match the exact string by including "255" in an expression like this: (Which will work for all of our present use cases)
[user#ws01 ~]$ grep -o -P '(?:)255.*?(?=:)' file.txt
255.255.254.0
The logic problem here is that the subnet mask itself, may not always start with "255", but it should be a number, [0-9] which is why I'm attempting to use that in the expression above. For the sake of simplicity, I don't need to validate that it's not greater than 255.

Using gnu-grep you could write the pattern as:
grep -oP '(?<!:):\K\d{1,3}(?:\.\d{1,3}){3}(?=:(?!:))' file.txt
Output
255.255.254.0
Explanation
(?<!:): Negative lookahead, assert not : to the left and then match :
\K Forget what is matched until now
\d{1,3}(?:\.\d{1,3}){3} Match 4 times 1-3 digits separated by .
(?=:(?!:)) Positive lookahead, assert : that is not followed by :
See a regex demo.

Using grep
$ grep -oP '(?<!:)?:\K([0-9.]+)(?=:[[:alpha:]])' file.txt
View Demo here
or
$ grep -oP '[^:]*:\K[^:[:alpha:]]*' file.txt
Output
255.255.254.0

If these are delimiters, your value should be in a clearly predictable place.
Just treat every colon as a delimiter and select the 4th field.
$: awk -F: '{print $4}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
I'm not sure what you mean by
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters.
If your delimiters aren't predictable and parse-able, they are useless. If you mean the fields can have or not have quotes, but you need to exclude quotes, we can do that. If double colons are one delimiter and single colons are another that's horrible design, but we can probably handle that, too.
$: awk -F'::' '{ split($2,x,":"); print x[2];}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
For quotes, you need to provide an example.

Since the number of fields is always the same, simply separated by ":", you can use cut.
That solution will also work if you have empty fields.
cut -d":" -f4

Regex for finding comma between words

What's the regular expression for finding all instances of a comma, without a trailing space, between words? e.g. "someword,otherword"?
I'm using this pattern in Eclipse's search tool:
([^,\n\s']),([^,\n\s\)\]'])
which works perfectly, but when I use this same pattern with grep like:
grep -nHIirE -- ([^,\n\s']),([^,\n\s\)\]'])
it finds nothing. What am I doing wrong?

Use word boundaries \b:
echo "abc,def ghi, jkl" | grep '\b,\b'
(find the first comma, but not the second)

Passing ([^,\n\s']),([^,\n\s\)\]']) to grep is a string literal search, you need to put it in single or double quotes to make it a regex search "([^,\n\s']),([^,\n\s\)\]'])"

Using grep this pattern should work:
grep -nHIir -- "[^,'[:space:]],[^],)[:space:]']"
Use [:space:] to match any whitespace including newlines
Importantly don't escape ] inside a negated character class. Just place it at first position.
Avoid unnecessary grouping
You don't even need extended regex flavor for this

How do I write a SED regex to extract a string delimited by another string?

I am using GNU sed version 4.2.1 and I am trying to write a non-greedy SED regex to extract a string that delimited by two other strings. This is easy when the delimiting strings are single-character:
s:{\([^}]*\)}:\1:g
In that example the string is delimited by '{' on the left and '}' on the right.
If the delimiting strings are multiple characters, say '{{{' and '}}}' I can adjust the above expression like this:
s:{{{\([^}}}]*\)}}}:\1:g
so the centre expression matches anything not containing the '}}}' closing string. But this only works if the match string does not contain '}' at all. Something like:
{{{cannot match {this broken} example}}}
will not work but
{{{can match this example}}}
does work. Of course
s:{{{\(.*\)}}}:\1:g
always works but is greedy so isn't suitable where multiple patterns occur on the same line.
I understand [^a] to mean anything except a and [^ab] to mean anything except a or b so, despite it appearing to work, I don't think [^}}}] is the correct way to exclude that sequence of 3 consecutive characters.
So how to I write a regex for SED that matches a string that is delimited bt two other strings ?

You are correct that [^}}}] doesn't work. A negated character class matches anything that is not one of the characters inside it. Repeating characters doesn't change the logic. So what you wrote is the same as [^}]. (It is easy to see why this works when there are no braces inside the expression).
In Perl and compatible regular expressions, you can use ? to make a * or + non-greedy:
s:{{{(.*?)}}}:$1:g
This will always match the first }}} after the opening {{{.
However, this is not possible in Sed. In fact, I don't think there is any way in Sed of doing this match. The only other way to do this is use advanced features like look-ahead, which Sed also does not have.
You can easily use Perl in a sed-like fashion with the -pe options, which cause it to take a single line of code from the command line (-e) and automatically loop over each line and print the result (-p).
perl -pe 's:{{{(.*?)}}}:$1:g'
The -i option for in-place editing of files is also useful, but make sure your regex is correct first!
For more information see perlrun.

With sed you could do something like:
sed -e :a -e 's/\(.*\){{{\(.*\)}}}/\1\2/ ; ta'
With:
{{{can match this example}}} {{{can match this 2nd example}}}
This gives:
can match this example can match this 2nd example
It is not lazy matching, but by replacing from right to left we can make use of sed's greediness.

sed: Can my pattern contain an "is not" character? How do I say "is not X"?

How do I say "is not" a certain character in sed?

[^x]
This is a character class that accepts any character except x.

For those not satisfied with the selected answer as per johnny's comment.
'su[^x]' will match 'sum' and 'sun' but not 'su'.
You can tell sed to not match lines with x using the syntax below:
sed '/x/! s/su//' file
See kkeller's answer for another example.

There are two possible interpretations of your question. Like others have already pointed out, [^x] matches a single character which is not x. But an empty string also isn't x, so perhaps you are looking for [^x]\|^$.
Neither of these answers extend to multi-character sequences, which is usually what people are looking for. You could painstakingly build something like
[^s]\|s\($\|[^t]\|t\($\|[^r]\)\)\)
to compose a regular expression which doesn't match str, but a much more straightforward solution in sed is to delete any line which does match str, then keep the rest;
sed '/str/d' file
Perl 5 introduced a much richer regex engine, which is hence standard in Java, PHP, Python, etc. Because Perl helpfully supports a subset of sed syntax, you could probably convert a simple sed script to Perl to get to use a useful feature from this extended regex dialect, such as negative assertions:
perl -pe 's/(?:(?!str).)+/not/' file
will replace a string which is not str with not. The (?:...) is a non-capturing group (unlike in many sed dialects, an unescaped parenthesis is a metacharacter in Perl) and (?!str) is a negative assertion; the text immediately after this position in the string mustn't be str in order for the regex to match. The + repeats this pattern until it fails to match. Notice how the assertion needs to be true at every position in the match, so we match one character at a time with . (newbies often get this wrong, and erroneously only assert at e.g. the beginning of a longer pattern, which could however match str somewhere within, leading to a "leak").

From my own experience, and the below post supports this, sed doesn't support normal regex negation using "^". I don't think sed has a direct negation method...but if you check the below post, you'll see some workarounds.
Sed regex and substring negation

In addition to all the provided answers , you can negate a character class in sed , using the notation [^:[C_CLASS]:] , for example , [^[:blank:]] will match anything which is not considered a space character .

What regular expression can remove duplicate items from a string?

Given a string of identifiers separated by :, is it possible to construct a regular expression to extract the unique identifiers into another string, also separated by :?
How is it possible to achieve this using a regular expression? I have tried s/(:[^:])(.*)\1/$1$2/g with no luck, because the (.*) is greedy and skips to the last match of $1.
Example: a:b:c:d:c:c:x:c:c:e:e:f should give a:b:c:d:x:e:f
Note: I am coding in perl, but I would very much appreciate using a regex for this.

In .NET which supports infinite repetition inside lookbehind, you could search for
(?<=\b\1:.*)\b(\w+):?
and replace all matches with the empty string.
Perl (at least Perl 5) only supports fixed-length lookbehinds, so you can try the following (using lookahead, with a subtly different result):
\b(\w+):(?=.*\b\1:?)
If you replace that with the empty string, all previous repetitions of a duplicate entry will be removed; the last one will remain. So instead of
a:b:c:d:x:e:f
you would get
a:b:d:x:c:e:f
If that is OK, you can use
$subject =~ s/\b(\w+):(?=.*\b\1:?)//g;
Explanation:
First regex:
(?<=\b\1:.*): Check if you can match the contents of backreference no. 1, followed by a colon, somewhere before in the string.
\b(\w+):?: Match an identifier (from a word boundary to the next :), optionally followed by a colon.
Second regex:
\b(\w+):: Match an identifier and a colon.
(?=.*\b\1:?): Then check whether you can match the same identifier, optionally followed by a colon, somewhere ahead in the string.

Check out: http://www.regular-expressions.info/duplicatelines.html
Always a useful site when thinking about any regular expression.

$str = q!a:b:c:d:c:c:x:c:c:e:e:f!;
1 while($str =~ s/(:[^:]+)(.*?)\1/$1$2/g);
say $str
output :
a:b:c:d:x:e:f

here's an awk version, no need regex.
$ echo "a:b:c:d:c:c:x:c:c:e:e:f" | awk -F":" '{for(i=1;i<=NF;i++)if($i in a){continue}else{a[$i];printf $i}}'
abcdxef
split the fields on ":", go through the splitted fields, store the elements in an array. check for existence and if exists, skip. Else print them out. you can translate this easily into Perl code.

If the identifiers are sorted, you may be able to do it using lookahead/lookbehind. If they aren't, then this is beyond the computational power of a regex. Now, just because it's impossible with formal regex doesn't mean it's impossible if you use some perl specific regex feature, but if you want to keep your regexes portable you need to describe this string in a language that supports variables.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

more robust regular expression lookaround - regex

Related

Regular expression to match string in line between single ":" field delimiters and exclude them, when the string also contains "::" field delimiters

Regex for finding comma between words

How do I write a SED regex to extract a string delimited by another string?

sed: Can my pattern contain an "is not" character? How do I say "is not X"?

What regular expression can remove duplicate items from a string?

Categories

Resources