Regexp and grouping

Regexp and grouping - regex

I have an regex to match string of the form x=y. Ie name assigned a value. The value can optionally be quoted and both name and value conform to \w+
My regex is
\w+=\w+|"\w+"|'\w+'
There can be multiple of these assignments on one line, but here I ran into problems. For some reason when I enclose this regex in (?:) it won't match. See test case below
use Test::More;
my $re1 = qr/^\w+=\w+|"\w+"|'\w+'$/p;
my $re2 = qr/^(?:\w+=\w+|"\w+"|'\w+')$/p;
ok('xy="abc"' =~ $re1);
say "PREMATCH ${^PREMATCH}";
say "MATCH ${^MATCH}";
say "POSTMATCH ${^POSTMATCH}";
ok('xy="abc"' =~ $re2);
done_testing;
Output is
ok 1
PREMATCH xy=
MATCH "abc"
POSTMATCH
not ok 2
# Failed test at ./test.pl line 20.
1..2
# Looks like you failed 1 test of 2.
I don't understand why the first matches and the second not. And I also don't understand why the first one matches only the part after the equal sign.

You are having an issue with your alternation. It is taking the entire part of the regex before the first pipe as one option. In other words,
/^\w+=\w+|"\w+"|'\w+'$/
is parsed into three possibilities to match
^\w+=\w+
"\w+"
or
'\w+'$
To fix this you have 2 choices (that I see). First expand each of those choices to what you really want:
/^\w+=\w+|^\w+="\w+"|^\w+='\w+'$/
The second is to cluster the alternation:
/^\w+=(?:\w+|"\w+"|'\w+')$/

Your
^\w+=\w+|"\w+"|'\w+'$
is equivalent to
(?:^\w+=\w+)|(?:"\w+")|(?:'\w+'$)
where it matches the ^ followed by whitespace OR quotation marks around the word OR a single-quote around words that occur at the end of the string.
Your
^(?:\w+=\w+|"\w+"|'\w+')$
Requires that ALL of those within the group start at the beginning of the line (due to the ^ outside of the group), then the various test, and then ALL of those groups must complete at the end of the string (due to the $ outside of the group).
The simplest fix is to simply move both the ^ and $ into to the group:
(?:^\w+=\w+|"\w+"|'\w+'$)

Related

Match all elements with n occurrences

I want to select the same element with exact n occurrences.
Match letters that repeats exact 3 times in this String: "aaaaabbbcccccccccdddee"
this should return "bbb" and "ddd"
If I define what I should match like "b{3}" or "d{3}", this would be easier, but I want to match all elements
I've tried and the closest I came up is this regex: (.)\1{2}(?!\1)
Which returns "aaa", "bbb", "ccc", "ddd"
And I can't add negative lookbehind, because of "non-fixed width" (?<!\1)

One possibility is to use a regex that looks for a character which is not followed by itself (or beginning of line), followed by three identical characters, followed by another character which is not the same as the second three i.e.
(?:(.)(?!\1)|^)((.)\3{2})(?!\3)
Demo on regex101
The match is captured in group 2. The issue with this though is that it absorbs a character prior to the match, so cannot find adjacent matches: as shown in the demo, it only matches aaa, ccc and eee in aaabbbcccdddeee.
This issue can be resolved by making the entire regex a lookahead, a technique which allows for capturing overlapping matches as described in this question. So:
(?=(?:(.)(?!\1)|^)((.)\3{2})(?!\3))
Again, the match is captured in group 2.
Demo on regex101

You could match what you don't want to keep, which is 4 or more times the same character.
Then use an alternation to capture what you want to keep, which is 3 times the same character.
The desired matches are in capture group 2.
(.)\1{3,}|((.)\3\3)
(.) Capture group 1, match a single character
\1{3,} Repeat the same char in group 1, 3 or more times
| Or
( Capture group 2
(.)\3\3 Capture group 3, match a single character followed by 2 backreferences matching 2 times the same character as in group 3
) Close group 2
Regex demo

This gets sticky because you cannot put a back reference inside a negative character set, so we'll use a lookbehind followed by a negative lookahead like this:
(?<=(.))((?!\1).)\2\2(?!\2))
This says find a character but don't include it in the match. Then look ahead to be certain the next character is different. Next consume it into capture group 2 and be certain that the next two characters match it, and the one after does not match.
Unfortunately, this does not work on 3 characters at the beginning of the string. I had to add a whole alternation clause to handle that case. So the final regex is:
(?:(?<=(.))((?!\1).)\2\2(?!\2))|^(.)\3\3(?!\3)
This handles all cases.
EDIT
I found a way to handle matches at the beginning of the string:
(?:(?<=(.))|^)((?!\1).)\2\2(?!\2)
Much nicer and more compact, and does not require looking in capture groups to get the answer.

If your environment permits the use of (*SKIP)(*FAIL), you can manage to return a lean set of matches by consuming substrings of four or more consecutive duplicate characters then discard them. In the alternation, match the desired 3 consecutive duplicated characters.
PHP Code: (Demo)
$string = 'aaaaabbbcccccccccdddee';
var_export(
preg_match_all(
'/(?:(.)\1{3,}(*SKIP)(*F)|(.)\2{2})/',
$string,
$m
)
? $m[0]
: 'no matches'
);
Output:
array (
0 => 'bbb',
1 => 'ddd',
)
This technique uses no lookarounds and does not generate false positive matches in the matches array (which would otherwise need to be filtered out).
This pattern is efficient because it never needs to look backward and by consuming the 4 or more consecutive duplicates, it can rule-out long substrings quickly.

Regular Expression: Find a specific group within other groups in VB.Net

I need to write a regular expression that has to replace everything except for a single group.
E.g
IN
OUT
OK THT PHP This is it 06222021
This is it
NO MTM PYT Get this content 111111
Get this content
I wrote the following Regular Expression: (\w{0,2}\s\w{0,3}\s\w{0,3}\s)(.*?)(\s\d{6}(\s|))
This RegEx creates 4 groups, using the first entry as an example the groups are:
OK THT PHP
This is it
06222021
Space Charachter
I need a way to:
Replace Group 1,2,4 with String.Empty
OR
Get Group 3, ONLY

You don't need 4 groups, you can use a single group 1 to be in the replacement and match 6-8 digits for the last part instead of only 6.
Note that this \w{0,2} will also match an empty string, you can use \w{1,2} if there has to be at least a single word char.
^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$
^ Start of string
\w{0,2}\s\w{0,3}\s\w{0,3}\s Match 3 times word characters with a quantifier and a whitespace in between
(.*?) Capture group 1 match any char as least as possible
\s\d{6,8} Match a whitespace char and 6-8 digits
\s? Match an optional whitespace char
$ End of string
Regex demo
Example code
Dim s As String = "OK THT PHP This is it 06222021"
Dim result As String = Regex.Replace(s, "^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$", "$1")
Console.WriteLine(result)
Output
This is it

My approach does not work with groups and does use a Replace operation. The match itself yields the desired result.
It uses look-around expressions. To find a pattern between two other patterns, you can use the general form
(?<=prefix)find(?=suffix)
This will only return find as match, excluding prefix and suffix.
If we insert your expressions, we get
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6}\s?)
where I simplified (\s|) as \s?. We can also drop it completely, since we don't care about trailing spaces.
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6})
Note that this works also if we have more than 6 digits because regex stops searching after it has found 6 digits and doesn't care about what follows.
This also gives a match if other things precede our pattern like in 123 OK THT PHP This is it 06222021. We can exclude such results by specifying that the search must start at the beginning of the string with ^.
If the exact length of the words and numbers does not matter, we simply write
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+)
If the find part can contain numbers, we must specify that we want to match until the end of the line with $ (and include a possible space again).
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+\s?$)
Finally, we use a quantifier for the 3 ocurrences of word-space:
(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)
This is compact and will only return This is it or Get this content.
string result = Regex.Match(#"(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)").Value;

Regex for text file

I have a text file with the following text:
andal-4.1.0.jar
besc_2.1.0-beta
prov-3.0.jar
add4lib-1.0.jar
com_lab_2.0.jar
astrix
lis-2_0_1.jar
Is there any way i can split the name and the version using regex. I want to use the results to make two columns 'Name' and 'Version' in excel.
So i want the results from regex to look like
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
So far I have used ^(?:.*-(?=\d)|\D+) to get the Version and -\d.*$ to get the Name separately. The problem with this is that when i do it for a large text file, the results from the two regex are not in the same order. So is there any way to get the results in the way I have mentioned above?

Ctrl+H
Find what: ^(.+?)[-_](\d.*)$
Replace with: $1\t$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(.+?) # group 1, 1 or more any character but newline, not greedy
[-_] # a dash or underscore
(\d.*) # group 2, a digit then 0 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
\t # a tabulation, you may replace with what you want
$2 # content of group 2
Result for given example:
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar

Not quite sure what you meant for the problem in large file, and I believe the two regex you showed are doing opposite as what you said: first one should get you the name and second one should give you version.
Anyway, here is the assumption I have to guess what may make sense to you:
"Name" may follow by - or _, followed by version string.
"Version" string is something preceded by - or _, with some digit, followed by a dot or underscore, followed by some digit, and then any string.
If these assumption make sense, you may use
^(.+?)(?:[-_](\d+[._]\d+.*))?$
as your regex. Group 1 is will be the name, Group 2 will be the Version.
Demo in regex101: https://regex101.com/r/RnwMaw/3
Explanation of regex
^ start of line
(.+?) "Name" part, using reluctant match of
at least 1 character
(?: )? Optional group of "Version String", which
consists of:
[-_] - or _
( ) Followed by the "Version" , which is
\d+ at least 1 digit,
[._] then 1 dot or underscore,
\d+ then at least 1 digit,
.* then any string
$ end of line

Match 3 and 4 delimiters and between them; not less not more

I have a command-line program that its first argument ( = argv[ 1 ] ) is a regex pattern.
./program 's/one-or-more/anything/gi/digit-digit'
So I need a regex to check if the entered input from user is correct or not. This regex can be solve easily but since I use c++ library and std::regex_match and this function by default puts begin and end assertion (^ and $) at the given string, so the nan-greedy quantifier is ignored.
Let me clarify the subject. If I want to match /anything/ then I can use /.*?/ but std::regex_match considers this pattern as ^/.*?/$ and therefore if the user enters: /anything/anything/anyhting/ the std::regex_match still returns true whereas the input-pattern is not correct. The std::regex_match only returns true or false and the expected pattern form the user can only be a text according to the pattern. Since the pattern is various, here, I can not provide you all possibilities, but I give you some example.
Should be match
/.//
s/.//
/.//g
/.//i
/././gi
/one-or-more/anything/
/one-or-more/anything/g/3
/one-or-more/anything/i
/one-or-more/anything/gi/99
s/one-or-more/anything/g/4
s/one-or-more/anything/i
s/one-or-more/anything/gi/54
and anything look like this pattern
Rules:
delimiters are /|##
s letter at the beginning and g, i and 2 digits at the end are optional
std::regex_match function returns true if the entire target character sequence can be match, otherwise return false
between first and second delimiter can be one-or-more +
between second and third delimiter can be zero-or-more *
between third and fourth can be g or i
At least 3 delimiter should be match /.// not less so /./ should not be match
ECMAScript 262 is allowed for the pattern
NOTE
May you would need to see may question about std::regex_match:
std::regex_match and lazy quantifier with strange
behavior
I no need any C++ code, I just need a pattern.
Do not try d?([/|##]).+?\1.*?\1[gi]?[gi]?\1?d?\d?\d?. It fails.
My attempt so far: ^(?!s?([/|##]).+?\1.*?\1.*?\1)s?([/|##]).+?\2.*?\2[gi]?[gi]?\d?\d?$
If you are willing to try, you should put ^ and $ around your pattern
If you need more details please comment me, and I will update the question.
Thanks.

You could use this regular expression:
^s?([/|##])((?!\1).)+\1((?!\1).)*\1((gi?|ig)(\1\d\d?)?|i)?$
See regex101.com
Note how this also rejects these cases:
///anything/
/./anything/gg
/./anything/ii
/./anything/i/12
How it works:
Some explanation of the parts that are different:
((?!\1).): this will match any character that is not the delimiter. This way you are sure you can keep track of the exact number of delimiters used. You can this way also prevent that the first character after the first delimiter, is again that delimiter, which should not be allowed.
(gi?|ig): matches any of the valid modifier combinations, except a sole i, which is treated separately. So this also excludes gg and ii as valid character sequences.
(\1\d\d?)?: optionally allows for an extra delimiter (after a g modifier -- see previous) to be added with one or two digits following it.
( |i)?: for the case there is no g modifier present, but just the i or none: then no digits are allowed to follow.

This is a tricky one, but I took the challenge - here is what I have ended up with:
^s?([\/|##])(?:(?!\1).)+\1(?:(?!\1).)*\1(?:i|(?:gi?|ig)(\1\d{1,2})?)?$
Pattern breakdown:
^ matches start of string
s? matches an optional 's' character
([\/|##]) matches the delimeter characters and captures as group 1
(?:(?!\1).)+ matches anything other than the delimiter character one or more times (uses negative lookahead to make sure that the character isn't the delimiter matched in group 1)
\1 matches the delimiter character captured in group 1
(?:(?!\1).)* matches anything other than the delimiter character zero or more times
\1 matches the delimiter character captured in group 1
(?: starts a new group
i matches the i character
| or
(?:gi?|ig) matches either g, gi, or ig
(\1\d{1,2})? followed by an optional extra delimiter and 0-9 once or twice
)? closes group and makes it optional
$ matches end of string
I have used non capturing groups throughout - these are groups that start ?:

Regular Expression to Extract Text Bounded by '/'

I need to a regular expression to extract names from a GEDCOM file. The format is:
Fred Joseph /Smith/
Where the text bounded by the / is the surname and the Fred Joseph are the forenames. The complication is that the surname could be at any place in the text or may not be there at all. I need something that will extract the surname and capture everything else as the forenames.
This is as far as I have got and I have tried making groups optional with the ? qualifier but to no avail:
As you can see it has several problems: If the surname is missing nothing gets captured, the forename(s) sometimes have leading and trailing spaces, and I have 3 capture groups when I'd really like 2. Even better would be if the capture group for the surname didn't include the '/' characters.
Any help would be much appreciated.

For your last line, I'm not sure there is a way to join the group 1 with group 3 into a single group.
Here is my proposed solution. It doesn't capture spaces around forenames.
^(?:\h*([a-z\h]+\b)\h*)?(?:\/([a-z\h]+)\/)?(?:\h*([a-z\h]+\b)\h*)?$
To correctly match the names, care to use the insensitive flag, and if you test all lines at once, use multiline flag.
See the demo
Explanation
^ start of the line
(?:\h*([a-z\h]+\b)\h*)? first non-capturing group that matches 0 or 1 time:
\h* 0 or more horizontal spaces
([a-z\h]+\b) captures in a group letters and spaces, but stops at the end of the last word
\h* matches the possible remaining spaces without capturing
(?:\/([a-z\h]+)\/)? second non-capturing group that matches 0 or 1 time a name in a capturing group surrounded by slashes
(?:\h*([a-z\h]+\b)\h*)? third non-capturing group doing the same as first one, capturing the names in a third group.
$ end of the line

For your requirements
([A-z a-z /])+\w*
Sample

Hope this helps
(.\*?)\\/(.\*?)\\/(.\*)

Try this: ^([^/]*)(/[^/]+/)?([^/]*)$
This matches the following:
^ start of string (or with multiline modifier start of line)
([^/\n]*) anything other than / or new line zero or more times - this is captured as group 1
(/[^/\n]+/)? a single / followed by one or more non / or new line characters, then a single '/' character - this is captured as group 2, and is optional
([^/\n]*) anything other than / or new line zero or more times - this is captured as group 3
$ end of string (or with multiline modifier end of line)
You can see in action with your example text here: https://regex101.com/r/9kmKpy/1
To not capture the slashes you can add a non capturing group by adding ?: to the second set of brackets, and then adding another pair between the slashes:
^([^\/\n]*)(?:\/([^\/\n]+)\/)?([^\/\n]*)$
https://regex101.com/r/9kmKpy/2

I am not sure I follow what language is being used to extract the data, but based on what you have so far, you simply need to add '?':
(.*)(\/?.*\/?)(.*)
Not that this does not give you groupings for EACH name as some solutions will have multiple names in a single group
Edit:
Extending on Niitaku solution and looking at having each individual name in its own group, you could use:
^\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*(?:\/?([a-z]+)\/?)\s*$
As explained though, if using a language like ruby it would simply be:
ruby -pe '$_ = $_.scan(/\w+/)' file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regexp and grouping - regex

Related

Match all elements with n occurrences

Regular Expression: Find a specific group within other groups in VB.Net

Regex for text file

Match 3 and 4 delimiters and between them; not less not more

Regular Expression to Extract Text Bounded by '/'

Categories

Resources