Regex match sequence more than once - regex

How come for something that simple I can't find an answer after looking one hour in the internet?
I have this sentence:
HeLLo woRLd HOw are YoU
I want to capture all groups that consist of two following capital letters
[A-Z]{2}
The regex above works but capture only LL (the first two capital letters) while I want LL in one group and in the other groups also RL HO

Most regular expression engines expose some way to make your expression global. This means that your expression will applied multiple times. This global flag is usually denoted with the /g marker at the end of your expression. This is your regular expression without the /g flag, while this is what happens when you apply said flag.
Different languages expose such functionality differently, in C# for instance, this is done through the Regex.Matches syntax. In Java, you use while(matcher.find()), which keeps providing sub strings which match the pattern provided.
EDIT: I am not a Python person, but judging from the example available here, you could do something like so:
it = re.finditer(r"[A-Z]{2}", "HeLLo woRLd HOw are YoU")
for match in it:
print "'{g}' was found between the indices {s}".format(g=match.group(), s=match.span())

You can not have multiple groups in this case, but you can have multiple matches. Add the global flag to your regex and use a method to match the regex.
For javscript, it would be /[A-Z]{2}/g.
The method most probably returns an Array of matches, and you can use index to access them.

Related

Trying to extract repeating pattern from string in php/javascript

The following is in PHP but the regex will also be used in javascript.
Trying to extract repeating patterns from a string
string can be any of the following:
"something arbitrary"
"D123"
"D111|something"
"D197|what.org|when.net"
"D297|who.197d234.whatever|when.net|some other arbitrary string"
I'm currently using the following regex: /^D([0-9]{3})(?:\|([^\|]+))*/
This correctly does not match the first string, matches the second and third correctly. The problem is the third and fourth only match the Dxxx and the last string. I need each of the strings between the '|' to be matched.
I'm hoping to use a regex as it makes it a single step. I realize I could just detect the leading Dxxx then use explode or split as appropriate to break the strings out. I've just gotten stuck on wanting a single regular expression match step.
This same regex may be used in Python as well so just want a generic regex solution.
There is no way to have a dynamic number of capture groups in a regular expression, but if you know some upper limit to how many parts you would have in one string, you can just repeat the pattern that many times:
/^D([0-9]{3})(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)/
So after the initial ^D([0-9]{3})(?:$|\|) you just repeat (.*?)(?:$|\|) as many times as you need it.
When the string has fewer elements, those remaining capture groups will match the empty string.
See regex tester.
Is something like preg_match_all() (the PHP variant of a global match) also acceptable for you?
Then you could use:
^(?|D([0-9]{3})|^.+$|(?!^)\|([^|\n]*)(?=\||$))
This will match everything in a string in different matches, e.g. take your string:
D197|what.org|when.net
It will you then give three matches:
D197
what.org
when.net
Running live: https://regex101.com/r/jL2oX6/4 (Everything in green are your group matches. Ignore what's in blue.)

Regex global search without using the global flag

I'm using software that only allows a single line regular expression for filtering and it doesn't allow the global modifier to capture all patterns in the string. Currently, my expression is only returning the first instance.
Is there another way to capture all instances of the pattern in the string?
Expression: (captures hi-res jpg urls)
\{\"hiRes\"\:\"([A-Za-z0-9%\/_:.-]+)\"\,\"thumb
String:
'colorImages': { 'initial': [{"hiRes":"http://sub.website.com/images/I/81OJ6qwKxyL._SL1500_.jpg","thumb":"http://sub.website.com/images/I/41NQRigTUdL._SS40_.jpg","large":"http://sub.website.com/images/I/41NQRigTUdL.jpg","main":{"http://sub.website.com/images/I/81OJ6qwKxyL._SY355_.jpg":[272,355],"http://sub.website.com/images/I/81OJ6qwKxyL._SY450_.jpg":[345,450],"http://sub.website.com/images/I/81OJ6qwKxyL._SY550_.jpg":[422,550],"http://sub.website.com/images/I/81OJ6qwKxyL._SY606_.jpg":[465,606],"http://sub.website.com/images/I/81OJ6qwKxyL._SY679_.jpg":[521,679]},"variant":"MAIN"},{"hiRes":"http://sub.website.com/images/I/71oHZNvsLbL._SL1500_.jpg","thumb":"http://sub.website.com/images/I/31lHNGD-ZDL._SS40_.jpg","large":"http://sub.website.com/images/I/31lHNGD-ZDL.jpg","main":{"http://sub.website.com/images/I/71oHZNvsLbL._SY355_.jpg":[197,355],"http://sub.website.com/images/I/71oHZNvsLbL._SY450_.jpg":[249,450],"http://sub.website.com/images/I/71oHZNvsLbL._SY550_.jpg":[305,550],"http://sub.website.com/images/I/71oHZNvsLbL._SY606_.jpg":[336,606],"http://sub.website.com/images/I/71oHZNvsLbL._SY679_.jpg":[376,679]},"variant":"PT01"},{"hiRes":"http://sub.website.com/images/I/91VCJAcIPEL._SL1500_.jpg","thumb":"http://sub.website.com/images/I/51G1gCkOFzL._SS40_.jpg","large":"http://sub.website.com/images/I/51G1gCkOFzL.jpg","main":{"http://sub.website.com/images/I/91VCJAcIPEL._SX355_.jpg":[355,341],"http://sub.website.com/images/I/91VCJAcIPEL._SX450_.jpg":[450,433],"http://sub.website.com/images/I/91VCJAcIPEL._SX425_.jpg":[425,409],"http://sub.website.com/images/I/91VCJAcIPEL._SX466_.jpg":[466,448],"http://sub.website.com/images/I/91VCJAcIPEL._SX522_.jpg":[522,502]},"variant":"PT02"},{"hiRes":"http://sub.website.com/images/I/912B68GN4aL._SL1500_.jpg","thumb":"http://sub.website.com/images/I/51elravQx6L._SS40_.jpg","large":"http://sub.websi
An interesting question. In my understanding, the global flag cannot be "emulated" with other Regex syntax features.
One could try to emulate the global flag by a Regex repetition. You could expand your Regex so that it would match all appearances of "hiRes":... in a repetition loop. But then, you would see that although several URLs would be matched because of the loop, only the last appearance would be captured.
Switching on the global flag does more than just "continue looking". It switches on collecting more than one capture in an array. Having just a Regex loop does not do the same.
I'd like to show two examples what this means. To test the examples, use e.g. https://regex101.com/.
Here is a simple example, first with the global flag:
Given text: a i b i c i
Regex: /(i)/g
Result: array of three strings, [0]="i" Pos.2, [1]="i" Pos.6, [2]="i Pos.10"
Now without the global flag. To match more, we must add a repetition to the Regex that embraces several "i", and a condition that ignores text between two "i". Like this:
Given text: a i b i c i
Regex: /(?:(i)[^i]*)+/
Result: array of one string, [0]="i" Pos.10
This seems puzzling first, but it is correct. The Regex matches from position 2 until 10. And from that match, it captures the last "i" at position 10. So the repetition in the Regex causes not several captures but a longer matching. This is very different from what the global flag does.
To be precise, this behavior is called "greedy". It tries to match as much as possible. With the "U" flag or with certain quantifiers, you can make the Regex "ungreedy". In that case in the example above, your "ungreedily" captured "i" will be that of position 2.
As a more complex example, just enhance your initial Regex. It must ignore text from the URL until the next "hiRes", and a repetition be put around. Here it is:
/\{(?:"hiRes":"([A-Za-z0-9%\/_:.-]+)"(?:[^"]|"(?!hiRes))*)+/
The second part means: match as many as possible that is not a quota, or that is a quota not followed by hiRes. Like this, this syntax will dig until the begin of the next "hiRes". And then the repetition comes in and it starts over with "hiRes".
Try it out. It will capture only the last URL in your text.
Finally, this tutorial is very comprehensive: http://www.regular-expressions.info/

Lua pattern parentheses and 0 or 1 occurrence

I'm trying to match a string against a pattern, but there's one thing I haven't managed to figure out. In a regex I'd do this:
Strings:
en
eng
engl
engli
englis
english
Pattern:
^en(g(l(i(s(h?)?)?)?)?)?$
I want all strings to be a match.
In Lua pattern matching I can't get this to work.
Even a simpler example like this won't work:
Strings:
fly
flying
Pattern:
^fly(ing)?$
Does anybody know how to do this?
You can't make match-groups optional (or repeat them) using Lua's quantifiers ?, *, + and -.
In the pattern (%d+)?, the question mark "looses" its special meaning and will simply match the literal ? as you can see by executing the following lines of code:
text = "a?"
first_match = text:match("((%w+)?)")
print(first_match)
which will print:
a?
AFAIK, the closest you can come in Lua would be to use the pattern:
^eng?l?i?s?h?$
which (of course) matches string like "enh", "enls", ... as well.
In Lua, the parentheses are only used for capturing. They don't create atoms.
The closest you can get to the patterns you want is:
'^flyi?n?g?$'
'^en?g?l?i?s?h?$'
If you need the full power of a regular expression engine, there are bindings to common engines available for Lua. There's also LPeg, a library for creating PEGs, which comes with a regular expression engine as an example (not sure how powerful it is).

(vim) regex: masking text with help of pattern

Am i correct to understand, that the definition
:range s[ubstitute]/pattern/string/cgiI
suggests that in the string part indeed only strings are to be used, that is patterns not allowed? What i would like to do is do replacement of say any N symbols at position M with X*N symbols, so i would have liked to use something like this:
:%s/^\(.\{10}\).\{28}/\1X\{28}/g
Which does not work because \{28} is interpreted literally.
Is writing the 28 XXXXX...X in the replace part the only possibility?
You can use expressions in the replacement part via \=. You have to access the match via submatch(), and join it together with the static string, which you can generate via repeat():
:%s/^\(.\{10}\).\{28}/\=submatch(1) . repeat('X',28)/g
The only regex constructs allowed in the replacement part are numbered groups: \1 \2 \3 etc. The repeating construct {28} is not valid there, though it's a clever idea. You'll have to use 28 X's.
Another alternative is using a expression in the replacement part:
:%s/^\(.\{10}\).\{28}/\=submatch(1).repeat("X",28)/g
The first matched group is obtained with submatch(1). For more information see :h sub-replace-expression.

How can I "inverse match" with regex?

I'm processing a file, line-by-line, and I'd like to do an inverse match. For instance, I want to match lines where there is a string of six letters, but only if these six letters are not 'Andrea'. How should I do that?
I'm using RegexBuddy, but still having trouble.
(?!Andrea).{6}
Assuming your regexp engine supports negative lookaheads...
...or maybe you'd prefer to use [A-Za-z]{6} in place of .{6}
Note that lookaheads and lookbehinds are generally not the right way to "inverse" a regular expression match. Regexps aren't really set up for doing negative matching; they leave that to whatever language you are using them with.
For Python/Java,
^(.(?!(some text)))*$
http://www.lisnichenko.com/articles/javapython-inverse-regex.html
In PCRE and similar variants, you can actually create a regex that matches any line not containing a value:
^(?:(?!Andrea).)*$
This is called a tempered greedy token. The downside is that it doesn't perform well.
The capabilities and syntax of the regex implementation matter.
You could use look-ahead. Using Python as an example,
import re
not_andrea = re.compile('(?!Andrea)\w{6}', re.IGNORECASE)
To break that down:
(?!Andrea) means 'match if the next 6 characters are not "Andrea"'; if so then
\w means a "word character" - alphanumeric characters. This is equivalent to the class [a-zA-Z0-9_]
\w{6} means exactly six word characters.
re.IGNORECASE means that you will exclude "Andrea", "andrea", "ANDREA" ...
Another way is to use your program logic - use all lines not matching Andrea and put them through a second regex to check for six characters. Or first check for at least six word characters, and then check that it does not match Andrea.
Negative lookahead assertion
(?!Andrea)
This is not exactly an inverted match, but it's the best you can directly do with regex. Not all platforms support them though.
If you want to do this in RegexBuddy, there are two ways to get a list of all lines not matching a regex.
On the toolbar on the Test panel, set the test scope to "Line by line". When you do that, an item List All Lines without Matches will appear under the List All button on the same toolbar. (If you don't see the List All button, click the Match button in the main toolbar.)
On the GREP panel, you can turn on the "line-based" and the "invert results" checkboxes to get a list of non-matching lines in the files you're grepping through.
I just came up with this method which may be hardware intensive but it is working:
You can replace all characters which match the regex by an empty string.
This is a oneliner:
notMatched = re.sub(regex, "", string)
I used this because I was forced to use a very complex regex and couldn't figure out how to invert every part of it within a reasonable amount of time.
This will only return you the string result, not any match objects!
(?! is useful in practice. Although strictly speaking, looking ahead is not a regular expression as defined mathematically.
You can write an inverted regular expression manually.
Here is a program to calculate the result automatically.
Its result is machine generated, which is usually much more complex than hand writing one. But the result works.
If you have the possibility to do two regex matches for the inverse and join them together you can use two capturing groups to first capture everything before your regex
^((?!yourRegex).)*
and then capture everything behind your regex
(?<=yourRegex).*
This works for most regexes. One problem I discovered was when I had a quantifier like {2,4} at the end. Then you gotta get creative.
In Perl you can do:
process($line) if ($line =~ !/Andrea/);