RegEx capturing group in Elixir - regex

I want to know how this Elixir regex work.
Regex.run(~r{(*UTF)([^\w])+}, "dd!!%%%")
when I execute this regex, the output is
["!!%%%", "%"]
I'm not able to understand why the last % is repeated after matching the regex.

I'm not able to understand why the last % is repeated after matching the regex.
I looks like you meant to write the pattern:
([^\w]+)
rather than something like:
([^\w])([^\w])...([^\w])
The first one gives the expected results:
1> Regex.run(~r{(*UTF)([^\w]+)}, "dd!!%%%")
["!!%%%", "!!%%%"]
which is a list containing the whole match followed by what matched the capture groups. The second one produces:
iex(9)> Regex.run(~r{(*UTF)([^\w])([^\w])([^\w])}, "dd!!%%%")
["!!%", "!", "!", "%"]
which follows the same logic.
However, your pattern does not follow the logic of the second example with the repeated capture groups. According to regular-expressions.info:
[a] repeated capturing group will capture only the last iteration
So, at least this is known behavior.
It looks like because you explicitly specified only one capture group:
([^\w])
...only one capture group is created.
The capture group matches one character, and the value of the capture group is repeatedly overwritten with the new match as the regex traverses the string according to the + quantifier. When the end of the string is reached, the capture group contains only the last match.

This tool helps you to see how your expression works:
([^\w])+
RegEx Circuit
You can visualize your expressions in this link:
Code
If you wish to only return !!%%% as your full match, without the group 1, this might work:
Regex.run(~r{(*UTF)[^\w]+}, "dd!!%%%")

Related

Extra groups in regex

I'm building a regex to be able to parse addresses and am running into some blocks. An example address I'm testing against is:
5173B 63rd Ave NE, Lake Forest Park WA 98155
I am looking to capture the house number, street name(s), city, state, and zip code as individual groups. I am new to regex and am using regex101.com to build and test against, and ended up with:
(^\d+\w?)\s((\w*\s?)+).\s(\w*\s?)+([A-Z]{2})\s(\d{5})
It matches all the groups I need and matches the whole string, but there are extra groups that are null value according to the match information (3 and 4). I've looked but can't find what is causing this issue. Can anyone help me understand?
Your regex expression was almost good:
(^\d+\w?)\s([\w*\s?]+).\s([\w*\s?]+)\s([A-Z]{2})\s(\d{5})
What I changed are the second and third groups: in both you used a group inside a group ((\w*\s?)+), where a class inside a group (([\w*\s?]+)) made sure you match the same things and you get the proper group content.
With your previous syntax, the inner group would be able to match an empty substring, since both quantifiers allow for a zero-length match (* is 0 to unlimited matches and ? is zero or one match). Since this group was repeated one or more times with the +, the last occurrence would match an empty string and only keep that.
For this you'll need to use a non-capturing group, which is of the form (?:regex), where you currently see your "null results". This gives you the regex:
(^\d+\w?)\s((?:\w*\s?)+).\s(?:\w*\s?)+([A-Z]{2})\s(\d{5})
Here is a basic example of the difference between a capturing group and a non-capturing group: ([^s]+) (?:[^s]+):
See how the first group is captured into "Group 1" and the second one is not captured at all?
Matching an address can be difficult due to the different formats.
If you can rely on the comma to be there, you can capture the part before it using a negated character class:
^(\d+[A-Z]?)\s+([^,]+?)\s*,\s*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo
Or take the part before the comma that ends on 2 or more uppercase characters, and then match optional non word characters using \W* to get to the first word character after the comma:
^(\d+[A-Z]?)\s+(.*?\b[A-Z]{2,}\b)\W*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo

Python2 re match repeating patterns doesn't behave as expected

I was trying to extract urls from messy text data using regular expression. I used to match [\w.]+[a-zA-Z]{2,4} which behaved as I expected: find consecutive alphanumerical and dots, then ends with 2~4 letters like com/net/gov. It wasn't perfect but sufficed for my use.
Now I want to improve the syntax a bit: I want to find all alphanumerical FOLLOWED BY ONE dot, repeat the pattern multiple times, then ends with 2~4 letters. This would exclude things like "abc....com". However, this time the result really confused me:
test = 'www.1f23123.asda.com'
re.findall(r'(\w+\.){1,}[a-zA-Z]{2,4}', test)
and the result was ['asda.']
Could someone explain to me what goes wrong here?
you are printing the captured group, try adding ?: to make it a non capturing group so it would print the whole match
test = 'www.1f23123.asda.com'
match = re.findall(r'(?:\w+\.){1,}[a-zA-Z]{2,4}', test)
print match
Your regex uses a repeating capturing group where you would need to capture a repeating group. So only the last match is captured in your regex. You will need:
((?:\w+\.){1,})[a-zA-Z]{2,4}
See example

Parsing multiple groups from a regular expression

I am having a problem parsing some fields from the following regular expression which I uploaded to rubular. The string that I am parsing is a special header from the banner of an FTP server. In order for me to process this banner, the line
special:pTXT1TOCAPTURE^:mTXT2TOCAPTURE^:uTXT3TOCAPTURE^
I thought that: (?i)^special(:[pmu](.*?)\^)?* would do the trick, however unfortunately this only gives me the last match and I am not sure why as I am lazily trying to capture each group. Also note that I should be able to capture an empty string also, i.e. if for ex the match string contains :u^
Wrap words Show invisibles Ruby version
Match result:
special:pTXT1TOMATCH^:mTXT2TOMATCH^:uTXT3TOMATCH^
Match groups:
:uTXT3TOMATCH^
TXT3TOMATCH
The idea is that the line must start with the test 'special' followed by up to 3 capture groups delimited with p,m or u lazily up to the next ^ symbol. I need to capture the text indicated above - basically I need to find TXT1TOCAPTURE, TXT2TOCAPTURE, and TXT3TOCAPTURE. There should be at least one of these three capture groups.
Thanks in advance
You have two problems with your RegEx, one is syntactic and one is conceptual.
Syntactic:
We don't have such a modifier ?* in PCRE but it is equal to * in Ruby which denotes a greedy quantifier. In the case of applying to a capturing group it captures last match.
Conceptual:
Using a lazy quantifier .*? doesn't provide you with continues matches. It stops immediately on engine satisfaction. While g modifier is on next match will never occur as there is no ^special at the next position of last match.
Solution is using \G token to benefit from its mean of start matching at the end of previous match:
(?:special|(?!\A)\G):([pmu][^^]*\^)
Live demo
You might want to have the \G modifier:
(?:(?:^special:)|\G(?!\A)\^:)[pmu]([^^]+)
See it working on rubular.com.

Is there a way to match Regex based on previous capture group, not captured previously?

Okay, so the task is that there is a string that can either look like post, or post put or even get put post. All of these must be matched. Preferably deviances like [space]post, or get[space] should not be matched.
Currently I came up with this
^(post|put|delete|get)(( )(post|put|delete|get))*$
However I'm not satisfied with it, because I had to specify (post|put|delete|get) twice. It also matches duplications like post post.
I'd like to somehow use a backreference(?) to the first group so that I don't have to specify the same condition twice.
However, backreference \1 would help me only match post post, for example, and that's the opposite of what I want. I'd like to match a word in the first capture group that was NOT previously found in the string.
Is this even possible? I've been looking through SO questions, but my Google-fu is eluding me.
If you are using a PCRE-based regex engine, you may use subroutine calls like (?n) to recurse the subpatterns.
^(post|put|delete|get)( (?!\1)(?1))*$
^^^^
See the regex demo
Expression details:
^ - start of string
(post|put|delete|get) - Group 1 matching one of the alternatives as literal substrings
( (?!\1)(?1))* - zero or more sequences of:
- a space
(?!\1) - a negative lookahead that fails the match if the text after the current location is identical to the one captured into Group 1 due to backreference \1
(?1) - a subroutine call to the first capture group (i.e. it uses the same pattern used in Group 1)
$ - end of string
UPDATE
In order to avoid matching strings like get post post, you need to also add a negative lookahead into Group 1 so that the subroutine call was aware that we do not want to match the same value that was captured into Group 1.
^((post|put|delete|get)(?!.*\2))( (?1))*$
See the regex demo
The difference is that we capture the alternations into Group 2 and add the negative lookahead (?!.*\2) to disallow any occurrences of the word we captured further in the string. The ( (?1))* remains intact: now, the subroutine recurses the whole Capture Group 1 subpattern with the lookahead.

How to make sure the second group with alternative patterns matches the first group?

I have the following regular expression and I want the matching alternative of the first group to be the matching alternative of the second group.
(?i)^([a-z]+|\d+)-([a-z]+|\d+)$
Basically what I want is if \[a-z\] matches in the first group I want only that pattern to match the second group and if \\d matches in the first group I want only that pattern to match in the second group.
I tried with an expanded regular expression that had[a-z]+)-([a-z]+)|(\d+)-(\d+) but that gave me 4 groups either 1,2 or 3,4 with one set populated and the other set null.
I want to make it where there is always just groups 1,2 so I don't have to test to see which groups actually match.
Given the following input:
10-15
XX-ZZ
5-A
a-1000
10-15 should match
XX-ZZ should match
5-A should not match
a-1000 should not match
Use a Conditional
This is your regex:
/(?i)^([a-z]+|\d+)-([a-z]+|\d+)$/
Please see the following:
/(?i)^(?:([a-z]+)|\d+)-(?(1)[a-z]+|\d+)$/
Regex Demo
Unless I am missing some obvious I think following regex should work for you:
(?im)^(?:([a-z]+-[a-z]+)|(\d+-\d+))$
RegEx Demo
Here is an answer that doesn't rely on conditional references:
(?i)^(?=\[a-z\]+-\[a-z\]+$|\d+-\d+$)(\[a-z\d\]+)-(\[a-z\d\]+)$
This is what it was used for.