Help with regex - email address matching - regex

I have the following regex which suppose to match email addresses:
[a-z0-9!#$%&'*+\\-/=?^_`{|}~][a-z0-9!#$%&'*+\\-/=?^_`{|}~.]{0,63}#[a-z0-9][a-z0-9\\-]*[a-z0-9](\\.[a-z0-9][a-z0-9\\-]*[a-z0-9])+$.
I have the following code in AS3:
var mails:Array = str.toLowerCase().match(pattern);
(pattern is RegExp with the mentioned regular expression).
I retrieve two results, when str is gaga#example.com:
gaga#example.com
.com
Why?

.com was captured by the last part of the regex (\\.[a-z0-9][a-z0-9\\-]*[a-z0-9]).
Regular expressions capture substrings matched by portions of the pattern that are enclosed in () for later use.
For example, the regex 0x([0-9a-fA-F]) will match a hexadecimal number of the form 0x9F34 and capture the hex portion in a separate group.

I'm not sure about your regex, there is a good tutorial about email validation here.
To me this reads:
[a-z0-9!#$%&'*+\-/=?^_{|}~] # single of chosen character set
[a-z0-9!#$%&'*+\\-/=?^_{|}~.]{0,63} # any of chosen character set with the addition of , \
#
[a-z0-9] # single alpha numeric
[a-z0-9\-]* # any alphanumeric with the addition of -
a-z # single alphabetical
0-9+ # at least one number
$ # end of line
. # any character
As to why you get two sub-strings in your array, its because both match the pattern - see docs

gaga#example.com is the match of the whole regular expression and .com is the last match of the first group ((\\.[a-z0-9][a-z0-9\\-]*[a-z0-9])).

([a-z0-9!#$%&'*+\\-/=?^_`{|}~][a-z0-9!#$%&'*+\\-/=?^_`{|}~.]{0,63}#[a-z0-9\\-]*[a-z0-9]+\\.([a-z0-9\\-]*[a-z0-9]))+$
This seem to work as expected (tested in Regex Tester). Last capturing group removed.

To add to what others have said:
There are two results because it matches both the whole email address, and the last group surrounded by parentheses.
If you don't want a group to be captured you can add ?: to the beginning of the group. Look in the AS documentation for non-capturing groups:
http://www.adobe.com/livedocs/flash/9.0/main/wwhelp/wwhimpl/js/html/wwhelp.htm?href=00000118.html#wp129703
"A noncapturing group is one that is used for grouping only; it is not "collected," and it does not match numbered backreferences. Use (?: and ) to define noncapturing groups, as follows:
var pattern = /(?:com|org|net)/;"

Related

Extra groups in regex

I'm building a regex to be able to parse addresses and am running into some blocks. An example address I'm testing against is:
5173B 63rd Ave NE, Lake Forest Park WA 98155
I am looking to capture the house number, street name(s), city, state, and zip code as individual groups. I am new to regex and am using regex101.com to build and test against, and ended up with:
(^\d+\w?)\s((\w*\s?)+).\s(\w*\s?)+([A-Z]{2})\s(\d{5})
It matches all the groups I need and matches the whole string, but there are extra groups that are null value according to the match information (3 and 4). I've looked but can't find what is causing this issue. Can anyone help me understand?
Your regex expression was almost good:
(^\d+\w?)\s([\w*\s?]+).\s([\w*\s?]+)\s([A-Z]{2})\s(\d{5})
What I changed are the second and third groups: in both you used a group inside a group ((\w*\s?)+), where a class inside a group (([\w*\s?]+)) made sure you match the same things and you get the proper group content.
With your previous syntax, the inner group would be able to match an empty substring, since both quantifiers allow for a zero-length match (* is 0 to unlimited matches and ? is zero or one match). Since this group was repeated one or more times with the +, the last occurrence would match an empty string and only keep that.
For this you'll need to use a non-capturing group, which is of the form (?:regex), where you currently see your "null results". This gives you the regex:
(^\d+\w?)\s((?:\w*\s?)+).\s(?:\w*\s?)+([A-Z]{2})\s(\d{5})
Here is a basic example of the difference between a capturing group and a non-capturing group: ([^s]+) (?:[^s]+):
See how the first group is captured into "Group 1" and the second one is not captured at all?
Matching an address can be difficult due to the different formats.
If you can rely on the comma to be there, you can capture the part before it using a negated character class:
^(\d+[A-Z]?)\s+([^,]+?)\s*,\s*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo
Or take the part before the comma that ends on 2 or more uppercase characters, and then match optional non word characters using \W* to get to the first word character after the comma:
^(\d+[A-Z]?)\s+(.*?\b[A-Z]{2,}\b)\W*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo

How can I remove something from the middle of a string with regex?

I have strings which look like this:
/xxxxx/xxxxx-xxxx-xxxx-338200.html
With my regex:
(?<=-)(\d+)(?=\.html)
It matches just the numbers before .html.
Is it possible to write a regex that matches everything that surrounds the numbers (matches the .html part and the part before the numbers)?
In your current pattern you already use a capturing group. In that case you might also match what comes before and after instead of using the lookarounds
-(\d+)\.html
To get what comes before and after the digits, you could use 2 capturing groups:
^(.*-)\d+(\.html)$
Regex demo
In the replacement use the 2 groups.
This should do the job:
.*-\d+\.html
Explanation: .* will match anything until -\d+ say it should match a - followed by a sequence of digits before a \.html (where \. represents the character .).
To capture groups, just do (.*-)(\d+)(\.html). This will put everything before the number in a group, the number in another group and everything after the number in another group.

Regex capture all dot characters excluding the dot character in front of '.com' in an email

Looking to do the following with regex using this email: tes.t+yolo#gmail.com
capture all . characters excluding the . that's in front of .com
In addition:
capture everything from + up to but not including the # character
The final result would look like: test#gmail.com
As I understand your approach was to remove the matched parts from the string in order to receive the final result. You would then need to do this in two steps (with two regex) - remove the dots, and remove the part between + and #.
Regex to capture all . except before .com: \.(?!com$)
Regex to capture everything from + up to not including the # character: \+[^#]*
You can use these together as a single regex expression: \+[^#]*|\.(?!com$)
Regex Demo
Bonus
Alternatively, another approach would be to tackle this through groups, e.g:
^([^\.]+)(.)([^+]+)([^#]+)(\S+)$
You can then build the final result by combining several groups together.

python regex non-capture group handling

(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
used to match string
123 FEX-1-80 Online N2K-C2248TP-1GE SSDFDFWFw23r23
How come this works in regexr.com but Python 3.5.1 can't find a match
r'(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))'
can match up to
123 FEX-1-80 Online N2K-C2248TP
but the second hyphen - in group(4) is not matched
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
Just a comment, not really an answer but for the sake of clarity I have put it as an answer.
Being relatively new to regular expressions, one should use the verbose mode. With this, your expression becomes much much more readable:
(1[0-9]{2})\s+ # three digits, the first one needs to be 1
(\w+(?:-\w+)+)\s+ # a word character (wc), followed by - and wcs
(\w+)\s+ # another word
(\w+(?:-\w+)+)\s+ # same expression as above
(\w+) # another word
Also, check if your (second and fourth) expression could be rewritten as [\w-]+ - it is not the same as yours and will match other substrings but try to avoid nested parenthesis in general.
Concerning your question, the second string cannot be matched as you made all of your expressions mandatory (and group 5 is missing in the second example, so it will fail).
See a demo on regex101.com.
This regular expression matches the full input string:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
This one doesn't:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))
The latter is missing a + after the last non-capturing group, and it's missing the \s+(\w+) at the end that matches the SSDFDFWFw23r23 at the end of the input string.
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
I'm not sure I follow. A non-capturing group is really just there to group a part of a regular expression.
(?:-\w+) or just -\w+ will both match a hyphen (-) followed by one or more "word" characters (\w+). It doesn't matter whether that regular expression is in a non-capturing group or not. If you want to match repetitions of that pattern, you can use the + modifier after the non-capturing group, e.g. (?:-\w+)+. That pattern will match a string like -foo-bar-baz.
So the reason your second regular expression doesn't match the repeated pattern is because it's lacking the + modifier.

perl style regex to match nth item in a list

Trying to match the third item in this list:
/text word1, word2, some_other_word, word_4
I tried using this perl style regex to no avail:
([^, ]*, ){$m}([^, ]*),
I want to match ONLY the third word, nothing before or after, and no commas or whitespace. I need it to be a regex, this is not in a program but UltraEdit for a word file.
What can I use to match some_other_word (Or anything third in the list.)
Based on some input by the community members I made the following change to make the logic of the regex pattern clearer.
/^(?:(?:.(?<!,))+,){2}\s*(\w+).*/x
Explanation
/^ # 1.- Match start of line.
(?:(?:.(?<!,))+ # 2.- Match but don't capture a secuence of character not containing a comma ...
,) # 3.- followed by a comma
{2} # 4.- (exactly two times)
\s* # 5.- Match any optional space
(\w+) # 6.- Match and capture a secuence of the characters represented by \w a leat one character long.
.* # 7.- Match anything after that if neccesary.
/x
This is the one suggested previously.
/(?:\w+,?\s*){3}(\w+)/
Try group 1 of this regex:
^(?:.*?,){2}\s*(.*?)\s*(,|$)
See a live demo using your sample, plus an edge case, input showing capture in group 1.
It can't only return one match at a time because your string has more than one occurrence of the same pattern and Regular Expression doesn't have a selective return option! So you can do whatever you want from the returned array.
,\s?([^,]+)
See it in action, 2nd matched group is what you need.