Help to compose regular expression - regex

I have folowing string: user1 fam <user#example.com>, user2 fam <user2#example.com>, ...
How can i get mail address from this string with regular expression. I need in output list of mail address
user#example.com
uesr2#example.com
I try:
<.*>
But it's ouput with < >:
<user#example.com>
<uesr2#example.com>
Thank you.
p.s. Thank you #xanatos for comment, I use Erlang

As the other have said, but to make it faster:
<([^>]*)>
In this way the Regex won't have to backtrack (with the other Regexes suggested, the Regex will match all the string and then will begin to rollback to find a >)
I'll add that, for historical reasons, there are small differences between the . and, for example [\s\S]. Both catch all the characters EXCEPT the \n. The first one (.) doesn't catch it. So by using the [^>] you are catching the \n, but this shouldn't be a problem for what you are doing. http://www.regular-expressions.info/dot.html
Just to be complete, because it's a problem that often happens, there is another variant:
<((?:(?!>).)*)>
(you can substitute the . with [\s\S] if you want, or use the SingleLine option if your language supports it, to make the . behave in a different way). The point here is that the "stop" expression can be longer than one character. Instead of (?!>) you could have inserted (?!%%) and it would have stopped at %%. BUT I'm not sure this variant work with Erlang (I hadn't noticed the new Tag... It wasn't there when I orginally read the question and I'm not an Erlang programmer... And it seems at least two Erlang programmers have different opinions on the argument :-) )

You need to use the option ungreedy so that it only matches the individual bracket pairs.
global so that you can get all the matches.
and you need {capture, all_but_first, list} so that you get the actual values (list can also be binary if you prefer binary results). all_but_first tells re to not return the whole match (which would include <>), just the group.
Result:
1> S.
"user1 fam <user#example.com>, user2 fam <user2#example.com>, "
2> re:run(S, "<(.+)>", [ungreedy, global, {capture, all_but_first, list}]).
{match,[["user#example.com"],["user2#example.com"]]}

Use groups. See your regex engine's documentation for more details.
>>> re.findall('<(.*?)>', 'user1 fam <user#example.com>, user2 fam <user2#example.com>, ...')
['user#example.com', 'user2#example.com']

Keep it simple and use <([^>]*)> which is about as fast as it can get and works for most versions of regular expressions. This is faster as it never has to backtrack while using <(.*?)> will cause backtracking.

Related

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

Regex to match text between two strings, including the strings

I'm trying to fix some conflicts in a merge by git, there are a lot of <<<<<< HEAD and ====== blocks I want to be able to just find and replace with an empty string in a lot of files.
I found this regex pattern that correctly matches everything between the two strings, but it leaves out the beginning and ending strings, and I want to be able to match them also.
(?s)(?<=<<<<<<< HEAD).*?(?=\=\=\=\=\=\=\=)
So, match <<<<<<< HEAD, ======= and everything between them to do a search/replace.
Can anyone help me out? I would be running this on files I'm certain I don't want anything between those strings, I guess that's also why I didn't try a "use theirs" flag when doing the merge, because I need to see the files first.
Just leave out the look-arounds mentioned by Xufox
(?s)(<<<<<<< HEAD)(.*?)(\=\=\=\=\=\=\=)
The .*? is wrapped with parentheses so you can reference it in the replacement. \1 for the first group, \2 for the second, and \3 for everything in between (but the syntax can vary.)
I think you might be asking the wrong question here. The best way to actually handle merge conflicts is with a merge tool. You should look into something like meld. And specifically setting git merge tool to use that. Manual merges are not fun...
Use a pretty ui to analyze the merge instead
You want to match on the opening and closing tags of the conflict sections?
Parentheses are primarily used for group capturing, if you do like so:
(<<<<<<< HEAD)(.*\s)+(\s*=======)
It will create 4 groups which you can access the members.
Tested: http://regexr.com/

Why /^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$/i does not work as expected

I have this regex for email validation (assume only x#y.com, abc#defghi.org, something#anotherhting.edu are valid)
/^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$/i
But #abc.edu and abc#xyz.eduorg are both valid as to the regex above. Can anyone explain why that is?
My approach:
there should be at least one character or number before #
then there comes #
there should be at least one character or number after # and before .
the string should end with either edu, com, or org.
Try this
/^[a-zA-Z0-9]+#[a-zA-Z0-9]+\.(com|edu|org)$/i
and it should become clear - you need to group those alternatives, otherwise you can match any string that has 'edu' in it, or any string that ends with org. To put it another way, your version matches any of these patterns
^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)
(edu)
(org)$
It's worth pointing out that the original poster is using this as a regex learning exercise. This would be a terrible regex for actual production use! It's a thorny problem - see Using a regular expression to validate an email address for a lot more depth.
Your grouping parentheses are incorrect:
/^[a-zA-Z0-9]+#[a-zA-Z0-9]+\.(com|edu|org)$/i
Can also just use one case as you're using the i modifier:
/^[a-z0-9]+#[a-z0-9]+\.(com|edu|org)$/i
N.B. you were also missing a + from the second set, I assume this was just a typo...
What you have written is the equivalent of matching something that:
Begins with [a-zA-Z0-9]+#[a-zA-Z0-9].com
contains edu
or ends with org
What you were looking for was:
/^[a-z0-9]+#[a-z0-9]+\.(com|edu|org)$/i
Your regex looks ok.
I guess you are looking using a find function in stead of a match function
Without specifying what you use it is a bit difficult, but in Python you would write
import re
pattern = re.compile ('^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$')
re.match('#abc.edu') # fails, use this to validate an input
re.search('#abc.edu') # matches, finds the edu
Try to use it:
[a-zA-Z0-9]+#[a-zA-Z0-9]+.(com|edu|org)+$
U forget about + modificator if u want to catch any combinations of (com|edu|org)
Upd: as i see second [a-zA-Z0-9] u missed + too

RegEx: Match Mr. Ms. etc in a "Title" Database field

I need to build a RegEx expression which gets its text strings from the Title field of my Database. I.e. the complete strings being searched are: Mr. or Ms. or Dr. or Sr. etc.
Unfortunately this field was a free field and anything could be written into it. e.g.: M. ; A ; CFO etc.
The expression needs to match on everything except: Mr. ; Ms. ; Dr. ; Sr. (NOTE: The list is a bit longer but for simplicity I keep it short.)
WHAT I HAVE TRIED SO FAR:
This is what I am using successfully on on another field:
^(?!(VIP)$).* (This will match every string except "VIP")
I rewrote that expression to look like this:
^(?!(Mr.|Ms.|Dr.|Sr.)$).*
Unfortunately this did not work. I assume this is because because of the "." (dot) is a reserved symbol in RegEx and needs special handling.
I also tried:
^(?!(Mr\.|Ms\.|Dr\.|Sr\.)$).*
But no luck as well.
I looked around in the forum and tested some other solutions but could not find any which works for me.
I would like to know how I can build my formula to search the complete (short) string and matches everything except "Mr." etc. Any help is appreciated!
Note: My Question might seem unusual and seems to have many open ends and possible errors. However the rest of my application is handling those open ends. Please trust me with this.
If you want your string simply to not start with one of those prefixes, then do this:
^(?!([MDS]r|Ms)\.).*$
The above simply ensures that the beginning of the string (^) is not followed by one of your listed prefixes. (You shouldn't even need the .*$ but this is in case you're using some engine that requires a complete match.)
If you want your string to not have those prefixes anywhere, then do:
^(.(?!([MDS]r|Ms)\.))*$
The above ensures that every character (.) is not followed by one of your listed prefixes, to the end (so the $ is necessary in this one).
I just read that your list of prefixes may be longer, so let me expand for you to add:
^(.(?!(Mr|Ms|Dr|Sr)\.))*$
You say entirely of the prefixes? Then just do this:
^(?!Mr|Ms|Dr|Sr)\.$
And if you want to make the dot conditional:
^(?!Mr|Ms|Dr|Sr)\.?$
^
Through this | we can define any number prefix pattern which we gonna match with string.
var pattern = /^(Mrs.|Mr.|Ms.|Dr.|Er.).?[A-z]$/;
var str = "Mrs.Panchal";
console.log(str.match(pattern));
this may do it
/(?!.*?(?:^|\W)(?:(?:Dr|Mr|Mrs|Ms|Sr|Jr)\.?|Miss|Phd|\+|&)(?:\W|$))^.*$/i
from that page I mentioned
Rather than trying to construct a regex that matches anything except Mr., Ms., etc., it would be easier (if your application allows it) to write a regex that matches only those strings:
/^(Mr|Ms|Dr|Sr)\.$/
and just swap the logic for handling matching vs non-matching strings.
re.sub(r'^([MmDdSs][RSrs]{1,2}|[Mm]iss)\.{0,1} ','',name)

Exponential Regex Problem

Can someone help me rewrite this regex to be non-exponential?
I'm using perl to parse email data. I want to extract email addresses from the data. Here is a shortened version of the regex that I've been using:
my $email_address = qr/(?:[^\s#<>,":;\[\]\(\)\\]+?|"[^\"]+?")#/i
For simplicity I've removed the later domain part of the regex. (It isn't causing any problems.)
This will find an RFC compliant email address that either contains non-email meta chars OR a "quoted" string followed by #. Using the OR '|' part of the regex with the two different multicharacter patterns creates an exponential problem.
The problem is, when I unleash this on a line of data that is several thousands of characters long.
$ wc line7.txt
1 221 497819 line7.txt
(I'm sorry but I cannot provide input data at this time, I may be able to mock some up later.)
Much like rewriting (a*b*)* to (a|b)*, I need to rewrite this regex.
Splitting it into two separate regex's creates more work in code changes then I am willing to perform at this point. Although it would solve my problem.
The eventual target machine is on a Hadoop cluster. So I would like to avoid CPAN modules that don't come with Hadoop's version of perl. (I'll have to check if Email::Find can even be used.) This is a problem I encountered at work.
Have you considered the CPAN modules Email::Valid and Email::Find?
Unless this is for your own fun or education, you almost certainly shouldn't be trying to write your own email address matching regex. See Mastering Regular Expressions by Jeffrey Friedl if you want to know what such a thing actually looks like. (Hint: it's 6,598 bytes long.)
qr/(?:(?>[^\s#<>,":;\[\]\(\)\\])+|"[^\"]{0,62}")#/i
The (?>expression) part prevents backtracking. It should be safe because there can be no overlap between the non-quoted part and the quoted part.
I removed the lazy repeats +? because the parts of the alternation already look for the # and " respectively. Phrases could be a large source of backtracking, so I looked at the Wikipedia article which states that the local part (before the #) can be only 64 characters long (subtracting two quotes yields {0,62} (if ""# is not valid, then change it to {1,62}.... I do not intend for this to be a completely functional email parser. That is your job. I simply provide help for the catastrophic backtracking.) Best of luck!
Non-greedy matches are expensive as I understand it, if you are not careful. It may do lots and lots of backtracking. http://blog.stevenlevithan.com/archives/greedy-lazy-performance
One trick I often use is to destructively pull bits of the data out once I figure out it cannot hold any data. Another trick is to do a non-backtrack match (\#{1}+ or the like) if there is something which might signal to you that there is absolutely an email address which you need to parse around there.
In your specific example, perhaps you can limit the number of characters that can be in an email address? Instead of + on the left-hand-side of the #, use {1,80}
Just changing the +? to + should do it; the ? says to prefer matching as few times as possible, which is not at all what you want.
Either I'm mis-seeing something, or your problem is in the part of the regex you aren't showing us. Or there's some difference between what you are showing and what you are actually trying. In any case, you may try changing the +? to ++ or enclosing the whole (?:...)# in (?> ... ).
Is there a + before the # in your actual regex? If so, just changing the (?: to (?> and making that + be ++ would be a very good idea.
If many lines do not contain an E-mail address, how about a quick pre-test before applying the RE:
if ( my $ix = index( $line, '#' ) > 0 )
{ #test E-mail address here
. . .
#and another wild idea you could try to cut down lengths of strings actually parsed:
my $maxLength = 100; #maximum supported E-mail address length (up to the #)
if ( substr( $line, MAX( $ix - $maxLength, 0), $maxLength ) =~ /YourRE/ )
}
(yes, > any line starting with a # can not be an E-mail address)