Exponential Regex Problem - regex

Can someone help me rewrite this regex to be non-exponential?
I'm using perl to parse email data. I want to extract email addresses from the data. Here is a shortened version of the regex that I've been using:
my $email_address = qr/(?:[^\s#<>,":;\[\]\(\)\\]+?|"[^\"]+?")#/i
For simplicity I've removed the later domain part of the regex. (It isn't causing any problems.)
This will find an RFC compliant email address that either contains non-email meta chars OR a "quoted" string followed by #. Using the OR '|' part of the regex with the two different multicharacter patterns creates an exponential problem.
The problem is, when I unleash this on a line of data that is several thousands of characters long.
$ wc line7.txt
1 221 497819 line7.txt
(I'm sorry but I cannot provide input data at this time, I may be able to mock some up later.)
Much like rewriting (a*b*)* to (a|b)*, I need to rewrite this regex.
Splitting it into two separate regex's creates more work in code changes then I am willing to perform at this point. Although it would solve my problem.
The eventual target machine is on a Hadoop cluster. So I would like to avoid CPAN modules that don't come with Hadoop's version of perl. (I'll have to check if Email::Find can even be used.) This is a problem I encountered at work.

Have you considered the CPAN modules Email::Valid and Email::Find?
Unless this is for your own fun or education, you almost certainly shouldn't be trying to write your own email address matching regex. See Mastering Regular Expressions by Jeffrey Friedl if you want to know what such a thing actually looks like. (Hint: it's 6,598 bytes long.)

qr/(?:(?>[^\s#<>,":;\[\]\(\)\\])+|"[^\"]{0,62}")#/i
The (?>expression) part prevents backtracking. It should be safe because there can be no overlap between the non-quoted part and the quoted part.
I removed the lazy repeats +? because the parts of the alternation already look for the # and " respectively. Phrases could be a large source of backtracking, so I looked at the Wikipedia article which states that the local part (before the #) can be only 64 characters long (subtracting two quotes yields {0,62} (if ""# is not valid, then change it to {1,62}.... I do not intend for this to be a completely functional email parser. That is your job. I simply provide help for the catastrophic backtracking.) Best of luck!

Non-greedy matches are expensive as I understand it, if you are not careful. It may do lots and lots of backtracking. http://blog.stevenlevithan.com/archives/greedy-lazy-performance
One trick I often use is to destructively pull bits of the data out once I figure out it cannot hold any data. Another trick is to do a non-backtrack match (\#{1}+ or the like) if there is something which might signal to you that there is absolutely an email address which you need to parse around there.
In your specific example, perhaps you can limit the number of characters that can be in an email address? Instead of + on the left-hand-side of the #, use {1,80}

Just changing the +? to + should do it; the ? says to prefer matching as few times as possible, which is not at all what you want.
Either I'm mis-seeing something, or your problem is in the part of the regex you aren't showing us. Or there's some difference between what you are showing and what you are actually trying. In any case, you may try changing the +? to ++ or enclosing the whole (?:...)# in (?> ... ).
Is there a + before the # in your actual regex? If so, just changing the (?: to (?> and making that + be ++ would be a very good idea.

If many lines do not contain an E-mail address, how about a quick pre-test before applying the RE:
if ( my $ix = index( $line, '#' ) > 0 )
{ #test E-mail address here
. . .
#and another wild idea you could try to cut down lengths of strings actually parsed:
my $maxLength = 100; #maximum supported E-mail address length (up to the #)
if ( substr( $line, MAX( $ix - $maxLength, 0), $maxLength ) =~ /YourRE/ )
}
(yes, > any line starting with a # can not be an E-mail address)

Related

Extracting data using regex from bank feed

I am looking to extract some text from a raw credit card feed for a workflow. I have gotten almost where I want to but am struggling with the final piece of information I'm trying to extract.
An example of the raw feed is:
LEO'SFINEFOOD&WINEHARTWELLJune350.0735.00ICGROUP,INC.MELBOURNEJune5UNITEDSTATESDOLLARAUD50.07includesconversioncommissionofAUD1.469.96WOOLWORTHS3335CHADSTOCHADSTONE
I am looking to extract this from the above:
(ICGROUP,INC.MELBOURNE)June5UNITEDSTATESDOLLARAUD(50.07)includesconversioncommissionof
with the brackets representing the two groups I am after. The consistent parts across all instances of what I'm trying to extract is:
DIGITS (TEXT) DATE TEXT AMOUNT includesconversioncommissionof
I have been able to use the regex:
([A-Z][a-z]\d)[A-Z]AUD(\d\,?\d+?.\d*)includesconversioncommissionofAUD
to get me the date and the amount. I am struggling to find a way to get as per the example above the words ICGROUP,INC.MELBOURNE
I have tried putting \d\d(.*) before the above regex but that doesn't work for some reason.
Would appreciate if anyone is able to help with what I'm after!
The closest I think we can get (PCRE) is something like:
/
[\d,.]+ # a currency value to bookend
(.+?) # capture everything in-between
[A-Z][a-z]+\d+ # a month followed by a day, e.g. "June5"
.+? # everything in-between
([\d,.]+) # capture a currency value
includesconversioncommissionof # our magic token to bookend
/x
The technique here is to pit greedy expressions against non-greedy expressions in a very deliberate way. Let me know if you have any questions about it. I would be extremely hesitant to put this in production—or even trust its output as an ad-hoc pass—without rigorous testing!
I'm using the pattern [\d,.] for currency, but you can replace that with something more sophisticated, especially if you expect weird formats and currency symbols. The biggest potential pitfall here is if the ICGROUP,INC.MELBOURNE token might start with a number. Then you'll definitely need a more sophisticated currency pattern!
Here's what I've got (in php).
$string = "LEO'SFINEFOOD&WINEHARTWELLJune350.0735.00ICGROUP,INC.MELBOURNEJune5UNITEDSTATESDOLLARAUD50.07includesconversioncommissionofAUD1.469.96WOOLWORTHS3335CHADSTOCHADSTONE";
$cleaned = preg_replace("/^(LEO'SFINEFOOD&WINEHARTWELL)([A-Za-z]{3,9})(\.|\d)*/", "", $string);
echo $cleaned;
what it returns is: ICGROUP,INC.MELBOURNEJune5UNITEDSTATESDOLLARAUD50.07includesconversioncommissionofAUD1.469.96WOOLWORTHS3335CHADSTOCHADSTONE
Which you can then use and run your own little regex on.
Explanation:
The \w{3,9} is used to remove the month which may be 3-9 characters long. Then the (\.|\d)* is to remove the digits and dots. I'm thinking that we could parse the month/date better using your regex to extract that June 5 part but from your example given, it shouldn't be necessary.
However, it would be much more helpful if you could provide at least 3 examples, optimally 5, so we can get a good feel of the pattern. Otherwise this is the best I can do with what you've given.

Why /^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$/i does not work as expected

I have this regex for email validation (assume only x#y.com, abc#defghi.org, something#anotherhting.edu are valid)
/^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$/i
But #abc.edu and abc#xyz.eduorg are both valid as to the regex above. Can anyone explain why that is?
My approach:
there should be at least one character or number before #
then there comes #
there should be at least one character or number after # and before .
the string should end with either edu, com, or org.
Try this
/^[a-zA-Z0-9]+#[a-zA-Z0-9]+\.(com|edu|org)$/i
and it should become clear - you need to group those alternatives, otherwise you can match any string that has 'edu' in it, or any string that ends with org. To put it another way, your version matches any of these patterns
^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)
(edu)
(org)$
It's worth pointing out that the original poster is using this as a regex learning exercise. This would be a terrible regex for actual production use! It's a thorny problem - see Using a regular expression to validate an email address for a lot more depth.
Your grouping parentheses are incorrect:
/^[a-zA-Z0-9]+#[a-zA-Z0-9]+\.(com|edu|org)$/i
Can also just use one case as you're using the i modifier:
/^[a-z0-9]+#[a-z0-9]+\.(com|edu|org)$/i
N.B. you were also missing a + from the second set, I assume this was just a typo...
What you have written is the equivalent of matching something that:
Begins with [a-zA-Z0-9]+#[a-zA-Z0-9].com
contains edu
or ends with org
What you were looking for was:
/^[a-z0-9]+#[a-z0-9]+\.(com|edu|org)$/i
Your regex looks ok.
I guess you are looking using a find function in stead of a match function
Without specifying what you use it is a bit difficult, but in Python you would write
import re
pattern = re.compile ('^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$')
re.match('#abc.edu') # fails, use this to validate an input
re.search('#abc.edu') # matches, finds the edu
Try to use it:
[a-zA-Z0-9]+#[a-zA-Z0-9]+.(com|edu|org)+$
U forget about + modificator if u want to catch any combinations of (com|edu|org)
Upd: as i see second [a-zA-Z0-9] u missed + too

RegEx: Match Mr. Ms. etc in a "Title" Database field

I need to build a RegEx expression which gets its text strings from the Title field of my Database. I.e. the complete strings being searched are: Mr. or Ms. or Dr. or Sr. etc.
Unfortunately this field was a free field and anything could be written into it. e.g.: M. ; A ; CFO etc.
The expression needs to match on everything except: Mr. ; Ms. ; Dr. ; Sr. (NOTE: The list is a bit longer but for simplicity I keep it short.)
WHAT I HAVE TRIED SO FAR:
This is what I am using successfully on on another field:
^(?!(VIP)$).* (This will match every string except "VIP")
I rewrote that expression to look like this:
^(?!(Mr.|Ms.|Dr.|Sr.)$).*
Unfortunately this did not work. I assume this is because because of the "." (dot) is a reserved symbol in RegEx and needs special handling.
I also tried:
^(?!(Mr\.|Ms\.|Dr\.|Sr\.)$).*
But no luck as well.
I looked around in the forum and tested some other solutions but could not find any which works for me.
I would like to know how I can build my formula to search the complete (short) string and matches everything except "Mr." etc. Any help is appreciated!
Note: My Question might seem unusual and seems to have many open ends and possible errors. However the rest of my application is handling those open ends. Please trust me with this.
If you want your string simply to not start with one of those prefixes, then do this:
^(?!([MDS]r|Ms)\.).*$
The above simply ensures that the beginning of the string (^) is not followed by one of your listed prefixes. (You shouldn't even need the .*$ but this is in case you're using some engine that requires a complete match.)
If you want your string to not have those prefixes anywhere, then do:
^(.(?!([MDS]r|Ms)\.))*$
The above ensures that every character (.) is not followed by one of your listed prefixes, to the end (so the $ is necessary in this one).
I just read that your list of prefixes may be longer, so let me expand for you to add:
^(.(?!(Mr|Ms|Dr|Sr)\.))*$
You say entirely of the prefixes? Then just do this:
^(?!Mr|Ms|Dr|Sr)\.$
And if you want to make the dot conditional:
^(?!Mr|Ms|Dr|Sr)\.?$
^
Through this | we can define any number prefix pattern which we gonna match with string.
var pattern = /^(Mrs.|Mr.|Ms.|Dr.|Er.).?[A-z]$/;
var str = "Mrs.Panchal";
console.log(str.match(pattern));
this may do it
/(?!.*?(?:^|\W)(?:(?:Dr|Mr|Mrs|Ms|Sr|Jr)\.?|Miss|Phd|\+|&)(?:\W|$))^.*$/i
from that page I mentioned
Rather than trying to construct a regex that matches anything except Mr., Ms., etc., it would be easier (if your application allows it) to write a regex that matches only those strings:
/^(Mr|Ms|Dr|Sr)\.$/
and just swap the logic for handling matching vs non-matching strings.
re.sub(r'^([MmDdSs][RSrs]{1,2}|[Mm]iss)\.{0,1} ','',name)

Help to compose regular expression

I have folowing string: user1 fam <user#example.com>, user2 fam <user2#example.com>, ...
How can i get mail address from this string with regular expression. I need in output list of mail address
user#example.com
uesr2#example.com
I try:
<.*>
But it's ouput with < >:
<user#example.com>
<uesr2#example.com>
Thank you.
p.s. Thank you #xanatos for comment, I use Erlang
As the other have said, but to make it faster:
<([^>]*)>
In this way the Regex won't have to backtrack (with the other Regexes suggested, the Regex will match all the string and then will begin to rollback to find a >)
I'll add that, for historical reasons, there are small differences between the . and, for example [\s\S]. Both catch all the characters EXCEPT the \n. The first one (.) doesn't catch it. So by using the [^>] you are catching the \n, but this shouldn't be a problem for what you are doing. http://www.regular-expressions.info/dot.html
Just to be complete, because it's a problem that often happens, there is another variant:
<((?:(?!>).)*)>
(you can substitute the . with [\s\S] if you want, or use the SingleLine option if your language supports it, to make the . behave in a different way). The point here is that the "stop" expression can be longer than one character. Instead of (?!>) you could have inserted (?!%%) and it would have stopped at %%. BUT I'm not sure this variant work with Erlang (I hadn't noticed the new Tag... It wasn't there when I orginally read the question and I'm not an Erlang programmer... And it seems at least two Erlang programmers have different opinions on the argument :-) )
You need to use the option ungreedy so that it only matches the individual bracket pairs.
global so that you can get all the matches.
and you need {capture, all_but_first, list} so that you get the actual values (list can also be binary if you prefer binary results). all_but_first tells re to not return the whole match (which would include <>), just the group.
Result:
1> S.
"user1 fam <user#example.com>, user2 fam <user2#example.com>, "
2> re:run(S, "<(.+)>", [ungreedy, global, {capture, all_but_first, list}]).
{match,[["user#example.com"],["user2#example.com"]]}
Use groups. See your regex engine's documentation for more details.
>>> re.findall('<(.*?)>', 'user1 fam <user#example.com>, user2 fam <user2#example.com>, ...')
['user#example.com', 'user2#example.com']
Keep it simple and use <([^>]*)> which is about as fast as it can get and works for most versions of regular expressions. This is faster as it never has to backtrack while using <(.*?)> will cause backtracking.

Adding http:// to all links without a protocol

I use VB.NET and would like to add http:// to all links that doesn't already start with http://, https://, ftp:// and so on.
"I want to add http here Google,
but not here Google."
It was easy when I just had the links, but I can't find a good solution for an entire string containing multiple links. I guess RegEx is the way to go, but I wouldn't even know where to start.
I can find the RegEx myself, it's the parsing and prepending I'm having problems with. Could anyone give me an example with Regex.Replace() in C# or VB.NET?
Any help appreciated!
Quote RFC 1738:
"Scheme names consist of a sequence of characters. The lower case letters "a"--"z", digits, and the characters plus ("+"), period ("."), and hyphen ("-") are allowed. For resiliency, programs interpreting URLs should treat upper case letters as equivalent to lower case in scheme names (e.g., allow "HTTP" as well as "http")."
Excellent! A regex to match:
/^[a-zA-Z0-9+.-]+:\/\//
If that matches your href string, continue on. If not, prepend "http://". Remaining sanity checks are yours unless you ask for specific details. Do note the other commenters' thoughts about relative links.
EDIT: I'm starting to suspect that you've asked the wrong question... that you perhaps don't have anything that splits the text up into the individual tokens you need to handle it. See Looking for C# HTML parser
EDIT: As a blind try at ignoring all and just attacking the text, using case insensitive matching,
/(<a +href *= *")(.*?)(" *>)/
If the second back-reference matches /^[a-zA-Z0-9+.-]+:\/\//, do nothing. If it does not match, replace it with
$1 + "http://" + $2 + $3
This isn't C# syntax, but it should translate across without too much effort.
In PHP (should translate somewhat easily)
$text = preg_replace('/href="(?:(http|ftp|https)\:\/\/)?([^"]*)"/', 'href="http://$1"', $text);
C#
result = new Regex("(href=\")([^(http|https|ftp)])", RegexOptions.IgnoreCase).Replace(input, "href=\"//$2");
If you aren't concerned with potentially messing up local links, and you can always guarantee that the strings will be fully qualified domain names, then you can simply use the contains method:
Dim myUrl as string = "someUrlString".ToLower()
If Not myUrl.Contains("http://") AndAlso Not myUrl.Contains("https://") AndAlso Not myUrl.Contains("ftp://") Then
'Execute your logic to prepend the proper protocol
myUrl = "http://" & myUrl
End If
Keep in mind this omits a lot of holes regarding the checking of which protocol should be used in the addition and if the url is relative or not.
Edit: I chose specifically not to offer a RegEx solution since this is a simple check and RegEx is a little heavy for it (IMO).