simplify a regex to reduce recursion

simplify a regex to reduce recursion - regex

I currently have a regex like this:
/^From: ((?!\n\n).)*\nSubject:.+/msu
with the point of matching a block that looks like this:
From: John Smith
Cc: Jane Smith
Subject: cat videos
(ie- where they're in a contiguous block) but not if there is a blank line breaking up the block, like this:
From: John Smith
Subject: cat videos
but I've been finding that my PHP script that uses this is sometimes segfaulting. I was able to mitigate the segfaults by setting pcre.recursion_limit to a lower number (I used 8000), but it occurs to me that what I'm trying to do should be doable without a great deal of recursion. Am I using a horribly inefficient method of catching the \n\n ?

This is just a terrible use for a single regex. In addition to the performance problems you're having, it's going to fail at straightforward problems like messages with the "Subject:" line appearing before "From:". If you want to parse a RFC822 email header, then you really should be parsing it.
Find the empty line terminator of the header. Join lines beginning with whitespace to the previous line (i.e. replace newline-followed-by-whitespace with a space). Split each line at the first colon and snip leading and trailing whitespace from each side.
Or find an appropriate library to do that for you.

You should not use regex to parse mail message reliably. Better use a PHP Mime Mail Parser for this task. Using Mime Mail Parser code will be as simple as:
require_once('MimeMailParser.class.php');
$path = 'path/to/mail.txt';
$Parser = new MimeMailParser();
$Parser->setPath($path);
$to = $Parser->getHeader('to');
$from = $Parser->getHeader('from');
$subject = $Parser->getHeader('subject');
$textBody = $Parser->getMessageBody('text');
$htmlBody = $Parser->getMessageBody('html');

I would use simply "not a newline":
/^From:[^\n]*\nSubject:.+/msu

Related

Preg_match_all with nested matches

i'm developing a template system and running into some issues.
The plan is to create HTML documents with [#tags] in them.
I could just use str_replace (i can loop trough all posible replacements), but i want to push this a little further ;-)
I want to allow nested tags, and allow parameters with each tag:
[#title|You are looking at article [#articlenumber] [#articlename]]
I would like to get the following results with preg_match_all:
[0] title|You are looking at article [#articlenumber] [#articlename]
[1] articlenumber
[2] articlename
My script will split the | for parameters.
The output from my script will be something like:
<div class='myTitle'>You are looking at article 001 MyProduct</div>
The problem i'm having is that i'm not exprerienced with regex. Al my paterns results almost what i want, but have problems with the nested params.
\[#(.*?)\]
Will stop at the ] from articlenumber.
\[#(.*?)(((?R)|.)*?)\]
Is more like it, but it doesn't catch the articlenumber; https://regex101.com/r/UvH7zi/1
Hope someone can help me out! Thanks in advance!

You cannot do this using general Python regular expressions. You are looking for a feature similar to "balancing groups" available in the .NET RegEx's engine that allows nested matches.
Take a look at PyParsing that allows nested expression:
from pyparsing import nestedExpr
import pyparsing as pp
text = '{They {mean to {win}} Wimbledon}'
print(pp.nestedExpr(opener='{', closer='}').parseString(text))
The output is:
[['They', ['mean', 'to', ['win']], 'Wimbledon']]
Unfortunately, this does not work very well with your example. You need a better grammar, I think.
You can experiment with a QuotedString definition, but still.
import pyparsing as pp
single_value = pp.QuotedString(quoteChar="'", endQuoteChar="'")
parser = pp.nestedExpr(opener="[", closer="]",
content=single_value,
ignoreExpr=None)
example = "['#title|You are looking at article' ['#articlenumber'] ['#articlename']]"
print(parser.parseString(example, parseAll=True))

I'm typing this on my phone so there might be some mistakes, but what you want can be quite easily achieved by incorporating a lookahead into your expression:
(?=\\[(#(?:\\[(?1)\\]|.)*)\\])
Edit: Yup, it works, here you go: https://regex101.com/r/UvH7zi/4
Because (?=) consumes no characters, the pattern looks for and captures the contents of all "[#*]" substrings in the subject, recursively checking that the contents themselves contain balanced groups, if any.

here is my code:
#\w+\|[\w\s]+\[#(\w+)]\s+\[#(\w+)]
https://regex101.com/r/UvH7zi/3

For now i've crated a parser:
- get all opening tags, and put their strpos in array
- loop trough all start positions of the opening tags
- Look for the next closingtag, is it before the next open-tag? than the tag is complete
- If the closingtag was after an opening tag, skip that one and look for the next (and keep checking for openingtags in between)
That way i could find all complete tags and replace them.
But that took about 50 lines of code and multiple loops, so one preg_match would be greater ;-)

Parsing whitespace-oriented conf file with Regex

I'm trying to parse a gitolite.conf file, which is a whitespace-oriented conf file with a few regexes. The worst problem is that some options might appear anywhere:
#staff = dilbert alice # line 1
#projects = foo bar # line 2
repo #projects baz # line 3
RW+ = #staff # line 4
- master = ashok # line 5
RW = ashok # line 6
R = wally # line 7
config hooks.emailprefix = '[%GL_REPO] ' # line 8
Check the "master" attribute. Some repos have them, others do not. It's a real pain.

This answer assumes a goal of extracting key/value pairs into capturing groups, where key consists of contiguous non-whitespace before = and value includes everything after = but before #, trimmed of leading/trailing whitespace.
Basic version
([^\s]+)\s*=\s*((?:\s*[^\s#]+)*)
More advanced version
The regex above doesn't handle quoted strings very well (e.g. prefix = ' Quoted with # and leading/trailing whitespace '). Regex isn't great at this kind of thing but simple cases can be handled as follows:
([^\s]+)\s*=\s*('[^']*'|"[^"]*"|(?:(?:\s*[^\s#]+)*))
Here's the demo if you need to see what is captured and play around with it more: Debuggex Demo

First, you should know that this isn't entirely possible with Regex. Regex is a great tool for parsing regular languages (including some types of configuration files), but as soon as you get into "Well, this line is actually a header line and we need all lines under it, and some lines might have this token, and others might not", it gets quite messy. I'm not saying it's impossible, but you're going to waste a lot of time debugging your Regex pattern instead of just writing a parser in whatever language you're using this with.
Second, if you're going to ask a quesiton about Regex, it is always helpful to know what you want out of the expression. Do you want to tokenize everything, do you only want the configuration keys, do you only want the comments?
That being said, I took my best guess, here's an expression to get you started:
^(?:([^=#]+?)\s.?=?\s.?([^=#]+?)\s.?(?:#|$))
With this expression, please apply the g and m flags (global and multiline). In PCRE, this would look like:
/^(?:([^=#]+?)\s.?=?\s.?([^=#]+?)\s.?(?:#|$))/gm
There are two capture groups, one is whatever is before the = sign, and the other is whatever is after. If there is no = sign, the first capture group contains everything. Anything after "#" is ignored.
Here's a fiddle to demonstrate: http://www.rexfiddle.net/eQexbZU

How to put .com at the end of email addressed by regex?

Example
I received a email-list from my friends but the problem is some people typed an email in full form (xxx#example.com) and some people typed (xxx#xxx without .com). And i want to improve it into the same format. How can i improve it if i want to edit them on vi?
In my emaillist.txt
foo#gmail
bar#hotmail.com
bas#gmail
qux#abc.com
mike#abc
john#email
My try:
i tried to use an easy regex like this to catch the pattern like xxx#xxx
:%s/\(\w*#\w*\)/\0.com/g
or
:%s/\(\w*#\w*[^.com]\)/\0.com/g
But the problem is this regex include xxx#example.com also
And the result become like this after i enter the command above
foo#gmail.com
bar#hotmail.com.com
bas#gmail.com
qux#abc.com.com
mike#abc.com
john#email.com
So, My expectation after substitution is should be like this:
foo#gmail.com
bar#hotmail.com
bas#gmail.com
qux#abc.com
mike#abc.com
john#email.com
How to use regex in this situation?

You can use this command:
%s/^.*\(\.com\)\#<!$/\0\.com/g
The search pattern matches each line not ending with .com (i just copy-pasted the recipy from Vim: Find any line NOT ending in "WORD") and replaces it with itself with .com added.

for gmail.com there is no need of further replace so do replace for only gmail like this
/(.*)(?!\.com)\n/\.com/msi ( i considered as each mail in one new line. )
pls dont -vte mar i tried to explain

Pig: extracting email details from raw text using REGEX

I am trying to extract email details from raw text using pig.
Here's the sample data:
Sample data for email abc.123#gmail.com
Sample data for email xyz#abc.com
I am trying with REGEX method, Regular expression i took from: http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/
here's the script:
A = Load '----' using PigStorage as (value: chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(value, '^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z]{2,})$')) AS (f1: chararray)
dump B;
After dumping the output into the terminal, i am getting blank output:
()
()
Is there any problem in script syntax?
Please share some links also regarding regular expression writing, it would be very much helpful.
Your help is appreciated, thank you.

For following input data
abc.123#gmail.com
xyz#abc.com
Output of your code is
.123 .com
.com
So there are couple of problems in your code
You need to add parenthesis around the whole regex to capture the complete email address. The code should then work if you have only one token (word or email-id) in each line
If each input line can be a sentence, then you have to first tokenize and then on tokens you can to do regex match.
The reason that the regex you have works only on token and not on line is "^" indicates beginning of string and "$" indicates end of string, so the match is going to successful only when the entire line is an email-id which means you can have only one token per line.

Selecting URLs using RegExp but ignoring them when surrounded by double quotes

I've searched around quite a bit now, but I can't get any suggestions to work in my situation. I've seen success with negative lookahead or lookaround, but I really don't understand it.
I wish to use RegExp to find URLs in blocks of text but ignore them when quoted. While not perfect yet I have the following to find URLs:
(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?
I want it to match the following:
www.test.com:50/stuff
http://player.vimeo.com/video/63317960
odd.name.amazone.com/pizza
But not match:
"www.test.com:50/stuff
http://plAyerz.vimeo.com/video/63317960"
"odd.name.amazone.com/pizza"
Edit:
To clarify, I could be passing a full paragraph of text through the expression. Sample paragraph of what I'd like below:
I would like the following link to be found www.example.com. However this link should be ignored "www.example.com". It would be nice, but not required, to have "www.example.com and www.example.com" ignored as well.
A sample of a different one I have working below. language is php:
$articleEntry = "Hey guys! Check out this cool video on Vimeo: player.vimeo.com/video/63317960";
$pattern = array('/\n+/', '/(https?\:\/\/)?(player\.vimeo\.com\/video\/[0-9]+)/');
$replace = array('<br/><br/>',
'<iframe src="http://$2?color=40cc20" width="500" height="281" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe>');
$articleEntry = preg_replace($pattern,$replace,$articleEntry);
The result of the above will replace any new lines "\n" with a double break "" and will embed the Vimeo video by replacing the Vimeo address with an iframe and link.

I've found a solution!
(?=(([^"]+"){2})*[^"]*$)((https?:\/\/)?(\w+\.)+\w{2,}(:[0-9]+)?((\/\w+)+(\.\w+)?)?\/?)
The first part from (? to *$) what makes it work for me. I found this as an answer in java Regex - split but ignore text inside quotes? by https://stackoverflow.com/users/548225/anubhava
While I had read that question before, I had overlooked his answer because it wasn't the one that "solved" the question. I just changed the single quote to double quote and it works out for me.

add ^ and $ to your regex
^(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?$
please notice you might need to escape the slashes after http (meaning https?\:\/\/)
update
if you want it to be case sensitive, you shouldn't use \w but [a-z]. the \w contains all letters and numbers, so you should be careful while using it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

simplify a regex to reduce recursion - regex

I would use simply "not a newline": /^From:[^\n]*\nSubject:.+/msu

Related

Preg_match_all with nested matches

Parsing whitespace-oriented conf file with Regex

How to put .com at the end of email addressed by regex?

Pig: extracting email details from raw text using REGEX

Selecting URLs using RegExp but ignoring them when surrounded by double quotes

Categories

Resources