Why is this regular expression faster? - regex

I'm writing a Telnet client of sorts in C# and part of what I have to parse are ANSI/VT100 escape sequences, specifically, just those used for colour and formatting (detailed here).
One method I have is one to find all the codes and remove them, so I can render the text without any formatting if needed:
public static string StripStringFormating(string formattedString)
{
if (rTest.IsMatch(formattedString))
return rTest.Replace(formattedString, string.Empty);
else
return formattedString;
}
I'm new to regular expressions and I was suggested to use this:
static Regex rText = new Regex(#"\e\[[\d;]+m", RegexOptions.Compiled);
However, this failed if the escape code was incomplete due to an error on the server. So then this was suggested, but my friend warned it might be slower (this one also matches another condition (z) that I might come across later):
static Regex rTest =
new Regex(#"(\e(\[([\d;]*[mz]?))?)?", RegexOptions.Compiled);
This not only worked, but was in fact faster to and reduced the impact on my text rendering. Can someone explain to a regexp newbie, why? :)

Do you really want to do run the regexp twice? Without having checked (bad me) I would have thought that this would work well:
public static string StripStringFormating(string formattedString)
{
return rTest.Replace(formattedString, string.Empty);
}
If it does, you should see it run ~twice as fast...

The reason why #1 is slower is that [\d;]+ is a greedy quantifier. Using +? or *? is going to do lazy quantifing. See MSDN - Quantifiers for more info.
You may want to try:
"(\e\[(\d{1,2};)*?[mz]?)?"
That may be faster for you.

I'm not sure if this will help with what you are working on, but long ago I wrote a regular expression to parse ANSI graphic files.
(?s)(?:\e\[(?:(\d+);?)*([A-Za-z])(.*?))(?=\e\[|\z)
It will return each code and the text associated with it.
Input string:
<ESC>[1;32mThis is bright green.<ESC>[0m This is the default color.
Results:
[ [1, 32], m, This is bright green.]
[0, m, This is the default color.]

Without doing detailed analysis, I'd guess that it's faster because of the question marks. These allow the regular expression to be "lazy," and stop as soon as they have enough to match, rather than checking if the rest of the input matches.
I'm not entirely happy with this answer though, because this mostly applies to question marks after * or +. If I were more familiar with the input, it might make more sense to me.
(Also, for the code formatting, you can select all of your code and press Ctrl+K to have it add the four spaces required.)

Related

Regex expression to recognize XdY+Z OR XdY

I've been trying to develop a program that will be used for DMing in an MMORPG but I'm having trouble parsing for the actual regex expression I need.
To quote myself from another thread on a less active forum:
I've officially taken over the DiceRoller addon from years and years ago and I've reworked it a lot since I've taken it over and done a lot of testing in game. While I haven't uploaded anything yet, I've been struggling on a piece of regex expression that is currently crucial to the design of the addon.
Some background: the newest iteration of the DiceRoller addon makes it so you can type "!XdY" (where X is the number of dice, Y is the dice value) into raid chat and the DM who has the addon will go through some logic in the addon (random number lua protocol) and then spit out an input after adding up the dice.
It is as follows:
local count, size = string.match(message, "^!(%d+)[dD](%d+)$")
Now the functionality I need it to do is parse for both "!XdY" OR "XdY+Z", but it seems as if I can't get close to "XdY+Z" no matter which regex expression I use since I need it to do both expressions. I can provide more source code context if necessary.
This is the closest I've ever gotten:
http://i.imgur.com/eMhPHQB.png
and this is with the regex expression:
local count, size, modifier = string.match(message, "^!(%d+)[dD](%d+)+?(%d+)$")
As you can see, with the modifier it will work just fine. However, remove the modifier the regex expression still thinks that it is "XdY+Z" and so with "1d20" it think it is "1d2+0". It will think 1d200 is "1d20+0", etc. I've tried moving around the optional character "?" but it just causes the expression to not work at all. If I do !1d2 it doesn't work. It's almost as if the optional character NEEDS to be there?
Thanks for the help ahead of time, I've always struggled with regex.
local function dice(input)
local count, size, modifier = input:match"^!(%d+)[dD](%d+)%+?(%d*)$"
if count then
return tonumber(count), tonumber(size), tonumber("0"..modifier)
end
end
for _, input in ipairs{"!1d6", "!1d24", "!1d200", "!1d2+4", "!1d20+24"} do
print(input, dice(input))
end
Output:
!1d6 1 6 0
!1d24 1 24 0
!1d200 1 200 0
!1d2+4 1 2 4
!1d20+24 1 20 24
Lua regular expressions are very limited. You would need to use ^!(%d+)[dD](%d+)(?:+(%d+))?$ but this wouldn't be supported because of (?:+(%d+))? that uses a non-capturing group and a modifier on a group, both are not supported by Lua Patterns.
Consider using a regex library like this one that allows you to use PCRE, PHP regex engine, one of the most complete engine. But that would be overkill if you only want to use it for this regex. You can do it by code then, wouldn't be so hard for a simple task like this.
While Lua patterns are not powerful enough to parse this with one expression (as they don't support optional groups), there is an easy option to handle it with two expressions:
-- check the longer expression first
local count, size, modifier = string.match(message, "^!(%d+)[dD](%d+)+(%d+)$")
if not count then
count, size = string.match(message, "^!(%d+)[dD](%d+)$")
end

Specific Regex Failing on Neko and Native

So I'm working on some cleanup in haxeflixel, and I need to validate a csv map, so I'm using a regex to check if its ok (don't mention the ending commas, I know thats not valid csv but I want to allow it), and I think I have a decent regex for doing that, and it seems to work well on flash, but c++ crashes, and neko gives me this error: An error occured while running pcre_exec....
here is my regex, I'm sorry its long, but I have no idea where the problem is...
^(([ ]*-?[0-9]+[ ]*,?)+\r?\n?)+$
if anyone knows what might be going on I'd appreciate it,
Thanks,
Nico
ps. there are probably errors in my regex for checking csv, but I can figure those out, its kind of enjoyable, I'd rather just know what specifically could be causing this:)
edit: ah, I've just noticed this doesn't happen on all strings, once I narrow it down to what strings, I will post one... as for what I'm checking for, its basically just to make sure theres no weird xml header, or any non integer value in the map file, basically it should validate this:
1,1,1,1
1,1,1,1
1,1,1,1
or this:
1,1,1,1,
1,1,1,1,
1,1,1,1,
but not:
xml blahh blahh>
1,m,1,1
1,1,b,1
1,1,1,1
xml>
(and yes I know thats not valid xml;))
edit: it gets stranger:
so I'm trying to determine what strings crash it, and while this still wouldnt explain a normal map crashing, its definatly weird, and has the same result:
what happens is:
this will fail a .match() test, but not crash:
a
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
while this will crash the program:
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,*a*,1,1,1,1,1,1,1,1,1,1,1,1,1
To be honest, you wrote one of the worst regexps I ever seen. It actually looks like it was written specifically to be as slow as possible. I write it not to offend you, but to express how much you need to learn to write regexps(hint: writing your own regexp engine is a good exercise).
Going to your problem, I guess it just runs out of memory(it is extremely memory intensive). I am not sure why it happens only on pcre targets(both neko and cpp targets use pcre), but I guess it is about memory limits per regexp run in pcre or some heuristics in other targets to correct such miswritten regexps.
I'd suggest something along the lines of
~/^(( *-?[0-9]+ *,)* *-?[0-9]+ *,?\r?\n)*(( *-?[0-9]+ *,)* *-?[0-9]+ *,?\r?\n?)$/
There, "~/" and last "/" are haxe regexp markers.
I wasnt extensively testing it, just a run on your samples, but it should do the job(probably with a bit of tweaking).
Also, just as a hint, I'd suggest you to split file into lines first before running any regexps, it will lower memmory usage(or you will need to hold only a part of your text in memory) and simplify your regexp.
I'd also note that since you will need to parse csv anyhow(for any properly formed input, which are prevailing in your data I guess), it might be much faster to do all the tests while actually parsing.
Edit: the answer to question "why it eats so much memory"
Well, it is not a short topic, and that's why I proposed to you to write your own regexp engine. There are some differences in implementations, but generally imagine regexp engine works like that:
parses your regular expression and builds a graph of all possible states(state is basically a symbol value and a number of links to other symbols which can follow it).
sets up a list of read pointer and state pointer pairs, current state list, consisting of regexp initial state and a pointer to matched string first letter
sets up read pointer to the first symbol of symbol string
sets up state poiter to initial state of regexp
takes up one pair from current state list and stores it as current state and current read pointer
reads symbol under current read pointer
matches it with symbols in states which current state have links to, and makes a list of states that matched.
if there is a final regexp state in this list, goes to 12
for each item in this list adds a pair of next read pointer(which is current+1) and item to the current state list
if the current state list is empty, returns false, as string didn't match the regexp
goes to 6
here it is, in a final state of matched regexp, returns true, string matches regexp.
Of course, there are some differences between regexp engines, and some of them eliminate some problems afaik. And of course they also have pseudosymbols, groupings, they need to store the positions regexp and groups matched, they have lookahead and lookbehind and also grouping references which makes it a bit(quite a humble measure) more complex and forces to use a bit more complex data structures, but the main idea is the same. So, here we are and your problem is clearly seen from algorithm. The less specific you are about what you want to match and the more there chances for engine to match the same substring as different paths in state graph, the more memory and processor time it will consume, exponentionally.
Try to model how regexp engine matches regexp (a+a+)+b on strings aaaaaab, ab, aa, aaaaaaaaaa, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (Don't try the last one, it would take hours or days to compute on a modern PC.)
Also, it worth to note that some regexp engines do things in a bit different way so they can handle this situations properly, but there always are ways to make regexp extremely slow.
And another thing to note is that I may hav ebeen wrong about the exact memory problem. This case it may be processor too, and before that it may be engine limits on memory/processor kicking in, not exactly system starving of memory.

Conditional RegExp Replace - if reference is found, then write something else

Two cases
1. Key<A, M> desc = newKey();
2. Property<B, N> type = newKey("type", B.bar);
The RegExp and replace
find: (?:Key|Property)<(.*), (.*)> (.*) = newKey\((.*)\);
rep.: Foo<C$1, $2> $3 = pl.nP("$3", $2.class); // ($4)
The Result
1. Foo<CA, M> desc = pl.nP("desc", M.class); //
2. Foo<CB, N> type = pl.nP("type", N.class); // ("type", B.bar)
The Problem:
Now I want to avoid the empty comment at the line 1.
Is there a way to write the $4 and the stuff around it only if $4
isn't empty?
You could remove empty comments afterwards with another regular expression.
EDIT
Another solution would be to deal with the special case separately (... = newKey\(\)).
Perhaps you could automate this process with a simple script, if the tedium of repetitive typing becomes too great(eg. when dealing with multiple conditionals).
As far as I know, there isn't any 'intelligence' built into the replace field in Sublime Text; all you can do is to assemble the captured pieces to your liking.
While skimming through a few Google search results yesterday, I found an article about conditional patterns in Perl, but nothing pertaining to the problem at hand.
For the sake of full disclosure, I should say that I am in no sense an expert in the field, so I could be wrong. I do however have some experience with the Python API for
Sublime Text. It might be possible to implement this functionality yourself, if it doesn't already exist within the plethora of extensions available.
I'm sorry if this sounds like a very long-winded 'uh uh', but I'll be on the lookout for a general solution.

Find/Replace string that doesn't contain quotes

I have inherited a rather large/ugly php codebase (language is unimportant, this is a generic vim question) , where nothing is quoted properly (old php doesn't mind, but new php versions throw warnings).
I'd like to turn $something[somekey] into $something['somekey'], only if its not already quoted or contain the character $
I was trying to build a regular expression to quote the keys, but just cant seem to be able get it to cooperate.
This is what i have so far, which doesn't work but maybe will help explain my question better. And to show that i have actually tried.
:%s/\v\$(.{-})\[(['"$]#<!.{-})\]/$\1['\2']/
My goal is to have something like this:
$something[somekey] = $something['somekey']
$somethingelse[someotherthing] = $something['someotherthing']
$another['key'] = $another['key'] (is ignored)
$yetanother["keykey"] = $yetanother["keykey"] (is ignored)
$derp[$herp] = $derp[$herp] (is ignored)
$array[3] = $array[3] (is ignored)
These can appear anywhere in text, even multiple on the same line, and even touching each other like $something[key]$something[key2], which i would like to be replaced with $something['key']$something['key2']
Another problem, there seems to be random javascript arrays in some files.. which have [] square brackets. So the regex needs to check to see if it starts with $ and text before the brackets.
Im probably asking for the impossible, but any help on this would be great before i go insane editing each file one by one manually.
EDIT: forgot that keys can be numeric, and shouldn't be quoted.
I tried the following, which processed everything from your question correctly:
:%s/\[\(\I\i*\)\]/['\1']/g
Or, with optional white spaces inside the parens:
:%s/\[\s*\(\I\i*\)\s*\]/['\1']/g
And also checking for $identifier before the parens:
:%s/\(\$\i\+\)\[\s*\(\I\i*\)\s*\]/\1['\2']/g

Regular expression to search for Gadaffi [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to search for the word Gadaffi, which can be spelled in many different ways. What's the best regular expression to search for this?
This is a list of 30 variants:
Gadaffi
Gadafi
Gadafy
Gaddafi
Gaddafy
Gaddhafi
Gadhafi
Gathafi
Ghadaffi
Ghadafi
Ghaddafi
Ghaddafy
Gheddafi
Kadaffi
Kadafi
Kaddafi
Kadhafi
Kazzafi
Khadaffy
Khadafy
Khaddafi
Qadafi
Qaddafi
Qadhafi
Qadhdhafi
Qadthafi
Qathafi
Quathafi
Qudhafi
Kad'afi
My best attempt so far is:
\b[KG]h?add?af?fi$\b
But I still seem to be missing some variants. Any suggestions?
Easy... (Qadaffi|Khadafy|Qadafi|...)... it's self-documented, maintainable, and assuming your regexp engine actually compiles regular expressions (rather than interpreting them), it will compile to the same DFA that a more obfuscated solution would.
Writing compact regular expressions is like using short variable names to speed up a program. It only helps if your compiler is brain-dead.
\b[KGQ]h?add?h?af?fi\b
Arabic transcription is (Wiki says) "Qaḏḏāfī", so maybe adding a Q. And one H ("Gadhafi", as the article (see below) mentions).
Btw, why is there a $ at the end of the regex?
Btw, nice article on the topic:
Gaddafi, Kadafi, or Qaddafi? Why is the Libyan leader’s name spelled so many different ways?.
EDIT
To match all the names in the article you've mentioned later, this should match them all. Let's just hope it won't match a lot of other stuff :D
\b(Kh?|Gh?|Qu?)[aeu](d['dt]?|t|zz|dhd)h?aff?[iy]\b
One interesting thing to note from your list of potential spellings is that there's only 3 Soundex values for the contained list (if you ignore the outlier 'Kazzafi')
G310, K310, Q310
Now, there are false positives in there ('Godby' also is G310), but by combining the limited metaphone hits as well, you can eliminate them.
<?
$soundexMatch = array('G310','K310','Q310');
$metaphoneMatch = array('KTF','KTHF','FTF','KHTF','K0F');
$text = "This is a big glob of text about Mr. Gaddafi. Even using compound-Khadafy terms in here, then we might find Mr Qudhafi to be matched fairly well. For example even with apostrophes sprinkled randomly like in Kad'afi, you won't find false positives matched like godfrey, or godby, or even kabbadi";
$wordArray = preg_split('/[\s,.;-]+/',$text);
foreach ($wordArray as $item){
$rate = in_array(soundex($item),$soundexMatch) + in_array(metaphone($item),$metaphoneMatch);
if ($rate > 1){
$matches[] = $item;
}
}
$pattern = implode("|",$matches);
$text = preg_replace("/($pattern)/","<b>$1</b>",$text);
echo $text;
?>
A few tweaks, and lets say some cyrillic transliteration, and you'll have a fairly robust solution.
Using CPAN module Regexp::Assemble:
#!/usr/bin/env perl
use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
$ra->add($_) for qw(Gadaffi Gadafi Gadafy Gaddafi Gaddafy
Gaddhafi Gadhafi Gathafi Ghadaffi Ghadafi
Ghaddafi Ghaddafy Gheddafi Kadaffi Kadafi
Kaddafi Kadhafi Kazzafi Khadaffy Khadafy
Khaddafi Qadafi Qaddafi Qadhafi Qadhdhafi
Qadthafi Qathafi Quathafi Qudhafi Kad'afi);
say $ra->re;
This produces the following regular expression:
(?-xism:(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi))
I think you're over complicating things here. The correct regex is as simple as:
\u0627\u0644\u0642\u0630\u0627\u0641\u064a
It matches the concatenation of the seven Arabic Unicode code points that forms the word القذافي (i.e. Gadaffi).
If you want to avoid matching things that no-one has used (ie avoid tending towards ".+") your best approach would be to create a regular expression that's just all the alternatives (eg. (Qadafi|Kadafi|...)) then compile that to a DFA, and then convert the DFA back into a regular expression. Assuming a moderately sensible implementation that would give you a "compressed" regular expression that's guaranteed not to contain unexpected variants.
If you've got a concrete listing of all 30 possibilities, just concatenate them all together with a bunch of "ors". Then you can be sure that it only matches the exact things you've listed, and no more. Your RE engine will probably be able to optimize in further, and, well, with 30 choices even if it doesn't it's still not a big deal. Trying to fiddle around with manually turning it into a "clever" RE can't possibly turn out better and may turn out worse.
(G|Gh|K|Kh|Q|Qh|Q|Qu)(a|au|e|u)(dh|zz|th|d|dd)(dh|th|a|ha|)(\x27|)(a|)(ff|f)(i|y)
Certainly not the most optimized version, split on syllables to maximize matches while trying to make sure we don't get false positives.
Well since you are matching small words why don't you try a similarity search engine with the Levenshtein distance? You can allow at most k insertions or deletions. This way you can change the distance function to other things that work better for your specific problem. There are many functions available in the simMetrics library.
A possible alternative is the online tool for generate regular expressions from examples http://regex.inginf.units.it.
Give it a chance!
Why not do a mixed approach? Something between a list of all possibilities and a complicated Regex that matches far too much.
Regex is about pattern matching and I can't see a pattern for all variants in the list. Trying to do so, will also find things like "Gazzafy" or "Quud'haffi" which are most probably not a used variant and definitly not on the list.
But I can see patterns for some of the variants, and so I ended up with this:
\b(?:Gheddafi|Gathafi|Kazzafi|Kad'afi|Qadhdhafi|Qadthafi|Qudhafi|Qu?athafi|[KG]h?add?h?aff?[iy]|Qad[dh]?afi)\b
At the beginning I list the ones where I can't see a pattern, then followed by some variants where there are patterns.
See it here on www.rubular.com
I know this is an old question, but...
Neither of these two regexes is the prettiest, but they are optimized and both match ALL the variations in the original post.
"Little Beauty" #1
(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi)
"Little Beauty" #2
(?:(?:Gh|[GK])adaff|(?:(?:Gh|[GKQ])ad|(?:Ghe|(?:[GK]h|[GKQ])a)dd|(?:Gadd|(?:[GKQ]a|Q(?:adh|u))d|(?:Qad|(?:Qu|[GQ])a)t)h|Ka(?:zz|d'))af)i|(?:Khadaff|(?:(?:Kh|G)ad|Gh?add)af)y
Rest in Peace, Muammar.
Just an addendum: you should add "Gheddafi" as alternate spelling. So the RE should be
\b[KG]h?[ae]dd?af?fi$\b
[GQK][ahu]+[dtez]+\'?[adhz]+f{1,2}(i|y)
In parts:
[GQK]
[ahu]+
[dtez]+
\'?
[adhz]+
f{1,2}(i|y)
Note: Just wanted to give a shot at this.
What else starts with Q, G, or K, has a d, z or t in the middle, and ends in "fi" the people actually search for?
/\b[GQK].+[dzt].+fi\b/i
Done.
>>> print re.search(a, "Gadasadasfiasdas") != None
False
>>> print re.search(a, "Gadasadasfi") != None
True
>>> print re.search(a, "Qa'dafi") != None
True
Interesting that I'm getting downvoted. Can someone leave some false positives in the comments?