Can this regex be made memory efficient - regex

I get an xml as plain unformatted text blob. I have to make some replacements and I use regex find and replace.
For example:
<MeasureValue><Text value="StartCalibration" /></MeasureValue>
has to be converted to
<MeasureValue type="Text" value="StartCalibration"/>
The regex I wrote was
<MeasureValue><((\w*)\s+value="(.*?)".*?)></MeasureValue>
And the replacement part was:
<MeasureValue type="$2" value="$3"/>
Here a link showing the same.
The issue is that in a file having 370 such occurrences, I get out of memory error. I have heard of the so called greedy regex patterns and wondering if this can be the case plaguing me. If this is already memory efficient then I will leave it as it is and try to increase the server memory. I have to process thousands of such documents.
EDIT: This is part of script for Logstash from Elasticsearch. As per documentation, Elasticsearch uses Apache Lucene internally to parse regular expressions. Not sure if that helps.

As a rule of thumb, specificity is positively correlated with efficiency in regex.
So, know your data and build something to surgically match it.
The more specific you build your regex, like literally writing down the pattern (and usually ending up with a freak regex), the fewer resources it will take due to the fewer "possibilities" it can match in your data.
To be more precise, imagine we are trying to match a string
2014-08-26 app[web.1]: 50.0.134.125
Approaches such as
(.*) (.*) (.*)
leaves it too open and prone to match with MANY different patterns and combinations, and thus takes a LOT more to process its infinite possibilities. check here https://regex101.com/r/GvmPOC/1
On the other han you could spend a little more time building a more elaborated expression such as:
^[0-9]{4}\-[0-9]{2}-[0-9]{2} app\[[a-zA-Z0-9.]+\]\: [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$`
and I agree, it is horrible but much more precise. It won't waste your precious resources finding unnecessary stuff. check here https://regex101.com/r/quz7fo/1
Another thing to have in mind is: operators such as * or + do a scan operation, which depending on the size of your string, might take some time. Also, whenever is possible, specifying the anchors ^$ also help the script not to try to find too many matches within the same string.
Bringing it to your reality...
If we have to use regex.
The million-dollar question is, how can we turn your regex into something more precise?
Since there is no limit to tag name lengths in XML... there is no way to make it utterly specific :(
We could try to specify what characters to match and avoid . and \w. So substitute it to something more like a-zA-Z is preferrable. Also making use of negative classes [^] would help to narrow down the range of possibilities.
Avoiding * and ? and try to put a quantifier {} (although I don't know your data to make this decision). And as I stated above, there is no limit in XML for this.
I didn't understand precisely the function of the ? in your code, so removing it is something less to process.
Ended up with something like
<(([a-zA-Z]+) value="([^"]*)"[^<>]*)>
Not many changes though. You can try to measure it to see if there was any improvement.
But perhaps the best approach is not to use regex at all :(
I don't know the language you are working with, but if it is getting complicated with the processing time, I would suggest you to not use regex and try some alternative.
If there is a slight possibility to use a xml parser it would be preferable.
https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
Sorry if it wasn't as conclusive as you might have expected, but the field for working on it is likewise really open.

Related

Why don't regex engines ensure all required characters are in the string?

For example, look at this email validating regex:
^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*#([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$. If you look carefully, there are three parts: stuff, the # character, and more stuff. So the regex requires an email address to have an #, thus the string aaaaaaaaaaaaaaaaaaaaaa! will not match.
Yet most regex engines will catastrophically backtrack given this combination. (PCRE, which powers Regex101, is smarter than most, but other regex/string combinations can cause catastrophic backtracking.)
Without needing to know much about Big O, I can tell that combinatorial things are exponential, while searching is linear. So why don't regex engines ensure the string contains required characters (so they can quit early)?
Unfortunately, most of what I've read about catastrophic backtracking puts the blame on the regex writer for writing evil regexes, instead of exploring the possibility that regex engines/compilers need to do better. Although I found several sources that look at regex engines/compilers, they are too technical.
Coming back after getting more experience, I know that regexes are declarative, meaning the execution plan is determined by the computer, not the programmer. Optimization is one of the ways that regex engines differ the most.
While PCRE and Perl have challenged the declarative status-quo with the introduction of backtracking control verbs, it is other engines, without the verbs, which are most likely to catastrophically backtrack.
I think you're taking this the wrong way, really:
Unfortunately, most of what I've read about catastrophic backtracking puts the blame on the regex writer for writing evil regexes, instead of exploring the possibility that regex engines/compilers need to do better. Although I found several sources that look at regex engines/compilers, they are too technical.
Well, if you write a regex, your regex engine will need to follow that program you've written.
If you write a complex program, then there's nothing the engine can do about that; this regex explicitly specifies that you'll need to match "stuff" first, before looking for the #.
Now, not being too involved in writing compilers, I agree, in this case, it might be possible to first identify all the "static" elements, which here are only said #, and look for them. Sadly, in the general case, this won't really help you, because there might either be more than one static element or the none at all…
If you cared about speed, you'd actually just first search for the # with plain linear search, and then do your regex thing after you've found one.
Regexes were never meant to be as fast as linear search engines, because they were rather meant to be much, much more powerful.
So, not only are you taking the wrong person to the judge (the regex engine rather than the regex, which is a program with a complexity), you're also blaming the victim for the crime (you want to harvest the speed of just looking for the # character, but still use a regex).
by the way, don't validate email addresses with regexes. It's the wrong tool:
http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html

Simplifying my Eclipse regex

So, I'm fairly new to regex. That being said, I'm looking for help. I've got this regex to do what I want, but this is as simple as I can make it with my current understanding.
(\w+\s*?\w+\s*?\-*?/*?\w+)\s*?(\(\w+\))
What this needs to match are the following configurations of strings:
word
word word
word-word
word/word
word word/word
word word/LL
word word (word)
word-word word/word
I kind of feel like I'm abusing *? but I saw an example that used that and it seemed to do what I needed. I've also seen that just * will do the same? Or just ?? Anyway there it is. Thanks in advance!
Also, the grouping is there because this regex is actually significantly longer with other groups. Please keep that in mind. I need the group to still work with others (4 in total).
EDIT: Sorry everyone. I'm actually trying to convert text being copy pasted from a pdf into python syntax using the built in find/replace (using regex) in the Eclipse IDE. That's why I didn't specify what I was using. I thought it was just plain ol' regex.
Also, my current regex works perfectly. What I'm asking for here is a lesson on simplicity (and the * and ? better explained). I just felt my current regex was long and ugly.
? after other RegEx quantifiers makes them reluctant. Meaning that they will match input only when the remainder of the RegEx has not been able to match.
The reluctant ? is superfluous when the set of characters it applies to has no common character with the following set. For example in:
[0-9]*?[A-Z]
there is no way [A-Z] will match unless all previous [0-9]s have been matched. Then why make [0-9]* reluctant? On the contrary, make it greedy by removing the ?.
[0-9]*[A-Z]
There is a second case where ? is abused. For example, you know that certain text contains, say, a colon followed an uppercase word. There are no other possible occurrences of a colon.
.*?:[A-Z]+
would do the job. Hoevever,
[^:]*:[A-Z]+
represents better the fact that a colon will always initiate what you want to match. In this case, we "created" the first condition (of character commonality) by realizing that, in fact, there never was need for one. IOW that we never needed a .* matching also :s, but just [^:]*.
I'm reluctant to use the reluctant operator because sometimes it tends to obscure patterns instead of clarify them and also because performance implications, both thanks to the fact that it increases the level of backtracking enormously (and without a reason).
Applying these principles,
(\w+\s*\w+\s*\-*/*\w+)\s*(\(\w+\))
seems a better option. Also, at some point you use \-*/*. It's hard to know what you really want without as many counter-examples as (positive) examples (and this is extremely important while developing and testing any RegEx!), but do you really want to accept perhaps many -s followed by perhaps many /s? My impression is that what you are looking for is one - or one / or one space. [ \-/] would do much better. Or perhaps \s*[\-/]?\s* if you want to accept multiple spaces, even before and/or after the [\-/]
(\w+\s*\w+\s*[\-/]?\s*\w+)\s*(\(\w+\))
See the Java documentation on Regular Expressions to find out more.
p.s.w.g was correct in pointing out that (.*) is the simplest form of what I needed. The other 3 grouping of my regular expression are specific enough that this works. Thank you p.s.w.g.
PS still don't know why I was down-voted

Can a regular expression be tested to see if it reduces to .*

I'm developing an application where users enter a regular expression as a filter criterion, however I do not want people to be (easily) able to enter .* (i.e. match anything). The problem is, if I just use if (expression == ".*"), then this could be easily sidestepped by entering something such as .*.*.
Does anyone know of a test that could take a piece of regex and see if is essentially .* but in a slightly more elaborate form?
My thoughts are:
I could see if the expression is one or more repetitions of .*, (i.e. if it matches (\.\*)+ (quotations/escapes may not be entirely accurate, but you get the idea). The problem with this is that there may be other forms of writing a global match (e.g. with $ and ^) that are too exhaustive to even think of upfront, let along test.
I could test a few randomly generated Strings with it and assume that if they all pass, the user has entered a globally matching pattern. The problem with this approach is that there could be situations where the expression is sufficiently tight and I just pick bad strings to match against.
Thoughts, anyone?
(FYI, the application is in Java but I guess this is more of an algorithmic question than one for a particular language.)
Yes, there is a way. It involves converting the regex to a canonical FSM representation. See http://en.wikipedia.org/wiki/Regular_expression#Deciding_equivalence_of_regular_expressions
You can likely find published code that does the work for you. If not, the detailed steps are described here: http://swtch.com/~rsc/regexp/regexp1.html
If that seems like too much work, then you can use a quick and dirty probabilistic test. Just Generated some random strings to see if they match the user's regex. If they are match, you have a pretty good indication that the regex is overly broad.
There are many, many possibilities to achieve something equivalent to .*. e.g. just put any class of characters and the counter part into a class or a alternation and it will match anything.
So, I think with a regular expression its not possible to test another regular expression for equivalence to .*.
These are some examples that would match the same than .* (they will additionally match the newline characters)
/[\s\S]*/
/(\w|\W)*/
/(a|[^a])*/
/(a|b|[^ab])*/
So I assume your idea 2 would be a lot easier to achieve.
Thanks everyone,
I did miss the testing for equivalence entry on the wikipedia, which was interesting.
My memories of DFAs (I seem to recall having to prove, or at least demonstrate, in an exam in 2nd year CompSci that a regex cannot test for palindromes) are probably best left rested at the moment!
I am going to go down the approach of generating a set of strings to test. If they all pass, then I am fairly confident that the filter is too broad and needs to be inspected manually. Meanwhile, at least one failure indicates that the expression is more likely to be fit for purpose.
Now to decide what type of strings to generate in order to run the tests....
Kind regards,
Russ.

TCL string match vs regexps

Is it right that we should avoid using regexp as it is slow. Instead we should use string operations. Are there cases that both can be used but regexp is better?
You should use the appropriate tool for the job. That means, you should not avoid regex, you should use it when it is necessary.
If you are just searching for a fixed sequence of characters, use string operations.
If you are searching for a pattern, then use regular expressions.
Example
Search for the word "Foo". use string operations it will also find
"Foobar", is this OK? NO, well then maybe search for "Foo ", but then
it will not find "Foo," and "Foo."
With regex no problem, you can match for a word boundary /\mFoo\M/ and
this regex will not be slow.
I think this negative image comes from special problems like catastrophic backtracking.
There has been a recent example (catastrophic-backtracking-shouldnt-be-happening-on-this-regex) where this behaviour was unexpected.
Conclusion
A regex has to be well designed, if it isn't then the performance can be catastrophic. But the same can also happen to your normal code if you use a bad algorithm.
For a small job it should nearly never be a problem to use a regex, if your task is bigger and has to be repeated often, do a benchmark.
From my own experience, I am analyzing really big text files (some hundred MB) and use regexes to find the rows I am interested in and I don't experience performance problems because of regex.
Here an interesting read about code optimization
Regular expressions (REs) are a marvelous hammer. They can solve some problems elegantly, and many more with brute force, but it won't be pretty. And some problems can be solved with REs if you hit them enough, but there are much better solutions available (for example, things that are a good fit for string map)
string match - or globbing - can be thought of as a simplified version of regular expressions. The glob pattern will usually be shorter than the equivalent regular expression (character classes are an exception - ERs support them, with globs you need to spell them out). I don't know offhand how the performance differs; I'd expect string match to be slightly faster on equivalent patterns because of the simpler logic, but time is much more reliable than expectations.
For a specific case where REs are easier to use, extracting a substring contextually vs. by simple character position is a good example. Or for matching one of several alternatives.
My rule of thumb is to use the simplest thing that works. If that's string match, then great. If it seems like the pattern is too complex for that, go to a regexp and be happy you have the choice.
The best advice I can give, and the advice I use myself is, use regular expressions only when a simpler solution won't work.
If you can use simple string matching, or use glob patterns, use them. It's only when those cannot work that you should be using regular expressions.
To address your specific question I would say that, no, there is no time when you can use either but that regular expressions are the better choice. Maybe there's an edge case I'm not thinking of, but generally speaking, simpler solutions are always better.
I don't know about Tcl in particular, but generally it can be said that if you're looking for exact text matches (e. g. find all lines that start with #define) then string operations are faster. But if you're looking for patterns (e. g. all lines that contain a word that starts with c and ends with t) then regular expressions are the right tool for this (\bc\w*t\b would be a good regex for this - compare this to the program logic you'd need if you had to write this yourself.
And even if regex is slower in a case like this, chances are high that it won't matter in terms of execution speed, but it'll matter a lot when changes to the matching logic are required (oh, now we need to look for a word that starts with c and ends with t but contains at least two as and no x --> \bc(?=\w*a\w*a)(?!\w*x)\w*t\b).
A place where most regex engines don't want to go is recursion (matching nested tags, nested parentheses and all that). That's where parsers enter the picture.
Regular expression matching is a kind of string operation. While it's not as fast as some of the more basic operations, it is enormously more capable too. It's also more difficult to use, especially if you don't already know the basic syntax of REs, but that's not a reason to avoid them. However, replacing a regular expression with a collection of basic string operations can just lead to the program getting enormously longer: sometimes, you simply need complex manipulations.
Tcl does a number of things to make RE operations more efficient. Notably, it detects particularly simple REs and converts them into glob-like matches (as in string match) which are faster but much less powerful, and it does a number of things to cache the compiled form of REs so that matching has less overhead. It also uses an automata-theoretic matching engine that has fewer surprises during match time (at a cost of more time to compile the RE in the first place).
In short, don't avoid them. Use them where appropriate. (And time if you're in doubt about speed.)
regexp aka regular expressions are used to match many different strings and can be very complex or even to validate a specific input.
string match only allows wildcards such as * and ? and basic character grouping with [] as in regexp.
You can read about it here: http://www.tcl.tk/man/tcl8.5/TclCmd/string.htm#M40
A basic guide what regexp can do also with some examples are explained here: http://www.regular-expressions.info/
So in short: If you don't need regexp or even don't know much about it, i recommand you to not use it. If you just want to compare two strings for their equality use string equal.

does regex comparisons consume lots of resources?

i dunno, but will your machine suffer great slowdown if you use a very complex regex?
like for example the famous email validation module proposed just recently? which can be found here RFC822
update: sorry i had to ask this question in a hurry anyway i posted the link to the email regex i was talking about
It highly depends on the individual regex: features like look-behind or look-ahead can get very expensive, while simple regular expressions are fine for most situations.
Tutorials on http://www.regular-expressions.info/ offer performance advice, so that can be a good start.
Regexes are usually implemented as one of two algorithms (NFA or DFA) that correspond to two different FSMs. Different languages and even different versions of the same language may have a different type of regex. Naturally, some regexes work faster in one and some work faster in the other. If it's really critical, you might want to find what type of regex FSM is implemented.
I'm no expert here. I got all this from reading Mastering Regular Expressions by Jeffrey E. F. Friedl. You might want to look that up.
Depends also on how well you optimise your query, and knowing the internal working of regex.
Using the negated character class, for example, saves the cost of having the engine backtracking characters (i.e. /<[^>]+>/ instead of /<.+?>/)(*).Trivial in small matches, but saves a lot of cycles when you have to match inside a big chunk of text.
And there are many other ways to save resources in regex operations, so performance can vary wildly.
example taken from http://www.regular-expressions.info/repeat.html
You might be interested by articles like: Regular Expression Matching Can Be Simple And Fast or Understanding Regular Expressions.
It is, alas, easy to write inefficient REs, which can match quite quickly on success but can look for hours if no match is found, because the engine stupidly try a long match on every position of a long string!
There are a few recipes for this, like anchoring whenever it is possible, avoiding greediness if possible, etc.
Note that the giant e-mail expression isn't recent, and not necessarily slow: a short, simple expression can be slower than a more convoluted one!
Note also that in some situations (like e-mail, precisely), it can be more efficient (and maintainable!) to use a mix of regexes and code to handle cases, like splitting at #, handling different cases (first part starts with " or not, second part is IP address or domain, etc.).
Regexes are not the ultimate tool able to do everything, but it is a very useful tool well worth to master!
It depends on your regexp engine. As explained here (Regular Expression Matching Can Be Simple And Fast) there may be some important difference in the performance depending on the implementation.
You can't talk about regexes in general any more than you can talk about code in general.
Regular expressions are little programs on their own. Just as any given program may be fast or slow, any given regex may be fast or slow.
One thing to remember, however, is that the regular expression handler is is very well optimized to do its job and run the regex quickly.
I once made a program that analyzed a lot of text (a big code base, >300k lines). First I used regex but when I switched to using regular string functions it got a lot faster, like taking 40% of the time of the regex version. So while of course it depends, my thing got a lot faster.
Once I had written a greedy - accidentally, of course :-) - a multi-line regex and had it search/replace on 10 * 200 GB of text files. It was damn slow... So it depends what you write, and what you check.
Depends on the complexity of the expression and the language the expression is used with.
In JavaScript; you have to optimize everything. In C#; not so much.