Match even numbers containing at least one "5" digit - regex

I have to find regular expression defining any even number, with at least one "5" digit inside. I've been thinking about:
(0+1+2+3+4+5+6+7+8+9)* 5 (0+1+2+3+4+5+6+7+8+9)* (0+2+4+6+8)
Will this work? Is there a way to make it shorter?
Well, I'm not sure if this is the right website to post this question...but the tag exists :P

Your regexp looks good. I don't think you can make it shorter (if we are speaking about theoretical regexps; real programming languages have shortcuts like [0-9] or \d for any digit).
As pointed out by others your regexp will also match numbers starting from any number of zeroes. If you don't want this, you will of course try to match the first digit with (1|2|3|4|5|6|7|8|9), but now you've got a special case: what if the first digit is 5? So you'll have to add more branches.
Your regexp is fine, but I think that making it match only numbers without leading zeroes is a good exercise and you should try it anyway, even if your task doesn't require this.
Here is a general advice. Problems requiring you to create regular expressions are best wil help of Test-driven development. That is, before trying to write the regexp you create a set of tests for it. It has a number of benefits:
If you've written many tests and you regexp passed them, you'll be almost sure you've got it write.
Quality of your tests will be way higher if you write them before creating regexps.
Seeing your freshly handcrafted regexp immediately pass your set of thorough tests makes you happy, trust me =).

Related

Why don't regex engines ensure all required characters are in the string?

For example, look at this email validating regex:
^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*#([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$. If you look carefully, there are three parts: stuff, the # character, and more stuff. So the regex requires an email address to have an #, thus the string aaaaaaaaaaaaaaaaaaaaaa! will not match.
Yet most regex engines will catastrophically backtrack given this combination. (PCRE, which powers Regex101, is smarter than most, but other regex/string combinations can cause catastrophic backtracking.)
Without needing to know much about Big O, I can tell that combinatorial things are exponential, while searching is linear. So why don't regex engines ensure the string contains required characters (so they can quit early)?
Unfortunately, most of what I've read about catastrophic backtracking puts the blame on the regex writer for writing evil regexes, instead of exploring the possibility that regex engines/compilers need to do better. Although I found several sources that look at regex engines/compilers, they are too technical.
Coming back after getting more experience, I know that regexes are declarative, meaning the execution plan is determined by the computer, not the programmer. Optimization is one of the ways that regex engines differ the most.
While PCRE and Perl have challenged the declarative status-quo with the introduction of backtracking control verbs, it is other engines, without the verbs, which are most likely to catastrophically backtrack.
I think you're taking this the wrong way, really:
Unfortunately, most of what I've read about catastrophic backtracking puts the blame on the regex writer for writing evil regexes, instead of exploring the possibility that regex engines/compilers need to do better. Although I found several sources that look at regex engines/compilers, they are too technical.
Well, if you write a regex, your regex engine will need to follow that program you've written.
If you write a complex program, then there's nothing the engine can do about that; this regex explicitly specifies that you'll need to match "stuff" first, before looking for the #.
Now, not being too involved in writing compilers, I agree, in this case, it might be possible to first identify all the "static" elements, which here are only said #, and look for them. Sadly, in the general case, this won't really help you, because there might either be more than one static element or the none at all…
If you cared about speed, you'd actually just first search for the # with plain linear search, and then do your regex thing after you've found one.
Regexes were never meant to be as fast as linear search engines, because they were rather meant to be much, much more powerful.
So, not only are you taking the wrong person to the judge (the regex engine rather than the regex, which is a program with a complexity), you're also blaming the victim for the crime (you want to harvest the speed of just looking for the # character, but still use a regex).
by the way, don't validate email addresses with regexes. It's the wrong tool:
http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html

Simplifying my Eclipse regex

So, I'm fairly new to regex. That being said, I'm looking for help. I've got this regex to do what I want, but this is as simple as I can make it with my current understanding.
(\w+\s*?\w+\s*?\-*?/*?\w+)\s*?(\(\w+\))
What this needs to match are the following configurations of strings:
word
word word
word-word
word/word
word word/word
word word/LL
word word (word)
word-word word/word
I kind of feel like I'm abusing *? but I saw an example that used that and it seemed to do what I needed. I've also seen that just * will do the same? Or just ?? Anyway there it is. Thanks in advance!
Also, the grouping is there because this regex is actually significantly longer with other groups. Please keep that in mind. I need the group to still work with others (4 in total).
EDIT: Sorry everyone. I'm actually trying to convert text being copy pasted from a pdf into python syntax using the built in find/replace (using regex) in the Eclipse IDE. That's why I didn't specify what I was using. I thought it was just plain ol' regex.
Also, my current regex works perfectly. What I'm asking for here is a lesson on simplicity (and the * and ? better explained). I just felt my current regex was long and ugly.
? after other RegEx quantifiers makes them reluctant. Meaning that they will match input only when the remainder of the RegEx has not been able to match.
The reluctant ? is superfluous when the set of characters it applies to has no common character with the following set. For example in:
[0-9]*?[A-Z]
there is no way [A-Z] will match unless all previous [0-9]s have been matched. Then why make [0-9]* reluctant? On the contrary, make it greedy by removing the ?.
[0-9]*[A-Z]
There is a second case where ? is abused. For example, you know that certain text contains, say, a colon followed an uppercase word. There are no other possible occurrences of a colon.
.*?:[A-Z]+
would do the job. Hoevever,
[^:]*:[A-Z]+
represents better the fact that a colon will always initiate what you want to match. In this case, we "created" the first condition (of character commonality) by realizing that, in fact, there never was need for one. IOW that we never needed a .* matching also :s, but just [^:]*.
I'm reluctant to use the reluctant operator because sometimes it tends to obscure patterns instead of clarify them and also because performance implications, both thanks to the fact that it increases the level of backtracking enormously (and without a reason).
Applying these principles,
(\w+\s*\w+\s*\-*/*\w+)\s*(\(\w+\))
seems a better option. Also, at some point you use \-*/*. It's hard to know what you really want without as many counter-examples as (positive) examples (and this is extremely important while developing and testing any RegEx!), but do you really want to accept perhaps many -s followed by perhaps many /s? My impression is that what you are looking for is one - or one / or one space. [ \-/] would do much better. Or perhaps \s*[\-/]?\s* if you want to accept multiple spaces, even before and/or after the [\-/]
(\w+\s*\w+\s*[\-/]?\s*\w+)\s*(\(\w+\))
See the Java documentation on Regular Expressions to find out more.
p.s.w.g was correct in pointing out that (.*) is the simplest form of what I needed. The other 3 grouping of my regular expression are specific enough that this works. Thank you p.s.w.g.
PS still don't know why I was down-voted

Can a regular expression be tested to see if it reduces to .*

I'm developing an application where users enter a regular expression as a filter criterion, however I do not want people to be (easily) able to enter .* (i.e. match anything). The problem is, if I just use if (expression == ".*"), then this could be easily sidestepped by entering something such as .*.*.
Does anyone know of a test that could take a piece of regex and see if is essentially .* but in a slightly more elaborate form?
My thoughts are:
I could see if the expression is one or more repetitions of .*, (i.e. if it matches (\.\*)+ (quotations/escapes may not be entirely accurate, but you get the idea). The problem with this is that there may be other forms of writing a global match (e.g. with $ and ^) that are too exhaustive to even think of upfront, let along test.
I could test a few randomly generated Strings with it and assume that if they all pass, the user has entered a globally matching pattern. The problem with this approach is that there could be situations where the expression is sufficiently tight and I just pick bad strings to match against.
Thoughts, anyone?
(FYI, the application is in Java but I guess this is more of an algorithmic question than one for a particular language.)
Yes, there is a way. It involves converting the regex to a canonical FSM representation. See http://en.wikipedia.org/wiki/Regular_expression#Deciding_equivalence_of_regular_expressions
You can likely find published code that does the work for you. If not, the detailed steps are described here: http://swtch.com/~rsc/regexp/regexp1.html
If that seems like too much work, then you can use a quick and dirty probabilistic test. Just Generated some random strings to see if they match the user's regex. If they are match, you have a pretty good indication that the regex is overly broad.
There are many, many possibilities to achieve something equivalent to .*. e.g. just put any class of characters and the counter part into a class or a alternation and it will match anything.
So, I think with a regular expression its not possible to test another regular expression for equivalence to .*.
These are some examples that would match the same than .* (they will additionally match the newline characters)
/[\s\S]*/
/(\w|\W)*/
/(a|[^a])*/
/(a|b|[^ab])*/
So I assume your idea 2 would be a lot easier to achieve.
Thanks everyone,
I did miss the testing for equivalence entry on the wikipedia, which was interesting.
My memories of DFAs (I seem to recall having to prove, or at least demonstrate, in an exam in 2nd year CompSci that a regex cannot test for palindromes) are probably best left rested at the moment!
I am going to go down the approach of generating a set of strings to test. If they all pass, then I am fairly confident that the filter is too broad and needs to be inspected manually. Meanwhile, at least one failure indicates that the expression is more likely to be fit for purpose.
Now to decide what type of strings to generate in order to run the tests....
Kind regards,
Russ.

How to generate random strings that match a given regexp?

Duplicate:
Random string that matches a regexp
No, it isn't. I'm looking for an easy and universal method, one that I could actually implement. That's far more difficult than randomly generating passwords.
I want to create an application that takes a regular expression, and shows 10 randomly generated strings that match that expression. It's supposed to help people better understand their regexps, and to decide i.e. if they're secure enough for validation purposes. Does anyone know of an easy way to do that?
One obvious solution would be to write (or steal) a regexp parser, but that seems really over my head.
I repeat, I'm looking for an easy and universal way to do that.
Edit: Brute force approach is out of the question. Assuming the random strings would just be [a-z0-9]{10} and 1 million iterations per second, it would take 65 years to iterate trough the space of all 10-char strings.
Parse your regular expression into a DFA, then traverse your DFA randomly until you end up in an accepting state, outputting a character for each transition. Each walk will yield a new string that matches the expression.
This doesn't work for "regular" expressions that aren't really regular, though, such as expressions with backreferences. It depends on what kind of expression you're after.
Take a look at Perl's String::Random.
One rather ugly solution that may or may not be practical is to leverage an existing regex diagnostics option. Some regex libraries have the ability to figure out where the regex failed to match. In this case, you could use what is in effect a form of brute force, but using one character at a time and trying to get longer (and further-matching) strings until you got a full match. This is a very ugly solution. However, unlike a standard brute force solution, it failure on a string like ab will also tell you whether there exists a string ab.* which will match (if not, stop and try ac. If so, try a longer string). This is probably not feasible with all regex libraries.
On the bright side, this kind of solution is probably pretty cool from a teaching perspective. In practice it's probably similar in effect to a dfa solution, but without the requirement to think about dfas.
Note that you won't want to use random strings with this technique. However, you can use random characters to start with if you keep track of what you've tested in a tree, so the effect is the same.
if your only criteria are that your method is easy and universal, then there ain't nothing easier or more universal than brute force. :)
for (i = 0; i < 10; ++i) {
do {
var str = generateRandomString();
} while (!myRegex.match(str));
myListOfGoodStrings.push(str);
}
Of course, this is a very silly way to do things and mostly was meant as a joke.
I think your best bet would be to try writing your own very basic parser, teaching it just the things which you're expecting to encounter (eg: letter and number ranges, repeating/optional characters... don't worry about look-behinds etc)
The universality criterion is impossible. Given the regular expression "^To be, or not to be -- that is the question:$", there will not be ten unique random strings that match.
For non-degenerate cases:
moonshadow's link to Perl's String::Random is the answer. A Perl program that reads a RegEx from stdin and writes the output from ten invocations of String::Random to stdout is trivial. Compile it to either a Windows or Unix exe with Perl2exe and invoke it from PHP, Python, or whatever.
Also see Random Text generator based on regex

Constructing regex

I use regex buddy which takes in a regex and then gives out the meaning of it from which one gets what it could be doing? On similar lines is it possible to have some engine which takes natural language input describing about the pattern one needs to match/replace and gives out the correct(almost correct) regex for that description?
e.g. Match the whole word 'dio' in some file
So regex for that could be : <dio>
or
\bdio\b
-AD.
P.S. = I think few guys here might think this as a 'subjective' 'not-related-to-programming' question, but i just need to ask this question nonetheless. For myself. - Thanks.
This would be complicated to program, because you need a natural language parser able to derive meaning. Unless you limit it to a strict subset -- in which case, you're reinventing an expression language, and you'll eventually wind up back at regular expressions -- only with bigger symbols. so what's the gain?
Regexes were developed for a reason -- they're the simplest, most accurate representation possible.
There is a Symbolix Regular Expression Builder package for Emacs, but looking at it, I think that regular expressions are easier to work with.
Short answer: no, not until artificial intelligence improves A LOT.
If you wrote something like this, you'd have a very limited syntax. For someone to know "Match the whole word 'dio' in some file", they would basically need to have significant knowledge of regular expressions. At that point, just use regular expressions.
For non-technical users, this will never work unless you limit it to basic "find this phrase" or, maybe, "find lines starting/ending with ??". They're never going to come up with something like this:
Find lines containing a less-than symbol followed by the string 'img' followed by one or more groupings of: some whitespace followed by one or more letters followed by either a double-quoted string or a single-quoted string, and those groupings are followed by any length of whitespace then a slash and a greater-than sign.
That's my attempt at a plain-language version of this relatively simple regex:
/<img(\s+[a-z]+=("[^"]*"|'[^']*'))+\s*/>/i
Yeah, I agree with you that it is subjective. But I will answer your question because I think that you have asked a wrong question.
The answer is "YES". Almost anything can be coded and this would be a rather simple application to code. Will it work perfectly? No, it wouldn't because natural language is quite complex to parse and interpret. But it is possible to write such an engine with some constraints.
Generating a regex via the use of a natural language processor is quite possible. Prolog is supposed to be a good language choice for this kind of problem. In practice, however, what you'd be doing, in effect, is designing your own input language which provides a regex as output. If your goal is to produce regexs for a specific task, this might in fact be useful. Perhaps the task you are doing tends to require certain formulations that are doable but not built into regular expressions. Though whether this will be more effective than just creating the regexs one at a time depends on your project. Usually this is probably not the case, since your own language is not going to be as well-known or as well-documented as regex. If your goal is to produce a replacement for regex whose output will be parsed as a regex, I think you're asking a lot. Not to say people haven't done the same sort of thing before (e.g. the C++ language as an 'improvement' that runs, originally, on C++).
try the open source mac application Ruby Regexp Machine, at http://www.rubyregexp.sf.net. It is written in ruby, so you can use some of the code even if you are not on mac. You can describe a lot of simple regular expresions in an easy english grammar. As a disclosure, i did make this tool.