Can a regular expression be tested to see if it reduces to .* - regex

I'm developing an application where users enter a regular expression as a filter criterion, however I do not want people to be (easily) able to enter .* (i.e. match anything). The problem is, if I just use if (expression == ".*"), then this could be easily sidestepped by entering something such as .*.*.
Does anyone know of a test that could take a piece of regex and see if is essentially .* but in a slightly more elaborate form?
My thoughts are:
I could see if the expression is one or more repetitions of .*, (i.e. if it matches (\.\*)+ (quotations/escapes may not be entirely accurate, but you get the idea). The problem with this is that there may be other forms of writing a global match (e.g. with $ and ^) that are too exhaustive to even think of upfront, let along test.
I could test a few randomly generated Strings with it and assume that if they all pass, the user has entered a globally matching pattern. The problem with this approach is that there could be situations where the expression is sufficiently tight and I just pick bad strings to match against.
Thoughts, anyone?
(FYI, the application is in Java but I guess this is more of an algorithmic question than one for a particular language.)

Yes, there is a way. It involves converting the regex to a canonical FSM representation. See http://en.wikipedia.org/wiki/Regular_expression#Deciding_equivalence_of_regular_expressions
You can likely find published code that does the work for you. If not, the detailed steps are described here: http://swtch.com/~rsc/regexp/regexp1.html
If that seems like too much work, then you can use a quick and dirty probabilistic test. Just Generated some random strings to see if they match the user's regex. If they are match, you have a pretty good indication that the regex is overly broad.

There are many, many possibilities to achieve something equivalent to .*. e.g. just put any class of characters and the counter part into a class or a alternation and it will match anything.
So, I think with a regular expression its not possible to test another regular expression for equivalence to .*.
These are some examples that would match the same than .* (they will additionally match the newline characters)
/[\s\S]*/
/(\w|\W)*/
/(a|[^a])*/
/(a|b|[^ab])*/
So I assume your idea 2 would be a lot easier to achieve.

Thanks everyone,
I did miss the testing for equivalence entry on the wikipedia, which was interesting.
My memories of DFAs (I seem to recall having to prove, or at least demonstrate, in an exam in 2nd year CompSci that a regex cannot test for palindromes) are probably best left rested at the moment!
I am going to go down the approach of generating a set of strings to test. If they all pass, then I am fairly confident that the filter is too broad and needs to be inspected manually. Meanwhile, at least one failure indicates that the expression is more likely to be fit for purpose.
Now to decide what type of strings to generate in order to run the tests....
Kind regards,
Russ.

Related

regex / heuristic to detect repetitive words e.g. "gfgfgfgf" "dadadada" "sdsdsd"

How can we search for repetitive patterns in word(s) using regex in order to detect "junk" or dummy words such as "gfgfgfgfg" and similar, but not limit creative words like "aweeesssoome", "omggg" etc.
Examples:
In the case of "gfgfgfgfg" regex search / detection / result should be positive ("gf" base pattern detected, which ultimately constructs the entire word, mind the "hanging" final character "g")
In the case of word "aweesooomee" it should return false, as no repetitive pattern is used to construct the entire word.
re possible duplicate mark by rsjaffe:
Question Detect repetitions in string has a generic and not so "smart" solution I am looking for. As explained above, the solution / variation I'm currently using considerably reduces false positives detection. Simple test in the link I've posted on regex101.com can serve as a proof and see why it does not satisfy my requirements.
Additional explanation:
Above method detects repetitions from neighboring words, as well, and limits creative ("valid") words, which is not a desirable effect.
Examples:
"this is" -- detects "is" as repetitions in 2 separate words ("is is" pattern match).
"awesoooommeee" -- detects repetitions of single letters like "o", "m" and "e".
Searching for this solution proved to be a bit hard to find, so I'm forced to ask the question.
First, a bit of a background story:
I run a blog
I have a post about reCaptcha
Sometimes (every week or so) someone tries to be funny and posts spam comments in the similar form to this:
gfgfgfgf
sdsdsdsds
dadadada
You get the idea. Are they testing an automated reCaptcha bypass systems as a proof of concept or just trying to be funny, I don't know and I don't really care (most probably a mix of both).
(edit) Interestingly enough, no other posts are affected by this type of spam comments.
However, thinking about this, it should be relatively simple and easy to detect patterns in the (mostly) single words which those comments have (99%) and prevent those comments from posting. Sounds simple?
But, it must also be good enough to avoid false positives.
If, for example, a comment has single repetitive word like above, then it's definitely a spam.
If, on the other hand, it just has a typo in the middle of the normal sentence, it should pass.
Now, I can already 'hear' comments below why not use Akismet. Or solution X. Or solution Y. Why not external comments system like Disqus or Facebook comments. Because, I can't. It must be in-house. And I wish to be simple. I already have some things that prevent a lot of junk, but for this particular case they all fail.
Solution(s) that I have tested so far:
This is one regex example that is a variant of this answer here, but it's not perfect:
(.+\w)(?=\1+)/gu
see live regex101 example
Problems with it is that in examples below it will pass most of the time, but it will trigger false positives, too:
correct/proper detection:
123123123123
daddaddaddad
sadsadasad
sadsadsad
121212121
sasasasasas
sdsdsdsds
dsdsdsdsd
ffffffff
blahblah
ioiooioioioi
popopopopop
Hi I dont think this is a spam.
improper/incorrect detection (false positive):
I loooovve this. It's awesooooommeee!
Now, this is tricky. The filter does exactly what it was instructed to do, however, "ooovv" and "oooommeee" patterns are not exactly repetitive in the same sense like the first ones listed above ("gfgfgfgf" etc.). Filter detects "oo" pattern repetition. Yes, correct, but not exactly what I want to target.
Does anyone have an idea how can I improve this regex detection to be smarter a bit?
Thanks!
I finally solved it! And with a single regex line :)
Searching for regex detect repetitive string I found the required clues.
This is the question: Matching on repeated substrings in a regex and the particular answer that inspired me to find a solution.
The solution is to use capturing groups and backreference in a slightly modified regex from above original answer in order to include both letters and numbers:
^([a-z0-9]{2,}).*(\1)$/gumi
Example: https://regex101.com/r/xG40cL/1
Another variation of above solution is to include single characters, so that both words with even and odd number of characters (even and odd symmetry) will also be matched (e.g. "ooo", "iii" etc.):
^([a-z0-9]{1,}).*(\1)$/gumi
Example: https://regex101.com/r/m9aqNk/1
It is still not perfect, but definitely better and closer to ideal case.
Sorry everyone for being such a pain, as now I understand the proper terminology I was seeking regarding regex (it's called backreference).

Simplifying my Eclipse regex

So, I'm fairly new to regex. That being said, I'm looking for help. I've got this regex to do what I want, but this is as simple as I can make it with my current understanding.
(\w+\s*?\w+\s*?\-*?/*?\w+)\s*?(\(\w+\))
What this needs to match are the following configurations of strings:
word
word word
word-word
word/word
word word/word
word word/LL
word word (word)
word-word word/word
I kind of feel like I'm abusing *? but I saw an example that used that and it seemed to do what I needed. I've also seen that just * will do the same? Or just ?? Anyway there it is. Thanks in advance!
Also, the grouping is there because this regex is actually significantly longer with other groups. Please keep that in mind. I need the group to still work with others (4 in total).
EDIT: Sorry everyone. I'm actually trying to convert text being copy pasted from a pdf into python syntax using the built in find/replace (using regex) in the Eclipse IDE. That's why I didn't specify what I was using. I thought it was just plain ol' regex.
Also, my current regex works perfectly. What I'm asking for here is a lesson on simplicity (and the * and ? better explained). I just felt my current regex was long and ugly.
? after other RegEx quantifiers makes them reluctant. Meaning that they will match input only when the remainder of the RegEx has not been able to match.
The reluctant ? is superfluous when the set of characters it applies to has no common character with the following set. For example in:
[0-9]*?[A-Z]
there is no way [A-Z] will match unless all previous [0-9]s have been matched. Then why make [0-9]* reluctant? On the contrary, make it greedy by removing the ?.
[0-9]*[A-Z]
There is a second case where ? is abused. For example, you know that certain text contains, say, a colon followed an uppercase word. There are no other possible occurrences of a colon.
.*?:[A-Z]+
would do the job. Hoevever,
[^:]*:[A-Z]+
represents better the fact that a colon will always initiate what you want to match. In this case, we "created" the first condition (of character commonality) by realizing that, in fact, there never was need for one. IOW that we never needed a .* matching also :s, but just [^:]*.
I'm reluctant to use the reluctant operator because sometimes it tends to obscure patterns instead of clarify them and also because performance implications, both thanks to the fact that it increases the level of backtracking enormously (and without a reason).
Applying these principles,
(\w+\s*\w+\s*\-*/*\w+)\s*(\(\w+\))
seems a better option. Also, at some point you use \-*/*. It's hard to know what you really want without as many counter-examples as (positive) examples (and this is extremely important while developing and testing any RegEx!), but do you really want to accept perhaps many -s followed by perhaps many /s? My impression is that what you are looking for is one - or one / or one space. [ \-/] would do much better. Or perhaps \s*[\-/]?\s* if you want to accept multiple spaces, even before and/or after the [\-/]
(\w+\s*\w+\s*[\-/]?\s*\w+)\s*(\(\w+\))
See the Java documentation on Regular Expressions to find out more.
p.s.w.g was correct in pointing out that (.*) is the simplest form of what I needed. The other 3 grouping of my regular expression are specific enough that this works. Thank you p.s.w.g.
PS still don't know why I was down-voted

does regex comparisons consume lots of resources?

i dunno, but will your machine suffer great slowdown if you use a very complex regex?
like for example the famous email validation module proposed just recently? which can be found here RFC822
update: sorry i had to ask this question in a hurry anyway i posted the link to the email regex i was talking about
It highly depends on the individual regex: features like look-behind or look-ahead can get very expensive, while simple regular expressions are fine for most situations.
Tutorials on http://www.regular-expressions.info/ offer performance advice, so that can be a good start.
Regexes are usually implemented as one of two algorithms (NFA or DFA) that correspond to two different FSMs. Different languages and even different versions of the same language may have a different type of regex. Naturally, some regexes work faster in one and some work faster in the other. If it's really critical, you might want to find what type of regex FSM is implemented.
I'm no expert here. I got all this from reading Mastering Regular Expressions by Jeffrey E. F. Friedl. You might want to look that up.
Depends also on how well you optimise your query, and knowing the internal working of regex.
Using the negated character class, for example, saves the cost of having the engine backtracking characters (i.e. /<[^>]+>/ instead of /<.+?>/)(*).Trivial in small matches, but saves a lot of cycles when you have to match inside a big chunk of text.
And there are many other ways to save resources in regex operations, so performance can vary wildly.
example taken from http://www.regular-expressions.info/repeat.html
You might be interested by articles like: Regular Expression Matching Can Be Simple And Fast or Understanding Regular Expressions.
It is, alas, easy to write inefficient REs, which can match quite quickly on success but can look for hours if no match is found, because the engine stupidly try a long match on every position of a long string!
There are a few recipes for this, like anchoring whenever it is possible, avoiding greediness if possible, etc.
Note that the giant e-mail expression isn't recent, and not necessarily slow: a short, simple expression can be slower than a more convoluted one!
Note also that in some situations (like e-mail, precisely), it can be more efficient (and maintainable!) to use a mix of regexes and code to handle cases, like splitting at #, handling different cases (first part starts with " or not, second part is IP address or domain, etc.).
Regexes are not the ultimate tool able to do everything, but it is a very useful tool well worth to master!
It depends on your regexp engine. As explained here (Regular Expression Matching Can Be Simple And Fast) there may be some important difference in the performance depending on the implementation.
You can't talk about regexes in general any more than you can talk about code in general.
Regular expressions are little programs on their own. Just as any given program may be fast or slow, any given regex may be fast or slow.
One thing to remember, however, is that the regular expression handler is is very well optimized to do its job and run the regex quickly.
I once made a program that analyzed a lot of text (a big code base, >300k lines). First I used regex but when I switched to using regular string functions it got a lot faster, like taking 40% of the time of the regex version. So while of course it depends, my thing got a lot faster.
Once I had written a greedy - accidentally, of course :-) - a multi-line regex and had it search/replace on 10 * 200 GB of text files. It was damn slow... So it depends what you write, and what you check.
Depends on the complexity of the expression and the language the expression is used with.
In JavaScript; you have to optimize everything. In C#; not so much.

I don’t get regular expressions

I don’t understand or see the need for regular expressions.
Can some explain them in simple terms and provide some basic examples where they could be useful, or even critical.
Use them where you need to use/manipulate patterns. For instance, suppose you need to recognise the following pattern:
Any letter, A-Z, either upper or lower case, 5 or 6 times
3 digits
a single letter a-z (definitely lower case)
(Things like this crop up for zip code, credit card, social security number validation etc.)
That's not really hard to write in code - but it becomes harder as the pattern becomes more complicated. With a regular expression, you describe the pattern (rather than the code to validate it) and let the regex engine do the work for you.
The pattern here would be something like
[A-Za-z]{5,6}[0-9]{3}[a-z]
(There are other ways of expressing it too.) Grouping constructs make it easy to match a whole pattern and grab (or replace) different bits of it, too.
A few downsides though:
Regexes can become complicated and hard to read quite quickly. Document thoroughly!
There are variations in behaviour between different regex engines
The complexity can be hard to judge if you're not an expert (which I'm certainly not!); there are "gotchas" which can make the patterns really slow against particular input, and these gotchas aren't obvious at all
Some people overuse regular expressions massively (and some underuse them, of course). The worst example I've seen was where someone asked (on a C# group) how to check whether a string was length 3 - this is clearly a job for using String.Length, but someone seriously suggested matching a regex. Madness. (They also got the regex wrong, which kinda proves the point.)
Regexes use backslashes to escape various things (e.g. use . to mean "a dot" rather than just "any character". In many languages the backslash itself needs escaping.
What regular expressions are used for:
Regular expressions is a language in itself that allows you to perform complex validation of string inputs. I.e. you pass it a string and it will return true or false if it is a match or not.
How regular expressions are used:
Form validation, determine if what the user entered is of the format you want
Finding the position of a certain pattern in a block of text
Search and replace where the search term is a regex and what to replace is a normal string.
Some regular expression language features:
Alternation: allows you to select one thing or another. Example match only yes or no.
yes|no
Grouping: You can define scope and have precedence using parentheses. For example match 3 color shades.
gr(a|e)y|black|white
Quantification: You can quantify how much of something you want. ? means 1 or 0, * means 0 or more. + means at least one. Example: Accept a binary string that is not empty:
(0|1)+
Why regular expressions?
Regular expressions make it easy to match strings, it can often replace several dozen lines of source code with a simple small regular expression string.
Not for all types of matching:
To understand how something is useful, you should also understand how it is not useful. Regular expressions are bad for certain tasks for example when you need to guarantee that a string has an equal number of parentheses.
Available in just about all languages:
Regular expressions are available in just about any programming language.
Formal language:
Any regular expression can be converted to a deterministic finite state machine. And in this same way you can figure out how to make source code that will validate your regular expression.
Example:
[hc]+at
matches "hat", "cat", "hhat", "chat", "hcat", "ccchat", and so on, but not "at"
Source, further reading
They look a bit cryptic but they provide a very powerful tool for finding patterns in text. Anything from href tags in HTML pages to validating email addresses.
And they can be processed into a very efficient data structure (FSA) that finds matches very fast.
They are a bit tricky, but extremely powerful and worth learning. The web is full of tutorial and examples, start for example from here and look at the examples here.
If I could direct the OP to some of the answers/comments on one of my own questions: How important is knowing Regexs?
Regular expressions are a very concise way to specify most pattern-matching and -replacement problems, and regexp engines can be very highly optimized.
If you wanted to do the same job as even a relatively simple regexp, you'd have to write a lot of code, which probably would contain a number of bugs, be hard to understand and perform badly.
Whereas doing the same with a regexp is much shorter, almost certainly performs as well as is technically possible, and is easier to understand to anyone familiar with regexpes (though it should be commented in either case)
The email example is actually a bad example for regular expressions. Regexes can be used, but the resulting expression (for example this one which doesn't handle "John Doe " style addresses) is hugely complicated - take a look at the email address specification and you'll see why...
However regexes are very useful in a host of other situations, extracting ip addresses from text, tags from html etc. Finding all versioned files would be another example. Something along the lines of:
my_versioned_file_(\d{4}-\d{2}-\d{2}).txt
will match any filenames of the format my_versioned_file_2009-02-26.txt and pull out the date as a captured group (the part wrapped in "()") for you to further analyse.
No regexes are not necessary, but they can save a world of time in writing a hand rolled parser for something a regex can easily achieve.
Whenever you've got some pattern to find in a lot of textual data or if you want to check that a string is in a certain format.
For example an email address...
The code for checking for an at symbol and the presence of a valid domain will look quite big where you could just use a regular expression and have an answer in 2 lines of code.
Regex r = new Regex("<An Email Address Regex>");
bool isValidEmail = r.IsMatch(MyInput);
Other examples would be for checking numbers are in the correct format before parsing them into integers etc.
Jon and Sqook gave a fine explanation and definition of Regular Expressions, and for simple problems it is pretty understandable, but if you use it for complex problems regular expressions can be a &$#( (at least for me ;-))
I use Expresso a lot to help me build complex regular expression code.
http://www.ultrapico.com/Expresso.htm
It has a build in library with expressions you can use, a design mode where you can build your code and a test mode where you can test and validate the code. It helped me build and understand complex expressions better!
Goodluck!
Some practical real world usages:
Finding abstract classes that extend JUnit's TestCase:
abstract\s+class\s+\w+\s+extends\s+TestCase
This is useful for finding test cases that cannot be instantiated and will need excluding from an ant build script that runs test cases. You cannot search for regular text because you don't know the class names in advance. hence the \w+ (At least one word character).
Finding running bash or bourne shell scripts:
ps -e | grep -e " sh| bash"
this is useful if you want to kill them all or something, if you did a search for just sh you'd not get the bash ones and have to run the command again for bash scripts. Again, more serviceable than perfect, but nearly no regex you write on the fly will be.
It's not perfect, but most regexes won't be, or they'll take so long to write they're not worth it. The ones you perfect are the ones you commit as part of some sort of validation or built application.
Example of critical use is JavaScript:
If you need to do search or replace on a string, the only matching you can do is a regular expression. It's in the JavaScript API on those string methods...
Personally, I mostly use regular expressions only when I need some advanced matching in some automated find/replace in a text editor (TextPad or Visual Studio). The most powerful feature in my view is the ability to match a pattern that can be inserted in the replace.
To give you some examples:
Email Address
Password requires at least 1 alphabet and 1 digit
How can you acheive these requirements?
The best way is to use regular expression.
Read the following links to learn more:
How To: Use Regular Expressions to Constrain Input in ASP.NET
http://msdn.microsoft.com/en-us/library/ms998267.aspx

Constructing regex

I use regex buddy which takes in a regex and then gives out the meaning of it from which one gets what it could be doing? On similar lines is it possible to have some engine which takes natural language input describing about the pattern one needs to match/replace and gives out the correct(almost correct) regex for that description?
e.g. Match the whole word 'dio' in some file
So regex for that could be : <dio>
or
\bdio\b
-AD.
P.S. = I think few guys here might think this as a 'subjective' 'not-related-to-programming' question, but i just need to ask this question nonetheless. For myself. - Thanks.
This would be complicated to program, because you need a natural language parser able to derive meaning. Unless you limit it to a strict subset -- in which case, you're reinventing an expression language, and you'll eventually wind up back at regular expressions -- only with bigger symbols. so what's the gain?
Regexes were developed for a reason -- they're the simplest, most accurate representation possible.
There is a Symbolix Regular Expression Builder package for Emacs, but looking at it, I think that regular expressions are easier to work with.
Short answer: no, not until artificial intelligence improves A LOT.
If you wrote something like this, you'd have a very limited syntax. For someone to know "Match the whole word 'dio' in some file", they would basically need to have significant knowledge of regular expressions. At that point, just use regular expressions.
For non-technical users, this will never work unless you limit it to basic "find this phrase" or, maybe, "find lines starting/ending with ??". They're never going to come up with something like this:
Find lines containing a less-than symbol followed by the string 'img' followed by one or more groupings of: some whitespace followed by one or more letters followed by either a double-quoted string or a single-quoted string, and those groupings are followed by any length of whitespace then a slash and a greater-than sign.
That's my attempt at a plain-language version of this relatively simple regex:
/<img(\s+[a-z]+=("[^"]*"|'[^']*'))+\s*/>/i
Yeah, I agree with you that it is subjective. But I will answer your question because I think that you have asked a wrong question.
The answer is "YES". Almost anything can be coded and this would be a rather simple application to code. Will it work perfectly? No, it wouldn't because natural language is quite complex to parse and interpret. But it is possible to write such an engine with some constraints.
Generating a regex via the use of a natural language processor is quite possible. Prolog is supposed to be a good language choice for this kind of problem. In practice, however, what you'd be doing, in effect, is designing your own input language which provides a regex as output. If your goal is to produce regexs for a specific task, this might in fact be useful. Perhaps the task you are doing tends to require certain formulations that are doable but not built into regular expressions. Though whether this will be more effective than just creating the regexs one at a time depends on your project. Usually this is probably not the case, since your own language is not going to be as well-known or as well-documented as regex. If your goal is to produce a replacement for regex whose output will be parsed as a regex, I think you're asking a lot. Not to say people haven't done the same sort of thing before (e.g. the C++ language as an 'improvement' that runs, originally, on C++).
try the open source mac application Ruby Regexp Machine, at http://www.rubyregexp.sf.net. It is written in ruby, so you can use some of the code even if you are not on mac. You can describe a lot of simple regular expresions in an easy english grammar. As a disclosure, i did make this tool.