regex / heuristic to detect repetitive words e.g. "gfgfgfgf" "dadadada" "sdsdsd" - regex

How can we search for repetitive patterns in word(s) using regex in order to detect "junk" or dummy words such as "gfgfgfgfg" and similar, but not limit creative words like "aweeesssoome", "omggg" etc.
Examples:
In the case of "gfgfgfgfg" regex search / detection / result should be positive ("gf" base pattern detected, which ultimately constructs the entire word, mind the "hanging" final character "g")
In the case of word "aweesooomee" it should return false, as no repetitive pattern is used to construct the entire word.
re possible duplicate mark by rsjaffe:
Question Detect repetitions in string has a generic and not so "smart" solution I am looking for. As explained above, the solution / variation I'm currently using considerably reduces false positives detection. Simple test in the link I've posted on regex101.com can serve as a proof and see why it does not satisfy my requirements.
Additional explanation:
Above method detects repetitions from neighboring words, as well, and limits creative ("valid") words, which is not a desirable effect.
Examples:
"this is" -- detects "is" as repetitions in 2 separate words ("is is" pattern match).
"awesoooommeee" -- detects repetitions of single letters like "o", "m" and "e".
Searching for this solution proved to be a bit hard to find, so I'm forced to ask the question.
First, a bit of a background story:
I run a blog
I have a post about reCaptcha
Sometimes (every week or so) someone tries to be funny and posts spam comments in the similar form to this:
gfgfgfgf
sdsdsdsds
dadadada
You get the idea. Are they testing an automated reCaptcha bypass systems as a proof of concept or just trying to be funny, I don't know and I don't really care (most probably a mix of both).
(edit) Interestingly enough, no other posts are affected by this type of spam comments.
However, thinking about this, it should be relatively simple and easy to detect patterns in the (mostly) single words which those comments have (99%) and prevent those comments from posting. Sounds simple?
But, it must also be good enough to avoid false positives.
If, for example, a comment has single repetitive word like above, then it's definitely a spam.
If, on the other hand, it just has a typo in the middle of the normal sentence, it should pass.
Now, I can already 'hear' comments below why not use Akismet. Or solution X. Or solution Y. Why not external comments system like Disqus or Facebook comments. Because, I can't. It must be in-house. And I wish to be simple. I already have some things that prevent a lot of junk, but for this particular case they all fail.
Solution(s) that I have tested so far:
This is one regex example that is a variant of this answer here, but it's not perfect:
(.+\w)(?=\1+)/gu
see live regex101 example
Problems with it is that in examples below it will pass most of the time, but it will trigger false positives, too:
correct/proper detection:
123123123123
daddaddaddad
sadsadasad
sadsadsad
121212121
sasasasasas
sdsdsdsds
dsdsdsdsd
ffffffff
blahblah
ioiooioioioi
popopopopop
Hi I dont think this is a spam.
improper/incorrect detection (false positive):
I loooovve this. It's awesooooommeee!
Now, this is tricky. The filter does exactly what it was instructed to do, however, "ooovv" and "oooommeee" patterns are not exactly repetitive in the same sense like the first ones listed above ("gfgfgfgf" etc.). Filter detects "oo" pattern repetition. Yes, correct, but not exactly what I want to target.
Does anyone have an idea how can I improve this regex detection to be smarter a bit?
Thanks!

I finally solved it! And with a single regex line :)
Searching for regex detect repetitive string I found the required clues.
This is the question: Matching on repeated substrings in a regex and the particular answer that inspired me to find a solution.
The solution is to use capturing groups and backreference in a slightly modified regex from above original answer in order to include both letters and numbers:
^([a-z0-9]{2,}).*(\1)$/gumi
Example: https://regex101.com/r/xG40cL/1
Another variation of above solution is to include single characters, so that both words with even and odd number of characters (even and odd symmetry) will also be matched (e.g. "ooo", "iii" etc.):
^([a-z0-9]{1,}).*(\1)$/gumi
Example: https://regex101.com/r/m9aqNk/1
It is still not perfect, but definitely better and closer to ideal case.
Sorry everyone for being such a pain, as now I understand the proper terminology I was seeking regarding regex (it's called backreference).

Related

Can this regex be made memory efficient

I get an xml as plain unformatted text blob. I have to make some replacements and I use regex find and replace.
For example:
<MeasureValue><Text value="StartCalibration" /></MeasureValue>
has to be converted to
<MeasureValue type="Text" value="StartCalibration"/>
The regex I wrote was
<MeasureValue><((\w*)\s+value="(.*?)".*?)></MeasureValue>
And the replacement part was:
<MeasureValue type="$2" value="$3"/>
Here a link showing the same.
The issue is that in a file having 370 such occurrences, I get out of memory error. I have heard of the so called greedy regex patterns and wondering if this can be the case plaguing me. If this is already memory efficient then I will leave it as it is and try to increase the server memory. I have to process thousands of such documents.
EDIT: This is part of script for Logstash from Elasticsearch. As per documentation, Elasticsearch uses Apache Lucene internally to parse regular expressions. Not sure if that helps.
As a rule of thumb, specificity is positively correlated with efficiency in regex.
So, know your data and build something to surgically match it.
The more specific you build your regex, like literally writing down the pattern (and usually ending up with a freak regex), the fewer resources it will take due to the fewer "possibilities" it can match in your data.
To be more precise, imagine we are trying to match a string
2014-08-26 app[web.1]: 50.0.134.125
Approaches such as
(.*) (.*) (.*)
leaves it too open and prone to match with MANY different patterns and combinations, and thus takes a LOT more to process its infinite possibilities. check here https://regex101.com/r/GvmPOC/1
On the other han you could spend a little more time building a more elaborated expression such as:
^[0-9]{4}\-[0-9]{2}-[0-9]{2} app\[[a-zA-Z0-9.]+\]\: [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$`
and I agree, it is horrible but much more precise. It won't waste your precious resources finding unnecessary stuff. check here https://regex101.com/r/quz7fo/1
Another thing to have in mind is: operators such as * or + do a scan operation, which depending on the size of your string, might take some time. Also, whenever is possible, specifying the anchors ^$ also help the script not to try to find too many matches within the same string.
Bringing it to your reality...
If we have to use regex.
The million-dollar question is, how can we turn your regex into something more precise?
Since there is no limit to tag name lengths in XML... there is no way to make it utterly specific :(
We could try to specify what characters to match and avoid . and \w. So substitute it to something more like a-zA-Z is preferrable. Also making use of negative classes [^] would help to narrow down the range of possibilities.
Avoiding * and ? and try to put a quantifier {} (although I don't know your data to make this decision). And as I stated above, there is no limit in XML for this.
I didn't understand precisely the function of the ? in your code, so removing it is something less to process.
Ended up with something like
<(([a-zA-Z]+) value="([^"]*)"[^<>]*)>
Not many changes though. You can try to measure it to see if there was any improvement.
But perhaps the best approach is not to use regex at all :(
I don't know the language you are working with, but if it is getting complicated with the processing time, I would suggest you to not use regex and try some alternative.
If there is a slight possibility to use a xml parser it would be preferable.
https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
Sorry if it wasn't as conclusive as you might have expected, but the field for working on it is likewise really open.

Why don't regex engines ensure all required characters are in the string?

For example, look at this email validating regex:
^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*#([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$. If you look carefully, there are three parts: stuff, the # character, and more stuff. So the regex requires an email address to have an #, thus the string aaaaaaaaaaaaaaaaaaaaaa! will not match.
Yet most regex engines will catastrophically backtrack given this combination. (PCRE, which powers Regex101, is smarter than most, but other regex/string combinations can cause catastrophic backtracking.)
Without needing to know much about Big O, I can tell that combinatorial things are exponential, while searching is linear. So why don't regex engines ensure the string contains required characters (so they can quit early)?
Unfortunately, most of what I've read about catastrophic backtracking puts the blame on the regex writer for writing evil regexes, instead of exploring the possibility that regex engines/compilers need to do better. Although I found several sources that look at regex engines/compilers, they are too technical.
Coming back after getting more experience, I know that regexes are declarative, meaning the execution plan is determined by the computer, not the programmer. Optimization is one of the ways that regex engines differ the most.
While PCRE and Perl have challenged the declarative status-quo with the introduction of backtracking control verbs, it is other engines, without the verbs, which are most likely to catastrophically backtrack.
I think you're taking this the wrong way, really:
Unfortunately, most of what I've read about catastrophic backtracking puts the blame on the regex writer for writing evil regexes, instead of exploring the possibility that regex engines/compilers need to do better. Although I found several sources that look at regex engines/compilers, they are too technical.
Well, if you write a regex, your regex engine will need to follow that program you've written.
If you write a complex program, then there's nothing the engine can do about that; this regex explicitly specifies that you'll need to match "stuff" first, before looking for the #.
Now, not being too involved in writing compilers, I agree, in this case, it might be possible to first identify all the "static" elements, which here are only said #, and look for them. Sadly, in the general case, this won't really help you, because there might either be more than one static element or the none at all…
If you cared about speed, you'd actually just first search for the # with plain linear search, and then do your regex thing after you've found one.
Regexes were never meant to be as fast as linear search engines, because they were rather meant to be much, much more powerful.
So, not only are you taking the wrong person to the judge (the regex engine rather than the regex, which is a program with a complexity), you're also blaming the victim for the crime (you want to harvest the speed of just looking for the # character, but still use a regex).
by the way, don't validate email addresses with regexes. It's the wrong tool:
http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html

Simplifying my Eclipse regex

So, I'm fairly new to regex. That being said, I'm looking for help. I've got this regex to do what I want, but this is as simple as I can make it with my current understanding.
(\w+\s*?\w+\s*?\-*?/*?\w+)\s*?(\(\w+\))
What this needs to match are the following configurations of strings:
word
word word
word-word
word/word
word word/word
word word/LL
word word (word)
word-word word/word
I kind of feel like I'm abusing *? but I saw an example that used that and it seemed to do what I needed. I've also seen that just * will do the same? Or just ?? Anyway there it is. Thanks in advance!
Also, the grouping is there because this regex is actually significantly longer with other groups. Please keep that in mind. I need the group to still work with others (4 in total).
EDIT: Sorry everyone. I'm actually trying to convert text being copy pasted from a pdf into python syntax using the built in find/replace (using regex) in the Eclipse IDE. That's why I didn't specify what I was using. I thought it was just plain ol' regex.
Also, my current regex works perfectly. What I'm asking for here is a lesson on simplicity (and the * and ? better explained). I just felt my current regex was long and ugly.
? after other RegEx quantifiers makes them reluctant. Meaning that they will match input only when the remainder of the RegEx has not been able to match.
The reluctant ? is superfluous when the set of characters it applies to has no common character with the following set. For example in:
[0-9]*?[A-Z]
there is no way [A-Z] will match unless all previous [0-9]s have been matched. Then why make [0-9]* reluctant? On the contrary, make it greedy by removing the ?.
[0-9]*[A-Z]
There is a second case where ? is abused. For example, you know that certain text contains, say, a colon followed an uppercase word. There are no other possible occurrences of a colon.
.*?:[A-Z]+
would do the job. Hoevever,
[^:]*:[A-Z]+
represents better the fact that a colon will always initiate what you want to match. In this case, we "created" the first condition (of character commonality) by realizing that, in fact, there never was need for one. IOW that we never needed a .* matching also :s, but just [^:]*.
I'm reluctant to use the reluctant operator because sometimes it tends to obscure patterns instead of clarify them and also because performance implications, both thanks to the fact that it increases the level of backtracking enormously (and without a reason).
Applying these principles,
(\w+\s*\w+\s*\-*/*\w+)\s*(\(\w+\))
seems a better option. Also, at some point you use \-*/*. It's hard to know what you really want without as many counter-examples as (positive) examples (and this is extremely important while developing and testing any RegEx!), but do you really want to accept perhaps many -s followed by perhaps many /s? My impression is that what you are looking for is one - or one / or one space. [ \-/] would do much better. Or perhaps \s*[\-/]?\s* if you want to accept multiple spaces, even before and/or after the [\-/]
(\w+\s*\w+\s*[\-/]?\s*\w+)\s*(\(\w+\))
See the Java documentation on Regular Expressions to find out more.
p.s.w.g was correct in pointing out that (.*) is the simplest form of what I needed. The other 3 grouping of my regular expression are specific enough that this works. Thank you p.s.w.g.
PS still don't know why I was down-voted

Can a regular expression be tested to see if it reduces to .*

I'm developing an application where users enter a regular expression as a filter criterion, however I do not want people to be (easily) able to enter .* (i.e. match anything). The problem is, if I just use if (expression == ".*"), then this could be easily sidestepped by entering something such as .*.*.
Does anyone know of a test that could take a piece of regex and see if is essentially .* but in a slightly more elaborate form?
My thoughts are:
I could see if the expression is one or more repetitions of .*, (i.e. if it matches (\.\*)+ (quotations/escapes may not be entirely accurate, but you get the idea). The problem with this is that there may be other forms of writing a global match (e.g. with $ and ^) that are too exhaustive to even think of upfront, let along test.
I could test a few randomly generated Strings with it and assume that if they all pass, the user has entered a globally matching pattern. The problem with this approach is that there could be situations where the expression is sufficiently tight and I just pick bad strings to match against.
Thoughts, anyone?
(FYI, the application is in Java but I guess this is more of an algorithmic question than one for a particular language.)
Yes, there is a way. It involves converting the regex to a canonical FSM representation. See http://en.wikipedia.org/wiki/Regular_expression#Deciding_equivalence_of_regular_expressions
You can likely find published code that does the work for you. If not, the detailed steps are described here: http://swtch.com/~rsc/regexp/regexp1.html
If that seems like too much work, then you can use a quick and dirty probabilistic test. Just Generated some random strings to see if they match the user's regex. If they are match, you have a pretty good indication that the regex is overly broad.
There are many, many possibilities to achieve something equivalent to .*. e.g. just put any class of characters and the counter part into a class or a alternation and it will match anything.
So, I think with a regular expression its not possible to test another regular expression for equivalence to .*.
These are some examples that would match the same than .* (they will additionally match the newline characters)
/[\s\S]*/
/(\w|\W)*/
/(a|[^a])*/
/(a|b|[^ab])*/
So I assume your idea 2 would be a lot easier to achieve.
Thanks everyone,
I did miss the testing for equivalence entry on the wikipedia, which was interesting.
My memories of DFAs (I seem to recall having to prove, or at least demonstrate, in an exam in 2nd year CompSci that a regex cannot test for palindromes) are probably best left rested at the moment!
I am going to go down the approach of generating a set of strings to test. If they all pass, then I am fairly confident that the filter is too broad and needs to be inspected manually. Meanwhile, at least one failure indicates that the expression is more likely to be fit for purpose.
Now to decide what type of strings to generate in order to run the tests....
Kind regards,
Russ.

improving Perl regex performance by adding +

I have some regexes in a Perl script that are correct but slow. I am considering trying to improve performance by adding extra + operators (ie *+ instead of * and ++ instead of +) to disable backtracking. I tried replacing all of them and the regexes stopped working... so much for the simple solution. How do I know where I can add them where it won't break the regex?
If the regexes stopped working, you either aren't using a version of perl that supports them, or you actually do need backtracking in those cases.
Identify sections of the regex that won't ever need backtracking (that is, that if asked to match starting at a given point, there will never be more than one length you might want them to match), and surround them with (?> ). This has the same effect as ++/*+ and is supported even pre-5.10.
Note that restricting backtracking is often not "optimization", since it changes what will and will not be matched. The idea is that you use it to better describe what you actually want matched. Borrowing from the article linked in the OP's answer, something like ^(.*?,){11}P (twelfth comma separated field starts P) is not just inefficient, it is incorrect, since backtracking will cause it to actually match even when only a field after the twelfth starts with P. By correcting it to ^(?>.*?,){11}P you are restricting it to actually matching the correct number of leading fields. (In this trivial case, ^([^,]*,){11}P also does the job, but if you add in support for escaped or quoted commas within fields using alternation, (?> becomes the easier choice.)
Hmmm... once I posted the question, looking at the "Related" column led me to this which has some pretty good ideas.... http://www.regular-expressions.info/catastrophic.html