ignore repeating characters - regex

I am trying to make a swearing prevention system, so far I have ignored the whitespace (with "\s*") and I've ignored the case("(?i)"). How would I ignore repeated characters ? e.g heeeello.

There is no flag that you can turn on to simply ignore any duplicate characters. However, you can use the 'one or more' quantifier (+) to match one or more occurrence of any character, character class, or group. For example the pattern he+l+o will match all of the following:
helo
heelo
hello
heeeello

Assuming you want a general solution to remove repeated characters, you'll want to replace (.)\1 with \1 repeatedly as long as it succeeds.

Use + to catch as many repetition of a sequence in () as there are. e+ will catch all the e's in heeeeello.
Check out rubular.com, very simple way to learn, practice and test regex.

You need to capture a single character then check for any repetition of it with using a backreference to the lately captured group:
(.)\1+
If string is matched then it has repetition.
Live demo

The problem is harder than you think. Let's assume that you want to match "no fewer than this number of characters" for each word in your dictionary. Then you would have to create a dictionary of regexes with a + after each character…
Initial dictionary:
boom
smurf
tree
cannibals
Process the dictionary with a text editor:
sed -e 's/\(.\)/\1\+/g' dictionary.txt > regex.txt
This puts a + between all characters:
b+o+o+m+
s+m+u+r+f+
t+r+e+e+
c+a+n+n+i+b+a+l+s+
And now you can match your "repeated" words:
bom : no match
smuuurf : match
trees : no match
canibals : no match
cannnibalssss : match
You might want to add "word boundaries" - so that smurfette doesn't get caught by smurf. This would mean adding \b before and after each expression ("word boundary").
Note - it's not enough to remove all duplicate letters from both the dictionary, and the words to be matched - otherwise you risk banning "pop" because you had "poop" on your list (and how would you know to stop when "pooop" had reached exactly two characters). This is why I prefer this solution over some of the others that recommend stripping repeats.

Related

How to pick files based on a certain word in a string in NIFI using file filter regex [duplicate]

How do I create a regular expression to match a word at the beginning of a string?
We are looking to match stop at the beginning of a string and anything can follow it.
For example, the expression should match:
stop
stop random
stopping
If you wish to match only lines beginning with stop, use
^stop
If you wish to match lines beginning with the word stop followed by a space:
^stop\s
Or, if you wish to match lines beginning with the word stop, but followed by either a space or any other non-word character you can use (your regex flavor permitting)
^stop\W
On the other hand, what follows matches a word at the beginning of a string on most regex flavors (in these flavors \w matches the opposite of \W)
^\w
If your flavor does not have the \w shortcut, you can use
^[a-zA-Z0-9]+
Be wary that this second idiom will only match letters and numbers, no symbol whatsoever.
Check your regex flavor manual to know what shortcuts are allowed and what exactly do they match (and how do they deal with Unicode).
Try this:
/^stop.*$/
Explanation:
/ charachters delimit the regular expression (i.e. they are not part of the Regex per se)
^ means match at the beginning of the line
. followed by * means match any character (.), any number of times (*)
$ means to the end of the line
If you would like to enforce that stop be followed by a whitespace, you could modify the RegEx like so:
/^stop\s+.*$/
\s means any whitespace character
+ following the \s means there has to be at least one whitespace character following after the stop word
Note: Also keep in mind that the RegEx above requires that the stop word be followed by a space! So it wouldn't match a line that only contains: stop
If you want to match anything after a word, stop, and not only at the start of the line, you may use: \bstop.*\b - word followed by a line.
Or if you want to match the word in the string, use \bstop[a-zA-Z]* - only the words starting with stop.
Or the start of lines with stop - ^stop[a-zA-Z]* for the word only - first word only.
The whole line ^stop.* - first line of the string only.
And if you want to match every string starting with stop, including newlines, use: /^stop.*/s - multiline string starting with stop.
Like #SharadHolani said. This won't match every word beginning with "stop"
. Only if it's at the beginning of a line like "stop going".
#Waxo gave the right answer:
This one is slightly better, if you want to match any word beginning with "stop" and containing nothing but letters from A to Z.
\bstop[a-zA-Z]*\b
This would match all
stop (1)
stop random (2)
stopping (3)
want to stop (4)
please stop (5)
But
/^stop[a-zA-Z]*/
would only match (1) until (3), but not (4) & (5)
If you want to match anything that starts with "stop" including "stop going", "stop" and "stopping" use:
^stop
If you want to match the word stop followed by anything as in "stop going", "stop this", but not "stopped" and not "stopping" use:
^stop\W
/stop([a-zA-Z])+/
Will match any stop word (stop, stopped, stopping, etc)
However, if you just want to match "stop" at the start of a string
/^stop/
will do :D
If you want the word to start with "stop", you can use the following pattern.
"^stop.*"
This will match words starting with stop followed by anything.
/^stop*$/i
i - in case it is case sensitive.
I'd advise against a simple regular expression approach to this problem. There are too many words that are substrings of other unrelated words, and you'll probably drive yourself crazy trying to overadapt the simpler solutions already provided.
You'll want at least a naive stemming algorithm (try the Porter stemmer; there's available, free code in most languages) to process text first. Keep this processed text and the preprocessed text in two separate space-split arrays. Make sure each non-alphabetical character also gets its own index in this array. Whatever list of words you're filtering, stem them also.
The next step would be to find the array indices which match to your list of stemmed 'stop' words. Remove those from the unprocessed array, and then rejoin on spaces.
This is only slightly more complicated, but will be much more reliable an approach. If you've got any doubts on the value of a more NLP-oriented approach, you might want to do some research into clbuttic mistakes.
can you try this:
https://regex101.com/r/P3qfKG/1
reg = /stop(\w+| [^ ]+|$)/gm
it will select both stop and start with stop and next word;

Allowing words picked up in regex in certain cases only

I have a regex expression to look for people just sticking "N/A" or similar into a form field.
^(?!(\b(N/A|NA|n/a|na|Yes|yes|YES|No|no|NO)\b))
Probably not the most elegant I am sure. However I cannot for the life of me get it to allow the above words if followed by something.
So if someone just types "yes" then I want it to fail the regex check. But if someone types "yes, I have blah blah etc etc" I want it to pass.
The expression I have allows the word to be used as long as it isn't the first word in the sentence. I just want to disallow the listed words as the ONLY words in the field.
Any ideas?
Thanks
You may remove the first \b (it is redundant between the start of string and a word char) and replace the second one with $ (end of string):
^(?!(?:N/A|NA|n/a|na|Yes|yes|YES|No|no|NO)$)
See the regex demo
With a case insensitive option, you may reduce the pattern to
^(?!(?:n/?a|yes|no)$)
See another regex demo
Details
^ - start of string, then...
(?!(?:n/?a|yes|no)$) - a location in string that is not immediately followed with n/?a (na, n/a), yes or no that are followed with the end of string.
In human words, only the start of string is matched if the whole string is not equal to the alternatives inside the alternation group.
The easiest way would be to match all the forbidden strings exactly and invert the result.
Try ^(n/?a|yes|no)$ with a case-insensitive option and invert the result.
^ matches the beginning of the string. $ matches the end of the string.
When you don't have a case-insensitive option, use ^([nN]/?[aA]|[yY][eE][sS]|[nN][oO])$.

Name validation - Adding a check to this regex to stop entering just identical characters

I'm trying to add another feature to a regex which is trying to validate names (first or last).
At the moment it looks like this:
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$)([a-z][a-z'-]{1,})$/i
https://regex101.com/r/pQ1tP2/1
The idea is to do the following
Don't allow just adding a title like Mr, Mrs etc
Ensure the first character is a letter
Ensure subsequent characters are either letters, hyphens or apostrophes
Minimum of two characters
I have managed to get this far (shockingly I find regex so confusing lol).
It matches things like O'Brian or Anne-Marie etc and is doing a pretty good job.
My next additions I've struggled with though! trying to add additional features to the regex to not match on the following:
Just entering the same characters i.e. aaa bbbbb etc
Thanks :)
I'd add another negative lookahead alternative matching against ^(.)\1*$, that is, any character, repetead until the end of the string.
Included as is in your regex, it would make that :
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$|^(.)\1*$)([a-z][a-z'-]{1,})$/i
However, I would probably simplify your negative lookahead as follows :
/^(?!(mr|ms|miss|dr|mr-mrs|(.)\2*)$)([a-z][a-z'-]{1,})$/i
The modifications are as follow :
We're evaluating the lookahead at the start of the string, as indicated by the ^ preceding it : no need to repeat that we match the start of the string in its clauses
Each alternative match the end of the string. We can put the alternatives in a group, which will be followed by the end-of-string anchor
We have created a new group, which we have to take into account in our back-reference : to reference the same group, it now must address \2 rather than \1. An alternative in certain regex flavours would have been to use a non-capturing group (?:...)

Cleaning up a regular expression which has lots of repetition

I am looking to clean up a regular expression which matches 2 or more characters at a time in a sequence. I have made one which works, but I was looking for something shorter, if possible.
Currently, it looks like this for every character that I want to search for:
([A]{2,}|[B]{2,}|[C]{2,}|[D]{2,}|[E]{2,}|...)*
Example input:
AABBBBBBCCCCAAAAAADD
See this question, which I think was asking the same thing you are asking. You want to write a regex that will match 2 or more of the same character. Let's say the characters you are looking for are just capital letters, [A-Z]. You can do this by matching one character in that set and grouping it by putting it in parentheses, then matching that group using the reference \1 and saying you want two or more of that "group" (which is really just the one character that it matched).
([A-Z])\1{1,}
The reason it's {1,} and not {2,} is that the first character was already matched by the set [A-Z].
Not sure I understand your needs but, how about:
[A-E]{2,}
This is the same as yours but shorter.
But if you want multiple occurrences of each letter:
(?:([A-Z])\1+)+
where ([A-Z]) matches one capital letter and store it in group 1
\1 is a backreference that repeats group 1
+ assume that are one or more repetition
Finally it matches strings like the one you've given: AABBBBBBCCCCAAAAAADD
To be sure there're no other characters in the string, you have to anchor the regex:
^(?:([A-Z])\1+)+$
And, if you wnat to match case insensitive:
^(?i)(?:([A-Z])\1+)+$

regex not working as it should

I'm trying to catch up on regex and I have made one as below;
^(.){1};(\d){4};(\d){8};[A,K]{1};(\d){7,8};(\d){8};[A-Z ]{1,};[ ,\d]{1};(\d){8};(\d){1};(\d){1}; $
and the sample is;
ä;1234;00126434;K;11821111;00000000;SOME TEXT ; 0;00000000;0;0;
As far as I've read
. is all chars, \d is digits, {n} and variations indicates n time and depending on variation, more repetitions.
What could be the problem?
A few suggestions/observations:
You can remove all {1}s, they don't do anything.
[A,K] means "A, , or K". If you want to match any letter between A and K, use [A-K].
You should place the capturing group around the repetitions: (\d{7,8}) captures a 7-8 digit number; (\d){7,8} will only capture the last digit.
[ ,\d]{1} fails on your regex because there are two characters (space and 0) at that point in the string.
you might need to remove the space before the final $, unless there actually is a space in your string after the last semicolon.
Here's a version that matches (and captures each element in a separate group):
^(.);(\d{4});(\d{8});([A-K]);(\d{7,8});(\d{8});([A-Z ]+);([ ,\d]+);(\d{8});(\d);(\d); *$
See it in action on regex101.com.
Please, don't abuse regexps for everything.
Your format is a CSV format, just split at ; and the validate the individual parts properly. This is perfectly valid, usually similarly efficient, and easier to debug.
With regexp, make sure you properly escape (i.e. double escape!). In most programming languages, \ is a reserved character in strings, and you will need to use \\ to get the desired effect.
Try this:
^(.){1};(\d){4};(\d){8};[A-K]{1};(\d){7,8};(\d){8};[A-Z ]{1,};[ \d]{2};(\d){8};(\d){1};(\d){1};$
Here what was happening in your regex
^(.){1};(\d){4};(\d){8};[A,K]{1};(\d){7,8};(\d){8};[A-Z ]{1,};[ ,\d]{1};(\d){8};(\d){1};(\d){1}; $
You have extra space before $ at the end.
To specify range use - and not comma, Your range should be [A-K].
In [ ,\d] range You have restricted it to 1 character {1} it should be {2} one for
space and 1 for digit.
Additional: You don't need to specify {1} as it will match one preceding token by default
If yours does not work, you can try this one :
^(.){1};(\d){4};(\d){8};[A,K]{1};(\d){7,8};(\d){8};[A-Z ]{1,};( \d){1};(\d){8};(\d){1};(\d){1};$