How to check that a string is a palindrome using regular expressions? - regex

That was an interview question that I was unable to answer:
How to check that a string is a palindrome using regular expressions?
p.s. There is already a question "How to check if the given string is palindrome?" and it gives a lot of answers in different languages, but no answer that uses regular expressions.

The answer to this question is that "it is impossible". More specifically, the interviewer is wondering if you paid attention in your computational theory class.
In your computational theory class you learned about finite state machines. A finite state machine is composed of nodes and edges. Each edge is annotated with a letter from a finite alphabet. One or more nodes are special "accepting" nodes and one node is the "start" node. As each letter is read from a given word we traverse the given edge in the machine. If we end up in an accepting state then we say that the machine "accepts" that word.
A regular expression can always be translated into an equivalent finite state machine. That is, one that accepts and rejects the same words as the regular expression (in the real world, some regexp languages allow for arbitrary functions, these don't count).
It is impossible to build a finite state machine that accepts all palindromes. The proof relies on the facts that we can easily build a string that requires an arbitrarily large number of nodes, namely the string
a^x b a^x (eg., aba, aabaa, aaabaaa, aaaabaaaa, ....)
where a^x is a repeated x times. This requires at least x nodes because, after seeing the 'b' we have to count back x times to make sure it is a palindrome.
Finally, getting back to the original question, you could tell the interviewer that you can write a regular expression that accepts all palindromes that are smaller than some finite fixed length. If there is ever a real-world application that requires identifying palindromes then it will almost certainly not include arbitrarily long ones, thus this answer would show that you can differentiate theoretical impossibilities from real-world applications. Still, the actual regexp would be quite long, much longer than equivalent 4-line program (easy exercise for the reader: write a program that identifies palindromes).

While the PCRE engine does support recursive regular expressions (see the answer by Peter Krauss), you cannot use a regex on the ICU engine (as used, for example, by Apple) to achieve this without extra code. You'll need to do something like this:
This detects any palindrome, but does require a loop (which will be required because regular expressions can't count).
$a = "teststring";
while(length $a > 1)
{
$a =~ /(.)(.*)(.)/;
die "Not a palindrome: $a" unless $1 eq $3;
$a = $2;
}
print "Palindrome";

It's not possible. Palindromes aren't defined by a regular language. (See, I DID learn something in computational theory)

With Perl regex:
/^((.)(?1)\2|.?)$/
Though, as many have pointed out, this can't be considered a regular expression if you want to be strict. Regular expressions does not support recursion.

Here's one to detect 4-letter palindromes (e.g.: deed), for any type of character:
\(.\)\(.\)\2\1
Here's one to detect 5-letter palindromes (e.g.: radar), checking for letters only:
\([a-z]\)\([a-z]\)[a-z]\2\1
So it seems we need a different regex for each possible word length.
This post on a Python mailing list includes some details as to why (Finite State Automata and pumping lemma).

Depending on how confident you are, I'd give this answer:
I wouldn't do it with a regular
expression. It's not an appropriate
use of regular expressions.

Yes, you can do it in .Net!
(?<N>.)+.?(?<-N>\k<N>)+(?(N)(?!))
You can check it here! It's a wonderful post!

StackOverflow is full of answers like "Regular expressions? nope, they don't support it. They can't support it.".
The truth is that regular expressions have nothing to do with regular grammars anymore. Modern regular expressions feature functions such as recursion and balancing groups, and the availability of their implementations is ever growing (see Ruby examples here, for instance). In my opinion, hanging onto old belief that regular expressions in our field are anything but a programming concept is just counterproductive. Instead of hating them for the word choice that is no longer the most appropriate, it is time for us to accept things and move on.
Here's a quote from Larry Wall, the creator of Perl itself:
(…) generally having to do with what we call “regular expressions”, which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I’m not going to try to fight linguistic necessity here. I will, however, generally call them “regexes” (or “regexen”, when I’m in an Anglo-Saxon mood).
And here's a blog post by one of PHP's core developers:
As the article was quite long, here a summary of the main points:
The “regular expressions” used by programmers have very little in common with the original notion of regularity in the context of formal language theory.
Regular expressions (at least PCRE) can match all context-free languages. As such they can also match well-formed HTML and pretty much all other programming languages.
Regular expressions can match at least some context-sensitive languages.
Matching of regular expressions is NP-complete. As such you can solve any other NP problem using regular expressions.
That being said, you can match palindromes with regexes using this:
^(?'letter'[a-z])+[a-z]?(?:\k'letter'(?'-letter'))+(?(letter)(?!))$
...which obviously has nothing to do with regular grammars.
More info here: http://www.regular-expressions.info/balancing.html

As a few have already said, there's no single regexp that'll detect a general palindrome out of the box, but if you want to detect palindromes up to a certain length, you can use something like
(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1

You can also do it without using recursion:
\A(?:(.)(?=.*?((?(2)\1\2|\1))\z))*?.?\2\z
to allow a single character:
\A(?:(?:(.)(?=.*?((?(2)\1\2|\1))\z))*?.?\2|.)\z
Works with Perl, PCRE
demo
For Java:
\A(?:(.)(?=.*?(\1\2\z|(?<!(?=\2\z).{0,1000})\1\z)))*?.?\2\z
demo

It can be done in Perl now. Using recursive reference:
if($istr =~ /^((\w)(?1)\g{-1}|\w?)$/){
print $istr," is palindrome\n";
}
modified based on the near last part http://perldoc.perl.org/perlretut.html

In ruby you can use named capture groups. so something like this will work -
def palindrome?(string)
$1 if string =~ /\A(?<p>| \w | (?: (?<l>\w) \g<p> \k<l+0> ))\z/x
end
try it, it works...
1.9.2p290 :017 > palindrome?("racecar")
=> "racecar"
1.9.2p290 :018 > palindrome?("kayak")
=> "kayak"
1.9.2p290 :019 > palindrome?("woahitworks!")
=> nil

Recursive Regular Expressions can do it!
So simple and self-evident algorithm to detect a string that contains a palindrome:
(\w)(?:(?R)|\w?)\1
At rexegg.com/regex-recursion the tutorial explains how it works.
It works fine with any language, here an example adapted from the same source (link) as proof-of-concept, using PHP:
$subjects=['dont','o','oo','kook','book','paper','kayak','okonoko','aaaaa','bbbb'];
$pattern='/(\w)(?:(?R)|\w?)\1/';
foreach ($subjects as $sub) {
echo $sub." ".str_repeat('-',15-strlen($sub))."-> ";
if (preg_match($pattern,$sub,$m))
echo $m[0].(($m[0]==$sub)? "! a palindrome!\n": "\n");
else
echo "sorry, no match\n";
}
outputs
dont ------------> sorry, no match
o ---------------> sorry, no match
oo --------------> oo! a palindrome!
kook ------------> kook! a palindrome!
book ------------> oo
paper -----------> pap
kayak -----------> kayak! a palindrome!
okonoko ---------> okonoko! a palindrome!
aaaaa -----------> aaaaa! a palindrome!
bbbb ------------> bbb
Comparing
The regular expression ^((\w)(?:(?1)|\w?)\2)$ do the same job, but as yes/not instead "contains". PS: it is using a definition where "o" is not a palimbrome, "able-elba" hyphened format is not a palindrome, but "ableelba" is. Naming it definition1. When "o" and "able-elba" are palindrones, naming definition2.
Comparing with another "palindrome regexes",
^((.)(?:(?1)|.?)\2)$ the base-regex above without \w restriction, accepting "able-elba".
^((.)(?1)?\2|.)$ (#LilDevil) Use definition2 (accepts "o" and "able-elba" so differing also in the recognition of "aaaaa" and "bbbb" strings).
^((.)(?1)\2|.?)$ (#Markus) not detected "kook" neither "bbbb"
^((.)(?1)*\2|.?)$ (#Csaba) Use definition2.
NOTE: to compare you can add more words at $subjects and a line for each compared regex,
if (preg_match('/^((.)(?:(?1)|.?)\2)$/',$sub)) echo " ...reg_base($sub)!\n";
if (preg_match('/^((.)(?1)?\2|.)$/',$sub)) echo " ...reg2($sub)!\n";
if (preg_match('/^((.)(?1)\2|.?)$/',$sub)) echo " ...reg3($sub)!\n";
if (preg_match('/^((.)(?1)*\2|.?)$/',$sub)) echo " ...reg4($sub)!\n";

Here's my answer to Regex Golf's 5th level (A man, a plan). It works for up to 7 characters with the browser's Regexp (I'm using Chrome 36.0.1985.143).
^(.)(.)(?:(.).?\3?)?\2\1$
Here's one for up to 9 characters
^(.)(.)(?:(.)(?:(.).?\4?)?\3?)?\2\1$
To increase the max number of characters it'd work for, you'd repeatedly replace .? with (?:(.).?\n?)?.

It's actually easier to do it with string manipulation rather than regular expressions:
bool isPalindrome(String s1)
{
String s2 = s1.reverse;
return s2 == s1;
}
I realize this doesn't really answer the interview question, but you could use it to show how you know a better way of doing a task, and you aren't the typical "person with a hammer, who sees every problem as a nail."

Regarding the PCRE expression (from MizardX):
/^((.)(?1)\2|.?)$/
Have you tested it? On my PHP 5.3 under Win XP Pro it fails on: aaaba
Actually, I modified the expression expression slightly, to read:
/^((.)(?1)*\2|.?)$/
I think what is happening is that while the outer pair of characters are anchored, the remaining inner ones are not. This is not quite the whole answer because while it incorrectly passes on "aaaba" and "aabaacaa", it does fail correctly on "aabaaca".
I wonder whether there a fixup for this, and also,
Does the Perl example (by JF Sebastian / Zsolt) pass my tests correctly?
Csaba Gabor from Vienna

/\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/
it is valid for Oniguruma engine (which is used in Ruby)
took from Pragmatic Bookshelf

In Perl (see also Zsolt Botykai's answer):
$re = qr/
. # single letter is a palindrome
|
(.) # first letter
(??{ $re })?? # apply recursivly (not interpolated yet)
\1 # last letter
/x;
while(<>) {
chomp;
say if /^$re$/; # print palindromes
}

As pointed out by ZCHudson, determine if something is a palindrome cannot be done with an usual regexp, as the set of palindrome is not a regular language.
I totally disagree with Airsource Ltd when he says that "it's not possibles" is not the kind of answer the interviewer is looking for. During my interview, I come to this kind of question when I face a good candidate, to check if he can find the right argument when we proposed to him to do something wrong. I do not want to hire someone who will try to do something the wrong way if he knows better one.

something you can do with perl: http://www.perlmonks.org/?node_id=577368

I would explain to the interviewer that the language consisting of palindromes is not a regular language but instead context-free.
The regular expression that would match all palindromes would be infinite. Instead I would suggest he restrict himself to either a maximum size of palindromes to accept; or if all palindromes are needed use at minimum some type of NDPA, or just use the simple string reversal/equals technique.

The best you can do with regexes, before you run out of capture groups:
/(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\9\8\7\6\5\4\3\2\1/
This will match all palindromes up to 19 characters in length.
Programatcally solving for all lengths is trivial:
str == str.reverse ? true : false

I don't have the rep to comment inline yet, but the regex provided by MizardX, and modified by Csaba, can be modified further to make it work in PCRE. The only failure I have found is the single-char string, but I can test for that separately.
/^((.)(?1)?\2|.)$/
If you can make it fail on any other strings, please comment.

#!/usr/bin/perl
use strict;
use warnings;
print "Enter your string: ";
chop(my $a = scalar(<STDIN>));
my $m = (length($a)+1)/2;
if( (length($a) % 2 != 0 ) or length($a) > 1 ) {
my $r;
foreach (0 ..($m - 2)){
$r .= "(.)";
}
$r .= ".?";
foreach ( my $i = ($m-1); $i > 0; $i-- ) {
$r .= "\\$i";
}
if ( $a =~ /(.)(.).\2\1/ ){
print "$a is a palindrome\n";
}
else {
print "$a not a palindrome\n";
}
exit(1);
}
print "$a not a palindrome\n";

From automata theory its impossible to match a paliandrome of any lenght ( because that requires infinite amount of memory). But IT IS POSSIBLE to match Paliandromes of Fixed Length.
Say its possible to write a regex that matches all paliandromes of length <= 5 or <= 6 etc, but not >=5 etc where upper bound is unclear

In Ruby you can use \b(?'word'(?'letter'[a-z])\g'word'\k'letter+0'|[a-z])\b to match palindrome words such as a, dad, radar, racecar, and redivider. ps : this regex only matches palindrome words that are an odd number of letters long.
Let's see how this regex matches radar. The word boundary \b matches at the start of the string. The regex engine enters the capturing group "word". [a-z] matches r which is then stored in the stack for the capturing group "letter" at recursion level zero. Now the regex engine enters the first recursion of the group "word". (?'letter'[a-z]) matches and captures a at recursion level one. The regex enters the second recursion of the group "word". (?'letter'[a-z]) captures d at recursion level two. During the next two recursions, the group captures a and r at levels three and four. The fifth recursion fails because there are no characters left in the string for [a-z] to match. The regex engine must backtrack.
The regex engine must now try the second alternative inside the group "word". The second [a-z] in the regex matches the final r in the string. The engine now exits from a successful recursion, going one level back up to the third recursion.
After matching (&word) the engine reaches \k'letter+0'. The backreference fails because the regex engine has already reached the end of the subject string. So it backtracks once more. The second alternative now matches the a. The regex engine exits from the third recursion.
The regex engine has again matched (&word) and needs to attempt the backreference again. The backreference specifies +0 or the present level of recursion, which is 2. At this level, the capturing group matched d. The backreference fails because the next character in the string is r. Backtracking again, the second alternative matches d.
Now, \k'letter+0' matches the second a in the string. That's because the regex engine has arrived back at the first recursion during which the capturing group matched the first a. The regex engine exits the first recursion.
The regex engine is now back outside all recursion. That this level, the capturing group stored r. The backreference can now match the final r in the string. Since the engine is not inside any recursion any more, it proceeds with the remainder of the regex after the group. \b matches at the end of the string. The end of the regex is reached and radar is returned as the overall match.

here is PL/SQL code which tells whether given string is palindrome or not using regular expressions:
create or replace procedure palin_test(palin in varchar2) is
tmp varchar2(100);
i number := 0;
BEGIN
tmp := palin;
for i in 1 .. length(palin)/2 loop
if length(tmp) > 1 then
if regexp_like(tmp,'^(^.).*(\1)$') = true then
tmp := substr(palin,i+1,length(tmp)-2);
else
dbms_output.put_line('not a palindrome');
exit;
end if;
end if;
if i >= length(palin)/2 then
dbms_output.put_line('Yes ! it is a palindrome');
end if;
end loop;
end palin_test;

my $pal='malayalam';
while($pal=~/((.)(.*)\2)/){ #checking palindrome word
$pal=$3;
}
if ($pal=~/^.?$/i){ #matches single letter or no letter
print"palindrome\n";
}
else{
print"not palindrome\n";
}

This regex will detect palindromes up to 22 characters ignoring spaces, tabs, commas, and quotes.
\b(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*\11?[ \t,'"]*\10|\10?)[ \t,'"]*\9|\9?)[ \t,'"]*\8|\8?)[ \t,'"]*\7|\7?)[ \t,'"]*\6|\6?)[ \t,'"]*\5|\5?)[ \t,'"]*\4|\4?)[ \t,'"]*\3|\3?)[ \t,'"]*\2|\2?))?[ \t,'"]*\1\b
Play with it here: https://regexr.com/4tmui
I wrote an explanation of how I got that here: https://medium.com/analytics-vidhya/coding-the-impossible-palindrome-detector-with-a-regular-expressions-cd76bc23b89b

A slight refinement of Airsource Ltd's method, in pseudocode:
WHILE string.length > 1
IF /(.)(.*)\1/ matches string
string = \2
ELSE
REJECT
ACCEPT

Related

Regular expression which will match if there is no repetition

I would like to construct regular expression which will match password if there is no character repeating 4 or more times.
I have come up with regex which will match if there is character or group of characters repeating 4 times:
(?:([a-zA-Z\d]{1,})\1\1\1)
Is there any way how to match only if the string doesn't contain the repetitions? I tried the approach suggested in Regular expression to match a line that doesn't contain a word? as I thought some combination of positive/negative lookaheads will make it. But I haven't found working example yet.
By repetition I mean any number of characters anywhere in the string
Example - should not match
aaaaxbc
abababab
x14aaaabc
Example - should match
abcaxaxaz
(a is here 4 times but it is not problem, I want to filter out repeating patterns)
That link was very helpful, and I was able to use it to create the regular expression from your original expression.
^(?:(?!(?<char>[a-zA-Z\d]+)\k<char>{3,}).)+$
or
^(?:(?!([a-zA-Z\d]+)\1{3,}).)+$
Nota Bene: this solution doesn't answer exaactly to the question, it does too much relatively to the expressed need.
-----
In Python language:
import re
pat = '(?:(.)(?!.*?\\1.*?\\1.*?\\1.*\Z))+\Z'
regx = re.compile(pat)
for s in (':1*2-3=4#',
':1*1-3=4#5',
':1*1-1=4#5!6',
':1*1-1=1#',
':1*2-a=14#a~7&1{g}1'):
m = regx.match(s)
if m:
print m.group()
else:
print '--No match--'
result
:1*2-3=4#
:1*1-3=4#5
:1*1-1=4#5!6
--No match--
--No match--
It will give a lot of work to the regex motor because the principle of the pattern is that for each character of the string it runs through, it must verify that the current character isn't found three other times in the remaining sequence of characters that follow the current character.
But it works, apparently.

Regular expression matching any subset of a given set?

Is it possible to write a regular expression which will match any subset of a given set of characters a1 ... an ?
I.e. it should match any string where any of these characters appears at most once, there are no other characters and the relative order of the characters doesn't matter.
Some approaches that arise at once:
1. [a1,...,an]* or (a1|a2|...|an)*- this allows multiple presence of characters
2. (a1?a2?...an?) - no multiple presence, but relative order is important - this matches any subsequence but not subset.
3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1), i.e. write all possible subsequences (just hardcode all matching strings :)) of course, not acceptable.
I also have a guess that it may be theoretically impossible, because during parsing the string we will need to remember which character we have already met before, and as far as I know regular expressions can check out only right-linear languages.
Any help will be appreciated. Thanks in advance.
This doesn't really qualify for the language-agnostic tag, but...
^(?:(?!\1)a1()|(?!\2)a2()|...|(?!\n)an())*$
see a demo on ideone.com
The first time an element is matched, it gets "checked off" by the capturing group following it. Because the group has now participated in the match, a negative lookahead for its corresponding backreference (e.g., (?!\1)) will never match again, even though the group only captured an empty string. This is an undocumented feature that is nevertheless supported in many flavors, including Java, .NET, Perl, Python, and Ruby.
This solution also requires support for forward references (i.e., a reference to a given capturing group (\1) appearing in the regex before the group itself). This seems to be a little less widely supported than the empty-groups gimmick.
Can't think how to do it with a single regex, but this is one way to do it with n regexes: (I will usr 1 2 ... m n etc for your as)
^[23..n]*1?[23..n]*$
^[13..n]*2?[13..n]*$
...
^[12..m]*n?[12..m]*$
If all the above match, your string is a strict subset of 12..mn.
How this works: each line requires the string to consist exactly of:
any number of charactersm drawn fromthe set, except a particular one
perhaps a particular one
any number of charactersm drawn fromthe set, except a particular one
If this passes when every element in turn is considered as a particular one, we know:
there is nothing else in the string except the allowed elements
there is at most one of each of the allowed elements
as required.
for completeness I should say that I would only do this if I was under orders to "use regex"; if not, I'd track which allowed elements have been seen, and iterate over the characters of the string doing the obvious thing.
Not sure you can get an extended regex to do that, but it's pretty easy to do with a simple traversal of your string.
You use a hash (or an array, or whatever) to store if any of your allowed characters has already been seen or not in the string. Then you simply iterate over the elements of your string. If you encounter an element not in your allowed set, you bail out. If it's allowed, but you've already seen it, you bail out too.
In pseudo-code:
foreach char a in {a1, ..., an}
hit[a1] = false
foreach char c in string
if c not in {a1, ..., an} => fail
if hit[c] => fail
hit[c] = true
Similar to Alan Moore's, using only \1, and doesn't refer to a capturing group before it has been seen:
#!/usr/bin/perl
my $re = qr/^(?:([abc])(?!.*\1))*$/;
foreach (qw(ba pabc abac a cc cba abcd abbbbc), '') {
print "'$_' ", ($_ =~ $re) ? "matches" : "does not match", " \$re \n";
}
We match any number of blocks (the outer (?:)), where each block must consist of "precisely one character from our preferred set, which is not followed by a string containing that character".
If the string might contain newlines or other funny stuff, it might be necessary to play with some flags to make ^, $ and . behave as intended, but this all depends on the particular RE flavor.
Just for sillyness, one can use a positive look-ahead assertion to effectively AND two regexps, so we can test for any permutation of abc by asserting that the above matches, followed by an ordinary check for 'is N characters long and consists of these characters':
my $re2 = qr/^(?=$re)[abc]{3}$/;
foreach (qw(ba pabc abac a cc abcd abbbbc abc acb bac bca cab cba), '') {
print "'$_' ", ($_ =~ $re2) ? "matches" : "does not match", " \$re2 \n";
}

How to use a REGEX pattern to remove a specific word "THE" only if at beginning of text string?

I have a text input field for titles of various things and to help minimize false negatives on search results(internal search is not the best), I need to have a REGEX pattern which looks at the first four characters of the input string and removes the word(and space after the word) _the _ if it is there at the beginning only.
For example if we are talking about the names of bands, and someone enters The Rolling Stones , what i need is for the entry to say only Rolling Stones
Can a regex be used to automatically strip these 4characters?
Applying the regex
^(?:\s*the\s*)?(.*)$
will match any string, and capture it in backreference no. 1, unless it starts with the (optionally surrounded by whitespace), in which case backref no. 1 will contain whatever follows.
You need to set the case-insensitive option in your regex engine for this to work.
You can use the ^ identifier to match a pattern at the beginning of a line, however for what you are using this for, it can be considered overkill.
A lot of languages support string manipulations, which is a more suitable choice. I can provide an example to demonstrate in Python,
>>> def func(n):
n = n[4:len(n)] if n[0:4] == "The " else n
return n
>>> func("The Rolling Stones")
'Rolling Stones'
>>> func("They Might Be Giants")
'They Might Be Giants'
As you don't clarify with language, here is a solution in Perl :
my $str = "The Rolling Stones";
$str =~ s/^the //i;
say $str; # Rolling Stones

How can I check if every substring of four zeros is followed by at least four ones using regular expressions?

How can I write regular expression in which
whenever there is 0000 there should be 1111 after this for example:
00101011000011111111001111010 -> correct
0000110 -> incorect
11110 -> correct
thanks for any help
If you are using Perl, you can use a zero-width negative-lookahead assertion:
#!/usr/bin/perl
use strict; use warnings;
my #strings = qw(
00101011000011111111001111010
00001111000011
0000110
11110
);
my $re = qr/0000(?!1111)/;
for my $s ( #strings ) {
my $result = $s =~ $re ? 'incorrect' : 'correct';
print "$s -> $result\n";
}
The pattern matches if there is a string of 0000 not followed by at least four 1s. So, a match indicates an incorrect string.
Output:
C:\Temp> s
00101011000011111111001111010 -> correct
00001111000011 -> incorrect
0000110 -> incorrect
11110 -> correct
While some languages' alleged "regular expressions" actually implement something quite different (generally a superset of) what are called regular expressions in computer science (including e.g. pushdown automatas or even arbitrary code execution within "regexes"), to answer in actual regex terms is, I think, best done as follows:
regular expressions are in general a good way to answer many questions of the form "is there any spot in the text in which the following pattern occurs" (with limitations on the pattern, of course -- for example, balancing of nested parentheses is beyond regexes' power, although of course it may well not be beyond the power of arbitrary supersets of regexes). "Does the whole text match this pattern" is obviously a special case of the question "does any spot in the text match this pattern", given the possibility to have special markers meaning "start of text" and "end of text" (typically ^ and $ in typical regex-pattern syntax).
However, the question "can you check that NO spot in the text matches this pattern" is not an answer which regex matching can directly answer... but, adding (outside the regex) the logical operation not obviously solves the problem in practice, because "check that no spot matches" is clearly the same "tell me if any spot matches" followed by "transform success into failure and vice versa" (the latter being the logical not part). This is the key insight in Sinan's answer, beyond the specific use of Perl's negative-lookahead (which is really just a shortcut, not an extension of regex power per se).
If your favorite language for using regexes in doesn't have negative lookahead but does have the {<number>} "count shortcut", parentheses, and the vertical bar "or" operation:
00001{0,3}([^1]|$)
i.e., "four 0s followed by zero to three 1s followed by either a non-1 character or end-of-text" is exactly a pattern such that, if the text matches it anywhere, violates your constraint (IOW, it can be seen as a slight expansion of the negative-lookahead shortcut syntax). Add a logical-not (again, in whatever language you prefer), and there you are!
There are several approaches that you can take here, and I will list some of them.
Checking that for all cases, the requirement is always met
In this approach, we simply look for 0000(1111)? and find all matches. Since ? is greedy, it will match the 1111 if possible, so we simply check that each match is 00001111. If it's only 0000 then we say that the input is not valid. If we didn't find any match that is only 0000 (perhaps because there's no match at all to begin with), then we say it's valid.
In pseudocode:
FUNCTION isValid(s:String) : boolean
FOR EVERY match /0000(1111)?/ FOUND ON s
IF match IS NOT "00001111" THEN
RETURN false
RETURN true
Checking that there is a case where the requirement isn't met (then oppose)
In this approach, we're using regex to try to find a violation instead, Thus, a successful match means we say the input is not valid. If there's no match, then there's no violation, so we say the input is valid (this is what is meant by "check-then-oppose").
In pseudocode
isValid := NOT (/violationPattern/ FOUND ON s)
Lookahead option
If your flavor supports it, negative lookahead is the most natural way to express this pattern. Simply look for 0000(?!1111).
No lookahead option
If your flavor doesn't support negative lookahead, you can still use this approach. Now the pattern becomes 00001{0,3}(0|$). That is, we try to match 0000, followed by 1{0,3} (that is, between 0-3 1), followed by either 0 or the end of string anchor $.
Fully spelled out option
This is equivalent to the previous option, but instead of using repetition and alternation syntax, you explicitly spell out what the violations are. They are
00000|000010|0000110|00001110|
0000$|00001$|000011$|0000111$
Checking that there ISN'T a case where the requirement isn't met
This relies on negative lookahead; it's simply taking the previous approach to the next level. Instead of:
isValid := NOT (/violationPattern/ FOUND ON s)
we can bring the NOT into the regex using negative lookahead as follows:
isValid := (/^(?!.*violationPattern)/ FOUND ON s)
That is, anchoring ourself at the beginning of the string, we negatively assert that we can match .*violationPattern. The .* allows us to "search" for the violationPattern as far ahead as necessary.
Attachments
Here are the patterns showcased on rubular:
Approach 1: Matching only 0000 means invalid
0000(?:1111)?
Approach 2: Match means invalid
0000(?!1111)
00001{0,3}(?:0|$)
00000|000010|0000110|00001110|0000$|00001$|000011$|0000111$
Approach 3: Match means valid
^(?!.*0000(?!1111)).*$
The input used is (annotated to show which ones are valid):
+ 00101011000011111111001111010
- 000011110000
- 0000110
+ 11110
- 00000
- 00001
- 000011
- 0000111
+ 00001111
References
regular-expressions.info/Lookarounds and Flavor comparison

regex to match a maximum of 4 spaces

I have a regular expression to match a persons name.
So far I have ^([a-zA-Z\'\s]+)$ but id like to add a check to allow for a maximum of 4 spaces. How do I amend it to do this?
Edit: what i meant was 4 spaces anywhere in the string
Don't attempt to regex validate a name. People are allowed to call themselves what ever they like. This can include ANY character. Just because you live somewhere that only uses English doesn't mean that all the people who use your system will have English names. We have even had to make the name field in our system Unicode. It is the only Unicode type in the database.
If you care, we actually split the name at " " and store each name part as a separate record, but we have some very specific requirements that mean this is a good idea.
PS. My step mum has 5 spaces in her name.
^ # Start of string
(?!\S*(?:\s\S*){5}) # Negative look-ahead for five spaces.
([a-zA-Z\'\s]+)$ # Original regex
Or in one line:
^(?!(?:\S*\s){5})([a-zA-Z\'\s]+)$
If there are five or more spaces in the string, five will be matched by the negative lookahead, and the whole match will fail. If there are four or less, the original regex will be matched.
Screw the regex.
Using a regex here seems to be creating a problem for a solution instead of just solving a problem.
This task should be 'easy' for even a novice programmer, and the novel idea of regex has polluted our minds!.
1: Get Input
2: Trim White Space
3: If this makes sence, trim out any 'bad' characters.
4: Use the "split" utility provided by your language to break it into words
5: Return the first 5 Words.
ROCKET SCIENCE.
replies
what do you mean screw the regex? your obviously a VB programmer.
Regex is the most efficient way to work with strings. Learn them.
No. Php, toyed a bit with ruby, now going manically into perl.
There are some thing ( like this case ) where the regex based alternative is computationally and logically exponentially overly complex for the task.
I've parse entire php source files with regex, I'm not exactly a novice in their use.
But there are many cases, such as this, where you're employing a logging company to prune your rose bush.
I could do all steps 2 to 5 with regex of course, but they would be simple and atomic regex, with no weird backtracking syntax or potential for recursive searching.
The steps 1 to 5 I list above have a known scope, known range of input, and there's no ambiguity to how it functions. As to your regex, the fact you have to get contributions of others to write something so simple is proving the point.
I see somebody marked my post as offensive, I am somewhat unhappy I can't mark this fact as offensive to me. ;)
Proof Of Pudding:
sub getNames{
my #args = #_;
my $text = shift #args;
my $num = shift #args;
# Trim Whitespace from Head/End
$text =~ s/^\s*//;
$text =~ s/\s*$//;
# Trim Bad Characters (??)
$text =~ s/[^a-zA-Z\'\s]//g;
# Tokenise By Space
my #words = split( /\s+/, $text );
#return 0..n
return #words[ 0 .. $num - 1 ];
} ## end sub getNames
print join ",", getNames " Hello world this is a good test", 5;
>> Hello,world,this,is,a
If there is anything ambiguous to anybody how that works, I'll be glad to explain it to them. Noted that I'm still doing it with regexps. Other languages I would have used their native "trim" functions provided where possible.
Bollocks -->
I first tried this approach. This is your brain on regex. Kids, don't do regex.
This might be a good start
/([^\s]+
(\s[^\s]+
(\s[^\s]+
(\s[^\s]+
(\s[^\s]+|)
|)
|)
|)
)/
( Linebroken for clarity )
/([^\s]+(\s[^\s]+(\s[^\s]+(\s[^\s]+|)|)|))/
( Actual )
I've used [^\s]+ here instead of your A-Z combo for succintness, but the point is here the nested optional groups
ie:
(Hello( this( is( example))))
(Hello( this( is( example( two)))))
(Hello( this( is( better( example))))) three
(Hello( this( is()))))
(Hello( this()))
(Hello())
( Note: this, while being convoluted, has the benefit that it will match each name into its own group )
If you want readable code:
$word = '[^\s]+';
$regex = "/($word(\s$word(\s$word(\s$word(\s$word|)|)|)|)|)/";
( it anchors around the (capture|) mantra of "get this, or get nothing" )
#Sir Psycho : Be careful about your assumptions here. What about hyphenated names? Dotted names (e.g. Brian R. Bondy) and so on?
Here's the answer that you're most likely looking for:
^[a-zA-Z']+(\s[a-zA-Z']+){0,4}$
That says (in English): "From start to finish, match one or more letters, there can also be a space followed by another 'name' up to four times."
BTW: Why do you want them to have apostrophes anywhere in the name?
^([a-zA-Z']+\s){0,4}[a-zA-Z']+$
This assumes you want 4 spaces inside this string (i.e. you have trimmed it)
Edit: If you want 4 spaces anywhere I'd recommend not using regex - you'd be better off using a substr_count (or the equivalent in your language).
I also agree with pipTheGeek that there are so many different ways of writing names that you're probably best off trusting the user to get their name right (although I have found that a lot of people don't bother using capital letters on ecommerce checkouts).
Match multiple whitespace followed by two characters at the end of the line.
Related problem ----
From a string, remove trailing 2 characters preceded by multiple white spaces... For example, if the column contains this string -
" 'This is a long string with 2 chars at the end AB "
then, AB should be removed while retaining the sentence.
Solution ----
select 'This is a long string with 2 chars at the end AB' as "C1",
regexp_replace('This is a long string with 2 chars at the end AB',
'[[[:space:]][a-zA-Z][a-zA-Z]]*$') as "C2" from dual;
Output ----
C1
This is a long string with 2 chars at the end AB
C2
This is a long string with 2 chars at the end
Analysis ----
regular expression specifies - match and replace zero or more occurences (*) of a space ([:space:]) followed by combination of two characters ([a-zA-Z][a-zA-Z]) at the end of the line.
Hope this is useful.