Regular expression matching any subset of a given set? - regex

Is it possible to write a regular expression which will match any subset of a given set of characters a1 ... an ?
I.e. it should match any string where any of these characters appears at most once, there are no other characters and the relative order of the characters doesn't matter.
Some approaches that arise at once:
1. [a1,...,an]* or (a1|a2|...|an)*- this allows multiple presence of characters
2. (a1?a2?...an?) - no multiple presence, but relative order is important - this matches any subsequence but not subset.
3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1), i.e. write all possible subsequences (just hardcode all matching strings :)) of course, not acceptable.
I also have a guess that it may be theoretically impossible, because during parsing the string we will need to remember which character we have already met before, and as far as I know regular expressions can check out only right-linear languages.
Any help will be appreciated. Thanks in advance.

This doesn't really qualify for the language-agnostic tag, but...
^(?:(?!\1)a1()|(?!\2)a2()|...|(?!\n)an())*$
see a demo on ideone.com
The first time an element is matched, it gets "checked off" by the capturing group following it. Because the group has now participated in the match, a negative lookahead for its corresponding backreference (e.g., (?!\1)) will never match again, even though the group only captured an empty string. This is an undocumented feature that is nevertheless supported in many flavors, including Java, .NET, Perl, Python, and Ruby.
This solution also requires support for forward references (i.e., a reference to a given capturing group (\1) appearing in the regex before the group itself). This seems to be a little less widely supported than the empty-groups gimmick.

Can't think how to do it with a single regex, but this is one way to do it with n regexes: (I will usr 1 2 ... m n etc for your as)
^[23..n]*1?[23..n]*$
^[13..n]*2?[13..n]*$
...
^[12..m]*n?[12..m]*$
If all the above match, your string is a strict subset of 12..mn.
How this works: each line requires the string to consist exactly of:
any number of charactersm drawn fromthe set, except a particular one
perhaps a particular one
any number of charactersm drawn fromthe set, except a particular one
If this passes when every element in turn is considered as a particular one, we know:
there is nothing else in the string except the allowed elements
there is at most one of each of the allowed elements
as required.
for completeness I should say that I would only do this if I was under orders to "use regex"; if not, I'd track which allowed elements have been seen, and iterate over the characters of the string doing the obvious thing.

Not sure you can get an extended regex to do that, but it's pretty easy to do with a simple traversal of your string.
You use a hash (or an array, or whatever) to store if any of your allowed characters has already been seen or not in the string. Then you simply iterate over the elements of your string. If you encounter an element not in your allowed set, you bail out. If it's allowed, but you've already seen it, you bail out too.
In pseudo-code:
foreach char a in {a1, ..., an}
hit[a1] = false
foreach char c in string
if c not in {a1, ..., an} => fail
if hit[c] => fail
hit[c] = true

Similar to Alan Moore's, using only \1, and doesn't refer to a capturing group before it has been seen:
#!/usr/bin/perl
my $re = qr/^(?:([abc])(?!.*\1))*$/;
foreach (qw(ba pabc abac a cc cba abcd abbbbc), '') {
print "'$_' ", ($_ =~ $re) ? "matches" : "does not match", " \$re \n";
}
We match any number of blocks (the outer (?:)), where each block must consist of "precisely one character from our preferred set, which is not followed by a string containing that character".
If the string might contain newlines or other funny stuff, it might be necessary to play with some flags to make ^, $ and . behave as intended, but this all depends on the particular RE flavor.
Just for sillyness, one can use a positive look-ahead assertion to effectively AND two regexps, so we can test for any permutation of abc by asserting that the above matches, followed by an ordinary check for 'is N characters long and consists of these characters':
my $re2 = qr/^(?=$re)[abc]{3}$/;
foreach (qw(ba pabc abac a cc abcd abbbbc abc acb bac bca cab cba), '') {
print "'$_' ", ($_ =~ $re2) ? "matches" : "does not match", " \$re2 \n";
}

Related

Regex match pair ocurrences of a specific character

I've been trying to make a regex that satisfies this conditions:
The word consists of characters a,b
The number of b characters must be pair (consecutive or not)
So for example:
abb -> accepted
abab -> accepted
aaaa -> rejected
baaab -> accepted
So far i got this: ([a]*)(((b){2}){1,})
As you can see i know very little about the matter, this checks for pairs but it does still accept words with odd number of b's.
You could use this regex to check for some number of as with an even number of bs:
^(?:a*ba*ba*)+$
This looks for 1 or more occurrences of 2 bs surrounded by some number (which may be 0) as.
Demo on regex101
Note this will match bb (or bbbb, bbbbbb etc.). If you don't want to do that, the easiest way is to add a positive lookahead for an a:
^(?=b*a)(?:a*ba*ba*)+$
Demo on regex101
Checking an Array of Characters Against Two Conditionals
While you could do this using regular expressions, it would be simpler to solve it by applying some conditional checks against your two rules against an Array of characters created with String#chars. For example, using Ruby 3.1.2:
# Rule 1: string contains only the letters `a` and `b`
# Rule 2: the number of `b` characters in the word is even
#
# #return [Boolean] whether the word matches *both* rules
def word_matches_rules word
char_array = word.chars
char_array.uniq.sort == %w[a b] and char_array.count("b").even?
end
words = %w[abb abab aaaa baaab]
words.map { |word| [word, word_matches_rules(word)] }.to_h
#=> {"abb"=>true, "abab"=>true, "aaaa"=>false, "baaab"=>true}
Regular expressions are very useful, but string operations are generally faster and easier to conceptualize. This approach also allows you to add more rules or verify intermediate steps without adding a lot of complexity.
There are probably a number of ways this could be simplified further, such as using a Set or methods like Array#& or Array#-. However, my goal with this answer was to make the code (and the encoded rules you're trying to apply) easier to read, modify, and extend rather than to make the code as minimalist as possible.

TCL_REGEXP:: How to grep a line from variable that looks similar in TCL

My TCL script:
set test {
a for apple
b for ball
c for cat
number n1
numbers 2,3,4,5,6
d for doctor
e for egg
number n2
numbers 56,4,5,5
}
set lines [split $test \n]
set data [join $lines :]
if { [regexp {number n1.*(numbers .*)} $data x y]} {
puts "numbers are : $y"
}
Current output if I run the above script:
C:\Documents and Settings\Owner\Desktop>tclsh stack.tcl
numbers are : numbers 56,4,5,5:
C:\Documents and Settings\Owner\Desktop>
Expected output:
In the script regexp, If I specify "number n1"... Its should print "numbers are : numbers 2,3,4,5,6"
If I specify "number n2"... Its should print "numbers are : numbers 56,4,5,5:"
Now always its prints the last (final line - numbers 56,4,5,5:) as output. How to resolve this issue.
Thanks,
Kumar
Try using
regexp {number n1.*?(numbers .*)\n} $test x y
(note that I'm matching against test. There is no need to replace the newlines.)
There are two differences from your pattern.
The question mark behind the first star makes the match non-greedy.
There is a newline character behind the capturing parentheses.
Your pattern told regexp to match from the first occurrence of number n1 up to the last occurrence of numbers, and it did. This is because the .* match between them was greedy, i.e. it matched as many characters as it could, which meant it went past the first numbers.
Making the match non-greedy means that the pattern will match from the first occurrence of number n1 up to the following occurrence of numbers, which was what you wanted.
After numbers, there is another .* match which is a bit troublesome. If it were greedy, it would match everything up to the end of the variable content. If it were non-greedy, it wouldn't match any characters, since matching a zero-length string satisfies the match. Another problem is that the Tcl RE engine doesn't really allow for switching back from non-greedy mode.
You can fix this by forcing the pattern to match one character past the text that you want the .* to match, making the zero-length match invalid. Matching a newline (\n) or space (\s) character should work. (This of course means that there must be a newline / other space character after every data field: if a numbers field is the last character range in the variable that field can't be located.)
Documentation: regular expression syntax, regexp
To use a Tcl variable in a regular expression is easy. On one level anyway: you put the regular expression in double quotes so that you have standard Tcl variable substitution inside it prior to it being passed to the RE engine:
# ...
set target "n1"
if { [regexp "number $target.*(numbers .*)" $data x y]} {
# ...
The hard part is that you've got to remember that switching to "…" from {…} will affect the whole of that word, and that the substitutions are of regular expression fragments. We usually recommend using {…} because that's easier to get consistently and unconfusingly right in the majority of cases.
Let's illustrate how this can get annoying. In your specific case, you may want to actually use this:
if { [regexp "number $target\[^:\]*:(numbers \[^:\]*)" $data x y]} {
The character sets here exclude the : (which you've — unnecessarily — used as a newline replacement) but because […] is also standard Tcl metasyntax, you have to backslash-quote it. (Things get even more annoying when you want to always use the contents of the variable as a literal even though they might include RE metasyntax characters; you need a regsub call to tidy things up. And you start to potentially make Tcl's RE cache less efficient too.)

Check if string is subset of a bunch of characters? (RegEx)?

I have a little problem, I have 8 characters, for example "a b c d a e f g", and a list of words, for example:
mom, dad, bad, fag, abac
How can I check if I can or cannot compose these words with the letters I have?
In my example, I can compose bad, abac and fag, but I cannot compose dad (I have not two D) and mom (I have not M or O).
I'm pretty sure it can be done using a RegEx but would be helpful even using some functions in Perl..
Thanks in advance guys! :)
This is done most simply by forming a regular expression from the word that is to be tested.
This sorts the list of available characters and forms a string by concatenating them. Then each candidate word is split into characters, sorted, and rejoined with the regex term .* as separator. So, for instance, abac will be converted to a.*a.*b.*c.
Then the validity of the word is determined by testing the string of available characters against the derived regex.
use strict;
use warnings;
my #chars = qw/ a b c d a e f g /;
my $chars = join '', sort #chars;
for my $word (qw/ mom dad bad fag abac /) {
my $re = join '.*', sort $word =~ /./g;
print "$word is ", $chars =~ /$re/ ? 'valid' : 'NOT valid', "\n";
}
output
mom is NOT valid
dad is NOT valid
bad is valid
fag is valid
abac is valid
This is to demonstrate the possibility rather than endorsing the regex method. Please consider other saner solution.
First step, you need to count the number of characters available.
Then construct your regex as such (this is not Perl code!):
Start with start of input anchor, this matches the start of the string (a single word from the list):
^
Append as many of these as the number of unique characters:
(?!(?:[^<char>]*+<char>){<count + 1>})
Example: (?!(?:[^a]*+a){3}) if the number of a is 2.
I used an advanced regex construct here called zero-width negative look-ahead (?!pattern). It will not consume text, and it will try its best to check that nothing ahead in the string matches the pattern specified (?:[^a]*+a){3}. Basically, the idea is that I check that I cannot find 3 'a' ahead in the string. If I really can't find 3 instances of 'a', it means that the string can only contain 2 or less 'a'.
Note that I use *+, which is 0 or more quantifier, possessively. This is to avoid unnecessary backtracking.
Put the characters that can appear within []:
[<unique_chars_in_list>]+
Example: For a b c d a e f g, this will become [abcdefg]+. This part will actually consume the string, and make sure the string only contains characters in the list.
End with end of input anchor, which matches the end of the string:
$
So for your example, the regex will be:
^(?!(?:[^a]*+a){3})(?!(?:[^b]*+b){2})(?!(?:[^c]*+c){2})(?!(?:[^d]*+d){2})(?!(?:[^e]*+e){2})(?!(?:[^f]*+f){2})(?!(?:[^g]*+g){2})[abcdefg]+$
You must also specify i flag for case-insensitive matching.
Note that this only consider the case of English alphabet (a-z) in the list of words to match. Space and hyphen are not (yet) considered here.
How about sorting both strings into alphabetical order then for the string you want to check insert .*
between each letter like so:
'aabcdefg' =~ m/a.*b.*d.*/
True
'aabcdefg' =~ m/m.*m.*u.*/
False
'aabcdefg' =~ m/a.*d.*d.*/
False
Some pseudocode:
Sort the available characters into alphabetical order
for each word:
Sort the characters of the word into alphabetical order
For each character of the word search forwards through the available characters to find a matching character. Note the this
search will never go back to the start of the available chars,
matched chars are consumed.
Or even better, use frequency counts of characters.
For your available characters, construct a map from character to occurence count of that character.
Do the same for each candidate word and compare against the available map, if the word map contains a mapping for a character where the available map does not, or the mapped value is larger in the word map than the available map, then the word cannot be constructed using the available characters.
Here's a really simple script that would be rather easy to generalize:
#!/usr/bin/env perl
use strict;
use warnings;
sub check_word {
my $word = shift;
my %chars;
$chars{$_}++ for #_;
$chars{$_}-- or return for split //, $word;
return 1;
}
print check_word( 'cab', qw/a b c/ ) ? "Good" : "Bad";
And of course the performance of this function could be greatly enhanced if the letters list is going to be the same every time. Actually for eight characters, copying the hash vs building a new one each time is probably the same speed.
pseudocode:
bool possible=true
string[] chars= { "a", "b", "c"}
foreach word in words
{
foreach char in word.chars
{
possible=possible && chars.contains(char)
}
}

How can I check if every substring of four zeros is followed by at least four ones using regular expressions?

How can I write regular expression in which
whenever there is 0000 there should be 1111 after this for example:
00101011000011111111001111010 -> correct
0000110 -> incorect
11110 -> correct
thanks for any help
If you are using Perl, you can use a zero-width negative-lookahead assertion:
#!/usr/bin/perl
use strict; use warnings;
my #strings = qw(
00101011000011111111001111010
00001111000011
0000110
11110
);
my $re = qr/0000(?!1111)/;
for my $s ( #strings ) {
my $result = $s =~ $re ? 'incorrect' : 'correct';
print "$s -> $result\n";
}
The pattern matches if there is a string of 0000 not followed by at least four 1s. So, a match indicates an incorrect string.
Output:
C:\Temp> s
00101011000011111111001111010 -> correct
00001111000011 -> incorrect
0000110 -> incorrect
11110 -> correct
While some languages' alleged "regular expressions" actually implement something quite different (generally a superset of) what are called regular expressions in computer science (including e.g. pushdown automatas or even arbitrary code execution within "regexes"), to answer in actual regex terms is, I think, best done as follows:
regular expressions are in general a good way to answer many questions of the form "is there any spot in the text in which the following pattern occurs" (with limitations on the pattern, of course -- for example, balancing of nested parentheses is beyond regexes' power, although of course it may well not be beyond the power of arbitrary supersets of regexes). "Does the whole text match this pattern" is obviously a special case of the question "does any spot in the text match this pattern", given the possibility to have special markers meaning "start of text" and "end of text" (typically ^ and $ in typical regex-pattern syntax).
However, the question "can you check that NO spot in the text matches this pattern" is not an answer which regex matching can directly answer... but, adding (outside the regex) the logical operation not obviously solves the problem in practice, because "check that no spot matches" is clearly the same "tell me if any spot matches" followed by "transform success into failure and vice versa" (the latter being the logical not part). This is the key insight in Sinan's answer, beyond the specific use of Perl's negative-lookahead (which is really just a shortcut, not an extension of regex power per se).
If your favorite language for using regexes in doesn't have negative lookahead but does have the {<number>} "count shortcut", parentheses, and the vertical bar "or" operation:
00001{0,3}([^1]|$)
i.e., "four 0s followed by zero to three 1s followed by either a non-1 character or end-of-text" is exactly a pattern such that, if the text matches it anywhere, violates your constraint (IOW, it can be seen as a slight expansion of the negative-lookahead shortcut syntax). Add a logical-not (again, in whatever language you prefer), and there you are!
There are several approaches that you can take here, and I will list some of them.
Checking that for all cases, the requirement is always met
In this approach, we simply look for 0000(1111)? and find all matches. Since ? is greedy, it will match the 1111 if possible, so we simply check that each match is 00001111. If it's only 0000 then we say that the input is not valid. If we didn't find any match that is only 0000 (perhaps because there's no match at all to begin with), then we say it's valid.
In pseudocode:
FUNCTION isValid(s:String) : boolean
FOR EVERY match /0000(1111)?/ FOUND ON s
IF match IS NOT "00001111" THEN
RETURN false
RETURN true
Checking that there is a case where the requirement isn't met (then oppose)
In this approach, we're using regex to try to find a violation instead, Thus, a successful match means we say the input is not valid. If there's no match, then there's no violation, so we say the input is valid (this is what is meant by "check-then-oppose").
In pseudocode
isValid := NOT (/violationPattern/ FOUND ON s)
Lookahead option
If your flavor supports it, negative lookahead is the most natural way to express this pattern. Simply look for 0000(?!1111).
No lookahead option
If your flavor doesn't support negative lookahead, you can still use this approach. Now the pattern becomes 00001{0,3}(0|$). That is, we try to match 0000, followed by 1{0,3} (that is, between 0-3 1), followed by either 0 or the end of string anchor $.
Fully spelled out option
This is equivalent to the previous option, but instead of using repetition and alternation syntax, you explicitly spell out what the violations are. They are
00000|000010|0000110|00001110|
0000$|00001$|000011$|0000111$
Checking that there ISN'T a case where the requirement isn't met
This relies on negative lookahead; it's simply taking the previous approach to the next level. Instead of:
isValid := NOT (/violationPattern/ FOUND ON s)
we can bring the NOT into the regex using negative lookahead as follows:
isValid := (/^(?!.*violationPattern)/ FOUND ON s)
That is, anchoring ourself at the beginning of the string, we negatively assert that we can match .*violationPattern. The .* allows us to "search" for the violationPattern as far ahead as necessary.
Attachments
Here are the patterns showcased on rubular:
Approach 1: Matching only 0000 means invalid
0000(?:1111)?
Approach 2: Match means invalid
0000(?!1111)
00001{0,3}(?:0|$)
00000|000010|0000110|00001110|0000$|00001$|000011$|0000111$
Approach 3: Match means valid
^(?!.*0000(?!1111)).*$
The input used is (annotated to show which ones are valid):
+ 00101011000011111111001111010
- 000011110000
- 0000110
+ 11110
- 00000
- 00001
- 000011
- 0000111
+ 00001111
References
regular-expressions.info/Lookarounds and Flavor comparison

How to check that a string is a palindrome using regular expressions?

That was an interview question that I was unable to answer:
How to check that a string is a palindrome using regular expressions?
p.s. There is already a question "How to check if the given string is palindrome?" and it gives a lot of answers in different languages, but no answer that uses regular expressions.
The answer to this question is that "it is impossible". More specifically, the interviewer is wondering if you paid attention in your computational theory class.
In your computational theory class you learned about finite state machines. A finite state machine is composed of nodes and edges. Each edge is annotated with a letter from a finite alphabet. One or more nodes are special "accepting" nodes and one node is the "start" node. As each letter is read from a given word we traverse the given edge in the machine. If we end up in an accepting state then we say that the machine "accepts" that word.
A regular expression can always be translated into an equivalent finite state machine. That is, one that accepts and rejects the same words as the regular expression (in the real world, some regexp languages allow for arbitrary functions, these don't count).
It is impossible to build a finite state machine that accepts all palindromes. The proof relies on the facts that we can easily build a string that requires an arbitrarily large number of nodes, namely the string
a^x b a^x (eg., aba, aabaa, aaabaaa, aaaabaaaa, ....)
where a^x is a repeated x times. This requires at least x nodes because, after seeing the 'b' we have to count back x times to make sure it is a palindrome.
Finally, getting back to the original question, you could tell the interviewer that you can write a regular expression that accepts all palindromes that are smaller than some finite fixed length. If there is ever a real-world application that requires identifying palindromes then it will almost certainly not include arbitrarily long ones, thus this answer would show that you can differentiate theoretical impossibilities from real-world applications. Still, the actual regexp would be quite long, much longer than equivalent 4-line program (easy exercise for the reader: write a program that identifies palindromes).
While the PCRE engine does support recursive regular expressions (see the answer by Peter Krauss), you cannot use a regex on the ICU engine (as used, for example, by Apple) to achieve this without extra code. You'll need to do something like this:
This detects any palindrome, but does require a loop (which will be required because regular expressions can't count).
$a = "teststring";
while(length $a > 1)
{
$a =~ /(.)(.*)(.)/;
die "Not a palindrome: $a" unless $1 eq $3;
$a = $2;
}
print "Palindrome";
It's not possible. Palindromes aren't defined by a regular language. (See, I DID learn something in computational theory)
With Perl regex:
/^((.)(?1)\2|.?)$/
Though, as many have pointed out, this can't be considered a regular expression if you want to be strict. Regular expressions does not support recursion.
Here's one to detect 4-letter palindromes (e.g.: deed), for any type of character:
\(.\)\(.\)\2\1
Here's one to detect 5-letter palindromes (e.g.: radar), checking for letters only:
\([a-z]\)\([a-z]\)[a-z]\2\1
So it seems we need a different regex for each possible word length.
This post on a Python mailing list includes some details as to why (Finite State Automata and pumping lemma).
Depending on how confident you are, I'd give this answer:
I wouldn't do it with a regular
expression. It's not an appropriate
use of regular expressions.
Yes, you can do it in .Net!
(?<N>.)+.?(?<-N>\k<N>)+(?(N)(?!))
You can check it here! It's a wonderful post!
StackOverflow is full of answers like "Regular expressions? nope, they don't support it. They can't support it.".
The truth is that regular expressions have nothing to do with regular grammars anymore. Modern regular expressions feature functions such as recursion and balancing groups, and the availability of their implementations is ever growing (see Ruby examples here, for instance). In my opinion, hanging onto old belief that regular expressions in our field are anything but a programming concept is just counterproductive. Instead of hating them for the word choice that is no longer the most appropriate, it is time for us to accept things and move on.
Here's a quote from Larry Wall, the creator of Perl itself:
(…) generally having to do with what we call “regular expressions”, which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I’m not going to try to fight linguistic necessity here. I will, however, generally call them “regexes” (or “regexen”, when I’m in an Anglo-Saxon mood).
And here's a blog post by one of PHP's core developers:
As the article was quite long, here a summary of the main points:
The “regular expressions” used by programmers have very little in common with the original notion of regularity in the context of formal language theory.
Regular expressions (at least PCRE) can match all context-free languages. As such they can also match well-formed HTML and pretty much all other programming languages.
Regular expressions can match at least some context-sensitive languages.
Matching of regular expressions is NP-complete. As such you can solve any other NP problem using regular expressions.
That being said, you can match palindromes with regexes using this:
^(?'letter'[a-z])+[a-z]?(?:\k'letter'(?'-letter'))+(?(letter)(?!))$
...which obviously has nothing to do with regular grammars.
More info here: http://www.regular-expressions.info/balancing.html
As a few have already said, there's no single regexp that'll detect a general palindrome out of the box, but if you want to detect palindromes up to a certain length, you can use something like
(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1
You can also do it without using recursion:
\A(?:(.)(?=.*?((?(2)\1\2|\1))\z))*?.?\2\z
to allow a single character:
\A(?:(?:(.)(?=.*?((?(2)\1\2|\1))\z))*?.?\2|.)\z
Works with Perl, PCRE
demo
For Java:
\A(?:(.)(?=.*?(\1\2\z|(?<!(?=\2\z).{0,1000})\1\z)))*?.?\2\z
demo
It can be done in Perl now. Using recursive reference:
if($istr =~ /^((\w)(?1)\g{-1}|\w?)$/){
print $istr," is palindrome\n";
}
modified based on the near last part http://perldoc.perl.org/perlretut.html
In ruby you can use named capture groups. so something like this will work -
def palindrome?(string)
$1 if string =~ /\A(?<p>| \w | (?: (?<l>\w) \g<p> \k<l+0> ))\z/x
end
try it, it works...
1.9.2p290 :017 > palindrome?("racecar")
=> "racecar"
1.9.2p290 :018 > palindrome?("kayak")
=> "kayak"
1.9.2p290 :019 > palindrome?("woahitworks!")
=> nil
Recursive Regular Expressions can do it!
So simple and self-evident algorithm to detect a string that contains a palindrome:
(\w)(?:(?R)|\w?)\1
At rexegg.com/regex-recursion the tutorial explains how it works.
It works fine with any language, here an example adapted from the same source (link) as proof-of-concept, using PHP:
$subjects=['dont','o','oo','kook','book','paper','kayak','okonoko','aaaaa','bbbb'];
$pattern='/(\w)(?:(?R)|\w?)\1/';
foreach ($subjects as $sub) {
echo $sub." ".str_repeat('-',15-strlen($sub))."-> ";
if (preg_match($pattern,$sub,$m))
echo $m[0].(($m[0]==$sub)? "! a palindrome!\n": "\n");
else
echo "sorry, no match\n";
}
outputs
dont ------------> sorry, no match
o ---------------> sorry, no match
oo --------------> oo! a palindrome!
kook ------------> kook! a palindrome!
book ------------> oo
paper -----------> pap
kayak -----------> kayak! a palindrome!
okonoko ---------> okonoko! a palindrome!
aaaaa -----------> aaaaa! a palindrome!
bbbb ------------> bbb
Comparing
The regular expression ^((\w)(?:(?1)|\w?)\2)$ do the same job, but as yes/not instead "contains". PS: it is using a definition where "o" is not a palimbrome, "able-elba" hyphened format is not a palindrome, but "ableelba" is. Naming it definition1. When "o" and "able-elba" are palindrones, naming definition2.
Comparing with another "palindrome regexes",
^((.)(?:(?1)|.?)\2)$ the base-regex above without \w restriction, accepting "able-elba".
^((.)(?1)?\2|.)$ (#LilDevil) Use definition2 (accepts "o" and "able-elba" so differing also in the recognition of "aaaaa" and "bbbb" strings).
^((.)(?1)\2|.?)$ (#Markus) not detected "kook" neither "bbbb"
^((.)(?1)*\2|.?)$ (#Csaba) Use definition2.
NOTE: to compare you can add more words at $subjects and a line for each compared regex,
if (preg_match('/^((.)(?:(?1)|.?)\2)$/',$sub)) echo " ...reg_base($sub)!\n";
if (preg_match('/^((.)(?1)?\2|.)$/',$sub)) echo " ...reg2($sub)!\n";
if (preg_match('/^((.)(?1)\2|.?)$/',$sub)) echo " ...reg3($sub)!\n";
if (preg_match('/^((.)(?1)*\2|.?)$/',$sub)) echo " ...reg4($sub)!\n";
Here's my answer to Regex Golf's 5th level (A man, a plan). It works for up to 7 characters with the browser's Regexp (I'm using Chrome 36.0.1985.143).
^(.)(.)(?:(.).?\3?)?\2\1$
Here's one for up to 9 characters
^(.)(.)(?:(.)(?:(.).?\4?)?\3?)?\2\1$
To increase the max number of characters it'd work for, you'd repeatedly replace .? with (?:(.).?\n?)?.
It's actually easier to do it with string manipulation rather than regular expressions:
bool isPalindrome(String s1)
{
String s2 = s1.reverse;
return s2 == s1;
}
I realize this doesn't really answer the interview question, but you could use it to show how you know a better way of doing a task, and you aren't the typical "person with a hammer, who sees every problem as a nail."
Regarding the PCRE expression (from MizardX):
/^((.)(?1)\2|.?)$/
Have you tested it? On my PHP 5.3 under Win XP Pro it fails on: aaaba
Actually, I modified the expression expression slightly, to read:
/^((.)(?1)*\2|.?)$/
I think what is happening is that while the outer pair of characters are anchored, the remaining inner ones are not. This is not quite the whole answer because while it incorrectly passes on "aaaba" and "aabaacaa", it does fail correctly on "aabaaca".
I wonder whether there a fixup for this, and also,
Does the Perl example (by JF Sebastian / Zsolt) pass my tests correctly?
Csaba Gabor from Vienna
/\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/
it is valid for Oniguruma engine (which is used in Ruby)
took from Pragmatic Bookshelf
In Perl (see also Zsolt Botykai's answer):
$re = qr/
. # single letter is a palindrome
|
(.) # first letter
(??{ $re })?? # apply recursivly (not interpolated yet)
\1 # last letter
/x;
while(<>) {
chomp;
say if /^$re$/; # print palindromes
}
As pointed out by ZCHudson, determine if something is a palindrome cannot be done with an usual regexp, as the set of palindrome is not a regular language.
I totally disagree with Airsource Ltd when he says that "it's not possibles" is not the kind of answer the interviewer is looking for. During my interview, I come to this kind of question when I face a good candidate, to check if he can find the right argument when we proposed to him to do something wrong. I do not want to hire someone who will try to do something the wrong way if he knows better one.
something you can do with perl: http://www.perlmonks.org/?node_id=577368
I would explain to the interviewer that the language consisting of palindromes is not a regular language but instead context-free.
The regular expression that would match all palindromes would be infinite. Instead I would suggest he restrict himself to either a maximum size of palindromes to accept; or if all palindromes are needed use at minimum some type of NDPA, or just use the simple string reversal/equals technique.
The best you can do with regexes, before you run out of capture groups:
/(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\9\8\7\6\5\4\3\2\1/
This will match all palindromes up to 19 characters in length.
Programatcally solving for all lengths is trivial:
str == str.reverse ? true : false
I don't have the rep to comment inline yet, but the regex provided by MizardX, and modified by Csaba, can be modified further to make it work in PCRE. The only failure I have found is the single-char string, but I can test for that separately.
/^((.)(?1)?\2|.)$/
If you can make it fail on any other strings, please comment.
#!/usr/bin/perl
use strict;
use warnings;
print "Enter your string: ";
chop(my $a = scalar(<STDIN>));
my $m = (length($a)+1)/2;
if( (length($a) % 2 != 0 ) or length($a) > 1 ) {
my $r;
foreach (0 ..($m - 2)){
$r .= "(.)";
}
$r .= ".?";
foreach ( my $i = ($m-1); $i > 0; $i-- ) {
$r .= "\\$i";
}
if ( $a =~ /(.)(.).\2\1/ ){
print "$a is a palindrome\n";
}
else {
print "$a not a palindrome\n";
}
exit(1);
}
print "$a not a palindrome\n";
From automata theory its impossible to match a paliandrome of any lenght ( because that requires infinite amount of memory). But IT IS POSSIBLE to match Paliandromes of Fixed Length.
Say its possible to write a regex that matches all paliandromes of length <= 5 or <= 6 etc, but not >=5 etc where upper bound is unclear
In Ruby you can use \b(?'word'(?'letter'[a-z])\g'word'\k'letter+0'|[a-z])\b to match palindrome words such as a, dad, radar, racecar, and redivider. ps : this regex only matches palindrome words that are an odd number of letters long.
Let's see how this regex matches radar. The word boundary \b matches at the start of the string. The regex engine enters the capturing group "word". [a-z] matches r which is then stored in the stack for the capturing group "letter" at recursion level zero. Now the regex engine enters the first recursion of the group "word". (?'letter'[a-z]) matches and captures a at recursion level one. The regex enters the second recursion of the group "word". (?'letter'[a-z]) captures d at recursion level two. During the next two recursions, the group captures a and r at levels three and four. The fifth recursion fails because there are no characters left in the string for [a-z] to match. The regex engine must backtrack.
The regex engine must now try the second alternative inside the group "word". The second [a-z] in the regex matches the final r in the string. The engine now exits from a successful recursion, going one level back up to the third recursion.
After matching (&word) the engine reaches \k'letter+0'. The backreference fails because the regex engine has already reached the end of the subject string. So it backtracks once more. The second alternative now matches the a. The regex engine exits from the third recursion.
The regex engine has again matched (&word) and needs to attempt the backreference again. The backreference specifies +0 or the present level of recursion, which is 2. At this level, the capturing group matched d. The backreference fails because the next character in the string is r. Backtracking again, the second alternative matches d.
Now, \k'letter+0' matches the second a in the string. That's because the regex engine has arrived back at the first recursion during which the capturing group matched the first a. The regex engine exits the first recursion.
The regex engine is now back outside all recursion. That this level, the capturing group stored r. The backreference can now match the final r in the string. Since the engine is not inside any recursion any more, it proceeds with the remainder of the regex after the group. \b matches at the end of the string. The end of the regex is reached and radar is returned as the overall match.
here is PL/SQL code which tells whether given string is palindrome or not using regular expressions:
create or replace procedure palin_test(palin in varchar2) is
tmp varchar2(100);
i number := 0;
BEGIN
tmp := palin;
for i in 1 .. length(palin)/2 loop
if length(tmp) > 1 then
if regexp_like(tmp,'^(^.).*(\1)$') = true then
tmp := substr(palin,i+1,length(tmp)-2);
else
dbms_output.put_line('not a palindrome');
exit;
end if;
end if;
if i >= length(palin)/2 then
dbms_output.put_line('Yes ! it is a palindrome');
end if;
end loop;
end palin_test;
my $pal='malayalam';
while($pal=~/((.)(.*)\2)/){ #checking palindrome word
$pal=$3;
}
if ($pal=~/^.?$/i){ #matches single letter or no letter
print"palindrome\n";
}
else{
print"not palindrome\n";
}
This regex will detect palindromes up to 22 characters ignoring spaces, tabs, commas, and quotes.
\b(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*\11?[ \t,'"]*\10|\10?)[ \t,'"]*\9|\9?)[ \t,'"]*\8|\8?)[ \t,'"]*\7|\7?)[ \t,'"]*\6|\6?)[ \t,'"]*\5|\5?)[ \t,'"]*\4|\4?)[ \t,'"]*\3|\3?)[ \t,'"]*\2|\2?))?[ \t,'"]*\1\b
Play with it here: https://regexr.com/4tmui
I wrote an explanation of how I got that here: https://medium.com/analytics-vidhya/coding-the-impossible-palindrome-detector-with-a-regular-expressions-cd76bc23b89b
A slight refinement of Airsource Ltd's method, in pseudocode:
WHILE string.length > 1
IF /(.)(.*)\1/ matches string
string = \2
ELSE
REJECT
ACCEPT