Difference between ? and * in regular expressions - match same input? - regex

I am not able to understand the practical difference between ? and * in regular expressions. I know that ? means to check if previous character/group is present 0 or 1 times and * means to check if the previous character/group is present 0 or more times.
But this code
while(<>) {
chomp($_);
if(/hello?/) {
print "metch $_ \n";
}
else {
print "naot metch $_ \n";
}
}
gives the same out put for both hello? and hello*. The external file that is given to this Perl program contains
hello
helloooo
hell
And the output is
metch hello
metch helloooo
metch hell
for both hello? and hello*. I am not able to understand the exact difference between ? and *

In Perl (and unlike Java), the m//-match operator is not anchored by default.
As such all of the input it trivially matched by both /hello?/ and /hello*/. That is, these will match any string that contains "hell" (as both quantifiers make the "o" optional) anywhere.
Compare with /^hello?$/ and /^hello*$/, respectively. Since these employ anchors the former will not match "helloo" (as at most one "o" is allowed) while the latter will.
Under Regexp Quote-like Operators:
m/PATTERN/ searches [anywhere in] a string for a pattern match, and in scalar context returns true if it succeeds, false if it fails.

What is confusing you is that, without anchors like ^ and $ a regex pattern match checks only whether the pattern appears anywhere in the target string.
If you add something to the pattern after the hello, like
if (/hello?, Ashwin/) { ... }
Then the strings
hello, Ashwin
and
hell, Ashwin
will match, but
helloooo, Ashwin
will not, because there are too many o characters between hell and the comma ,.
However, if you use a star * instead, like
if (/hello*, Ashwin/) { ... }
then all three strings will match.

? Means the last item is optional. * Means it is both optional and you can have multiple items.
ie.
hello? matches hell, hello
hello* matches hell, hello, helloo, hellooo, ....
But not using either ^ or $ means these matches can occur anywhere in the string

Here's an example I came up with that makes it quite clear:
What if you wanted to only match up to tens of people and your data was like below:
2 people. 20 people. 200 people. 2000 people.
Only ? would be useful in that case, whereas * would incorrectly capture larger numbers.

Related

How can I match only integers in Perl?

So I have an array that goes like this:
my #nums = (1,2,12,24,48,120,360);
I want to check if there is an element that is not an integer inside that array without using loop. It goes like this:
if(grep(!/[^0-9]|\^$/,#nums)){
die "Numbers are not in correct format.";
}else{
#Do something
}
Basically, the format should not be like this (Empty string is acceptable):
1A
A2
#A
#
#######
More examples:
1,2,3,A3 = Unacceptable
1,2,###,2 = unacceptable
1,2,3A,4 = Unacceptable
1, ,3,4=Acceptable
1,2,3,360 = acceptable
I know that there is another way by using look like a number. But I can't use that for some reason (outside of my control/setup reasons). That's why I used the regex method.
My question is, even though the numbers are in not correct format (A60 for example), the condition always return False. Basically, it ignores the incorrect format.
You say in the comments that you don't want to use modules because you can't install them, but there are many core modules that should come with Perl (although some systems screw this up).
zdim's answer in the comments is to look for anything that is not 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. That's the negated character class [^0-9]. A grep in scalar context returns the number of items that match:
my $found_non_ints = grep { /[^0-9]/ } #items;
Instead of that, I'd go back to the non-negated character class and match string that only has zero or more digits. To do this, anchor the pattern to the absolute start and end of the string:
my $found_non_ints = grep { ! /\A[0-9]*\z/ } #items;
But, this doesn't really match integers. It matches positive whole numbers (and zero). If you want to match negative numbers as well, allow an optional - at the start of the string:
my $found_non_ints = grep { ! /\A-?[0-9]*\z/ } #items;
That - would be a problem in the negated character class.
Also, you don't want the $ anchor here: that allows a possible newline to match at the end, and that's a non-digit (the \Z is the same for the end of the string). Also, the meaning of $ can change based on the setting of the /m flag, which might be set with default regex flags.
Here's a short program with your sample data. Note that you need to decide how to split up the list; does whitespace matter? I decided to remove whitespace around the comma:
#!perl
use v5.10;
while( <DATA> ) {
chomp;
my $found_non_ints = grep { ! /\A[0-9]*\z/ } split /\s*,\s*/;
say "$_ => Found $found_non_ints non-ints";
}
__DATA__
1A
A2
#A
#
1,2,3,A3
1,2,###,2
1,2,3A,4
1, ,3,4
1,,3,4
1,2,3,360
The solution proposed in the question gets close, except that the logic got reversed and there is an error in a regex pattern. One way for it:
if ( grep { /[^0-9] | ^$/x } #nums ) { say 'not all integers' }
Regex explanation
[] is a character class: it matches any one of the characters listed inside (so [abc] matches either of a, b, or c) -- but when it starts with a ^ it matches any character not listed; so [^abc] matches any char not being either of a, b, or c. The pattern 0-9 inside a character class specifies all digits in that range (and we can also use a-z and A-Z)
So [^0-9] matches any character that is not a digit
Then that is or-ed by | with a ^$: ^ matches beginning of the string and $ is for the end of it. So ^$ match a string without anything -- an empty string! We need to account for that as [^0-9] doesn't while an array element can be an empty string. (It can also be a undef but from my understanding that is not possible with actual data, and a regex on undef would draw a warning.)
Note that $ allows for a newline as well, and that ^ and $ may change their meaning if /m modifier is in use, matching on linefeeds inside a string. However, in all these cases we'd be matching a non-digit, which is precisely the point here
/x modifier makes it disregard literal spaces inside so we can space things out for easier reading. (It also allows for newlines and comments with #, so complex patterns can be organized and documented very nicely)
So that's all -- the regex tries to match anything that shouldn't be in an integer (assumed to be strictly positive in OP's data).
If it matches any such, in any one of the array elements, then grep returns a list which isn't empty (but has at least one element) and that is "true" under if. So we caught a non-integer and we go into if's block to deal with that.
A little aside: we can also declare and populate an array right inside the if condition, to catch all those non-integers:
if ( my #non_ints = grep { /[^0-9] | ^$/x } #nums ) {
say 'Non-integers: ', join ' ', map { "|$_|" } #non_ints;
}
This also reads more nicely, telling by the array name what we're after in that complicated condition: "non_ints." I put || around each item in print to be able to see an empty string.†
Now, when you put an exclamation mark in front of that regex, it reverses the true/false return from the regex and our code goes haywire. So drop that !.
The other error is in escaping the ^ by having \^. This would match a literal ^ character, robbing ^ of its special meaning as a pattern in regex, explained above. So drop that \.
One other way is in using an extremely useful List::Util library, which is "core" (so it is normally installed with Perl, even though that can get messed up).
Among a number of essential functions it gives us any, and with it we have
use List::Util qw(any);
if ( any { /[^0-9]|^$/ } #nums ) { say 'not all integers' }
I like any firstly because the name of the function includes at least a part of the needed logic, making code that much clearer and easier to comprehend: is there any element of #nums for which the code in the block is true? So any element which contains a non-digit? Precisely what is needed here.
Then, another advantage is that any will quit as soon as it finds one match, while grep continues through the whole list. But this efficiency advantage shows only on very large arrays or a lot of repeated checks. Also, on the other hand sometimes we want to count all instances.
I'd also like to point out some of any's siblings: none and notall. These names themselves also capture a good deal of logic, making otherwise possibly convoluted code that much clearer. Browse through this library to get accustomed to what is in there.
† A program with your test data
use warnings;
use strict;
use feature 'say';
while (<DATA>) {
chomp;
my #nums = split /\s*,\s*/;
say "Data: #nums";
if ( my #non_ints = grep { /[^0-9] | ^$/x } #nums ) {
say 'Non-ints: ', join ' ', map { "|$_|" } #non_ints;
}
say '---';
}
__DATA__
1A
A2
#A
#
1,2,3,A3
1,2,###,2
1,2,3A,4
1, ,3,4
1,2,3,360

regex for n characters or at least m characters

This should be a pretty simple regex question but I couldn't find any answers anywhere. How would one make a regex, which matches on either ONLY 2 characters, or at least 4 characters. Here is my current method of doing it (ignore the regex itself, that's besides the point):
[A-Za-z0_9_]{2}|[A-Za-z0_9_]{4,}
However, this method takes twice the time (and is approximately 0.3s slower for me on a 400 line file), so I was wondering if there was a better way to do it?
Optimize the beginning, and anchor it.
^[A-Za-z0-9_]{2}(?:|[A-Za-z0-9_]{2,})$
(Also, you did say to ignore the regex itself, but I guessed you probably wanted 0-9, not 0_9)
EDIT Hm, I was sure I read that you want to match lines. Remove the anchors (^$) if you want to match inside the line as well. If you do match full lines only, anchors will speed you up (well, the front anchor ^ will, at least).
Your solution looks pretty good. As an alternative you can try smth like that:
[A-Za-z0-9_]{2}(?:[A-Za-z0-9_]{2,})?
Btw, I think you want hyphen instead of underscore between 0 and 9, don't you?
The solution you present is correct.
If you're trying to optimize the routine, and the number of matches strings matching 2 or more characters is much smaller than those that do not, consider accepting all strings of length 2 or greater, then tossing those if they're of length 3. This may boost performance by only checking the regex once, and the second call need not even be a regular expression; checking a string length is usually an extremely fast operation.
As always, you really need to run tests on real-world data to verify if this would give you a speed increase.
so basically you want to match words of length either 2 or 2+2+N, N>=0
([A-Za-z0-9][A-Za-z0-9](?:[A-Za-z0-9][A0Za-z0-9])*)
working example:
#!/usr/bin/perl
while (<STDIN>)
{
chomp;
my #matches = ($_=~/([A-Za-z0-9][A-Za-z0-9](?:[A-Za-z0-9][A0Za-z0-9])*)/g);
for my $m (#matches) {
print "match: $m\n";
}
}
input file:
cat in.txt
ab abc bcad a as asdfa
aboioioi i i abc bcad a as asdfa
output:
perl t.pl <in.txt
match: ab
match: ab
match: bcad
match: as
match: asdf
match: aboioioi
match: ab
match: bcad
match: as
match: asdf

Regular expression matching any subset of a given set?

Is it possible to write a regular expression which will match any subset of a given set of characters a1 ... an ?
I.e. it should match any string where any of these characters appears at most once, there are no other characters and the relative order of the characters doesn't matter.
Some approaches that arise at once:
1. [a1,...,an]* or (a1|a2|...|an)*- this allows multiple presence of characters
2. (a1?a2?...an?) - no multiple presence, but relative order is important - this matches any subsequence but not subset.
3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1), i.e. write all possible subsequences (just hardcode all matching strings :)) of course, not acceptable.
I also have a guess that it may be theoretically impossible, because during parsing the string we will need to remember which character we have already met before, and as far as I know regular expressions can check out only right-linear languages.
Any help will be appreciated. Thanks in advance.
This doesn't really qualify for the language-agnostic tag, but...
^(?:(?!\1)a1()|(?!\2)a2()|...|(?!\n)an())*$
see a demo on ideone.com
The first time an element is matched, it gets "checked off" by the capturing group following it. Because the group has now participated in the match, a negative lookahead for its corresponding backreference (e.g., (?!\1)) will never match again, even though the group only captured an empty string. This is an undocumented feature that is nevertheless supported in many flavors, including Java, .NET, Perl, Python, and Ruby.
This solution also requires support for forward references (i.e., a reference to a given capturing group (\1) appearing in the regex before the group itself). This seems to be a little less widely supported than the empty-groups gimmick.
Can't think how to do it with a single regex, but this is one way to do it with n regexes: (I will usr 1 2 ... m n etc for your as)
^[23..n]*1?[23..n]*$
^[13..n]*2?[13..n]*$
...
^[12..m]*n?[12..m]*$
If all the above match, your string is a strict subset of 12..mn.
How this works: each line requires the string to consist exactly of:
any number of charactersm drawn fromthe set, except a particular one
perhaps a particular one
any number of charactersm drawn fromthe set, except a particular one
If this passes when every element in turn is considered as a particular one, we know:
there is nothing else in the string except the allowed elements
there is at most one of each of the allowed elements
as required.
for completeness I should say that I would only do this if I was under orders to "use regex"; if not, I'd track which allowed elements have been seen, and iterate over the characters of the string doing the obvious thing.
Not sure you can get an extended regex to do that, but it's pretty easy to do with a simple traversal of your string.
You use a hash (or an array, or whatever) to store if any of your allowed characters has already been seen or not in the string. Then you simply iterate over the elements of your string. If you encounter an element not in your allowed set, you bail out. If it's allowed, but you've already seen it, you bail out too.
In pseudo-code:
foreach char a in {a1, ..., an}
hit[a1] = false
foreach char c in string
if c not in {a1, ..., an} => fail
if hit[c] => fail
hit[c] = true
Similar to Alan Moore's, using only \1, and doesn't refer to a capturing group before it has been seen:
#!/usr/bin/perl
my $re = qr/^(?:([abc])(?!.*\1))*$/;
foreach (qw(ba pabc abac a cc cba abcd abbbbc), '') {
print "'$_' ", ($_ =~ $re) ? "matches" : "does not match", " \$re \n";
}
We match any number of blocks (the outer (?:)), where each block must consist of "precisely one character from our preferred set, which is not followed by a string containing that character".
If the string might contain newlines or other funny stuff, it might be necessary to play with some flags to make ^, $ and . behave as intended, but this all depends on the particular RE flavor.
Just for sillyness, one can use a positive look-ahead assertion to effectively AND two regexps, so we can test for any permutation of abc by asserting that the above matches, followed by an ordinary check for 'is N characters long and consists of these characters':
my $re2 = qr/^(?=$re)[abc]{3}$/;
foreach (qw(ba pabc abac a cc abcd abbbbc abc acb bac bca cab cba), '') {
print "'$_' ", ($_ =~ $re2) ? "matches" : "does not match", " \$re2 \n";
}

What is wrong with the below regular expression(c#3.0)

Consider the below
Case 1: [Success]
Input : X(P)~AK,X(MV)~AK
Replace with: AP
Output: X(P)~AP,X(MV)~AP
Case 2: [Failure]
Input: X(P)~$B,X(MV)~$B
Replace with: C$
Output: X(P)~C$,X(MV)~C$
Actual Output: X(P)~C$B,X(MV)~C$B
I am using the below REGEXP
#"~(\w*[A-Z$%])"
This works fine for case 1 but falied for the second.
Need help
I am using C#3.0
Thanks
It's unclear what exactly your matching requirements are, but changing the regex to #"~(\w*[A-Z$%]+)" should do the trick. (For the examples given, just plain #"~([A-Z$%]+)" should work too.)
It looks like you want something like this:
public static String replaceWith(String input, String repl) {
return Regex.Replace(
input,
#"(?<=~)[A-Z$%]+",
repl
);
}
The (?<=…) is what is called a lookbehind. It's used to assert that to the left there's a tilde, but that tilde is not part of the match.
Now we can test it as follows (as seen on ideone.com):
Console.WriteLine(replaceWith(
"X(P)~AK,X(MV)~AK", "AP"
));
// X(P)~AP,X(MV)~AP
Console.WriteLine(replaceWith(
"X(P)~$B,X(MV)~$B", "C$"
));
// X(P)~C$,X(MV)~C$
Console.WriteLine(replaceWith(
"X(P)~THIS,X(MV)~THAT", "$$$$"
));
// X(P)~$$,X(MV)~$$
Note the last example: $ is a special symbol in substitutions and can have special meanings. $$ actually gets you one dollar sign.
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
Your expression (being greedy) replaces the first string that starts with zero or more work characters that ends in [A-Z$%] after an '~' with your substitution.
In the first case you have ~AK, so \w*[A-Z$%] evaluates to the 'AK', matching \w* -> A, and [A-Z$%] -> K
In the second case you cae ~$C so \w*[A-Z$%] evaluates to '$', matching \w* -> nothing, and [A-Z$%] -> $
I think the important thing is that \w is optional (zero or more), but the [A-Z$%] is mandatory. This is why the second case gives '$' not '$C' as the matched part.
Since I don't know what you're trying to achieve I cannot tell you how to fix your expression.

How to check that a string is a palindrome using regular expressions?

That was an interview question that I was unable to answer:
How to check that a string is a palindrome using regular expressions?
p.s. There is already a question "How to check if the given string is palindrome?" and it gives a lot of answers in different languages, but no answer that uses regular expressions.
The answer to this question is that "it is impossible". More specifically, the interviewer is wondering if you paid attention in your computational theory class.
In your computational theory class you learned about finite state machines. A finite state machine is composed of nodes and edges. Each edge is annotated with a letter from a finite alphabet. One or more nodes are special "accepting" nodes and one node is the "start" node. As each letter is read from a given word we traverse the given edge in the machine. If we end up in an accepting state then we say that the machine "accepts" that word.
A regular expression can always be translated into an equivalent finite state machine. That is, one that accepts and rejects the same words as the regular expression (in the real world, some regexp languages allow for arbitrary functions, these don't count).
It is impossible to build a finite state machine that accepts all palindromes. The proof relies on the facts that we can easily build a string that requires an arbitrarily large number of nodes, namely the string
a^x b a^x (eg., aba, aabaa, aaabaaa, aaaabaaaa, ....)
where a^x is a repeated x times. This requires at least x nodes because, after seeing the 'b' we have to count back x times to make sure it is a palindrome.
Finally, getting back to the original question, you could tell the interviewer that you can write a regular expression that accepts all palindromes that are smaller than some finite fixed length. If there is ever a real-world application that requires identifying palindromes then it will almost certainly not include arbitrarily long ones, thus this answer would show that you can differentiate theoretical impossibilities from real-world applications. Still, the actual regexp would be quite long, much longer than equivalent 4-line program (easy exercise for the reader: write a program that identifies palindromes).
While the PCRE engine does support recursive regular expressions (see the answer by Peter Krauss), you cannot use a regex on the ICU engine (as used, for example, by Apple) to achieve this without extra code. You'll need to do something like this:
This detects any palindrome, but does require a loop (which will be required because regular expressions can't count).
$a = "teststring";
while(length $a > 1)
{
$a =~ /(.)(.*)(.)/;
die "Not a palindrome: $a" unless $1 eq $3;
$a = $2;
}
print "Palindrome";
It's not possible. Palindromes aren't defined by a regular language. (See, I DID learn something in computational theory)
With Perl regex:
/^((.)(?1)\2|.?)$/
Though, as many have pointed out, this can't be considered a regular expression if you want to be strict. Regular expressions does not support recursion.
Here's one to detect 4-letter palindromes (e.g.: deed), for any type of character:
\(.\)\(.\)\2\1
Here's one to detect 5-letter palindromes (e.g.: radar), checking for letters only:
\([a-z]\)\([a-z]\)[a-z]\2\1
So it seems we need a different regex for each possible word length.
This post on a Python mailing list includes some details as to why (Finite State Automata and pumping lemma).
Depending on how confident you are, I'd give this answer:
I wouldn't do it with a regular
expression. It's not an appropriate
use of regular expressions.
Yes, you can do it in .Net!
(?<N>.)+.?(?<-N>\k<N>)+(?(N)(?!))
You can check it here! It's a wonderful post!
StackOverflow is full of answers like "Regular expressions? nope, they don't support it. They can't support it.".
The truth is that regular expressions have nothing to do with regular grammars anymore. Modern regular expressions feature functions such as recursion and balancing groups, and the availability of their implementations is ever growing (see Ruby examples here, for instance). In my opinion, hanging onto old belief that regular expressions in our field are anything but a programming concept is just counterproductive. Instead of hating them for the word choice that is no longer the most appropriate, it is time for us to accept things and move on.
Here's a quote from Larry Wall, the creator of Perl itself:
(…) generally having to do with what we call “regular expressions”, which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I’m not going to try to fight linguistic necessity here. I will, however, generally call them “regexes” (or “regexen”, when I’m in an Anglo-Saxon mood).
And here's a blog post by one of PHP's core developers:
As the article was quite long, here a summary of the main points:
The “regular expressions” used by programmers have very little in common with the original notion of regularity in the context of formal language theory.
Regular expressions (at least PCRE) can match all context-free languages. As such they can also match well-formed HTML and pretty much all other programming languages.
Regular expressions can match at least some context-sensitive languages.
Matching of regular expressions is NP-complete. As such you can solve any other NP problem using regular expressions.
That being said, you can match palindromes with regexes using this:
^(?'letter'[a-z])+[a-z]?(?:\k'letter'(?'-letter'))+(?(letter)(?!))$
...which obviously has nothing to do with regular grammars.
More info here: http://www.regular-expressions.info/balancing.html
As a few have already said, there's no single regexp that'll detect a general palindrome out of the box, but if you want to detect palindromes up to a certain length, you can use something like
(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1
You can also do it without using recursion:
\A(?:(.)(?=.*?((?(2)\1\2|\1))\z))*?.?\2\z
to allow a single character:
\A(?:(?:(.)(?=.*?((?(2)\1\2|\1))\z))*?.?\2|.)\z
Works with Perl, PCRE
demo
For Java:
\A(?:(.)(?=.*?(\1\2\z|(?<!(?=\2\z).{0,1000})\1\z)))*?.?\2\z
demo
It can be done in Perl now. Using recursive reference:
if($istr =~ /^((\w)(?1)\g{-1}|\w?)$/){
print $istr," is palindrome\n";
}
modified based on the near last part http://perldoc.perl.org/perlretut.html
In ruby you can use named capture groups. so something like this will work -
def palindrome?(string)
$1 if string =~ /\A(?<p>| \w | (?: (?<l>\w) \g<p> \k<l+0> ))\z/x
end
try it, it works...
1.9.2p290 :017 > palindrome?("racecar")
=> "racecar"
1.9.2p290 :018 > palindrome?("kayak")
=> "kayak"
1.9.2p290 :019 > palindrome?("woahitworks!")
=> nil
Recursive Regular Expressions can do it!
So simple and self-evident algorithm to detect a string that contains a palindrome:
(\w)(?:(?R)|\w?)\1
At rexegg.com/regex-recursion the tutorial explains how it works.
It works fine with any language, here an example adapted from the same source (link) as proof-of-concept, using PHP:
$subjects=['dont','o','oo','kook','book','paper','kayak','okonoko','aaaaa','bbbb'];
$pattern='/(\w)(?:(?R)|\w?)\1/';
foreach ($subjects as $sub) {
echo $sub." ".str_repeat('-',15-strlen($sub))."-> ";
if (preg_match($pattern,$sub,$m))
echo $m[0].(($m[0]==$sub)? "! a palindrome!\n": "\n");
else
echo "sorry, no match\n";
}
outputs
dont ------------> sorry, no match
o ---------------> sorry, no match
oo --------------> oo! a palindrome!
kook ------------> kook! a palindrome!
book ------------> oo
paper -----------> pap
kayak -----------> kayak! a palindrome!
okonoko ---------> okonoko! a palindrome!
aaaaa -----------> aaaaa! a palindrome!
bbbb ------------> bbb
Comparing
The regular expression ^((\w)(?:(?1)|\w?)\2)$ do the same job, but as yes/not instead "contains". PS: it is using a definition where "o" is not a palimbrome, "able-elba" hyphened format is not a palindrome, but "ableelba" is. Naming it definition1. When "o" and "able-elba" are palindrones, naming definition2.
Comparing with another "palindrome regexes",
^((.)(?:(?1)|.?)\2)$ the base-regex above without \w restriction, accepting "able-elba".
^((.)(?1)?\2|.)$ (#LilDevil) Use definition2 (accepts "o" and "able-elba" so differing also in the recognition of "aaaaa" and "bbbb" strings).
^((.)(?1)\2|.?)$ (#Markus) not detected "kook" neither "bbbb"
^((.)(?1)*\2|.?)$ (#Csaba) Use definition2.
NOTE: to compare you can add more words at $subjects and a line for each compared regex,
if (preg_match('/^((.)(?:(?1)|.?)\2)$/',$sub)) echo " ...reg_base($sub)!\n";
if (preg_match('/^((.)(?1)?\2|.)$/',$sub)) echo " ...reg2($sub)!\n";
if (preg_match('/^((.)(?1)\2|.?)$/',$sub)) echo " ...reg3($sub)!\n";
if (preg_match('/^((.)(?1)*\2|.?)$/',$sub)) echo " ...reg4($sub)!\n";
Here's my answer to Regex Golf's 5th level (A man, a plan). It works for up to 7 characters with the browser's Regexp (I'm using Chrome 36.0.1985.143).
^(.)(.)(?:(.).?\3?)?\2\1$
Here's one for up to 9 characters
^(.)(.)(?:(.)(?:(.).?\4?)?\3?)?\2\1$
To increase the max number of characters it'd work for, you'd repeatedly replace .? with (?:(.).?\n?)?.
It's actually easier to do it with string manipulation rather than regular expressions:
bool isPalindrome(String s1)
{
String s2 = s1.reverse;
return s2 == s1;
}
I realize this doesn't really answer the interview question, but you could use it to show how you know a better way of doing a task, and you aren't the typical "person with a hammer, who sees every problem as a nail."
Regarding the PCRE expression (from MizardX):
/^((.)(?1)\2|.?)$/
Have you tested it? On my PHP 5.3 under Win XP Pro it fails on: aaaba
Actually, I modified the expression expression slightly, to read:
/^((.)(?1)*\2|.?)$/
I think what is happening is that while the outer pair of characters are anchored, the remaining inner ones are not. This is not quite the whole answer because while it incorrectly passes on "aaaba" and "aabaacaa", it does fail correctly on "aabaaca".
I wonder whether there a fixup for this, and also,
Does the Perl example (by JF Sebastian / Zsolt) pass my tests correctly?
Csaba Gabor from Vienna
/\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/
it is valid for Oniguruma engine (which is used in Ruby)
took from Pragmatic Bookshelf
In Perl (see also Zsolt Botykai's answer):
$re = qr/
. # single letter is a palindrome
|
(.) # first letter
(??{ $re })?? # apply recursivly (not interpolated yet)
\1 # last letter
/x;
while(<>) {
chomp;
say if /^$re$/; # print palindromes
}
As pointed out by ZCHudson, determine if something is a palindrome cannot be done with an usual regexp, as the set of palindrome is not a regular language.
I totally disagree with Airsource Ltd when he says that "it's not possibles" is not the kind of answer the interviewer is looking for. During my interview, I come to this kind of question when I face a good candidate, to check if he can find the right argument when we proposed to him to do something wrong. I do not want to hire someone who will try to do something the wrong way if he knows better one.
something you can do with perl: http://www.perlmonks.org/?node_id=577368
I would explain to the interviewer that the language consisting of palindromes is not a regular language but instead context-free.
The regular expression that would match all palindromes would be infinite. Instead I would suggest he restrict himself to either a maximum size of palindromes to accept; or if all palindromes are needed use at minimum some type of NDPA, or just use the simple string reversal/equals technique.
The best you can do with regexes, before you run out of capture groups:
/(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\9\8\7\6\5\4\3\2\1/
This will match all palindromes up to 19 characters in length.
Programatcally solving for all lengths is trivial:
str == str.reverse ? true : false
I don't have the rep to comment inline yet, but the regex provided by MizardX, and modified by Csaba, can be modified further to make it work in PCRE. The only failure I have found is the single-char string, but I can test for that separately.
/^((.)(?1)?\2|.)$/
If you can make it fail on any other strings, please comment.
#!/usr/bin/perl
use strict;
use warnings;
print "Enter your string: ";
chop(my $a = scalar(<STDIN>));
my $m = (length($a)+1)/2;
if( (length($a) % 2 != 0 ) or length($a) > 1 ) {
my $r;
foreach (0 ..($m - 2)){
$r .= "(.)";
}
$r .= ".?";
foreach ( my $i = ($m-1); $i > 0; $i-- ) {
$r .= "\\$i";
}
if ( $a =~ /(.)(.).\2\1/ ){
print "$a is a palindrome\n";
}
else {
print "$a not a palindrome\n";
}
exit(1);
}
print "$a not a palindrome\n";
From automata theory its impossible to match a paliandrome of any lenght ( because that requires infinite amount of memory). But IT IS POSSIBLE to match Paliandromes of Fixed Length.
Say its possible to write a regex that matches all paliandromes of length <= 5 or <= 6 etc, but not >=5 etc where upper bound is unclear
In Ruby you can use \b(?'word'(?'letter'[a-z])\g'word'\k'letter+0'|[a-z])\b to match palindrome words such as a, dad, radar, racecar, and redivider. ps : this regex only matches palindrome words that are an odd number of letters long.
Let's see how this regex matches radar. The word boundary \b matches at the start of the string. The regex engine enters the capturing group "word". [a-z] matches r which is then stored in the stack for the capturing group "letter" at recursion level zero. Now the regex engine enters the first recursion of the group "word". (?'letter'[a-z]) matches and captures a at recursion level one. The regex enters the second recursion of the group "word". (?'letter'[a-z]) captures d at recursion level two. During the next two recursions, the group captures a and r at levels three and four. The fifth recursion fails because there are no characters left in the string for [a-z] to match. The regex engine must backtrack.
The regex engine must now try the second alternative inside the group "word". The second [a-z] in the regex matches the final r in the string. The engine now exits from a successful recursion, going one level back up to the third recursion.
After matching (&word) the engine reaches \k'letter+0'. The backreference fails because the regex engine has already reached the end of the subject string. So it backtracks once more. The second alternative now matches the a. The regex engine exits from the third recursion.
The regex engine has again matched (&word) and needs to attempt the backreference again. The backreference specifies +0 or the present level of recursion, which is 2. At this level, the capturing group matched d. The backreference fails because the next character in the string is r. Backtracking again, the second alternative matches d.
Now, \k'letter+0' matches the second a in the string. That's because the regex engine has arrived back at the first recursion during which the capturing group matched the first a. The regex engine exits the first recursion.
The regex engine is now back outside all recursion. That this level, the capturing group stored r. The backreference can now match the final r in the string. Since the engine is not inside any recursion any more, it proceeds with the remainder of the regex after the group. \b matches at the end of the string. The end of the regex is reached and radar is returned as the overall match.
here is PL/SQL code which tells whether given string is palindrome or not using regular expressions:
create or replace procedure palin_test(palin in varchar2) is
tmp varchar2(100);
i number := 0;
BEGIN
tmp := palin;
for i in 1 .. length(palin)/2 loop
if length(tmp) > 1 then
if regexp_like(tmp,'^(^.).*(\1)$') = true then
tmp := substr(palin,i+1,length(tmp)-2);
else
dbms_output.put_line('not a palindrome');
exit;
end if;
end if;
if i >= length(palin)/2 then
dbms_output.put_line('Yes ! it is a palindrome');
end if;
end loop;
end palin_test;
my $pal='malayalam';
while($pal=~/((.)(.*)\2)/){ #checking palindrome word
$pal=$3;
}
if ($pal=~/^.?$/i){ #matches single letter or no letter
print"palindrome\n";
}
else{
print"not palindrome\n";
}
This regex will detect palindromes up to 22 characters ignoring spaces, tabs, commas, and quotes.
\b(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*(?:(\w)[ \t,'"]*\11?[ \t,'"]*\10|\10?)[ \t,'"]*\9|\9?)[ \t,'"]*\8|\8?)[ \t,'"]*\7|\7?)[ \t,'"]*\6|\6?)[ \t,'"]*\5|\5?)[ \t,'"]*\4|\4?)[ \t,'"]*\3|\3?)[ \t,'"]*\2|\2?))?[ \t,'"]*\1\b
Play with it here: https://regexr.com/4tmui
I wrote an explanation of how I got that here: https://medium.com/analytics-vidhya/coding-the-impossible-palindrome-detector-with-a-regular-expressions-cd76bc23b89b
A slight refinement of Airsource Ltd's method, in pseudocode:
WHILE string.length > 1
IF /(.)(.*)\1/ matches string
string = \2
ELSE
REJECT
ACCEPT