To check if a given string start with A-Z in Perl - regex

I had an programming interview few days ago, I am required to write a piece of code in Perl with the functionality described in the title, after a while, I came up with the following solution:
sub startWithUppercaseLetter {
return #_[0] =~ m/^[A-Z]/;
}
The interviewer seems unhappy with this solution, can anybody give a better one? thanks

I would write
sub starts_with_capital {
shift =~ /^[A-Z]/;
}
Your own solution doesn't survive use warnings, giving
Scalar value #_[0] better written as $_[0]
and it is bad practice to use upper case letters in local identifiers.

I would really think this is not a good use of the title since your regular expression will return empty matches or matches (what do you want as a definition of the problem to solve). The person could also imagine having type this function name over and over again to check if something has a Capital.
So many ways to do it in Perl.
return #_[0] if /^[A-Z]/;
return;
The m really is not needed as you only want the start of the string and any new lines etc, as you are concerned only if first character starts. Your way, yes can have an empty match and works the same; make it readable for interviews or provide two examples : long hand as above and then short hand.

Related

Regex that matches a list of comma separated items in any order

I have three "Clue texts" that say:
SomeClue=someText
AnotherClue=somethingElse
YetAnotherClue=moreText
I need to parse a string and see if it contains exactly these 3 texts, separated by a comma. No Clue Text contains any comma.
The problem is, they can be in any order and they must be the only clues in the string.
Matches:
SomeClue=someText,AnotherClue=somethingElse,YetAnotherClue=moreText
SomeClue=someText,YetAnotherClue=moreText,AnotherClue=somethingElse
AnotherClue=somethingElse,SomeClue=someText,YetAnotherClue=moreText
YetAnotherClue=moreText,SomeClue=someText,AnotherClue=somethingElse
Non-Matches:
SomeClue=someText,AnotherClue=somethingElse,YetAnotherClue=moreText,
SomeClue=someText,YetAnotherClue=moreText,,AnotherClue=somethingElse
,AnotherClue=somethingElse,SomeClue=someText,YetAnotherClue=moreText
YetAnotherClue=moreText,SomeClue=someText,AnotherClue=somethingElse,UselessText
YetAnotherClue=moreText,SomeClue=someText,AnotherClue=somethingElse,AClueThatIDontWant=wrongwrongwrong
Putting togheter what I found on other posts, I have:
(?=.*SomeClue=someText($|,))(?=.*AnotherClue=somethingElse($|,))(?=.*YetAnotherClue=moreText($|,))
This works as far as Clues and their order are concerned.
Unfortunately, I can't find a way to avoid adding a comma and then some stupid text at the end.
My real case has somewhat more complicated Clue Texts, because each of them is a small regex, but I am pretty sure once I know how to handle commas, the rest will be easy.
I think you'd be better off with a stronger tool than regexes (and I genuinely love regular expressions). Regexes aren't good with needing supplementary memory, which is what you have here: you need exactly these 3, but they can come in any order.
In principle, you could write a regex for each of the 6 permutations. But that would never scale. You ought to use something with parsing power.
I suggest writing a verification function in your favorite scripting language, made up of underlying string functions.
In basic Python, you could do (for instance)
ref = set(['SomeClue=someText', 'AnotherClue=somethingElse', 'YetAnotherClue=moreText'])
def ismatch(myline):
splt = myline.split(',')
return ref == set(splt)
You can tweak that as necessary, of course. Note that this nearly-complete solution is not really longer, and much more readable, than any regex would be.

Finding match between optional tokens?

For the strings:
text::handle:e#ma.il::text
text::chat_identifier:chat0123456789&text
I have the current regex:
m/(handle:|chat_identifier:)(.+?)(:{2}|&)/
And I am currently using $2 in order to obtain the value I wish (in the first string e#ma.il and in the second, chat0123456789).
Is there a better/faster/simpler way to solve this problem, though?
Whether it's "better" or not depends on the context, but you could take this approach: split the string on ":" and take the fourth element of the resulting list. That's arguably more readable than the regex and more robust if the third field can be something other than "handle" or "chat_identifier".
I think the speed would be very similar for either approach but probably for almost any implementation in perl. I'd want to show that speed was critical for this step before worrying about it...
For a regex solution, this one is slightly simpler and doesn't need to backtrack:
m/(handle|chat_identifier):([^:&]+)/
Note the slight difference: yours allows single colons within the value, mine doesn't (it stops at the first colon encountered). If that is not a problem, you can use my variant. Or as I mentioned in a comment, split at : and use the fourth element in the result.
An equivalent version that does only stop at double colons is this:
m/(handle|chat_identifier):((?:(?!::|&).)+)/
Not so beautiful, but it still avoids backtracking (the lookahead might make it slower, though... you will need to profile that, if speed matters at all).
Looks like you have allot of good solutions already here. The split method seems like the simplest. But depending on your requirements you could also use a more generic regex that breaks the string in its basic pieces. It will work for other datatypes and property names than in your examples.
([^:]+)::([^:]+):([^:&]+)(?:::|&)\1
The captures groups are as follows:
Group 1: the datatype. (the keyword "text" from your examples.)
Group 2: The property name. (The keywords "handle" and "chat_identifier"
from your examples.)
Group 3: The property value.
If the values you want are always in the same position and it's safe to split on : and &, then perhaps the following will work for you:
use Modern::Perl;
say +( split /[:&]+/ )[2] for <DATA>;
__DATA__
text::handle:e#ma.il::text
text::chat_identifier:chat0123456789&text
Output:
e#ma.il
chat0123456789

Lua string.match uses irregular regular expressions?

I'm curious why this doesn't work, and need to know why/how to work around it; I'm trying to detect whether some input is a question, I'm pretty sure string.match is what I need, but:
print(string.match("how much wood?", "(how|who|what|where|why|when).*\\?"))
returns nil. I'm pretty sure Lua's string.match uses regular expressions to find matches in a string, as I've used wildcards (.) before with success, but maybe I don't understand all the mechanics? Does Lua require special delimiters in its string functions? I've tested my regular expression here, so if Lua used regular regular expressions, it seems like the above code would return "how much wood?".
Can any of you tell me what I'm doing wrong, what I mean to do, or point me to a good reference where I can get comprehensive information about how Lua's string manipulation functions utilize regular expressions?
Lua doesn't use regex. Lua uses Patterns, which look similar but match different input.
.* will also consume the last ? of the input, so it fails on \\?. The question mark should be excluded. Special characters are escaped with %.
"how[^?]*%?"
As Omri Barel said, there's no alternation operator. You probably need to use multiple patterns, one for each alternative word at the beginning of the sentence. Or you could use a library that supports regex like expressions.
According to the manual, patterns don't support alternation.
So while "how.*" works, "(how|what).*" doesnt.
And kapep is right about the question mark being swallowed by the .*.
There's a related question: Lua pattern matching vs. regular expressions.
As they have already answered before, it is because the patterns in lua are different from the Regex in other languages, but if you have not yet managed to get a good pattern that does all the work, you can try this simple function:
local function capture_answer(text)
local text = text:lower()
local pattern = '([how]?[who]?[what]?[where]?[why]?[when]?[would]?.+%?)'
for capture in string.gmatch(text, pattern) do
return capture
end
end
print(capture_answer("how much wood?"))
Output: how much wood?
That function will also help you if you want to find a question in a larger text string
Ex.
print(capture_answer("Who is the best football player in the world?\nWho are your best friends?\nWho is that strange guy over there?\nWhy do we need a nanny?\nWhy are they always late?\nWhy does he complain all the time?\nHow do you cook lasagna?\nHow does he know the answer?\nHow can I learn English quickly?"))
Output:
who is the best football player in the world?
who are your best friends?
who is that strange guy over there?
why do we need a nanny?
why are they always late?
why does he complain all the time?
how do you cook lasagna?
how does he know the answer?
how can i learn english quickly?

How can I extract a substring from a string using regular expressions?

Let us say that I have a string "ABCDEF34GHIJKL". How would I extract the number (in this case 34) from the string using regular expressions?
I know little about the regular expressions, and while I would love to learn all there is to know about it, time constraints have forced me to simply find out how this specific example would work.
Any information would be greatly appreciated!
Thanks
This is a very language specific question but you didn't specify a language. Based on previous questions you've asked though I'm going to assume you meant this to be a C# language question.
For this scenario just write up a regex for a number and apply it to the input.
var match = Regex.Match(input, "\d+");
if ( match.Success ) {
var number = match.Value;
}
Depends on the language, but you want to match with an expression like ([0-9]+). That will find (the first group of) digits in the string. If your regexp engine expects to starts matching at the start of the string, you will need to add .*?([0-9]+).
I agree with calmh ([0-9]+) is the main thing need to worry about. However you may want to note that in a lot of languages you'll need to use back references (usually \\1 or \1) to get the value. For example
"ABCDEF34GHIJKL".sub(/^.*?([0-9]+).*$/, "\\1")
A better solution in Ruby however would be the following and would also match multiple numbers in the string.
"ABCDEF34GHIJ1001KL".scan(/[0-9]+/) { |m|
puts m
}
# Outputs:
34
1001
Most languages have some sort of similar methods. There are some examples of various languages here http://www.regular-expressions.info/tools.html as well as some good examples of back references being used.

In "aa67bc54c9", is there any way to print "aa" 67 times, "bc" 54 times and so on, using regular expressions?

I was asked this question in an interview for an internship, and the first solution I suggested was to try and use a regular expression (I usually am a little stumped in interviews). Something like this
(?P<str>[a-zA-Z]+)(?P<n>[0-9]+)
I thought it would match the strings and store them in the variable "str" and the numbers in the variable "n". How, I was not sure of.
So it matches strings of type "a1b2c3", but a problem here is that it also matches strings of type "a1b". Could anyone suggest a solution to deal with this problem?
Also, is there any other regular expression that could solve this problem?
Do you know why "regular expressions" are called "regular"? :-)
That would be too long to explain, I'll just outline the way. To match a pattern (i.e. decide whether a given string is "valid" or "invalid"), a theoretical informatician would use a finite state automaton. That's an abstract machine that has a finite number of states; each tick it reads a char from the input and jumps to another state. The pattern of where to jump from particular state when a particular character is read is fixed. Some states are marked as "OK", some--as "FAIL", so that by examining state of a machine you can check whether your text is "valid" (i.e. a valid e-mail).
For example, this machine only accepts "nice" as its "valid" word (a pic from Wikipedia):
A set of "valid" words such a machine theoretically can distinguish from invalid is called "regular language". Not every set is a regular language: for example, finite state automata are incapable of checking whether parentheses in string are balanced.
But constructing state machines was a complex task, compared to the complexity of defining what "valid" is. So the mathematicians (mainly S. Kleene) noted that every regular language could be described with a "regular expression". They had *s and |s and were the prototypes of what we know as regexps now.
What does it have to do with the problem? The problem in subject is essentially non-regular. It can't be expressed with anything that works like a finite automaton.
The essence is that it should contain a memory cell that is capable to hold an arbitrary number (repetition count in your case). Finite automata and classical regular expressions can not do this.
However, modern regexps are more expressive and are said to be able to check balanced parentheses! But this may serve as a good example that you shouldn't use regexps for tasks they don't suit. Let alone that it contains code snippets; this makes the expression far from being "regular".
Answering the initial question, you can't solve your problem with using anything "regular" only. However, regexps could be aid you in solving this problem, as in tster's answer
Perhaps, I should look closer to tster's answer (do a "+1" there, please!) and show why it's not the "regular expression" solution. One may think that it is, it just contains print statement (not essential) and a loop--and loop concept is compatible with finite state automaton expressive power. But there is one more elusive thing:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1
x # <--- this one
$2;
}
The task of reading a string and a number and printing repeatedly that string given number of times, where the number is an arbitrary integer, is undoable on a finite state machine without additional memory. You use a memory cell to keep that number and decrease it, and check for it to be greater than zero. But this number may be arbitrarily big, and it contradicts with a finite memory available to the finite state machine.
However, there's nothing wrong with classical pattern /([abc]*){5}/ that matches something "regular" repeated fixed number of times. We essentially have states that correspond to "matched pattern once", "matched pattern twice" ... "matched pattern 5 times". There's finite number of them, and that's the gist of the difference.
how about:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1 x $2;
}
Answering your question directly:
No, regular expressions match text and don't print anything, so there is no way to do it solely using regular expressions.
The regular expression you gave will match one string/number pair; you can then print that repeatedly using an appropriate mechanism. The Perl solution from #tster is about as compact as it gets. (It doesn't use the names that you applied in your regex; I'm pretty sure that doesn't matter.)
The remaining details depend on your implementation language.
Nope, this is your basic 'trick question' - no matter how you answer it that answer is wrong unless you have exactly the answer the interviewer was trained to parrot. See the workup of the issue given by Pavel Shved - note that all invocations have 'not' as a common condition, the tool just keeps sliding: Even when it changes state there is no counter in that state
I have a rather advanced book by Kenneth C Louden who is a college prof on the matter, in which it is stated that the issue at hand is codified as "Regex's can't count." The obvious answer to the question seems to me at the moment to be using the lookahead feature of Regex's ...
Probably depends on what build of what brand of regex the interviewer is using, which probably depends of flight-dynamics of Golf Balls.
Nice answers so far. Regular expressions alone are generally thought of as a way to match patterns, not generate output in the manner you mentioned.
Having said that, there is a way to use regex as part of the solution. #Jonathan Leffler made a good point in his comment to tster's reply: "... maybe you need a better regex library in your language."
Depending on your language of choice and the library available, it is possible to pull this off. Using C# and .NET, for example, this could be achieved via the Regex.Replace method. However, the solution is not 100% regex since it still relies on other classes and methods (StringBuilder, String.Join, and Enumerable.Repeat) as shown below:
string input = "aa67bc54c9";
string pattern = #"([a-z]+)(\d+)";
string result = Regex.Replace(input, pattern, m =>
// can be achieved using StringBuilder or String.Join/Enumerable.Repeat
// don't use both
//new StringBuilder().Insert(0, m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToString()
String.Join("", Enumerable.Repeat(m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToArray())
+ Environment.NewLine // comment out to prevent line breaks
);
Console.WriteLine(result);
A clearer solution would be to identify the matches, loop over them and insert them using the StringBuilder rather than rely on Regex.Replace. Other languages may have compact idioms to handle the string multiplication that doesn't rely on other library classes.
To answer the interview question, I would reply with, "it's possible, however the solution would not be a stand-alone 100% regex approach and would rely on other language features and/or libraries to handle the generation aspect of the question since the regex alone is helpful in matching patterns, not generating them."
And based on the other responses here you could beef up that answer further if needed.