Pattern matches an hyphen too

Pattern matches an hyphen too - regex

I have a piece of Perl code (pattern matching) like this,
$var = "<AT>this is an at command</AT>";
if ($var =~ /<AT>([\s\w]*)<\/AT>/i)
{
print "Matched in AT command\n";
print "$var\n\n";
}
It works fine, if the content inbetween tags are without an Hyphen. It is not working if a hyphen is inserted between the string present inbetween tags like this... <AT>this is an at-command</AT>.
Can any one fix this regex to match even if hyphen is also inserted ??
help me pls
Senthil

On character class
Your pattern contains this subpattern:
[\s\w]*
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
\s is the shorthand for whitespace character class; \w for word character class. Neither contains the hyphen.
The * is the zero-or-more repetition specifier.
Now you should understand why this pattern does not match a hyphen: it matches zero-or-more of characters that is either a whitespace or a word character. If you want to match a hyphen, then you can include it into the character class.
[\s\w-]*
If you also want to include the period, question mark, and exclamation mark, for example, then you can simply add them in as well:
[\s\w.!?-]*
Special note on hyphen
BE CAUTIOUS when including the hyphen in a character class. It is used as a regex metacharacter in character class definition to define character range. For example,
[a-z]
matches one of any character the range between 'a' and 'z', inclusive. By contrast,
[az-]
matches one of exactly 3 characters, 'a', 'z', and '-'. When you put - as the last element in a character class, it becomes a literal hyphen instead of range definition. You can also put it as the first element, or escape it (by preceding with backslash, which is the way you escape all other regex metacharacters too).
That is, the following 3 character class are identical:
[az-] [-az] [a\-z]
Related questions
Regex: why doesn't [01-12] range work as expected?

You can just add a hyphen in the char class as:
if ($var =~ /<AT>([\s\w-]*)<\/AT>/i)
Also since your regex has a / in it you can use a different delimiter, this way you can avoid escaping /:
if ($var =~m{<AT>([\s\w-]*)</AT>}i)

Use \S instead of \w.
if ($var =~ /<AT>([\s\S]*)<\/AT>/i) {

If you want to have everything between and you can use
if ($var =~ /<AT>((?:(?!<AT>).)*)<\/AT>/i)
And it's ungreedy.

You need to add more characters to your class like [\s\w-]* (as codaddict told you).
Moreover, you should maybe use a lookahead to match the end of your command ("I want to match that only if it is followed by the ending statement") like :
if ($var =~ /<AT>([^<]*)(?=<\/AT>)/i)
[^<] stands for "any character (including hyphen) except "<".
You could even add a lookbehind :
if ($var =~ (?<=/<AT>)([^<]*)(?=<\/AT>)/i)
For more complexe things (since you seem to want a little parser), you should look at the theory of grammar and at lex/yacc.

Related

Perl regex to replace a underscore or forward slash with a dash

While there are several regex examples here showing the many variations, simply I just want to use regex in Perl to search 2 different strings with one string as an underscore(_) and the other string as a forward slash (/) and replace each string with a hyphen (-) plus string. I am using the delimiter backslash, however it is the incorrect output.
Input: Output:
_APPLE -APPLE
/APPLE -APPLE
Here is my code:
$string1 =~ s/\_\/APPLE/-APPLE
$string2 =~ s/\/\/APPLE/-APPLE

The code has an extra (escaped) / and would match strings with _/ (and // in the second case). That is not in your data, which has either _ or /, not both.
Also, there is no need to escape the _, and neither the / if it is not the delimiter.
To match either of a few characters the cleanest and most efficient is the character class
$string =~ s{[_/](\w+)}{-$1};
The alternation also works here
$string =~ s{(?:_|/)(\w+)}{-$1};
but it is more suitable when possibilities to match have more characters (word|another).
There are quite a few assumptions here, given how little is specified in the question. For one, \w also matches digits and _ along with letters. If you clarify the requirements I'll edit as needed.
I assume that the missing closing delimiter, needed for the code to compile, is a typo in posting.

perl regex to accept exactly 1 not alphanumeric (\W) character

the title explain the question itself.
more specifically i need to write a regex in order to accept a "question", something like: "how are you today?". So the last character must be a "?".
I tried something like this:
m/[^a-zA-Z0-9\W{1}]/
but it accept any input with 1 or more \W character

The regex you gave in your question does not do what you think it does.
m/[^a-zA-Z0-9\W{1}]/
This will match any character that is not a-z, A-Z, 0-9, any non word character (\W), {, or }. The ^ inside the square brackets negate the content of the char group. It's not the beginning of the line if it's in there!
If you need to validate any input that has a questionmark at at the end, all you need it the questionmark and the end-of-line metacharacer.
/\?$/
The ? is a metacharacter itself, so you need to escape it with a backslash (\).
If you want to match a whole sentence with the questionmark at the end, think of what kinds of characters could be in the sentence. It will not only be \w probably.
Play around with your input and your regex on http://regex101.com/, that will make it easier because it explains what's going on.

accept a "question", something like:"how are you today?"
How about:
$string =~ /^(?:[a-z0-9]+\s*)+\?$/i;

This may works:
if( $question =~ m!([\w\s]+)\?$! ) {
print "question text: $1\n";
}
The regexpr looks for \w and \s (spaces, tabs, ...) you often have in a text before the question mark at the last position

Try this. I hope to you expect match any character in preceding the ?, this is favor for you
'm/[.+\?$]/ '
.is helps to match the any character of the string
\Ignore the function of the ? (match 0 or 1 time in preceding character) then $ matches the last character.

Substitution with \s does not work as expected

I write regex to remove more than 1 space in a string. The code is simple:
my $string = 'A string has more than 1 space';
$string = s/\s+/\s/g;
But, the result is something bad: 'Asstringshassmoresthans1sspace'. It replaces every single space with 's' character.
There's a work around is instead of using \s for substitution, I use ' '. So the regex becomes:
$string = s/\s+/ /g;
Why doesn't the regex with \s work?

\s is only a metacharacter in a regular expression (and it matches more than just a space, for example tabs, linebreak and form feed characters), not in a replacement string. Use a simple space (as you already did) if you want to replace all whitespace by a single space:
$string = s/\s+/ /g;
If you only want to affect actual space characters, use
$string = s/ {2,}/ /g;
(no need to replace single spaces with themselves).

The answer to your question is that \s is a character class, not a literal character. Just as \w represents alphanumeric characters, it cannot be used to print an alphanumeric character (except w, which it will print, but that's beside the point).
What I would do, if I wanted to preserve the type of whitespace matched, would be:
s/\s\K\s*//g
The \K (keep) escape sequence will keep the initial whitespace character from being removed, but all subsequent whitespace will be removed. If you do not care about preserving the type of whitespace, the solution already given by Tim is the way to go, i.e.:
s/\s+/ /g

\s stands for matching any whitespace. It's equivalent to this:
[\ \t\r\n\f]
When you replace with $string = s/\s+/\s/g;, you are replacing one or more whitespace characters with the letter s. Here's a link for reference: http://perldoc.perl.org/perlrequick.html

Why doesn't the regex with \s work?
Your regex with \s does work. What doesn't work is your replacement string. And, of course, as others have pointed out, it shouldn't.
People get confused about the substitution operator (s/.../.../). Often I find people think of the whole operator as "a regex". But it's not, it's an operator that takes two arguments (or operands).
The first operand (between the first and second delimiters) is interpreted as a regex. The second operand (between the second and third delimiters) is interpreted as a double-quoted string (of course, the /e option changes that slightly).
So a substitution operation looks like this:
s/REGEX/REPLACEMENT STRING/
The regex recognises special characters like ^ and + and \s. The replacement string doesn't.
If people stopped misunderstanding how the substitution operator is made up, they might stop expecting regex features to work outside of regular expressions :-)

What do these qr{} regular expressions mean?

What do these mean?
qr{^\Q$1\E[a-zA-Z0-9_\-]*\Q$2\E$}i
qr{^[a-zA-Z0-9_\-]*\Q$1\E$}i
If $pattern is a Perl regular expression, what is $identity in the code below?
$identity =~ $pattern;

When the RHS of =~ isn't m//, s/// or tr///, a match operator (m//) is implied.
$identity =~ $pattern;
is the same as
$identity =~ /$pattern/;
It matches the pattern or pre-compiled regex $pattern (qr//) against the value of $identity.

The binding operator =~ applies a regex to a string variable. This is documented in perldoc perlop
The \Q ... \E escape sequence is a way to quote meta characters (also documented in perlop). It allows for variable interpolation, though, which is why you can use it here with $1 and $2. However, using those variables inside a regex is somewhat iffy, because they themselves are defined during the use of a capture inside a regex.
The character class bracket [ ... ] defines a range of characters which it will match. The quantifier that follows it * means that particular bracket must match zero or more times. The dashes denote ranges, such as a-z meaning "from a through z". The escaped dash \- means a literal dash.
The ^ and $ (the dollar sign at the end) denotes anchors, beginning and end of string respectively. The modifier i at the end means the match is case insensitive.
In your example, $identity is a variable that presumably contains a string (or whatever it contains will be converted to a string).

The perlre documentation is your friend here. Search it for unfamiliar regex constructs.
A detailed explanation is below, but it is so hairy that I wonder whether using a module such as Text::Balanced would be a superior approach.
The first pattern matches possibly empty delimited strings, and the delimiters are in $1 and $2, which we do not know until runtime. Say $1 is ( and $2 is ), then the first pattern matches strings of the form
()
(a)
(9)
(abcABC_012-)
and so on …
The second pattern matches terminated strings, where the terminator is in $1—also not known until runtime. Assuming the terminator is ], then the second pattern matches strings of the form
]
a]
Aa9a_9]
Using \Q...\E around a pattern removes any special regex meaning from the characters inside, as documented in perlop:
For the pattern of regex operators (qr//, m// and s///), the quoting from \Q is applied after interpolation is processed, but before escapes are processed. This allows the pattern to match literally (except for $ and #). For example, the following matches:
'\s\t' =~ /\Q\s\t/
Because $ or # trigger interpolation, you'll need to use something like /\Quser\E\#\Qhost/ to match them literally.
The patterns in your question do want to trigger interpolation but do not want any regex metacharacters to have special meaning, as with parentheses and square brackets above that are meant to match literally.
Other parts:
Circumscribed brackets delimit a character class. For example, [a-zA-Z0-9_\-] matches any single character that is upper- or lowercase A through Z (but with no accents or other extras), zero through nine, underscore, or hyphen. Note that the hyphen is escaped at the end to emphasize that it matches a literal hyphen rather and does not specify part of a range.
The * quantifier means match zero or more of the preceding subpattern. In the examples from your question, the star repeats character classes.
The patterns are bracketed with ^ and $, which means an entire string must match rather than some substring to succeed.
The i at the end, after the closing curly brace, is a regex switch that makes the pattern case-insensitive. As TLP helpfully points out in the comment below, this makes the delimiters or terminators match without regard to case if they contain letters.
The expression $identity =~ $pattern tests whether the compiled regex stored in $pattern (created with $pattern = qr{...}) matches the text in $identity. As written above, it is likely being evaluated for its side effect of storing capture groups in $1, $2, etc. This is a red flag. Never use $1 and friends unconditionally but instead write
if ($identity =~ $pattern) {
print $1, "\n"; # for example
}

meaning of a regexp if ($_ =~ /-\n/)

I am a beginner of perl scripting.
I know hyphen (-) is used to specify the range.
But what if it is mentioned in the beginning of the expression?
Example:
if ($_ =~ /-\n/)
//do something
How to interpret the above code?
"if the parameter is equal to a range of newline" ?
(No, that is weird understanding :-/)
Please help.

Outside of [] - means "-" as far as I know, it only indicates a range within a [] block.
Here is a more complete answer I found
How to match hyphens with Regular Expression? (look at the second answer)
So the expression should match a - followed by a newline or line ending with -

The pattern will match hyphens "-" followed by a newline \n.
The hyphen is treated as a range operator inside character classes, as explained in perldoc perlrequick:
The special character '-' acts as a range operator within character
classes, so that the unwieldy [0123456789] and [abc...xyz] become
the svelte [0-9] and [a-z] :
/item[0-9]/; # matches 'item0' or ... or 'item9'
/[0-9a-fA-F]/; # matches a hexadecimal digit
If '-' is the first or last character in a character class, it is
treated as an ordinary character.

This means:
If there is a hyphen immediately followed by a newline-character, no matter where this pair of characters is located inside the string.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js