meaning of a regexp if ($_ =~ /-\n/) - regex

I am a beginner of perl scripting.
I know hyphen (-) is used to specify the range.
But what if it is mentioned in the beginning of the expression?
Example:
if ($_ =~ /-\n/)
//do something
How to interpret the above code?
"if the parameter is equal to a range of newline" ?
(No, that is weird understanding :-/)
Please help.

Outside of [] - means "-" as far as I know, it only indicates a range within a [] block.
Here is a more complete answer I found
How to match hyphens with Regular Expression? (look at the second answer)
So the expression should match a - followed by a newline or line ending with -

The pattern will match hyphens "-" followed by a newline \n.
The hyphen is treated as a range operator inside character classes, as explained in perldoc perlrequick:
The special character '-' acts as a range operator within character
classes, so that the unwieldy [0123456789] and [abc...xyz] become
the svelte [0-9] and [a-z] :
/item[0-9]/; # matches 'item0' or ... or 'item9'
/[0-9a-fA-F]/; # matches a hexadecimal digit
If '-' is the first or last character in a character class, it is
treated as an ordinary character.

This means:
If there is a hyphen immediately followed by a newline-character, no matter where this pair of characters is located inside the string.

Related

What is the meaning of this line in perl?

$line =~ s/^<(\w+)=\"(.*?)\">//;
What is the meaning of this line in perl?
The s/.../.../ is the substitution operator. It matches its first operand, which is a regular expression and replaces it with its second operand.
By default, the substitution operator works on a string stored in $_. But your code uses the binding operator (=~) to make it work on $line instead.
The two operands to the substitution operator are the bits delimited by the / characters (there are more advanced versions of these delimiters, but we'll ignore them for now). So the first operand is ^<(\w+)=\"(.*?)\"> and the second operand is an empty string (because there is nothing between the second and third / characters).
So your code says:
Examine the variable $line
Look for a section of the string which matches ^<(\w+)=\"(.*?)\">
Replace that part of the string with an empty string
All that is left now is for us to untangle the regular expression and see what that matchs.
^ - matches the start of the string
< - matches a literal < character
(...) - means capture this bit of the match and store it in $1
\w+ - matches one or more "word characters" (where a word character is a letter, a digit or an underscore)
= - matches a literal = character
\" - matches a literal " character (the \ is unnecessary here)
(...) - means capture this bit of the match and store it in $2
.*? - matches zero or more instances of any character
\" - matches a literal " character (once again, the \ is unnecessary here)
> - matches a literal >
So, all in all, this looks like a slightly broken attempt to match XML or HTML. It matches tags of the form <foo="bar"> (which isn't valid XML or HTML) and replaces them with an empty string.
It's searching for an XML tag at the start of a string, and substituting it with nothing (i.e. removing it).
For example, in the input:
<hello="world">example
The regex will match <hello="world">, and substitute it with nothing - so the final result is just:
example
In general, this is something that you shouldn't do with regex. There are a dozen different ways you could create false negatives here, that don't get stripped from the string.
But if this is a "quick and dirty" script, where you don't need to worry about all possible edge cases, then it may be OK to use.

RegEx with Pipes and IPs not working

The RegEx:
^([0-9\.]+)\Q|\E([^\Q|\E])\Q|\E
does not match the string:
1203730263.912|12.66.18.0|
Why?
From PHP docs,
\Q and \E can be used to ignore regexp metacharacters in the pattern.
For example:
\w+\Q.$.\E$ will match one or more word characters, followed by literals .$. and anchored at the end of the string.
And your regex should be,
^([0-9\.]+)\Q|\E([^\Q|\E]*)\Q|\E
OR
^([0-9\.]+)\Q|\E([^\Q|\E]+)\Q|\E
You forget to add + after [^\Q|\E]. Without +, it matches single character.
DEMO
Explanation:
^ Starting point.
([0-9\.]+) Captures digits or dot one or more times.
\Q|\E In PCRE, \Q and \E are referred to as Begin sequence. Which treats any character literally when it's included in that block. So | symbol in that block tells the regex engine to match a literal |.
([^\Q|\E]+) Captures any character not of | one or more times.
\Q|\E Matches a literal pipe symbol.
The accepted answer seems somewhat incorrect so I wanted to address this for future readers.
If you did not already know, using \Q and \E ensures that any character between \Q ... \E will be matched literally, not interpreted as a metacharacter by the regular expression engine.
First and most important, \Q and \E is NOT usable within a bracketed character class [].
[^\Q|\E] # Incorrect
[^|] # Correct
Secondly, you do not follow that class with a quantifier. Using this, the correct syntax would be:
^([0-9.]+)\Q|\E([^|]+)\Q|\E
Although, it is much simpler to write this out as:
^([0-9.]+)\|([^|]+)\|

regular expression what's the meaning of this regular expression s#^.*/##s

what is the meaning of s#^.*/##s
because i know that in the pattern '.' denotes that it can represent random letter except the \n.
then '.* 'should represent the random quantity number of random letter .
but in the book it said that this would be delete all the unix type of path.
My question is that, does it means I could substitute random quantity number of random letter by space?
s -> subsitution
# -> pattern delimiter
^.* -> all chars 0 or more times from the begining
/ -> literal /
## -> replace by nothing (2 delimiters)
s -> single line mode ( the dot can match newline)
Substitutions conventionally use the / character as a delimiter (s/this/that/), but you can use other punctuation characters if it's more convenient. In this case, # is used because the regexp itself contains a / character; if / were used as the delimiter, any / in the pattern would have to be escaped as \/. (# is not the character I would have chosen, but it's perfectly valid.)
^ matches the beginning of the string (or line; see below)
.*/ matches any sequence of characters up to and including a / character. Since * is greedy, it will match all characters up to an including the last / character; any precedng / characters are "eaten" by the .*. (The final / is not, because if .* matched all / characters the final / would fail to match.)
The trailing s modifier treats the string as a single line, i.e., causes . to match any character including a newline. See the m and s modifiers in perldoc perlre for more information.
So this:
s#^.*/##s
replaces everything from the beginning of the string ($_ in this case, since that's the default) up to the last / character by nothing.
If there are no / characters in $_, the match fails and the substitution does nothing.
This might be used to replace all directory components of an absolute or relative path name, for example changing /home/username/dir/file.txt to file.txt.
It will delete all characters, including line breaks because of the s modifier, in a string until the last slash included.
Please excuse a little pedantry. But I keep seeing this and I think it's important to get it right.
s#^.*/##s is not a regular expression.
^.* is a regular expression.
s/// is the substitution operator.
The substitution operator takes two arguments. The first is a regular expression. The second is a replacement string.
The substitution operator (like many other quote-like operators in Perl) allows you you change the delimiter character that you use.
So s### is also a substitution operator (just using # instead of /).
s#^.*/## means "find the text that matches the regular expression ^.*/ and replace it with an empty string. And the s on the end is a option which changes the regex so that the . matches "\n" as well as all other characters.

What do these qr{} regular expressions mean?

What do these mean?
qr{^\Q$1\E[a-zA-Z0-9_\-]*\Q$2\E$}i
qr{^[a-zA-Z0-9_\-]*\Q$1\E$}i
If $pattern is a Perl regular expression, what is $identity in the code below?
$identity =~ $pattern;
When the RHS of =~ isn't m//, s/// or tr///, a match operator (m//) is implied.
$identity =~ $pattern;
is the same as
$identity =~ /$pattern/;
It matches the pattern or pre-compiled regex $pattern (qr//) against the value of $identity.
The binding operator =~ applies a regex to a string variable. This is documented in perldoc perlop
The \Q ... \E escape sequence is a way to quote meta characters (also documented in perlop). It allows for variable interpolation, though, which is why you can use it here with $1 and $2. However, using those variables inside a regex is somewhat iffy, because they themselves are defined during the use of a capture inside a regex.
The character class bracket [ ... ] defines a range of characters which it will match. The quantifier that follows it * means that particular bracket must match zero or more times. The dashes denote ranges, such as a-z meaning "from a through z". The escaped dash \- means a literal dash.
The ^ and $ (the dollar sign at the end) denotes anchors, beginning and end of string respectively. The modifier i at the end means the match is case insensitive.
In your example, $identity is a variable that presumably contains a string (or whatever it contains will be converted to a string).
The perlre documentation is your friend here. Search it for unfamiliar regex constructs.
A detailed explanation is below, but it is so hairy that I wonder whether using a module such as Text::Balanced would be a superior approach.
The first pattern matches possibly empty delimited strings, and the delimiters are in $1 and $2, which we do not know until runtime. Say $1 is ( and $2 is ), then the first pattern matches strings of the form
()
(a)
(9)
(abcABC_012-)
and so on …
The second pattern matches terminated strings, where the terminator is in $1—also not known until runtime. Assuming the terminator is ], then the second pattern matches strings of the form
]
a]
Aa9a_9]
Using \Q...\E around a pattern removes any special regex meaning from the characters inside, as documented in perlop:
For the pattern of regex operators (qr//, m// and s///), the quoting from \Q is applied after interpolation is processed, but before escapes are processed. This allows the pattern to match literally (except for $ and #). For example, the following matches:
'\s\t' =~ /\Q\s\t/
Because $ or # trigger interpolation, you'll need to use something like /\Quser\E\#\Qhost/ to match them literally.
The patterns in your question do want to trigger interpolation but do not want any regex metacharacters to have special meaning, as with parentheses and square brackets above that are meant to match literally.
Other parts:
Circumscribed brackets delimit a character class. For example, [a-zA-Z0-9_\-] matches any single character that is upper- or lowercase A through Z (but with no accents or other extras), zero through nine, underscore, or hyphen. Note that the hyphen is escaped at the end to emphasize that it matches a literal hyphen rather and does not specify part of a range.
The * quantifier means match zero or more of the preceding subpattern. In the examples from your question, the star repeats character classes.
The patterns are bracketed with ^ and $, which means an entire string must match rather than some substring to succeed.
The i at the end, after the closing curly brace, is a regex switch that makes the pattern case-insensitive. As TLP helpfully points out in the comment below, this makes the delimiters or terminators match without regard to case if they contain letters.
The expression $identity =~ $pattern tests whether the compiled regex stored in $pattern (created with $pattern = qr{...}) matches the text in $identity. As written above, it is likely being evaluated for its side effect of storing capture groups in $1, $2, etc. This is a red flag. Never use $1 and friends unconditionally but instead write
if ($identity =~ $pattern) {
print $1, "\n"; # for example
}

Pattern matches an hyphen too

I have a piece of Perl code (pattern matching) like this,
$var = "<AT>this is an at command</AT>";
if ($var =~ /<AT>([\s\w]*)<\/AT>/i)
{
print "Matched in AT command\n";
print "$var\n\n";
}
It works fine, if the content inbetween tags are without an Hyphen. It is not working if a hyphen is inserted between the string present inbetween tags like this... <AT>this is an at-command</AT>.
Can any one fix this regex to match even if hyphen is also inserted ??
help me pls
Senthil
On character class
Your pattern contains this subpattern:
[\s\w]*
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
\s is the shorthand for whitespace character class; \w for word character class. Neither contains the hyphen.
The * is the zero-or-more repetition specifier.
Now you should understand why this pattern does not match a hyphen: it matches zero-or-more of characters that is either a whitespace or a word character. If you want to match a hyphen, then you can include it into the character class.
[\s\w-]*
If you also want to include the period, question mark, and exclamation mark, for example, then you can simply add them in as well:
[\s\w.!?-]*
Special note on hyphen
BE CAUTIOUS when including the hyphen in a character class. It is used as a regex metacharacter in character class definition to define character range. For example,
[a-z]
matches one of any character the range between 'a' and 'z', inclusive. By contrast,
[az-]
matches one of exactly 3 characters, 'a', 'z', and '-'. When you put - as the last element in a character class, it becomes a literal hyphen instead of range definition. You can also put it as the first element, or escape it (by preceding with backslash, which is the way you escape all other regex metacharacters too).
That is, the following 3 character class are identical:
[az-] [-az] [a\-z]
Related questions
Regex: why doesn't [01-12] range work as expected?
You can just add a hyphen in the char class as:
if ($var =~ /<AT>([\s\w-]*)<\/AT>/i)
Also since your regex has a / in it you can use a different delimiter, this way you can avoid escaping /:
if ($var =~m{<AT>([\s\w-]*)</AT>}i)
Use \S instead of \w.
if ($var =~ /<AT>([\s\S]*)<\/AT>/i) {
If you want to have everything between and you can use
if ($var =~ /<AT>((?:(?!<AT>).)*)<\/AT>/i)
And it's ungreedy.
You need to add more characters to your class like [\s\w-]* (as codaddict told you).
Moreover, you should maybe use a lookahead to match the end of your command ("I want to match that only if it is followed by the ending statement") like :
if ($var =~ /<AT>([^<]*)(?=<\/AT>)/i)
[^<] stands for "any character (including hyphen) except "<".
You could even add a lookbehind :
if ($var =~ (?<=/<AT>)([^<]*)(?=<\/AT>)/i)
For more complexe things (since you seem to want a little parser), you should look at the theory of grammar and at lex/yacc.