Perl regular expression in Perl/Curl script - regex

I'm not all that sure how this works/what it means...
my ($value) = ($out =~ /currentvalue[^>]*>([^<]+)/);
So basically, thats part of a CURL/PERL script, it goes onto www.example.com, and finds <span id="currentvalue"> GETS THIS VALUE </span> in the pages html.
What exactly does the [^>]*>([^<]+)/) part of the script do? Does it define that its looking for span id=".." ?
Where can I learn more about the [^>]*>([^<]+)/) functions?

/.../ aka m/.../ is a the match operator. It checks if its operand (on the LHS of =~) matches the regular expression within the literal. Operators are documented in perlop. (Go down to "m/PATTERN/".) Regular expressions are documented in perlre.
As for the regular expression used here,
$ perl -MYAPE::Regex::Explain \
-e'print YAPE::Regex::Explain->new($ARGV[0])->explain' \
'currentvalue[^>]*>([^<]+)'
The regular expression:
(?-imsx:currentvalue[^>]*>([^<]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
currentvalue 'currentvalue'
----------------------------------------------------------------------
[^>]* any character except: '>' (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^<]+ any character except: '<' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

This is plain vanilla Perl regexp. See this tutorial
/ # Start of regexp
currentvalue # Matches the string 'currentvalue'
[^>]* # Matches 0 or more characters which is not '>'
> # Matches >
( # Captures match enclosed in () to Perl built-in variable $1
[^<]+ # Matches 1 or more characters which is not '<'
) # End of group $1
/ # End of regexp

Related

Regular expression to escape n # except 2 #

I'm trying to override Parsedown's markup to only allow <h2> headings.
What regex would escape all heading types except <h2>?
#Heading -> \#Heading
##Heading -> ##Heading
###Heading -> \###Heading
####Heading -> \####Heading
#####Heading -> \#####Heading
######Heading -> \######Heading
You can use this regex
^(?!##\w)(?=#)
Regex Demo
Regex Breakdown
^ #Start of string
(?! #Negative lookahead(it means, whatever is there next do not match it)
##\w #Assert that its impossible to match two # followed by a word character
)
(?= #Positive lookahead
# #check if there is at least one #
)
NOTE
\w denotes any character from [A-Za-z0-9_].
[..] denotes character class. Any character(not string) present in this will be matched.
Use look aheads for headings, but not double hashes:
^(?!##\w)(?=#+)
Description
^((?:#|#{3,})[^#])
Replace with: \$1
This regular expression will do the following:
match one hash
match 3 or more hash
Example
Live Demo
https://regex101.com/r/kE4oK6/1
Sample text
#Heading
##Heading
###Heading
####Heading
#####Heading
######Heading
Sample Matches
\#Heading
##Heading
\###Heading
\####Heading
\#####Heading
\######Heading
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of a "line"
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
# '#'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
#{3,} '#' (at least 3 times (matching the
most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[^#] any character except: '#'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------

How to express this in regular expression?

I need to capture all Upper and Lower case character strings other than "FOO" and "BAR". How to do this?
I tried [^(^FOO$)(^BAR$)] but it doesn't work.
Update:
Actually I'm using this in a context, I am concatenating it with another regex
["(\w)+": _this_regex_ ]
For example ["abc":FOO] shouldn't be matched
All other types say ["abc":BAZ] should match
You want a negative look ahead:
\["(\w+)"\s*:\s*(?!FOO\b|BAR\b)(\w+)]
The (\w+) are capturing group, they store the key/value pairs inside variables (I guess that's what you want to do?)
(?!...) is a negative lookahead: it will cause the regex to fail if what's inside matches.
\b is a word-boundary: here it will make the loohahead match (and so fail the regex) only if FOO is followed by a non alphanum character (so ["foo": FOOLISH] will be accepted by the regex)
\s is a short for all type of whitespaces (spaces, tabs, newlines etc)
Demo: http://regex101.com/r/fM3uZ7
What you tried [^...] was a negative character range: it matches any character (and only one character) that's not inside the character range. And keep in mind that inside character ranges only ], ^ and - are special character (so $ means \$ and so on)
Have a try with:
(?i)^(?!.*foo)(?!.*bar)
In action within perl script:
my $re = qr~(?i)^(?!.*foo)(?!.*bar)~;
while(<DATA>) {
chomp;
say (/$re/ ? "OK : $_" : "KO : $_");
}
__DATA__
["abc":FOO]
["abc":BAZ]
output:
KO : ["abc":FOO]
OK : ["abc":BAZ]
Explanation:
The regular expression:
(?i)^(?!.*foo)(?!.*bar)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?i) set flags for this block (case-
insensitive) (with ^ and $ matching
normally) (with . not matching \n)
(matching whitespace and # normally)
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
foo 'foo'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
bar 'bar'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------

Perl regular expression explanation

I have regular expression like this:
s/<(?:[^>'"]|(['"]).?\1)*>//gs
and I don't know what exactly does it mean.
The regex looks intended to remove HTML tags from input.
It matches text beginning with < and ending with >, containing non->/non-quotes or quoted strings (which may contain >). But it appears to have an error:
The .? says that quotes may contain 0 or 1 character; it was probably intended to be .*? (0 or more characters). And to prevent backtracking from doing things like making the . match a quote in some odd cases, it needs to change the (?: ... ) grouping to be possessive (> instead of :).
This tool can explain the details: http://rick.measham.id.au/paste/explain.pl?regex=%3C%28%3F%3A[^%3E%27%22]|%28[%27%22]%29.%3F\1%29*%3E
NODE EXPLANATION
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[^>'"] any character except: '>', ''', '"'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
['"] any character of: ''', '"'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
.? any character except \n (optional
(matching the most amount possible))
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
> '>'
So it tries to remove HTML tags as ysth also mentions.

I can't find proper regexp

I have the following file(like this scheme, but much longer):
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
There are 2 words (like i.e. LSE ZTX) in every line with optional spaces and/or tabs at the beginning, at the end and always in between.
Could someone help me to match these 2 words with regexp? Following the example I wish to have LSE in $1 and ZTX in $2 for the first line, SWX in $1 and ZURN in $2 for the second etc.
I have tried something like:
$line =~ /(\t|\s)*?(.*?)(\t|\s)*?(.*?)/msgi;
$line =~ /[\t*\s*]?(.*?)[\t*\s*]?(.*?)/msgi;
I don't know how can I say, that there could be either spaces or tabs (or both of them mixed, so for ex. \t\s\t)
If you want to just match the two first words, the most basic thing is to just match any sequence of characters that are not whitespace:
my ($word1, $word2) = $line =~ /\S+/g;
This will capture the first two words in $line into the variables, if they exist. Note that parentheses are not required when using the /g modifier. Use an array instead if you want to capture all existing matches.
Always two words, you don't need to match the entire line, so your most simple regex would be:
/(\w+)\s+(\w+)/
I think this is what you want
^\s*([A-Z]+)\s+([A-Z]+)
See it here on Regexr, you find the first code of a row in group 1 and the second in group 2. \s is a whitespace character, it includes e.g. spaces, tabs and newline characters.
In Perl it is something like this:
($code1, $code2) = $line =~ /^\s*([A-Z]+)\s+([A-Z]+)/i;
I think you are reading the text file row by row, so you don't need the modifiers s and m, and g is also not needed.
In case the codes are not only ASCII letters, then replace [A-Z] with \p{L}. \p{L} is a Unicode property that will match every letter in every language.
\s includes also tabulation so your regex looks like:
$line =~ /^\s*([A-Z]+)\s+([A-Z]+)/;
the first word is in the first group ($1) and the second in $2.
You can change [A-Z] to whatever's more convenient with your needs.
Here is the explanation from YAPE::Regex::Explain
The regular expression:
(?-imsx:^\s*([A-Z]+)\s+([A-Z]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
With option "Multiline" this Regex:
^\s*(?<word1>\S+)\s+(?<word2>\S+)\s*$
Will give you N matches each containing 2 groups named:
- word1
- word2
^\s*([A-Z]{3,4})\s+([A-Z]{3,4})$
What this does
^ // Matches the beginning of a string
\s* // Matches a space/tab character zero or more times
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $1
\s+ // Then matches at least one tab or space
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $2
$ // Matches the end of a string
You can use split here:
use strict;
use warnings;
while (<DATA>) {
my ( $word1, $word2 ) = split;
print "($word1, $word2)\n";
}
__DATA__
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
Output:
(LSE, ZTX)
(SWX, ZURN)
(LSE, ZYT)
(NYSE, CGI)
Assuming the spaces at the start of the line are what you use to identify your codes you want, try this:
Split your string up at newlines, then try this regex:
^\s+(\w+\s+){2}$
This will only match lines that start with some space, followed by a (word - some space - word), then end with some space.
# ^ --> String start
# \s+ --> Any number of spaces
# (\w+\s+){2} --> A (word followed by some space)x2
# $ --> String end.
However, if you want to capture the codes alone, try this:
$line =~ /^\s*(\w+)\s+(\w+)/;
# \s* --> Zero or more whitespace,
# (\w+) --> Followed by a word (group #1),
# \s+ --> Followed by some whitespace,
# (\w+) --> Followed by a word (group #2),
This will match all your codes
/[A-Z]+/

cryptic perl expression

I find the following statement in a perl (actually PDL) program:
/\/([\w]+)$/i;
Can someone decode this for me, an apprentice in perl programming?
Sure, I'll explain it from the inside out:
\w - matches a single character that can be used in a word (alphanumeric, plus '_')
[...] - matches a single character from within the brackets
[\w] - matches a single character that can be used in a word (kinda redundant here)
+ - matches the previous character, repeating as many times as possible, but must appear at least once.
[\w]+ - matches a group of word characters, many times over. This will find a word.
(...) - grouping. remember this set of characters for later.
([\w]+) - match a word, and remember it for later
$ - end-of-line. match something at the end of a line
([\w]+)$ - match the last word on a line, and remember it for later
\/ - a single slash character '/'. it must be escaped by backslash, because slash is special.
\/([\w]+)$ - match the last word on a line, after a slash '/', and remember the word for later. This is probably grabbing the directory/file name from a path.
/.../ - match syntax
/.../i - i means case-insensitive.
All together now:
/\/([\w]+)$/i; - match the last word on a line and remember it for later; the word must come after a slash. Basically, grab the filename from an absolute path. The case insensitive part is irrelevant, \w will already match both cases.
More details about Perl regex here: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
And as JRFerguson pointed out, YAPE::Regex::Explain is useful for tokenizing regex, and explaining the pieces.
You will find the Yape::Regex::Explain module worth installing.
#!/usr/bin/env perl
use YAPE::Regex::Explain;
#...may need to single quote $ARGV[0] for the shell...
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;
Assuming this script is named 'rexplain' do:
$ ./rexplain '/\/([\w]+)$/i'
...to obtain:
The regular expression:
(?-imsx:/\/([\w]+)$/i)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\w]+ any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\w]+ any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
/i '/i'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
UPDATE:
See also: https://stackoverflow.com/a/12359682/1015385 . As noted there and in the module's documentation:
There is no support for regular expression syntax added after Perl version 5.6, particularly any
constructs added in 5.10.
/\/([\w]+)$/i;
It is a regex, and if it is a complete statement, it is applied to the $_ variable, like so:
$_ =~ /\/([\w]+)$/i;
It looks for a slash \/, followed by an alphanumeric string \w+, followed by end of line $. It also captures () the alphanumeric string, which ends up in the variable $1. The /i on the end makes it case-insensitive, which has no effect in this case.
While it doesn't help "explain" a regex, once you have a test case, Damian's new Regexp::Debugger is a cool utility to watch what actually occurs during the matching. Install it and then do rxrx at the command line to start the debugger, then type in /\/([\w]+)$/ and '/r' (for example), and finally m to start the matching. You can then step through the debugger by hitting enter repeatedly. Really cool!
This is comparing $_ to a slash followed by one or more character (case insensitive) and storing it in $1
$_ value then $1 value
------------------------------
"/abcdes" | "abcdes"
"foo/bar2" | "bar2"
"foobar" | undef # no slash so doesn't match
The Online Regex Analyzer deserves a mention. Here's a link to explain what your regex means, and pasted here for the record.
Sequence: match all of the followings in order
/ (slash)
--+
Repeat | (in GroupNumber:1)
AnyCharIn[ WordCharacter] one or more times |
--+
EndOfLine