Save Matched Perl Regex as Variable - regex

I have a simple Perl regex that I need to save as a variable.
If I print it:
print($html_data =~ m/<iframe id="pdfDocument" src=.(.*)pdf/g);
It prints what I want to save, but when trying to save it with:
$link = $html_data =~ m/<iframe id="pdfDocument" src=.(.*)pdf/g;
I get back a '1' as the value of $link. I assume this is because it found '1' match. But how do I save the content of the match instead?

Note the /g to get all matches. Those can't possibly be put into a scalar. You need an array.
my #links = $html_data =~ m/<iframe id="pdfDocument" src=.(.*)pdf/g;
If you just want the first match:
my ($link) = $html_data =~ m/<iframe id="pdfDocument" src=.(.*)pdf/;
Note the parens (and the lack of now-useless /g). You need them to call m// in list context.

The matched subexpressions of a pattern are saved in variables $1, $2, etc. You can also get the entire matched pattern ($&) but this is expensive and should be avoided.
The distinction in behavior here, by the way, is the result of scalar vs. list context; you should get to know them, how they differ, and how they affect the behavior of various Perl expressions.

From 'perlfunc' docs:
print LIST
Prints a string or a list of strings.
So,print m//, where m// determines that the return value
wanted (wantarray?) is a list
(It appers m// without capture groups returns 1 or 0 match pass
or fail, where as m//g returns a list of matches).
and
$link = m// can only be scalar (as opposed to list) context.
So, m// returns match results 1 (true) or 0 (false).

I just wrote code like this. It may help. It's basically like yours except mine has a few more parentheses.
my $path = `ls -l -d ~/`;
#print "\n path is $path";
($user) = ($path=~/\.*\s+(\w+)\susers/);
So yours from this example may be something like this if your trying to store the whole thing? I'm not sure but you can use mine as an example. I am storing whatever is in (\w+):
($link) = ($html_data =~ (m/<iframe id="pdfDocument" src=.(.*)pdf/g));

Related

Perl is returning hash when I am trying to find the characters after a searched-for character

I want to search for a given character in a string and return the character after it.
Based on a post here, I tried writing
my $string = 'v' . '2';
my $char = $string =~ 'v'.{0,1};
print $char;
but this returns 1 and a hash (last time I ran it, the exact output was 1HASH(0x11823a498)). Does anyone know why it returns a hash instead of the character?
Return a character after a specific pattern (a character here)
my $string = 'example';
my $pattern = qr(e);
my ($ret) = $string =~ /$pattern(.)/; #--> 'x'
This matches the first occurrence of $pattern in the $string, and captures and returns the next character, x. (The example doesn't handle the case when there may not be a character following, like for the other e; it would simply fail to match so $ret would stay undef.)
I use qr operator to form a pattern but a normal string would do just as well here.
The regex match operator returns different things in scalar and list contexts: in the scalar context it is true/false for whether it matched, while in the list context it returns matches. See perlretut
So you need that matching to be in the list context, and a common way to provide that is to put the variable that is being assigned to in parenthesis.
The first problem with the example in the question is that the =~ operator binds more tightly than the . operator, so the example is effectively
my $char = ( ($string =~ 'v') . {0,1} );
So there's first the regex match, which succeeds and returns 1 (since it is in the scalar context, imposed by the . operator) and then there is a hash-reference {0,1} which is concatenated to that 1. So $char gets assigned the 1 concatenated with a stringification for a hashref, which is a string HASH(0x...) (in the parens is a hex stringification of an address).
Next, the needed . in the pattern isn't there. Got confused with the concatenation . operator?
Then, the capturing parenthesis are absent, while needed for the intended subpattern.
Finally, the matching is the scalar context, as mentioned, what would only yield true/false.
Altogether, that would need to be
my ($char) = $string =~ ( q{v} . q{(.)} );
But I'd like to add: while Perl has very fluid semantics I'd recommend to not build regex patterns on the fly like that. I'd also recommend to actually use delimiters in the match operator, for clarity (even though you indeed mostly don't have to).

Perl split string based on forward slash

I am new to Perl, so this is basic question. I have a string as shown below. I am interested in taking date out of it, so thinking of splitting it using slash
my $path = "/bla/bla/bla/20160306";
my $date = (split(/\//,$path))[3];#ideally 3 is date position in array after split
print $date;
However, I don't see the expected output, but instead I see 5 getting printed.
Since the path starts with the pattern / itself, split returns a list with an empty string first (to the left of the first /); one element more. Thus the posted code miscounts by one and returns the one before last element (subdirectory) in the path, not the date.
If date is always the last thing in the string you can pick the last element
my $date = (split '/', $path)[-1];
where i've used '' for delimiters so to not have to escape /. (This, however, may confuse since the separator pattern is a regex and // convey that, while '' may appear to merely quote a string.)
This can also be done with regex
my #parts = $path =~ m{([^/]+)}g;
With this there can be no inital empty string. Or, the last part can be picked out of the full list as above, with ($path =~ m{...}g)[-1], but if you indeed only need the last bit then extract it directly
my ($last_part) = $path =~ m{.*/(.*)};
Here the "greedy" .* matches everything in the string up to the last instance of the next subpattern (/ here), thus getting us to the last part of the path, which is then captured. The regex match operator returns its matches only when it is in the list context so parens on the left are needed.
What brings us to the fact that you are parsing a path, and there are libraries dedicated to that.
For splitting a path into its components one tool is splitdir from File::Spec
use File::Spec;
my #parts = File::Spec->splitdir($path);
If the path starts with a / we'll again get an empty string for the first element (by design, see docs). That can then be removed, if there
shift #parts if $parts[0] eq '';
Again, the last element alone can be had like in the other examples.
Simply bind it to the end:
(\d+)$
# look for digits at the end of the string
See a demo on regex101.com. The capturing group is only for clarification though not really needed in this case.
In Perl this would be (I am a PHP/Python guy, so bear with me when it is ugly)
my $path = "/bla/bla/bla/20160306";
$path =~ /(\d+)$/;
print $1;
See a demo on ideone.com.
Try this
Use look ahead for to do it. It capture the / by splitting. Then substitute the data using / for remove the slash.
my $path = "/a/b/c/20160306";
my $date = (split(/(?=\/)/,$path))[3];
$date=~s/^\///;
print $date;
Or else use pattern matching with grouping for to do it.
my $path = "/a/b/c/20160306";
my #data = $path =~m/\/(\w+)/g;
print $data[3];

return data type of Perl regex matching groups

I've got a question about the handling/ return data type of a regex matching multiple groups.
Consider this line:
($pre, $middle, $post) = $text =~ /(.*)Telefon:(.+)(Fax:.*)/;
It assigns the values of matched parts of $text to $pre, $middle and $post as a list, I suppose!
So I would like to check before the number of returned matches. Since the returned data type is a list, i assume that the following works:
if (scalar ($text =~ /(.*)Telefon:(.+)(Fax:.*)/) == 3) { do something }
The problem seems to be that
(scalar ($text =~ /(.*)Telefon:(.+)(Fax:.*)/)
returns 1, although the following works as expected (returning the value 3):
my #arr = $text =~ /(.*)Telefon:(.+)(Fax:.*)/;
scalar #arr
There seems to be some Perl magic going on. What can I do to get the expected value without assigning a value (#arr) in between?
In perl, a function or operator can return different things in scalar context compared to list context. In fact even subs you write yourself can do this .. See the wantarray keyword.
When a regex is evaluated in scalar context it returns 1 for a match or 0 for no match. Different to what it returns in list context, which is the capture groups.
When you assign to an array first, the regex is evaluated in list context. Then you take the scalar value of the array, which gives the length.
In any case I suspect you are not going to get the result you want. If the regex matches you will always get a list of size 3, even if some of the capture groups are empty. However if any of the capture groups are empty then the resulting slots in the list will be undef (which you can check for). If the regex did not match then you get an empty list back.
We have to supply list context for the match – e.g. by assigning to a list:
() = $text =~ /.../
Yes, the empty list works. We can the use this list assignment in a scalar context, e.g.
3 == (() = $text =~ /.../)
You can think of ()= as a “count of” pseudo-operator.
The behavior of many Perl operators and builtins differ depending on context. If in doubt, read the documentation (although this specific section refers you to other parts of the docs).

Sensethising domains

So I'm trying to put all numbered domains into on element of a hash doing this:
### Domanis ###
my $dom = $name;
$dom =~ /(\w+\.\w+)$/; #this regex get the domain names only
my $temp = $1;
if ($temp =~ /(^d+\.\d+)/) { # this regex will take out the domains with number
my $foo = $1;
$foo = "OTHER";
$domain{$foo}++;
}
else {
$domain{$temp}++;
}
where $name will be something like:
something.something.72.154
something.something.72.155
something.something.72.173
something.something.72.175
something.something.73.194
something.something.73.205
something.something.73.214
something.something.abbnebraska.com
something.something.cableone.net
something.something.com.br
something.something.cox.net
something.something.googlebot.com
My code currently print this:
72.175
73.194
73.205
73.214
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
lstn.net
but I want it to print like this:
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
OTHER
lstn.net
where OTHER is all the numbered domains, so any ideas how?
You really shouldn't need to split the variable into two, e.g. this regex will match the case you want to trap:
/\d{1,3}\.\d{1,3}$/ -- returns true if the string ends with two 1-3 long digits separated by a dot
but I mean if you only need to separate those domains that are not numbered you could just check the last character in the domain whether it is a letter, because TLDs cannot contain numbers, so you would do something like
/\w$/ -- if returns true, it is not a numbered domain (providing you've stripped spaces and new lines)
But I suppose it is better to be more specific in the regex, which also better illustrates the logic you are looking for in your script, so I'd use the former regex.
And actually you could do something like this:
if (my ($domain) = $name =~ /\.(\w+.\w+)$/)
{
#the domain is assigned to the variable $domain
} else {
#it is a number domain
}
Take what it currently puts, and use the regex:
/\d+\.\d+/
if it matches this, then its a pair of numbers, so remove it.
This way you'll be able to keep any words with numbers in them.
Please, please indent your code correctly, and use whitespace to separate out various bits and pieces. It'll make your code so much easier to read.
Interestingly, you mentioned that you're getting the wrong output, but the section of the code you post has no print, printf, or say statement. It looks like you're attempting to count up the various domain names.
If these are the value of $name, there are several issues here:
if ($temp =~ /(^d+\.\d+)/) {
Matches nothing. This is saying that your string starts with one or more letter d followed by a period followed by one or more digits. The ^ anchors your regular expression to the beginning of the string.
I think, but not 100% sure, you want this:
if ( $temp =~ /\d\.\d/ ) {
This will find all cases where there are two digits with a period in between them. This is the sub-pattern to /\d+\.\d+/, so both regular expressions will match the same thing.
The
$dom =~ /(\w+\.\w+)$/;
Is matching anywhere in the entire string $dom where there are two letters, digits. or underscores with a decimal between them. Is that what you want?
I also believe this may indicate an error of some sort:
my $foo = $1;
$foo = "OTHER";
$domain{$foo} ++;
This is setting $foo to whatever $dom is matching, but then immediately resets $foo to OTHER, and increments $domain{OTHER}.
We need a sample of your initial data, and maybe the actual routine that prints your output.

how do you match two strings in two different variables using regular expressions?

$a='program';
$b='programming';
if ($b=~ /[$a]/){print "true";}
this is not working
thanks every one i was a little confused
The [] in regex mean character class which match any one of the character listed inside it.
Your regex is equivalent to:
$b=~ /[program]/
which returns true as character p is found in $b.
To see if the match happens or not you are printing true, printing true will not show anything. Try printing something else.
But if you wanted to see if one string is present inside another you have to drop the [..] as:
if ($b=~ /$a/) { print true';}
If variable $a contained any regex metacharacter then the above matching will fail to fix that place the regex between \Q and \E so that any metacharacters in the regex will be escaped:
if ($b=~ /\Q$a\E/) { print true';}
Assuming either variable may come from external input, please quote the variables inside the regex:
if ($b=~ /\Q$a\E/){print true;}
You then won't get burned when the pattern you'll be looking for will contain "reserved characters" like any of -[]{}().
(apart the missing semicolons:) Why do you put $a in square brackets? This makes it a list of possible characters. Try:
$b =~ /\Q${a}\E/
Update
To answer your remarks regarding = and =~:
=~ is the matching operator, and specifies the variable to which you are applying the regex ($b) in your example above. If you omit =~, then Perl will automatically use an implied $_ =~.
The result of a regular expression is an array containing the matches. You usually assign this so an array, such as in ($match1, $match2) = $b =~ /.../;. If, on the other hand, you assign the result to a scalar, then the scalar will be assigned the number of elements in that array.
So if you write $b = /\Q$a\E/, you'll end up with $b = $_ =~ /\Q$a\E/.
$a='program';
$b='programming';
if ( $b =~ /\Q$a\E/) {
print "match found\n";
}
If you're just looking for whether one string is contained within another and don't need to use any character classes, quantifiers, etc., then there's really no need to fire up the regex engine to do an exact literal match. Consider using index instead:#!/usr/bin/env perl
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'program';
my $string = 'programming';
if (index($string, $target) > -1) {
print "target is in string\n";
}