Related
I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);
I am new to perl. Can anyone explain the meaning of the following line of code:
my ($H,$M,$S) = $date =~ m{^([0-9]{2}):([0-9]{2}):([0-9]{2})}
I assume that after the execution of this line $H, $M and $S will have the values extracted from $date. Can anyone explain to get a better understanding?
It tries to match the contents of the $date variable, with a regex:
^([0-9]{2}):([0-9]{2}):([0-9]{2})
The regex basically means: from the start of the string, there should be two digits and colons repeated three times. Each of these three two digit numbers are enclosed in a group.
Finally, the matches of the three groups are assigned to local variables $H, $M and $S.
For example if
$date = "10:37:21 2016.01.02";
then
$H = "10";
$M = "37";
$S = "21";
Can anyone explain to get a better understanding?
You need to start to be aware of two things:
list context
scalar context
The match operator, m//, will provide different results depending on what's on the left hand side of your = sign. Check this out:
use strict;
use warnings;
use 5.020;
my $result = "abc" =~ m/a(.)(.)/;
say $result; #=> 1
my #results = "abc" =~ m/a(.)(.)/;
for my $result (#results) {
say $result;
};
--output:--
b
c
A $variable can only store one thing, so when there is a $variable on the left hand side of the = sign, the $variable looks over to the match operator, m//, on the right hand side of the = sign and calls out, "Hey, I can only store one thing over here, just give me one thing, please!" The match operator responds by returning 1, for true, if there was a match; or 0, for false, if there wasn't a match.
On the other hand, when an #variable is on the left hand side of the = sign, the array looks over to the m// operator and calls out, "Hey, I can store a bunch of things over here, so give me a bunch of stuff, please!" The match operator responds by returning what matched the capture groups in the regex if there was a match; if there wasn't a match, the match operator returns ().
In the first case, the $variable is said to provide scalar context for the match operator. In the second case, the #variable is said to provide list context for the match operator. Don't let those terms scare you. You know what they mean now.
Next, when you write this:
my ($H,$M,$S) =
You are creating several variables on the left hand side of the = sign. In unison, they call out to the match operator on the other side of the = sign, "Hey, there are many of us over here, give us the bunch of stuff, please! That particular my syntax provides a list context for the match operator which is on the right hand side of the = sign:
my ($group1, $group2) = "abc" =~ m/a(.)(.)/;
say $group1; #=> b
say $group2; #=> c
Note that if the delimiters you use for the match operator are m/.../, then you don't have to write the leading m, so typically you will see the example above written as:
my ($group1, $group2) = "abc" =~ /a(.)(.)/;
When you use braces like you did: m{...}{...}, then you have to write the leading m.
You can use a simpler regex, which is easier to understand, to do what you want:
\d{2} #\d means a digit, {2} means twice,
#so this matches two consecutive digits
Here's how you can use that regex:
#Just blindly use all three of these in every program:
use strict;
use warnings;
use 5.020;
my $date = "10:37:21 2016.01.02";
my ($H,$M,$S) = $date =~ /\d{2}/g; #g => global, Find all matches in the string
say $H; #say() is the same as print() with a newline at the end
say $M;
say $S;
--output:--
10
37
21
The regex starts at the beginning of the string and looks for two consecutive digits and finds 10, so that is a match; then the regex jumps over the : and finds 37, so that is a match; then the regex jumps over the : and finds 21, so that is a match; etc., etc.
When you assign all the matches to three variables, the first three matches are assigned to the three variables, and the rest of the matches are discarded.
I have a question I am hoping someone could help with...
I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).
The variable contains data such as these:
$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"
The only bits I am interested in from the above examples are:
#array = ("cat_dog","horse","rabbit","chicken-pig")
#array = ("elephant","MOUSE_RAT","spider","lion-tiger")
#array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")
The problem I am having:
I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.
But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.
Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...
For example:
Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.
^ ^
| |
note the spaces here and here
I am not interested in any cases where the comma is followed by a space (as shown above).
I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)
I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.
How about
[^,\s]+(,[^,\s]+)+
which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.
Further to comments
To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to #matches.
my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my #matches;
while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
push(#matches, split(/,/, $&));
}
print join("\n",#matches),"\n";
Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently
my #array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split
Going from right to left:
Split the line on spaces (split)
Leave only elements having no comma at the either end but having one inside (grep)
Split each such element into parts (map and split)
That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.
I hope this is clear and suits your needs:
#!/usr/bin/perl
use warnings;
use strict;
my #strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
"fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf",
"dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew",
"Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
"Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");
my $regex = qr/
\s #From your examples, it seems as if every
#comma separated list is preceded by a space.
(
(?:
[^,\s]+ #Now, not a comma or a space for the
#terms of the list
, #followed by a comma
)+
[^,\s]+ #followed by one last term of the list
)
/x;
my #matches = map {
$_ =~ /$regex/;
if ($1) {
my $comma_sep_list = $1;
[split ',', $comma_sep_list];
}
else {
[]
}
} #strs;
$var =~ tr/ //s;
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
push (#arr, $&);
}
the regular expression matches three cases :
(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,) : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig
I have an array of strings, some of which contain the character '-'. I want to be able to search for it and for those strings that contain it I wish to delete all characters to the right of it.
So for example if I have:
$string1 = 'home - London';
$string2 = 'office';
$string3 = 'friend-Manchester';
or something as such, then the affected strings would become:
$string1 = 'home';
$string3 = 'friend';
I don't know if the white-space before the '-' would be included in the string afterwards (I don't want it as I will be comparing strings at a later point, although if it doesn't affect string comparisons then it doesn't matter).
I do know that I can search and replace specific strings/characters using something like:
$string1 =~ s/-//
or
$string1 =~ tr/-//
but I'm not very familiar with regular expressions in Perl so I'm not 100% sure of these. I've looked around and couldn't see anything to do with 'to the right of' in regex. Help appreciated!
You can delete anything after a hyphen - with this substitution:
s/-.*$//s
However, you will want to remove the whitespace prior to the hyphen and thus do
s/\s* - .* $//xs
The $ anchores the regex at the end of the string and the /s flag allows the dot to match newlines as well. While the $ is superfluous, it might add clarity.
Your substitution would just have removed the first -, and your transliteration would have removed all hyphens from the string.
Your regular expressions are just searching for the dash, so that's all they replace. You want to search for the dash, and anything after it.
$string =~ s/-.*//;
. represents any character, * means search for that character 0 or more times, and match as many as possible (i.e. to the end of the string if possible)
You can also search for an optional space before it.
$string =~ s/\s?-.*//;
(\s is a clearer way to specify a space character)
Using plain substr() and index() is possible as well.
my #strings = ("we are - so cool",
"lonely",
"friend-Manchester",
"home - london",
"home-new york",
"home with-childeren-first episode");
local $/ = " ";
foreach (#strings) {
$_ = substr($_,0,index($_,'-')) if (index($_,'-') != -1);
chomp;
}
The other answers are good. However, in light of what you said:
...if it doesn't affect string comparisons then it doesn't matter
You don't need a separate step for this at all. Suppose you want to compare $stringwith another variable, $search_string. The following expression will check for an exact match, except that it ignores anything $string has after a dash:
if ($string =~ /^$search_string(\s*-|$)/) { print "Strings matched"; }
#Using Regex:
my #strings =
("we are - so cool",
"lonely",
"friend - Manchester",
"home - london",
"home - new york",
"home with-childeren-first episode"
);
foreach (#strings) {
$_ =~ s/-\s*[a-zA-Z ]+\s*//g;
print "NEW: ".$_."\n";
}
$a='program';
$b='programming';
if ($b=~ /[$a]/){print "true";}
this is not working
thanks every one i was a little confused
The [] in regex mean character class which match any one of the character listed inside it.
Your regex is equivalent to:
$b=~ /[program]/
which returns true as character p is found in $b.
To see if the match happens or not you are printing true, printing true will not show anything. Try printing something else.
But if you wanted to see if one string is present inside another you have to drop the [..] as:
if ($b=~ /$a/) { print true';}
If variable $a contained any regex metacharacter then the above matching will fail to fix that place the regex between \Q and \E so that any metacharacters in the regex will be escaped:
if ($b=~ /\Q$a\E/) { print true';}
Assuming either variable may come from external input, please quote the variables inside the regex:
if ($b=~ /\Q$a\E/){print true;}
You then won't get burned when the pattern you'll be looking for will contain "reserved characters" like any of -[]{}().
(apart the missing semicolons:) Why do you put $a in square brackets? This makes it a list of possible characters. Try:
$b =~ /\Q${a}\E/
Update
To answer your remarks regarding = and =~:
=~ is the matching operator, and specifies the variable to which you are applying the regex ($b) in your example above. If you omit =~, then Perl will automatically use an implied $_ =~.
The result of a regular expression is an array containing the matches. You usually assign this so an array, such as in ($match1, $match2) = $b =~ /.../;. If, on the other hand, you assign the result to a scalar, then the scalar will be assigned the number of elements in that array.
So if you write $b = /\Q$a\E/, you'll end up with $b = $_ =~ /\Q$a\E/.
$a='program';
$b='programming';
if ( $b =~ /\Q$a\E/) {
print "match found\n";
}
If you're just looking for whether one string is contained within another and don't need to use any character classes, quantifiers, etc., then there's really no need to fire up the regex engine to do an exact literal match. Consider using index instead:#!/usr/bin/env perl
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'program';
my $string = 'programming';
if (index($string, $target) > -1) {
print "target is in string\n";
}