i am new to perl and using regex. I have a subroutine that is suppose to show that we are looking for a pattern that has one of more dots".". The next line is to use this subroutine to check if there are dots in some patterns. However I find that the if block in my if/else statement is not evaluated, it only evaluates the else block.
my $hostname1 = "hh.uu";
my $hostname2 = "yyhu";
sub isDotted {
return 1 if #_ =~ /\./g; return 0;
}
if (isDotted($hostname1) != isDotted($hostname2))
{
print "Must be equal\n";
}
else {
print "Perfect good to go\n";
}
This code prints out "Perfect good to go" and the fact that $hostname2 does not have a dot "." in it means it should print out "must be equal" but it does not evaluate the if block.
*Update: Sorry I have made the changes, this code should compile
Perl passes arguments to subroutines in a fixed-name array (#_). So you have to extract your argument from that array. By using the bare array name, you are probably in "scalar context" and perl evaluates arrays in scalar context to be their size, 1 in this case. And 1 is never going to match /\./. The way to refer to the first element of array #_ is $_[0]. The /g flag is also superfluous. So try changing that line to this:
return 1 if $_[0] =~ /\./; return 0;
Lots wrong with this code.
1) It does not compile. Use perl -c script_name to check if it compiles correctly.
2) For anything more that a one line script add
use strict;
use warnings;
to the top of the script.
3) Parameters are passed into a subroutine in #_ not $_
4) = ~ is two separate operators you want the =~ operator
5) hostname1 and hostname2 are not the same as $hostname1 and $hostname2, you should be using the latter to pass the values into your subrouteen.
Related
I am trying to match a function or mixin used in an SCSS string so I may remove it but I am having a bit of trouble.
For those unfamiliar with SCSS this is an example of the things I am trying to match (from bootstrap 4).
#mixin _assert-ascending($map, $map-name) {
$prev-key: null;
$prev-num: null;
#each $key, $num in $map {
#if $prev-num == null {
// Do nothing
} #else if not comparable($prev-num, $num) {
#warn "Potentially invalid value for #{$map-name}: This map must be in ascending order, but key '#{$key}' has value #{$num} whose unit makes it incomparable to #{$prev-num}, the value of the previous key '#{$prev-key}' !";
} #else if $prev-num >= $num {
#warn "Invalid value for #{$map-name}: This map must be in ascending order, but key '#{$key}' has value #{$num} which isn't greater than #{$prev-num}, the value of the previous key '#{$prev-key}' !";
}
$prev-key: $key;
$prev-num: $num;
}
}
And a small function:
#function str-replace($string, $search, $replace: "") {
$index: str-index($string, $search);
#if $index {
#return str-slice($string, 1, $index - 1) + $replace + str-replace(str-slice($string, $index + str-length($search)), $search, $replace);
}
#return $string;
}
So far I have the following regex:
#(function|mixin)\s?[[:print:]]+\n?([^\}]+)
However it only matches to the first } that it finds which makes it fail, this is because it needs to find the last occurance of the closing curly brace.
My thoughts are that a regex capable of matching a function definition could be adapted but I can't find a good one using my Google foo!
Thanks in advance!
I would not recommend to use a regex for that, since a regex is not able to handle recursion, what you might need in that case.
For Instance:
#mixin test {
body {
}
}
Includes two »levels« of scope here ({{ }}), so your regex should be able to to count brackets as they open and close, to match the end of the mixin or function. But that is not possible with a regex.
This regex
/#mixin(.|\s)*\}/gm
will match the whole mixin, but if the input is like that:
#mixin foo { … }
body { … }
It will match everything up to the last } what includes the style definition for the body. That is because the regex cannot know which } closes the mixin.
Have a look at this answer, it explains more or less the same thing but based on matching html elements.
Instead you should use a parser, to parse the whole Stylesheet into syntax tree, than remove unneeded functions and than write it to string again.
In fact, like #philipp said, regex can't replace syntax analysis like compilers do.
But here is a sed command which is a little ugly but could make the trick :
sed -r -e ':a' -e 'N' -e '$!ba' -e 's/\n//g' -e 's/}\s*#(function|mixin)/}\n#\1/g' -e 's/^#(function|mixin)\s*str-replace(\s|\()+.*}$//gm' <your file>
-e ':a' -e 'N' -e '$!ba' -e 's/\n//g' : Read all file in a loop and remove the new line (See https://stackoverflow.com/a/1252191/7990687 for more information)
-e 's/}\s*#(function|mixin)/}\n#\1/g' : Make each #mixin or #function statement the start of a new line, and the preceding } the last character of the previous line
's/^#(function|mixin)\s*str-replace(\s|\()+.*}$//gm' : Remove the line corresponding to the #function str-replace or #mixin str-replace declaration
But it will result in an output that will loose indentation, so you will have to reindent it after that.
I tried it on a file where I copy/paste multiple times the sample code you provided, so you will have to try it on your file because there could be cases where the regex will match more element than wanted. If it is the case, provide us a test file to try to resolve these issues.
After much headache here is the answer to my question!
The source needs to be split line by line and read, maintining a count of the open / closed braces to determine when the index is 0.
$pattern = '/(?<remove>#(function|mixin)\s?[\w-]+[$,:"\'()\s\w\d]+)/';
$subject = file_get_contents('vendor/twbs/bootstrap/scss/_variables.scss'); // just a regular SCSS file containing what I've already posted.
$lines = explode("\n",$subject);
$total_lines = count($lines);
foreach($lines as $line_no=>$line) {
if(preg_match($pattern,$line,$matches)) {
$match = $matches['remove'];
$counter = 0;
$open_braces = $closed_braces = 0;
for($i=$line_no;$i<$total_lines;$i++) {
$current = $lines[$i];
$open_braces = substr_count($current,"{");
$closed_braces = substr_count($current,"}");
$counter += ($open_braces - $closed_braces);
if($counter==0) {
$start = $line_no;
$end = $i;
foreach(range($start,$end) as $a) {
unset($lines[$a]);
} // end foreach(range)
break; // break out of this if!
} // end for loop
} // end preg_match
} // endforeach
And we have a $lines array without any functions or mixins.
There is probably a more elegant way to do this but I don't have the time or the willing to write an AST parser for SCSS
This can be quite easily adapted into making a hacked one however!
So here is what I'm doing. This is for homework, and I know I can't come on here and get you guys to do my homework for me but I'm stuck. We have to use perl (First time ever using it so forgive my stupidity) to make a function $starts_with that takes a parameter $str0 and $prefix. if $str0 starts with $prefix. then the function returns true. if it doesn't then it isn't pretty simple. We have to use regular expressions because that is the whole point of the exercise so here is my code
sub starts_with
{
$str0 = $_[0];
$prefix = $_[1];
if($prefix =~ /^($str0)/)
{
print $str0."\n";
print m/^(prefix)/."\n";
$startsWith = "Y"
}
if ($startsWith eq "Y")
{
print $str0." starts with ".$prefix."\n";
}
else
{
print $str0." does not start with ".$prefix."\n";
}
}
I'm almost ashamed to put this up here because I have no Idea what I'm doing yet. But I am trying to learn. I don't know how to do true false in perl thats why I have the $startsWith variable. you can fix that if you want. the part I need to fix is the line
if(str0 =~ /^($prefix)/)
I also need to find out how to refer to the first letter in str0...I think
A couple points without giving away the answer:
1) Arguments to functions are passed in a special variable called #_, which is what you are accessing when you say $_[0] and $_[1], but can be written much more concisely by assigned the argument list (#_) to your variables in list context
sub starts_with {
my ($str0, $prefix) = #_;
...
}
2) This statement: if($prefix =~ /^($str0)/) tests the exact opposite condition you are trying to prove. It says does the prefix start with the value of the variable $str0. What you really want to test is if $str0 starts with $prefix.
It might also be using to prefix your pattern with m flag, m/PATTERN which means match this pattern.
3) You don't have a return statement in your function, (As #M42 points out) the result of the last expression is returned; that expression being print will return true. You probably want to return true or false explicity.
See if you can use this to get started.
What I would do :
use Modern::Perl; # or use strict; use warnings; use feature qw/say/;
sub starts_with {
# better use #_, the default array instead of just elements of them
# ...like $_[0]
my ($str, $pref) = #_;
# very short expression, the pattern matching return a boolean.
# \Q\E is there to treat the prefix as-is (no metacharacters)
return $str =~ /^\Q$pref\E/;
}
# using our function
if (starts_with("foobar", "f")) {
say "TRUE";
}
else {
say "FALSE";
}
Golfing it a bit...
sub starts_with { $_[0] =~ /^\Q$_[1]/ }
Don't hand that version in though :-)
For example I have a string:
MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|
and I want to catch this: |TrxId=475665|
after TrxId= it could be any numbers and any amount of them, so regex should catch as well:
|TrxId=111333| and |TrxId=0000011112222| and |TrxId=123|
TrxId=(\d+)
That would give a group (1) with the TrxId.
PS: Use global modifier.
The regex should look somewhat like this:
TrxId=[0-9]+
It will match TrxId= followed by at least one digit.
An example solution in Python:
In [107]: data = 'MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|'
In [108]: m = re.search(r'\|TrxId=(\d+)\|', data)
In [109]: m.group(0)
Out[109]: '|TrxId=475665|'
In [110]: m.group(1)
Out[110]: '475665'
/MsgNam\=.*?\|(TrxId\=\d+)\|.*/
for example in perl:
$a = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100111|";
$a =~ /MsgNam\=.*?\|(TrxId\=\d+)\|.*/;
print $1;
will print TrxId=475665
You know what your delimiters look like, so you don't need a regex, you need to split. Here's an implementation in Perl.
use strict;
use warnings;
my $input = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|";
my #first_array = split(/\|/,$input); #splitting $input on "|"
#Now, since the last character of $input is "|", the last element
#of this array is undef (ie the Perl equivalent of null)
#So, filter that out.
#first_array = grep{defined}#first_array;
#Also filter out elements that do not have an equals sign appearing.
#first_array = grep{/=/}#first_array;
#Now, put these elements into an associative array:
my %assoc_array;
foreach(#first_array)
{
if(/^([^=]+)=(.+)$/)
{
$assoc_array{$1} = $2;
}
else
{
#Something weird may be happening...
#we may have an element starting with "=" for example.
#Do what you want: throw a warning, die, silently move on, etc.
}
}
if(exists $assoc_array{TrxId})
{
print "|TrxId=" . $assoc_array{TrxId} . "|\n";
}
else
{
print "Sorry, TrxId not found!\n";
}
The code above yields the expected output:
|TrxId=475665|
Now, obviously this is more complex than some of the other answers, but it's also a bit more robust in that it allows you to search for more keys as well.
This approach does have a potential issue if your keys appear more than once. In that case, it's easy enough to modify the code above to collect an array reference of values for each key.
I have an Ispell list of english words (nearly 50 000 words), my homework in Perl is to get quickly (like under one minute) list of all strings, that are substrings of some other word. I have tried solution with two foreach cycles comparing all words, but even with some optimalizations, its still too slow. I think, that right solution could be some clever use of regular expressions on array of words. Do you know how to solve this problem quicky (in Perl)?
I have found fast solution, which can find some all these substrings in about 15 seconds on my computer, using just one thread. Basically, for each word, I have created array of every possible substrings (eliminating substrings which differs only in "s" or "'s" endings):
#take word and return list of all valid substrings
sub split_to_all_valid_subwords {
my $word = $_[0];
my #split_list;
my ($i, $j);
for ($i = 0; $i < length($word); ++$i){
for ($j = 1; $j <= length($word) - $i; ++$j){
unless
(
($j == length($word)) or
($word =~ m/s$/ and $i == 0 and $j == length($word) - 1) or
($word =~ m/\'s$/ and $i == 0 and $j == length($word) - 2)
)
{
push(#split_list, substr($word, $i, $j));
}
}
}
return #split_list;
}
Then I just create list of all candidates for substrings and make intersection with words:
my #substring_candidates;
foreach my $word (#words) {
push( #substring_candidates, split_to_all_valid_subwords($word));
}
#make intersection between substring candidates and words
my %substring_candidates=map{$_ =>1} #substring_candidates;
my %words=map{$_=>1} #words;
my #substrings = grep( $substring_candidates{$_}, #words );
Now in substrings I have array of all words, that are substrings of some other words.
Perl regular expressions will optimize patterns like foo|bar|baz into an Aho-Corasick match - up to a certain limit of total compiled regex length. Your 50000 words will probably exceed that length, but could be broken into smaller groups. (Indeed, you probably want to break them up by length and only check words of length N for containing words of length 1 through N-1.)
Alternatively, you could just implement Aho-Corasick in your perl code - that's kind of fun to do.
update
Ondra supplied a beautiful solution in his answer; I leave my post here as an example of overthinking a problem and failed optimisation techniques.
My worst case kicks in for a word that doesn't match any other word in the input. In that case, it goes quadratic. The OPT_PRESORT was a try to advert the worst case for most words. The OPT_CONSECUTIVE was a linear-complexity filter that reduced the total number of items in the main part of the algorithm, but it is just a constant factor when considering the complexity. However, it is still useful with Ondras algorithm and saves a few seconds, as building his split list is more expensive than comparing two consecutive words.
I updated the code below to select ondras algorithm as a possible optimisation. Paired with zero threads and the presort optimisation, it yields maximum performance.
I would like to share a solution I coded. Given an input file, it outputs all those words that are a substring of any other word in the same input file. Therefore, it computes the opposite of ysth's ideas, but I took the idea of optimisation #2 from his answer. There are the following three main optimisations that can be deactivated if required.
Multithreading
The questions "Is word A in list L? Is word B in L?" can be easily parallelised.
Pre-sorting all the words for their length
I create an array that points to the list of all words that are longer than a certain length, for every possible length. For long words, this can cut down the number of possible words dramatically, but it trades quite a lot of space, as one word of length n appears in all lists from length 1 to length n.
Testing consecutive words
In my /usr/share/dict/words, most consecutive lines look quite similar:
Abby
Abby's
for example. As every word that would match the first word also matches the second one, I immediately add the first word to the list of matching words, and only keep the second word for further testing. This saved about 30% of words in my test cases. Because I do that before optimisation No 2, this also saves a lot of space. Another trade-off is that the output will not be sorted.
The script itself is ~120 lines long; I explain each sub before showing it.
head
This is just a standard script header for multithreading. Oh, and you need perl 5.10 or better to run this. The configuration constants define the optimisation behaviour. Add the number of processors of your machine in that field. The OPT_MAX variable can take the number of words you want to process, however this is evaluated after the optimisations have taken place, so the easy words will already have been caught by the OPT_CONSECUTIVE optimisation. Adding anything there will make the script seemingly slower. $|++ makes sure that the status updates are shown immediately. I exit after the main is executed.
#!/usr/bin/perl
use strict; use warnings; use feature qw(say); use threads;
$|=1;
use constant PROCESSORS => 0; # (false, n) number of threads
use constant OPT_MAX => 0; # (false, n) number of words to check
use constant OPT_PRESORT => 0; # (true / false) sorts words by length
use constant OPT_CONSECUTIVE => 1; # (true / false) prefilter data while loading
use constant OPT_ONDRA => 1; # select the awesome Ondra algorithm
use constant BLABBER_AT => 10; # (false, n) print progress at n percent
die q(The optimisations Ondra and Presort are mutually exclusive.)
if OPT_PRESORT and OPT_ONDRA;
exit main();
main
Encapsulates the main logic, and does multi-threading. The output of n words will be matched will be considerably smaller than the number of input words, if the input was sorted. After I have selected all matched words, I print them to STDOUT. All status updates etc. are printed to STDERR, so that they don't interfere with the output.
sub main {
my #matching; # the matching words.
my #words = load_words(\#matching); # the words to be searched
say STDERR 0+#words . " words to be matched";
my $prepared_words = prepare_words(#words);
# do the matching, possibly multithreading
if (PROCESSORS) {
my #threads =
map {threads->new(
\&test_range,
$prepared_words,
#words[$$_[0] .. $$_[1]] )
} divide(PROCESSORS, OPT_MAX || 0+#words);
push #matching, $_->join for #threads;
} else {
push #matching, test_range(
$prepared_words,
#words[0 .. (OPT_MAX || 0+#words)-1]);
}
say STDERR 0+#matching . " words matched";
say for #matching; # print out the matching words.
0;
}
load_words
This reads all the words from the input files which were supplied as command line arguments. Here the OPT_CONSECUTIVE optimisation takes place. The $last word is either put into the list of matching words, or into the list of words to be matched later. The -1 != index($a, $b) decides if the word $b is a substring of word $a.
sub load_words {
my $matching = shift;
my #words;
if (OPT_CONSECUTIVE) {
my $last;
while (<>) {
chomp;
if (defined $last) {
push #{-1 != index($_, $last) ? $matching : \#words}, $last;
}
$last = $_;
}
push #words, $last // ();
} else {
#words = map {chomp; $_} <>;
}
#words;
}
prepare_words
This "blows up" the input words, sorting them after their length into each slot, that has the words of larger or equal length. Therefore, slot 1 will contain all words. If this optimisation is deselected, it is a no-op and passes the input list right through.
sub prepare_words {
if (OPT_ONDRA) {
my $ondra_split = sub { # evil: using $_ as implicit argument
my #split_list;
for my $i (0 .. length $_) {
for my $j (1 .. length($_) - ($i || 1)) {
push #split_list, substr $_, $i, $j;
}
}
#split_list;
};
return +{map {$_ => 1} map &$ondra_split(), #_};
} elsif (OPT_PRESORT) {
my #prepared = ([]);
for my $w (#_) {
push #{$prepared[$_]}, $w for 1 .. length $w;
}
return \#prepared;
} else {
return [#_];
}
}
test
This tests if the word $w is a substring in any of the other words. $wbl points to the data structure that was created by the previous sub: Either a flat list of words, or the words sorted by length. The appropriate algorithm is then selected. Nearly all of the running time is spent in this loop. Using index is much faster than using a regex.
sub test {
my ($w, $wbl) = #_;
my $l = length $w;
if (OPT_PRESORT) {
for my $try (#{$$wbl[$l + 1]}) {
return 1 if -1 != index $try, $w;
}
} else {
for my $try (#$wbl) {
return 1 if $w ne $try and -1 != index $try, $w;
}
}
return 0;
}
divide
This just encapsulates an algorithm that guarantees a fair distribution of $items items into $parcels buckets. It outputs the bounds of a range of items.
sub divide {
my ($parcels, $items) = #_;
say STDERR "dividing $items items into $parcels parcels.";
my ($min_size, $rest) = (int($items / $parcels), $items % $parcels);
my #distributions =
map [
$_ * $min_size + ($_ < $rest ? $_ : $rest),
($_ + 1) * $min_size + ($_ < $rest ? $_ : $rest - 1)
], 0 .. $parcels - 1;
say STDERR "range division: #$_" for #distributions;
return #distributions;
}
test_range
This calls test for each word in the input list, and is the sub that is multithreaded. grep selects all those elements in the input list where the code (given as first argument) return true. It also regulary outputs a status message like thread 2 at 10% which makes waiting for completition much easier. This is a psychological optimisation ;-).
sub test_range {
my $wbl = shift;
if (BLABBER_AT) {
my $range = #_;
my $step = int($range / 100 * BLABBER_AT) || 1;
my $i = 0;
return
grep {
if (0 == ++$i % $step) {
printf STDERR "... thread %d at %2d%%\n",
threads->tid,
$i / $step * BLABBER_AT;
}
OPT_ONDRA ? $wbl->{$_} : test($_, $wbl)
} #_;
} else {
return grep {OPT_ONDRA ? $wbl->{$_} : test($_, $wbl)} #_;
}
}
invocation
Using bash, I invoked the script like
$ time (head -n 1000 /usr/share/dict/words | perl script.pl >/dev/null)
Where 1000 is the number of lines I wanted to input, dict/words was the word list I used, and /dev/null is the place I want to store the output list, in this case, throwing the output away. If the whole file should be read, it can be passed as an argument, like
$ perl script.pl input-file >output-file
time just tells us how long the script ran. Using 2 slow processors and 50000 words, it executed in just over two minutes in my case, which is actually quite good.
update: more like 6–7 seconds now, with the Ondra + Presort optimisation, and no threading.
further optimisations
update: overcome by better algorithm. This section is no longer completely valid.
The multithreading is awful. It allocates quite some memory and isn't exactly fast. This isn't suprising considering the amount of data. I considered using a Thread::Queue, but that thing is slow like $#*! and therefore is a complete no-go.
If the inner loop in test was coded in a lower-level language, some performance might be gained, as the index built-in wouldn't have to be called. If you can code C, take a look at the Inline::C module. If the whole script were coded in a lower language, array access would also be faster. A language like Java would also make the multithreading less painful (and less expensive).
This is a really basic regex question but since I can't seem to figure out why the match is failing in certain circumstances I figured I'd post it to see if anyone else can point out what I'm missing.
I'm trying to pull out the 2 sets of digits from strings of the form:
12309123098_102938120938120938
1321312_103810312032123
123123123_10983094854905490
38293827_1293120938129308
I'm using the following code to process each string:
if($string && $string =~ /^(\d)+_(\d)+$/) {
if(IsInteger($1) && IsInteger($2)) { print "success ('$1','$2')"; }
else { print "fail"; }
}
Where the IsInterger() function is as follows:
sub IsInteger {
my $integer = shift;
if($integer && $integer =~ /^\d+$/) { return 1; }
return;
}
This function seems to work most of the time but fails on the following for some reason:
1287123437_1268098784380
1287123437_1267589971660
Any ideas on why these fail while others succeed? Thanks in advance for your help!
This is an add-on to the answers from unicornaddict and ZyX: what are you trying to match?
If you're trying to match the sequences left and right of '_', unicorn addict is correct and your regex needs to be ^(\d+)_(\d+)$. Also, you can get rid of the first qualifier and the 'IsIntrger()` function altogether - you already know it's an integer - it matched (\d+)
if ($string =~ /^(\d+)_(\d+)$/) {
print "success ('$1','$2')";
} else {
print "fail\n";
}
If you're trying to match the last digit in each and wondering why it's failing, it's the first check in IsInteger() ( if($intger && ). It's redundant anyway (you know it's an integer) and fails on 0 because, as ZyX notes - it evaluates to false.
Same thing applies though:
if ($string =~ /^(\d)+_(\d)+$/) {
print "success ('$1','$2')";
} else {
print "fail\n";
}
This will output success ('8','8') given the input 12309123098_102938120938120938
Because you have 0 at the end of the second string, (\d)+ puts only the last match in the $N variable, string "0" is equivalent to false.
When in doubt, check what your regex is actually capturing.
use strict;
use warnings;
my #data = (
'1321312_103810312032123',
'123123123_10983094854905490',
);
for my $s (#data){
print "\$1=$1 \$2=$2\n" if $s =~ /^(\d)+_(\d)+$/;
# Output:
# $1=2 $2=3
# $1=3 $2=0
}
You probably intended the second of these two approaches.
(\d)+ # Repeat a regex group 1+ times,
# capturing only the last instance.
(\d+) # Capture 1+ digits.
In addition, both in your main loop and in IsInteger (which seems unnecessary, given the initial regex in the main loop), you are testing for truth rather than something more specific, such as defined or length. Zero, for example, is a valid integer but false.
Shouldn't + be included in the grouping:
^(\d+)_(\d+)$ instead of ^(\d)+_(\d)+$
Many people have commented on your regex, but the problem you had in your IsInteger (which you really don't need for your example). You checked for "truth" when you really want to check for defined:
sub IsInteger {
my $integer = shift;
if( defined $integer && $integer =~ /^\d+$/) { return 1; }
return;
}
You don't need most of the infrastructure in that subroutine though:
sub IsInteger {
defined $_[0] && $_[0] =~ /^\d+$/
}