How to use Perl to perform substitutions based on calculations? - regex

I'm trying to write a perl script to search through a text file, find all decimal numbers, and modify them by some scaling factor. So far, I have succeeded in extracting the numbers with regular expressions:
open(INPUT, $inputPath) or die "$inputPath cannot be opened.";
while ($thisLine = <INPUT>) {
while ($thisLine =~ m/(-*\d+\.\d+)/g) {
if(defined($1)) {
$new = $scalingFactor*$1;
print $new."\n";
}
}
}
close (INPUT);
However, I haven't yet figured out how to reinsert the new values into the file. I tried using s/(-*\d+.\d+)/$scalingFactor*$1/g for substitution, but of course this inserted the string representation of $scalingFactor instead of evaluating the expression.
I'm a perl newbie, so any help would be greatly appreciated. Thanks in advance,
-Dan
Edit: Solution (based on Roman's Reply)
while ($thisLine = <INPUT>) {
$thisLine =~ s/(-*\d+\.\d+)/$scalingFactor*$1/ge;
prinf OUTPUT $thisLine;
}
Alternatively, Sean's solution also worked great for me. Thanks all!

s/(-*\d+.\d+)/$scalingFactor*$1/ge
(notice e in the end)

Here's a self-contained subroutine that'll do the job. It uses the special variable $^I, which activates Perl's in-place editing feature. (See the "perlvar" man page for more information on $^I, and the "perlrun" man page for information about the -i command-line switch, which turns on in-place editing.)
use strict; # Always.
sub scale_numbers_in_file_by_factor {
my ($path, $scaling_factor) = #_;
local #ARGV = ($path);
local $^I = '.bak';
while (<>) {
s/ ( -? \d+ \. \d+ ) / $scaling_factor * $1 /gex;
print;
}
}
scale_numbers_in_file_by_factor('my-file.txt', .1);
A backup file will be made by appending '.bak' to the original filename. Change the '.bak' to '' above if you don't want a backup.
You might want to tweak your number-recognizing regular expression. As written, it will not match numbers without a trailing decimal point and at least one digit. I think you also want -? to match an optional minus sign, not -*, which will match any number of minus signs. Performing arithmetic on a string with more than one leading minus sign will almost certainly not do what you want.

Related

Get the second string of the URI with Perl regex

I need to get the second part of the URI, the possible URI are:
/api/application/v1/method
/web/application/v1/method
I can get "application" using:
([^\/api]\w*)
and
([^\/web]\w*)
But I know is not the best approach, what would be the good way?
Thanks!
Edit: thank you all for the input, the goal was to set the second parte of the uri into a header in apache with rewrite rules
A general regex (Perl or PCRE syntax) solution would be:
^/[^/]+/([^/]+)
Each section is delimited with /, so just capture as many non-/ characters as there are.
This is preferable to non-greedy regexes because it does not need to backtrack, and allows for whatever else the sections may contain, which can easily contain non-word characters such as - that won't be matched by \w.
There are so many options that we can do so, not sure which one would be best, but it could be as simple as:
\/(.+?)\/(.+?)\/.*
which our desired output is in the second capturing group $2.
Demo 1
Example
#!/usr/bin/perl -w
use strict;
use warnings;
use feature qw( say );
main();
sub main{
my $string = '/api/application/v1/method
/web/application/v1/method';
my $pattern = '\/(.+?)\/(.+?)\/.*';
my $match = replace($pattern, '$2', $string);
say $match , " is a match 💚💚💚 ";
}
sub replace {
my ($pattern, $replacement, $string) = #_;
$string =~s/$pattern/$replacement/gee;
return $string;
}
Output
application
application is a match 💚💚💚
Advice
zdim advises that:
A legitimate approach, notes:
(1) there is no need for the trailing .*
(2) Need /|$ (not just /), in case the path finishes without / (to
terminate the non-greedy pattern at the end of string, if there is no
/)
(3) note though that /ee can be vulnerable (even just to errors),
since the second evaluation (e) will run code if the first evaluation
results in code. And it may be difficult to ensure that that is always
done under full control. More to the point, for this purpose there is
no reason to run a substitution --- just match and capture is enough.
With all the regex, explicitly asked for, I'd like to bring up other approaches.
These also parse only a (URI style) path, like the regex ones, and return the second directory.
The most basic and efficient one, just split the string on /
my $dir = ( split /\//, $path )[2];
The split returns '' first (before the first /) thus we need the third element. (Note that we can use an alternate delimiter for the separator pattern, it being regex: split m{/}, $path.)
Use appropriate modules, for example URI
use URI;
my $dir = ( URI->new($path)->path_segments )[2];
or Mojo::Path
use Mojo::Path;
my $dir = Mojo::Path->new($path)->parts->[1];
What to use depends on details of what you do -- if you've got any other work with URLs and web then you clearly want modules for that; otherwise they may (or may not) be an overkill.
I've benchmarked these for a sanity check of what one is paying with modules.
The split either beats regex by up to 10-15% (the regex using negated character class and the one based on non-greedy .+? come around the same), or is about the same with them. They are faster than Mojo by about 30%, and only URI lags seriously, by a factor of 5 behind Mojo.
That's for paths typical for real-life URLs, with a handful of short components. With only two very long strings (10k chars), Mojo::Path (surprisingly for me) is a factor of six ahead of split (!), which is ahead of character-class regex by more than an order of magnitude.
The negated-character-class regex for such long strings beats the non-greedy (.+?) one by a factor of 3, good to know in its own right.
In all this the URI and Mojo objects were created once, ahead of time.
Benchmark code. I'd like to note that the details of these timings are far less important than the structure and quality of code.
use warnings;
use strict;
use feature 'say';
use URI;
use Mojo::Path;
use Benchmark qw(cmpthese);
my $runfor = shift // 3; #/
#my $path = '/' . 'a' x 10_000 . '/' . 'X' x 10_000;
my $path = q(/api/app/v1/method);
my $uri = URI->new($path);
my $mojo = Mojo::Path->new($path);
sub neg_cc {
my ($dir) = $path =~ m{ [^/]+ / ([^/]+) }x; return $dir; #/
}
sub non_greedy {
my ($dir) = $path =~ m{ .+? / (.+?) (?:/|$) }x; return $dir; #/
}
sub URI_path {
my $dir = ( $uri->path_segments )[2]; return $dir;
}
sub Mojo_path {
my $dir = $mojo->parts->[1]; return $dir;
}
sub just_split {
my $dir = ( split /\//, $path )[2]; return $dir;
}
cmpthese( -$runfor, {
neg_cc => sub { neg_cc($path) },
non_greedy => sub { non_greedy($path) },
just_split => sub { just_split($path) },
URI_path => sub { URI_path($path) },
Mojo_path => sub { Mojo_path($path) },
});
With a (10-second) run this prints, on a laptop with v5.16
Rate URI_path Mojo_path non_greedy neg_cc just_split
URI_path 146731/s -- -82% -87% -87% -89%
Mojo_path 834297/s 469% -- -24% -28% -36%
non_greedy 1098243/s 648% 32% -- -5% -16%
neg_cc 1158137/s 689% 39% 5% -- -11%
just_split 1308227/s 792% 57% 19% 13% --
One should keep in mind that the overhead of the function-call is very large for such a simple job, and in spite of Benchmark's work these numbers are probably best taken as a cursory guide.
Your pattern ([^\/api]\w*) consists of a capturing group and a negated character class that will first match 1 time not a /, a, p or i. See demo.
After that 0+ times a word char will be matched. The pattern could for example only match a single char which is not listed in the character class.
What you might do is use a capturing group and match \w+
^/(?:api|web)/(\w+)/v1/method
Explanation
^ Start of string
(?:api|web) Non capturing group with alternation. Match either api or web
(\w+) Capturing group 1, match 1+ word chars
/v1/method Match literally as in your example data.
Regex demo

Sensethising domains

So I'm trying to put all numbered domains into on element of a hash doing this:
### Domanis ###
my $dom = $name;
$dom =~ /(\w+\.\w+)$/; #this regex get the domain names only
my $temp = $1;
if ($temp =~ /(^d+\.\d+)/) { # this regex will take out the domains with number
my $foo = $1;
$foo = "OTHER";
$domain{$foo}++;
}
else {
$domain{$temp}++;
}
where $name will be something like:
something.something.72.154
something.something.72.155
something.something.72.173
something.something.72.175
something.something.73.194
something.something.73.205
something.something.73.214
something.something.abbnebraska.com
something.something.cableone.net
something.something.com.br
something.something.cox.net
something.something.googlebot.com
My code currently print this:
72.175
73.194
73.205
73.214
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
lstn.net
but I want it to print like this:
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
OTHER
lstn.net
where OTHER is all the numbered domains, so any ideas how?
You really shouldn't need to split the variable into two, e.g. this regex will match the case you want to trap:
/\d{1,3}\.\d{1,3}$/ -- returns true if the string ends with two 1-3 long digits separated by a dot
but I mean if you only need to separate those domains that are not numbered you could just check the last character in the domain whether it is a letter, because TLDs cannot contain numbers, so you would do something like
/\w$/ -- if returns true, it is not a numbered domain (providing you've stripped spaces and new lines)
But I suppose it is better to be more specific in the regex, which also better illustrates the logic you are looking for in your script, so I'd use the former regex.
And actually you could do something like this:
if (my ($domain) = $name =~ /\.(\w+.\w+)$/)
{
#the domain is assigned to the variable $domain
} else {
#it is a number domain
}
Take what it currently puts, and use the regex:
/\d+\.\d+/
if it matches this, then its a pair of numbers, so remove it.
This way you'll be able to keep any words with numbers in them.
Please, please indent your code correctly, and use whitespace to separate out various bits and pieces. It'll make your code so much easier to read.
Interestingly, you mentioned that you're getting the wrong output, but the section of the code you post has no print, printf, or say statement. It looks like you're attempting to count up the various domain names.
If these are the value of $name, there are several issues here:
if ($temp =~ /(^d+\.\d+)/) {
Matches nothing. This is saying that your string starts with one or more letter d followed by a period followed by one or more digits. The ^ anchors your regular expression to the beginning of the string.
I think, but not 100% sure, you want this:
if ( $temp =~ /\d\.\d/ ) {
This will find all cases where there are two digits with a period in between them. This is the sub-pattern to /\d+\.\d+/, so both regular expressions will match the same thing.
The
$dom =~ /(\w+\.\w+)$/;
Is matching anywhere in the entire string $dom where there are two letters, digits. or underscores with a decimal between them. Is that what you want?
I also believe this may indicate an error of some sort:
my $foo = $1;
$foo = "OTHER";
$domain{$foo} ++;
This is setting $foo to whatever $dom is matching, but then immediately resets $foo to OTHER, and increments $domain{OTHER}.
We need a sample of your initial data, and maybe the actual routine that prints your output.

regular expression url rewrite based on folder

I need to be able to take /calendar/MyCalendar.ics where MyCalendar.ics coudl be about anything with an ICS extention and rewrite it to /feeds/ics/ics_classic.asp?MyCalendar.ics
Thanks
Regular expressions are meant for searching/matching text. Usually you will use regex to define what you search for some text manipulation tool, and then use a tool specific way to tell the tool with what to replace the text.
Regex syntax use round brackets to define capture groups inside the whole search pattern. Many search and replace tools use capture groups to define which part of the match to replace.
We can take the Java Pattern and Matcher classes as example. To complete your task with the Java Matcher you can use the following code:
Pattern p = Pattern.compile("/calendar/(.*\.(?i)ics)");
Matcher m = p.matcher(url);
String rewritenUrl = "";
if(m.matches()){
rewritenUrl = "/feeds/ics/ics_classic.asp?" + url.substring( m.start(1), m.end(1));
}
This will find the requested pattern but will only take the first regex group for creating the new string.
Here is a link to regex replacement information in (imho) a very good regex information site: http://www.regular-expressions.info/refreplace.html
C:\x>perl foo.pl
Before: a=/calendar/MyCalendar.ics
After: a=/feeds/ics/ics_classic.asp?MyCalendar.ics
...or how about this way?
(regex kind of seems like overkill for this problem)
b=/calendar/MyCalendar.ics
index=9
c=MyCalendar.ics (might want to add check for ending with '.ics')
d=/feeds/ics/ics_classic.asp?MyCalendar.ics
Here's the code:
C:\x>type foo.pl
my $a = "/calendar/MyCalendar.ics";
print "Before: a=$a\n";
my $match = (
$a =~ s|^.*/([^/]+)\.ics$|/feeds/ics/ics_classic.asp?$1.ics|i
);
if( ! $match ) {
die "Expected path/filename.ics instead of \"$a\"";
}
print "After: a=$a\n";
print "\n";
print "...or how about this way?\n";
print "(regex kind of seems like overkill for this problem)\n";
my $b = "/calendar/MyCalendar.ics";
my $index = rindex( $b, "/" ); #find last path delim.
my $c = substr( $b, $index+1 );
print "b=$b\n";
print "index=$index\n";
print "c=$c (might want to add check for ending with '.ics')\n";
my $d = "/feeds/ics/ics_classic.asp?" . $c;
print "d=$d\n";
C:\x>
General thoughts:
If you do solve this with a regex, a semi-tricky bit is making sure your capture group (the parens) exclude the path separator.
Some things to consider:
Are your paths separators always forward-slashes?
Regex seems like overkill for this; the simplest thing I can think of of to get the index of your last path separator and do simple string manipulation (2nd part of sample program).
Libraries often have routines for parsing paths. In Java I'd look at the java.io.File object, for example, specifically
getName()
Returns the name of the file or directory denoted by
this abstract pathname. This is just the last name in
the pathname's name sequence

Getting just the file name from full path

I need to get from a full file path just the name of the file. I've tried to use:
$out_fname =~ s/[\/\w+\/]+//;
but it "eats up" also purts of the file name.
example:
for a file:
/bla/bla/folder/file.part.1.file,
it returned:
.part.1,file
You can do:
use File::Basename;
my $path = "/bla/bla/folder/file.part.1.file";
my $filename = basename($path);
Besides File::Basename, there's also Path::Class, which can be handy for more complex operations, particularly when dealing with directories, or cross-platform/filesystem operations. It's probably overkill in this case, but might be worth knowing about.
use Path::Class;
my $file = file( "/bla/bla/folder/file.part.1.file" );
my $filename = $file->basename;
I agree with the other answers, but just wanted to explain the mistake in your pattern. Regex is tricky, but worth it to learn well.
The square brackets defines a class of objects that will match. In your case, it will match with the forward slash, a word character (from the \w), the + character, or the forward slash character (this is redundant). Then you are saying to match 1 or more of those. There are multiple strings that could match. It will match the earliest starting character, so the first /. Then it will grab as much as possible.
This is not what you intended clearly. For example, if you had a . in one of your directory names, you would stop there. /blah.foo/bar/x.y.z would return .foo/bar/x.y.z.
The way to think of this is that you want to match all characters up to and including the final /.
All characters then slash: /.*\//
But to be safer, add a caret at front to make sure it starts there: /^.*\//
And to allow forward and backslashes, make a class for that: /^.*[\/\\]/ (i.e. elusive's answer).
A really good reference is Learning Perl. There are about 3 really good regex chapters. They are applicable to non-Perl regex users as well.
Using split on the directory separator is another alternative. This has the same caveats as using a regex (i.e. with filenames it's better to use a module where someone else has already thought about edge cases, portability, different filesystems, etc, and so you don't need matching on both back- and forward-slashes), but useful as another general technique where you have a string with a repeated separator.
my $file = "/bla/bla/folder/file.part.1.file";
my #parts = split /\//, $file;
my $filename = $parts[-1];
This is exactly what I would expect it to retain in the given substitution. You are saying replace the longest string of slashes and word characters with nothing. So it grabs all the characters up until the first character you didn't specify and deletes them.
It's doing what you are asking it to do. I join with others in saying use File::Basename for what you are trying to do.
But here is the quickest way to do the same thing:
my $fname = substr( $out_fname, rindex( $out_fname, '/' ) + 1 );
Here, it says find the last occurrence of '/' in the string and give me the text starting one after that position. I'm not anti-regex by any stretch, but it's a simple expression of what you actually want to do. I've had to do stuff like this for so long, I wrote a last_after sub:
sub last_after {
my ( $string, $delim ) = #_;
unless ( length( $string ) and my $ln = length( $delim )) {
return $string // '';
}
my $ri = rindex( $string, $delim );
return $ri == -1 ? $string : substr( $string, $ri + $ln );
}
I also needed to pull just the last field from a bunch of path names. This worked for me:
grep -o '/\([^/]*\)$' inputfile > outputfile
What about this:
$out_fname =~ s/^.*[\/\\]//;
It should remove everything in front of your filename.

Regex line by line: How to match triple quotes but not double quotes

I need to check to see if a string of many words / letters / etc, contains only 1 set of triple double-quotes (i.e. """), but can also contain single double-quotes (") and double double-quotes (""), using a regex. Haven't had much success thus far.
A regex with negative lookahead can do it:
(?!.*"{3}.*"{3}).*"{3}.*
I tried it with these lines of java code:
String good = "hello \"\"\" hello \"\" hello ";
String bad = "hello \"\"\" hello \"\"\" hello ";
String regex = "(?!.*\"{3}.*\"{3}).*\"{3}.*";
System.out.println( good.matches( regex ) );
System.out.println( bad.matches( regex ) );
...with output:
true
false
Try using the number of occurrences operator to match exactly three double-quotes.
\"{3}
["]{3}
[\"]{3}
I've quickly checked using http://www.regextester.com/, seems to work fine.
How you correctly compile the regex in your language of choice may vary, though!
Depends on your language, but you should only need to match for three double quotes (e.g., /\"{3}/) and then count the matches to see if there is exactly one.
There are probably plenty of ways to do this, but a simple one is to merely look for multiple occurrences of triple quotes then invert the regular expression. Here's an example from Perl:
use strict;
use warnings;
my $match = 'hello """ hello "" hello';
my $no_match = 'hello """ hello """ hello';
my $regex = '[\"]{3}.*?[\"]{3}';
if ($match !~ /$regex/) {
print "Matched as it should!\n";
}
if ($no_match !~ /$regex/) {
print "You shouldn't see this!\n";
}
Which outputs:
Matched as it should!
Basically, you are telling it to find the thing you DON'T want, then inverting the truth. Hope that makes sense. Can help you convert the example to another language if you need help.
This may be a good start for you.
^(\"([^\"\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"|'([^'\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*'|\"\"\"((?!\"\"\")[^\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"\"\")$
See it in action at regex101.com.