Perl, regex, extract data from a line - regex

Im trying to extract part of a line with perl
use strict;
use warnings;
# Set path for my.txt and extract datadir
my #myfile = "C:\backups\MySQL\my.txt";
my #datadir = "";
open READMYFILE, #myfile or die "Error, my.txt not found.\n";
while (<READMYFILE>) {
# Read file and extract DataDir path
if (/C:\backups/gi) {
push #datadir, $_;
}
}
# ensure the path was found
print #datadir . " \n";
Basically at first im trying to set the location of the my.txt file. Next im trying to read it and pull part of the line with regex. The error Im getting is:
Unrecognized escape \m passed through
at 1130.pl line 17.
I took a look at How can I grab multiple lines after a matching line in Perl? to get an idea of how to read a file and match a line within it, however im not 100% sure I'm doing this right or in the best way. I also seem to produce the error:
Error, my.txt not found.
But the file does exist in the folder C:\backups\MySQL\

When Perl sees the string "C:\backups\MySQL\my.txt" it tries to parse any escape sequences, such as \n. But when it sees \m in \my.txt, it's an unrecognized escape sequence, hence the error.
One way to fix this is to properly escape your backslashes: "C:\\backups\\MySQL\\my.txt". Another way to fix this is to use single quotes instead of double quotes: 'C:\backups\MySQL\my.txt'. Yet another way is to use the q() construct: q(C:\backups\MySQL\my.txt).

Since there are several problems I'll put comments on the changes I've made in the code below.
use strict;
use warnings;
# For pretty dumping of arrays and what not.
use Data::Dumper;
# Use single quotes so you don't have to worry about escaping '\'s.
# Use a scalar ($) instead of an array(#) for storing the string.
my $myfile = 'C:\backups\MySQL\my.txt';
# No need to initialize the array.
my #datadir;
# I believe using a scalar is preferred for file handles.
# $! will contain the error if we couldn't open the file.
open(my $readmyfile, $myfile) or die "error opening: $!";
while (<$readmyfile>) {
# You must escape '\'s by doubling them.
# If you are just testing to see if the line contains 'c:\backups' you do not
# need /g for the regex. /g is for repeating matches
if (/C:\\backups/i) {
push(#datadir, $_);
}
}
# Data::Dumper would be better for dumping the array for debugging.
# Dumper wants a reference to the array.
print Dumper(\#datadir);
Update:
If you're referring to the output from Data::Dumper, it's just there for a pretty representation of the array. If you need a specifically formatted output you'll have to code it. A start would be:
print "$_\n" for (#datadir);

Use forward slashes instead of backslahes

Shouldn't you be using $myfile instead of #myfile? The latter gives you an array, and since you're referencing it in scalar context, it's getting dereferenced (so it's actually trying to open a "file" called something like ARRAY(0xdeadbeef) instead of the actual filename).

The file is not being found because you are passing an array to open when it's expecting a scalar, so I'd guess that the array is being evaluated in a scalar context instead of as a list so you're actually telling perl to try opening the file named '1' instead of your 'my.txt' file.
Try something like this instead:
my $a = 'filename';
open FH, $a or die "Error, could not open $a: $!";
...

As other people have said, part of the issue is using " " rather than ' ' type of quoting.
I try always to use ' ' unless I know I need to include an escape or interpolate a variable.
Here are a number of pitfalls
use 5.10.0 ;
use warnings ;
say "file is c:\mydir" ;
say "please pay $100 ";
say "on VMS the system directory is sys$system" ;
say "see you #5 ";
With double quotes
Unrecognized escape \m passed through at (eval 1) line 2.
Possible unintended interpolation of #5 in string at (eval 1) line 5.
file is c:mydir
Use of uninitialized value $100 in concatenation (.) or string at (eval 1) line 3.
please pay
Use of uninitialized value $system in concatenation (.) or string at (eval 1) line 4.
on VMS the system directory is sys
see you
With single quotes
file is c:\mydir
please pay $100
on VMS the system directory is sys$system
see you #5

Related

Why isn't this regex executing?

I'm attempting to convert my personal wiki from Foswiki to Markdown files and then to a JAMstack deployment. Foswiki uses flat files and stores metadata in the following format:
%META:TOPICINFO{author="TeotiNathaniel" comment="reprev" date="1571215308" format="1.1" reprev="13" version="14"}%
I want to use a git repo for versioning and will worry about linking that to article metatada later. At this point I simply want to convert these blocks to something that looks like this:
---
author: Teoti Nathaniel
revdate: 1539108277
---
After a bit of tweaking I have constructed the following regex:
author\=\['"\]\(\\w\+\)\['"\]\(\?\:\.\*\)date\=\['"\]\(\\w\+\)\['"\]
According to regex101 this works and my two capture groups contain the desired results. Attempting to actually run it:
perl -0777 -pe 's/author\=\['"\]\(\\w\+\)\['"\]\(\?\:\.\*\)date\=\['"\]\(\\w\+\)\['"\]/author: $1\nrevdate: $2/gms' somefile.txt
gets me only this:
>
My previous attempt (which breaks if the details aren't in a specific order) looked like this and executed correctly:
perl -0777 -pe 's/%META:TOPICINFO\{author="(.*)"\ date="(.*)"\ format="(.*)"\ (.*)\}\%/author:$1 \nrevdate:$2/gms' somefile.txt
I think that this is an escape character problem but can't figure it out. I even went and found this tool to make sure that they are correct.
Brute-forcing my way to understanding here is feeling both inefficient and frustrating, so I'm asking the community for help.
The first major problem is that you're trying to use a single quote (') in the program, when the program is being passed to the shell in single quotes.
Escape any instance of ' in the program by using '\''. You could also use \x27 if the quote happens to be a single double-quoted string literal or regex literal (as is the case of every instance in your program).
perl -0777pe's/author=['\''"].../.../gs'
perl -0777pe's/author=[\x27"].../.../gs'
I would try to break it down into a clean data structure then process it. By seperating the data processing to printing, you can modifiy to add extra data later. It also makes it far more readable. Please see the example below
#!/usr/bin/env perl
use strict;
use warnings;
## yaml to print the data, not required for operation
use YAML::XS qw(Dump);
my $yaml;
my #lines = '%META:TOPICINFO{author="TeotiNathaniel" comment="reprev" date="1571215308" format="1.1" reprev="13" version="14"}%';
for my $str (#lines )
{
### split line into component parts
my ( $type , $subject , $data ) = $str =~ /\%(.*?):(.*?)\{(.*)\}\%/;
## break data in {} into a hash
my %info = map( split(/=/), split(/\s+/, $data) );
## strip quotes if any exist
s/^"(.*)"$/$1/ for values %info;
#add to data structure
$yaml->{$type}{$subject} = \%info;
}
## yaml to print the data, not required for operation
print Dump($yaml);
## loop data and print
for my $t (keys %{ $yaml } ) {
for my $s (keys %{ $yaml->{$t} } ) {
print "-----------\n";
print "author: ".$yaml->{$t}{$s}{"author"}."\n";
print "date: ".$yaml->{$t}{$s}{"date"}."\n";
}
}
Ok, I kept fooling around with it by reducing the execution to a single term and expanding. I soon got to here:
$ perl -0777 -pe 's/author=['\"]\(\\w\+\)['"](?:.*)date=\['\"\]\(\\w\+\)\['\"\]/author\: \$1\\nrevdate\: \$2/gms' somefile.txt
Unmatched [ in regex; marked by <-- HERE in m/author=["](\w+)["](?:.*)date=\["](\w+)[ <-- HERE \"\]/ at -e line 1.
This eventually got me to here:
perl -0777 -pe 's/author=['\"]\(\\w\+\)['"](?:.*)date=['\"]\(\\w\+\)['\"]/\nauthor\ $1\nrevdate\:$2\n/gms' somefile.txt
Which produces a messy output but works. (Note: Output is proof-of-concept and this can now be used within a Python script to programattically generate Markdown metadata.
Thanks for being my rubber duckie, StackOverflow. Hopefully this is useful to someone, somewhere, somewhen.

Perl regex extracting a match using braces

I tested the following code
#! /usr/bin/perl
use strict;
use English;
#this code extracts the current scripts filename
#by removing the path from the filepath
my $Script_Name = $PROGRAM_NAME;
${Script_Name} =~ s/^.*\\//; #windows path
#${Script_Name} =~ s/^.*\///; #Unix based path
print $Script_Name;
and i don't understand why these braces extract the match without using a /r modifier. can anyone explain why and how this works or point me to some documentation?
You're getting a little confused!
The braces make no difference. ${Script_Name} is identical to $Script_Name.
You code first copies the entire path to the script file from $PROGRAM_NAME to $Script_Name.
Then the substitution removes everything up to and including the last backslash, leaving just the file name.
The /r modifier would be used if you wanted to modify one string and put the result of the modification into another, so you could write your code in one step as
$Script_Name = $PROGRAM_NAME=~ s/^.*\\//r

Perl regex substitution not working with global modifier

I have code that looks like the following:
s/(["\'])(?:\\?+.)*?\1/(my $x = $&) =~ s|^(["\'])(.*src=)([\'"])\/|$1$2$3$1.\\$baseUrl.$1\/|g;$x/ge
Ignoring the last bit (and only leaving the part where the problems occur) the code becomes:
s/(["\'])(?:\\?+.)*?\1/replace-text-here/g
I have tried using both, but I still get the same problem, which is that even though I am using the g modifier, this regex only matches and replaces the first occurrence. If this is a Perl bug, I don't know, but I was using a regex that matches everything between two quotes, and also handles escaped quotes, and I was following this blog post. In my eyes, that regex should match everything between the two quotes, then replace it, then try and find another instance of this pattern, because of the g modifier.
For a bit of background information, I am not using and version declarations, and strict and warnings are turned on, yet no warnings have shown up. My script reads an entire file into a scalar (including newlines) then the regex operates directly on that scalar. It does seem to work on each line individually - just not multiple times on one line. Perl version 5.14.2, running on Cygwin 64-bit. It could be that Cygwin (or the Perl port) is messing something up, but I doubt it.
I also tried another example from that blog post, with atomic groups and possessive quantifiers replaced with equivalent code but without those features, but this problem still plagued me.
Examples:
<?php echo ($watched_dir->getExistsFlag())?"":"<span class='ui-icon-alert'><img src='/css/images/warning-icon.png'></span>"?>
Should become (with the shortened regex):
<?php echo ($watched_dir->getExistsFlag())?replace-text-here:replace-text-here?>
Yet it only becomes:
<?php echo ($watched_dir->getExistsFlag())?replace-text-here:"<span class='ui-icon-alert'><img src='/css/images/warning-icon.png'></span>"?>
<?php echo ($sub->getTarget() != "")?"target=\"".$sub->getTarget()."\"":""; ?>
Should become:
<?php echo ($sub->getTarget() != replace-text-here)?replace-text-here.$sub->getTarget().replace-text-here:replace-text-here; ?>
And as above, only the first occurrence is changed.
(And yes, I do realise that this will spark into some sort of - don't use regex for parsing HTML/PHP. But in this case I think that regex is more appropriate, as I am not looking for context, I am looking for a string (anything within quotes) and performing an operation on that string - which is regex.)
And just a note - these regexes are running in an eval function, and the actual regex is encoded in a single quoted string (which is why the single quotes are escaped). I will try any presented solution directly though to rule out my bad programming.
EDIT: As requested, a short script that presents the problems:
#!/usr/bin/perl -w
use strict;
my $data = "this is the first line, where nothing much happens
but on the second line \"we suddenly have some double quotes\"
and on the third line there are 'single quotes'
but the fourth line has \"double quotes\" AND 'single quotes', but also another \"double quote\"
the fifth line has the interesting one - \"double quoted string 'with embedded singles' AND \\\"escaped doubles\\\"\"
and the sixth is just to say - we need a new line at the end to simulate a properly structured file
";
my $regex = 's/(["\'])(?:\\?+.)*?\1/replaced!/g';
my $regex2 = 's/([\'"]).*?\1/replaced2!/g';
print $data."\n";
$_ = $data; # to make the regex operate on $_, as per the original script
eval($regex);
print $_."\n";
$_ = $data;
eval($regex2);
print $_; # just an example of an eval, but without the fancy possessive quantifiers
This produces the following output for me:
this is the first line, where nothing much happens
but on the second line "we suddenly have some double quotes"
and on the third line there are 'single quotes'
but the fourth line has "double quotes" AND 'single quotes', but also another "double quote"
the fifth line has the interesting one - "double quoted string 'with embedded singles' AND \"escaped doubles\""
and the sixth is just to say - we need a new line at the end to simulate a properly structured file
this is the first line, where nothing much happens
but on the second line "we suddenly have some double quotes"
and on the third line there are 'single quotes'
but the fourth line has "double quotes" AND 'single quotes', but also another "double quote"
the fifth line has the interesting one - "double quoted string 'with embedded singles' AND \"escaped doubles\replaced!
and the sixth is just to say - we need a new line at the end to simulate a properly structured file
this is the first line, where nothing much happens
but on the second line replaced2!
and on the third line there are replaced2!
but the fourth line has replaced2! AND replaced2!, but also another replaced2!
the fifth line has the interesting one - replaced2!escaped doubles\replaced2!
and the sixth is just to say - we need a new line at the end to simulate a properly structured file
Even within single-quotes, \\ gets processed as \, so this:
my $regex = 's/(["\'])(?:\\?+.)*?\1/replaced!/g';
sets $regex to this:
s/(["'])(?:\?+.)*?\1/replaced!/g
which requires each character in the quoted-string to be preceded by one or more literal question-marks (\?+). Since you don't have lots of question-marks, this effectively means that you're requiring the string to be empty, either "" or ''.
The minimal fix is to add more backslashes:
my $regex = 's/(["\'])(?:\\\\?+.)*?\\1/replaced!/g';
but you really might want to rethink your approach. Do you really need to save the whole regex-replacement command as a string and run it via eval?
Update: this:
my $regex = 's/(["\'])(?:\\?+.)*?\1/replaced!/g';
should be:
my $regex = 's/(["\'])(?:\\\\?+.)*?\1/replaced!/g';
since those single quotes there in the assignment turn \\ into \ and you want the regex to end up with \\.
Please boil your problem down to a short script that demonstrates the problem (including input, bad output, eval and all). Taking what you do show and trying it:
use strict;
use warnings;
my $input = <<'END';
<?php echo ($watched_dir->getExistsFlag())?"":"<span class='ui-icon-alert'><img src='/css/images/warning-icon.png'></span>"?>
END
(my $output = $input) =~ s/(["\'])(?:\\?+.)*?\1/replace-text-here/g;
print $input,"becomes\n",$output;
produces for me:
<?php echo ($watched_dir->getExistsFlag())?"":"<span class='ui-icon-alert'><img src='/css/images/warning-icon.png'></span>"?>
becomes
<?php echo ($watched_dir->getExistsFlag())?replace-text-here:replace-text-here?>
as I would expect. What does it do for you?

How to trim the file modification value from SVN log output with PowerShell

I have an SVN log being captured in PowerShell which I am then trying to modify and string off everything except the file URL. The problem I am having is getting a regex to remove everything before the file URL. My entry is matched as:
M /trunk/project/application/myFile.cs
There are two spaces at the beginning which originally I was trying to replace with a Regex but that did not seem to work, so I use a trim and end up with:
M /trunk/project/application/myFile.cs
Now I want to get rid of the File status indicator so I have a regular expression like:
$entry = $entry.Replace("^[ADMR]\s+","")
Where $entry is the matched file URL but this doesn't seem to do anything, even removing the caret to just look for the value and space did not do anything. I know that $entry is a string, I originally thought Replace was not working as $entry was not a string, but running Get-Member during the script shows I have a string type. Is there something special about the svn file indicator or is the regex somehow off?
Given your example string:
$entry = 'M /trunk/project/application/myFile.cs'
$fileURL = ($entry -split ' /')[1]
Your regex doesn't work because string.Replace just does a literal string replacement and doesn't know about regexes. You'd probably want [Regex]::Replace or just the -replace operator.
But when using SVN with PowerShell, I'd always go with the XML format. SVN allows a --xml option to all commands which then will output XML (albeit invalid if it dies in between).
E.g.:
$x = [xml](svn log -l 3 --verbose --xml)
$x.log.logentry|%{$_.paths}|%{$_.path}|%{$_.'#text'}
will give you all paths.
But if you need a regex:
$entry -replace '^.*?\s+'
which will remove everything up to (and including) the first sequence of spaces which has the added benefit that you don't need to remember what characters may appear there, too.

Perl: Grabbing the nth and mth delimited words from each line in a file

Because of the more tedious way of adding hosts to be monitored in Nagios (it requires defining a host object, as opposed to the previous program which only required the IP and hostname), I figured it'd be best to automate this, and it'd be a great time to learn Perl, because all I know at the moment is C/C++ and Java.
The file I read from looks like this:
xxx.xxx.xxx.xxx hostname #comments. i.dont. care. about
All I want are the first 2 bunches of characters. These are obviously space delimited, but for the sake of generality, it might as well be anything. To make it more general, why not the first and third, or fourth and tenth? Surely there must be some regex action involved, but I'll leave that tag off for the moment, just in case.
The one-liner is great, if you're not writing more Perl to handle the result.
More generally though, in the context of a larger Perl program, you would either write a custom regular expression, for example:
if($line =~ m/(\S+)\s+(\S+)/) {
$ip = $1;
$hostname = $2;
}
... or you would use the split operator.
my #arr = split(/ /, $line);
$ip = $arr[0];
$hostname = $arr[1];
Either way, add logic to check for invalid input.
Let's turn this into code golf! Based on David's excellent answer, here's mine:
perl -ane 'print "#F[0,1]\n";'
Edit: A real golf submission would look more like this (shaving off five strokes):
perl -ape '$_="#F[0,1]
"'
but that's less readable for this question's purposes. :-P
Here's a general solution (if we step away from code-golfing a bit).
#!/usr/bin/perl -n
chop; # strip newline (in case next line doesn't strip it)
s/#.*//; # strip comments
next unless /\S/; # don't process line if it has nothing (left)
#fields = (split)[0,1]; # split line, and get wanted fields
print join(' ', #fields), "\n";
Normally split splits by whitespace. If that's not what you want (e.g., parsing /etc/passwd), you can pass a delimiter as a regex:
#fields = (split /:/)[0,2,4..6];
Of course, if you're parsing colon-delimited files, chances are also good that such files don't have comments and you don't have to strip them.
A simple one-liner is
perl -nae 'print "$F[0] $F[1]\n";'
you can change the delimiter with -F
David Nehme said:
perl -nae 'print "$F[0] $F[1}\n";
which uses the -a switch. I had to look that one up:
-a turns on autosplit mode when used with a -n or -p. An implicit split
command to the #F array is done as the first thing inside the implicit
while loop produced by the -n or -p.
you learn something every day. -n causes each line to be passed to
LINE:
while (<>) {
... # your program goes here
}
And finally -e is a way to directly enter a single line of a program. You can have more than -e. Most of this was a rip of the perlrun(1) manpage.
Since ray asked, I thought I'd rewrite my whole program without using Perl's implicitness (except the use of <ARGV>; that's hard to write out by hand). This will probably make Python people happier (braces notwithstanding :-P):
while (my $line = <ARGV>) {
chop $line;
$line =~ s/#.*//;
next unless $line =~ /\S/;
#fields = (split ' ', $line)[0,1];
print join(' ', #fields), "\n";
}
Is there anything I missed? Hopefully not. The ARGV filehandle is special. It causes each named file on the command line to be read, unless none are specified, in which case it reads standard input.
Edit: Oh, I forgot. split ' ' is magical too, unlike split / /. The latter just matches a space. The former matches any amount of any whitespace. This magical behaviour is used by default if no pattern is specified for split. (Some would say, but what about /\s+/? ' ' and /\s+/ are similar, except for how whitespace at the beginning of a line is treated. So ' ' really is magical.)
The moral of the story is, Perl is great if you like lots of magical behaviour. If you don't have a bar of it, use Python. :-P
To Find Nth to Mth Character In Line No. L --- Example For Finding Label
#echo off
REM Next line = Set command value to a file OR Just Choose Your File By Skipping The Line
vol E: > %temp%\justtmp.txt
REM Vol E: = Find Volume Lable Of Drive E
REM Next Line to choose line line no. +0 = line no. 1
for /f "usebackq delims=" %%a in (`more +0 %temp%\justtmp.txt`) DO (set findstringline=%%a& goto :nextstep)
:nextstep
REM Next line to read nth to mth Character here 22th Character to 40th Character
set result=%findstringline:~22,40%
echo %result%
pause
exit /b
Save as find label.cmd
The Result Will Be Your Drive E Label
Enjoy