Perl: Regular Expression Matching

Perl: Regular Expression Matching - regex

Say I have a fixed variable:
$f_variable = "hello.exe";
and I want to search through a file line by line and find the path that contains this word, for example:
/Desktop/Downloads/hello.exe
or
/Desktop/Downloads/hello_qwdqd.exe
Lets say that the file extension can be either exe or ex
and i wrote this line of code:
if ($line =~ m/(\/$f_variable.*\.[exe]+)
obviously it won't works because the line of code above is actually:
if ($line =~ m/(\/hello.exe.*\.[exe]+)
which will not match anything.
So my question is what changes should I make in order to match and capture the whole path properly without changing the value of $f_variable?

What you wrote isn't even a complete regex. Though you've not said clearly what's in the file (Are the lines just paths?), you probably want something like:
my $pattern = $f_variable;
$pattern =~ s/\.exe?$//;
if ($line =~ m{(/\S*\Q$pattern\E[^/]*\.exe?)}) {
print "$1\n";
}
This removes the file name .ex or .exe suffix to get the base name, then matches the first string that contains the base name including any non-space leading characters and trailing non-space characters ending in .ex or .exe.

Related

Search and replace a special character in perl

I want to search a character and replace it with a string. First, I search for ':' and replace it with 'to'. Next I want to search '$' and replace it with 'END'. This is the code that I've tried. In below code, it work for the first character but not the second character. I tried to use backslash to escape the special character '$' but it still did not work. What else can I do?
$string = "[9:8],
if ($string =~ /^.*:+/){
$stringreplaced =~ s/:/to/g;
}
elsif ($string =~ /^.*\$+/){
$stringreplaced =~ s/\$/END/g;
}

First of all, the code you posted doesn't even compile, yet you say it actually ran. Only post code that you've run.
Second, you're matching against the wrong string. You're checking if $string contains the character, but you replace the characters in $stringreplaced. ALWAYS use use strict; use warnings;. This would have caught this error.
Third, you only check if the character (: or $) is on the first line. This is because . doesn't match line feeds without /s.
Finally, You only check if the string contains $ if it doesn't contain : because you used elsif.
The following is all you need:
$string =~ s/:/to/g;
$string =~ s/\$/END/g;

Resolve Perl error: "Use of uninitialized value"

To clarify the following post, we have an automation requirement to send shipping information to an online platform so users can track their orders. We receive a daily .csv file through email, we have to extract the unique Shopify order reference from a field (last 10 digits of a field), save the amended .csv file and upload to an FTP site so tracking references can be matched to the specific order.
A previous colleague wrote an application in Perl to handle this, however it has not worked and I have no experience with Perl at all!
The program is called by a "Watcher" monitoring for files, the code for this is as follows:
use strict;
use warnings;
use Datatools::Watcher;
my $hotfolder = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT';
my $process = '"C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl"';
my #backup = ('\\gen-svr-01\users\DATA\MW\DMO_Report_IO\ARCHIVE');
watcher($hotfolder,$process,\#backup);
The main code (PERL PROGRAM) is:
use strict;
use warnings;
use File::Copy;
use Datatools::Watcher;
my $output = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\OUTPUT';
my $desthotfolder = '\\gen-svr-01\users\DATA\MW\Data_TO_MWS_FTP_TEST';
my $shopifyPos = 0;
my $shopifyNew = "";
my $header = 1;
my $inputfile = $ARGV[0];
my ($path,$file,$extention) = $inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
my $outputfilename = $file . "_FORMATTED" . $extention;
$outputfilename =~ s/.~#~//;
my $outputfile = "$output\\$outputfilename";
open (INPUT, $inputfile) or die "Could not open input file: $inputfile\n";
open (OUTPUT, ">$outputfile") or die "Could not open output file: $outputfile\n";
while (my $record = <INPUT>){
chomp $record;
my #field = parse_csv($record);
if ($header == 1){
print OUTPUT $record . "\n";
$header = 0;
next;
} else {
$shopifyNew = substr $field[$shopifyPos], -10;
splice (#field, 0, 1, $shopifyNew);
print OUTPUT join(',',#field) . "\n";
next;
}
}
close INPUT;
close OUTPUT;
my $destfile = "$desthotfolder\\$outputfilename";
move $outputfile, $destfile or die "Could not move output file: $outputfile\nto: $destfile\n";
print "\nProcessing complete\n";
sub parse_csv {
my ($shift) = #_;
my $text = $shift; # record containing comma-separated values
my #new = ();
push(#new, $+) while $text =~ m{
# the first part groups the phrase inside the quotes.
# see explanation of this pattern in MRE
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(#new, undef) if substr($text, -1,1) eq ',';
return #new; # list of values that were comma-separated
}
When the program runs, the "Watcher" details the following:
File Seen, Processing File \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csvUse of uninitialized value $file in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Use of uninitialized value $extention in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Processing complete
Line 47 refers to the following code:
my $outputfilename = $file . "_FORMATTED" . $extention;
In the output folder, there is a file with the name "_FORMATTED" (no file extensions)
I have looked for a solution, and from my limited understanding I don't think the variables: file and extension are being defined, but I have no idea how to correct!

It would help to know which is line 47 in this code. I assume it's this line:
my $outputfilename = $file . "_FORMATTED" . $extention;
So, at this point, $file and $extention are both uninitialised. They are both supposed to be initialised in the previous line:
my ($path,$file,$extention) =
$inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
So it seems that your $inputfile doesn't match the regex. This leaves us with two options:
$inputfile isn't being set at all (which would mean it isn't being passed to the program).
$inputfile isn't in the correct format to to match the regex.
To work out which of the problems we have here, add the following validation lines before the line which tries to set $file and $extention:
die "No input file given\n" unless $inputfile;
die "Input file name ($inputfile) is the wrong format\n"
unless $inputfile =~ / \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
Update: From recent updates to your question, I can see that you are running the program and passing it the filename \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csv.
Let's take a closer look at your regex.
m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms
The /x option at the end means that the regex compiler ignores any literal whitespace in the string. So we can do the same. Let's break down what the individual parts are trying to match:
\A : matches the start of the string
(.*\/) : matches anything up to and including the last / in your string. It captures the matched substring into $1. This is what is stored in $path in your code. It's the directory that your file is in.
(.+\d\d\d\d) : This matches one or more of any character followed by four digits. This is stored in $2 and in your code it ends up in `$file``. It's the main part of the filename.
.+ : Matches one or more characters. Any characters. Your code does nothing with these characters.
([.]\w{3}) : Matches a dot followed by three "word" characters (basically alphanumerics). This is captured into $3 and ends up in your $extention variable.
\z : Matches the end of the string.
Putting all that together, you have a regex that looks for filenames and splits them into three parts - the path, the name and the extension. The only complication is that the filename section needs to contain four consecutive digits. And your filename is OrderTracking - which doesn't contain those required digits. So the regex doesn't match and your variables don't get set.
When this program was written, it was assumed that the filenames would contain four digits. The files that you are trying to process do not contain digits, so the program fails.
We can't suggest how you fix this. You need to speak to the people who supply your input files and find out why they have started to send you files with a different name format. Once you know that, you can decide one the best approach to work round the problems.

How to get matching all matching occurences in perl regex

I have a file of the following:
Question:What color is the sky?
Explanation:The sky reflects the ocean.
Question:Why did the chicken cross the road?
Explanation:He was hungry.
What I'm trying to obtain is a list of ("What color is the sky?", "Why did the chicken cross the road")
I'm trying to use perl regex to parse this file, but with no luck.
I have the entire contents of my file in a string called $file, and this is what I'm trying
my #questions = ($file =~ /Question:(.*)\n/g);
But this always just returns the entire $file string to me.

Your (.*) is greedily matching the whole line until it gets to the \n, which is probably a result of how you are getting the string.
You can add a ? to make the match not greedy.
So try
my #questions = ($file =~ /Question:(.*?\?)/g);
Notice I escaped \?, so the regex will match up to the questionmark

Put the whole file in a value will occupy too many memory if the is large, a better way is to process the file line by line.
For example you could do something like
my #questions;
while (<>) {
chomp;
if (m/Question:(.*)/) {
push #questions, $1;
}
}
Some explanations:
I/O Operators of perlop:
Input from <> comes either from standard input, or from each file listed on the command line.

Parenthesis text capturing (Perl RegEx)

I'm back with a follow-up to this question.
Let's assume I have the text
====Example 1====
Some text that I want to get that
may include line breaks
or special ~!##$%^&*() characters
====Example 2====
Some more text that I don't want to get.
and use $output = ($text =~ ====Example 1====\s*(.*?)\s*====); to try and get everything from "====Example 1====" to the four equal signs right before "Example 2".
Based on what I've seen on this site, regexpal.com, and by running it myself, Perl finds and matches the text, but $output remains null or is assigned "1". I'm pretty sure that I'm doing something wrong with the capturing parenthesis, but I can't figure out what. Any help would be appreciated.
My full code is:
$text = "====Example 1====\n
Some text that I want to get this text\n
may include line breaks\n
or special ~!##$%^&*() characters\n
\n
====Example 2====]\n
Some more filler text that I don't want to get.";
my ($output) = $text =~ /====Example 1====\s*(.*?)\s*====/;
die "un-defined" unless defined $output;
print $output;

Try with parentheses to force list context, and use /s when matching so . can also match newlines,
my ($output) = $text =~ / /s;

Two things.
Apply the /s flag to the regex to let it know that the input to the regex might be multiple lines.
Switch your parenthesis to be around $output instead of around the ($text ~= regex);.
Example:
($output) = $text =~ /====Example\s1====\s*(.*?)\s*====/s;
For example, putting it into a script like:
#!/usr/bin/env perl
$text="
====Example 1====
Some text that I want to get that
may include line breaks
or special ~!##$%^&*() characters
====Example 2====
Some more text that I don't want to get.
";
print "full text:","\n";
&hr;
print "$text","\n";
&hr;
($output) = $text =~ /====Example\s1====\s*(.*?)\s*====/s;
print "desired output of regex:","\n";
&hr;
print "$output","\n";
&hr;
sub hr {
print "-" x 80, "\n";
}
Leaves you output like:
bash$ perl test.pl
--------------------------------------------------------------------------------
full text:
--------------------------------------------------------------------------------
====Example 1====
Some text that I want to get that
may include line breaks
or special ~!##0^&*() characters
====Example 2====
Some more text that I don't want to get.
--------------------------------------------------------------------------------
desired output of regex:
--------------------------------------------------------------------------------
Some text that I want to get that
may include line breaks
or special ~!##0^&*() characters
--------------------------------------------------------------------------------

Removing newline character from a string in Perl

I have a string that is read from a text file, but in Ubuntu Linux, and I try to delete its newline character from the end.
I used all the ways. But for s/\n|\r/-/ (I look whether it finds any replaces any new line string) it replaces the string, but it still goes to the next line when I print it. Moreover, when I used chomp or chop, the string is completely deleted. I could not find any other solution. How can I fix this problem?
use strict;
use warnings;
use v5.12;
use utf8;
use encoding "utf-8";
open(MYINPUTFILE, "<:encoding(UTF-8)", "file.txt");
my #strings;
my #fileNames;
my #erroredFileNames;
my $delimiter;
my $extensions;
my $id;
my $surname;
my $name;
while (<MYINPUTFILE>)
{
my ($line) = $_;
my ($line2) = $_;
if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
#chop($line2);
$line2 =~ s/^\n+//;
print $line2 . " WRONG FORMAT!\n";
}
else {
#print "INSERTED:".$13."\n";
my($id) = $13;
my($name) = $2;
print $name . "\t" . $id . "\n";
unshift(#fileNames, $line2);
unshift(#strings, $line2 =~ /[^\W_]+/g);
}
}
close(MYINPUTFILE);

The correct way to remove Unicode linebreak graphemes, including CRLF pairs, is using the \R regex metacharacter, introduced in v5.10.
The use encoding pragma is strongly deprecated. You should either use the use open pragma, or use an encoding in the mode argument on 3-arg open, or use binmode.
use v5.10; # minimal Perl version for \R support
use utf8; # source is in UTF-8
use warnings qw(FATAL utf8); # encoding errors raise exceptions
use open qw(:utf8 :std); # default open mode, `backticks`, and std{in,out,err} are in UTF-8
while (<>) {
s/\R\z//;
...
}

You are probably experiencing a line ending from a Windows file causing issues. For example, a string such as "foo bar\n", would actually be "foo bar\r\n". When using chomp on Ubuntu, you would be removing whatever is contained in the variable $/, which would be "\n". So, what remains is "foo bar\r".
This is a subtle, but very common error. For example, if you print "foo bar\r" and add a newline, you would not notice the error:
my $var = "foo bar\r\n";
chomp $var;
print "$var\n"; # Remove and put back newline
But when you concatenate the string with another string, you overwrite the first string, because \r moves the output handle to the beginning of the string. For example:
print "$var: WRONG\n";
It would effectively be "foo bar\r: WRONG\n", but the text after \r would cause the following text to wrap back on top of the first part:
foo bar\r # \r resets position
: WRONG\n # Second line prints and overwrites
This is more obvious when the first line is longer than the second. For example, try the following:
perl -we 'print "foo bar\rbaz\n"'
And you will get the output:
baz bar
The solution is to remove the bad line endings. You can do this with the dos2unix command, or directly in Perl with:
$line =~ s/[\r\n]+$//;
Also, be aware that your other code is somewhat horrific. What do you for example think that $13 contains? That'd be the string captured by the 13th parenthesis in your previous regular expression. I'm fairly sure that value will always be undefined, because you do not have 13 parentheses.
You declare two sets of $id and $name. One outside the loop and one at the top. This is very poor practice, IMO. Only declare variables within the scope they need, and never just bunch all your declarations at the top of your script, unless you explicitly want them to be global to the file.
Why use $line and $line2 when they have the same value? Just use $line.
And seriously, what is up with this:
if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
That looks like an attempt to obfuscate, no offence. Three nested negations and a bunch of unnecessary parentheses?
First off, since it is an if-else, just swap it around and reverse the regular expression. Second, [^\W_] a double negation is rather confusing. Why not just use [A-Za-z0-9]? You can split this up to make it easier to parse:
if ($line =~ /^(.+)(\.docx)\s*$/) {
my $pre = $1;
my $ext = $2;

You can wipe the linebreaks with something like this:
$line =~ s/[\n\r]//g;
When you do that though, you'll need to change the regex in your if statement to not look for them. I also don't think you want a /g in your if. You really shouldn't have a $line2 either.
I also wouldn't do this type of thing:
print $line2." WRONG FORMAT!\n";
You can do
print "$line2 WRONG FORMAT!\n";
... instead. Also, print accepts a list, so instead of concatenating your strings, you can just use commas.

You can do something like:
=~ tr/\n//
But really chomp should work:
while (<filehandle>){
chomp;
...
}
Also s/\n|\r// only replaces the first occurrence of \r or \n. If you wanted to replace all occurrences you would want the global modifier at the end s/\r|\n//g.
Note: if you're including \r for windows it usually ends its line as \r\n so you would want to replace both (e.g. s/(?:\r\n|\n)//), of course the statement above (s/\r|\n//g) with the global modifier would take care of that anyways.

$variable = join('',split(/\n/,$variable))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl: Regular Expression Matching - regex

Related

Search and replace a special character in perl

Resolve Perl error: "Use of uninitialized value"

How to get matching all matching occurences in perl regex

Parenthesis text capturing (Perl RegEx)

Removing newline character from a string in Perl

Categories

Resources