I'm back with a follow-up to this question.
Let's assume I have the text
====Example 1====
Some text that I want to get that
may include line breaks
or special ~!##$%^&*() characters
====Example 2====
Some more text that I don't want to get.
and use $output = ($text =~ ====Example 1====\s*(.*?)\s*====); to try and get everything from "====Example 1====" to the four equal signs right before "Example 2".
Based on what I've seen on this site, regexpal.com, and by running it myself, Perl finds and matches the text, but $output remains null or is assigned "1". I'm pretty sure that I'm doing something wrong with the capturing parenthesis, but I can't figure out what. Any help would be appreciated.
My full code is:
$text = "====Example 1====\n
Some text that I want to get this text\n
may include line breaks\n
or special ~!##$%^&*() characters\n
\n
====Example 2====]\n
Some more filler text that I don't want to get.";
my ($output) = $text =~ /====Example 1====\s*(.*?)\s*====/;
die "un-defined" unless defined $output;
print $output;
Try with parentheses to force list context, and use /s when matching so . can also match newlines,
my ($output) = $text =~ / /s;
Two things.
Apply the /s flag to the regex to let it know that the input to the regex might be multiple lines.
Switch your parenthesis to be around $output instead of around the ($text ~= regex);.
Example:
($output) = $text =~ /====Example\s1====\s*(.*?)\s*====/s;
For example, putting it into a script like:
#!/usr/bin/env perl
$text="
====Example 1====
Some text that I want to get that
may include line breaks
or special ~!##$%^&*() characters
====Example 2====
Some more text that I don't want to get.
";
print "full text:","\n";
&hr;
print "$text","\n";
&hr;
($output) = $text =~ /====Example\s1====\s*(.*?)\s*====/s;
print "desired output of regex:","\n";
&hr;
print "$output","\n";
&hr;
sub hr {
print "-" x 80, "\n";
}
Leaves you output like:
bash$ perl test.pl
--------------------------------------------------------------------------------
full text:
--------------------------------------------------------------------------------
====Example 1====
Some text that I want to get that
may include line breaks
or special ~!##0^&*() characters
====Example 2====
Some more text that I don't want to get.
--------------------------------------------------------------------------------
desired output of regex:
--------------------------------------------------------------------------------
Some text that I want to get that
may include line breaks
or special ~!##0^&*() characters
--------------------------------------------------------------------------------
Related
To clarify the following post, we have an automation requirement to send shipping information to an online platform so users can track their orders. We receive a daily .csv file through email, we have to extract the unique Shopify order reference from a field (last 10 digits of a field), save the amended .csv file and upload to an FTP site so tracking references can be matched to the specific order.
A previous colleague wrote an application in Perl to handle this, however it has not worked and I have no experience with Perl at all!
The program is called by a "Watcher" monitoring for files, the code for this is as follows:
use strict;
use warnings;
use Datatools::Watcher;
my $hotfolder = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT';
my $process = '"C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl"';
my #backup = ('\\gen-svr-01\users\DATA\MW\DMO_Report_IO\ARCHIVE');
watcher($hotfolder,$process,\#backup);
The main code (PERL PROGRAM) is:
use strict;
use warnings;
use File::Copy;
use Datatools::Watcher;
my $output = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\OUTPUT';
my $desthotfolder = '\\gen-svr-01\users\DATA\MW\Data_TO_MWS_FTP_TEST';
my $shopifyPos = 0;
my $shopifyNew = "";
my $header = 1;
my $inputfile = $ARGV[0];
my ($path,$file,$extention) = $inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
my $outputfilename = $file . "_FORMATTED" . $extention;
$outputfilename =~ s/.~#~//;
my $outputfile = "$output\\$outputfilename";
open (INPUT, $inputfile) or die "Could not open input file: $inputfile\n";
open (OUTPUT, ">$outputfile") or die "Could not open output file: $outputfile\n";
while (my $record = <INPUT>){
chomp $record;
my #field = parse_csv($record);
if ($header == 1){
print OUTPUT $record . "\n";
$header = 0;
next;
} else {
$shopifyNew = substr $field[$shopifyPos], -10;
splice (#field, 0, 1, $shopifyNew);
print OUTPUT join(',',#field) . "\n";
next;
}
}
close INPUT;
close OUTPUT;
my $destfile = "$desthotfolder\\$outputfilename";
move $outputfile, $destfile or die "Could not move output file: $outputfile\nto: $destfile\n";
print "\nProcessing complete\n";
sub parse_csv {
my ($shift) = #_;
my $text = $shift; # record containing comma-separated values
my #new = ();
push(#new, $+) while $text =~ m{
# the first part groups the phrase inside the quotes.
# see explanation of this pattern in MRE
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(#new, undef) if substr($text, -1,1) eq ',';
return #new; # list of values that were comma-separated
}
When the program runs, the "Watcher" details the following:
File Seen, Processing File \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csvUse of uninitialized value $file in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Use of uninitialized value $extention in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Processing complete
Line 47 refers to the following code:
my $outputfilename = $file . "_FORMATTED" . $extention;
In the output folder, there is a file with the name "_FORMATTED" (no file extensions)
I have looked for a solution, and from my limited understanding I don't think the variables: file and extension are being defined, but I have no idea how to correct!
It would help to know which is line 47 in this code. I assume it's this line:
my $outputfilename = $file . "_FORMATTED" . $extention;
So, at this point, $file and $extention are both uninitialised. They are both supposed to be initialised in the previous line:
my ($path,$file,$extention) =
$inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
So it seems that your $inputfile doesn't match the regex. This leaves us with two options:
$inputfile isn't being set at all (which would mean it isn't being passed to the program).
$inputfile isn't in the correct format to to match the regex.
To work out which of the problems we have here, add the following validation lines before the line which tries to set $file and $extention:
die "No input file given\n" unless $inputfile;
die "Input file name ($inputfile) is the wrong format\n"
unless $inputfile =~ / \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
Update: From recent updates to your question, I can see that you are running the program and passing it the filename \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csv.
Let's take a closer look at your regex.
m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms
The /x option at the end means that the regex compiler ignores any literal whitespace in the string. So we can do the same. Let's break down what the individual parts are trying to match:
\A : matches the start of the string
(.*\/) : matches anything up to and including the last / in your string. It captures the matched substring into $1. This is what is stored in $path in your code. It's the directory that your file is in.
(.+\d\d\d\d) : This matches one or more of any character followed by four digits. This is stored in $2 and in your code it ends up in `$file``. It's the main part of the filename.
.+ : Matches one or more characters. Any characters. Your code does nothing with these characters.
([.]\w{3}) : Matches a dot followed by three "word" characters (basically alphanumerics). This is captured into $3 and ends up in your $extention variable.
\z : Matches the end of the string.
Putting all that together, you have a regex that looks for filenames and splits them into three parts - the path, the name and the extension. The only complication is that the filename section needs to contain four consecutive digits. And your filename is OrderTracking - which doesn't contain those required digits. So the regex doesn't match and your variables don't get set.
When this program was written, it was assumed that the filenames would contain four digits. The files that you are trying to process do not contain digits, so the program fails.
We can't suggest how you fix this. You need to speak to the people who supply your input files and find out why they have started to send you files with a different name format. Once you know that, you can decide one the best approach to work round the problems.
I have the below code where I am trying to grep for a pattern in a variable. The variable has a multiline text in it.
Multiline text in $output looks like this
_skv_version=1
COMPONENTSEQUENCE=C1-
BEGIN_C1
COMPONENT=SecurityJNI
TOOLSEQUENCE=T1-
END_C1
CMD_ID=null
CMD_USES_ASSET_ENV=null_jdk1.7.0_80
CMD_USES_ASSET_ENV=null_ivy,null_jdk1.7.3_80
BEGIN_C1_T1
CMD_ID=msdotnet_VS2013_x64
CMD_ID=ant_1.7.1
CMD_FILE=path/to/abcI.vc12.sln
BEGIN_CMD_OPTIONS_RELEASE
-useideenv
The code I am using to grep for the pattern
use strict;
use warnings;
my $cmd_pattern = "CMD_ID=|CMD_USES_ASSET_ENV=";
my #matching_lines;
my $output = `cmd to get output` ;
print "output is : $output\n";
if ($output =~ /^$cmd_pattern(?:null_)?(\w+([\.]?\w+)*)/s ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
I am getting the multiline output as expected from $output but the regex pattern match which I am using on $output is not giving me any results.
Desired output
jdk1.7.0_80
ivy
jdk1.7.3_80
msdotnet_VS2013_x64
ant_1.7.1
Regarding your regular expression:
You need a while, not an if (otherwise you'll only be matching once); when you make this change you'll also need the /gc modifiers
You don't really need the /s modifier, as that one makes . match \n, which you're not making use of (see note at the end)
You want to use the /m modifier so that ^ matches the beginning of every new line, and not just the beginning of the string
You want to add \s* to your regular expression right after ^, because in at least one of your lines you have a leading space
You need parenthesis around $cmd_pattern; otherwise, you're getting two options, the first one being ^CMD_ID= and the second one being CMD_USES_ASSET_ENV= followed by the rest of your expression
You can also simplify the (\w+([\.]?\w+)*) bit down to (.+).
The result would be:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
That being said, your regular expression still won't split ivy and jdk1.7.3_80 on its own; I would suggest adding a split and removing _null with something like:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
my $text = $1;
my #text;
if ($text =~ /,/) {
#text = split /,(?:null_)?/, $text;
}
else {
#text = $text;
}
for (#text) {
print "1 is : $_\n";
push (#matching_lines, $_);
}
}
The only problem you're left with is the lone line CMD_ID=null. I'm gonna leave that to you :-)
(I recently wrote a blog post on best practices for regular expressions - http://blog.codacy.com/2016/03/30/best-practices-for-regular-expressions/ - you'll find there a note to always require the /s in Perl; the reason I mention here that you don't need it is that you're not using the ones you actually need, and that might mean you weren't certain of the meaning of /s)
I am using a perl script to remove all stopwords in a text. The stop words are stored one by line. I am using Mac OSX command line and perl is installed correctly.
This script is not working properly and has a boundary problem.
#!/usr/bin/env perl -w
# usage: script.pl words text >newfile
use English;
# poor man's argument handler
open(WORDS, shift #ARGV) || die "failed to open words file: $!";
open(REPLACE, shift #ARGV) || die "failed to open replacement file: $!";
my #words;
# get all words into an array
while ($_=<WORDS>) {
chop; # strip eol
push #words, split; # break up words on line
}
# (optional)
# sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the"
#words=sort { length($b) <=> length($a) } #words;
# slurp text file into one variable.
undef $RS;
$text = <REPLACE>;
# now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space.
foreach $word (#words) {
$text =~ s/\b\Q$word\E\s?//sg;
}
# output "fixed" text
print $text;
sample.txt
$ cat sample.txt
how about i decide to look at it afterwards what
across do you think is it a good idea to go out and about i
think id rather go up and above
stopwords.txt
I
a
about
an
are
as
at
be
by
com
for
from
how
in
is
it
..
Output:
$ ./remove.pl stopwords.txt sample.txt
i decide look fterwards cross do you think good idea go out d i
think id rather go up d bove
As you can see, it replaces afterwards using a as fterwards. Think its a regex problem. Please can somebody help me to patch this quickly? Thanks for all the help :J
Use word-boundary on both sides of your $word. Currently, you are only checking for it at the beginning.
You won't need the \s? condition with the \b in place:
$text =~ s/\b\Q$word\E\b//sg;
Your regex is not strict enough.
$text =~ s/\b\Q$word\E\s?//sg;
When $word is a, the command is effectively s/\ba\s?//sg. This means, remove all occurrences of a new word starting with a followed by zero or more whitespace. In afterwards, this will successfully match the first a.
You can make the match more stricter by ending word with another \b. Like
$text =~ s/\b\Q$word\E\b\s?//sg;
I have a file of the following:
Question:What color is the sky?
Explanation:The sky reflects the ocean.
Question:Why did the chicken cross the road?
Explanation:He was hungry.
What I'm trying to obtain is a list of ("What color is the sky?", "Why did the chicken cross the road")
I'm trying to use perl regex to parse this file, but with no luck.
I have the entire contents of my file in a string called $file, and this is what I'm trying
my #questions = ($file =~ /Question:(.*)\n/g);
But this always just returns the entire $file string to me.
Your (.*) is greedily matching the whole line until it gets to the \n, which is probably a result of how you are getting the string.
You can add a ? to make the match not greedy.
So try
my #questions = ($file =~ /Question:(.*?\?)/g);
Notice I escaped \?, so the regex will match up to the questionmark
Put the whole file in a value will occupy too many memory if the is large, a better way is to process the file line by line.
For example you could do something like
my #questions;
while (<>) {
chomp;
if (m/Question:(.*)/) {
push #questions, $1;
}
}
Some explanations:
I/O Operators of perlop:
Input from <> comes either from standard input, or from each file listed on the command line.
For each line I need to add a semicolon exactly one character before the first match of an alphanumeric sign but only for the alphanumeric sign after the first appearance of a semicolon.
Example:
Input:
00000001;Root;;
00000002; Documents;;
00000003; oracle-advanced_plsql.zip;file;
00000004; Public;;
00000005; backup;;
00000006; 20110323-JM-F.7z.001;file;
00000007; 20110426-JM-F.7z.001;file;
00000008; 20110603-JM-F.7z.001;file;
00000009; 20110701-JM-F-via-summer_school;;
00000010; 20110701-JM-F-via-summer_school.7z.001;file;
Desired output:
00000001;;Root;;
00000002; ;Documents;;
00000003; ;oracle-advanced_plsql.zip;file;
00000004; ;Public;;
00000005; ;backup;;
00000006; ;20110323-JM-F.7z.001;file;
00000007; ;20110426-JM-F.7z.001;file;
00000008; ;20110603-JM-F.7z.001;file;
00000009; ;20110701-JM-F-via-summer_school;;
00000010; ;20110701-JM-F-via-summer_school.7z.001;file;
Could someone helps me please to create Perl regex for that? I'd need it in a program, not as a oneliner.
This is a way to insert a semi-colon after the first semi-colon and whitespace, but before the first non-whitespace.
s/;\s*\K(?=\S)/;/
If you feel the need, you can use \w instead of \S, but I felt with this input it was an unnecessary specification.
The \K (keep) escape is similar to a lookbehind assertion in that it does not remove what it matches. The same goes for the lookahead assertion, so all this substitution does is insert a semi-colon in the designated spot.
First of all, here is a program that seems to match your requirements:
#/usr/bin/perl -w
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}
Store it in a file 'program.pl', make it executable with 'chmod u+x program.pl' and run it on your input data like this:
program.pl input-data.txt
Here is an explanation of the regular expression:
s/ # start search-and-replace regexp
^ # start at the beginning of this line
( # save the matched characters until ')' in $1
.*?; # go forward until finding the first semicolon
.*? # go forward until finding... (to be continued below)
)
( # save the matched characters until ')' in $2
\w # ... the next alphanumeric character.
)
/ # continue with the replace part
$1;$2 # write all characters found above, but insert a ; before $2
/ # finish the search-and-replace regexp.
Based on your sample input, I would use a more specific regular expression:
s/^(\d*; *)(\w)/$1;$2/;
This expression starts at the beginning of the line, skips over numbers (\d*) followed by the first semicolon and space. Before the following word character, it inserts a semicolon.
Take what fits best to your needs!
First of all thank you for your really great answers!
Actually my code snippet looks like this:
our $seperator=";" # at the beginning of the file
#...
sub insert {
my ( $seperator, $line, #all_lines, $count, #all_out );
$count = 0;
#all_lines = read_file($filename);
foreach $line (#all_lines) {
$count = sprintf( "%08d", $count );
chomp $line;
$line =~ s/\:/$seperator/; # works
$line =~ s/\ file/file/; # works
#$line=~s/;\s*\K(?=\S)/;/; # doesn't work
$line =~ s/^(.*?$seperator.*?)(\w)/$1$seperator$2/; # doesn't work
say $count . $seperator . $line . $seperator;
$count++; # btw, is there maybe a hidden index variable in a foreach-loop I could us instead of a new variable??
push( #all_out, $count . $seperator . $line . $seperator . "\n" );
}
write_file( $csvfile, #all_out ); # using File::Slurp
}
In order to get the input which I presented you, I made already some small substitutions, as you can see in the beginning of the foreach-loop.
I am curious, why the regular expressions presented by TLP and Yaakov do not work in my code. In general they work, but only when written like in the example which Yaakov gave:
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}