To clarify the following post, we have an automation requirement to send shipping information to an online platform so users can track their orders. We receive a daily .csv file through email, we have to extract the unique Shopify order reference from a field (last 10 digits of a field), save the amended .csv file and upload to an FTP site so tracking references can be matched to the specific order.
A previous colleague wrote an application in Perl to handle this, however it has not worked and I have no experience with Perl at all!
The program is called by a "Watcher" monitoring for files, the code for this is as follows:
use strict;
use warnings;
use Datatools::Watcher;
my $hotfolder = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT';
my $process = '"C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl"';
my #backup = ('\\gen-svr-01\users\DATA\MW\DMO_Report_IO\ARCHIVE');
watcher($hotfolder,$process,\#backup);
The main code (PERL PROGRAM) is:
use strict;
use warnings;
use File::Copy;
use Datatools::Watcher;
my $output = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\OUTPUT';
my $desthotfolder = '\\gen-svr-01\users\DATA\MW\Data_TO_MWS_FTP_TEST';
my $shopifyPos = 0;
my $shopifyNew = "";
my $header = 1;
my $inputfile = $ARGV[0];
my ($path,$file,$extention) = $inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
my $outputfilename = $file . "_FORMATTED" . $extention;
$outputfilename =~ s/.~#~//;
my $outputfile = "$output\\$outputfilename";
open (INPUT, $inputfile) or die "Could not open input file: $inputfile\n";
open (OUTPUT, ">$outputfile") or die "Could not open output file: $outputfile\n";
while (my $record = <INPUT>){
chomp $record;
my #field = parse_csv($record);
if ($header == 1){
print OUTPUT $record . "\n";
$header = 0;
next;
} else {
$shopifyNew = substr $field[$shopifyPos], -10;
splice (#field, 0, 1, $shopifyNew);
print OUTPUT join(',',#field) . "\n";
next;
}
}
close INPUT;
close OUTPUT;
my $destfile = "$desthotfolder\\$outputfilename";
move $outputfile, $destfile or die "Could not move output file: $outputfile\nto: $destfile\n";
print "\nProcessing complete\n";
sub parse_csv {
my ($shift) = #_;
my $text = $shift; # record containing comma-separated values
my #new = ();
push(#new, $+) while $text =~ m{
# the first part groups the phrase inside the quotes.
# see explanation of this pattern in MRE
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(#new, undef) if substr($text, -1,1) eq ',';
return #new; # list of values that were comma-separated
}
When the program runs, the "Watcher" details the following:
File Seen, Processing File \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csvUse of uninitialized value $file in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Use of uninitialized value $extention in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Processing complete
Line 47 refers to the following code:
my $outputfilename = $file . "_FORMATTED" . $extention;
In the output folder, there is a file with the name "_FORMATTED" (no file extensions)
I have looked for a solution, and from my limited understanding I don't think the variables: file and extension are being defined, but I have no idea how to correct!
It would help to know which is line 47 in this code. I assume it's this line:
my $outputfilename = $file . "_FORMATTED" . $extention;
So, at this point, $file and $extention are both uninitialised. They are both supposed to be initialised in the previous line:
my ($path,$file,$extention) =
$inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
So it seems that your $inputfile doesn't match the regex. This leaves us with two options:
$inputfile isn't being set at all (which would mean it isn't being passed to the program).
$inputfile isn't in the correct format to to match the regex.
To work out which of the problems we have here, add the following validation lines before the line which tries to set $file and $extention:
die "No input file given\n" unless $inputfile;
die "Input file name ($inputfile) is the wrong format\n"
unless $inputfile =~ / \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
Update: From recent updates to your question, I can see that you are running the program and passing it the filename \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csv.
Let's take a closer look at your regex.
m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms
The /x option at the end means that the regex compiler ignores any literal whitespace in the string. So we can do the same. Let's break down what the individual parts are trying to match:
\A : matches the start of the string
(.*\/) : matches anything up to and including the last / in your string. It captures the matched substring into $1. This is what is stored in $path in your code. It's the directory that your file is in.
(.+\d\d\d\d) : This matches one or more of any character followed by four digits. This is stored in $2 and in your code it ends up in `$file``. It's the main part of the filename.
.+ : Matches one or more characters. Any characters. Your code does nothing with these characters.
([.]\w{3}) : Matches a dot followed by three "word" characters (basically alphanumerics). This is captured into $3 and ends up in your $extention variable.
\z : Matches the end of the string.
Putting all that together, you have a regex that looks for filenames and splits them into three parts - the path, the name and the extension. The only complication is that the filename section needs to contain four consecutive digits. And your filename is OrderTracking - which doesn't contain those required digits. So the regex doesn't match and your variables don't get set.
When this program was written, it was assumed that the filenames would contain four digits. The files that you are trying to process do not contain digits, so the program fails.
We can't suggest how you fix this. You need to speak to the people who supply your input files and find out why they have started to send you files with a different name format. Once you know that, you can decide one the best approach to work round the problems.
Related
I am seeking advice on extracting a section of a string, that is always occurs as the first instance data between parenthesis using perl and regex and assign that value to a variable.
Here is the precise situation, I am using perl and regex to extract the courseID from a university catalog and assign it to a variable. Please consider the following:
BIO-2109-01 (12345) Introduction to Biology
CHM-3501-F2-01 (54321) Introduction to Chemistry
IDS-3250-01 (98765) History of US (1860-2000)
SPN-1234-02-F1 (45678) Spanish History (1900-2010)
The typical format is [course-section-name] [(courseID)] [courseName]
My goal is to create a script which can take each entry, one at a time, assign it to a variable and then use regex to extract only the courseID and assign only the courseID to a variable.
My approach has been to use search and replace to replace everything not matching that with '' and then saving what is left (the courseID) to the variable. Here are a few examples of what I have tried the following:
$string = "BIO-2109-01 (12345) Introduction to Biology";
($courseID = $string) =~ s/[^\d\d\d\d\d]//g;
print $courseID;
Result: 21090112345 --- printing the course-section-name and courseID
$string = "BIO-2109-01 (12345) Introduction to Biology";
$($courseID = $string) =~ s/[^\b\(\d{5}\)]\b//g;
print $courseID;
Result: 210901(12345) --- printing course-section-name, parens, and courseID
So I haven't had much luck with search and replace - however I found this nugget:
\(([^\)]+)\)
On http://regexr.com/ that will match the parens section. However, it would also match multiple parans, including for example (abc).
I'm not really sure at this point how to do something like this:
$string = "BIO-2109-01 (12345) Introduction to Biology";
($courseID = $string) =~ [magicRegex_goes_here];
print courseID;
result 12345
OR, better:
$string = IDS-3250-01 (98765) History of US (1860-2000)
($courseID = $string) =~ [magicRegex_goes_here];
print courseID;
result 98765
Any advice or direction would be greatly appreciated. I have tried everything I know and can research in regards to regex to solve this problem. If there is anymore information I can include please ask away.
UPDATE
use warnings 'all';
use strict;
use feature 'say';
my $file = './data/enrollment.csv'; #File this script generates
my $course = ""; #Complete course string [name-of-course] [(courseID)] [course_name]
my #arrayCourses = ""; #Array of courseIDs
my $i = ""; #i in for loop
my $courseID = ""; #Extracted course ID
my $userName = ""; #Username of person we are enrolling
my $action = "add,"; #What we are doing to user
my $permission = "teacher,"; #What permissions to assign to user
my $stringToPrint = ""; #Concatinated string to write to file
my $n = "\n"; #\n
my $c = ","; #,
#BEGIN PROGRAM
print "Enter the username \n";
chomp($userName = <STDIN>); #Get the enrollee username from user
print "\n";
print "Enter course name and press enter. Enter 'x' to end. \n"; #prompt for course names
while ($course ne 'x') {
chomp($course = <STDIN>);
if ($course ne "x") {
if (($courseID) = ($course =~ /[^(]+\(([^)]+)\)/) ) { #nasty regex to extract courseID - thnx PerlDuck and zdim
push #arrayCourses, $courseID; #put the courseID into array
}
else {
print "Cannot process last entry check it";
}
}
else {
last;
}
}
shift #arrayCourses; #Remove first entry from array - add,teacher,,username
open(my $fh,'>', $file); #open file
for $i (#arrayCourses) #write array to file
{
$stringToPrint= join "", $action, $permission, $i, $c, $userName, $n ;
print $fh $stringToPrint;
}
close $fh;
That'll do it! Suggestions or improvements are always welcome! Thanks #PerlDuck and #zdim
#!/usr/bin/env perl
use strict;
use warnings;
while( my $line = <DATA> ) {
if (my ($courseID) = ($line =~ /[^(]+\(([^)]+)\)/) ) {
print "course-ID = $courseID; -- line was $line";
}
}
__DATA__
BIO-2109-01 (12345) Introduction to Biology
CHM-3501-F2-01 (54321) Introduction to Chemistry
IDS-3250-01 (98765) History of US (1860-2000)
SPN-1234-02-F1 (45678) Spanish History (1900-2010)
Output:
course-ID = 12345; -- line was BIO-2109-01 (12345) Introduction to Biology
course-ID = 54321; -- line was CHM-3501-F2-01 (54321) Introduction to Chemistry
course-ID = 98765; -- line was IDS-3250-01 (98765) History of US (1860-2000)
course-ID = 45678; -- line was SPN-1234-02-F1 (45678) Spanish History (1900-2010)
The pattern I used, /[^(]+\(([^)]+)\)/, can also be written as
/ [^(]+ # 1 or more characters that are not a '('
\( # a literal '('. You must escape that because you don't want
# to start it a capture group.
([^)]+) # 1 or more chars that are not a ')'.
# The sorrounding '(' and ')' capture this match
\) # a literal ')'
/x
The /x modifier allows you to insert spaces, comments, and even newlines right in the pattern.
Just in case you're unsure about the /x. You can indeed write:
while( my $line = <DATA> ) {
if (my ($courseID) = ($line =~ / [^(]+ # …
\( # …
([^)]+) # …
\) # …
/x ) ) {
print "course-ID = $courseID; -- line was $line";
}
}
That's probably not nice to read but you can also store the regex in a separate variable:
my $pattern =
qr/ [^(]+ # 1 or more characters that are not a '('
\( # a literal '(' (you must escape it)
([^)]+) # 1 or more chars that are not a ')'.
# The sorrounding '(' and ')' capture this match
\) # a literal ')'
/x;
And then:
if (my ($courseID) = ($line =~ $pattern)) {
…
}
Since you nailed down the format
my ($section, $id, $name) =
$string =~ /^\s* ([^(]+) \(\s* ([^)]+) \)\s* (.+) $/x;
The key here is the negated character class, [^...], which matches any one character other than those listed inside following the ^ (which makes it "negated"). The un-escaped parenthesis capture the match, except inside a character class [] where they are taken as literal.
It first matches all consecutive characters other than (, so up to first (, what is captured by the pair of ( ) around it. Then all other than ), so up to the first closing paren, also captured by its own pair ( ). This comes between literal parenthesis \( ... \), which are outside of ( ) since we don't want them captured. Then all the rest is captured, (.+), requiring at least some characters since + means one or more. Note though that these can be spaces. We exclude possible leading white space from the first capture, by matching it specifically before the capturing parenthesis, and extract (some of) possible spaces around id-parenthesis.
The /x modifier allows use of spaces (and comments and newlines) inside, what helps reaadbility. The match operator returns a list of all matches, which we assign to variables. Note, even if there is only one match it still returns (it as) a list. See Regular Expressions Tutorial (perlretut).
Then, assuming that you have the catalog in a file
use warnings 'all';
use strict;
use feature 'say';
my $file = 'catalog.txt';
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = <$fh>)
{
next if $line =~ /^\s*$/; # skip empty lines
# Strip leading and trailing white space
$line =~ s{^\s*|\s*$}{}g;
my ($section, $id, $name) =
$line =~ /^ ([^(]+) \(\s* ([^)]+) \)\s* (.+) $/x
or do {
warn "Error with expected format -- ";
next;
};
say "$section, $id, $name";
}
close $fh;
I use s{}{} delimiters since s/// confuse markup's syntax highlighter with this pattern, which is also a good demonstration since these sometimes help readability a lot.
You would store the retrieved variables in a suitable data structure. Any combination of arrays and hashes (and their references) comes to mind, depending on what need be done with them later. See Cookbook of Data Structures (perldsc).
Note on the error handling. Since none of the matches involve * (allowing zero matches -- nothing), if any component of your format isn't as expected there won't be a match at all and we get an error. The .+ is extremely permissive but it still requires something to be there. This is why the trailing space is first stripped, so that the last pattern (.+) cannot be satisfied by spaces alone.
If the only objective is the course id and we are certain that the first parenthesis are around it
my ($id) = $line =~ / \(\s* ([^)]+) \) /x or do { ... };
We now only need to match and capture the middle piece, something inside parenthesis.
This code grabs a keyword 'fun' from text files that I have and then prints the 20 characters before and after the keyword. However, I also want it to print the previous 2 lines and the next two lines, and I'm not sure how to do that. I wasn't sure if it is easier to change the code with this or just read the whole file at one time.
{my $inputfile = "file";
$searchword = 'fun';
open (INPUT, '<', $inputfile) or die "fatal error reading the file \n";
while ($line1=<INPUT>)
{
#read in a line of the file
if ($line1 =~m/$searchword/i)
{print "searchword found\n";
$keepline = $line1;
$goodline =1;
$keepline =~/(.{1,20})(fun)(.{1,20})/gi;
if ($goodline==1)
{&write_excel};
$goodline =0;
}
Your code as is seems to
Take 20 chars each side of 'pledge' not $searchword;
Have an unmatched '{' at the start;
Doesn't print any file contents save for &write_excel which we can't examine; and
Has a logic problem in that if $searchword is found, $goodline is unconditionally set to '1' and then tested to see if its '1' and finally reset to '0'
Putting that aside, the question as to whether to read in the whole file depends on your circumstances some what - how big are the files you're going to be searching, does your machine have plenty of memory; is the machine a shared resource and so on. I'm going to presume you can read in the whole file as that's the more common position in my experience (those who disagree please keep in mind (a) I've acknowledge that its debatable; and (b) its very dependant on the circumstances that only the OP knows)
Given that, there are several ways to read in a whole file but the consensus seems to be to go with the module File::Slurp. Given those parameters, the answer looks like this;
#!/usr/bin/env perl
use v5.12;
use File::Slurp;
my $searchword = 'fun';
my $inputfile = "file.txt";
my $contents = read_file($inputfile);
my $line = '\N*\n';
if ( $contents =~ /(
$line?
$line?
\N* $searchword \N* \n?
$line?
$line?
)/x) {
say "Found:\n" . $1 ;
}
else {
say "Not found."
}
File::Slurp prints a reasonable error message if the file isn't present (or something else goes wrong), so I've left out the typical or die.... Whenever working with regexes - particularly if your trying to match stuff on multiple lines, it pays to use "extended mode" (by putting an 'x' after the final '/') to allow insignificant whitespace in the regex. This allows a clearer layout.
I've also separated out the definition of a line for added clarity which consists of 0, 1 or more non-newlines characters, \N*, followed by a new line, \n. However, if your target is on the first, second, second-last or last line I presume you still want the information, so the requested preceding and following pairs of lines are optionally matched. $line?
Please note that regular expressions are pedantic and there are inevitably 'fine details' that effect the definition of a successful match vs an unwanted match - ie. Don't expect this to do exactly what you want in all circumstances. Expect that you'll have to experiment and tweek things a bit.
I'm not sure I understand your code block (what purpose does "pledge" have? what is &write_excel?), but I can answer your question itself.
First, is this grep command acceptable? It's far faster and cleaner:
grep -i -C2 --color "fun" "file"
The -C NUM flag tells grep to provide NUM lines of context surrounding each pattern match. Obviously, --color is optional, but it may help you find the matches on really long lines.
Otherwise, here's a bit of perl:
#!/usr/bin/perl
my $searchword = "fun";
my $inputfile = "file";
my $blue = "\e[1;34m"; # change output color to blue
my $green = "\e[1;32m"; # change output color to green
my $nocolor = "\e[0;0m"; # reset output to no color
my $prev1 = my $prev2 = my $result = "";
open (INPUT, '<', $inputfile) or die "fatal error reading the file \n";
while(<INPUT>) {
if (/$searchword/i) {
$result .= $prev2 . $prev1 . $_; # pick up last two lines
$prev2 = $prev1 = ""; # prevent reusing last two lines
for (1..2) { # for two more non-matching lines
while (<INPUT>) { # parse them to ensure they don't match
$result .= $_; # pick up this line
last unless /$searchword/i; # reset counting if it matched
}
}
} else {
$prev2 = $prev1; # save last line as $prev2
$prev1 = $_; # save current line as $prev1
}
}
close $inputfile;
exit 1 unless $result; # return with failure if without matches
$result =~ # add colors (okay to remove this line)
s/([^\e]{0,20})($searchword)([^\e]{0,20})/$blue$1$green$2$blue$3$nocolor/g;
print "$result"; # print the result
print "\n" unless $result =~ /\n\Z/m; # add newline if there wasn't already one
Bug: this assumes that the two lines before and the two lines after are actually 20+ characters. If you need to fix this, it goes in the else stanza.
I am using a perl script to remove all stopwords in a text. The stop words are stored one by line. I am using Mac OSX command line and perl is installed correctly.
This script is not working properly and has a boundary problem.
#!/usr/bin/env perl -w
# usage: script.pl words text >newfile
use English;
# poor man's argument handler
open(WORDS, shift #ARGV) || die "failed to open words file: $!";
open(REPLACE, shift #ARGV) || die "failed to open replacement file: $!";
my #words;
# get all words into an array
while ($_=<WORDS>) {
chop; # strip eol
push #words, split; # break up words on line
}
# (optional)
# sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the"
#words=sort { length($b) <=> length($a) } #words;
# slurp text file into one variable.
undef $RS;
$text = <REPLACE>;
# now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space.
foreach $word (#words) {
$text =~ s/\b\Q$word\E\s?//sg;
}
# output "fixed" text
print $text;
sample.txt
$ cat sample.txt
how about i decide to look at it afterwards what
across do you think is it a good idea to go out and about i
think id rather go up and above
stopwords.txt
I
a
about
an
are
as
at
be
by
com
for
from
how
in
is
it
..
Output:
$ ./remove.pl stopwords.txt sample.txt
i decide look fterwards cross do you think good idea go out d i
think id rather go up d bove
As you can see, it replaces afterwards using a as fterwards. Think its a regex problem. Please can somebody help me to patch this quickly? Thanks for all the help :J
Use word-boundary on both sides of your $word. Currently, you are only checking for it at the beginning.
You won't need the \s? condition with the \b in place:
$text =~ s/\b\Q$word\E\b//sg;
Your regex is not strict enough.
$text =~ s/\b\Q$word\E\s?//sg;
When $word is a, the command is effectively s/\ba\s?//sg. This means, remove all occurrences of a new word starting with a followed by zero or more whitespace. In afterwards, this will successfully match the first a.
You can make the match more stricter by ending word with another \b. Like
$text =~ s/\b\Q$word\E\b\s?//sg;
I'm new to perl and I'm trying to figure out a find and replace. I have a large csv file (actually semi-colon separated). Some of the numbers (int and decimals) in the file have a negative symbol after the number. I need to move the negative sign to before the number.
E.g: Change
ABC;10.00-;XYZ
to
ABC;-10.00;XYZ
I'm not sure how to do this in perl. Can someone please help?
Regards,
Anand
I would not dabble around in a large csv file with regexes, unless I was very sure about my data and the regex. Using a CSV module seems to me to be the safest way.
This script will take input files as arguments, and write the corrected files with a .new extension.
If you notice undesired changes in your output file, you can try to un-comment the keep_meta_info line.
use strict;
use warnings;
use autodie;
use Text::CSV;
my $out_ext = ".new";
my $csv = Text::CSV->new( {
sep_char => ";",
# keep_meta_info => 1,
binary => 1,
eol => $/,
} ) or die "" . Text::CSV->error_diag();
for my $arg (#ARGV) {
open my $input, '<', $arg;
open my $output, '>', $arg . $out_ext;
while (my $row = $csv->getline($input)) {
for (#$row) {
s/([0-9\.]+)\-$/-$1/;
}
$csv->print($output, $row);
}
}
I'll assume you don't have to worry about quoteing or escaping in your delimited file. I'll read from standard in/out, change to appropriate files if req'd
while( my $line = <STDIN> )
{
chop( $line );
my #rec = split( ';', $line );
map( s/^(\d*\.?\d+)\-$/-$1/, #rec );
print join(';',#rec) . "\n";
}
If you do have to worry about escaping and quoting, then use Text::CSV_XS instead of the <STDIN>, split, and join oprerations
In general, the replace command is s/old/new/flags:
s/( # start a capture group
\d+ # first part of the number
(\.\d+)? # possibly a decimal dot and the fractional part
)- # end capture group, match the minus sign
/-$1/gx # move minus to the front
The g flag means “global” (replace all occurences), and x is “extended legibility” (allows whitespace and comments in the pattern). You have to test the expression on your data to see what corner cases you might have missed, it usually takes a few iterations to get the right one. Samples:
$ echo "10.5-;10-;0-;a-" | perl -pe 's/(\d+(\.\d+)?)-/-$1/g'
-10.5;-10;-0;a-
See also perldoc perlop (search for “replacement” to jump to the right section).
I'm doing this in Perl.
I have a text file that contains several paragraphs and 61 sentences.
First, I need to match a series of words that are input on the command line, which I have no trouble at all doing:
my $input = $ARGV[0];
$file =~ m/$input/gi;
Unfortunately, there are some wrinkles-
1. The input can be for multiple items and
2. The multiple items can be on different lines.
I will show you an example:
3 sentences match the pattern "fall|election|2009". The sentences are:
4: "We hate elections."
16: "The dog was injured in a fall from the balcony."
24: "There will be no 2009 fall election."
In this case, the program found counted three sentences within the document that contained either fall, election or 2009, where fall|election|2009 was the input.
My question is twofold:
How do I count the number of sentences that the inputs appear in? I'm very unexperienced with regex, but I would have thought that the default match would try to match the first occurrence of either fall, election, or 2009 that occurred within the file and neither count how many instances there were of each individual word and then add them up. I'm kind of hung up on this, as I don't understand counting with regex at all.
The second part of my question relates to how to first find which sentence the input is found in (i.e. elections appearing in line 4) and how to extract the whole sentence that the input is located in. I think this would be done using first an if: if there is a match within the string to the input then a new scalar equals the text file =~ a substitution? of the sentence... I'm totally unsure.
Edit: I actually have a fully parsed HTML document that I am performing this on. If printed, the output of one example is:
"The Journal is now on Facebook! Check out our page here. It's a work in progress, and we're hungry for your feedback. So let us know what you think on our discussion board, comment below or send us an email. Get breaking news, insider information and curiosities by following The Journal on Twitter. Here are some feeds and writers you might want to follow:"
My command line looks like this: perl WebScan.pl information|writers WebPage000.htm
I have, as aforementioned parsed through the webpage and removed all tags, leaving just text. Now, I have to find the input, which in this case is "information" or "writers". I have to find out how many times these occur within the text of the file (so 2), as well as in which sentence they appear in (so 5 and 6 respectively). I will show you my code so far:
use strict;
use warnings;
my $file;
open (FILENAME, $ARGV[1]);
$file = do { local $/; <FILENAME> };
$file =~ s{
< # open tag
(?: # open group (A)
(!--) | # comment (1) or
(\?) | # another comment (2) or
(?i: # open group (B) for /i
( # one of start tags
SCRIPT | # for which
APPLET | # must be skipped
OBJECT | # all content
STYLE # to correspond
) # end tag (3)
) | # close group (B), or
([!/A-Za-z]) # one of these chars, remember in (4)
) # close group (A)
(?(4) # if previous case is (4)
(?: # open group (C)
(?! # and next is not : (D)
[\s=] # \s or "="
["`'] # with open quotes
) # close (D)
[^>] | # and not close tag or
[\s=] # \s or "=" with
`[^`]*` | # something in quotes ` or
[\s=] # \s or "=" with
'[^']*' | # something in quotes ' or
[\s=] # \s or "=" with
"[^"]*" # something in quotes "
)* # repeat (C) 0 or more times
| # else (if previous case is not (4))
.*? # minimum of any chars
) # end if previous char is (4)
(?(1) # if comment (1)
(?<=--) # wait for "--"
) # end if comment (1)
(?(2) # if another comment (2)
(?<=\?) # wait for "?"
) # end if another comment (2)
(?(3) # if one of tags-containers (3)
</ # wait for end
(?i:\3) # of this tag
(?:\s[^>]*)? # skip junk to ">"
) # end if (3)
> # tag closed
}{}gsx; # STRIP THIS TAG
$file =~ s/ //gi;
$file =~ s/ //gi;
$file =~ s/;//gi;
$file =~ s/[\h\v]+/ /g;
my $count = $file =~ s/((^|\s)\S)/$2/g;
my $sentencecount = $file =~ s/((^|\s)\S).*?(\.|\?|\!)/$1/g;
print "Input file $ARGV[1] contains $sentencecount sentences and $count words.";
So, I need perl to, using $ARGV[0] as keywords, search through a text file, counting the number of times the keyword appears. Then, I need to state what sentence the keyword appeared in (i.e. print the whole sentence in full), as well as the number that the sentence is in.
It's not clear if you have your sentences delimited (or if you have some criteria for split them). If so, and if understand your problem right, you can do something as this:
#words = qw/hi bye 2009 a*d/;
#lines = ('Lets see , hi ',
' hi hi hi ',
' asdadasdas ',
'a2009a',
'hi bye');
$pattern="";
foreach $word (#words) {
$pattern .= quotemeta($word) . '|';
}
chop $pattern; # chop last |
print "pattern='$pattern'\n";
$cont = 0;
foreach $line (#lines) {
$cont++ if $line =~ /$pattern/o;
}
printf "$cont/%d lines matched\n",scalar(#lines);
I build the pattern with quotemeta escaping just in case there are some special characters in the words (as in my example, we dont want it to match).
Edit to match updated question
Okay, let me start with a truism: don't try to parse HTML by yourself. HTML::TreeBuilder is your friend.
For regular expressions, the perlfaq6 is a great source of knowledge.
The following sample works with the following syntax: perl WebScan.pl --regex="information|writers" --filename=WebPage000.htm.
It will print a list of paragraphs and their matches.
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;
use Data::Dumper;
use Getopt::Long;
my #regexes;
my $filename;
GetOptions('regex=s' => \#regexes, 'filename=s' => \$filename);
my $tb = HTML::TreeBuilder->new_from_file($filename);
$tb->normalize_content;
my #patterns = map { qr/$_/ } #regexes;
my #all;
foreach my $node ($tb->find_by_tag_name('p', 'pre', 'blockquote')) {
my $text = $node->as_text;
my #matches;
foreach my $r (#patterns) {
while ($text =~ /$r/gi) {
push #matches, $&;
}
}
push #all, { paragraph => $text, matches => \#matches } if #matches;
}
foreach (#all) {
print "Paragraph:\n\t$_->{paragraph}\nMatches:\n\t", join(', ', #{$_->{matches}}), "\n";
}
Hopefully, this can point you in the right direction.