Perl removing words from file1 with file2

Perl removing words from file1 with file2 - regex

I am using a perl script to remove all stopwords in a text. The stop words are stored one by line. I am using Mac OSX command line and perl is installed correctly.
This script is not working properly and has a boundary problem.
#!/usr/bin/env perl -w
# usage: script.pl words text >newfile
use English;
# poor man's argument handler
open(WORDS, shift #ARGV) || die "failed to open words file: $!";
open(REPLACE, shift #ARGV) || die "failed to open replacement file: $!";
my #words;
# get all words into an array
while ($_=<WORDS>) {
chop; # strip eol
push #words, split; # break up words on line
}
# (optional)
# sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the"
#words=sort { length($b) <=> length($a) } #words;
# slurp text file into one variable.
undef $RS;
$text = <REPLACE>;
# now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space.
foreach $word (#words) {
$text =~ s/\b\Q$word\E\s?//sg;
}
# output "fixed" text
print $text;
sample.txt
$ cat sample.txt
how about i decide to look at it afterwards what
across do you think is it a good idea to go out and about i
think id rather go up and above
stopwords.txt
I
a
about
an
are
as
at
be
by
com
for
from
how
in
is
it
..
Output:
$ ./remove.pl stopwords.txt sample.txt
i decide look fterwards cross do you think good idea go out d i
think id rather go up d bove
As you can see, it replaces afterwards using a as fterwards. Think its a regex problem. Please can somebody help me to patch this quickly? Thanks for all the help :J

Use word-boundary on both sides of your $word. Currently, you are only checking for it at the beginning.
You won't need the \s? condition with the \b in place:
$text =~ s/\b\Q$word\E\b//sg;

Your regex is not strict enough.
$text =~ s/\b\Q$word\E\s?//sg;
When $word is a, the command is effectively s/\ba\s?//sg. This means, remove all occurrences of a new word starting with a followed by zero or more whitespace. In afterwards, this will successfully match the first a.
You can make the match more stricter by ending word with another \b. Like
$text =~ s/\b\Q$word\E\b\s?//sg;

Related

Wild card matching

I need to match a sentences which contains both wild card character \ and . in same sentence.How to do it with Perl?
Say suppose my file has following sentences :
ttterfasghti.
ddseghies/affag
hhail/afgsh.
asfsdgagh/
adterhjc/sgsagh.
My expected output should be :
hhail/afgsh.
adterhjc/sgsagh.

Given a clarification from a comment
Any order but the matching line should contain both / and .
an easy way
perl -wne'print if m{/} and m{\.}' filename
This is inefficient in the sense that it starts the regex engine twice and scans each string twice. However, in most cases that is unnoticable while this code is much clearer than a single regex for the task.
I use {} delimiters so to not have to escape the /, in which case the m in front is compulsory. Then I use the same m{...} on the other pattern for consistency.
A most welcome inquiry comes that this be done in a script, not one-liner! Happy to oblige.
use warnings;
use strict;
my $file = shift || die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
print if m{/} and m{\.};
}
close $fh;

This feels like a duplicate, but I just can't find a good previous question for this.
For / there are two ways:
use m// operator with different separator characters, e.g. m,<regex with />,, m{<regex with />}, or
escape it, i.e. /\//
For . use escaping.
Note that inside a character class ([...]) many special characters no longer need escaping.
Hence we get:
$ perl <dummy.txt -ne 'print "$1: $_" if m,(\w+/\w*\.),'
hhail/afgsh.: hhail/afgsh.
adterhjc/sgsagh.: adterhjc/sgsagh.
i.e. the line is printed if it contains one-or-more word characters, followed by a /, zero-or-more word characters, ending with a ..
Recommended reading perlrequick, perlretut & perlre.
UPDATE after OP clarified the requirement in a comment:
$ perl <dummy.txt -ne 'print if m,/, && m{\.}'
hhail/afgsh.
adterhjc/sgsagh.

Repeating regex pattern

I have a string such as this
word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>
where, if there is one ore more words enclosed in tags. In those instances where there are more than one words (which are usually separated by - or = and potentially other non-word characters), I'd like to make sure that the tags enclose each word individually so that the resulting string would be:
word <gl>aaa</gl> word <gl>aaa</gl>-<gl>bbb</gl>=<gl>ccc</gl>
So I'm trying to come up with a regex that would find any number of iterations of \W*?(\w+) and then enclose each word individually with the tags. And ideally I'd have this as a one-liner that I can execute from the command line with perl, like so:
perl -pe 's///g;' in out
This is how far I've gotten after a lot of trial and error and googling - I'm not a programmer :( ... :
/<gl>\W*?(\w+)\W*?((\w+)\W*?){0,10}<\/gl>/
It finds the first and last word (aaa and ccc). Now, how can I make it repeat the operation and find other words if present? And then how to get the replacement? Any hints on how to do this or where I can find further information would be much appreciated?
EDIT:
This is part of a workflow that does some other transformations within a shell script:
#!/bin/sh
perl -pe '#
s/replace/me/g;
s/replace/me/g;
' $1 > tmp
... some other commands ...

This needs a mini nested-parser and I'd recommend a script, as easier to maintain
use warnings;
use strict;
use feature 'say';
my $str = q(word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>);
my $tag_re = qr{(<[^>]+>) (.+?) (</[^>]+>)}x; # / (stop markup highlighter)
$str =~ s{$tag_re}{
my ($o, $t, $c) = ($1, $2, $3); # open (tag), text, close (tag)
$t =~ s/(\w+)/$o$1$c/g;
$t;
}ge;
say $str;
The regex gives us its built-in "parsing," where words that don't match the $tag_re are unchanged. Once the $tag_re is matched, it is processed as required inside the replacement side. The /e modifier makes the replacement side be evaluated as code.
One way to provide input for a script is via command-line arguments, available in #ARGV global array in the script. For the use indicated in the question's "Edit" replace the hardcoded
my $str = q(...);
with
my $str = shift #ARGV; # first argument on the command line
and then use that script in your shell script as
#!/bin/sh
...
script.pl $1 > output_file
where $1 is the shell variable as shown in the "Edit" to the question.
In a one-liner
echo "word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>" |
perl -wpe'
s{(<[^>]+>) (.+?) (</[^>]+>)}
{($o,$t,$c)=($1,$2,$3);$t=~s/(\w+)/$o$1$c/g; $t}gex;
'
what in your shell script becomes echo $1 | perl -wpe'...' > output_file. Or you can change the code to read from #ARGV and drop the -n switch, and add a print
#!/bin/sh
...
perl -wE'$_=shift; ...; say' $1 > output_file
where ... in one-liner indicate the same code as above, and say is now needed since we don't have the -p with which the $_ is printed out once it's processed.
The shift takes an element off of an array's front and returns it. Without an argument it does that to #ARGV when outside a subroutine, as here (inside a subroutine its default target is #_).

This will do it:
s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
The /g at the end is the repeat and stands for "global". It will pick up matching at the end of the previous match and keep matching until it doesn't match anymore, so we have to be careful about where the match ends. That's what the (?=...) is for. It's a "followed by pattern" that tells the repeat to not include it as part of "where you left off" in the previous match. That way, it picks up where it left off by re-matching the second "word".
The s/ at the beginning is a substitution, so the command would be something like:
cat in | perl -pne 's/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;$_' > out
You need the $_ at the end because the result of the global substitution is the number of substitutions made.
This will only match one line. If your pattern spans multiple lines, you'll need some fancier code. It also assumes the XML is correct and that there are no words surrounding dashes or equals signs outside of tags. To account for this would necessitate an extra pattern match in a loop to pull out the values surrounded by gl tags so that you can do your substitution on just those portions, like:
my $e = $in;
while($in =~ /(.*?<gl>)(.*?)(?=<\/gl>)/g){
my $p = $1;
my $s = $2;
print($p);
$s =~ s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
print($s);
$e = $'; # ' (stop markup highlighter)
}
print($e);
You'd have to write your own surrounding loop to read STDIN and put the lines read in into $in. (You would also need to not use -p or -n flags to the perl interpreter since you're reading the input and printing the output manually.) The while loop above however grabs everything inside the gl tags and then performs your substitution on just that content. It prints everything occurring between the last match (or the beginning of the string) and before the current match ($p) and saves everything after in $e which gets printed after the last match outside the loop.

Resolve Perl error: "Use of uninitialized value"

To clarify the following post, we have an automation requirement to send shipping information to an online platform so users can track their orders. We receive a daily .csv file through email, we have to extract the unique Shopify order reference from a field (last 10 digits of a field), save the amended .csv file and upload to an FTP site so tracking references can be matched to the specific order.
A previous colleague wrote an application in Perl to handle this, however it has not worked and I have no experience with Perl at all!
The program is called by a "Watcher" monitoring for files, the code for this is as follows:
use strict;
use warnings;
use Datatools::Watcher;
my $hotfolder = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT';
my $process = '"C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl"';
my #backup = ('\\gen-svr-01\users\DATA\MW\DMO_Report_IO\ARCHIVE');
watcher($hotfolder,$process,\#backup);
The main code (PERL PROGRAM) is:
use strict;
use warnings;
use File::Copy;
use Datatools::Watcher;
my $output = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\OUTPUT';
my $desthotfolder = '\\gen-svr-01\users\DATA\MW\Data_TO_MWS_FTP_TEST';
my $shopifyPos = 0;
my $shopifyNew = "";
my $header = 1;
my $inputfile = $ARGV[0];
my ($path,$file,$extention) = $inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
my $outputfilename = $file . "_FORMATTED" . $extention;
$outputfilename =~ s/.~#~//;
my $outputfile = "$output\\$outputfilename";
open (INPUT, $inputfile) or die "Could not open input file: $inputfile\n";
open (OUTPUT, ">$outputfile") or die "Could not open output file: $outputfile\n";
while (my $record = <INPUT>){
chomp $record;
my #field = parse_csv($record);
if ($header == 1){
print OUTPUT $record . "\n";
$header = 0;
next;
} else {
$shopifyNew = substr $field[$shopifyPos], -10;
splice (#field, 0, 1, $shopifyNew);
print OUTPUT join(',',#field) . "\n";
next;
}
}
close INPUT;
close OUTPUT;
my $destfile = "$desthotfolder\\$outputfilename";
move $outputfile, $destfile or die "Could not move output file: $outputfile\nto: $destfile\n";
print "\nProcessing complete\n";
sub parse_csv {
my ($shift) = #_;
my $text = $shift; # record containing comma-separated values
my #new = ();
push(#new, $+) while $text =~ m{
# the first part groups the phrase inside the quotes.
# see explanation of this pattern in MRE
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(#new, undef) if substr($text, -1,1) eq ',';
return #new; # list of values that were comma-separated
}
When the program runs, the "Watcher" details the following:
File Seen, Processing File \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csvUse of uninitialized value $file in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Use of uninitialized value $extention in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Processing complete
Line 47 refers to the following code:
my $outputfilename = $file . "_FORMATTED" . $extention;
In the output folder, there is a file with the name "_FORMATTED" (no file extensions)
I have looked for a solution, and from my limited understanding I don't think the variables: file and extension are being defined, but I have no idea how to correct!

It would help to know which is line 47 in this code. I assume it's this line:
my $outputfilename = $file . "_FORMATTED" . $extention;
So, at this point, $file and $extention are both uninitialised. They are both supposed to be initialised in the previous line:
my ($path,$file,$extention) =
$inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
So it seems that your $inputfile doesn't match the regex. This leaves us with two options:
$inputfile isn't being set at all (which would mean it isn't being passed to the program).
$inputfile isn't in the correct format to to match the regex.
To work out which of the problems we have here, add the following validation lines before the line which tries to set $file and $extention:
die "No input file given\n" unless $inputfile;
die "Input file name ($inputfile) is the wrong format\n"
unless $inputfile =~ / \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
Update: From recent updates to your question, I can see that you are running the program and passing it the filename \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csv.
Let's take a closer look at your regex.
m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms
The /x option at the end means that the regex compiler ignores any literal whitespace in the string. So we can do the same. Let's break down what the individual parts are trying to match:
\A : matches the start of the string
(.*\/) : matches anything up to and including the last / in your string. It captures the matched substring into $1. This is what is stored in $path in your code. It's the directory that your file is in.
(.+\d\d\d\d) : This matches one or more of any character followed by four digits. This is stored in $2 and in your code it ends up in `$file``. It's the main part of the filename.
.+ : Matches one or more characters. Any characters. Your code does nothing with these characters.
([.]\w{3}) : Matches a dot followed by three "word" characters (basically alphanumerics). This is captured into $3 and ends up in your $extention variable.
\z : Matches the end of the string.
Putting all that together, you have a regex that looks for filenames and splits them into three parts - the path, the name and the extension. The only complication is that the filename section needs to contain four consecutive digits. And your filename is OrderTracking - which doesn't contain those required digits. So the regex doesn't match and your variables don't get set.
When this program was written, it was assumed that the filenames would contain four digits. The files that you are trying to process do not contain digits, so the program fails.
We can't suggest how you fix this. You need to speak to the people who supply your input files and find out why they have started to send you files with a different name format. Once you know that, you can decide one the best approach to work round the problems.

Capitalize all letters after a period and a space

I am trying to capitalize all occurrences of small letters after a period and a space using Perl. This is an example of an input:
...so, that's our art. the 4 of us can now have a dialog. we can have a conversation. we can speak to...
This is the output I'd like to see:
...so, that's our art. The 4 of us can now have a dialog. We can have a conversation. We can speak to...
I have tried multiple regexes without much success--for instance:
$currentLine =~ s/\.\s([a-z])/\. \u$1/g;
or
$currentLine =~ s/([\.!?]\s*)(\w)/$1\U$2/g;
But I don't get the intended result. Help please!
UPDATE
To provide context, as somebody pointed out, the problem may lie elsewhere. The regexes are used in the context of this little script which does a few things besides the step that originated this post. I run it on long SRT files obtained from video closed captions. Thanks again for your help.
#! perl
use strict;
use warnings;
my $filename = $ARGV[0];
open(INPUT_FILE, $filename)
or die "Couldn't open $filename for reading!";
while (<INPUT_FILE>) {
my $currentLine = $_;
# Remove empty lines and lines that start with digits
if ($currentLine =~ /^[\s+|\d+]/){
next;
}
# Remove all carriage returns
$currentLine =~ s/\R$/ /;
# Convert all letters to lower case
$currentLine =~ s/([A-Z])/\l$1/g;
# Capitalize after period <= STEP THAT DOES NOT WORK
$currentLine =~ s/\.\s([a-z])/\. \u$1/g;
print $currentLine;
}
close(INPUT_FILE);

Try this
Use look behind, and capture the pattern and use \U for the change the beginning of the string to uppercase
$str ="...so, that's our art. the 4 of us can now have a dialog. we can have a conversation. we can speak to...";
$str =~ s/(?<=\w\.\s)(\w)/\U$1/g;
print $str
Or else try to \K for keep the word by the substitution.
$str =~ s/\w\.\s\K(\w)/\U$1/g;

One problem is the code:
if ($currentLine =~ /^[\s+|\d+]/){
next;
}
Contrary to the comment, this ignores lines that start with a space, a digit, a plus or a pipe symbol. This is probably sending you down the wrong track. You likely meant to write:
next if /^(\s+$|\d)/;
This skips a line if the whole line is spaces, or if the first character is a digit.
You could simplify your loop, and generalize it, with:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>) {
# Remove empty lines and lines that start with digits. sometimes
next if /^(\s+$|\d)/;
# Remove all carriage returns. forever
s/\R$//;
# Convert all letters to lower case. always
s/([A-Z])/\l$1/g;
# Capitalize after period <=... STEP THAT DOES NOT WORK
s/\.\s([a-z])/\. \u$1/g;
print "$_\n";
}
When run on itself, the output is:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>) {
# remove empty lines and lines that start with digits. Sometimes
next if /^(\s+$|\d)/;
# remove all carriage returns. Forever
s/\r$//;
# convert all letters to lower case. Always
s/([a-z])/\l$1/g;
# capitalize after period <=... Step that does not work
s/\.\s([a-z])/\. \u$1/g;
print "$_\n";
}
Note that for the converted script to work, you'd need to use /gi as the modifier (instead of /g) on the substitute operations. There is plenty of room for improvement in this code, still.
One basic way of testing what's going on is to print everything at each step.
while (<INPUT_FILE>) {
print "## $_";
my $currentLine = $_;
# Remove empty lines and lines that start with digits
if ($currentLine =~ /^[\s+|\d+]/){
print "#SKIP# $currentLine";
next;
}
# Remove all carriage returns
$currentLine =~ s/\R$/ /;
print "#EOL# $currentLine##\n";
# Convert all letters to lower case
$currentLine =~ s/([A-Z])/\l$1/g;
print "#LC# $currentLine##\n";
# Capitalize after period <= STEP THAT DOES NOT WORK
$currentLine =~ s/\.\s([a-z])/\. \u$1/g;
print "#CAPS# $currentLine##\n";
print $currentLine; # Needs a newline!
}
This would have told you what was going on, and going wrong. Note that replacing the generic EOL (\R) with a blank means that the output doesn't end with a newline. That's a bad idea too — and it's why the outputs I generate end with a newline; either the one read from the file, or adding one after that's been removed.
Also, you should avoid ALL_CAPS file handles and use lexical ones — when you need an explicit file handle at all.
open my $fh, '<', $filename
or die "Couldn't open $filename for reading!";
Good work on including the file name in the error message (though adding $! to report the system error message would be a good idea too).

# (char)(char)(char) (char)(char)(char) Uppercase the 3rd
$str =~ s/(\.)(\s)(\w)/$1$2\U$3/g;
print $str
...so, that's our art. the 4 of us can now have a dialog. we can have a conversation. we can speak to...
...so, that's our art. The 4 of us can now have a dialog. We can have a conversation. We can speak to...

How do I return all characters that begin and end with certain characters in Perl (Or C++)?

note: I'm running Perl 5 on Linux
I'm currently doing a project where I have to input a few words and then return words that begin with "d" and end with "e". I'm not using a pre-done list, for example I input into the console Done, Dish, Dome, and Death. I want it to return Done and Dome, but not the other words. I hope to receive help how to do this in Perl, but C++ would help if Perl doesn't work out.

perl -ne ' print if /^d/i && /e$/i ' < words
Since you are using Linux, it may be simpler to use grep(1):
grep -i '^d.*e$' < words

That's almost trivial in Perl:
$ perl -nE 'say "ok" if /^d.*e$/i'
Done
ok
Dish
Dome
ok
Death
It reads from STDIN and says ok if the line matched. This is useful while debugging regular expressions. You just want to output matching lines, so you could simply replace say "ok" by say
$ perl -nlE 'say if /^d.*e$/i' words
while words is the filename of your words file. It magically reads its lines. Short explanation of that regular expression match:
^ # start of the line
d # the literal character 'd' (case-insensitive because of the i switch)
.* # everything allowed here
$ # end of the line

Not often I answer perl questions, but I think this does the trick.
my #words = ...;
#words = grep(/^d.*e$/i, #words);
grep uses a regular expression to filter the words.

How about:
#!/usr/bin/perl -Tw
use strict;
use warnings;
for my $word (#ARGV) {
if ( $word =~ m{\A d .* e \z}xmsi ) {
print "$word\n";
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl removing words from file1 with file2 - regex

Use word-boundary on both sides of your $word. Currently, you are only checking for it at the beginning. You won't need the \s? condition with the \b in place: $text =~ s/\b\Q$word\E\b//sg;

Related

Wild card matching

Repeating regex pattern

Resolve Perl error: "Use of uninitialized value"

Capitalize all letters after a period and a space

How do I return all characters that begin and end with certain characters in Perl (Or C++)?

Categories

Resources