Perl regex find and replace - regex

I'm new to perl and I'm trying to figure out a find and replace. I have a large csv file (actually semi-colon separated). Some of the numbers (int and decimals) in the file have a negative symbol after the number. I need to move the negative sign to before the number.
E.g: Change
ABC;10.00-;XYZ
to
ABC;-10.00;XYZ
I'm not sure how to do this in perl. Can someone please help?
Regards,
Anand

I would not dabble around in a large csv file with regexes, unless I was very sure about my data and the regex. Using a CSV module seems to me to be the safest way.
This script will take input files as arguments, and write the corrected files with a .new extension.
If you notice undesired changes in your output file, you can try to un-comment the keep_meta_info line.
use strict;
use warnings;
use autodie;
use Text::CSV;
my $out_ext = ".new";
my $csv = Text::CSV->new( {
sep_char => ";",
# keep_meta_info => 1,
binary => 1,
eol => $/,
} ) or die "" . Text::CSV->error_diag();
for my $arg (#ARGV) {
open my $input, '<', $arg;
open my $output, '>', $arg . $out_ext;
while (my $row = $csv->getline($input)) {
for (#$row) {
s/([0-9\.]+)\-$/-$1/;
}
$csv->print($output, $row);
}
}

I'll assume you don't have to worry about quoteing or escaping in your delimited file. I'll read from standard in/out, change to appropriate files if req'd
while( my $line = <STDIN> )
{
chop( $line );
my #rec = split( ';', $line );
map( s/^(\d*\.?\d+)\-$/-$1/, #rec );
print join(';',#rec) . "\n";
}
If you do have to worry about escaping and quoting, then use Text::CSV_XS instead of the <STDIN>, split, and join oprerations

In general, the replace command is s/old/new/flags:
s/( # start a capture group
\d+ # first part of the number
(\.\d+)? # possibly a decimal dot and the fractional part
)- # end capture group, match the minus sign
/-$1/gx # move minus to the front
The g flag means “global” (replace all occurences), and x is “extended legibility” (allows whitespace and comments in the pattern). You have to test the expression on your data to see what corner cases you might have missed, it usually takes a few iterations to get the right one. Samples:
$ echo "10.5-;10-;0-;a-" | perl -pe 's/(\d+(\.\d+)?)-/-$1/g'
-10.5;-10;-0;a-
See also perldoc perlop (search for “replacement” to jump to the right section).

Related

Perl, use regex to find a match and replace just the last character of the match (in this case a line break)

I have to clean several csv files before i put them in a database, some of the files have a unexpected linebreak in the middle of the line, as the line should always end with a number i managed to fix the files with this one liner:
perl -pe 's/[^0-9]\r?\n//g'
while it did work it also replaces the last char before the line break
foob
ar
turns into
fooar
Is there any one liner perl that i can call that would follow the same rule without replacing the last char before the linebreak
A negative lookbehind which is an assertion and won't consume characters can also be used.
(?<!\d)\R
\d is a a short for digit
\R matches any linebreak sequence
See this demo at regex101
One way is to use \K lookbehind
perl -pe 's/[^0-9]\K\r?\n//g'
Now it drops all matches up to \K so only what follows it is subject to the replacement side.
However, I'd rather recommend to process your CSV with a library, even as it's a little more code. There's already been one problem, that linefeed inside a field, what else may be there? A good library can handle a variety of irregularities.
A simple example with Text::CSV
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = shift or die "Usage: $0 file.csv\n";
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\n+//g for #$row;
$csv->say(\*STDOUT, $row);
}
Consider other constructor options, also available via accessors, that are good for all kinds of unexpected problems. Like allow_whitespace for example.
This can be done as a command-line program ("one-liner") as well, if there is a reason for that. The library's functional interface via csv is then convenient
perl -MText::CSV=csv -we'
csv in => *ARGV, on_in => sub { s/\n+//g for #{$_[1]} }' filename
With *ARGV the input is taken either from a file named on command line or from STDIN.
Just capture the last char and put it back:
perl -pe 's/([^0-9])\r?\n/$1/g'

Resolve Perl error: "Use of uninitialized value"

To clarify the following post, we have an automation requirement to send shipping information to an online platform so users can track their orders. We receive a daily .csv file through email, we have to extract the unique Shopify order reference from a field (last 10 digits of a field), save the amended .csv file and upload to an FTP site so tracking references can be matched to the specific order.
A previous colleague wrote an application in Perl to handle this, however it has not worked and I have no experience with Perl at all!
The program is called by a "Watcher" monitoring for files, the code for this is as follows:
use strict;
use warnings;
use Datatools::Watcher;
my $hotfolder = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT';
my $process = '"C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl"';
my #backup = ('\\gen-svr-01\users\DATA\MW\DMO_Report_IO\ARCHIVE');
watcher($hotfolder,$process,\#backup);
The main code (PERL PROGRAM) is:
use strict;
use warnings;
use File::Copy;
use Datatools::Watcher;
my $output = '\\gen-svr-01\users\DATA\MW\DMO_Report_IO\OUTPUT';
my $desthotfolder = '\\gen-svr-01\users\DATA\MW\Data_TO_MWS_FTP_TEST';
my $shopifyPos = 0;
my $shopifyNew = "";
my $header = 1;
my $inputfile = $ARGV[0];
my ($path,$file,$extention) = $inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
my $outputfilename = $file . "_FORMATTED" . $extention;
$outputfilename =~ s/.~#~//;
my $outputfile = "$output\\$outputfilename";
open (INPUT, $inputfile) or die "Could not open input file: $inputfile\n";
open (OUTPUT, ">$outputfile") or die "Could not open output file: $outputfile\n";
while (my $record = <INPUT>){
chomp $record;
my #field = parse_csv($record);
if ($header == 1){
print OUTPUT $record . "\n";
$header = 0;
next;
} else {
$shopifyNew = substr $field[$shopifyPos], -10;
splice (#field, 0, 1, $shopifyNew);
print OUTPUT join(',',#field) . "\n";
next;
}
}
close INPUT;
close OUTPUT;
my $destfile = "$desthotfolder\\$outputfilename";
move $outputfile, $destfile or die "Could not move output file: $outputfile\nto: $destfile\n";
print "\nProcessing complete\n";
sub parse_csv {
my ($shift) = #_;
my $text = $shift; # record containing comma-separated values
my #new = ();
push(#new, $+) while $text =~ m{
# the first part groups the phrase inside the quotes.
# see explanation of this pattern in MRE
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(#new, undef) if substr($text, -1,1) eq ',';
return #new; # list of values that were comma-separated
}
When the program runs, the "Watcher" details the following:
File Seen, Processing File \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csvUse of uninitialized value $file in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Use of uninitialized value $extention in concatenation <.> or string at C:\Workspace\bin\WS_DMO_Report_Manipulation_v1.0.pl line 47.
Processing complete
Line 47 refers to the following code:
my $outputfilename = $file . "_FORMATTED" . $extention;
In the output folder, there is a file with the name "_FORMATTED" (no file extensions)
I have looked for a solution, and from my limited understanding I don't think the variables: file and extension are being defined, but I have no idea how to correct!
It would help to know which is line 47 in this code. I assume it's this line:
my $outputfilename = $file . "_FORMATTED" . $extention;
So, at this point, $file and $extention are both uninitialised. They are both supposed to be initialised in the previous line:
my ($path,$file,$extention) =
$inputfile =~ m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
So it seems that your $inputfile doesn't match the regex. This leaves us with two options:
$inputfile isn't being set at all (which would mean it isn't being passed to the program).
$inputfile isn't in the correct format to to match the regex.
To work out which of the problems we have here, add the following validation lines before the line which tries to set $file and $extention:
die "No input file given\n" unless $inputfile;
die "Input file name ($inputfile) is the wrong format\n"
unless $inputfile =~ / \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms;
Update: From recent updates to your question, I can see that you are running the program and passing it the filename \\gen-svr-01\users\DATA\MW\DMO_Report_IO\INPUT/OrderTracking.csv.
Let's take a closer look at your regex.
m/ \A (.+\/) (.+\d\d\d\d) .+ ([.]\w{3}) \z/ixms
The /x option at the end means that the regex compiler ignores any literal whitespace in the string. So we can do the same. Let's break down what the individual parts are trying to match:
\A : matches the start of the string
(.*\/) : matches anything up to and including the last / in your string. It captures the matched substring into $1. This is what is stored in $path in your code. It's the directory that your file is in.
(.+\d\d\d\d) : This matches one or more of any character followed by four digits. This is stored in $2 and in your code it ends up in `$file``. It's the main part of the filename.
.+ : Matches one or more characters. Any characters. Your code does nothing with these characters.
([.]\w{3}) : Matches a dot followed by three "word" characters (basically alphanumerics). This is captured into $3 and ends up in your $extention variable.
\z : Matches the end of the string.
Putting all that together, you have a regex that looks for filenames and splits them into three parts - the path, the name and the extension. The only complication is that the filename section needs to contain four consecutive digits. And your filename is OrderTracking - which doesn't contain those required digits. So the regex doesn't match and your variables don't get set.
When this program was written, it was assumed that the filenames would contain four digits. The files that you are trying to process do not contain digits, so the program fails.
We can't suggest how you fix this. You need to speak to the people who supply your input files and find out why they have started to send you files with a different name format. Once you know that, you can decide one the best approach to work round the problems.

perl search and replace a substring

I am trying to search for a substring and replace the whole string if the substring is found. in the below example someVal could be any value that is unknown to me.
how i can search for someServer.com and replace the whole string $oldUrl and with $newUrl?
I can do it on the whole string just fine:
$directory = "/var/tftpboot";
my $oldUrl = "someVal.someServer.com";
my $newUrl = "someNewVal.someNewServer.com";
opendir( DIR, $directory ) or die $!;
while ( my $files = readdir(DIR) ) {
next unless ( $files =~ m/\.cfg$/ );
open my $in, "<", "$directory/$files";
open my $out, ">", "$directory/temp.txt";
while (<$in>) {
s/.*$oldUrl.*/$newUrl/;
print $out $_;
}
rename "$directory/temp.txt", "$directory/$files";
}
Your script will delete much of your content because you are surrounding the match with .*. This will match any character except newline, as many times as it can, from start to end of each line, and replace it.
The functionality that you are after already exists in Perl, the use of the -pi command line switches, so it would be a good idea to make use of it rather than trying to make your own, which works exactly the same way. You do not need a one-liner to use the in-place edit. You can do this:
perl -pi script.pl *.cfg
The script should contain the name definitions and substitutions, and any error checking you need.
my $old = "someVal.someServer.com";
my $new = "someNewVal.someNewServer.com";
s/\Q$old\E/$new/g;
This is the simplest possible solution, when running with the -pi switches, as I showed above. The \Q ... \E is the quotemeta escape, which escapes meta characters in your string (highly recommended).
You might want to prevent partial matches. If you are matching foo.bar, you may not want to match foo.bar.baz, or snafoo.bar. To prevent partial matching, you can put in anchors of different kinds.
(?<!\S) -- do not allow any non-whitespace before match
\b -- match word boundary
Word boundary would be suitable if you want to replace server1.foo.bar in the above example, but not snafoo.bar. Otherwise use whitespace boundary. The reason we do a double negation with a negative lookaround assertion and negated character class is to allow beginning and end of line matches.
So, to sum up, I would do:
use strict;
use warnings;
my $old = "someVal.someServer.com";
my $new = "someNewVal.someNewServer.com";
s/(?<!\S)\Q$old\E(?!\S)/$new/g;
And run it with
perl -pi script.pl *.cfg
If you want to try it out beforehand (highly recommended!), just remove the -i switch, which will make the script print to standard output (your terminal) instead. You can then run a diff on the files to inspect the difference. E.g.:
$ perl -p script.pl test.cfg > test_replaced.cfg
$ diff test.cfg test_replaced.cfg
You will have to decide whether word boundary is more desirable, in which case you replace the lookaround assertions with \b.
Always use
use strict;
use warnings;
Even in small scripts like this. It will save you time and headaches.
If you want to match and replace any subdomain, then you should devise a specific regular expression to match them.
\b(?i:(?!-)[a-z0-9-]+\.)*someServer\.com
The following is a rewrite of your script using more Modern Perl techniques, including Path::Class to handle file and directory operations in a cross platform way and $INPLACE_EDIT to automatically handle the editing of a file.
use strict;
use warnings;
use autodie;
use Path::Class;
my $dir = dir("/var/tftpboot");
while (my $file = $dir->next) {
next unless $file =~ m/\.cfg$/;
local #ARGV = "$file";
local $^I = '.bak';
while (<>) {
s/\b(?i:(?!-)[a-z0-9-]+\.)*someServer\.com\b/someNewVal.someNewServer.com/;
print;
}
#unlink "$file$^I"; # Optionally delete backup
}
Watch for the Dot-Star: it matches everything that surrounds the old URL, so the only thing remaining on the line will be the new URL:
s/.*$oldUrl.*/$newUrl/;
Better:
s/$oldUrl/$newUrl/;
Also, you might need to close the output file before you try to rename it.
If the old URL contains special characters (dots, asterisks, dollar signs...) you might need to use \Q$oldUrl to suppress their special meaning in the regex pattern.

find strings in source code by using regex in perl

I am studying on regular expression in perl.
I want to write a script that accepts a C source code file and finds strings.
This is my code:
my $file1= #ARGV;
open my $fh1, '<', $file1;
while(<>)
{
#words = split(/\s/, $_);
$newMsg = join '', #words;
push #strings,($newMsg =~ m/"(.*\\*.*\\*.*\\*.*)"/) if($newMsg=~/".*\\*.*\\*.*\\*.*"/);
print Dumper(\#strings);
foreach(#strings)
{
print"strings: $_\n";
}
but i have problem in matching multiple string like this
const char *text2 =
"Here, on the other hand, I've gone crazy\
and really let the literal span several lines\
without bothering with quoting each line's\
content. This works, but you can't indent";
what i must do?
Here is a fun solution. It uses MarpaX::Languages::C::AST, an experimental C parser. We can use the c2ast.pl program that ships with the module to convert a piece of C source file to an abstract syntax tree, which we dump to some file (using Data::Dumper). We can then extract all strings with a bit of magic.
Unfortunately, the AST objects have no methods, but as they are autogenerated, we know how they look on the inside.
They are blessed arrayrefs.
Some contain a single unblessed arrayrefs of items,
Others contain zero or more items (lexemes or objects)
“Lexemes” are an arrayref with two fields of location information, and the string contents at index 2.
This information can be extracted from the grammar.
The code:
use strict; use warnings;
use Scalar::Util 'blessed';
use feature 'say';
our $VAR1;
require "test.dump"; # populates $VAR1
my #strings = map extract_value($_), find_strings($$VAR1);
say for #strings;
sub find_strings {
my $ast = shift;
return $ast if $ast->isa("C::AST::string");
return map find_strings($_), map flatten($_), #$ast;
}
sub flatten {
my $thing = shift;
return $thing if blessed($thing);
return map flatten($_), #$thing if ref($thing) eq "ARRAY";
return (); # we are not interested in other references, or unblessed data
}
sub extract_value {
my $string = shift;
return unless blessed($string->[0]);
return unless $string->[0]->isa("C::AST::stringLiteral");
return $string->[0][0][2];
}
A rewrite of find_strings from recursion to iteration:
sub find_strings {
my #unvisited = #_;
my #found;
while (my $ast = shift #unvisited) {
if ($ast->isa("C::AST::string")) {
push #found, $ast;
} else {
push #unvisited, map flatten($_), #$ast;
}
}
return #found;
}
The test C code:
/* A "comment" */
#include <stdio.h>
static const char *text2 =
"Here, on the other hand, I've gone crazy\
and really let the literal span several lines\
without bothering with quoting each line's\
content. This works, but you can't indent";
int main() {
printf("Hello %s:\n%s\n", "World", text2);
return 0;
}
I ran the commands
$ perl $(which c2ast.pl) test.c -dump >test.dump;
$ perl find-strings.pl
Which produced the output
"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"World"
"Hello %s\n"
""
""
""
""
""
""
Notice how there are some empty strings not from our source code, which come somewhere from the included files. Filtering those out would probably not be impossible, but is a bit impractical.
It appears you're trying to use the following regular expression to capture multiple lines in a string:
my $your_regexp = m{
(
.* # anything
\\* # any number of backslashes
.* # anything
\\* # any number of backslashes
.* # anything
\\* # any number of backslashes
.* # anything
)
}x
But it appears more of a grasp of desperation than a deliberately thought out plan.
So you've got two problems:
find everything between double quotes (")
handle the situation where there might be multiple lines between those quotes
Regular expressions can match across multiple lines. The /s modifier does this. So try:
my $your_new_regexp = m{
\" # opening quote mark
([^\"]+) # anything that's not a quote mark, capture
\" # closing quote mark
}xs;
You might actually have a 3rd problem:
remove trailing backslash/newline pairs from strings
You could handle this by doing a search-replace:
foreach ( #strings ) {
$_ =~ s/\\\n//g;
}
Here is a simple way of extracting all strings in a source file. There is an important decision we can make: Do we preprocess the code? If not, we may miss some strings if they are generated via macros. We would also have to treat the # as a comment character.
As this is a quick-and-dirty solution, syntactic correctness of the C code is not an issue. We will however honour comments.
Now if the source was pre-processed (with gcc -E source.c), then multiline strings are already folded into one line! Also, comments are already removed. Sweet. The only comments that remain are mention line numbers and source files for debugging purposes. Basically all that we have to do is
$ gcc -E source.c | perl -nE'
next if /^#/; # skip line directives etc.
say $1 while /(" (?:[^"\\]+ | \\.)* ")/xg;
'
Output (with the test file from my other answer as input):
""
"__isoc99_fscanf"
""
"__isoc99_scanf"
""
"__isoc99_sscanf"
""
"__isoc99_vfscanf"
""
"__isoc99_vscanf"
""
"__isoc99_vsscanf"
"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"Hello %s:\n%s\n"
"World"
So yes, there is a lot of garbage here (they seem to come from __asm__ blocks), but this works astonishingly well.
Note the regex I used: /(" (?:[^"\\]+ | \\.)* ")/x. The pattern inside the capture can be explained as
" # a literal '"'
(?: # the begin of a non-capturing group
[^"\\]+ # a character class that matches anything but '"' or '\', repeated once or more
|
\\. # an escape sequence like '\n', '\"', '\\' ...
)* # zero or more times
" # closing '"'
What are the limitations of this solution?
We need the a preprocessor
This code was tested with gcc
clang also supports the -E option, but I have no idea how the output is formatted.
Character literals are a failure mode, e.g. myfunc('"', a_variable, '"') would be extracted as "', a_variable, '".
We also extract strings from other source files. (false positives)
Oh wait, we can fix the last bit by parsing the source file comments which the preprocessor inserted. They look like
# 29 "/usr/include/stdio.h" 2 3 4
So if we remeber the current filename, and compare it to the filename we want, we can skip unwanted strings. This time, I'll write it as a full script instead of a one-liner.
use strict; use warnings;
use autodie; # automatic error handling
use feature 'say';
my $source = shift #ARGV;
my $string_re = qr/" (?:[^"\\]+ | \\.)* "/x;
# open a pipe from the preprocessor
open my $preprocessed, "-|", "gcc", "-E", $source;
my $file;
while (<$preprocessed>) {
$file = $1 if /^\# \s+ \d+ \s+ ($string_re)/x;
next if /^#/;
next if $file ne qq("$source");
say $1 while /($string_re)/xg;
}
Usage: $perl extract-strings.pl source.c
This now produces the output:
"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"Hello %s:\n%s\n"
"World"
If you cannot use the convenient preprocessor to fold multiline strings and remove comments, this gets a lot uglier, because we have to account for all of that ourselves. Basically, you want to slurp in the whole file at once, not iterate it line by line. Then, you skip over any comments. Do not forget to ignore preprocessor directives as well. After that, we can extract the strings as usual. Basically, you have to rewrite the grammar
Start → Comment Start
Start → String Start
Start → Whatever Start
Start → End
to a regex. As the above is a regular language, this isn't too hard.

How do I do optional matching in a regular expression using Perl?

I want to extract the size value from a string. The string can be be formatted in one of two ways:
Data-Size: (2000 bytes)
or
file Data-Size: (2082 bytes)
If the string is present in a file, it will appear only once.
So far I have:
#!/usr/bin/perl
use strict;
use warnings;
open FILE, "</tmp/test";
my $input = do { local $/; <FILE> };
my ($length) = $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/m;
$length or die "could not get data length\n";
print "length: $length\n";
The problem seems to be with making the word file optional. I thought I could do this with:
(file)?
But this seems to be stopping matches when the word file is not present. Also when the word file is there it sets $length to the string "file". I think this is because the parenthesis around file also mean extraction.
So how do I match either of the two strings and extract the size value?
You want the second capture in $length. To do that, you could use
my (undef, $length) = $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/;
or
my $length = ( $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/ )[1];
But a much better approach would be to avoid capturing something you're not interested in capturing.
my ($length) = $input =~ /(?:file)?\s*Data-Size: \((\d+) bytes\)/;
Of course, you'd get the same result from
my ($length) = $input =~ /Data-Size: \((\d+) bytes\)/;
By the way, I removed the needless /m. /m changes the meaning of ^ and $, yet neither are present in the pattern.
Just my 2 cents, you can make optional matching other way:
/(file|)\s*Data-Size: ((\d+) bytes)/