I need to parse through many delimited files and I had a question. Within a while loop, how do I make a substitution within a field? Let me give some example code and data.
Data Example:
Word,Name,Number,You/Me,Data
Hello,Josh,123,Me,Data
Hello,Joe,222,Me,Data
GOAL:
In this example what I would like to do is do a substitution on $[2] and $[3].
In other words, the field $[2] would be the number field. $[3] would be the You/Me field.
What I have in my code is this:
my #F = split;
while (<>) {
if ($F[3] =~ /^You$/ print "Me";) next;
if ($F[2] =~ /^222$/ print "P";) next;
if ($F[2] =~ /^123$/ print "P";) next;
print #F;
}
I can't seem to find the correct syntax to make substitutions in specific fields and was hoping someone would have a suggestion. The goal of my results is below.
DESIRED RESULTS:
Word,Name,Number,You/Me,Data
Hello,Josh,P,Me,Data
Hello,Joe,P,Me,Data
I would just like to perform substitutions on the fields specified and leave everything else the same.
Additionally I may have a scenario where I would like to delimit certain fields by something completely different while leaving everything else the same. I would also like to use scalars.
Data Example:
Word,Name,Number,You/Me,Data
Hello,Josh,123,Me,Data
Hello,Joe,222,Me,Data
DESIRED RESULTS:
Word,Name,Number,You/Me,Data
Hello,Josh,P-Me,Data
Hello,Joe,P-Me,Data
EXAMPLE with scalar
my $numbers = qw/222|123/;
my #F = split;
while (<>) {
if ($F[3] =~ /^You$/ print "Me";) next;
if ($F[2] = /^$numbers$/ print "P";) next;
join ("-",$[2],$[3]);
print #F;
}
So in the end, I would like to know how to substitute in fields, join specific fields with a different delimiter, and implement scalars into this sort of split/join script.
This perl script does what you need.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $file = "file";
local #ARGV = $file;
local $^I = '.bak';
my $numbers = qr/^(?:222|123)$/;
while (<>) {
print and next if /^\s+$/ or $.==1;
my #flds = split /,/;
$flds[2] = "P" if $flds[2] =~ $numbers;
$flds[3] = "Me" if $flds[3] eq "You";
print join ",", #flds[0,1], join ("-", #flds[2,3]), #flds[4..$#flds];
}
#unlink "$file$^I";
Outputs:
Word,Name,Number,You/Me,Data
Hello,Josh,P-Me,Data
Hello,Joe,P-Me,Data
We use $^I variable to do in-place changes while creating a backup of the original with extension of .bak. We split the line on , and populate an array called #flds.
Then we do two checks and if they are successful we make the changes to the fields. Once the checks are done we print the array which has the modified fields for every successful test.
unlink is commented out. You can uncomment it if you don't want back up file.
You can remove print and next unless /\S/; line if you do not have any blank lines in your file or if you do not wish to retain them in the output.
Would you like to change "You" to "Me"? Your desired results show you won't like to change, but you code is otherwise.
I think you can use $F[2] =~ s/^123$/P/. See perldoc perlre
And maybe you would like to use $^I variable if you want in-place edition.
See perldoc perlval
Related
I have been working on this for a little while now and can't seem to figure it out. I have a file containing a bunch of lines all structured like the one below meaning each line starts with "!" and has three separators "<DIV>".
!the<DIV>car<DIV>drove down the<DIV>road off into the distance
I am interested in retrieving the last string "road off into the distance" I can't seem to get it to work. Below I have listed the current code I have.
while($line = <INFILE>) {
$line =~ /<SEP>{3}(.*)/;
print $1;
}
Any help would be greatly appreciated!
The statement
#b = $a =~ /^!(.*?)<DIV>(.*?)<DIV>(.*?)<DIV>(.*)/
will split the string into a list, and you can then extract the 4th element with
$b[3]
If you really want only the last one, do this instead:
($text) = $a =~ /^!.*<DIV>(.*)/
I don't know whether you insist on regex or simply didn't think of else, but split will nicely do this
$text = (split '<DIV>', $str)[-1];
If you regularly have such repeating patterns split may well be better for the job than a pure regex. (Split also uses full regular expressions in its pattern, of course.)
ADDED
All this can be done directly, if you simply only need to pull the last thing from each line:
open my $fh, '<', $file;
my #text = map { (split '<DIV>')[-1] } <$fh>;
close $fh;
print "$_\n" for #text;
The split by default uses $_, which inside the map is the current element processed. For lines without a <DIV> this returns the whole line. A file handle in the list context serves all lines as a list; the list context is imposed by map here.
In case you want all text between delimiters you can do
my #rlines = map { [ split '<DIV>' ] } <$fh>;
where [ ] takes a reference to the list returned by split and thus #rlines contains references to arrays, each with text in between <DIV>s on a line. The leading ! is there though and to drop it a little more processing is needed.
Of course, for the map block you can use { (/.*<DIV>(.*)/)[0] } from Jim Garrison's answer for a single match, or modify the regex a little to catch'em all.
If performance is a factor then that's a little different question.
A simple substitution could work too:
while(<DATA>){
chomp;
my $text = (s/.*<DIV>//g, $_);
say $text;
}
Simple regex which answers your question:
my $match= '';
while($line = <INFILE>) {
($match) = $line =~/.*<DIV>(.*)/;
}
print $match, "\n";
I'd like to use one of perl's special variable to make this snippet a bit less large and ugly:
my $mysqlpass = "mysqlpass=verysecret";
$mysqlpass = first { /mysqlpass=/ } #vars;
$mysqlpass =~ s/mysqlpass=//;
I have looked this info up and tried several special variables ($',$1,$`, etc) to no avail
A s/// will return true if it replaces something.
Therefore, it is possible to simply combine those two statements instead of having a redundant m//:
use strict;
use warnings;
use List::Util qw(first);
chomp(my #vars = <DATA>);
my $mysqlpass = first { s/mysqlpass=// } #vars;
print "$mysqlpass\n";
__DATA__
mysqluser=notsosecret
mysqlpass=verysecret
mysqldb=notsecret
Outputs:
verysecret
One Caveat
Because $_ is an alias to the original data structure, the substitution will effect the #vars value as well.
Alternative using split
To avoid that, I would inquire if the #vars contains nothing but key value pairs separated by equal signs. If that's the case, then I would suggest simply translating that array into a hash instead.
This would enable much easier pulling of all keys:
use strict;
use warnings;
chomp(my #vars = <DATA>);
my %vars = map {split '=', $_, 2} #vars;
print "$vars{mysqlpass}\n";
__DATA__
mysqluser=notsosecret
mysqlpass=verysecret
mysqldb=notsecret
Outputs:
verysecret
Yeah, regular expression it, if you really want to visit the path of obfuscation.
See following code:
my $string = "mysqlpass=verysecret";
if ($string =~ /^(\w+)\=(\w+)$/) {
print $1; # This stores 'mysqlpass'
print $2; # This stores 'verysecret'
}
My recommendation against this though, is that you want your code to be readable.
The one you're looking for is $_.
I have a file with strings in each row as follows
"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1
the next line could look like
84545,X,2
I'm trying to parse this text in Perl. Note: quotes are present in the strings when there are several of them in a row, but not present if there is only item
I would like to parse each item into an array. I tried the following regex
#fields = ($_ =~ /(\d+\_\d+),*/g);
but it is missing the last 2714. How do I capture that edge case? Any help appreciated. Thanks in advance
It looks like you have a CSV File, so use an actual CSV parser for it like Text::CSV.
After you parse the columns, you can separate your first field into the array:
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1 } ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
my $line = qq{"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1 the next line could look like 84545,X,2};
if ($csv->parse($line)) {
my #columns = $csv->fields();
my #nums = split ',', $columns[0];
print "#nums\n";
}
Outputs:
229269_2 190594_2 94552_2 266076_2 269628_2 165328_2 99319_2 263339_2 263300_2 99315_2 271509_2 2714
Why not a regex ?
Yes, of course it's possible to use a regex for practically anything. But what you need to understand is that this will make your code extremely fragile and difficult to maintain.
Even if you want to use a regular expression, you should STILL do this in two steps. First separate the initial column(s) of your CSV, and then process the specific column that you're worried about.
Because you're just working with the first column, you could use code like the following:
use strict;
use warnings;
my $line = qq{"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1 the next line could look like 84545,X,2};
if ($line =~ /^"(.*?)"|^([^,]*)/) {
my $column0 = $1 // $2;
my #nums = split ',', $column0;
print "#nums\n";
}
The above happens to accomplish the same thing as the previous code. However, it has one big flaw, it's not nearly as obvious to the maintaining programmer what's going on.
Whenever a new coder, or even yourself in 6 months, views the first set of code, it is extremely obvious what format your data is in. You're working with a CSV file, and the first column is a list separated by commas. The second code also works, but the new maintainer must actually read the regex and figure out what's going on to understand both what format the data is in, and whether the code is actually doing it correctly.
Anyway, do whatever you will, but I strongly advise you to use an actual CSV Parser for parsing csv files.
If all you want is all but the last two fields...
my $string = qq("229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1);
$string =~ s/"//g; # delete the quotes
my #f = split (/,/, $string); # split on the comma
pop #f; pop #f; # jettison the last two columns
# #f contains what you're looking for
After searching everywhere on the web, and being a noob to perl, for a solution to this I have decided to post on Stack.
I am looking to do is loop through array1 containing required matches (they will be different each time and could contain lots of patterns (well strings that need to be matched) but using this example so I can understand the problem). Then testing each element against a grep which is using array2 that contains some strings. Then printing out the lines that grep found to match the patterns used.
#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw( strftime );
my (#regressions,#current_test_summary_file,#regression_links);
#regressions = ("test","table");
#current_test_summary_file = ("this is the line for test \n","this is the line for table \n","this is the line for to\n");
foreach (#regressions)
{
print $_ . "\n";
#regression_links = grep(/$_/, #current_test_summary_file);
}
foreach(#regression_links)
{
print $_ . "\n";
}
So would like to pick up only the first two elements instead of all three which is happening now.
Hopefully I've explained my problem properly. I've tried quite a few things (using qq for example) but have only really used grep to try this (unsure how I could do this approach using something different). If someone can point me in the right direction (and whether I should be using grep at all to solve this problem for that matter) I would be very grateful. Just tried this code below instead of just get the second element any ideas anyone ? (sorry can't reply to ur comment some reason but so u know axeman second idea worked).
foreach my $regression (#regressions)
{
print $regression . "\n";
#regression_links = grep(/$regression/, #current_test_summary_file);
}
Inside of grep, $_ refers to the list element involved in the test. Also /abc/ is short for $_ =~ /abc/ so you're effectively testing $_ =~ /$_/ guess what the answer is likely to be (with no metacharacters)?
So you're passing all values into #regression_links.
What you need to do is save the value of $_. But since you're not using the simple print statement, you could just as easily reserve the $_ variable for the grep, like so:
foreach my $reg ( #regressions ) {
print "$reg\n";
#regression_links = grep(/$reg/, #current_test_summary_file );
}
However, you're resetting #regression_links with each loop, and a push would work better:
push #regression_links, grep(/$reg/, #current_test_summary_file );
However, a for loop is a bad choice for this anyway, because you could get duplicates and you might not want them. Since you're matching by regex, one alternative with multiple criteria is to build a regex alternation. But in order to get a proper alternation, we need to sort it by length of string descending and then by alphabetic order (cmp).
# create the alternation expression
my $filter
= join( '|'
, sort { length( $b ) <=> length( $a )
|| $a cmp $b
}
#regressions
);
#regression_links = grep( /$filter/, #current_test_summary_file );
Or other than concatenating a regex, if you wanted to test them separately, the better way would be with something like List::MoreUtils::any:
#regression_links
= grep {
my $c = $_; # save $_
return any { /$c/ } #regressions;
} #current_test_summary_file
;
Axeman is correct and localising $_ with $reg will solve your problem. But as for pulling out matches I would naively push all matches onto #regression_links producing a list of (probably) multiple matches. You can then use List::MoreUtils::uniq to trim down the list. If you don't have List::MoreUtils installed you can just copy the function (its 2 lines of code).
# Axeman's changes
foreach my $reg (#regressions) {
print "regression: $reg\n";
# Push all matches.
push #regression_links, grep(/$reg/, #current_test_summary_file);
}
# Trim down the list once matching is done.
use List::MoreUtils qw/uniq/;
foreach ( uniq(#regression_links) ) {
print "$_\n";
}
I've two text files. I want to take text from first one between </sup><sup> tags, and insert it to another text file between {}.
Better example (sth like a dictionary)
Text1:
<sup>1</sup>dog
<sup>2</sup>cat
<sup>3</sup>lion
<sup>1</sup>flower
<sup>2</sup>tree
.
.
Text2:
\chapter1
\pkt{1}{}{labrador retirever is..}
\pkt{2}{}{home pets..}
\pkt{3}{}{wild cats..}
\chapter2
\pkt{1}{}{red rose}
\pkt{2}{}{lemon tree}
.
.
What I want:
Text3:
\chapter1
\pkt{1}{dog}{labrador retirever is..}
\pkt{2}{cat}{home pets..}
\pkt{3}{lion}{wild cats..}
\chapter2
\pkt{1}{flower}{red rose}
\pkt{2}{tree}{lemon tree}
Text is random, but You can see what I want. Perl would be best.
So get
</sup>**text**<sup>
and paste it to
\pkt{nr}{**here**}{this is translation of this word already stored in text2}.
Text A and B are in order, so if I could read first </sup>text<sup> from Text A, save it in temp, delete this line from Text A, put it on first free {} slot in text B, and start over again it would be great. Numbers will match because order is saved.
Sorry for my English:)
Thanks!
This code puts all dict items in an array, in the order they appear. The tex file is then looped and each time \pkt{num}{} is hit an item from the array is inserted.
Newlines in dict are handled and replaced with spaces (Just remove this replace in the map if you don't want this behavior). \pkt should be found as long as the part \pkt{num}{} is not spanning multiple lines. Otherwise I think the easiest solution would be to undef $/ (the input record separator) and read the whole file into a string and just loop the replacement (could be a bit memory hungry though).
#!/usr/bin/perl -wT
use strict;
my $dict_filename = 'text1';
my $tex_filename = 'text2';
my $out_filename = 'text3';
open(DICT, $dict_filename);
my #dict;
{
# Set newline separator to <sup>
local $/ = '<sup>';
# Throw away first "line", it will be empty
<DICT>;
# Extract string and throw away newlines
#dict = map { $_ =~ m#</sup>\s*(.*?)\s*(?:<sup>|$)#s; $_ = $1; $_ =~ s/\n/ /g; $_; } <DICT>;
}
close(DICT);
open(TEX, $tex_filename);
open(OUT, ">$out_filename");
my $tex_line;
my $dict_pos = 0;
while($tex_line = <TEX>)
{
# Replace any \pkt{num}{} with \pkt{num}{text}
$tex_line =~ s|(\\pkt\{\d+\}\{)(\})|$1$dict[$dict_pos++]$2|g;
print OUT $tex_line;
}
close(TEX);
close(OUT);