perl split a string on "," [duplicate] - regex

This question already has answers here:
How do I efficiently parse a CSV file in Perl?
(6 answers)
Closed 8 years ago.
I am trying to split several hundred lines read from a CSV file on ,. E.g.:
"Acme services","Sesame street","zip","0,56","2013-10-21"
"Black adder, intra-national Association","shaftsville rd","zap code","0,50","2014-10-14"
etc.
I could split the first row on ,, but this would not work for the second row. However, if I split on , then I would trap these cases. I could then remove the " using simple regex (e.g. $col[i] =~ s/\"+//g)
I have tried #cols = split(/\",\"/,$line), and I've tried split('","',$lines) and various variations, but every time, I get the full $line in $col[0], with $cols[1:n] as empty.
Any help would be much appreciated! Thanks.

Why not use Text::CSV. This will take care of edge cases where you have commas in values and all sorts of other problems,
from the cpan page
use Text::CSV;
my #rows;
my $csv = Text::CSV->new ( { binary => 1 } ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
$row->[2] =~ m/pattern/ or next; # 3rd field should match
push #rows, $row;
}
$csv->eof or $csv->error_diag();
close $fh;
$csv->eol ("\r\n");
open $fh, ">:encoding(utf8)", "new.csv" or die "new.csv: $!";
$csv->print ($fh, $_) for #rows;
close $fh or die "new.csv: $!";
EDIT worked example assuming two given lines are in a.txt
use strict;
use Text::CSV;
my #rows;
my $csv = Text::CSV->new ( { binary => 1 } ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<:encoding(utf8)", "a.txt" or die "a.txt: $!";
while ( my $row = $csv->getline( $fh ) ) {
foreach(#$row){
print "$_\n";
}
print "\n";
}
$csv->eof or $csv->error_diag();
close $fh;
gives
Acme services
Sesame street
zip
0,56
2013-10-21
Black adder, intra-national Association
shaftsville rd
zap code
0,50
2014-10-14

Related

Substituting millions of regular expressions (perl)

I have a text file containing over one million lines of text. On each line, there is an alphanumerical code which needs to be substituted with a name. I have tried doing this using different Perl scripts, but each time the scripts die because they are using too many memory. I am new to Perl, so I imagine that I am doing something wrong, and it making the job too complex?
So far, I have tried:
use strict;
use warnings;
my $filename = 'names.txt';
my $data = read_file($filename);
$data =~ s/88tx0p/Author1/g;
##and then there are 1,000,000+ other substitution regexes.
write_file($filename, $data);
exit;
sub read_file {
my ($filename) = #_;
open my $in, '<:encoding(UTF-8)', $filename or die "Could not open
'$filename' for reading $!";
local $/ = undef;
my $all = <$in>;
close $in;
return $all;
}
sub write_file {
my ($filename, $content) = #_;
open my $out, '>:encoding(UTF-8)', $filename or die "Could not open
'$filename' for writing $!";;
print $out $content;
close $out;
return;
}
But then I realised that this script is trying to write the output to the original file, which I imagine uses more memory? So I tried the following:
use strict;
use utf8;
use warnings;
open(FILE, 'names.txt') || die "File not found";
my #lines = <FILE>;
close(FILE);
my #newlines;
foreach(#lines) {
$_ =~ s/88tx0p/Author1/g;
##and then there are approximately 1,000,000 other substitution regexes.
push(#newlines,$_);
}
open(FILE, '>names_edited.txt') || die "File not found";
;
print FILE #newlines;
close(FILE);
But again, this used too much memory. Please could I get help with ways of doing this while using minimum amount of memory? Thank you all.
Your problem is you're using a foreach loop. That needs you to load all the lines into memory, which is the root of your problem.
Try it in a while loop:
open ( my $file, '<', 'names.txt' ) or die $!;
open ( my $output, '>', 'names_edited.txt' ) or die $!;
select $output; #destination for print;
while ( <$file> ) { #reads one line at a time, sets $_
s/88tx0p/Author1/g; #acts on $_ by default
print; #defaults to printing $_ to the selected filehandle $output
}
That'll work line by line (as your initial code was) but will read only one line at a time, so the memory footprint will be vastly lower.

Why does this regex in perl work for one word but not another?

I'm new to perl so please excuse me if my question seems obvious. I made a small perl script that just examines itself to extract a particular substring I'm looking for and I'm getting results that I can't explain. Here is the script:
use 5.006;
use strict;
use warnings;
use File::Find;
my #files;
find(
sub { push #files, $File::Find::name unless -d; },
"."
);
my #filteredfiles = grep(/.pl/, #files);
foreach my $fileName (#filteredfiles)
{
open (my $fh, $fileName) or die "Could not open file $fileName";
while (my $row = <$fh>)
{
chomp $row;
if ($row =~ /file/)
{
my ($substring) = $row =~ /file\(([^\)]*)\)/;
print "$substring\n" if $substring;
}
}
close $fh;
}
# file(stuff)
# directory(stuff)
Now, when I run this, I get the following output:
stuff
[^\
Why is it printing the lines out of order? Since the "stuff" line occurs later in the file, shouldn't it print later?
Why is it printing that second line wrong? It should be "\(([^\". It's missing the first 3 characters.
If I change my regex to the following: /directory\(([^\)]*)\)/, I get no output. The only difference is the word. It should be finding the second comment. What is going on here?
use 5.006 kind of odd if you are just beginning to learn Perl ... That is an ancient version.
You should not build a potentially huge list of all files in all locations under the current directory and then filter it. Instead, push only the files you want to the list.
Especially with escaped meta characters, regex patterns can be become hard to read very quickly, so use the /x modifier to insert some whitespace into those patterns.
You do not have to match twice: Just check & capture at the same time.
If open fails, include the reason in the error message.
Your second question above does not make sense. You seem to expect your pattern to match the literal string file\(([^\)]*)\)/, but it cannot.
use strict;
use warnings;
use File::Find;
my #files;
find(
sub {
return if -d;
return unless / [.] pl \z/x;
push #files, $File::Find::name;
},
'.',
);
for my $file ( #files ) {
open my $fh, '<', $file
or die "Could not open file $file: $!";
while (my $line = <$fh>) {
if (my ($substring) = ($line =~ m{ (?:file|directory) \( ([^\)]*) \) }x)) {
print "$substring\n";
}
}
close $fh;
}
# file(stuff)
# directory(other)
Output:
stuff
other

perl split 8gb csv with "," as pattern

I recognise this might be a duplicate but the size of the file I have to split requires a method with doesn't load the csv into memory before processing it. ie I'm looking for a line by line method to read and split and output my file. I I only need my output to be the last 3 field without the quotes and without the thousand delimiting comma.
I have a file of arcGIS coordinates which contain quotes and commas internal to the fields. Data example below.
"0","0","1","1","1,058.83","1,455,503.936","5,173,996.331"
I have been trying to do this using variations on split( '","' , $line);.
Here'e my code.
use strict;
use warnings;
open (FH, '<', "DEM_Export.csv") or die "Can't open file DEM_Export.csv";
open (FH2, '>', "DEM_ExportProcessed.csv") or die "Can't open file DEM_ExportProcessed.csv";
print FH2 "EASTING, NORTHING, ELEVATION,\n";
my $count = 0;
foreach my $line (<FH>) {
chomp;
# if ($count == 0){next;}
print $line, "\n";
my #list = split( '","' , $line);
print "1st print $list[5],$list[6],$list[4]\n";
$list[4] =~ s/,//g;
$list[5] =~ s/,//g;
$list[6] =~ s/,//g;
$list[4] =~ s/"//g;
$list[5] =~ s/"//g;
$list[6] =~ s/"//g;
print "2nd print $list[5],$list[6],$list[4]\n";
if ($count == 10) {
exit;
}
my $string = sprintf("%.3f,%.3f,%.3f\n", $list[5],$list[6],$list[4]);
print FH2 $string;
$count++;
}
close FH;
close FH2;
I'm getting close my my wits end with this and really need a solution.
Any help will be gratefully received.
Cheers
This is really very straightforward using the Text::CSV to handle the nastiness of CSV data
Here's an example, which works fine with the sample data you have shown. As long as your input file is plain ASCII and the rows are about the size you have shown it should work fine
It prints its output to STDOUT, so you'll want to use a command-line redirect to put it into the file you want
use strict;
use warnings 'all';
use Text::CSV;
my $csv_file = 'DEM_Export.csv';
open my $in_fh, '<', $csv_file or die qq{Unable to open "$csv_file" for input: $!};
my $csv = Text::CSV->new({ eol => "\n" });
print "EASTING,NORTHING,ELEVATION\n";
while ( my $row = $csv->getline($in_fh) ) {
$csv->print(\*STDOUT, [ map tr/,//dr, #$row[-2,-1,-3] ] );
}
output
1455503.936,5173996.331,1058.83
I guess I should have been braver and had a crack with Text::CSV to start with rather than asking a question.
Many thanks to Сухой27 and choroba for pointing me in the right direction.
Here is the code I ended up with. Probably not the tidiest.
use strict;
use warnings;
use Text::CSV;
my $file = "DEM_Export.csv";
my $file2 = "DEM_ExportProcessed.csv";
open (FH2, '>', $file2) or die "Can't open file $file2: $!";
print FH2 "EASTING, NORTHING, ELEVATION,\n";
print "Starting file processing...\n";
my $csv = Text::CSV->new ({ binary => 1, eol => $/ });
open my $io, "<", $file or die "$file: $!";
while (my $row = $csv->getline ($io)) {
my #fields = #$row;
s/,//g for #fields[3..5];
my $string = sprintf("%.3f,%.3f,%.3f\n", $fields[4],$fields[5],$fields[3]);
print FH2 $string;
}
print "Finished!";
close FH2;
Worked a treat!
Thank you.

How to split one file into multiple files using perl? [duplicate]

This question already has answers here:
Split fasta files based on header
(2 answers)
Closed 1 year ago.
I have the file as ftp.txt which contains many versions of lines such as
>KCY60942 pep:novel supercontig:GCA_000682575.1:ab248605.contig.36_1:19:588:-1 gene:J738_3590 transcript:KCY60942 description:"putative transposase 1"
MTHLNELYLILNKYLKWNKSHLKCFALIMLVIILKQTCNLSSASKALPIKCLPQSFYRRM
QRFFAGQYFDYRQISQLIFNMFSFDQVQLTLDRTNWKWGKRNINILMLAIVYRGIAIPIL
WTLLNKRGNSDTKERIALIQRFIAIFGKDRIVNVFADREFIGEQWFTWLIEQDINFCIRV
KKTSLSPII
>KCY61710 pep:novel supercontig:GCA_000682575.1:ab248605.contig.22_1:4164:6320:1 gene:J738_2986 transcript:KCY61710 description:"tonB-dependent siderophore receptor family protein"
MQRTTKHFQINALALAIAMSTISAHAETDQQTSEYGTLPTIKVKAGSGQENEKSYIAGKT
DTAVPLGLSVREVPQSVSVITQQRLQDQQLSTLVEVAENVTGVSVNRYETNRGGIYSRGF
VVDNYIIDGIPTTYSLPWSSGEIFSSMALYDHIDVVRGATGLTFGAGNPSAAINMVRKRA
TSTEPTANVEVSAGSWDNYRVMGDIANSLNQSGTVRGRAVAQYEQGDSYTDLLSKEKLSL
LLSAEADLSENTLLSGGVTYQEDDPRGPMWGGLPVWFSDGTKTNWSKNITTSADWTRWNV
KYTNLFADLTHKFNDNWSAKLSYSHGKRDANSKLLYVSGSVDKNTGLGLSPYASAYDLEV
EQDNASLQLNGSFDLWGLEQKVVLGYQYSNQDFTAYARSTDTKMEIGNFFEWNGSMPEPV
WNAPTLNEKYNIEQNALFAATYLNPIEPLKFILGGRFTNYEKNIYGRSSSIKYDHEFVPY
AGIIYDFNDVYTAYASYTSIFQPQDKKDFDGNYLDPVEGNSTEVGLKSAWFDGRLNGTLA
LYHIKQDNLAQEAGDVTRNGVKEIYYRAAKGATSEGFEVEVSGQITPDWNITAGYSQFSA
KDTNDVDVNTQLPRKMIQTFTTYKLSGKLENITVGGGVNWQSSTYINAENPKEVIEKVEQ
GDYALVNLMARYQITKDFSAQLNINNVFDKKYYGVFPAYGQITLGAPRNAALTLQYKF
my query is to separate each version and want to save it each version with different file names?
i tried the below code but i get only the line which startsup
#!/usr/local/bin/perl
open( FILE, "/home/httpd/cgi-bin/r/ftp.txt" );
while ( $line = <FILE> ) {
if ( $line =~ m/^\>/g ) {
print $line;
}
}
my desired output should be those two different versions which starts as like this >KCY60942 and >KCY61710 must saved in different filenames such as >KCY60942 should be saved in one file name and >KCY61710 it should be saved in another file name.
Here's another option:
use strict;
use warnings;
local $/ = '';
while (<>) {
my ($fileName) = /^>([^\s]+)/;
open my $fh, '>', "$fileName.txt" or die "Can't write to '$fileName.txt'";
print $fh $_;
close $fh;
}
Usage: perl script.pl inFile
Since each (FASTA?) record is a paragraph, $/ is set to empty ('') to read the file in paragraph mode--one 'record' at a time. Each record's id is captured for use as that record's file name, and then that record is written to its file.
Hope this helps!
Something like this should do the trick:
#!/usr/local/bin/perl
use strict;
use warnings;
open( my $file, "<", "/home/httpd/cgi-bin/r/ftp.txt" );
open( my $output, ">", "pre-match" ) or die $!;
while ( my $line = <$file> ) {
if ( $line =~ m/^\>/g ) {
my ($output_name) = ( $line =~ m/^\>(\w+)/ );
close($output);
open( $output, ">", $output_name . ".output" ) or die $!;
}
print {$output} $line;
}
close($output);
If your line matches that regular expression, we 'pick out' the first word (so KCY61710 etc.) and open a file called KCY61710.output.
We print each line as we go to this output, closing and re-opening each time we hit one of those lines.
A pre-match file exists in case the first line(s) don't match this pattern.

How to extract string from raw data file (stuck)

I have a raw data text file and I am writing a perl script to extract a string from it. I am trying to get the computer name where the result would only be UFLEX-06 however, I am getting Computer Name: UFLEX-06 (with the tab spacing at the back of Computer Name).
This is what the text file contains:
Computer Info:
Computer Name: UFLEX-06
Computer is external to Teradyne.
Computer IP Address(es):
This is what I've done so far in my perl script:
use strict;
use warnings;
my $filename = 'IGXLEventLog.3.17.2015.20.25.12.625.log';
open(my $fn, '<', $filename) or die "Could not open file '$filename' $!";
our %details;
while(my $row = <$fn>)
{
chomp $row;
if($row =~ /Computer Name:/i)
{
my $cp_line;
do
{
$cp_line .= $row;
}while($row !~ /Computer Name:/);
#Remove
$cp_line =~ /Computer Name:/s;
my #String = split(/Computer Name:/, $cp_line);
#Assigning array elements to hash array
$details{cp_line} = $String[0];
print $details{cp_line};
}
}
I'm new to Perl so do pardon me if there are any obvious mistakes. My idea was to use split so that I can remove the Computer Name:. Any help on this?
I really don't follow what you are trying to do with your own code, but this is all that is necessary
use strict;
use warnings;
my $filename = 'IGXLEventLog.3.17.2015.20.25.12.625.log';
open my $fh, '<', $filename or die "Could not open file '$filename': $!";
while ( <$fh> ) {
if ( /Computer Name:\s*(\S+)/i ) {
print $1, "\n";
}
}
You can use a regex with capturing groups like this:
Computer Name: (.*)
And then grab the content from the group
Working demo
I'm not good at Perl but I think you can use below code and adapt it to your needs:
use strict;
use warnings;
my $filename = 'IGXLEventLog.3.17.2015.20.25.12.625.log';
open(my $fn, '<', $filename) or die "Could not open file '$filename' $!";
our %details;
while(my $row = <$fn>)
{
chomp $row;
if($row =~ m/Computer Name: (.*)/)
{
print $1;
}
}