How to split one file into multiple files using perl? [duplicate] - regex

This question already has answers here:
Split fasta files based on header
(2 answers)
Closed 1 year ago.
I have the file as ftp.txt which contains many versions of lines such as
>KCY60942 pep:novel supercontig:GCA_000682575.1:ab248605.contig.36_1:19:588:-1 gene:J738_3590 transcript:KCY60942 description:"putative transposase 1"
MTHLNELYLILNKYLKWNKSHLKCFALIMLVIILKQTCNLSSASKALPIKCLPQSFYRRM
QRFFAGQYFDYRQISQLIFNMFSFDQVQLTLDRTNWKWGKRNINILMLAIVYRGIAIPIL
WTLLNKRGNSDTKERIALIQRFIAIFGKDRIVNVFADREFIGEQWFTWLIEQDINFCIRV
KKTSLSPII
>KCY61710 pep:novel supercontig:GCA_000682575.1:ab248605.contig.22_1:4164:6320:1 gene:J738_2986 transcript:KCY61710 description:"tonB-dependent siderophore receptor family protein"
MQRTTKHFQINALALAIAMSTISAHAETDQQTSEYGTLPTIKVKAGSGQENEKSYIAGKT
DTAVPLGLSVREVPQSVSVITQQRLQDQQLSTLVEVAENVTGVSVNRYETNRGGIYSRGF
VVDNYIIDGIPTTYSLPWSSGEIFSSMALYDHIDVVRGATGLTFGAGNPSAAINMVRKRA
TSTEPTANVEVSAGSWDNYRVMGDIANSLNQSGTVRGRAVAQYEQGDSYTDLLSKEKLSL
LLSAEADLSENTLLSGGVTYQEDDPRGPMWGGLPVWFSDGTKTNWSKNITTSADWTRWNV
KYTNLFADLTHKFNDNWSAKLSYSHGKRDANSKLLYVSGSVDKNTGLGLSPYASAYDLEV
EQDNASLQLNGSFDLWGLEQKVVLGYQYSNQDFTAYARSTDTKMEIGNFFEWNGSMPEPV
WNAPTLNEKYNIEQNALFAATYLNPIEPLKFILGGRFTNYEKNIYGRSSSIKYDHEFVPY
AGIIYDFNDVYTAYASYTSIFQPQDKKDFDGNYLDPVEGNSTEVGLKSAWFDGRLNGTLA
LYHIKQDNLAQEAGDVTRNGVKEIYYRAAKGATSEGFEVEVSGQITPDWNITAGYSQFSA
KDTNDVDVNTQLPRKMIQTFTTYKLSGKLENITVGGGVNWQSSTYINAENPKEVIEKVEQ
GDYALVNLMARYQITKDFSAQLNINNVFDKKYYGVFPAYGQITLGAPRNAALTLQYKF
my query is to separate each version and want to save it each version with different file names?
i tried the below code but i get only the line which startsup
#!/usr/local/bin/perl
open( FILE, "/home/httpd/cgi-bin/r/ftp.txt" );
while ( $line = <FILE> ) {
if ( $line =~ m/^\>/g ) {
print $line;
}
}
my desired output should be those two different versions which starts as like this >KCY60942 and >KCY61710 must saved in different filenames such as >KCY60942 should be saved in one file name and >KCY61710 it should be saved in another file name.

Here's another option:
use strict;
use warnings;
local $/ = '';
while (<>) {
my ($fileName) = /^>([^\s]+)/;
open my $fh, '>', "$fileName.txt" or die "Can't write to '$fileName.txt'";
print $fh $_;
close $fh;
}
Usage: perl script.pl inFile
Since each (FASTA?) record is a paragraph, $/ is set to empty ('') to read the file in paragraph mode--one 'record' at a time. Each record's id is captured for use as that record's file name, and then that record is written to its file.
Hope this helps!

Something like this should do the trick:
#!/usr/local/bin/perl
use strict;
use warnings;
open( my $file, "<", "/home/httpd/cgi-bin/r/ftp.txt" );
open( my $output, ">", "pre-match" ) or die $!;
while ( my $line = <$file> ) {
if ( $line =~ m/^\>/g ) {
my ($output_name) = ( $line =~ m/^\>(\w+)/ );
close($output);
open( $output, ">", $output_name . ".output" ) or die $!;
}
print {$output} $line;
}
close($output);
If your line matches that regular expression, we 'pick out' the first word (so KCY61710 etc.) and open a file called KCY61710.output.
We print each line as we go to this output, closing and re-opening each time we hit one of those lines.
A pre-match file exists in case the first line(s) don't match this pattern.

Related

Using perl to process list of name and details

There are two files. one file is list of names. another file is list of names and details. I want to create 3rd file which contains names from 1st file and details(of that name) from 2nd file. Can you please suggest.
Details from 2nd file are delimited by pattern "list[i]"(which are names from 1st file) and "</reg>"
#!/usr/intel/bin/perl
use warnings;
use strict;
use Data::Dumper;
my $handle;
unless (open $handle, "<:encoding(utf8)", "/nfs/fm/disks/fm_nvm_7138/WLRD_LOGIC_users/cgoudarx/willard_b02/chiplevel/verif/testsuites/upf/pss_ret_regs.txt") {
print STDERR "Could not open file '/nfs/fm/disks/fm_nvm_7138/WLRD_LOGIC_users/cgoudarx/willard_b02/chiplevel/verif/testsuites/upf/pss_ret_regs.txt': $!\n";
# we return 'undefined', we could also 'die' or 'croak'
return undef
}
chomp(my #list = <$handle>);
unless (close $handle) {
# what does it mean if close yields an error and you are just reading?
print STDERR "Don't care error while closing '/nfs/fm/disks/fm_nvm_7138/WLRD_LOGIC_users/cgoudarx/willard_b02/chiplevel/verif/testsuites/upf/pss_ret_regs.txt': $!\n";
}
open ( INPUT, "/nfs/fm/disks/fm_nvm_7138/WLRD_LOGIC_users/cgoudarx/willard_b02/chiplevel/verif/testsuites/upf/tet.xml" ) or die("Could not open xml file.");
my $outffile ="newlist.xml";
open(FILEOUT2, ">$outffile") || die "ERROR: Can't open the output file $outffile: $!";
my $size = #list;
for (my $i=0; $i < $size; $i++) {
while( my $line = <INPUT> )
{
if ( $line =~ m/$list[$i]/) {
print FILEOUT2 $line;
while( $line = <INPUT>) # print till empty line
{
last if ( $line =~ m/<\/reg>/);
print FILEOUT2 $line;
}
print FILEOUT2 $line;
};
};
};
close(INPUT);
One of your input files is an XML document. You shouldn't parse XML documents with regular expressions. It is a far better idea to use a proper XML parser (I'd recommend XML::LibXML).
If you insist in parsing XML using regexes, then you cannot process your input file a line at a time, as XML elements will often (usually?) span multiple lines.
Also, please update your file-handling code to use the three-arg version of open() and lexical filehandles.
open ( my $in_fh, '<', "...") or die("Could not open xml file.");
And
open( my $out_fh, '>', $outffile) || die "ERROR: Can't open the output file $outffile: $!";
Oh, and it's a good idea to standardise on using or or || in those commands.

Substituting millions of regular expressions (perl)

I have a text file containing over one million lines of text. On each line, there is an alphanumerical code which needs to be substituted with a name. I have tried doing this using different Perl scripts, but each time the scripts die because they are using too many memory. I am new to Perl, so I imagine that I am doing something wrong, and it making the job too complex?
So far, I have tried:
use strict;
use warnings;
my $filename = 'names.txt';
my $data = read_file($filename);
$data =~ s/88tx0p/Author1/g;
##and then there are 1,000,000+ other substitution regexes.
write_file($filename, $data);
exit;
sub read_file {
my ($filename) = #_;
open my $in, '<:encoding(UTF-8)', $filename or die "Could not open
'$filename' for reading $!";
local $/ = undef;
my $all = <$in>;
close $in;
return $all;
}
sub write_file {
my ($filename, $content) = #_;
open my $out, '>:encoding(UTF-8)', $filename or die "Could not open
'$filename' for writing $!";;
print $out $content;
close $out;
return;
}
But then I realised that this script is trying to write the output to the original file, which I imagine uses more memory? So I tried the following:
use strict;
use utf8;
use warnings;
open(FILE, 'names.txt') || die "File not found";
my #lines = <FILE>;
close(FILE);
my #newlines;
foreach(#lines) {
$_ =~ s/88tx0p/Author1/g;
##and then there are approximately 1,000,000 other substitution regexes.
push(#newlines,$_);
}
open(FILE, '>names_edited.txt') || die "File not found";
;
print FILE #newlines;
close(FILE);
But again, this used too much memory. Please could I get help with ways of doing this while using minimum amount of memory? Thank you all.
Your problem is you're using a foreach loop. That needs you to load all the lines into memory, which is the root of your problem.
Try it in a while loop:
open ( my $file, '<', 'names.txt' ) or die $!;
open ( my $output, '>', 'names_edited.txt' ) or die $!;
select $output; #destination for print;
while ( <$file> ) { #reads one line at a time, sets $_
s/88tx0p/Author1/g; #acts on $_ by default
print; #defaults to printing $_ to the selected filehandle $output
}
That'll work line by line (as your initial code was) but will read only one line at a time, so the memory footprint will be vastly lower.

Why does this regex in perl work for one word but not another?

I'm new to perl so please excuse me if my question seems obvious. I made a small perl script that just examines itself to extract a particular substring I'm looking for and I'm getting results that I can't explain. Here is the script:
use 5.006;
use strict;
use warnings;
use File::Find;
my #files;
find(
sub { push #files, $File::Find::name unless -d; },
"."
);
my #filteredfiles = grep(/.pl/, #files);
foreach my $fileName (#filteredfiles)
{
open (my $fh, $fileName) or die "Could not open file $fileName";
while (my $row = <$fh>)
{
chomp $row;
if ($row =~ /file/)
{
my ($substring) = $row =~ /file\(([^\)]*)\)/;
print "$substring\n" if $substring;
}
}
close $fh;
}
# file(stuff)
# directory(stuff)
Now, when I run this, I get the following output:
stuff
[^\
Why is it printing the lines out of order? Since the "stuff" line occurs later in the file, shouldn't it print later?
Why is it printing that second line wrong? It should be "\(([^\". It's missing the first 3 characters.
If I change my regex to the following: /directory\(([^\)]*)\)/, I get no output. The only difference is the word. It should be finding the second comment. What is going on here?
use 5.006 kind of odd if you are just beginning to learn Perl ... That is an ancient version.
You should not build a potentially huge list of all files in all locations under the current directory and then filter it. Instead, push only the files you want to the list.
Especially with escaped meta characters, regex patterns can be become hard to read very quickly, so use the /x modifier to insert some whitespace into those patterns.
You do not have to match twice: Just check & capture at the same time.
If open fails, include the reason in the error message.
Your second question above does not make sense. You seem to expect your pattern to match the literal string file\(([^\)]*)\)/, but it cannot.
use strict;
use warnings;
use File::Find;
my #files;
find(
sub {
return if -d;
return unless / [.] pl \z/x;
push #files, $File::Find::name;
},
'.',
);
for my $file ( #files ) {
open my $fh, '<', $file
or die "Could not open file $file: $!";
while (my $line = <$fh>) {
if (my ($substring) = ($line =~ m{ (?:file|directory) \( ([^\)]*) \) }x)) {
print "$substring\n";
}
}
close $fh;
}
# file(stuff)
# directory(other)
Output:
stuff
other

perl split 8gb csv with "," as pattern

I recognise this might be a duplicate but the size of the file I have to split requires a method with doesn't load the csv into memory before processing it. ie I'm looking for a line by line method to read and split and output my file. I I only need my output to be the last 3 field without the quotes and without the thousand delimiting comma.
I have a file of arcGIS coordinates which contain quotes and commas internal to the fields. Data example below.
"0","0","1","1","1,058.83","1,455,503.936","5,173,996.331"
I have been trying to do this using variations on split( '","' , $line);.
Here'e my code.
use strict;
use warnings;
open (FH, '<', "DEM_Export.csv") or die "Can't open file DEM_Export.csv";
open (FH2, '>', "DEM_ExportProcessed.csv") or die "Can't open file DEM_ExportProcessed.csv";
print FH2 "EASTING, NORTHING, ELEVATION,\n";
my $count = 0;
foreach my $line (<FH>) {
chomp;
# if ($count == 0){next;}
print $line, "\n";
my #list = split( '","' , $line);
print "1st print $list[5],$list[6],$list[4]\n";
$list[4] =~ s/,//g;
$list[5] =~ s/,//g;
$list[6] =~ s/,//g;
$list[4] =~ s/"//g;
$list[5] =~ s/"//g;
$list[6] =~ s/"//g;
print "2nd print $list[5],$list[6],$list[4]\n";
if ($count == 10) {
exit;
}
my $string = sprintf("%.3f,%.3f,%.3f\n", $list[5],$list[6],$list[4]);
print FH2 $string;
$count++;
}
close FH;
close FH2;
I'm getting close my my wits end with this and really need a solution.
Any help will be gratefully received.
Cheers
This is really very straightforward using the Text::CSV to handle the nastiness of CSV data
Here's an example, which works fine with the sample data you have shown. As long as your input file is plain ASCII and the rows are about the size you have shown it should work fine
It prints its output to STDOUT, so you'll want to use a command-line redirect to put it into the file you want
use strict;
use warnings 'all';
use Text::CSV;
my $csv_file = 'DEM_Export.csv';
open my $in_fh, '<', $csv_file or die qq{Unable to open "$csv_file" for input: $!};
my $csv = Text::CSV->new({ eol => "\n" });
print "EASTING,NORTHING,ELEVATION\n";
while ( my $row = $csv->getline($in_fh) ) {
$csv->print(\*STDOUT, [ map tr/,//dr, #$row[-2,-1,-3] ] );
}
output
1455503.936,5173996.331,1058.83
I guess I should have been braver and had a crack with Text::CSV to start with rather than asking a question.
Many thanks to Сухой27 and choroba for pointing me in the right direction.
Here is the code I ended up with. Probably not the tidiest.
use strict;
use warnings;
use Text::CSV;
my $file = "DEM_Export.csv";
my $file2 = "DEM_ExportProcessed.csv";
open (FH2, '>', $file2) or die "Can't open file $file2: $!";
print FH2 "EASTING, NORTHING, ELEVATION,\n";
print "Starting file processing...\n";
my $csv = Text::CSV->new ({ binary => 1, eol => $/ });
open my $io, "<", $file or die "$file: $!";
while (my $row = $csv->getline ($io)) {
my #fields = #$row;
s/,//g for #fields[3..5];
my $string = sprintf("%.3f,%.3f,%.3f\n", $fields[4],$fields[5],$fields[3]);
print FH2 $string;
}
print "Finished!";
close FH2;
Worked a treat!
Thank you.

perl split a string on "," [duplicate]

This question already has answers here:
How do I efficiently parse a CSV file in Perl?
(6 answers)
Closed 8 years ago.
I am trying to split several hundred lines read from a CSV file on ,. E.g.:
"Acme services","Sesame street","zip","0,56","2013-10-21"
"Black adder, intra-national Association","shaftsville rd","zap code","0,50","2014-10-14"
etc.
I could split the first row on ,, but this would not work for the second row. However, if I split on , then I would trap these cases. I could then remove the " using simple regex (e.g. $col[i] =~ s/\"+//g)
I have tried #cols = split(/\",\"/,$line), and I've tried split('","',$lines) and various variations, but every time, I get the full $line in $col[0], with $cols[1:n] as empty.
Any help would be much appreciated! Thanks.
Why not use Text::CSV. This will take care of edge cases where you have commas in values and all sorts of other problems,
from the cpan page
use Text::CSV;
my #rows;
my $csv = Text::CSV->new ( { binary => 1 } ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
$row->[2] =~ m/pattern/ or next; # 3rd field should match
push #rows, $row;
}
$csv->eof or $csv->error_diag();
close $fh;
$csv->eol ("\r\n");
open $fh, ">:encoding(utf8)", "new.csv" or die "new.csv: $!";
$csv->print ($fh, $_) for #rows;
close $fh or die "new.csv: $!";
EDIT worked example assuming two given lines are in a.txt
use strict;
use Text::CSV;
my #rows;
my $csv = Text::CSV->new ( { binary => 1 } ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<:encoding(utf8)", "a.txt" or die "a.txt: $!";
while ( my $row = $csv->getline( $fh ) ) {
foreach(#$row){
print "$_\n";
}
print "\n";
}
$csv->eof or $csv->error_diag();
close $fh;
gives
Acme services
Sesame street
zip
0,56
2013-10-21
Black adder, intra-national Association
shaftsville rd
zap code
0,50
2014-10-14