Using perl to process list of name and details

Using perl to process list of name and details - regex

There are two files. one file is list of names. another file is list of names and details. I want to create 3rd file which contains names from 1st file and details(of that name) from 2nd file. Can you please suggest.
Details from 2nd file are delimited by pattern "list[i]"(which are names from 1st file) and "</reg>"
#!/usr/intel/bin/perl
use warnings;
use strict;
use Data::Dumper;
my $handle;
unless (open $handle, "<:encoding(utf8)", "/nfs/fm/disks/fm_nvm_7138/WLRD_LOGIC_users/cgoudarx/willard_b02/chiplevel/verif/testsuites/upf/pss_ret_regs.txt") {
print STDERR "Could not open file '/nfs/fm/disks/fm_nvm_7138/WLRD_LOGIC_users/cgoudarx/willard_b02/chiplevel/verif/testsuites/upf/pss_ret_regs.txt': $!\n";
# we return 'undefined', we could also 'die' or 'croak'
return undef
}
chomp(my #list = <$handle>);
unless (close $handle) {
# what does it mean if close yields an error and you are just reading?
print STDERR "Don't care error while closing '/nfs/fm/disks/fm_nvm_7138/WLRD_LOGIC_users/cgoudarx/willard_b02/chiplevel/verif/testsuites/upf/pss_ret_regs.txt': $!\n";
}
open ( INPUT, "/nfs/fm/disks/fm_nvm_7138/WLRD_LOGIC_users/cgoudarx/willard_b02/chiplevel/verif/testsuites/upf/tet.xml" ) or die("Could not open xml file.");
my $outffile ="newlist.xml";
open(FILEOUT2, ">$outffile") || die "ERROR: Can't open the output file $outffile: $!";
my $size = #list;
for (my $i=0; $i < $size; $i++) {
while( my $line = <INPUT> )
{
if ( $line =~ m/$list[$i]/) {
print FILEOUT2 $line;
while( $line = <INPUT>) # print till empty line
{
last if ( $line =~ m/<\/reg>/);
print FILEOUT2 $line;
}
print FILEOUT2 $line;
};
};
};
close(INPUT);

One of your input files is an XML document. You shouldn't parse XML documents with regular expressions. It is a far better idea to use a proper XML parser (I'd recommend XML::LibXML).
If you insist in parsing XML using regexes, then you cannot process your input file a line at a time, as XML elements will often (usually?) span multiple lines.
Also, please update your file-handling code to use the three-arg version of open() and lexical filehandles.
open ( my $in_fh, '<', "...") or die("Could not open xml file.");
And
open( my $out_fh, '>', $outffile) || die "ERROR: Can't open the output file $outffile: $!";
Oh, and it's a good idea to standardise on using or or || in those commands.

Related

Perl in-place substitution

In Perl one liner, we can use the -i argument to do an in-place substitution. What's the equivalence of -i when writing perl code in the IDE?
Consider the following code:
binmode(STDOUT, ':raw');
open my $fh, '<', $filename;
while (<$fh>) {
s/^/<rootroot>/ if $.==1;
if (/(<link rel[^<>\n]*?)(\/?)(>)/g) {
my ($p1, $p2, $p3) = ($1, $2, $3);
s/$p1$p2$p3/($p2 ? qq|$p1$p2$p3<span class="entry">| : qq|$p1\/$p3<span class="entry">|)/ge;
};
s/<\/>/<entry_end><\/entry_end>/;
s/$/<\/rootroot>/ if eof;
}
How can we save all the lines of changes in-place?
Because I need to do a quick validation on the html file using XML::LibXML right after the in-place change of the html source..
Thanks in advance.

You can try something like this:
my $filename = 'test.dat';
#ARGV = ($filename);
$^I = '';
while(<<>>) {
binmode(ARGV, ':raw');
# Do the substitiution on $_ here ...
print;
}
I did not find out how to set binmode before the loop, since ARGV is only defined after the <> operator has been used.
The $^I and ARGVvariables are decribed in perlvar
See perlop for information about why you should use <<>> instead of <>.
Some notes:
The while(<>) { ... }
According to perlop, the loop
while (<>) { ... # code for each line
}
is equivalent to the following Perl-like pseudo code:
unshift(#ARGV, '-') unless #ARGV;
while ($ARGV = shift) {
open(ARGV, $ARGV);
while (<ARGV>) {
... # code for each line
}
}
Using in-place edit without a backup file: $^I="":
According to perlrun:
If no extension is supplied, and your system supports it, the original
file is kept open without a name while the output is redirected to a
new file with the original filename. When perl exits, cleanly or not,
the original file is unlinked.
and some more information in this blog:
Perl opens and immediately unlink()s the original file, then opens a
new file with the same name (new file descriptor and inode), and sends
output to this second file; at the end, the old file is closed and
thus deleted because it was unlinked, and what's left is a changed
file with the same name as the original.
See also doio.c for actual implementation.
According to the above, the following might work:
my $fn = 'test.dat';
open ( my $fh, '<:raw', $fn ) or die "Could not open file '$fn': $!";
unlink $fn or die "$!";
open ( my $fh2, '>:raw', $fn ) or die "Could not reopen file '$fn': $!";
while(<$fh>) {
# Do the substitutions on $_ here ...
print $fh2 $_;
}
close $fh;
close $fh2;

Substituting millions of regular expressions (perl)

I have a text file containing over one million lines of text. On each line, there is an alphanumerical code which needs to be substituted with a name. I have tried doing this using different Perl scripts, but each time the scripts die because they are using too many memory. I am new to Perl, so I imagine that I am doing something wrong, and it making the job too complex?
So far, I have tried:
use strict;
use warnings;
my $filename = 'names.txt';
my $data = read_file($filename);
$data =~ s/88tx0p/Author1/g;
##and then there are 1,000,000+ other substitution regexes.
write_file($filename, $data);
exit;
sub read_file {
my ($filename) = #_;
open my $in, '<:encoding(UTF-8)', $filename or die "Could not open
'$filename' for reading $!";
local $/ = undef;
my $all = <$in>;
close $in;
return $all;
}
sub write_file {
my ($filename, $content) = #_;
open my $out, '>:encoding(UTF-8)', $filename or die "Could not open
'$filename' for writing $!";;
print $out $content;
close $out;
return;
}
But then I realised that this script is trying to write the output to the original file, which I imagine uses more memory? So I tried the following:
use strict;
use utf8;
use warnings;
open(FILE, 'names.txt') || die "File not found";
my #lines = <FILE>;
close(FILE);
my #newlines;
foreach(#lines) {
$_ =~ s/88tx0p/Author1/g;
##and then there are approximately 1,000,000 other substitution regexes.
push(#newlines,$_);
}
open(FILE, '>names_edited.txt') || die "File not found";
;
print FILE #newlines;
close(FILE);
But again, this used too much memory. Please could I get help with ways of doing this while using minimum amount of memory? Thank you all.

Your problem is you're using a foreach loop. That needs you to load all the lines into memory, which is the root of your problem.
Try it in a while loop:
open ( my $file, '<', 'names.txt' ) or die $!;
open ( my $output, '>', 'names_edited.txt' ) or die $!;
select $output; #destination for print;
while ( <$file> ) { #reads one line at a time, sets $_
s/88tx0p/Author1/g; #acts on $_ by default
print; #defaults to printing $_ to the selected filehandle $output
}
That'll work line by line (as your initial code was) but will read only one line at a time, so the memory footprint will be vastly lower.

How to split one file into multiple files using perl? [duplicate]

This question already has answers here:
Split fasta files based on header
(2 answers)
Closed 1 year ago.
I have the file as ftp.txt which contains many versions of lines such as
>KCY60942 pep:novel supercontig:GCA_000682575.1:ab248605.contig.36_1:19:588:-1 gene:J738_3590 transcript:KCY60942 description:"putative transposase 1"
MTHLNELYLILNKYLKWNKSHLKCFALIMLVIILKQTCNLSSASKALPIKCLPQSFYRRM
QRFFAGQYFDYRQISQLIFNMFSFDQVQLTLDRTNWKWGKRNINILMLAIVYRGIAIPIL
WTLLNKRGNSDTKERIALIQRFIAIFGKDRIVNVFADREFIGEQWFTWLIEQDINFCIRV
KKTSLSPII
>KCY61710 pep:novel supercontig:GCA_000682575.1:ab248605.contig.22_1:4164:6320:1 gene:J738_2986 transcript:KCY61710 description:"tonB-dependent siderophore receptor family protein"
MQRTTKHFQINALALAIAMSTISAHAETDQQTSEYGTLPTIKVKAGSGQENEKSYIAGKT
DTAVPLGLSVREVPQSVSVITQQRLQDQQLSTLVEVAENVTGVSVNRYETNRGGIYSRGF
VVDNYIIDGIPTTYSLPWSSGEIFSSMALYDHIDVVRGATGLTFGAGNPSAAINMVRKRA
TSTEPTANVEVSAGSWDNYRVMGDIANSLNQSGTVRGRAVAQYEQGDSYTDLLSKEKLSL
LLSAEADLSENTLLSGGVTYQEDDPRGPMWGGLPVWFSDGTKTNWSKNITTSADWTRWNV
KYTNLFADLTHKFNDNWSAKLSYSHGKRDANSKLLYVSGSVDKNTGLGLSPYASAYDLEV
EQDNASLQLNGSFDLWGLEQKVVLGYQYSNQDFTAYARSTDTKMEIGNFFEWNGSMPEPV
WNAPTLNEKYNIEQNALFAATYLNPIEPLKFILGGRFTNYEKNIYGRSSSIKYDHEFVPY
AGIIYDFNDVYTAYASYTSIFQPQDKKDFDGNYLDPVEGNSTEVGLKSAWFDGRLNGTLA
LYHIKQDNLAQEAGDVTRNGVKEIYYRAAKGATSEGFEVEVSGQITPDWNITAGYSQFSA
KDTNDVDVNTQLPRKMIQTFTTYKLSGKLENITVGGGVNWQSSTYINAENPKEVIEKVEQ
GDYALVNLMARYQITKDFSAQLNINNVFDKKYYGVFPAYGQITLGAPRNAALTLQYKF
my query is to separate each version and want to save it each version with different file names?
i tried the below code but i get only the line which startsup
#!/usr/local/bin/perl
open( FILE, "/home/httpd/cgi-bin/r/ftp.txt" );
while ( $line = <FILE> ) {
if ( $line =~ m/^\>/g ) {
print $line;
}
}
my desired output should be those two different versions which starts as like this >KCY60942 and >KCY61710 must saved in different filenames such as >KCY60942 should be saved in one file name and >KCY61710 it should be saved in another file name.

Here's another option:
use strict;
use warnings;
local $/ = '';
while (<>) {
my ($fileName) = /^>([^\s]+)/;
open my $fh, '>', "$fileName.txt" or die "Can't write to '$fileName.txt'";
print $fh $_;
close $fh;
}
Usage: perl script.pl inFile
Since each (FASTA?) record is a paragraph, $/ is set to empty ('') to read the file in paragraph mode--one 'record' at a time. Each record's id is captured for use as that record's file name, and then that record is written to its file.
Hope this helps!

Something like this should do the trick:
#!/usr/local/bin/perl
use strict;
use warnings;
open( my $file, "<", "/home/httpd/cgi-bin/r/ftp.txt" );
open( my $output, ">", "pre-match" ) or die $!;
while ( my $line = <$file> ) {
if ( $line =~ m/^\>/g ) {
my ($output_name) = ( $line =~ m/^\>(\w+)/ );
close($output);
open( $output, ">", $output_name . ".output" ) or die $!;
}
print {$output} $line;
}
close($output);
If your line matches that regular expression, we 'pick out' the first word (so KCY61710 etc.) and open a file called KCY61710.output.
We print each line as we go to this output, closing and re-opening each time we hit one of those lines.
A pre-match file exists in case the first line(s) don't match this pattern.

Perl regex to change date format

I'm totally new to Perl, and I need to get a small find and replace done to change the date format in a set of large file.
The files have dates in the format: dd.mm.yyyy
and I need to change them to: mm-dd-yyyy
How do I do this with perl?
I have a basic code to read through the files in a directory and write to an output file, I'll need to have my replace logic in-between the while loop (if i'm not wrong!).
#!c:/perl64/bin/perl.exe
#loop around a directory
#files = <C:/perl64/data/*>;
# loop around files
foreach $file (#files) {
#Read File
open READ, $file or die "Cannot open $read for read :$!";
#Output File
$fname=substr($file, rindex($file,"/")+1,length($file)-rindex($file,"/")-1);
$write="C:/perl64/output/$fname";
open WRITE, ">$write" or die "Cannot open $write for write :$!";
#Loop Around file
while (<READ>) {
# TO DO: Change date format from dd.mm.yyyy to mm-dd-yyyy
#Write to ourput file
print WRITE "$_";
}
}
Regards,
Anand

You can use the substitution operator s///:
while (<READ>) {
s/(\d{2})\.(\d{2})\.(\d{4})/$2-$1-$3/;
print WRITE "$_";
}

Here's a simple script that will take optional arguments for input and output directories.
The use of opendir instead of a glob will save you some trouble cleaning up the file names.
use strict;
use warnings;
use autodie;
my $indir = shift || "C:/perl64/data";
my $outdir = shift || "C:/perl64/output";
opendir(my $in, $indir);
while (readdir $in) {
next unless -f;
open my $infile, '<', $_;
open my $outfile, '>', $outdir . "/" . $_;
while (<$infile>) {
s/([0-9]{2})\.([0-9]{2})\.([0-9]{4})/$2-$1-$3/g;
print $outfile $_;
}
}

Why is this substitution involving the end of each line adding more to the beginning of the line?

I have some document into which I want to add something at the beginning and at the end of each line. The original document looks like this:
firstLine
secondline
I want to turn it into this:
put 'firstLine';
put 'secondline';
By using the following Perl script, I can only turn it into this:
put 'firstLine';
';put 'secondline';
It seems that there is a $ at the end of the first line and at the beginning of the second line. Could someone help me to figure out what is wrong with the following Perl script?
use File::Find;
use strict;
my ($filename, #lines, $oldterm, $newterm); #,$File::Find::name);
my $dir = ".";
open MYFILE, ">error.txt" or die $!;
find(\&edits, $dir);
sub edits() {
$filename = $File::Find::name;
if (grep(/\.txt$/, $filename)) { #only process the perl files
# open the file and read data
# die with grace if it fails
open(FILE, "<$filename") or die "Can't open $filename: $!\n";
#lines = <FILE>;
close FILE;
# open same file for writing, reusing STDOUT
open(STDOUT, ">$filename") or die "Can't open $filename: $!\n";
# walk through lines, putting into $_, and substitute 2nd away
for (#lines) {
s/(&.+)/' "$1" '/ig;
s/^/put '/ig;
s/$/';/ig;
print;
}
#Finish up
close STDOUT;
}
}

don't use regular expressions at all: you already have the lines separated in the #lines array:
for ( #lines ) {
chomp; # remove newline at the end of the implicit variable $_
print "puts '$_'\n";
}

If you do it in one step you should have better luck. Something like:
s/^(&.+)$/put '$1';/im;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using perl to process list of name and details - regex

Related

Perl in-place substitution

Substituting millions of regular expressions (perl)

How to split one file into multiple files using perl? [duplicate]

Perl regex to change date format

Why is this substitution involving the end of each line adding more to the beginning of the line?

Categories

Resources