I have written a small perl "hack" to replace 1's with alphabets in a range of columns in a tab delimited file. The file looks like this:
Chr Start End Name Score Strand Donor Acceptor Merged_Transcript Gencode Colon Heart Kidney Liver Lung Stomach
chr10 100177483 100177931 . . - 1 1 1 1 1 0 1 1 0 0
chr10 100178014 100179801 . . - 1 1 1 1 1 1 1 1 1 0
chr10 100179915 100182125 . . - 1 1 1 1 1 1 1 0 1 0
chr10 100182270 100183359 . . - 1 1 1 1 0 0 1 0 1 0
chr10 100183644 100184069 . . - 1 1 1 1 0 0 1 0 1 0
The gola is to take columns 11 through 16 and append letters A to Z if a value of 1 is seen in those columns. My code so far is producing an empty output and this is my first time doing regular expressions.
cat infile.txt \
| perl -ne '#alphabet=("A".."Z");
$is_known_intron = 0;
$is_known_donor = 1;
$is_known_acceptor = 1;
chomp;
$_ =~ s/^\s+//;
#d = split /\s+/, $_;
#d_bool=#d[$11-$16];
$ct=1;
$known_intron = $d[$10];
$num_of_overlapping_gene = $d[$9];
$known_acceptor = $d[$8];
$known_donor = $d[$7];
$k="";
if (($known_intron == $is_known_intron) and ($known_donor == $is_known_donor) and ($known_acceptor == $is_known_acceptor)) {
for ($i = 0; $i < scalar #d_bool; $i++){
$k.=$alphabet[$i] if ($d_bool[$i])
}
$alphabet_ct{$k}+=$ct;
}
END
{
foreach $k (sort keys %alphabet_ct){
print join("\t", $k, $alphabet_ct{$k}), "\n";
}
} '\
> Outfile.txt
What should I be doing instead?
Thanks!
* Edit *
Expected Output
ABCD 45
BCD 23
ABCDEF 1215
so on and so forth.
I converted your code into a script for ease of debugging. I've put comments in the code to point out dodgy bits:
use strict;
use warnings;
my %alphabet_ct;
my #alphabet = ( "A" .. "Z" );
my $is_known_intron = 0;
my $is_known_donor = 1;
my $is_known_acceptor = 1;
while (<DATA>) {
# don't process the first line
next unless /chr10/;
chomp;
# this should remove whitespace at the beginning of the line but is doing nothing as there is none
$_ =~ s/^\s+//;
my #d = split /\s+/, $_;
# the range operator in perl is .. (not "-")
my #d_bool = #d[ 10 .. 15 ];
my $known_intron = $d[9];
my $known_acceptor = $d[7];
my $known_donor = $d[6];
my $k = "";
# this expression is false for all the data in the sample you provided as
# $is_known_intron is set to 0
if ( ( $known_intron == $is_known_intron )
and ( $known_donor == $is_known_donor )
and ( $known_acceptor == $is_known_acceptor ) )
{
for ( my $i = 0; $i < scalar #d_bool; $i++ ) {
$k .= $alphabet[$i] if $d_bool[$i];
}
# it is more idiomatic to write $alphabet_ct{$k}++;
# $alphabet_ct{$k} += $ct;
$alphabet_ct{$k}++;
}
}
foreach my $k ( sort keys %alphabet_ct ) {
print join( "\t", $k, $alphabet_ct{$k} ) . "\n";
}
__DATA__
Chr Start End Name Score Strand Donor Acceptor Merged_Transcript Gencode Colon Heart Kidney Liver Lung Stomach
chr10 100177483 100177931 . . - 1 1 1 1 1 0 1 1 0 0
chr10 100178014 100179801 . . - 1 1 1 1 1 1 1 1 1 0
chr10 100179915 100182125 . . - 1 1 1 1 1 1 1 0 1 0
chr10 100182270 100183359 . . - 1 1 1 1 0 0 1 0 1 0
chr10 100183644 100184069 . . - 1 1 1 1 0 0 1 0 1 0
With $is_known_intron set to 1, the sample data gives the results:
ABCDE 1
ABCE 1
ACD 1
CE 2
Related
I'm looking for a clean method in Perl to merge a collection of
lists. The all have the same length, and each consists mainly of
zeros, but also has short contiguous segments of non-zero
entries. For example, here are two representative lists of
length 25:
#flags1 = qw( 0 0 0 0 21 22 23 0 0 0 0 0 0 0 0 41 42 43 0 0 0 0 0 0 0);
#flags2 = qw(11 12 13 0 0 0 0 0 0 0 0 0 0 31 32 33 0 0 0 0 0 51 52 53 0);
The objective is to merge the elements of #flags2 into #flags1 for all the places
where a contiguous clump of non-zero elements in #flags2 replaces
only zero entries in #flags1. If there is an overlap with any of
the non-zero elements of #flags1, the associated contiguous clump
of non-zero values in #flags2 is discarded instead of being
merged.
Thus, for the example above, the contiguous clump of values 31,
32, and 33 in #flags2[13..15] are discarded because one of the
entries, $flags2[15] is non-zero and collides with the non-zeros value at $flags1[15]. The resulting desired merged list
would be:
#merged = qw(11 12 13 0 21 22 23 0 0 0 0 0 0 0 0 41 42 43 0 0 0 51 52 53 0);
I have experimented with collecting up contiguous elements of
non-zero elements into a list of lists, and then comparing them
using for and if statements, but it's a mess, and I think it will
be difficult for any other developer to understand the logic. If
anyone could propose an more elegant solution that would be much
appreciated.
use List::Util qw( none );
my $s = 0;
while (1) {
# Find start of next clump.
++$s while $s < #flags2 && !$flags2[$s];
# Exit if at end of array.
last if $s == #flags2;
# Find end of clump.
my $e = $s+1;
++$e while $e < #flags2 && $flags2[$e];
# Merge in clump.
my #clump = $s .. $e-1;
if ( none { $_ } #flags1[ #clump ] ) { # Or `!grep { $_ }`
#flags1[ #clump ] = #flags2[ #clump ];
}
$s = $e;
# Exit if at end of array.
last if $s == #flags2;
}
This is another approach that is akin to the merge portion of a merge sort.
sub get_next_clump {
my ( $f, $s ) = #_;
++$s while $s < #$f && !$f[$s];
return if $s == #$f;
my $e = $s+1;
++$e while $e < #$f && $f[$e];
return $s, $e;
}
my $ok1 = my ( $f1_s, $f1_e ) = get_next_clump( \#flags1, 0 );
my $ok2 = my ( $f2_s, $f2_e ) = get_next_clump( \#flags2, 0 );
while ( $ok1 && $ok2 ) {
if ( $f2_s < $f1_e && $f2_e > $f1_s ) {
$ok2 = ( $f2_s, $f2_e ) = get_next_clump( \#flags2, $f2_e );
next;
}
if ( $f1_s < $f2_s ) {
$ok1 = ( $f1_s, $f1_e ) = get_next_clump( \#flags1, $f1_e );
} else {
#flags1[ $f2_s .. $f2_e-1 ] = #flags2[ $f2_s .. $f2_e-1 ];
$ok2 = ( $f2_s, $f2_e ) = get_next_clump( \#flags2, $f2_e );
}
}
while ( $ok2 ) {
#flags1[ $f2_s .. $f2_e-1 ] = #flags2[ $f2_s .. $f2_e-1 ];
$ok2 = ( $f2_s, $f2_e ) = get_next_clump( \#flags2, $f2_e );
}
Your approach is workable, it just needs some organization. Let's take it a step at a time:
sub to_ranges {
my $in = shift;
my (#ret, $in_range);
for my $i (0 .. $#$in) {
if ($in->[$i]) {
if ($in_range) { # Extend an existing range
$ret[-1]{end} = $i;
push #{$ret[-1]{values}}, $in->[$i];
} else { # Start a new one
push #ret, { start => $i, end => $i, values => [ $in->[$i] ] };
$in_range = 1;
}
} else {
$in_range = 0;
}
}
# Dummy entry to make sure the output will be padded to the right length
push #ret, { start => scalar #$in, end => scalar #$in, values => [] };
return \#ret;
}
This turns a list into a list of "chunks", each of which knows its start, end, and the values it contains. (end is not strictly necessary but it makes things tidier).
sub from_ranges {
my $in = shift;
my #ret;
for my $r (#$in) {
push #ret, 0 while $#ret < $r->{end};
splice #ret, $r->{start}, $r->{end} - $r->{start} + 1, #{ $r->{values} };
}
return \#ret;
}
This does the reverse transformation: from_ranges(to_ranges(\#x)) should contain the same elements as #x.
sub overlaps_any {
my ($r, $ll) = #_;
for my $l (#$ll) {
return 1 if $r->{start} >= $l->{start} && $r->{start} <= $l->{end};
return 1 if $r->{end} >= $l->{start} && $r->{end} <= $l->{end};
}
return 0;
}
This is a helper that returns true if the range $r overlaps any of the ranges in #$ll.
sub merge_ranges {
my ($ll, $rr) = #_;
my #rr_new = grep { !overlaps_any($_, $ll) } #$rr;
return [
sort {
$a->{start} <=> $b->{start}
} #$ll, #rr_new
];
}
And this takes two sets of ranges, #$ll and #$rr and returns all of the ranges in #$ll plus the ranges in #$rr that don't overlap. The sort is actually only for ease of debugging; you can just return [ #$ll, #rr_new ] if you prefer.
sub merge {
my ($ll, $rr) = #_;
return from_ranges(
merge_ranges(
to_ranges($ll),
to_ranges($rr),
)
);
}
put the pieces together, and it works.
ikegami has provided an overall simpler soluion, but I'll still offer up this one because maybe you have other things you need to do that would benefit from this representation.
I need to replace a variable in one file with information/regex found in another file. For example, I need to grab the first string/variable (that starts with MSTRG) from each row in file1:
MSTRG.5734 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
MSTRG.26588 MSTRG.26588 196 5 0 12 0 0
and use this to search in file2 that will look something like this:
Chr1 StringTie transcript 24039360 24041181 1000 - . gene_id "MSTRG.5734"; transcript_id "MSTRG.5734.1";
Chr1 StringTie transcript 24039810 24040595 1000 - . gene_id "MSTRG.5734"; transcript_id "Transcript:AT1G64700.1"; ref_gene_id "Gene:AT1G64700"
Chr1 StringTie exon 24040560 24041181 1000 - . gene_id "MSTRG.19266"; transcript_id "MSTRG.19266.1"; exon_number "2";
Chr1 StringTie exon 24040560 24041181 1000 - . gene_id "MSTRG.26588"; transcript_id "MSTRG.26588.1"; exon_number "2";
Chr1 StringTie transcript 24039810 24040595 1000 - . gene_id "MSTRG.26588"; transcript_id "Transcript:AT5G41000.1"; ref_gene_id "Gene:AT5G41000";
Ideally when e.g. MSTRG.5734 is found on a line in file2 that also contains the string e.g. Gene:AT1G64700, it will grab the information Gene:AT1G64700 and replace MSTRG.5734 in file1. So, every MSTRG on each row in file1 is unique and it will theoretically match a unique Gene in file2. If it does not match a Gene then I need the original row in file1 to be maintained.
File1 output should then look like:
Gene:AT1G64700 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
Gene:AT5G41000 MSTRG.26588 196 5 0 12 0 0
My current perl code is:
use strict;
use warnings;
use vars qw($outfile #id $mstrg $gene);
open(SEARCH, $ARGV[0]) or die "Couldn't open $ARGV[0]: $!";
open(FILE, $ARGV[1]) or die "Couldn't open $ARGV[1]: $!";
$outfile = "testout.txt";
open (OUT, ">$outfile") || die "Can't open $outfile for creation: $!\n";
my %mstrg;
while (<SEARCH>) {
chomp;
if (/^MSTRG/) {
chomp $_;
#id = split (/\t/, $_);
$mstrg{$id[1]}.="$id[1]";
}
}
while (<FILE>) # {#ffn=<FILE>};
{
chomp ($gene=$_);
if ($mstrg =~ /$gene/) {
print OUT "$id[1]\t$id[2]";}
else {
#print OUT "$_\n";
}
next;
}
close FILE;
This is unfortunately where I get stuck and not sure how to proceed?
Thanks for any help given, appreciated and apologies for the previous post where I did not include any code to those that saw.
I would do this somewhat like:
use 5.014; #needed min 5.014 because the /r modifier
use warnings;
use Path::Tiny '0.077'; #added the min. req. version
my $file1='file1.txt';
my $file2='file2.txt';
my %mstmap = map { split /\s+/, s/.*?gene_id\s*"\s*(MSTRG\.\d+).*ref_gene_id\s*"\s*(Gene:\w+)".*/$1 $2/r }
grep { /ref_gene_id.*Gene:/ } path($file2)->lines({chomp => 1});
path($file1)->edit_lines( sub { s/^(MSTRG\.\d+)/exists($mstmap{$1}) ? $mstmap{$1} : $1/e });
for your input files produces
Gene:AT1G64700 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
Gene:AT5G41000 MSTRG.26588 196 5 0 12 0 0
It create a hash for the pairs: MSTRG.number => Gene:String (from the file2), and using the Path::Tiny module editing fucntion doing the replaces in the file1.
After #Borodin comments the above could be reduced to:
use 5.014;
use warnings;
use Path::Tiny '0.077';
my $file1='f1';
my $file2='f2';
my %mstmap = map {
/.*?gene_id\s*"\s*(MSTRG\.\d+).*ref_gene_id\s*"\s*(Gene:\w+).*/
} path($file2)->lines({chomp => 1});
path($file1)->edit_lines( sub { s/^(MSTRG\.\d+)/exists($mstmap{$1}) ? $mstmap{$1} : $1/e });
There's no need to write impregnable code to achieve the result you want
This program reads through $file2 building hash %mstrg from all the lines that contain both an MSTRG. and a Gene: string. It then creates a regex in $re that will match any one of the MSTR. strings found
$file1 is opened, and that regex is used to replace any of the hash keys with the corresponding hash value it appears at the start of the line. The line is then printed
It isn't clear whether the first two fields of file1.txt are always the same, but I've opted to alter only the first field
I've used the autodie pragma to avoid having to explicitly check for success of any file IO operations
The program prints the output to STDOUT, so you can redirect it wherever you like on the command line
use strict;
use warnings 'all';
use autodie;
my ( $file1, $file2 ) = #ARGV;
my %mstrg;
{
open my $fh, '<', $file2;
while ( <$fh> ) {
$mstrg{$1} = $2 if /"(MSTRG.\d+)".*"(Gene:\w+)"/;
}
}
my $re = join '|', sort { length $b <=> length $a } keys %mstrg;
open my $fh, '<', $file1;
while ( <$fh> ) {
s/^($re)\b/$mstrg{$1}/;
print;
}
output
Gene:AT1G64700 MSTRG.5734 509 -4 0 -14 0 0
MSTRG.19266 MSTRG.19266 842 -4 0 -12 0 0
Gene:AT5G41000 MSTRG.26588 196 5 0 12 0 0
I want to match the fifth column i.e ",,," and ",,,," and "," except new line i.e "\n" and then replace them with some value. the following is a file delimited by space . I tried following code:
Note: Though the example shows commas in the fifth column.It could be any characters (including tab \t) other than newline (\n).
my $delimiter="**";
my $dir_to_check=$DIR;
opendir my $DIR, $dir_to_check or die "Error in opening dir '$dir_to_check' because: $!";
my #files = readdir($DIR);
closedir($DIR);
foreach my $file (#files)
{
if($file =~ /\.fmt/)
{
unless ( open( CONTRL_FILE, "< $dir_to_check/$file" ) ) {
print "error while opening file $dir_to_check/$file \n"
} # UNLESS
if ($file eq 'test.fmt')
{
unless ( open( CONTRL_FILE_1, "> $dir_to_check/$file.temp" ) ) {
print "error while opening file $file \n"
} # UNLESS
while(<CONTRL_FILE>)
{
$_ =~ s/"[^\s]+"/"$delimiter"/ ;
print CONTRL_FILE_1 $_;
}
close(CONTRL_FILE_1);
}
}
}
Data:
1 SQLCHAR 0 5 ",,," 1 ""
2 SQLCHAR 0 25 ",,,," 2 ""
3 SQLCHAR 0 1 "," 3 ""
4 SQLCHAR 0 12 "," 4 ""
5 SQLCHAR 0 1 "\n" 5 ""
Result:
1 SQLCHAR 0 5 "*****" 1 ""
2 SQLCHAR 0 25 "*****" 2 ""
3 SQLCHAR 0 1 "*****" 3 ""
4 SQLCHAR 0 12 "*****" 4 ""
5 SQLCHAR 0 1 "*****" 5 ""
Expected Result :
1 SQLCHAR 0 5 "**" 1 ""
2 SQLCHAR 0 25 "**" 2 ""
3 SQLCHAR 0 1 "**" 3 ""
4 SQLCHAR 0 12 "**" 4 ""
5 SQLCHAR 0 1 "\n" 5 ""
If you are using an older version of Perl then that may be a factor. In any case, I would suggest you make a minor modification ...
$_ =~ s/"[^\s"]+"/"$delimiter"/;
... that is a ", one or more NOT whitespace OR ", then a "
C:\Users\Ken>type test.pl
#!C:\Strawberry\perl\bin\perl -w
$\="\n";
my $d="**";
my $L1="5 SQLCHAR 0 1 \",,,\" 5 \"\"";
my $L2="5 SQLCHAR 0 1 \"\n\" 5 \"\"";
foreach my $L ($L1,$L2)
{
print "LineIn=$L";
if ($L=~ s/"[^\s"]+"/"$d"/) {print "#YES L=$L";}
else {print "#NO L=$L";}
}
C:\Users\Ken>test.pl
LineIn=5 SQLCHAR 0 1 ",,," 5 ""
#YES L=5 SQLCHAR 0 1 "**" 5 ""
LineIn=5 SQLCHAR 0 1 "
" 5 ""
#NO L=5 SQLCHAR 0 1 "
" 5 ""
Since the OP says in his comment that the contents on 4th column could be "any combination of non white-space characters", and that he states that he does not want substitution to happen for the case when 4th column contains "\n" literally, I suggest he matches the contents of 4th column and then test, in two steps, whether what is in quotes includes a literal representation of what Perl would understand as a whitespace.
For doing that, we could use eval or we could use a regexp with the ee modifier, which is better and safer.
Here is an example using the latter (update - dataset correctly includes the OP's and additional cases):
#!/usr/bin/perl
use strict;
use warnings;
my $delimiter="**";
while (<DATA>) {
# we capture the contents of the quotes in
# 4th column, checking also the expected format
if (/(^([^\s]+\s+){4})"([^"]+)"(.*)/) {
my $st = $3;
# "\n" in the file is actually "\\n" for Perl
# so, to have Perl understand it as "\n", we need
# to have Perl effectively escape it, we can
# do that with a regexp and the ee modifier
$st =~ s/\\([tnfr])/"qq{\\$1}"/gee;
# now this will match an "\n", "\r", "\f" or "\t"
if (!($st =~ /\s/)) {
print "$1\"$delimiter\"$4\n";
} else {
print $_;
}
} else {
print "error: wrong line format: $_\n";
}
}
__DATA__
1 SQLCHAR 0 5 ",,," 1 ""
2 SQLCHAR 0 25 ",,,," 2 ""
3 SQLCHAR 0 1 "," 3 ""
4 SQLCHAR 0 12 "," 4 ""
5 SQLCHAR 0 1 "\n" 5 ""
6 SQLCHAR 0 8 "a b" 6 ""
7 SQLCHAR 0 8 "\t" 7 ""
8 SQLCHAR 0 9 "\" 8 ""
9 SQLCHAR 0 9 "stuff\" 8 ""
which would result in:
1 SQLCHAR 0 5 "**" 1 ""
2 SQLCHAR 0 25 "**" 2 ""
3 SQLCHAR 0 1 "**" 3 ""
4 SQLCHAR 0 12 "**" 4 ""
5 SQLCHAR 0 1 "\n" 5 ""
6 SQLCHAR 0 8 "a b" 6 ""
7 SQLCHAR 0 8 "\t" 7 ""
8 SQLCHAR 0 9 "**" 8 ""
9 SQLCHAR 0 9 "**" 8 ""
Please note that there is no easy way to determine what a given script running on a given environment could understand as being a "Perl whitespace", since it depends on many factors, and that [\t\n\f\r ] is just a simplified view of what Perl can understand as whitespace.
Quoting a bit of perlrecharclass:
Whitespace
\s matches any single character considered whitespace.
If the /a modifier is in effect ...
In all Perl versions, \s matches the 5 characters [\t\n\f\r ]; that is, the horizontal tab, the newline, the form feed, the carriage
return, and the space. Starting in Perl v5.18, it also matches the
vertical tab, \cK . See note 1 below for a discussion of this.
otherwise ...
For code points above 255 ...
\s matches exactly the code points above 255 shown with an "s" column in the table below.
For code points below 256 ...
if locale rules are in effect ...
\s matches whatever the locale considers to be whitespace. (...)
I want the match to occur but it's not happening.
I have an issue. I wish to find occurences of lines from one file in another.
Here's one file (#file)
735 1 1
1891 1 0
2021 1 1
1892 2 1
667 1 0
802 2 1
665 1 0
666 1 1
596 1 0
3193 2 1
Here's the one in which I have to find above lines (#file1)
1521 1 0 : 1167 0 0 : 1167 2 0 : 1167 1 0 ;
2605 1 1 ;
2280 0 1 : 2280 2 0 : 1892 0 0 : 2280 1 0 : 2021 0 0 ;
1892 2 1 : 667 0 1 : 667 1 0 ;
1892 1 1 ;
Here's the code I wrote
foreach $leadline (#file1) {
foreach $line (#file) {
$_ = ' ' . $leadline;
$line = ' ' . $line;
if (m/$line/) {
push #final, $_;
}
}
}
But I am unable to detect the lines.
#file1 and #file variables store the contents of the files.
I either get no lines detected or all lines detected.
The reason I am concatenating a Space before the two lines is , that sometimes 667 1 0 can occur as the very first phrase in a given line.I am not comfortable with Regex to do that in Regex directly.
Note :- If line i and line j in first file occur as a pattern in the same line of the other file then output should be just one of the lines. Also , if a pattern 1667 1 0 is found, it shouldn't be confused with 667 1 0. Hence I added the whitespace.
I was able to achieve this goal in Python but unable to replicate it in Perl . Here's the Python snippet :-
for line1 in file1:
for j in range(0,len(file0)-1):# in file0:
if ' '+lines[j][0:len(file0[j])-1] in ' '+line1:
i = i + 1
print line1[0:len(line1)-1]
break
Expected output is :-
1892 2 1 : 667 0 1 : 667 1 0 ;
I now think this is a solution to a different problem, but here it is anyway!
use warnings;
use strict;
use 5.010;
use Array::Utils 'array_diff';
open my $fh, '<', 'f1.txt' or die $!;
my #f1;
while ( <$fh> ) {
push #f1, [split];
}
my #final;
open $fh, '<', 'f2.txt' or die $!;
while ( <$fh> ) {
my #f2 = map [ /\d+/g ], split /:/;
for my $f1 ( #f1 ) {
my #matches = grep { not array_diff(#$f1, #$_) } #f2;
push #final, map "#$_", #matches;
}
}
say for #final;
output
1892 2 1
667 0 1
667 1 0
Update
Okay here's my second attempt! This is essentially what choroba wrote but using map and with the addition of stripping all trailing whitespace on the data from the first file.
use warnings;
use strict;
use 5.014; # For non-destructive substitution
open my $fh, '<', 'f1.txt' or die $!;
my #f1 = map s/\s+\z//r, <$fh>;
my $re = join '|', #f1;
open $fh, '<', 'f2.txt' or die $!;
my #final = grep /\b(?:$re)\b/, <$fh>;
print for #final;
output
1892 2 1 : 667 0 1 : 667 1 0 ;
You can create a regex by joining the lines from file1 by | (and applying quotemeta on each). \b should prevent matching 667 in 1667.
#!/usr/bin/perl
use warnings;
use strict;
my #search;
open my $F1, '<', 'file1' or die $!;
while (<$F1>) {
chomp;
push #search, quotemeta;
}
my $regex = join '|', #search;
$regex = qr/\b(?:$regex)\b/;
open my $F2, '<', 'file2' or die $!;
while (<$F2>) {
print if /$regex/;
}
Here is how I'd do the job:
use Modern::Perl;
use Data::Dumper;$Data::Dumper::Indent = 1;
my #file = (
'735 1 1',
'1891 1 0',
'2021 1 1',
'1892 2 1',
'667 1 0',
'802 2 1',
'665 1 0',
'666 1 1',
'596 1 0',
'3193 2 1',
);
my #final;
while(my $line = <DATA>) {
chomp $line;
if (grep{$line =~ /\b$_\b/} #file) {
push #final, $line;
}
}
say Dumper\#final;
__DATA__
1521 1 0 : 1167 0 0 : 1167 2 0 : 1167 1 0 ;
2605 1 1 ;
2280 0 1 : 2280 2 0 : 1892 0 0 : 2280 1 0 : 2021 0 0 ;
1892 2 1 : 667 0 1 : 667 1 0 ;
1892 1 1 ;
Output:
$VAR1 = [
'1892 2 1 : 667 0 1 : 667 1 0 ; '
];
With your files:
use Modern::Perl;
use Data::Dumper;$Data::Dumper::Indent = 1;
open my $fh, '<', 'file.txt' or die "unable to open 'file.txt': $!";
my #file = <$fh>;
chomp #file;
my #final;
open $fh, '<', 'file1.txt' or die "unable to open 'file1.txt': $!";
while(my $line = <$fh>) {
chomp $line;
if (grep{$line =~ /\b$_\b/} #file) {
push #final, $line;
}
}
say Dumper\#final;
I'm trying to extract the specific lines from a trace file like below:
- 0.118224 0 7 ack 40 ------- 1 2.0 7.0 0 2
r 0.118436 1 2 tcp 40 ------- 2 7.1 2.1 0 1
+ 0.118436 1 2 ack 40 ------- 2 3.1 2.1 0 3
- 0.118436 1 2 ack 40 ------- 2 4.1 2.1 0 3
r 0.120256 0 7 ack 40 ------- 1 2.0 7.0 0 2
I want to extract any line that have the following:
r x.xxxxx 1 2 xxx xx ------- x numbers.x 2.x x x.
Note: x means any value and numbers could be between 3-to-7.
here is my try-its not working !!:
if {[regexp \r+ ([0-9.]+) 1 2.*- ([3-7.]+) 2.*- ([0-9.]+) $line -> time]}
Any suggestion??
Here's another approach: extract the fields you want to use for comparison
while {[gets $f line] != -1} {
lassign [split $line] a - b c - - - - d e - -
if {
$a eq "r" &&
$b == 1 &&
$c == 2 &&
3 <= floor($d) && floor($d) <= 7 &&
floor($e) == 2
} {
puts $line
}
}
You have to escape the . with a \. It means "any character" in regexp.
So your regexp could look like:
if {[regexp {r \d\.\d{5} 1 2 \d{3} \d{2} ------- \d [3-7]\.\d 2\.\d \d \d} $line -> time ]} {
# ...
}
Now you have to place () around the part you want.
Btw: I used the following transformation on your description of what you want to match:
set input {r x.xxxxx 1 2 xxx xx ------- x numbers.x 2.x x x}
set re [subst [regsub -all {x{2,}} $data {\\\\d{[string length \0]}}]]
set re [string map {. {\.} x {\d} numbers {[3-7]}} $re]