Is there any grep/sed option which will allow me to match a pattern after matching another pattern? For example: Input file (foos are variable patterns starting with 0 mixed with random numbers preceded by # in front):
0foo1
0foo2
0foo3
\#89888
0foo4
0foo5
\#98980
0foo6
So once I try to search for a variable pattern (eg. foo2), I also want to match another pattern (eg, #number) from this pattern line number, in this case, #89888.
Therefore output for variable foo2 must be:
foo2 #89888
For variable foo5:
foo5 #98980
foos consist of every character, including which may be considered metacharacters.
I tried a basic regex match script using tcl which will first search for foo* and then search for next immediate #, but since I am working with a very large file, it will take days to finish. Any help is appreciated.
A Perl one-liner to slurp the whole file and match across any newlines for the pattern you seek would look like:
perl -000 -nle 'm{(foo2).*(\#89888)}s and print join " ",$1,$2' file
The -000 switch enables "slurp" mode which signals Perl not to split the file into chunks, but rather treat it as one large string. The s modifier lets . match any character, including a newline.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my ( %matches, $recent_foo );
while(<DATA>)
{
chomp;
( $matches{$recent_foo} ) = $1 if m/(\\#\d+)/;
( $recent_foo ) = $1 if m/(0foo\d+)/;
}
print Dumper( \%matches );
__DATA__
0foo1
0foo2
0foo3
\#89888
0foo4
0foo5
\#98980
0foo6
./perl
$VAR1 = {
'0foo5' => '\\#98980',
'0foo3' => '\\#89888'
};
If what you want is 0foo1, 0foo2 and 0foo3 to all have the same value the following will do:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my ( %matches, #recent_foo );
while(<DATA>)
{
chomp;
if (/^\\#/)
{
#matches{#recent_foo} = ($') x #recent_foo;
undef #recent_foo;
}
elsif (/^0/)
{
push #recent_foo, $';
}
}
print Dumper( \%matches );
__DATA__
0foo1
0foo2
0foo3
\#89888
0foo4
0foo5
\#98980
0foo6
gives:
$VAR1 = {
'foo2' => '89888',
'foo1' => '89888',
'foo5' => '98980',
'foo3' => '89888',
'foo4' => '98980'
};
Var='foo2'
sed "#n
/${Var}/,/#[0-9]\{1,\}/ {
H
/#[0-9]\{1,\}/ !d
s/.*//;x
s/.//;s/\n.*\\n/ /p
q
}" YourFile
Not clear as request. It take first occurence of your pattern foo2 until first #number, remove line between and print both line in 1 than quit (no other extract
A Tcl solution. The procedure runs in a little over 3 microseconds, so you'll need very large data files to have it run for days. If more than one token matches, the first match is used (it's easy to rewrite the procedure to return all matches).
set data {
0foo1
0foo2
0foo3
\#89888
0foo4
0foo5
\#98980
0foo6
}
proc find {data pattern} {
set idx [lsearch -regexp $data $pattern]
if {$idx >= 0} {
lrange $data $idx $idx+1
}
}
find $data 0foo3
# -> 0foo3 #89888
find $data 0f.*5
# -> 0foo5 #98980
Documentation: if, lrange, lsearch, proc, set
sed
sed -n '/foo2/,/#[0-9]\+/ {s/^[[:space:]]*[0\\]//; p}' file |
sed -n '1p; $p' |
paste -s
The first sed prints all the lines between the first pattern and the 2nd, removing optional leading whitespace and the leading 0 or \.
The second sed extracts only the first and last lines.
The paste command prints the 2 lines as a single line, separated with a tab.
awk
awk -v p1=foo5 '
$0 ~ p1 {found = 1}
found && /#[0-9]+/ { sub(/^\\\/, ""); print p1, $0; exit }
' file
tcl
lassign $argv filename pattern1
set found false
set fid [open $filename r]
while {[gets $fid line] != -1} {
if {[string match "*$pattern1*" $line]} {
set found true
}
if {$found && [regexp {#\d+} $line number]} {
puts "$pattern1 $number"
break
}
}
close $fid
Then
$ tclsh 2patt.tcl file foo4
foo4 #98980
Is this what you want?
$ awk -v tgt="foo2" 'index($0,tgt){f=1} f&&/#[0-9]/{print tgt, $0; exit}' file
foo2 \#89888
$ awk -v tgt="foo5" 'index($0,tgt){f=1} f&&/#[0-9]/{print tgt, $0; exit}' file
foo5 \#98980
I'm using index() above as it searches for a string not a regexp and so could not care less what RE metacharacters are in foo - they are all just literal characters in a string.
It's not clear from your question if you want to find a specific number after a specific foo or the first number after foo2 or even if you want to search for a specific foo value or all "foo"s or...
Related
rencently I have met a strange bug when use a dynamic regular expressions in perl for Nesting brackets' match. The origin string is " {...test{...}...} ", I want to grep the pair brace begain with test, "test{...}". actually there are probably many pairs of brace before and end this group , I don't really know the deepth of them.
Following is my match scripts: nesting_parser.pl
#! /usr/bin/env perl
use Getopt::Long;
use Data::Dumper;
my %args = #ARGV;
if(exists$args{'-help'}) {printhelp();}
unless ($args{'-file'}) {printhelp();}
unless ($args{'-regex'}) {printhelp();}
my $OpenParents;
my $counts;
my $NestedGuts = qr {
(?{$OpenParents = 0})
(?>
(?:
[^{}]+
| \{ (?{$OpenParents++;$counts++; print "\nLeft:".$OpenParents." ;"})
| \} (?(?{$OpenParents ne 0; $counts++}) (?{$OpenParents--;print "Right: ".$OpenParents." ;"})) (?(?{$OpenParents eq 0}) (?!))
)*
)
}x;
my $string = `cat $args{'-file'}`;
my $partten = $args{'-regex'} ;
print "####################################################\n";
print "Grep [$partten\{...\}] from $args{'-file'}\n";
print "####################################################\n";
while ($string =~ /($partten$NestedGuts)/xmgs){
print $1."}\n";
print $2."####\n";
}
print "Regex has seen $counts brackts\n";
sub printhelp{
print "Usage:\n";
print "\t./nesting_parser.pl -file [file] -regex '[regex expression]'\n";
print "\t[file] : file path\n";
print "\t[regex] : regex string\n";
exit;
}
Actually my regex is:
our $OpenParents;
our $NestedGuts = qr {
(?{$OpenParents = 0})
(?>
(?:
[^{}]+
| \{ (?{$OpenParents++;})
| \} (?(?{$OpenParents ne 0}) (?{$OpenParents--})) (?(?{$OpenParents eq 0} (?!))
)*
)
}x;
I have add brace counts in nesting_parser.pl
I also write a string generator for debug: gen_nesting.pl
#! /usr/bin/env perl
use strict;
my $buffer = "{{{test{";
unless ($ARGV[0]) {print "Please specify the nest pair number!\n"; exit}
for (1..$ARGV[0]){
$buffer.= "\n\{\{\{\{$_\}\}\}\}";
#$buffer.= "\n\{\{\{\{\{\{\{\{\{$_\}\}\}\}\}\}\}\}\}";
}
$buffer .= "\n\}}}}";
open TEXT, ">log_$ARGV[0]";
print TEXT $buffer;
close TEXT;
You can generate a test file by
./gen_nesting.pl 1000
It will create a log file named log_1000, which include 1000 lines brace pairs
Now we test our match scripts:
./nesting_parser.pl -file log_1000 -regex "test" > debug_1000
debug_1000 looks like a great perfect result, matched successfully! But when I gen a 4000 lines test log file and match it again, it seem crashed:
./gen_nesting.pl 4000
./nesting_parser.pl -file log_4000 -regex "test" > debug_4000
The end of debug_4000 shows
{{{{3277}
####
Regex has seen 26213 brackts
I don't know what's wrong with the regex expresions, mostly it works well for paired brackets, untill recently I found it crashed when I try to match a text file more than 600,000 lines.
I'm really confused by this problems,
I really hope to solve this problem.
thank you all!
First for matching nested brackets I normally use Regexp::Common.
Next, I'm guessing that your problem is that Perl's regular expression engine breaks after matching 32767 groups. You can verify this by turning on warnings and looking for a message like Complex regular subexpression recursion limit (32766) exceeded.
If so, you can rewrite your code using /g and \G and pos. The idea being that you match the brackets in a loop like this untested code:
my $start = pos($string);
my $open_brackets = 0;
my $failed;
while (0 < $open_brackets or $start == pos($string)) {
if ($string =~ m/\G[^{}]*(\{|\})/g) {
if ($1 eq '{') {
$open_brackets++;
}
else {
$open_brackets--;
}
}
else {
$failed = 1;
break; # WE FAILED TO MATCH
}
}
if (not $failed and 0 == $open_brackets) {
my $matched = substr($string, $start, pos($string));
}
I need to grep a value from an array.
For example i have a values
#a=('branches/Soft/a.txt', 'branches/Soft/h.cpp', branches/Main/utils.pl');
#Array = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', branches/Main/utils.pl','branches/Soft/B2/c.tct', 'branches/Docs/A1/b.txt');
Now, i need to loop #a and find each value matches to #Array. For Example
It works for me with grep. You'd do it the exact same way as in the More::ListUtils example below, except for having grep instead of any. You can also shorten it to
my $got_it = grep { /$str/ } #paths;
my #matches = grep { /$str/ } #paths;
This by default tests with /m against $_, each element of the list in turn. The $str and #paths are the same as below.
You can use the module More::ListUtils as well. Its function any returns true/false depending on whether the condition in the block is satisfied for any element in the list, ie. whether there was a match in this case.
use warnings;
use strict;
use Most::ListUtils;
my $str = 'branches/Soft/a.txt';
my #paths = ('branches/Soft/a.txt', 'branches/Soft/b.txt',
'branches/Docs/A1/b.txt', 'branches/Soft/B2/c.tct');
my $got_match = any { $_ =~ m/$str/ } #paths;
With the list above, containing the $str, the $got_match is 1.
Or you can roll it by hand and catch the match as well
foreach my $p (#paths) {
print "Found it: $1\n" if $p =~ m/($str)/;
}
This does print out the match.
Note that the strings you show in your example do not contain the one to match. I added it to my list for a test. Without it in the list no match is found in either of the examples.
To test for more than one string, with the added sample
my #strings = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', 'branches/Main/utils.pl');
my #paths = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', 'branches/Main/utils.pl',
'branches/Soft/B2/c.tct', 'branches/Docs/A1/b.txt');
foreach my $str (#strings) {
foreach my $p (#paths) {
print "Found it: $1\n" if $p =~ m/($str)/;
}
# Or, instead of the foreach loop above use
# my $match = grep { /$str/ } #paths;
# print "Matched for $str\n" if $match;
}
This prints
Found it: branches/Soft/a.txt
Found it: branches/Soft/h.cpp
Found it: branches/Main/utils.pl
When the lines with grep are uncommented and foreach ones commented out I get the corresponding prints for the same strings.
The slashes dot in $a will pose a problem so you either have to escape them it when doing regex match or use a simple eq to find the matches:
Regex match with $a escaped:
my #matches = grep { /\Q$a\E/ } #array;
Simple comparison with "equals":
my #matches = grep { $_ eq $a } #array;
With your sample data both will give an empty array #matches because there is no match.
This Solved My Question. Thanks to all especially #zdim for the valuable time and support
my #SVNFILES = ('branches/Soft/a.txt', 'branches/Soft/b.txt');
my #paths = ('branches/Soft/a.txt', 'branches/Soft/b.txt',
'branches/Docs/A1/b.txt', 'branches/Soft/B2/c.tct');
foreach my $svn (#SVNFILES)
{
chomp ($svn);
my $m = grep { /$svn/ } (#paths);
if ( $m eq '0' ) {
print "Files Mismatch\n";
exit 1;
}
}
You should escape characters like '/' and '.' in any regex when you need it as a character.
Likewise :
$a="branches\/Soft\/a\.txt"
Retry whatever you did with either grep or perl with that. If it still doesn't work, tell us precisely what you tried.
kindly explain, why this issue comes
my data file
DATA----1
DATA----2
DATA----3
DATA----4
DATA----5
DATA----6
DATA----7
SAMPLE----1
SAMPLE----12
SAMPLE----13
SAMPLE----2
SAMPLE----3
SAMPLE----4
SAMPLE----5
OTHER----1
OTHER----2
OTHER----3
where I need entire line which start with DATA and SAMPLE to an array and an another array should have content which start with SAMPLE end with two digit number
I have got output with following script
use strict;
use warnings;
open(FH, "di.txt");
my #file = <FH>;
close(FH);
my #arr2 = grep { $_ =~ m/^SAMPLE.+\d\d$/g } #file; ## this array prints
my #arr1 = grep { $_ =~ m/^DATA|^SAMPLE/g } #file;
print #arr1,"\n\t~~~~~~~~~~~\n\n",#arr2;
First writen as
use strict;
use warnings;
open(FH, "di.txt");
my #file = <FH>;
close(FH);
my #arr1 = grep { $_ =~ m/^DATA|^SAMPLE/g } #file;
my #arr2 = grep { $_ =~ m/^SAMPLE.+\d\d$/g } #file; ## this doesn't print
print #arr1,"\n\t~~~~~~~~~~~\n\n",#arr2;
while run this one, prints only #arr1
what would be the reason #arr2 don't print
The problem is because of the behaviour of the global match /g option in scalar context
Every scalar variable has a marker that remembers where the most recent global match left off, and hence where the next one should start searching. It enables the use of the \G anchor in regex patterns, as well as while loops like this
my $s = 'aaabacad';
while ( $s =~ /a(.)/g ) {
print "$1 ";
}
which prints
a b c d
In truth you're not interested in a global match in this case, you just want to discover whether OR NOT the pattern can be found in the string. The grep operator applies scalar context to its first parameter, so in using the /g option in this statement
my #arr1 = grep { $_ =~ m/^DATA|^SAMPLE/g } #file;
you have left every element of the #file with the marker set to right after DATA or SAMPLE. That means the next match on the same element m/^SAMPLE.+\d\d$/g will start looking from there and clearly can't even find the ^ anchor to the match fails
The pos function gives you access to the marker, and you can fix your original code by resetting it to the start of the string after the first grep call. If you write this instead
my #arr1 = grep { $_ =~ m/^DATA|^SAMPLE/g } #file;
pos($_) = 0 for #file;
my #arr2 = grep { $_ =~ m/^SAMPLE.+\d\d$/g } #file; ## this doesn't print
then the output will be what you expected
The correct fix, however, is to write what you mean anyway, which means you should remove the /g option from the pattern matches. This code also works fine, and it's also more concise, more readable, and far less fragile
my #arr1 = grep /^DATA|^SAMPLE/, #file;
my #arr2 = grep /^SAMPLE.+\d\d$/, #file;
I have been doing this by hand and I just can't do it anymore-- I have thousands of lines and I think this is a job for sed or awk.
Essentially, we have a file like this:
A sentence X
A matching sentence Y
A sentence Z
A matching sentence N
This pattern continues for the entire file. I want to flip every sentence and matching sentence so the entire file will end up like:
A matching sentence Y
A sentence X
A matching sentence N
A sentence Z
Any tips?
edit: extending the initial problem
Dimitre Radoulov provided a great answer for the initial problem. This is an extension of the main problem-- some more details:
Let's say we have an organized file (due to the sed line Dimitre gave, the file is organized). However, now I want to organize the file alphabetically but only using the language (English) of the second line.
watashi
me
annyonghaseyo
hello
dobroye utro!
Good morning!
I would like to organize alphabetically via the English sentences (every 2nd sentence). Given the above input, this should be the output:
dobroye utro!
Good morning!
annyonghaseyo
hello
watashi
me
For the first part of the question, here is a one way to swap every other line with each other in sed without using regular expressions:
sed -n 'h;n;p;g;p'
The -n command line suppresses the automatic printing. Command h puts copies the current line from the pattern space to the hold space, n reads in the next line to the pattern space and p prints it; g copies the first line from the hold space back to the pattern space, bringing the first line back into the pattern space, and p prints it.
sed 'N;
s/\(.*\)\n\(.*\)/\2\
\1/' infile
N - append the next line of input into the pattern space
\(.*\)\n\(.*\) - save the matching parts of the pattern space
the one before and the one after the newline.
\2\\
\1 - exchange the two lines (\1 is the first saved part,
\2 the second). Use escaped literal newline for portability
With some sed implementations you could use the escape sequence
\n: \2\n\1 instead.
First question:
awk '{x = $0; getline; print; print x}' filename
next question: sort by 2nd line
paste - - < filename | sort -f -t $'\t' -k 2 | tr '\t' '\n'
which outputs:
dobroye utro!
Good morning!
annyonghaseyo
hello
watashi
me
Assuming an input file like this:
A sentence X
Z matching sentence Y
A sentence Z
B matching sentence N
A sentence Z
M matching sentence N
You could do both exchange and sort with Perl:
perl -lne'
$_{ $_ } = $v unless $. % 2;
$v = $_;
END {
print $_, $/, $_{ $_ }
for sort keys %_;
}' infile
The output I get is:
% perl -lne'
$_{ $_ } = $v unless $. % 2;
$v = $_;
END {
print $_, $/, $_{ $_ }
for sort keys %_;
}' infile
B matching sentence N
A sentence Z
M matching sentence N
A sentence Z
Z matching sentence Y
A sentence X
If you want to order by the first line (before the exchange):
perl -lne'
$_{ $_ } = $v unless $. % 2;
$v = $_;
END {
print $_, $/, $_{ $_ }
for sort {
$_{ $a } cmp $_{ $b }
} keys %_;
}' infile
So, if the original file looks like this:
% cat infile1
me
watashi
hello
annyonghaseyo
Good morning!
dobroye utro!
The output should look like this:
% perl -lne'
$_{ $_ } = $v unless $. % 2;
$v = $_;
END {
print $_, $/, $_{ $_ }
for sort {
$_{ $a } cmp $_{ $b }
} keys %_;
}' infile1
dobroye utro!
Good morning!
annyonghaseyo
hello
watashi
me
This version should handle duplicate records correctly:
perl -lne'
$_{ $_, $. } = $v unless $. % 2;
$v = $_;
END {
print substr( $_, 0, length() - 1) , $/, $_{ $_ }
for sort {
$_{ $a } cmp $_{ $b }
} keys %_;
}' infile
And another version, inspired by the solution posted by Glenn (record exchange included and assuming the pattern _ZZ_ is not present in the text file):
sed 'N;
s/\(.*\)\n\(.*\)/\1_ZZ_\2/' infile |
sort |
sed 's/\(.*\)_ZZ_\(.*\)/\2\
\1/'
I have a file and a list of string pairs which I get from another file. I need substitute the first string of the pair with the second one, and do this for each pair.
Is there more efficient/simple way to do this (using Perl, grep, sed or other), then running a separate regexp substitution for each pair of values?
#! /usr/bin/perl
use warnings;
use strict;
my %replace = (
"foo" => "baz",
"bar" => "quux",
);
my $to_replace = qr/#{["(" .
join("|" => map quotemeta($_), keys %replace) .
")"]}/;
while (<DATA>) {
s/$to_replace/$replace{$1}/g;
print;
}
__DATA__
The food is under the bar in the barn.
The #{[...]} bit may look strange. It's a hack to interpolate generated content inside quote and quote-like operators. The result of the join goes inside the anonymous array-reference constructor [] and is immediately dereferenced thanks to #{}.
If all that seems too wonkish, it's the same as
my $search = join "|" => map quotemeta($_), keys %replace;
my $to_replace = qr/($search)/;
minus the temporary variable.
Note the use of quotemeta—thanks Ivan!—which escapes the first string of each pair so the regular-expression engine will treat them as literal strings.
Output:
The bazd is under the quux in the quuxn.
Metaprogramming—that is, writing a program that writes another program—is also nice. The beginning looks familiar:
#! /usr/bin/perl
use warnings;
use strict;
use File::Compare;
die "Usage: $0 path ..\n" unless #ARGV >= 1;
# stub
my #pairs = (
["foo" => "baz"],
["bar" => "quux"],
['foo$bar' => 'potrzebie\\'],
);
Now we generate the program that does all the s/// replacements—but is quotemeta on the replacement side a good idea?—
my $code =
"sub { while (<>) { " .
join(" " => map "s/" . quotemeta($_->[0]) .
"/" . quotemeta($_->[1]) .
"/g;",
#pairs) .
"print; } }";
#print $code, "\n";
and compile it with eval:
my $replace = eval $code
or die "$0: eval: $#\n";
To do the replacements, we use Perl's ready-made in-place editing:
# set up in-place editing
$^I = ".bak";
my #save_argv = #ARGV;
$replace->();
Below is an extra nicety that restores backups that the File::Compare module judges to have been unnecessary:
# in-place editing is conservative: it creates backups
# regardless of whether it modifies the file
foreach my $new (#save_argv) {
my $old = $new . $^I;
if (compare($new, $old) == 0) {
rename $old => $new
or warn "$0: rename $old => $new: $!\n";
}
}
There are two ways, both of them require you to compile a regex alternation on the keys of the table:
my %table = qw<The A the a quick slow lazy dynamic brown pink . !>;
my $alt
= join( '|'
, map { quotemeta } keys %table
sort { ( length $b <=> length $a ) || $a cmp $b }
)
;
my $keyword_regex = qr/($alt)/;
Then you can use this regex in a substitution:
my $text
= <<'END_TEXT';
The quick brown fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog.
END_TEXT
$text =~ s/$keyword_regex/$table{ $1 }/ge; # <- 'e' means execute code
Or you can do it in a loop:
use English qw<#LAST_MATCH_START #LAST_MATCH_END>;
while ( $text =~ /$keyword_regex/g ) {
my $key = $1;
my $rep = $table{ $key };
# use the 4-arg form
substr( $text, $LAST_MATCH_START[1]
, $LAST_MATCH_END[1] - $LAST_MATCH_START[1], $rep
);
# reset the position to start + new actual
pos( $text ) = $LAST_MATCH_START[1] + length $rep;
}
Build a hash of the pairs. Then split the target string into word tokens, and check each token against the keys in the hash. If it's present, replace it with the value of that key.
If eval is not a security concern:
eval $(awk 'BEGIN { printf "sed \047"} {printf "%s", "s/\\<" $1 "\\>/" $2 "/g;"} END{print "\047 substtemplate"}' substwords )
This constructs a long sed command consisting of multiple substitution commands. It's subject to potentially exceeding your maximum command line length. It expects the word pair file to consist of two words separated by whitespace on each line. Substitutions will be made for whole words only (no clbuttic substitutions).
It may choke if the word pair file contains characters that are significant to sed.
You can do it this way if your sed insists on -e:
eval $(awk 'BEGIN { printf "sed"} {printf "%s", " -e \047s/\\<" $1 "\\>/" $2 "/g\047"} END{print " substtemplate"}' substwords)