I have been doing this by hand and I just can't do it anymore-- I have thousands of lines and I think this is a job for sed or awk.
Essentially, we have a file like this:
A sentence X
A matching sentence Y
A sentence Z
A matching sentence N
This pattern continues for the entire file. I want to flip every sentence and matching sentence so the entire file will end up like:
A matching sentence Y
A sentence X
A matching sentence N
A sentence Z
Any tips?
edit: extending the initial problem
Dimitre Radoulov provided a great answer for the initial problem. This is an extension of the main problem-- some more details:
Let's say we have an organized file (due to the sed line Dimitre gave, the file is organized). However, now I want to organize the file alphabetically but only using the language (English) of the second line.
watashi
me
annyonghaseyo
hello
dobroye utro!
Good morning!
I would like to organize alphabetically via the English sentences (every 2nd sentence). Given the above input, this should be the output:
dobroye utro!
Good morning!
annyonghaseyo
hello
watashi
me
For the first part of the question, here is a one way to swap every other line with each other in sed without using regular expressions:
sed -n 'h;n;p;g;p'
The -n command line suppresses the automatic printing. Command h puts copies the current line from the pattern space to the hold space, n reads in the next line to the pattern space and p prints it; g copies the first line from the hold space back to the pattern space, bringing the first line back into the pattern space, and p prints it.
sed 'N;
s/\(.*\)\n\(.*\)/\2\
\1/' infile
N - append the next line of input into the pattern space
\(.*\)\n\(.*\) - save the matching parts of the pattern space
the one before and the one after the newline.
\2\\
\1 - exchange the two lines (\1 is the first saved part,
\2 the second). Use escaped literal newline for portability
With some sed implementations you could use the escape sequence
\n: \2\n\1 instead.
First question:
awk '{x = $0; getline; print; print x}' filename
next question: sort by 2nd line
paste - - < filename | sort -f -t $'\t' -k 2 | tr '\t' '\n'
which outputs:
dobroye utro!
Good morning!
annyonghaseyo
hello
watashi
me
Assuming an input file like this:
A sentence X
Z matching sentence Y
A sentence Z
B matching sentence N
A sentence Z
M matching sentence N
You could do both exchange and sort with Perl:
perl -lne'
$_{ $_ } = $v unless $. % 2;
$v = $_;
END {
print $_, $/, $_{ $_ }
for sort keys %_;
}' infile
The output I get is:
% perl -lne'
$_{ $_ } = $v unless $. % 2;
$v = $_;
END {
print $_, $/, $_{ $_ }
for sort keys %_;
}' infile
B matching sentence N
A sentence Z
M matching sentence N
A sentence Z
Z matching sentence Y
A sentence X
If you want to order by the first line (before the exchange):
perl -lne'
$_{ $_ } = $v unless $. % 2;
$v = $_;
END {
print $_, $/, $_{ $_ }
for sort {
$_{ $a } cmp $_{ $b }
} keys %_;
}' infile
So, if the original file looks like this:
% cat infile1
me
watashi
hello
annyonghaseyo
Good morning!
dobroye utro!
The output should look like this:
% perl -lne'
$_{ $_ } = $v unless $. % 2;
$v = $_;
END {
print $_, $/, $_{ $_ }
for sort {
$_{ $a } cmp $_{ $b }
} keys %_;
}' infile1
dobroye utro!
Good morning!
annyonghaseyo
hello
watashi
me
This version should handle duplicate records correctly:
perl -lne'
$_{ $_, $. } = $v unless $. % 2;
$v = $_;
END {
print substr( $_, 0, length() - 1) , $/, $_{ $_ }
for sort {
$_{ $a } cmp $_{ $b }
} keys %_;
}' infile
And another version, inspired by the solution posted by Glenn (record exchange included and assuming the pattern _ZZ_ is not present in the text file):
sed 'N;
s/\(.*\)\n\(.*\)/\1_ZZ_\2/' infile |
sort |
sed 's/\(.*\)_ZZ_\(.*\)/\2\
\1/'
Related
I am writing an awk oneliner for this purpose:
file1:
1 apple
2 orange
4 pear
file2:
1/4/2/1
desired output: apple/pear/orange/apple
addendum: Missing numbers should be best kept unchanged 1/4/2/3 = apple/pear/orange/3 to prevent loss of info.
Methodology:
Build an associative array key[$1] = $2 for file1
capture all characters between the slashes and replace them by matching to the key of associative array eg key[4] = pear
Tried:
gawk 'NR==FNR { key[$1] = $2 }; NR>FNR { r = gensub(/(\w+)/, "key[\\1]" , "g"); print r}' file1.txt file2.txt
#gawk because need to use \w+ regex
#gensub used because need to use a capturing group
Unfortunately, results are
1/4/2/1
key[1]/key[4]/key[2]/key[1]
Any suggestions? Thank you.
You may use this awk:
awk -v OFS='/' 'NR==FNR {key[$1] = $2; next}
{for (i=1; i<=NF; ++i) if ($i in key) $i = key[$i]} 1' file1 FS='/' file2
apple/pear/orange/apple
Note that if numbers from file2 don't exist in key array then it will make those fields empty.
file1 FS='/' file2 will keep default field separators for file1 but will use / as field separator while reading file2.
EDIT: In case you don't have a match in file2 from file and you want to keep original value as it is then try following:
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val=(val=="" ? "" : val FS) (($i in arr)?arr[$i]:$i)
}
print val
}
' file1 FS="/" file2
With your shown samples please try following.
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val = (val=="" ? "" : val FS) arr[$i]
}
print val
}
' file1 FS="/" file2
Explanation: Reading Input_file1 first and creating array arr with index of 1st field and value of 2nd field then setting field separator as / and traversing through each field os file2 and saving its value in val; printing it at last for each line.
Like #Sundeep comments in the comments, you can't use backreference as an array index. You could mix match and gensub (well, I'm using sub below). Not that this would be anywhere suggested method but just as an example:
$ awk '
NR==FNR {
k[$1]=$2 # hash them
next
}
{
while(match($0,/[0-9]+/)) # keep doing it while it lasts
sub(/[0-9]+/,k[substr($0,RSTART,RLENGTH)]) # replace here
}1' file1 file2
Output:
apple/pear/orange/apple
And of course, if you have k[1]="word1", you'll end up with a neverending loop.
With perl (assuming key is always found):
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|$h{$&}|g; print }' f1 f2
apple/pear/orange/apple
if(!$#ARGV) to determine first file (assuming exactly two files passed)
$h{$F[0]}=$F[1] create hash based on first field as key and second field as value
[^/]+ match non / characters
$h{$&} get the value based on matched portion from the hash
If some keys aren't found, leave it as is:
$ cat f2
1/4/2/1/5
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|exists $h{$&} ? $h{$&} : $&|ge; print }' f1 f2
apple/pear/orange/apple/5
exists $h{$&} checks if the matched portion exists as key.
Another approach using awk without loop:
awk 'FNR==NR{
a[$1]=$2;
next
}
$1 in a{
printf("%s%s",FNR>1 ? RS: "",a[$1])
}
END{
print ""
}' f1 RS='/' f2
$ cat f1
1 apple
2 orange
4 pear
$ cat f2
1/4/2/1
$ awk 'FNR==NR{a[$1]=$2;next}$1 in a{printf("%s%s",FNR>1?RS:"",a[$1])}END{print ""}' f1 RS='/' f2
apple/pear/orange/apple
The script below works, but it requires a kludge. By "kludge" I mean a line of code which makes the script do what I want --- but I do not understand why the line is necessary. Evidently, I do not understand exactly what the multiline regex substitution, ending /mg, is doing.
Is there not a more elegant way to accomplish the task?
The script reads through a file by paragraphs. It partitions each paragraph into two subsets: $text and $cmnt. The $text includes the left part of every line, i.e., from the first column up to the first %, if it exists, or to end of the line if it doesn't. The $cmnt includes the rest.
Motivation: The files to be read are LaTeX markup, where % announces the beginning of a comment. We could change the value of $breaker to equal # if we were reading through a perl script. After separating $text from $cmnt, one could perform a match across lines such as
print "match" if ($text =~ /WOLF\s*DOG/s);
Please see the line labeled "kludge."
Without that line, something funny happens after the last % in a record. If there are lines of $text
(material not commented out by %) after the last commented line of the record, those lines are included both at the end of $cmnt and in $text.
In the example below, this means that without the kludge, in record 2, "cat lion" is included both in the $text, where it belongs, and also in $cmnt.
(The kludge causes an unnecessary % to appear at the end of every non-void $cmnt. This is because the kludge-pasted-on % announces a final, fictitious empty comment line.)
According to https://perldoc.perl.org/perlre.html#Modifiers, the /m regex modifier means
Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
Therefore, I would expect the 2nd match in
s/^([^$breaker]*)($breaker.*?)$/$2/mg
to start with the first %, to extend as far of end-of-line, and stop there. So even without the kludge, it should not include the "cat lion" in record 2? But obviously it does, so I am misreading, or missing, some part of the documentation. I suspect it has to do with the /g regex modifier?
#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/$breaker/; # kludge
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg) # non-greedy
{
$cmnt = $_;
die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg); # non-greedy
}
else
{
$cmnt = '';
}
print "\nRECORD $count_record:\n";
print "******** text==";
print "\n|";
print $text;
print "|\n";
print "******** cmnt==|";
print $cmnt;
print "|\n";
}
Example file to run it on:
dog wolf % flea
DOG WOLF % FLEA
DOG WOLLLLLLF % FLLLLLLEA
% what was that?
cat lion
no comments in this line
%The last paragraph of this file is nothing but a single-line comment.
You must also delete the lines that does not contain a comment from $cmnt:
use feature qw(say);
use strict;
use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg) # non-greedy
{
$cmnt = $_;
$cmnt =~ s/^[^$breaker]*?$//mg;
die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg); # non-greedy
}
else
{
$cmnt = '';
}
print "\nRECORD $count_record:\n";
print "******** text==";
print "\n|";
print $text;
print "|\n";
print "******** cmnt==|";
print $cmnt;
print "|\n";
}
Output:
RECORD 1:
******** text==
|dog wolf
DOG WOLF
DOG WOLLLLLLF
|
******** cmnt==|% flea
% FLEA
% FLLLLLLEA
|
RECORD 2:
******** text==
|
cat lion
|
******** cmnt==|% what was that?
|
RECORD 3:
******** text==
|no comments in this line
|
******** cmnt==||
RECORD 4:
******** text==
||
******** cmnt==|%The last paragraph of this file is nothing but a single-line comment.
|
My main source of confusion was a failure to distinguish between
whether or not an entire record matches -- here, a record is potentially a multi-line paragraph, and
whether or not a line inside a record matches.
The following script incorporates insights from both answers that others offered, and includes extensive explanation.
#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<DATA>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
print "RECORD $count_record:";
print "\n|"; print $_; print "|\n";
# https://perldoc.perl.org/perlre.html#Modifiers
# the following regex:
# ^ /m: ^==start of line, not of record
# ([^$breaker]*) zero or more characters that are not $breaker
# ($breaker.*?) non-greedy: the first instance of $breaker, followed by everything after $breaker
# $ /m: $==end of line, not of record
# /g: "globally match the pattern repeatedly in the string"
if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg)
{
$cmnt = $_;
# In at least one line of this record, the pattern above has matched.
# But this does not mean every line matches. There may be any number of
# lines inside the record that do not match /$breaker/; for these lines,
# in spite of /g, there will be no match, and thus the exclusion of $1 and printing only of $2,
# in the substitution below, will not take place. Thus, those particular lines must be deleted from $cmnt.
# Thus:
$cmnt =~ s/^[^$breaker]*?$/\n/mg; # remove entire line if it does not match /$breaker/
# recall that /m guarantees that ^ and $ match the start and end of the line, not of the record.
die "code error: cmnt does not match this record" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg);
if ( $text =~ /\S/ )
{
print "|text|==\n|$text|\n";
}
else
{
print "NO text found\n";
}
print "|cmnt|==\n|$cmnt|\n";
}
else
{
print "NO comment found\n";
}
}
__DATA__
one dogs% one comment %d**n lies %statistics
two %two comment
thuh-ree
fower
fi-yiv % (he means 5)
SIX 66 % ¿666==antichrist?
seven % the seventh seal, the seven days
ate
niner
ten
As Douglass said to Lincoln ...
%Darryl Pinckney
Regular expression modifier mg assumes that a string it applied to includes multiple lines (includes \n in the string). It instructs regular expression to look through all lines in the string.
Please study following code which should simplify solution to your problem.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $breaker = '%';
my #records = do { local $/ = ''; <DATA> };
for( #records ) {
my %hash = ( /(.*?)$breaker(.*)/mg );
next unless %hash;
say Dumper(\%hash);
}
__DATA__
dog wolf % flea
DOG WOLF % FLEA
DOG WOLLLLLLF % FLLLLLLEA
% what was that?
cat lion
no comments in this line
%The last paragraph of this file is nothing but a single-line comment.
Output
$VAR1 = {
'DOG WOLF ' => ' FLEA ',
'dog wolf ' => ' flea ',
'DOG WOLLLLLLF ' => ' FLLLLLLEA '
};
$VAR1 = {
'' => ' what was that?'
};
$VAR1 = {
'' => 'The last paragraph of this file is nothing but a single-line comment.'
};
I have a text file in this format:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375 Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375 aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375 abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
Here I call the first string before the first space as word (for example abacısı)
The string which starts with after first space and ends with integer is definition (for example Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875)
I want to do this: If a line includes more than one definition (first line has one, second line has two, third line has three), apply newline and put the first string (word) into the beginning of the new line. Expected output:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
I have almost 1.500.000 lines in my text file and the number of definition is not certain for each line. It can be 1 to 5
Small python script does the job. Input is expected in input.txt, output gotes to output.txt.
import re
rf = re.compile('([^\s]+\s).+')
r = re.compile('([^\s]+\s\:\s\d+\.\d+)')
with open("input.txt", "r") as f:
text = f.read()
with open("output.txt", "w") as f:
for l in text.split('\n'):
offset = 0
first = ""
match = re.search(rf, l[offset:])
if match:
first = match.group(1)
offset = len(first)
while True:
match = re.search(r, l[offset:])
if not match:
break
s = match.group(1)
offset += len(s)
f.write(first + " " + s + "\n")
I am assuming the following format:
word definitionkey : definitionvalue [definitionkey : definitionvalue …]
None of those elements may contain a space and they are always delimited by a single space.
The following code should work:
awk '{ for (i=2; i<=NF; i+=3) print $1, $i, $(i+1), $(i+2) }' file
Explanation (this is the same code but with comments and more spaces):
awk '
# match any line
{
# iterate over each "key : value"
for (i=2; i<=NF; i+=3)
print $1, $i, $(i+1), $(i+2) # prints each "word key : value"
}
' file
awk has some tricks that you may not be familiar with. It works on a line-by-line basis. Each stanza has an optional conditional before it (awk 'NF >=4 {…}' would make sense here since we'll have an error given fewer than four fields). NF is the number of fields and a dollar sign ($) indicates we want the value of the given field, so $1 is the value of the first field, $NF is the value of the last field, and $(i+1) is the value of the third field (assuming i=2). print will default to using spaces between its arguments and adds a line break at the end (otherwise, we'd need printf "%s %s %s %s\n", $1, $i, $(i+1), $(i+2), which is a bit harder to read).
With perl:
perl -a -F'[^]:]\K\h' -ne 'chomp(#F);$p=shift(#F);print "$p ",shift(#F),"\n" while(#F);' yourfile.txt
With bash:
while read -r line
do
pre=${line%% *}
echo "$line" | sed 's/\([0-9]\) /\1\n'$pre' /g'
done < "yourfile.txt"
This script read the file line by line. For each line, the prefix is extracted with a parameter expansion (all until the first space) and spaces preceded by a digit are replaced with a newline and the prefix using sed.
edit: as tripleee suggested it, it's much faster to do all with sed:
sed -i.bak ':a;s/^\(\([^ ]*\).*[0-9]\) /\1\n\2 /;ta' yourfile.txt
Assuming there are always 4 space-separated words for each definition:
awk '{for (i=1; i<NF; i+=4) print $i, $(i+1), $(i+2), $(i+3)}' file
Or if the split should occur after that floating point number
perl -pe 's/\b\d+\.\d+\K\s+(?=\S)/\n/g' file
(This is the perl equivalent of Avinash's answer)
Bash and grep:
#!/bin/bash
while IFS=' ' read -r in1 in2 in3 in4; do
if [[ -n $in4 ]]; then
prepend="$in1"
echo "$in1 $in2 $in3 $in4"
else
echo "$prepend $in1 $in2 $in3"
fi
done < <(grep -o '[[:alnum:]][^:]\+ : [[:digit:].]\+' "$1")
The output of grep -o is putting all definitions on a separate line, but definitions originating from the same line are missing the "word" at the beginning:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
The for loop now loops over this, using a space as the input file separator. If in4 is a zero length string, we're on a line where the "word" is missing, so we prepend it.
The script takes the input file name as its argument, and saving output to an output file can be done with simple redirection:
./script inputfile > outputfile
Using perl:
$ perl -nE 'm/([^ ]*) (.*)/; my $word=$1; $_=$2; say $word . " " . $_ for / *(.*?[0-9]+\.[0-9]+)/g;' < input.log
Output:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
Explanation:
Split the line to separate first field as word.
Then split the remaining line using the regex .*?[0-9]+\.[0-9]+.
Print word concatenated with every match of above regex.
I would approach this with one of the excellent Awk answers here; but I'm posting a Python solution to point to some oddities and problems with the currently accepted answer:
It reads the entire input file into memory before processing it. This is harmless for small inputs, but the OP mentions that the real-world input is kind of big.
It needlessly uses re when simple whitespace tokenization appears to be sufficient.
I would also prefer a tool which prints to standard output, so that I can redirect it where I want it from the shell; but to keep this compatible with the earlier solution, this hard-codes output.txt as the destination file.
with open('input.txt', 'r') as input:
with open('output.txt', 'w') as output:
for line in input:
tokens = line.rstrip().split()
word = tokens[0]
for idx in xrange(1, len(tokens), 3):
print(word, ' ', ' '.join(tokens[idx:idx+3]), file=output)
If you really, really wanted to do this in pure Bash, I suppose you could:
while read -r word analyses; do
set -- $analyses
while [ $# -gt 0 ]; do
printf "%s %s %s %s\n" "$word" "$1" "$2" "$3"
shift; shift; shift
done
done <input.txt >output.txt
Please find the following bash code
#!/bin/bash
# read.sh
while read variable
do
for i in "$variable"
do
var=`echo "$i" |wc -w`
array_1=( $i )
counter=0
for((j=1 ; j < $var ; j++))
do
if [ $counter = 0 ] #1
then
echo -ne ${array_1[0]}' '
fi #1
echo -ne ${array_1[$j]}' '
counter=$(expr $counter + 1)
if [ $counter = 3 ] #2
then
counter=0
echo
fi #2
done
done
done
I have tested and it is working.
To test
On bash shell prompt give the following command
$ ./read.sh < input.txt > output.txt
where read.sh is script , input.txt is input file and output.txt is where output is generated
here is a sed in action
sed -r '/^indirger(ken|di)/{s/([0-9]+[.][0-9]+ )(indirge)/\1\n\2/g}' my_file
output
indirgerdi indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]+YDH[Past] : 22.2626953125
indirge[Verb]+[Pos]+Hr[Aor]+YDH[Past]+[A3sg] : 18.720703125
indirgerken indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]-Yken[Adv+While] : 19.6201171875
Is there any grep/sed option which will allow me to match a pattern after matching another pattern? For example: Input file (foos are variable patterns starting with 0 mixed with random numbers preceded by # in front):
0foo1
0foo2
0foo3
\#89888
0foo4
0foo5
\#98980
0foo6
So once I try to search for a variable pattern (eg. foo2), I also want to match another pattern (eg, #number) from this pattern line number, in this case, #89888.
Therefore output for variable foo2 must be:
foo2 #89888
For variable foo5:
foo5 #98980
foos consist of every character, including which may be considered metacharacters.
I tried a basic regex match script using tcl which will first search for foo* and then search for next immediate #, but since I am working with a very large file, it will take days to finish. Any help is appreciated.
A Perl one-liner to slurp the whole file and match across any newlines for the pattern you seek would look like:
perl -000 -nle 'm{(foo2).*(\#89888)}s and print join " ",$1,$2' file
The -000 switch enables "slurp" mode which signals Perl not to split the file into chunks, but rather treat it as one large string. The s modifier lets . match any character, including a newline.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my ( %matches, $recent_foo );
while(<DATA>)
{
chomp;
( $matches{$recent_foo} ) = $1 if m/(\\#\d+)/;
( $recent_foo ) = $1 if m/(0foo\d+)/;
}
print Dumper( \%matches );
__DATA__
0foo1
0foo2
0foo3
\#89888
0foo4
0foo5
\#98980
0foo6
./perl
$VAR1 = {
'0foo5' => '\\#98980',
'0foo3' => '\\#89888'
};
If what you want is 0foo1, 0foo2 and 0foo3 to all have the same value the following will do:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my ( %matches, #recent_foo );
while(<DATA>)
{
chomp;
if (/^\\#/)
{
#matches{#recent_foo} = ($') x #recent_foo;
undef #recent_foo;
}
elsif (/^0/)
{
push #recent_foo, $';
}
}
print Dumper( \%matches );
__DATA__
0foo1
0foo2
0foo3
\#89888
0foo4
0foo5
\#98980
0foo6
gives:
$VAR1 = {
'foo2' => '89888',
'foo1' => '89888',
'foo5' => '98980',
'foo3' => '89888',
'foo4' => '98980'
};
Var='foo2'
sed "#n
/${Var}/,/#[0-9]\{1,\}/ {
H
/#[0-9]\{1,\}/ !d
s/.*//;x
s/.//;s/\n.*\\n/ /p
q
}" YourFile
Not clear as request. It take first occurence of your pattern foo2 until first #number, remove line between and print both line in 1 than quit (no other extract
A Tcl solution. The procedure runs in a little over 3 microseconds, so you'll need very large data files to have it run for days. If more than one token matches, the first match is used (it's easy to rewrite the procedure to return all matches).
set data {
0foo1
0foo2
0foo3
\#89888
0foo4
0foo5
\#98980
0foo6
}
proc find {data pattern} {
set idx [lsearch -regexp $data $pattern]
if {$idx >= 0} {
lrange $data $idx $idx+1
}
}
find $data 0foo3
# -> 0foo3 #89888
find $data 0f.*5
# -> 0foo5 #98980
Documentation: if, lrange, lsearch, proc, set
sed
sed -n '/foo2/,/#[0-9]\+/ {s/^[[:space:]]*[0\\]//; p}' file |
sed -n '1p; $p' |
paste -s
The first sed prints all the lines between the first pattern and the 2nd, removing optional leading whitespace and the leading 0 or \.
The second sed extracts only the first and last lines.
The paste command prints the 2 lines as a single line, separated with a tab.
awk
awk -v p1=foo5 '
$0 ~ p1 {found = 1}
found && /#[0-9]+/ { sub(/^\\\/, ""); print p1, $0; exit }
' file
tcl
lassign $argv filename pattern1
set found false
set fid [open $filename r]
while {[gets $fid line] != -1} {
if {[string match "*$pattern1*" $line]} {
set found true
}
if {$found && [regexp {#\d+} $line number]} {
puts "$pattern1 $number"
break
}
}
close $fid
Then
$ tclsh 2patt.tcl file foo4
foo4 #98980
Is this what you want?
$ awk -v tgt="foo2" 'index($0,tgt){f=1} f&&/#[0-9]/{print tgt, $0; exit}' file
foo2 \#89888
$ awk -v tgt="foo5" 'index($0,tgt){f=1} f&&/#[0-9]/{print tgt, $0; exit}' file
foo5 \#98980
I'm using index() above as it searches for a string not a regexp and so could not care less what RE metacharacters are in foo - they are all just literal characters in a string.
It's not clear from your question if you want to find a specific number after a specific foo or the first number after foo2 or even if you want to search for a specific foo value or all "foo"s or...
I have a file and a list of string pairs which I get from another file. I need substitute the first string of the pair with the second one, and do this for each pair.
Is there more efficient/simple way to do this (using Perl, grep, sed or other), then running a separate regexp substitution for each pair of values?
#! /usr/bin/perl
use warnings;
use strict;
my %replace = (
"foo" => "baz",
"bar" => "quux",
);
my $to_replace = qr/#{["(" .
join("|" => map quotemeta($_), keys %replace) .
")"]}/;
while (<DATA>) {
s/$to_replace/$replace{$1}/g;
print;
}
__DATA__
The food is under the bar in the barn.
The #{[...]} bit may look strange. It's a hack to interpolate generated content inside quote and quote-like operators. The result of the join goes inside the anonymous array-reference constructor [] and is immediately dereferenced thanks to #{}.
If all that seems too wonkish, it's the same as
my $search = join "|" => map quotemeta($_), keys %replace;
my $to_replace = qr/($search)/;
minus the temporary variable.
Note the use of quotemeta—thanks Ivan!—which escapes the first string of each pair so the regular-expression engine will treat them as literal strings.
Output:
The bazd is under the quux in the quuxn.
Metaprogramming—that is, writing a program that writes another program—is also nice. The beginning looks familiar:
#! /usr/bin/perl
use warnings;
use strict;
use File::Compare;
die "Usage: $0 path ..\n" unless #ARGV >= 1;
# stub
my #pairs = (
["foo" => "baz"],
["bar" => "quux"],
['foo$bar' => 'potrzebie\\'],
);
Now we generate the program that does all the s/// replacements—but is quotemeta on the replacement side a good idea?—
my $code =
"sub { while (<>) { " .
join(" " => map "s/" . quotemeta($_->[0]) .
"/" . quotemeta($_->[1]) .
"/g;",
#pairs) .
"print; } }";
#print $code, "\n";
and compile it with eval:
my $replace = eval $code
or die "$0: eval: $#\n";
To do the replacements, we use Perl's ready-made in-place editing:
# set up in-place editing
$^I = ".bak";
my #save_argv = #ARGV;
$replace->();
Below is an extra nicety that restores backups that the File::Compare module judges to have been unnecessary:
# in-place editing is conservative: it creates backups
# regardless of whether it modifies the file
foreach my $new (#save_argv) {
my $old = $new . $^I;
if (compare($new, $old) == 0) {
rename $old => $new
or warn "$0: rename $old => $new: $!\n";
}
}
There are two ways, both of them require you to compile a regex alternation on the keys of the table:
my %table = qw<The A the a quick slow lazy dynamic brown pink . !>;
my $alt
= join( '|'
, map { quotemeta } keys %table
sort { ( length $b <=> length $a ) || $a cmp $b }
)
;
my $keyword_regex = qr/($alt)/;
Then you can use this regex in a substitution:
my $text
= <<'END_TEXT';
The quick brown fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog.
END_TEXT
$text =~ s/$keyword_regex/$table{ $1 }/ge; # <- 'e' means execute code
Or you can do it in a loop:
use English qw<#LAST_MATCH_START #LAST_MATCH_END>;
while ( $text =~ /$keyword_regex/g ) {
my $key = $1;
my $rep = $table{ $key };
# use the 4-arg form
substr( $text, $LAST_MATCH_START[1]
, $LAST_MATCH_END[1] - $LAST_MATCH_START[1], $rep
);
# reset the position to start + new actual
pos( $text ) = $LAST_MATCH_START[1] + length $rep;
}
Build a hash of the pairs. Then split the target string into word tokens, and check each token against the keys in the hash. If it's present, replace it with the value of that key.
If eval is not a security concern:
eval $(awk 'BEGIN { printf "sed \047"} {printf "%s", "s/\\<" $1 "\\>/" $2 "/g;"} END{print "\047 substtemplate"}' substwords )
This constructs a long sed command consisting of multiple substitution commands. It's subject to potentially exceeding your maximum command line length. It expects the word pair file to consist of two words separated by whitespace on each line. Substitutions will be made for whole words only (no clbuttic substitutions).
It may choke if the word pair file contains characters that are significant to sed.
You can do it this way if your sed insists on -e:
eval $(awk 'BEGIN { printf "sed"} {printf "%s", " -e \047s/\\<" $1 "\\>/" $2 "/g\047"} END{print " substtemplate"}' substwords)