Print several lines between patterns (first pattern not unique) - regex

Need help with sed/awk/grep/whatever could solve my task.
I have a large file and I need to extract multiple sequential lines from it.
I have start pattern: <DN>
and end pattern: </GR>
and several lines in between, like this:
<DN>234</DN>
<DD>sdfsd</DD>
<BR>456456</BR>
<COL>6575675 sdfsd</COL>
<RAC>456464</RAC>
<GR>sdfsdfsFFFDd</GR>
I've tried this:
sed -n '/\<DN\>/,/\<\/GR\>/p'
and several other ones (using awk and sed).
It works okay, but the problem is that the source file may contain lines starting with <DN> and without </GR> in the end of the bunch of lines, and then starts a part with another and normal in the end:
<DN>234</DN> - unneded DN
<AB>sdfsd</AB>
<DC>456456</DC>
<EF>6575675 sdfsd</EF>
....really large piece of unwanted text here....
<DN>234</DN>
<DD>sdfsd</DD>
<BR>456456</BR>
<COL>6575675 sdfsd</COL>
<RAC>456464</RAC>
<GR>sdfsdfsFFFDd</GR>
<RAC>456464</RAC>
<GR>sdfsdfsFFFDd</GR>
How can I extract only needed lines and ignore garbage pieces of log, containing <DN> without ending </GR>?
And next, I need to convert a multiline pieces from <DN> to </GR> to a file with single lines, starting with <DN> and ending with </GR>.
Any help would be appreciated. I'm stuck

This might work for you (GNU sed):
sed -n '/<DN>/{h;b};x;/./G;x;/<\/GR/{x;/./p;z;x}' file
Use the hold space to store lines between <DN> and </GR>.

awk '
# Lines that start with '<DN>' start our matching.
/^<DN>/ {
# If we saw a start without a matching end throw everything we've saved away.
if (dn) {
d=""
}
# Mark being in a '<DN>' element.
dn=1
# Save the current line.
d=$0
next
}
# Lines that end with '</GR>$' end our matching (but only if we are currently in a match).
dn && /<\/GR>$/ {
# We aren't in a <DN> element anymore.
dn=0
# Print out the lines we've saved and the current line.
printf "%s%s%s\n", d, OFS, $0
# Reset our saved contents.
d=""
next
}
# If we are in a <DN> element and have saved contents append the current line to the contents (separated by OFS).
dn && d {
d=d OFS $0
}
' file

awk '
/^<DN>/ { n = 1 }
n { lines[n++] = $0 }
n && /<\/GR>$/ {
for (i=1; i<n; i++) printf "%s", lines[i]
print ""
n = 0
}
' file

with bash:
fun ()
{
local line output;
while IFS= read -r line; do
if [[ $line =~ ^'<DN>' ]]; then
output=$line;
else
if [[ -n $output ]]; then
output=$output$'\n'$line;
if [[ $line =~ '</GR>'$ ]]; then
echo "$output";
output=;
fi;
fi;
fi;
done
}
fun <file

You could use pcregrep tool for this.
$ pcregrep -o -M '(?s)(?<=^|\s)<DN>(?:(?!<DN>).)*?</GR>(?=\n|$)' file
<DN>234</DN>
<DD>sdfsd</DD>
<BR>456456</BR>
<COL>6575675 sdfsd</COL>
<RAC>456464</RAC>
<GR>sdfsdfsFFFDd</GR>

Related

Replace a block of text

I have a file in this pattern:
Some text
---
## [Unreleased]
More text here
I need to replace the text between '---' and '## [Unreleased]' with something else in a shell script.
How can it be achieved using sed or awk?
Perl to the rescue!
perl -lne 'my #replacement = ("First line", "Second line");
if ($p = (/^---$/ .. /^## \[Unreleased\]/)) {
print $replacement[$p-1];
} else { print }'
The flip-flop operator .. tells you whether you're between the two strings, moreover, it returns the line number relative to the range.
This might work for you (GNU sed):
sed '/^---/,/^## \[Unreleased\]/c\something else' file
Change the lines between two regexp to the required string.
This example may help you.
$ cat f
Some text
---
## [Unreleased]
More text here
$ seq 1 5 >mydata.txt
$ cat mydata.txt
1
2
3
4
5
$ awk '/^---/{f=1; while(getline < c)print;close(c);next}/^## \[Unreleased\]/{f=0;next}!f' c="mydata.txt" f
Some text
1
2
3
4
5
More text here
awk -v RS="\0" 'gsub(/---\n\n## \[Unreleased\]\n/,"something")+1' file
give this line a try.
An awk solution that:
is portable (POSIX-compliant).
can deal with any number of lines between the start line and the end line of the block, and potentially with multiple blocks (although they'd all be replaced with the same text).
reads the file line by line (as opposed to reading the entire file at once).
awk -v new='something else' '
/^---$/ { f=1; next } # Block start: set flag, skip line
f && /^## \[Unreleased\]$/ { f=0; print new; next } # Block end: unset flag, print new txt
! f # Print line, if before or after block
' file

Unify lines that contains same patterns

I have a database with this structure:
word1#element1.1#element1.2#element1.3#...
word2#element2.1#element2.2#element2.3#...
...
...
I would like to unify the elements of 2 or more lines every time the word at the beginning is the same.
Example:
...
word8#element8.1#element8.2#element8.3#...
word9#element9.1#element9.2#element9.3#...
...
Now, lets suppose word8=word9, this is the result:
...
word8#element8.1#element8.2#element8.3#...#element9.1#element9.2#element9.3#...
...
I tried with the command sed:
I match 2 lines at time with N
Memorize the first word of the first line: ^\([^#]*\) (all the elements exept '#')
Memorize all the other elements of the first line: \([^\n]*\)
Check if in the second line (after \n) is present the same word: \1
If it's like that I just take out the newline char and the first word of the second line: \1#\2
This is the complete code:
sed 'N;s/^\([^#]*\)#\([^\n]*\)\n\1/\1#\2/' database
I would like to understand why it's not working and how I can solve that problem.
Thank you very much in advance.
This might work for you (GNU sed):
sed 'N;s/^\(\([^#]*#\).*\)\n\2/\1#/;P;D' file
Read 2 lines at all times and remove the line feed and the matching portion of the second line (reinstating the #) if the words at the beginning of those 2 lines match.
sed '#n
H
$ { x
:cycle
s/\(\n\)\([^#]*#\)\([^[:cntrl:]]*\)\1\2/\1\2\3#/g
t cycle
s/.//
p
}' YourFile
Assuming word are sorted
load the whole file in buffer (code could be adapted if file is to big to use only several lines in buffer)
at the end, load holding buffer content to working buffer
remove the new line and first word of any line where previous line start with same word (and add a # as seprator)
if occur, retry once again
if not, remove first char (a new line due to loading process)
print
You can try with perl. It reads input file line by line, splits in first # character and uses a hash of arrays to save the first word as key and append the rest of the line as value. At the END block it sorts by the first word and joins the lines:
perl -lne '
($key, $line) = split /#/, $_, 2;
push #{$hash{$key}}, $line;
END {
for $k ( sort keys %hash ) {
printf qq|%s#%s\n|, $k, join q|#|, #{$hash{$k}};
}
}
' infile
$ cat file
word1#element1.1#element1.2#element1.3
word2#element2.1#element2.2#element2.3
word8#element8.1#element8.2#element8.3
word8#element9.1#element9.2#element9.3
word9#element9.1#element9.2#element9.3
.
$ awk 'BEGIN{FS=OFS="#"}
NR>1 && $1!=prev { print "" }
$1==prev { sub(/^[^#]+/,"") }
{ printf "%s",$0; prev=$1 }
END { print "" }
' file
word1#element1.1#element1.2#element1.3
word2#element2.1#element2.2#element2.3
word8#element8.1#element8.2#element8.3#element9.1#element9.2#element9.3
word9#element9.1#element9.2#element9.3
Using text replacements:
perl -p0E 'while( s/(^|\n)(.+?#)(.*)\n\2(.*)/$1$2$3 $4/ ){}' yourfile
or indented:
perl -p0E 'while( # while we can
s/(^|\n) # substitute \n
(.+?\#) (.*) \n # id elems1
\2 (.*) # id elems2
/$1$2$3 $4/x # \n id elems1 elems2
){}'
thanks: #birei

Remove matching and previous line

I need to remove a line containing "not a dynamic executable" and a previous line from a stream using grep, awk, sed or something other. My current working solution would be to tr the entire stream to strip off newlines then replace the newline preceding my match with something else using sed then use tr to add the newlines back in and then use grep -v. I'm somewhat weary of artifacts with this approach, but I don't see how else I can to it at the moment:
tr '\n' '|' | sed 's/|\tnot a dynamic executable/__MY_REMOVE/g' | tr '|' '\n'
EDIT:
Input is a list of mixed files piped to xargs ldd, basically I want to ignore all output about non library files since that has nothing to do with what I'm doing next. I didn't want to use lib*.so mask since that could concievably be different
Most simply with pcregrep in multi-line mode:
pcregrep -vM '\n\tnot a dynamic executable' filename
If pcregrep is not available to you, then awk or sed can also do this by reading one line ahead and skipping the printing of previous lines when a marker line appears.
You could be boring (and sensible) with awk:
awk '/^\tnot a dynamic executable/ { flag = 1; next } !flag && NR > 1 { print lastline; } { flag = 0; lastline = $0 } END { if(!flag) print }' filename
That is:
/^\tnot a dynamic executable/ { # in lines that start with the marker
flag = 1 # set a flag
next # and do nothing (do not print the last line)
}
!flag && NR > 1 { # if the last line was not flagged and
# is not the first line
print lastline # print it
}
{ # and if you got this far,
flag = 0 # unset the flag
lastline = $0 # and remember the line to be possibly
# printed.
}
END { # in the end
if(!flag) print # print the last line if it was not flagged
}
But sed is fun:
sed ':a; $! { N; /\n\tnot a dynamic executable/ d; P; s/.*\n//; ba }' filename
Explanation:
:a # jump label
$! { # unless we reached the end of the input:
N # fetch the next line, append it
/\n\tnot a dynamic executable/ d # if the result contains a newline followed
# by "\tnot a dynamic executable", discard
# the pattern space and start at the top
# with the next line. This effectively
# removes the matching line and the one
# before it from the output.
# Otherwise:
P # print the pattern space up to the newline
s/.*\n// # remove the stuff we just printed from
# the pattern space, so that only the
# second line is in it
ba # and go to a
}
# and at the end, drop off here to print
# the last line (unless it was discarded).
Or, if the file is small enough to be completely stored in memory:
sed ':a $!{N;ba}; s/[^\n]*\n\tnot a dynamic executable[^\n]*\n//g' filename
Where
:a $!{ N; ba } # read the whole file into
# the pattern space
s/[^\n]*\n\tnot a dynamic executable[^\n]*\n//g # and cut out the offending bit.
This might work for you (GNU sed):
sed 'N;/\n.*not a dynamic executable/d;P;D' file
This keeps a moving window of 2 lines and deletes them both if the desired string is found in the second. If not the first line is printed and then deleted and then next line appended and the process repeated.
Always keep in mind that while grep and sed are line-oriented awk is record-oriented and so can easily handle problems that span multiple lines.
It's a guess given you didn't post any sample input and expected output but it sounds like all you need is (using GNU awk for multi-char RS):
awk -v RS='^$' -v ORS= '{gsub(/[^\n]+\n\tnot a dynamic executable/,"")}1' file

Remove \n newline if string contains keyword

I'd like to know if I can remove a \n (newline) only if the current line has one ore more keywords from a list; for instance, I want to remove the \n if it contains the words hello or world.
Example:
this is an original
file with lines
containing words like hello
and world
this is the end of the file
And the result would be:
this is an original
file with lines
containing words like hello and world this is the end of the file
I'd like to use sed, or awk and, if needed, grep, wc or whatever commands work for this purpose. I want to be able to do this on a lot of files.
Using awk you can do:
awk '/hello|world/{printf "%s ", $0; next} 1' file
this is an original
file with lines
containing words like hello and world this is the end of the file
here is simple one using sed
sed -r ':a;$!{N;ba};s/((hello|world)[^\n]*)\n/\1 /g' file
Explanation
:a;$!{N;ba} read whole file into pattern, like this: this is an original\nfile with lines\ncontaining words like hell\
o\nand world\nthis is the end of the file$
s/((hello|world)[^\n]*)\n/\1 /g search the key words hello or world and remove the next \n,
g command in sed substitute stands to apply the replacement to all matches to the regexp, not just the first.
A non-regex approach:
awk '
BEGIN {
# define the word list
w["hello"]
w["world"]
}
{
printf "%s", $0
for (i=1; i<=NF; i++)
if ($i in w) {
printf " "
next
}
print ""
}
'
or a perl one-liner
perl -pe 'BEGIN {#w = qw(hello world)} s/\n/ / if grep {$_ ~~ #w} split'
To edit the file in-place, do:
awk '...' filename > tmpfile && mv tmpfile filename
perl -i -pe '...' filename
This might work for you (GNU sed):
sed -r ':a;/^.*(hello|world).*\'\''/M{$bb;N;ba};:b;s/\n/ /g' file
This checks if the last line, of a possible multi-line, contains the required string(s) and if so reads another line until end-of-file or such that the last line does not contain the/those string(s). Newlines are removed and the line printed.
$ awk '{ORS=(/hello|world/?FS:RS)}1' file
this is an original
file with lines
containing words like hello and world this is the end of the file
sed -n '
:beg
/hello/ b keep
/world/ b keep
H;s/.*//;x;s/\n/ /g;p;b
: keep
H;s/.*//
$ b beg
' YourFile
a bit harder due to check on current line that may include a previous hello or world already
principle:
on every pattern match, keep the string in hold buffer
other wise, load hold buffer and remove \n (use of swap and empty the current line due to limited buffer operation available) and print the content
Add a special case of pattern in last line (normaly hold so not printed otherwise)

Replace entire paragraph with another from linux command line

The problem I have is pretty straightforward (or so it seems). All I want to do is replace a paragraph of text (it's a header comment) with another paragraph. This will need to happen across a diverse number of files in a directory hierarchy (source code tree).
The paragraph to be replaced must be matched in it's entirety as there are similar text blocks in existence.
e.g.
To Replace
// ----------
// header
// comment
// to be replaced
// ----------
With
// **********
// some replacement
// text
// that could have any
// format
// **********
I have looked at using sed and from what I can tell the most number of lines that it can work on is 2 (with the N command).
My question is: what is the way to do this from the linux command line?
EDIT:
Solution obtained: Best solution was Ikegami's, fully command line and best fit for what I wanted to do.
My final solution required some tweaking; the input data contained a lot of special characters as did the replace data. To deal with this the data needs to be pre processed to insert appropriate \n's and escape characters. The end product is a shell script that takes 3 arguments; File containing text to search for, File containing text to replace with and a folder to recursively parse for files with .cc and .h extension. It's fairly easy to customise from here.
SCRIPT:
#!/bin/bash
if [ -z $1 ]; then
echo 'First parameter is a path to a file that contains the excerpt to be replaced, this must be supplied'
exit 1
fi
if [ -z $2 ]; then
echo 'Second parameter is a path to a file contaiing the text to replace with, this must be supplied'
exit 1
fi
if [ -z $3 ]; then
echo 'Third parameter is the path to the folder to recursively parse and replace in'
exit 1
fi
sed 's!\([]()|\*\$\/&[]\)!\\\1!g' $1 > temp.out
sed ':a;N;$!ba;s/\n/\\n/g' temp.out > final.out
searchString=`cat final.out`
sed 's!\([]|\[]\)!\\\1!g' $2 > replace.out
replaceString=`cat replace.out`
find $3 -regex ".*\.\(cc\|h\)" -execdir perl -i -0777pe "s{$searchString}{$replaceString}" {} +
find -name '*.pm' -exec perl -i~ -0777pe'
s{// ----------\n// header\n// comment\n// to be replaced\n// ----------\n}
{// **********\n// some replacement\n// text\n// that could have any\n// format\n// **********\n};
' {} +
Using perl:
#!/usr/bin/env perl
# script.pl
use strict;
use warnings;
use Inline::Files;
my $lines = join '', <STDIN>; # read stdin
my $repl = join '', <REPL>; # read replacement
my $src = join '', <SRC>; # read source
chomp $repl; # remove trailing \n from $repl
chomp $src; # id. for $src
$lines =~ s#$src#$repl#gm; # global multiline replace
print $lines; # print output
__SRC__
// ----------
// header
// comment
// to be replaced
// ----------
__REPL__
// **********
// some replacement
// text
// that could have any
// format
// **********
Usage: ./script.pl < yourfile.cpp > output.cpp
Requirements: Inline::Files (install from cpan)
Tested on: perl v5.12.4, Linux _ 3.0.0-12-generic #20-Ubuntu SMP Fri Oct 7 14:56:25 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
This might work:
# cat <<! | sed ':a;N;s/this\nand\nthis\n/something\nelse\n/;ba'
> a
> b
> c
> this
> and
> this
> d
> e
> this
> not
> this
> f
> g
> !
a
b
c
something
else
d
e
this
not
this
f
g
The trick is to slurp everything into the pattern space using the N and the loop :a;...;ba
This is probably more efficient:
sed '1{h;d};H;$!d;x;s/this\nand\nthis\n/something\nelse\n/g;p;d'
A more general purpose solution may use files for match and substitute data like so:
match=$(sed ':a;N;${s/\n/\\n/g};ba;' match_file)
substitute=$(sed ':a;N;${s/\n/\\n/g};ba;' substitute_file)
sed '1{h;d};H;$!d;x;s/'"$match"'/'"$substitute"'/g;p;d' source_file
Another way (probably less efficient) but cleaner looking:
sed -s '$s/$/\n###/' match_file substitute_file |
sed -r '1{h;d};H;${x;:a;s/^((.*)###\n(.*)###\n(.*))\2/\1\3/;ta;s/(.*###\n){2}//;p};d' - source_file
The last uses the GNU sed --separate option to treat each file as a separate entity. The second sed command uses a loop for the substitute to obviate .* greediness.
As long as the header comments are delimited uniquely (i.e., no other header comment starts with // ----------), and the replacement text is constant, the following awk script should do what you need:
BEGIN { normal = 1 }
/\/\/ ----------/ {
if (normal) {
normal = 0;
print "// **********";
print "// some replacement";
print "// text";
print "// that could have any";
print "// format";
print "// **********";
} else {
normal = 1;
next;
}
}
{
if (normal) print;
}
This prints everything it sees until it runs into the paragraph delimiter. When it sees the first one, it prints out the replacement paragraph. Until it sees the 2nd paragraph delimiter, it will print nothing. When it sees the 2nd paragraph delimiter, it will start printing lines normally again with the next line.
While you can technically do this from the command line, you may run into tricky shell quoting issues, especially if the replacement text has any single quotes. It may be easier to put the script in a file. Just put #!/usr/bin/awk -f (or whatever path which awk returns) at the top.
EDIT
To match multiple lines in awk, you'll need to use getline. Perhaps something like this:
/\/\/ ----------/ {
lines[0] = "// header";
lines[1] = "// comment";
lines[2] = "// to be replaced";
lines[3] = "// ----------";
linesRead = $0 "\n";
for (i = 0; i < 4; i++) {
getline line;
linesRead = linesRead line;
if (line != lines[i]) {
print linesRead; # print partial matches
next;
}
}
# print the replacement paragraph here
next;
}