pulling text between two patterns with awk script

pulling text between two patterns with awk script - regex

Input text file:
This is a simple test file.
#BEGIN
These lines should be extracted by our script.
Everything here will be copied.
#END
That should be all.
#BEGIN
Nothing from here.
#END
Desired output:
These lines should be extracted by our script.
Everything here will be copied.
My awk script is:
#!/usr/bin/awk -f
$1 ~ /#BEGIN/{a=1;next};a;$1 ~ /#END/ {exit}
and my current output is:
These lines should be extracted by our script.
Everything here will be copied.
#END
The only problem I'm having is that I'm still printing the "#END". I've been trying for a long time to somehow eliminate that. Not sure how to exactly do it.

This becomes obvious IMO is we comment each command in the script. The script can be written like this:
#!/usr/bin/awk -f
$1 ~ /#BEGIN/ { # If we match the BEGIN line
a=1 # Set a flag to one
next # skip to the next line
}
a != 0 { # if the flag is not zero
print $0 # print the current line
}
$1 ~ /#END/ { # if we match the END line
exit # exit the process
}
Note that I expanded a to the equivalent form a!=0{print $0}, to make the point clearer.
So the script starts printing each line when the flag is set, and when it reaches the END line, it has already printed the line before it exits. Since you don't want the END line to be printed, you should exit before you print the line. So the script should become:
#!/usr/bin/awk -f
$1 ~ /#BEGIN/ { # If we match the BEGIN line
a=1 # Set a flag to one
next # skip to the next line
}
$1 ~ /#END/ { # if we match the END line
exit # exit the process
}
a != 0 { # if the flag is not zero
print $0 # print the current line
}
In this case, we exit before the line is printed. In a condensed form, it can be written as:
awk '$1~/#BEGIN/{a=1;next}$1~/#END/{exit}a' file
or a bit shorter
awk '$1~/#END/{exit}a;$1~/#BEGIN/{a=1}' file
Regarding the additional constraints raised in the comments, to avoid skipping any BEGIN blocks within the block that is to be printed, we should remove the next statement, and rearrange the lines like in the example right above. In an expanded form it would be like this:
#!/usr/bin/awk -f
$1 ~ /#END/ { # if we match the END line
exit # exit the process
}
a != 0 { # if the flag is not zero
print $0 # print the current line
}
$1 ~ /#BEGIN/ { # If we match the BEGIN line
a=1 # Set a flag to one
}
To also avoid exiting if an END line is found before the block to be printed, we can check if the flag is set before exiting:
#!/usr/bin/awk -f
$1 ~ /#END/ && a != 0 { # if we match the END line and the flag is set
exit # exit the process
}
a != 0 { # if the flag is not zero
print $0 # print the current line
}
$1 ~ /#BEGIN/ { # If we match the BEGIN line
a=1 # Set a flag to one
}
or in a condensed form:
awk '$1~/#END/&&a{exit}a;$1~/#BEGIN/{a=1}' file

Try below sed command to get desired output -
vipin#kali:~$ sed '/#BEGIN/,/#END/!d;/END/q' kk.txt|sed '1d;$d'
These lines should be extracted by our script.
Everything here will be copied.
vipin#kali:~$
Explanation -
use d to delete the content between two expression but !d will print them and then q for quit where command found END.
1d;$d to replace first and last line in our case #BEGIN and #END

Related

How can I delete the lines starting with "//" (e.g., file header) which are at the beginning of a file?

I want to delete the header from all the files, and the header has the lines starting with //.
If I want to delete all the lines that starts with //, I can do following:
sed '/^\/\//d'
But, that is not something I need to do. I just need to delete the lines in the beginning of the file that starts with //.
Sample file:
// This is the header
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
Expected output:
print "Hi"
// This should not be deleted
print "Hello"
Update:
If there is a new line in the beginning or in-between, it doesn't work. Is there any way to take care of that scenario?
Sample file:
< new empty line >
// This is the header
< new empty line >
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
Expected output:
print "Hi"
// This should not be deleted
print "Hello"
Can someone suggest a way to do this? Thanks in advance!
Update: The accepted answer works well for white space in the beginning or in-between.

Could you please try following. This also takes care of new line scenario too, written and tested in https://ideone.com/IKN3QR
awk '
(NF == 0 || /^[[:blank:]]*\/\//) && !found{
next
}
NF{
found=1
}
1
' Input_file
Explanation: Simply checking conditions if a line either is empty OR starting from // AND variable found is NULL then simply skip those lines. Once any line without // found then setting variable found here so all next coming lines should be printed from line where it's get set to till end of Input_file printed.

With sed:
sed -n '1{:a; /^[[:space:]]*\/\/\|^$/ {n; ba}};p' file
print "Hi"
// This should not be deleted
print "Hello"
Slightly shorter version with GNU sed:
sed -nE '1{:a; /^\s*\/\/|^$/ {n; ba}};p' file
Explanation:
1 { # execute this block on the fist line only
:a; # this is a label
/^\s*\/\/|^$/ { n; # on lines matching `^\s*\/\/` or `^$`, do: read the next line
ba } # and go to label :a
}; # end block
p # print line unchanged:
# we only get here after the header or when it's not found
sed -n makes sed not print any lines without the p command.
Edit: updated the pattern to also skip empty lines.

I sounds like you just want to start printing from the first line that's neither blank nor just a comment:
$ awk 'NF && ($1 !~ "^//"){f=1} f' file
print "Hi"
// This should not be deleted
print "Hello"
The above simply sets a flag f when it finds such a line and prints every line from then on. It will work using any awk in any shell on every UNIX box.
Note that, unlike some of the potential solutions posted, it doesn't store more than 1 line at a time in memory and so will work no matter how large your input file is.
It was tested against this input:
$ cat file
// This is the header
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
To run the above on many files at once and modify each file as you go is this with GNU awk:
awk -i inplace 'NF && ($1 !~ "^//"){f=1} f' *
and this with any awk:
ip_awk() { local f t=$(mktemp) && for f in "${#:2}"; do awk "$1" "$f" > "$t" && mv -- "$t" "$f"; done; }
ip_awk 'NF && ($1 !~ "^//"){f=1} f' *

In case perl is available then this may also work in slurp mode:
perl -0777 -pe 's~\A(?:\h*(?://.*)?\R+)+~~' file
\A will only match start of the file and (?:\h*(?://.*)?\R+)+ will match 1 or more lines that are blank or have // with optional leading spaces.

With GNU sed:
sed -i -Ez 's/^((\/\/[^\n]*|\s*)\n)+//' file
The ^((\/\/[^\n]*|\s*)\n)+ expression will match one or more lines starting with //, also matching blank lines, only at the start of the file.

Using ed (the file editor that the stream editor sed is based on),
printf '1,/^[^/]/ g|^\(//.*\)\{0,1\}$| d\nw\n' | ed tmp.txt
Some explanations are probably in order.
ed takes the name of the file to edit as an argument, and reads commands from standard input. Each command is terminated by a newline. (You could also read commands from a here document, rather than from printf via a pipe.)
1,/^[^/]/ addresses the first lines in the file, up to and including the first one that does not start with /. (All the lines you want to delete will be included in this set.)
g|^\(//.*\)\{0,1\}$|d deletes all the addressed lines that are either empty or do start with //.
w saves the changes.
Step 2 is a bit ugly; unfortunately, ed does not support regular expression operators you may take for granted, like ? or |. Breaking the regular expression down a bit:
^ matches the start of the line.
//.* matches // followed by zero or more characters.
\(//.*\)\{0,1\} matches the preceding regular expression 0 or 1 times (i.e., optionally)
$ matches the end of the line.

Replace a block of text

I have a file in this pattern:
Some text
---
## [Unreleased]
More text here
I need to replace the text between '---' and '## [Unreleased]' with something else in a shell script.
How can it be achieved using sed or awk?

Perl to the rescue!
perl -lne 'my #replacement = ("First line", "Second line");
if ($p = (/^---$/ .. /^## \[Unreleased\]/)) {
print $replacement[$p-1];
} else { print }'
The flip-flop operator .. tells you whether you're between the two strings, moreover, it returns the line number relative to the range.

This might work for you (GNU sed):
sed '/^---/,/^## \[Unreleased\]/c\something else' file
Change the lines between two regexp to the required string.

This example may help you.
$ cat f
Some text
---
## [Unreleased]
More text here
$ seq 1 5 >mydata.txt
$ cat mydata.txt
1
2
3
4
5
$ awk '/^---/{f=1; while(getline < c)print;close(c);next}/^## \[Unreleased\]/{f=0;next}!f' c="mydata.txt" f
Some text
1
2
3
4
5
More text here

awk -v RS="\0" 'gsub(/---\n\n## \[Unreleased\]\n/,"something")+1' file
give this line a try.

An awk solution that:
is portable (POSIX-compliant).
can deal with any number of lines between the start line and the end line of the block, and potentially with multiple blocks (although they'd all be replaced with the same text).
reads the file line by line (as opposed to reading the entire file at once).
awk -v new='something else' '
/^---$/ { f=1; next } # Block start: set flag, skip line
f && /^## \[Unreleased\]$/ { f=0; print new; next } # Block end: unset flag, print new txt
! f # Print line, if before or after block
' file

Sed replace string, based on input file

Wanting to see if there is a better/quicker way to do this.
Basically, I have a file and I need to add some more information to it, based on one of its fields. e.g.
File to edit:
USER|ROLE
user1|role1
user1|role2
user2|role1
user2|role11
Input File:
Role|Application
role1|applicationabc
role2|application_qwerty
role3|application_new_app_new
role4|qwerty_abc_123
role11|applicationabc123
By the end, I want to be left with something like this:
USER|ROLE|Application
user1|role1|applicationabc
user1|role2|application_qwerty
user2|role11|applicationabc123
user2|role3|application_new_app_new
My idea:
cat inputfile | while IFS='|' read src rep
do
sed -i "s#\<$src\>#$src\|$rep#" /path/to/file/filename.csv
done
What I've written works to an extent, but it is very slow. Also, if it finds a match anywhere in the line, it will replace it. For example, for user2, and role11, the script would match role1 before it matches role11.
So my questions are:
Is there a quicker way to do this?
Is there a way to match against the exact expression/string? Putting quotes in my input file doesn't seem to work.

With join:
join -i -t "|" -1 2 -2 1 <(sort -t '|' -k2b,2 file) <(sort -t '|' -k 1b,1 input)
From the join manpage:
Important: FILE1 and FILE2 must be sorted on the join fields.
That's why we need to sort the two files first: file on the first field and input on the second.
Then join joins the two file on those fields -1 2 -2 1. Output would then be:
ROLE|USER|Application
role1|user1|applicationabc
role1|user2|applicationabc
role11|user2|applicationabc123
role2|user1|application_qwerty

Piece of cake with awk:
$ cat file1
USER|ROLE
user1|role1
user1|role2
user2|role1
user2|role11
$ cat file2
ROLE|Application
role1|applicationabc
role2|application_qwerty
role3|application_new_app_new
role4|qwerty_abc_123
role11|applicationabc123
$ awk -F'\\|' 'NR==FNR{a[$1]=$2; next}; {print $0 "|" a[$2]}' file2 file1
USER|ROLE|Application
user1|role1|applicationabc
user1|role2|application_qwerty
user2|role1|applicationabc
user2|role11|applicationabc123

Please try the following:
awk 'FNR==NR{A[$1]=$2;next}s=$2 in A{ $3=A[$2] }s' FS='|' OFS='|' file2 file1
or:
awk 'FNR==NR{A[$1]=$2;next} $3 = $2 in A ? A[$2] : 0' FS='|' OFS='|' file2 file1
Explanation
awk '
# FNR==NR this is true only when awk reading first file
FNR==NR{
# Create array A where index = field1($1) and value = field2($2)
A[$1]=$2
# stop processing and go to next line
next
}
# Here we read 2nd file that is file1 in your case
# var in Array returns either 1=true or 0=false
# if array A has index field2 ($2) then s will be 1 otherwise 0
# whenever s is 1 that is nothing but true state, we create new field
# $3 and its value will be array element corresponds to array index field2
s=$2 in A{
$3=A[$2]
}s
# An awk program is a series of condition-action pairs,
# conditions being outside of curly braces and actions being enclosed in them.
# A condition is considered false if it evaluates to zero or the empty string,
# anything else is true (uninitialized variables are zero or empty string,
# depending on context, so they are false).
# Either a condition or an action can be implied;
# braces without a condition are considered to have a true condition and
# are always executed if they are hit,
# and any condition without an action will print the line
# if and only if the condition is met.
# So finally }s at the end of script
# it executes the default action for every line,
# printing the line whenever s is 1 that is true
# which may have been modified by the previous action in braces
# FS = Input Field Separator
# OFS = Output Field Separator
' FS='|' OFS='|' file2 file1

Remove matching and previous line

I need to remove a line containing "not a dynamic executable" and a previous line from a stream using grep, awk, sed or something other. My current working solution would be to tr the entire stream to strip off newlines then replace the newline preceding my match with something else using sed then use tr to add the newlines back in and then use grep -v. I'm somewhat weary of artifacts with this approach, but I don't see how else I can to it at the moment:
tr '\n' '|' | sed 's/|\tnot a dynamic executable/__MY_REMOVE/g' | tr '|' '\n'
EDIT:
Input is a list of mixed files piped to xargs ldd, basically I want to ignore all output about non library files since that has nothing to do with what I'm doing next. I didn't want to use lib*.so mask since that could concievably be different

Most simply with pcregrep in multi-line mode:
pcregrep -vM '\n\tnot a dynamic executable' filename
If pcregrep is not available to you, then awk or sed can also do this by reading one line ahead and skipping the printing of previous lines when a marker line appears.
You could be boring (and sensible) with awk:
awk '/^\tnot a dynamic executable/ { flag = 1; next } !flag && NR > 1 { print lastline; } { flag = 0; lastline = $0 } END { if(!flag) print }' filename
That is:
/^\tnot a dynamic executable/ { # in lines that start with the marker
flag = 1 # set a flag
next # and do nothing (do not print the last line)
}
!flag && NR > 1 { # if the last line was not flagged and
# is not the first line
print lastline # print it
}
{ # and if you got this far,
flag = 0 # unset the flag
lastline = $0 # and remember the line to be possibly
# printed.
}
END { # in the end
if(!flag) print # print the last line if it was not flagged
}
But sed is fun:
sed ':a; $! { N; /\n\tnot a dynamic executable/ d; P; s/.*\n//; ba }' filename
Explanation:
:a # jump label
$! { # unless we reached the end of the input:
N # fetch the next line, append it
/\n\tnot a dynamic executable/ d # if the result contains a newline followed
# by "\tnot a dynamic executable", discard
# the pattern space and start at the top
# with the next line. This effectively
# removes the matching line and the one
# before it from the output.
# Otherwise:
P # print the pattern space up to the newline
s/.*\n// # remove the stuff we just printed from
# the pattern space, so that only the
# second line is in it
ba # and go to a
}
# and at the end, drop off here to print
# the last line (unless it was discarded).
Or, if the file is small enough to be completely stored in memory:
sed ':a $!{N;ba}; s/[^\n]*\n\tnot a dynamic executable[^\n]*\n//g' filename
Where
:a $!{ N; ba } # read the whole file into
# the pattern space
s/[^\n]*\n\tnot a dynamic executable[^\n]*\n//g # and cut out the offending bit.

This might work for you (GNU sed):
sed 'N;/\n.*not a dynamic executable/d;P;D' file
This keeps a moving window of 2 lines and deletes them both if the desired string is found in the second. If not the first line is printed and then deleted and then next line appended and the process repeated.

Always keep in mind that while grep and sed are line-oriented awk is record-oriented and so can easily handle problems that span multiple lines.
It's a guess given you didn't post any sample input and expected output but it sounds like all you need is (using GNU awk for multi-char RS):
awk -v RS='^$' -v ORS= '{gsub(/[^\n]+\n\tnot a dynamic executable/,"")}1' file

Print several lines between patterns (first pattern not unique)

Need help with sed/awk/grep/whatever could solve my task.
I have a large file and I need to extract multiple sequential lines from it.
I have start pattern: <DN>
and end pattern: </GR>
and several lines in between, like this:
<DN>234</DN>
<DD>sdfsd</DD>
<BR>456456</BR>
<COL>6575675 sdfsd</COL>
<RAC>456464</RAC>
<GR>sdfsdfsFFFDd</GR>
I've tried this:
sed -n '/\<DN\>/,/\<\/GR\>/p'
and several other ones (using awk and sed).
It works okay, but the problem is that the source file may contain lines starting with <DN> and without </GR> in the end of the bunch of lines, and then starts a part with another and normal in the end:
<DN>234</DN> - unneded DN
<AB>sdfsd</AB>
<DC>456456</DC>
<EF>6575675 sdfsd</EF>
....really large piece of unwanted text here....
<DN>234</DN>
<DD>sdfsd</DD>
<BR>456456</BR>
<COL>6575675 sdfsd</COL>
<RAC>456464</RAC>
<GR>sdfsdfsFFFDd</GR>
<RAC>456464</RAC>
<GR>sdfsdfsFFFDd</GR>
How can I extract only needed lines and ignore garbage pieces of log, containing <DN> without ending </GR>?
And next, I need to convert a multiline pieces from <DN> to </GR> to a file with single lines, starting with <DN> and ending with </GR>.
Any help would be appreciated. I'm stuck

This might work for you (GNU sed):
sed -n '/<DN>/{h;b};x;/./G;x;/<\/GR/{x;/./p;z;x}' file
Use the hold space to store lines between <DN> and </GR>.

awk '
# Lines that start with '<DN>' start our matching.
/^<DN>/ {
# If we saw a start without a matching end throw everything we've saved away.
if (dn) {
d=""
}
# Mark being in a '<DN>' element.
dn=1
# Save the current line.
d=$0
next
}
# Lines that end with '</GR>$' end our matching (but only if we are currently in a match).
dn && /<\/GR>$/ {
# We aren't in a <DN> element anymore.
dn=0
# Print out the lines we've saved and the current line.
printf "%s%s%s\n", d, OFS, $0
# Reset our saved contents.
d=""
next
}
# If we are in a <DN> element and have saved contents append the current line to the contents (separated by OFS).
dn && d {
d=d OFS $0
}
' file

awk '
/^<DN>/ { n = 1 }
n { lines[n++] = $0 }
n && /<\/GR>$/ {
for (i=1; i<n; i++) printf "%s", lines[i]
print ""
n = 0
}
' file

with bash:
fun ()
{
local line output;
while IFS= read -r line; do
if [[ $line =~ ^'<DN>' ]]; then
output=$line;
else
if [[ -n $output ]]; then
output=$output$'\n'$line;
if [[ $line =~ '</GR>'$ ]]; then
echo "$output";
output=;
fi;
fi;
fi;
done
}
fun <file

You could use pcregrep tool for this.
$ pcregrep -o -M '(?s)(?<=^|\s)<DN>(?:(?!<DN>).)*?</GR>(?=\n|$)' file
<DN>234</DN>
<DD>sdfsd</DD>
<BR>456456</BR>
<COL>6575675 sdfsd</COL>
<RAC>456464</RAC>
<GR>sdfsdfsFFFDd</GR>

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

pulling text between two patterns with awk script - regex

Related

How can I delete the lines starting with "//" (e.g., file header) which are at the beginning of a file?

Replace a block of text

Sed replace string, based on input file

Remove matching and previous line

Print several lines between patterns (first pattern not unique)

Categories

Resources