grep/pcregrep/sed/awk the data after the last match to the end of a file - regex

I need to grab the content after the last match of ENTRY to the end of the file, and I can't seem to do it. It can be multiple lines and the data can include any character to the end of the file including (,\n, ).
I’ve tried:
tail -1 file # doesn’t work due to it not consistently being one line
grep "^(.*" # only grabs one line
pcregrep -M '\n(.*' file # I think a variation of this is the solution, but I’ve had no luck so far.
File that grows below:
TOP OF FILE
%
ENTRY
(S®s
√6ûíπ‹ôTìßÅDPˆ¬k·Ù"=ÓxF)*†‰ú˚ÃQ´¿J‘\˜©ŒG»‡∫QÆ’<πsµ-ù±ñ∞NäAOilWçk
N+P}V<ôÒ∏≠µW*`Hß”;–GØ»14∏åR"ºã
FD‘mÍõ?*ÊÎÉC)(S®s
√6ûíπ‹ôTìßÅDPˆ¬k·Ù"=ÓxF)*†‰ú˚ÃQ´¿J‘\˜©ŒG»‡∫QÆ’<πsµ-ù±ñ∞NäAOilWçk
N+P}V<ôÒ∏≠µW*`Hß”;–GØ»14∏åR"ºã
FD‘mÍõ?*ÊÎÉC)eq
{
DATA
}
ENTRY
(A® S\kÉflã1»Âbπ¯Ú∞⁄äπHZ#F◊§•Ã*‹¡‹…ÿPkJòÑíòú˛¶à˛¨¢v|u«Ùbó–Ö¶¢∂5ıÜ#¨•˘®#W´≥‡*`H∑”ı–Só¬<˙ìEçöf∞Gg±:œe™flflå)A® S\kÉflã1»Âbπ¯Ú∞⁄äπHZ#F◊§•Ã*‹¡‹…ÿPkJòÑíòú˛¶à˛¨¢v|u«Ùbó–Ö¶¢∂5ıÜ#¨•˘®#W´≥‡*`H∑”ı–Só¬<˙ìEçöf∞Gg±:œe™flflå)eq
{
DATA
}if
ENTRY
(ÌSYõ˛9°\K¬∞≈fl|”/í÷L
Ö˙h/ÜÇi"û£fi±€ÀNéÓ›bÏÿmâ[≈4J’XPü´Z
oÜlø∫…qìõ¢,ßü©cÓ{—˜e&ÚÀÓHÏÜ‚m(Œ∆⁄ˆQ˝òêpoÉÄÂ(S‘E ⁄ !ŸQ§ô6ÉH

$ awk '/^[(]/{s="";} {s=s"\n"$0;} END{print substr(s,2);}' file
(ÌSYõ˛9°\K¬∞≈fl|”/í÷L
Ö˙h/ÜÇi"û£fi±€ÀNéÓ›bÏÿmâ[≈4J’XPü´Z
oÜlø∫…qìõ¢,ßü©cÓ{—˜e&ÚÀÓHÏÜ‚m(Œ∆⁄ˆQ˝òêpoÉÄÂ(S‘E ⁄ !ŸQ§ô6ÉH
How it works
awk implicitly loops through files line-by-line. This script stores whatever we want to print in the variable s.
/^[(]/{s="";}
Every time that we find a line which starts with (, we set s to an empty string.
The purpose of this is to remove everything before the last occurrence of a line starting with (.
s=s"\n"$0
We add the current line onto the end of s.
END{print substr(s,2);}
After we reach the end of the file, we print s (omitting the first character which will be a surplus newline character).

Interesting problem. I think you can do it with just sed. When you find a match, zero the hold space and add the match line to the hold space. On the last line, print the hold space.
sed -n -e '/ENTRY/,$ { /ENTRY/ { h; n; }; H; $ { x; p; } }'
Don't print by default. From the first entry to the end of the file:
If it is an entry line; copy the new line over the hold space and move on.
Otherwise append the line to the hold space.
If it is the last line, swap the hold space and pattern space, and print the pattern space (what was in the hold space).
You might worry about what happens if the last line in the file is an ENTRY line.
Given a data file:
TOP OF FILE
not wanted
ENTRY
could be wanted
ENTRY
but it wasn't
and this isn't
because
ENTRY
this is here
EOF
The output is:
ENTRY
this is here
EOF
If you don't want ENTRY to appear, modify the script slightly:
sed -n -e '/ENTRY/,$ { /ENTRY/ { s/.*//; h; n; }; H; $ { x; s/^\n//; p; } }'

Using tac you could do it:
tac <file> | sed -e '/ENTRY/,$d' | tac
This will print the file with the lines reversed, then use sed to remove everything from what is now the first occurrence of ENTRY to the now end of the file, then reverse the lines again to get the original order.
As Jonathan Leffler pointed out, a faster way to do this--though probably not much because tac will still have a lot to do and it has all the overhead of rquireing 3 processes instead of just one, but the sed could be done more efficiently, but just ending when we find the ENTRY line, instead of processing the rest of the file to remove the lines:
tac <file> | sed -e '/ENTRY/q' | tac
though his answer is often going to be better still. That answer will include the ENTRY line. If you don't want that you could also do
tac <file> | sed -n '/ENTRY/q;p' | tac
to not print any ouptut by default, then quit as soon as you find the ENTRY line, but use the p command to print the lines until you get to that line.

This should work too (at least with gawk)
awk -vRS="ENTRY" 'END{print $0}'
set the record separator as your pattern and print the last record.

loadind file in memory
sed -e 'H;$!d' -e 'x;s/.*ENTRY[[:blank:]]*\n//' YourFile

Related

How can I delete the lines starting with "//" (e.g., file header) which are at the beginning of a file?

I want to delete the header from all the files, and the header has the lines starting with //.
If I want to delete all the lines that starts with //, I can do following:
sed '/^\/\//d'
But, that is not something I need to do. I just need to delete the lines in the beginning of the file that starts with //.
Sample file:
// This is the header
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
Expected output:
print "Hi"
// This should not be deleted
print "Hello"
Update:
If there is a new line in the beginning or in-between, it doesn't work. Is there any way to take care of that scenario?
Sample file:
< new empty line >
// This is the header
< new empty line >
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
Expected output:
print "Hi"
// This should not be deleted
print "Hello"
Can someone suggest a way to do this? Thanks in advance!
Update: The accepted answer works well for white space in the beginning or in-between.
Could you please try following. This also takes care of new line scenario too, written and tested in https://ideone.com/IKN3QR
awk '
(NF == 0 || /^[[:blank:]]*\/\//) && !found{
next
}
NF{
found=1
}
1
' Input_file
Explanation: Simply checking conditions if a line either is empty OR starting from // AND variable found is NULL then simply skip those lines. Once any line without // found then setting variable found here so all next coming lines should be printed from line where it's get set to till end of Input_file printed.
With sed:
sed -n '1{:a; /^[[:space:]]*\/\/\|^$/ {n; ba}};p' file
print "Hi"
// This should not be deleted
print "Hello"
Slightly shorter version with GNU sed:
sed -nE '1{:a; /^\s*\/\/|^$/ {n; ba}};p' file
Explanation:
1 { # execute this block on the fist line only
:a; # this is a label
/^\s*\/\/|^$/ { n; # on lines matching `^\s*\/\/` or `^$`, do: read the next line
ba } # and go to label :a
}; # end block
p # print line unchanged:
# we only get here after the header or when it's not found
sed -n makes sed not print any lines without the p command.
Edit: updated the pattern to also skip empty lines.
I sounds like you just want to start printing from the first line that's neither blank nor just a comment:
$ awk 'NF && ($1 !~ "^//"){f=1} f' file
print "Hi"
// This should not be deleted
print "Hello"
The above simply sets a flag f when it finds such a line and prints every line from then on. It will work using any awk in any shell on every UNIX box.
Note that, unlike some of the potential solutions posted, it doesn't store more than 1 line at a time in memory and so will work no matter how large your input file is.
It was tested against this input:
$ cat file
// This is the header
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
To run the above on many files at once and modify each file as you go is this with GNU awk:
awk -i inplace 'NF && ($1 !~ "^//"){f=1} f' *
and this with any awk:
ip_awk() { local f t=$(mktemp) && for f in "${#:2}"; do awk "$1" "$f" > "$t" && mv -- "$t" "$f"; done; }
ip_awk 'NF && ($1 !~ "^//"){f=1} f' *
In case perl is available then this may also work in slurp mode:
perl -0777 -pe 's~\A(?:\h*(?://.*)?\R+)+~~' file
\A will only match start of the file and (?:\h*(?://.*)?\R+)+ will match 1 or more lines that are blank or have // with optional leading spaces.
With GNU sed:
sed -i -Ez 's/^((\/\/[^\n]*|\s*)\n)+//' file
The ^((\/\/[^\n]*|\s*)\n)+ expression will match one or more lines starting with //, also matching blank lines, only at the start of the file.
Using ed (the file editor that the stream editor sed is based on),
printf '1,/^[^/]/ g|^\(//.*\)\{0,1\}$| d\nw\n' | ed tmp.txt
Some explanations are probably in order.
ed takes the name of the file to edit as an argument, and reads commands from standard input. Each command is terminated by a newline. (You could also read commands from a here document, rather than from printf via a pipe.)
1,/^[^/]/ addresses the first lines in the file, up to and including the first one that does not start with /. (All the lines you want to delete will be included in this set.)
g|^\(//.*\)\{0,1\}$|d deletes all the addressed lines that are either empty or do start with //.
w saves the changes.
Step 2 is a bit ugly; unfortunately, ed does not support regular expression operators you may take for granted, like ? or |. Breaking the regular expression down a bit:
^ matches the start of the line.
//.* matches // followed by zero or more characters.
\(//.*\)\{0,1\} matches the preceding regular expression 0 or 1 times (i.e., optionally)
$ matches the end of the line.

Modifying a pattern-matched line as well as next line in a file

I'm trying to write a script that, among other things, automatically enable multilib. Meaning in my /etc/pacman.conf file, I have to turn this
#[multilib]
#Include = /etc/pacman.d/mirrorlist
into this
[multilib]
Include = /etc/pacman.d/mirrorlist
without accidentally removing # from lines like these
#[community-testing]
#Include = /etc/pacman.d/mirrorlist
I already accomplished this by using this code
linenum=$(rg -n '\[multilib\]' /etc/pacman.conf | cut -f1 -d:)
sed -i "$((linenum))s/#//" /etc/pacman.conf
sed -i "$((linenum+1))s/#//" /etc/pacman.conf
but I'm wondering, whether this can be solved in a single line of code without any math expressions.
With GNU sed. Find row starting with #[multilib], append next line (N) to pattern space and then remove all # from pattern space (s/#//g).
sed -i '/^#\[multilib\]/{N;s/#//g}' /etc/pacman.conf
If the two lines contain further #, then these are also removed.
Could you please try following, written with shown samples only. Considering that multilib and it's very next line only you want to deal with.
awk '
/multilib/ || found{
found=$0~/multilib/?1:""
sub(/^#+/,"")
print
}
' Input_file
Explanation:
First checking if a line contains multilib or variable found is SET then following instructions inside it's block.
Inside block checking if line has multilib then set it to 1 or nullify it. So that only next line after multilib gets processed only.
Using sub function of awk to substitute starting hash one or more occurences with NULL here.
Then printing current line.
This will work using any awk in any shell on every UNIX box:
$ awk '$0 == "#[multilib]"{c=2} c&&c--{sub(/^#/,"")} 1' file
[multilib]
Include = /etc/pacman.d/mirrorlist
and if you had to uncomment 500 lines instead of 2 lines then you'd just change c=2 to c=500 (as opposed to typing N 500 times as with the currently accepted solution). Note that you also don't have to escape any characters in the string you're matching on. So in addition to being robust and portable this is a much more generally useful idiom to remember than the other solutions you have so far. See printing-with-sed-or-awk-a-line-following-a-matching-pattern/17914105#17914105 for more.
A perl one-liner:
perl -0777 -api.back -e 's/#(\[multilib]\R)#/$1/' /etc/pacman.conf
modify in place with a backup of original in /etc/pacman.conf.back
If there is only one [multilib] entry, with ed and the shell's printf
printf '/^#\[multilib\]$/;+1s/^#//\n,p\nQ\n' | ed -s /etc/pacman.conf
Change Q to w to edit pacman.conf
Match #[multilib]
; include the next address
+1 the next line (plus one line below)
s/^#// remove the leading #
,p prints everything to stdout
Q exit/quit ed without error message.
-s means do not print any message.
Ed can do this.
cat >> edjoin.txt << EOF
/multilib/;+j
s/#//
s/#/\
/
wq
EOF
ed -s pacman.conf < edjoin.txt
rm -v ./edjoin.txt
This will only work on the first match. If you have more matches, repeat as necessary.
This might work for you (GNU sed):
sed '/^#\[multilib\]/,+1 s/^#//' file
Focus on a range of lines (in this case, two) where the first line begins #[multilib] and remove the first character in those lines if it is a #.
N.B. The [ and ] must be escaped in the regexp otherwise they will match a single character that is m,u,l,t,i or b. The range can be extended by changing the integer +1 to +n if you were to want to uncomment n lines plus the matching line.
To remove all comments in a [multilib] section, perhaps:
sed '/^#\?\[[^]]*\]$/h;G;/^#\[multilib\]/M s/^#//;P;d' file

How can I delete the last word in the current line, but only if a pattern occurs on the next line?

The contents of the file are
some line DELETE_ME
some line this_is_the_pattern
If the this_is_the_pattern occurs in the next line, then delete the last word (in this case DELETE_ME) in the current line.
How can I do this using sed or awk? My understanding is that sed is more appropriate for this task than awk is, because awk is suitable for operations on data stored tabular format. If my understanding is incorrect, please let me know.
$ awk '/this_is_the_pattern/{sub(/[^[:space:]]+$/, "", last)} NR>1{print last} {last=$0} END{print last}' file
some line
some line this_is_the_pattern
How it works
This script uses a single variable called last which contains the previous line in the file. In summary, if the current line contains the pattern, then the last word is removed from last. Otherwise, last is printed as is.
In detail, taking each command in turn:
/this_is_the_pattern/{sub(/[^[:space:]]+$/, "", last)}
If this line has the pattern, remove the final word from the last line.
NR>1{print last}
For each line after the first line, print the last line.
last=$0
Save the current line in variable last.
END{print last}
Print the last line from the file.
awk 'NR>1 && /this_is_the_pattern/ {print t;}
NR>1 && !/this_is_the_pattern/ {print f;}
{f=$0;$NF="";t=$0}
END{print f}' input-file
Note that this will modify whitespace in any lines in which the last field is removed, squeezing runs of whitespace into a single space.
You could simplify this to:
awk 'NR>1 { print( /this_is_the_pattern/? t:f)}
{f=$0;$NF="";t=$0}
END{print f}' input-file
and you can resolve the squeezed whitespace issue with:
awk 'NR>1 { print( /this_is_the_pattern/? t:f)}
{f=$0;sub(" [^ ]*$","");t=$0}
END{print f}' input-file
You could use tac to cat the file backwards, so that you see the pattern first. Then set a flag and delete the last word on the next line you see. Then at the end, reverse the file through tac back to the original order.
tac file | awk '/this_is_the_pattern/{f=1;print;next} f==1{sub(/ [^ ]+$/, "");print;f=0}' | tac
Use buffer to keep previous line in memory
sed -n 'H;1h;1!{x;/\nPAGE/ s/[^ ]*\(\n\)/\1/;P;s/.*\n//;h;$p;}' YourFile
Use loop but same concept
sed -n ':cycle
N;/\nPAGE/ s/[^ ]*\(\n\)/\1/;P;s/.*\n//;$p;b cycle' YourFile
in both case, it remove last word of previous line also the search pattern is on 2 consecutive lines
work with 2 last read lines, test if pattern on last and delete word if present than print first line, remove it and cycle
The idiomatic awk solution is simply to keep a buffer of the previous line (or N lines in the general case) so you can test the current line and then modify and/or print the buffer accordingly:
$ awk '
NR>1 {
if (/this_is_the_pattern/) {
sub(/[^[:space:]]+$/,"",prev)
}
print prev
}
{ prev = $0 }
END { print prev }
' file
some line
some line this_is_the_pattern

SED: addressing two lines before match

Print line, which is situated 2 lines before the match(pattern).
I tried next:
sed -n ': loop
/.*/h
:x
{n;n;/cen/p;}
s/./c/p
t x
s/n/c/p
t loop
{g;p;}
' datafile
The script:
sed -n "1N;2N;/XXX[^\n]*$/P;N;D"
works as follows:
Read the first three lines into the pattern space, 1N;2N
Search for the test string XXX anywhere in the last line, and if found print the first line of the pattern space, P
Append the next line input to pattern space, N
Delete first line from pattern space and restart cycle without any new read, D, noting that 1N;2N is no longer applicable
This might work for you (GNU sed):
sed -n ':a;$!{N;s/\n/&/2;Ta};/^PATTERN\'\''/MP;$!D' file
This will print the line 2 lines before the PATTERN throughout the file.
This one with grep, a bit simpler solution and easy to read [However need to use one pipe]:
grep -B2 'pattern' file_name | sed -n '1,2p'
If you can use awk try this:
awk '/pattern/ {print b} {b=a;a=$0}' file
This will print two line before pattern
I've tested your sed command but the result is strange (and obviously wrong), and you didn't give any explanation. You will have to save three lines in a buffer (named hold space), do a pattern search with the newest line and print the oldest one if it matches:
sed -n '
## At the beginning read three lines.
1 { N; N }
## Append them to "hold space". In following iterations it will append
## only one line.
H
## Get content of "hold space" to "pattern space" and check if the
## pattern matches. If so, extract content of first line (until a
## newline) and exit.
g
/^.*\nsix$/ {
s/^\n//
P
q
}
## Remove the old of the three lines saved and append the new one.
s/^\n[^\n]*//
h
' infile
Assuming and input file (infile) with following content:
one
two
three
four
five
six
seven
eight
nine
ten
It will search six and as output will yield:
four
Here are some other variants:
awk '{a[NR]=$0} /pattern/ {f=NR} END {print a[f-2]}' file
This stores all lines in an array a. When pattern is found store line number.
At then end print that line number from the file.
PS may be slow with large files
Here is another one:
awk 'FNR==NR && /pattern/ {f=NR;next} f-2==FNR' file{,}
This reads the file twice (file{,} is the same as file file)
At first round it finds the pattern and store line number in variable f
Then at second round it prints the line two before the value in f

What does this sed expression from todo.sh do?

What does the sed expression: G; s/\n/&&/; /^\([ ~-]*\n\).*\n\1/d; s/\n//; h; P do? Exactly what does it match and how does it match it?
It's from todo.sh. In context:
archive()
{
#defragment blank lines
sed -i.bak -e '/./!d' "$TODO_FILE" ## delete all empty lines
[ $TODOTXT_VERBOSE -gt 0 ] && grep "^x " "$TODO_FILE" ## if verbose mode print completed tasks..
grep "^x " "$TODO_FILE" >> "$DONE_FILE" ## append completed tasks to $DONE_FILE
sed -i.bak '/^x /d' "$TODO_FILE" ## delete completed tasks
cp "$TODO_FILE" "$TMP_FILE"
sed -n 'G; s/\n/&&/; /^\([ ~-]*\n\).*\n\1/d; s/\n//; h; P' "$TMP_FILE" > "$TODO_FILE"
## G; Add a newline
## s/\n/&&/; Substitute newline with && (two newlines?)
## /^\([ ~-]*\n\).*\n\1/d; Delete duplicate lines???
## s/\n// Remove newlines
## h Hold: copy pattern space to buffer
## P Print first line of pattern space
if [ $TODOTXT_VERBOSE -gt 0 ]; then
echo "TODO: $TODO_FILE archived."
fi
}
Ok, you've got some of the story already. Recall that the sed expression is executed for each input line. So the G at the beginning appends the contents of the hold space to the current line (with a newline in between). The contents of the hold space is empty initially but expanded by the h command at the end of each input cycle.
Then s/\n/&&/ duplicates the first newline only, the one between the current line and what was grabbed from the hold space. This is in preparation for the next command. /^\([ -~]*\n\).*\n\1/ indeed matches if the current line is identical to a line in the hold space:
    ^\([ -~]*\n\) matches a line at the beginning of the buffer¹
        Note that this matches only if the line contains only printable ASCII characters.
        If your system supports locales, ^\([[:print:]]*\n\) would be better.
    .*\n matches at least one subsequent line
    \1 matches a line identical to the first line
The extra newline added by the previous s command takes care of the case when the duplicate is the very first line from the hold space. The point of the \n\1 is to “anchor” the duplicate at the beginning of a line, otherwise bar would be considered a duplicate of foobar. If the current line is a duplicate, the d command discards it and execution branches to the next line.
If the current line is not a duplicate, s/\n// discards that extra newline (again, no g modifier, so only the first newline is removed). Then the h command results in the hold space containing what it contained before, with the current line prepended. Finally P prints the current input line.
Ok, now what does the hold space contain? It starts empty, then gets each successive line prepended unless it's a duplicate. So the hold space contains the input lines, in reverse order, minus the duplicates.
¹ Uh, I don't know how you did that, but that should be [ -~], not [ ~-] which wouldn't make any sense.
Here's another way of doing this, if you have a POSIX-conforming set of tools (Single Unix v2 is good enough).
<"$TMP_FILE" \
nl -s: | # add line numbers
sort -t: -k2 -u | # sort, ignoring the line numbers, and remove duplicates
sort -t: -k1 -n | # sort by line number
cut -d: -f2- # cut out the line numbers
Oh, you wanted to do this legibly and concisely? Just use awk.
<"$TMP_FILE" awk '!seen[$0] {++seen[$0]; print}'
If the current line hasn't been seen yet, mark it as seen, and print it.
Note that like the sed method, the awk method essentially stores the whole file in memory. The method above using sort has the advantage that only sort needs to keep more than one line of input at a time, and it's designed for this.
Of course, if you don't care about the order of the lines, it's as simple as sort -u.
After Gilles presented his excellent answer I found Famous Sed One-Liners Explained, which includes this exact sed expression; adding here for reference:
70. Delete duplicate, nonconsecutive lines from a file.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
This is a very tricky one-liner. It
stores the unique lines in hold buffer
and at each newly read line, tests if
the new line already is in the hold
buffer. If it is, then the new line is
purged. If it's not, then it's saved
in hold buffer for future tests and
printed.
A more detailed description - at each
line this one-liner appends the
contents of hold buffer to pattern
space with "G" command. The appended
string gets separated from the
existing contents of pattern space by
"\n" character. Next, a substitution
is made to that substitutes the "\n"
character with two "\n\n". The
substitute command "s/\n/&&/" does
that. The "&" means the matched
string. As the matched string was
"\n", then "&&" is two copies of it
"\n\n". Next, a test "/^([
-~]\n).\n\1/" is done to see if the contents of group capture group 1 is
repeated. The capture group 1 is all
the characters from space " " to "~"
(which include all printable chars).
The "[ -~]" matches that. Replacing
one "\n" with two was the key idea
here. As "([ -~]\n)" is greedy
(matches as much as possible), the
double newline makes sure that it
matches as little text as possible. If
the test is successful, the current
input line was already seen and "d"
purges the whole pattern space and
starts script execution from the
beginning. If the test was not
successful, the doubled "\n\n" gets
replaced with a single "\n" by
"s/\n//" command. Then "h" copies the
whole string to hold buffer, and "P"
prints the new line.