Grepping out a block of text, regex

Grepping out a block of text, regex - regex

Given a large log file, what is the best way to grep a block of text?
text to be ignored
more text to be ignored
--- <---- start capture here
lots of
text with separators like "---"
---
spanning
multiple lines
--- <---- end capture here
text to be ignored
more text to be ignored
What is known?
Max number of characters in line (55 but may be less)
Number of lines in a block
Separator (which may repeat itself)
What regular expression would match this block? Desired output: list of blocks of text.
Please assume Linux command line environment

Several years ago I used this to split patches into hunks:
sed -e '$ {x;q}' -e '/##/ !{H;d}' -e '/##/ x' # note - i know sed better now
Replace /##/ with /---/.
To remove everything before first '---' and after last '---' add -e '1,/---/d' and remove the whole -e '$ {x;q}'.
Result would be like this:
sed -e '1,/---/d' -e '/---/ !{H;d}' -e x
Just tested it and it works with the given example.

Keep it simple:
$ awk 'NR==FNR {if (/^---/) { if (!start) start=NR; end=NR } next} FNR>=start && FNR<=end' file file
--- <---- start capture here
lots of
text with separators like "---"
---
spanning
multiple lines
--- <---- end capture here
$ awk 'NR==FNR {if (/^---/) { if (!start) start=NR; end=NR } next} FNR>start && FNR<end' file file
lots of
text with separators like "---"
---
spanning
multiple lines

If you have enough memory, you can use the following line. Note, however, that it will read the whole logfile into memory!
perl -0777 -lnE 'm{ ^--- .+ ^--- }xms and say $&' logfile

Related

Replace newline in quoted strings in huge files

I have a few huge files with values seperated by a pipe (|) sign.
The strings our quoted but sometimes there is a newline in between the quoted string.
I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.
I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.
I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.
The statement I tried but that isn't working:
perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt
Sample text:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....
Should become:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....

Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, [ map { tr/\n/ /r } #$row ])
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.

In your specific case, you can also consider a workaround using GNU sed or awk.
An awk command will look like
awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile
The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).
A GNU sed command will look like
sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file
This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.
With perl, if the file is huge but it can still fit into memory, you can use a short
perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' <<< "$s"
With -0777, you slurp the file and the \R++(?!\d{4,}\|) pattern matches any one or more line breaks (\R++) not followed with four or more digits followed with a | char. The ++ possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.

With your shown samples, this could be simply done in awk program. Written and tested in GNU awk, should work in any awk. This should work fast even on huge files(better than slurping whole file into memory, having mentioned that OP may use it on huge files).
awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
gsub(/"/,"&")%2!=0{ ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
if(val==""){ val=$0 } ##Checking condition if val is NULL then set val to current line.
else {print val $0;val=""} ##Else(if val NOT NULL) then print val current line and nullify val here.
next ##next will skip further statements from here.
}
1 ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file ##Mentioning Input_file name here.

How can I delete the lines starting with "//" (e.g., file header) which are at the beginning of a file?

I want to delete the header from all the files, and the header has the lines starting with //.
If I want to delete all the lines that starts with //, I can do following:
sed '/^\/\//d'
But, that is not something I need to do. I just need to delete the lines in the beginning of the file that starts with //.
Sample file:
// This is the header
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
Expected output:
print "Hi"
// This should not be deleted
print "Hello"
Update:
If there is a new line in the beginning or in-between, it doesn't work. Is there any way to take care of that scenario?
Sample file:
< new empty line >
// This is the header
< new empty line >
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
Expected output:
print "Hi"
// This should not be deleted
print "Hello"
Can someone suggest a way to do this? Thanks in advance!
Update: The accepted answer works well for white space in the beginning or in-between.

Could you please try following. This also takes care of new line scenario too, written and tested in https://ideone.com/IKN3QR
awk '
(NF == 0 || /^[[:blank:]]*\/\//) && !found{
next
}
NF{
found=1
}
1
' Input_file
Explanation: Simply checking conditions if a line either is empty OR starting from // AND variable found is NULL then simply skip those lines. Once any line without // found then setting variable found here so all next coming lines should be printed from line where it's get set to till end of Input_file printed.

With sed:
sed -n '1{:a; /^[[:space:]]*\/\/\|^$/ {n; ba}};p' file
print "Hi"
// This should not be deleted
print "Hello"
Slightly shorter version with GNU sed:
sed -nE '1{:a; /^\s*\/\/|^$/ {n; ba}};p' file
Explanation:
1 { # execute this block on the fist line only
:a; # this is a label
/^\s*\/\/|^$/ { n; # on lines matching `^\s*\/\/` or `^$`, do: read the next line
ba } # and go to label :a
}; # end block
p # print line unchanged:
# we only get here after the header or when it's not found
sed -n makes sed not print any lines without the p command.
Edit: updated the pattern to also skip empty lines.

I sounds like you just want to start printing from the first line that's neither blank nor just a comment:
$ awk 'NF && ($1 !~ "^//"){f=1} f' file
print "Hi"
// This should not be deleted
print "Hello"
The above simply sets a flag f when it finds such a line and prints every line from then on. It will work using any awk in any shell on every UNIX box.
Note that, unlike some of the potential solutions posted, it doesn't store more than 1 line at a time in memory and so will work no matter how large your input file is.
It was tested against this input:
$ cat file
// This is the header
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
To run the above on many files at once and modify each file as you go is this with GNU awk:
awk -i inplace 'NF && ($1 !~ "^//"){f=1} f' *
and this with any awk:
ip_awk() { local f t=$(mktemp) && for f in "${#:2}"; do awk "$1" "$f" > "$t" && mv -- "$t" "$f"; done; }
ip_awk 'NF && ($1 !~ "^//"){f=1} f' *

In case perl is available then this may also work in slurp mode:
perl -0777 -pe 's~\A(?:\h*(?://.*)?\R+)+~~' file
\A will only match start of the file and (?:\h*(?://.*)?\R+)+ will match 1 or more lines that are blank or have // with optional leading spaces.

With GNU sed:
sed -i -Ez 's/^((\/\/[^\n]*|\s*)\n)+//' file
The ^((\/\/[^\n]*|\s*)\n)+ expression will match one or more lines starting with //, also matching blank lines, only at the start of the file.

Using ed (the file editor that the stream editor sed is based on),
printf '1,/^[^/]/ g|^\(//.*\)\{0,1\}$| d\nw\n' | ed tmp.txt
Some explanations are probably in order.
ed takes the name of the file to edit as an argument, and reads commands from standard input. Each command is terminated by a newline. (You could also read commands from a here document, rather than from printf via a pipe.)
1,/^[^/]/ addresses the first lines in the file, up to and including the first one that does not start with /. (All the lines you want to delete will be included in this set.)
g|^\(//.*\)\{0,1\}$|d deletes all the addressed lines that are either empty or do start with //.
w saves the changes.
Step 2 is a bit ugly; unfortunately, ed does not support regular expression operators you may take for granted, like ? or |. Breaking the regular expression down a bit:
^ matches the start of the line.
//.* matches // followed by zero or more characters.
\(//.*\)\{0,1\} matches the preceding regular expression 0 or 1 times (i.e., optionally)
$ matches the end of the line.

Can not replace multiple empty lines with one

Why does the following not replace multiple empty lines with one?
$ cat some_random_text.txt
foo
bar
test
and this does not work:
$ cat some_random_text.txt | perl -pe "s/\n+/\n/g"
foo
bar
test
I am trying to replace the multiple new lines (i.e. empty lines) to a single empty new line but the regex I use for that does not work as you can see in the example snippet.
What am I messing up?
Expected outcome is:
foo
bar
test

The reason it doesn't work is that -p tells perl to process the input line by line, and there's never more than one \n in a single line.
Better idea:
perl -00 -lpe 1
-00: Enable paragraph mode (input records are terminated by any sequence of 2+ newlines).
-l: Enable autochomp mode (the input record separators are trimmed automatically, so since we're in paragraph mode, all trailing newlines are removed, and output records get "\n\n" added).
-p: Enable automatic input/output (the main code is executed for each input record; anything left in $_ is printed automatically).
-e 1: Use a dummy main program that does nothing.
Taken all together this does nothing except normalize paragraph terminators to exactly two newlines.

You are executing the following program:
LINE: while (<>) {
s/\n+/\n/g;
}
continue {
die "-p destination: $!\n" unless print $_;
}
Since you are reading one line at at time, and since a line is a sequence of characters that aren't line feeds terminated by a line feed, your pattern will never match more than one newline.
The simple fix is to tell Perl to treat the entire file as one line. Also, you don't want to replace every line feed, but just those found in sequence of two or more, and you want to replace the sequence with two line feeds.
perl -0777pe's/\n\n\K\n+//g; s^\n+//; s/\n\K\n\z//' some_random_text.txt
The second and third substitutions ensure there are no blank lines at the start and end of the file.
While reading the entire file into memory is easy, it's not necessary. The desired output can also be achieved by maintaining a flag that indicates whether the previous line was blank or not.
perl -ne'if (/\S/) { print "\n" if $f; print; $f=0 } else { $f=1 }' some_random_text.txt
This solution also removes blank lines from the start and end of the file.

Given:
$ echo "$txt"
foo
bar
test
You can use sed to reduce the runs of blank lines to a single \n:
$ echo "$txt" | sed '/^$/N;/^\n$/D'
foo
bar
test
Even easier, you can use cat -s:
$ echo "$txt" | cat -s # same output
In perl the idiomatic 1 liner is to use -00 for paragraph mode:
$ echo "$txt" | perl -00pe0 # same output
And in awk you have the flexibility of using paragraph mode by setting RS= and then set ORS= to what you want the replacement for runs of \n to be:
$ echo "$txt" | awk '1' RS= ORS="\n\n" # same output
ikegami correctly states that printf 'a\n\n' | ... will produce two trailing spaces with these solutions. That may or may not be an issue.

sed: remove strings between two patterns leaving the 2nd pattern intact (half inclusive)

I am trying to filter out text between two patterns, I've seen a dozen examples but didn't manage to get exactly what I want:
Sample input:
START LEAVEMEBE text
data
START DELETEME text
data
more data
even more
START LEAVEMEBE text
data
more data
START DELETEME text
data
more
SOMETHING that doesn't start with START
# sometimes it starts with characters that needs to be escaped...
I want to stay with:
START LEAVEMEBE text
data
START LEAVEMEBE text
data
more data
SOMETHING that doesn't start with START
# sometimes it starts with characters that needs to be escaped...
I tried running sed with:
sed 's/^START DELETEME/,/^[^ ]/d'
And got an inclusive removal, I tried adding "exclusions" (not sure if I really understand this syntax well):
sed 's/^START DELETEME/,/^[^ ]/{/^[^ ]/!d}'
But my "START DELETEME" line is still there (yes, I can grep it out, but that's ugly :) and besides - it DOES remove the empty line in this sample as well and I'd like to leave empty lines if they are my end pattern intact )
I am wondering if there is a way to do it with a single sed command.
I have an awk script that does this well:
BEGIN { flag = 0 }
{
if ($0 ~ "^START DELETEME")
flag=1
else if ($0 !~ "^ ")
flag=0
if (flag != 1)
print $0
}
But as you know "A is for awk which runs like a snail". It takes forever.
Thanks in advance.
Dave.

Using a loop in sed:
sed -n '/^START DELETEME/{:l n; /^[ ]/bl};p' input

GNU sed
sed '/LEAVEMEBE/,/DELETEME/!d;{/DELETEME/d}' file

I would stick with awk:
awk '
/LEAVE|SOMETHING/{flag=1}
/DELETE/{flag=0}
flag' file
But if you still prefer sed, here's another way:
sed -n '
/LEAVE/,/DELETE/{
/DELETE/b
p
}
' file

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing

My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt

You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'

I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt

If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).

You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.

perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.

If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'

The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp

why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution

there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.

you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"

For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}

gawk '/.*abc([0-9]+)xyz.*/' file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Grepping out a block of text, regex - regex

If you have enough memory, you can use the following line. Note, however, that it will read the whole logfile into memory! perl -0777 -lnE 'm{ ^--- .+ ^--- }xms and say $&' logfile

Related

Replace newline in quoted strings in huge files

How can I delete the lines starting with "//" (e.g., file header) which are at the beginning of a file?

Can not replace multiple empty lines with one

sed: remove strings between two patterns leaving the 2nd pattern intact (half inclusive)

how to use sed, awk, or gawk to print only what is matched?

Categories

Resources