Sed not reading multiline input? - regex

I have some text as below:
path => ["/home/Desktop/**/auditd.log",
"/home/Desktop/**/rsyslog*.log",
"/home/Desktop/**/snmpd.log",
"/home/Desktop/**/kernel.log",
"/home//Desktop/**/ntpd.log",
"/home/Desktop/**/mysql*.log",
"/home/Desktop/**/sshd.log",
"/hme/Desktop/**/su.log",
"/home/Desktop/**/run-parts(.log"
]
I want to extract the values inside [ ], so I am doing:
sed -n 's/.*\[\(.*\)\]/\1/p'
Sed is not returning anything.
If I do sed -n 's/.*\[\(.*\)log/\1/p it's returning properly the string between [ and log.
"/home/Desktop/**/auditd.",
So it's able to search within the line.
How to make this work??
EDIT:
I created a file with content:
path => [asd,masd,dasd
sdalsd,ad
asdlmas;ldasd
]
When I do grep -o '\[.*\]' it does not work but grep -o '\[.*' this returns the 1st line [asd,masd,dasd. So it's working for single line not for multiple lines.

Try doing this :
$ grep -o '".*",?' file
OUTPUT:
"/home/Desktop/**/auditd.log",
"/home/Desktop/**/rsyslog*.log",
"/home/Desktop/**/snmpd.log",
"/home/Desktop/**/kernel.log",
"/home//Desktop/**/ntpd.log",
"/home/Desktop/**/mysql*.log",
"/home/Desktop/**/sshd.log",
"/hme/Desktop/**/su.log",
"/home/Desktop/**/run-parts(.log"
-o for grep print only the matching part
" is a literal double quote
.* is anything
" si the closing double quote
, is a literal double quote
? mean o or 1 occurrence

Well, I was a bit too slow, but I think the question of how to apply sed substitutions to a whole file as a block rather than on a line-by-line basis merits a general answer, so I'll leave one here.
In your specific case, you could use this pattern:
sed -n 'H; $ { x; s/.*\[\(.*\)\].*/\1/; p; }' foo.txt
The general trick is
sed -n 'H; $ { x; s/pattern/replacement/flags; p; }' file
What this means is: Every line that comes in is appended to the hold buffer (with the H command), and at the end of the file ($), when the whole file is in the hold buffer, the stuff between the brackets is executed. In that block, the hold buffer is swapped with the pattern space (x), then the substitution is done, and what remains in the pattern space is printed (p).
EDIT: One caveat of this simple form is that it doesn't work properly if your pattern wants to match the beginning of the file; for reasons the hold buffer is not entirely empty when sed is first called (it contains an empty line). If this is important, the slightly more complicated form
sed -n '1 h; 1 !H; $ { x; s/pattern/replacement/flags; p; }' file
fixes it. This will use h instead of H for the first line, which overwrites the hold buffer with the pattern space rather than appending the pattern space to the hold buffer.

You can do it with awk by replacing .*[ or ] or white spaces with nothing:
$ awk '{gsub(/.*\[|\]| /, ""); print}' filename
"/home/Desktop/**/auditd.log",
"/home/Desktop/**/rsyslog*.log",
"/home/Desktop/**/snmpd.log",
"/home/Desktop/**/kernel.log",
"/home//Desktop/**/ntpd.log",
"/home/Desktop/**/mysql*.log",
"/home/Desktop/**/sshd.log",
"/hme/Desktop/**/su.log",
"/home/Desktop/**/run-parts(.log"

from your sample, it could also be (treat only first and last line)
sed '1s/^[^[]*\[//;$d' YourFile

Related

How can I delete the lines starting with "//" (e.g., file header) which are at the beginning of a file?

I want to delete the header from all the files, and the header has the lines starting with //.
If I want to delete all the lines that starts with //, I can do following:
sed '/^\/\//d'
But, that is not something I need to do. I just need to delete the lines in the beginning of the file that starts with //.
Sample file:
// This is the header
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
Expected output:
print "Hi"
// This should not be deleted
print "Hello"
Update:
If there is a new line in the beginning or in-between, it doesn't work. Is there any way to take care of that scenario?
Sample file:
< new empty line >
// This is the header
< new empty line >
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
Expected output:
print "Hi"
// This should not be deleted
print "Hello"
Can someone suggest a way to do this? Thanks in advance!
Update: The accepted answer works well for white space in the beginning or in-between.
Could you please try following. This also takes care of new line scenario too, written and tested in https://ideone.com/IKN3QR
awk '
(NF == 0 || /^[[:blank:]]*\/\//) && !found{
next
}
NF{
found=1
}
1
' Input_file
Explanation: Simply checking conditions if a line either is empty OR starting from // AND variable found is NULL then simply skip those lines. Once any line without // found then setting variable found here so all next coming lines should be printed from line where it's get set to till end of Input_file printed.
With sed:
sed -n '1{:a; /^[[:space:]]*\/\/\|^$/ {n; ba}};p' file
print "Hi"
// This should not be deleted
print "Hello"
Slightly shorter version with GNU sed:
sed -nE '1{:a; /^\s*\/\/|^$/ {n; ba}};p' file
Explanation:
1 { # execute this block on the fist line only
:a; # this is a label
/^\s*\/\/|^$/ { n; # on lines matching `^\s*\/\/` or `^$`, do: read the next line
ba } # and go to label :a
}; # end block
p # print line unchanged:
# we only get here after the header or when it's not found
sed -n makes sed not print any lines without the p command.
Edit: updated the pattern to also skip empty lines.
I sounds like you just want to start printing from the first line that's neither blank nor just a comment:
$ awk 'NF && ($1 !~ "^//"){f=1} f' file
print "Hi"
// This should not be deleted
print "Hello"
The above simply sets a flag f when it finds such a line and prints every line from then on. It will work using any awk in any shell on every UNIX box.
Note that, unlike some of the potential solutions posted, it doesn't store more than 1 line at a time in memory and so will work no matter how large your input file is.
It was tested against this input:
$ cat file
// This is the header
// This should be deleted
print "Hi"
// This should not be deleted
print "Hello"
To run the above on many files at once and modify each file as you go is this with GNU awk:
awk -i inplace 'NF && ($1 !~ "^//"){f=1} f' *
and this with any awk:
ip_awk() { local f t=$(mktemp) && for f in "${#:2}"; do awk "$1" "$f" > "$t" && mv -- "$t" "$f"; done; }
ip_awk 'NF && ($1 !~ "^//"){f=1} f' *
In case perl is available then this may also work in slurp mode:
perl -0777 -pe 's~\A(?:\h*(?://.*)?\R+)+~~' file
\A will only match start of the file and (?:\h*(?://.*)?\R+)+ will match 1 or more lines that are blank or have // with optional leading spaces.
With GNU sed:
sed -i -Ez 's/^((\/\/[^\n]*|\s*)\n)+//' file
The ^((\/\/[^\n]*|\s*)\n)+ expression will match one or more lines starting with //, also matching blank lines, only at the start of the file.
Using ed (the file editor that the stream editor sed is based on),
printf '1,/^[^/]/ g|^\(//.*\)\{0,1\}$| d\nw\n' | ed tmp.txt
Some explanations are probably in order.
ed takes the name of the file to edit as an argument, and reads commands from standard input. Each command is terminated by a newline. (You could also read commands from a here document, rather than from printf via a pipe.)
1,/^[^/]/ addresses the first lines in the file, up to and including the first one that does not start with /. (All the lines you want to delete will be included in this set.)
g|^\(//.*\)\{0,1\}$|d deletes all the addressed lines that are either empty or do start with //.
w saves the changes.
Step 2 is a bit ugly; unfortunately, ed does not support regular expression operators you may take for granted, like ? or |. Breaking the regular expression down a bit:
^ matches the start of the line.
//.* matches // followed by zero or more characters.
\(//.*\)\{0,1\} matches the preceding regular expression 0 or 1 times (i.e., optionally)
$ matches the end of the line.

Can not replace multiple empty lines with one

Why does the following not replace multiple empty lines with one?
$ cat some_random_text.txt
foo
bar
test
and this does not work:
$ cat some_random_text.txt | perl -pe "s/\n+/\n/g"
foo
bar
test
I am trying to replace the multiple new lines (i.e. empty lines) to a single empty new line but the regex I use for that does not work as you can see in the example snippet.
What am I messing up?
Expected outcome is:
foo
bar
test
The reason it doesn't work is that -p tells perl to process the input line by line, and there's never more than one \n in a single line.
Better idea:
perl -00 -lpe 1
-00: Enable paragraph mode (input records are terminated by any sequence of 2+ newlines).
-l: Enable autochomp mode (the input record separators are trimmed automatically, so since we're in paragraph mode, all trailing newlines are removed, and output records get "\n\n" added).
-p: Enable automatic input/output (the main code is executed for each input record; anything left in $_ is printed automatically).
-e 1: Use a dummy main program that does nothing.
Taken all together this does nothing except normalize paragraph terminators to exactly two newlines.
You are executing the following program:
LINE: while (<>) {
s/\n+/\n/g;
}
continue {
die "-p destination: $!\n" unless print $_;
}
Since you are reading one line at at time, and since a line is a sequence of characters that aren't line feeds terminated by a line feed, your pattern will never match more than one newline.
The simple fix is to tell Perl to treat the entire file as one line. Also, you don't want to replace every line feed, but just those found in sequence of two or more, and you want to replace the sequence with two line feeds.
perl -0777pe's/\n\n\K\n+//g; s^\n+//; s/\n\K\n\z//' some_random_text.txt
The second and third substitutions ensure there are no blank lines at the start and end of the file.
While reading the entire file into memory is easy, it's not necessary. The desired output can also be achieved by maintaining a flag that indicates whether the previous line was blank or not.
perl -ne'if (/\S/) { print "\n" if $f; print; $f=0 } else { $f=1 }' some_random_text.txt
This solution also removes blank lines from the start and end of the file.
Given:
$ echo "$txt"
foo
bar
test
You can use sed to reduce the runs of blank lines to a single \n:
$ echo "$txt" | sed '/^$/N;/^\n$/D'
foo
bar
test
Even easier, you can use cat -s:
$ echo "$txt" | cat -s # same output
In perl the idiomatic 1 liner is to use -00 for paragraph mode:
$ echo "$txt" | perl -00pe0 # same output
And in awk you have the flexibility of using paragraph mode by setting RS= and then set ORS= to what you want the replacement for runs of \n to be:
$ echo "$txt" | awk '1' RS= ORS="\n\n" # same output
ikegami correctly states that printf 'a\n\n' | ... will produce two trailing spaces with these solutions. That may or may not be an issue.

How can I merge multiple blocks/lines with sed or regex?

Is it possible to merge multiple blocks/lines into a "single" line?
So basically if the next line starts with the same "#Msg" tag then append it to the previous line. (Hard to explain, but my example speaks for itself) (The blocks are separated by a new/blank line)
My input file looks like this:
#Msg,00000
#Msg,00001
#Msg,00002
#Msg,00003
#Msg,00004
#Msg,00005
#Msg,00006
#Msg,00007
#Msg,00008
#Msg,00009
#Msg,00010
#Msg,00011
Output should be like this:
#Msg,00000
#Msg,00001 #Msg,00002
#Msg,00003 #Msg,00004
#Msg,00005
#Msg,00006 #Msg,00007 #Msg,00008
#Msg,00009
#Msg,00010 #Msg,00011
Any advice is very welcome.
This would be pretty easy to do in Perl:
perl -00 -ple 'tr/\n/ /'
-e CODE specifies the program.
-p wraps a read/write line loop around it (by default it reads from STDIN, but you can also specify one or more filenames on the command line).
-00 specifies that the input "lines" are actually paragraphs.
-l has two effects: Incoming line terminators are automatically stripped from lines, and outgoing lines get line terminators added to them (and because we used -00 (paragraph mode), our line terminator is actually \n\n).
To recap:
We read the input one paragraph at a time. For each paragraph, we remove any trailing newlines. We then translate every newline to a space. Finally we output the transformed paragraph, followed by \n\n.
No point in trying to produce a shorter code than is possible with Perl!
Collect lines from the input file in list group until a blank line appears. Then output the contents of group, empty it and start again. When end-of-file is encountered output whatever is in group, if it is non-empty.
group = []
with open('vollschauer.txt') as vollschauer:
for line in vollschauer:
line = line.rstrip()
if line:
group.append(line)
else:
if group:
print (' '.join(group))
print()
group = []
if group:
print (' '.join(group))
group = []
$ awk -v RS= -v ORS='\n\n' '{$1=$1}1' file
#Msg,00000
#Msg,00001 #Msg,00002
#Msg,00003 #Msg,00004
#Msg,00005
#Msg,00006 #Msg,00007 #Msg,00008
#Msg,00009
#Msg,00010 #Msg,00011
If you insist on using sed, this should do the trick:
sed -r ':a; N; /^(#[^,]+,).*\n\1/! { P; D }; s/\n/ /; ba' file
It takes different tags into account. Such tags won't be grouped together (that's what I understood is the desired behavior):
$ cat file
#Msg,00000
#Msg,00001
#Hello,00002
#Hello,00003
#What,00004
#What,00005
$ sed -r ':a; N; /^(#[^,]+,).*\n\1/! { P; D }; s/\n/ /; ba' file
#Msg,00000 #Msg,00001
#Hello,00002
#Hello,00003
#What,00004 #What,00005
Note that this solution uses GNU sed.
This might work for you (GNU sed):
sed ':a;N;/^$/M!s/\n/ /;ta' file
Gather up lines, replacing each newline by a space until an empty line.
N.B. The use of the M flag on the repexp /^$/ which matches an empty line on a pattern space containing multiple lines.

Finding and replacing a numeric string between colons, before a space, using sed?

I am attempting to change all coordinate information in a fastq file to zeros. My input file is composed of millions of entries in the following repeating 4-line structure:
#HWI-SV007:140:C173GACXX:6:2215:16030:89299 1:N:0:CAGATC
GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAG
+
###FFFDFHGGDHIIHGIJJJJJJJJJJJGIJJJJJJJIIIDHGHIGIJJIIIJJIJ
I would like to replace the two numeric strings in the first line 16030:89299 with zeros in a generic way, such that any numeric string between the colons, before the space, is replaced. I would like the output to appear as follows, replacing the two strings globally throughout the file with zeros:
#HWI-SV007:140:C173GACXX:6:2215:0:0 1:N:0:CAGATC
GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAG
+
###FFFDFHGGDHIIHGIJJJJJJJJJJJGIJJJJJJJIIIDHGHIGIJJIIIJJIJ
I am attempting to do this using the following sed:
sed 's/:^[0-9]+$:^[0-9]+$\s/:0:0 /g'
However, this does not behave as expected.
I think you will need to use sed -r option.
Also, ^ matches beginning of the line and $ matches end of the line.
Thus this is the command line that works against your sample.
sed -r 's/:[0-9]+:[0-9]+\s/:0:0 /g'
some alternative
awk -F ":" 'BEGIN{ OFS = ":" }{ if ( NF > 1 ) {$6 = 0; sub( /^[0-9]*/, 0, $7)}; print $0 }' YourFile
using column separate by :
sed 's/^\(\([^:]*:\)\{5\}\)[^[:blank:]]*/\10:0/' YourFile
using 5 first element separate by : thant space as delimiter
for your sed
sed 's/:[0-9]+:[0-9]+\(\s\)/:0:0\1/'
^and $ are relative to the whole string not the current word
option to keep the original space instead of replacing by a blank space (case of several or other like \t)
g is not needed (and better not to use here) because normaly only 1 occurence per line
you need to be sure that the pattern is not possible somewhere else (never a space after the previous number) because it's a small one

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file