Finding and replacing a numeric string between colons, before a space, using sed? - regex

I am attempting to change all coordinate information in a fastq file to zeros. My input file is composed of millions of entries in the following repeating 4-line structure:
#HWI-SV007:140:C173GACXX:6:2215:16030:89299 1:N:0:CAGATC
GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAG
+
###FFFDFHGGDHIIHGIJJJJJJJJJJJGIJJJJJJJIIIDHGHIGIJJIIIJJIJ
I would like to replace the two numeric strings in the first line 16030:89299 with zeros in a generic way, such that any numeric string between the colons, before the space, is replaced. I would like the output to appear as follows, replacing the two strings globally throughout the file with zeros:
#HWI-SV007:140:C173GACXX:6:2215:0:0 1:N:0:CAGATC
GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAG
+
###FFFDFHGGDHIIHGIJJJJJJJJJJJGIJJJJJJJIIIDHGHIGIJJIIIJJIJ
I am attempting to do this using the following sed:
sed 's/:^[0-9]+$:^[0-9]+$\s/:0:0 /g'
However, this does not behave as expected.

I think you will need to use sed -r option.
Also, ^ matches beginning of the line and $ matches end of the line.
Thus this is the command line that works against your sample.
sed -r 's/:[0-9]+:[0-9]+\s/:0:0 /g'

some alternative
awk -F ":" 'BEGIN{ OFS = ":" }{ if ( NF > 1 ) {$6 = 0; sub( /^[0-9]*/, 0, $7)}; print $0 }' YourFile
using column separate by :
sed 's/^\(\([^:]*:\)\{5\}\)[^[:blank:]]*/\10:0/' YourFile
using 5 first element separate by : thant space as delimiter
for your sed
sed 's/:[0-9]+:[0-9]+\(\s\)/:0:0\1/'
^and $ are relative to the whole string not the current word
option to keep the original space instead of replacing by a blank space (case of several or other like \t)
g is not needed (and better not to use here) because normaly only 1 occurence per line
you need to be sure that the pattern is not possible somewhere else (never a space after the previous number) because it's a small one

Related

How can I replace all occurrences of a character within matched string in sed

I want to replace all occurrences of a character, say ,, in a matched string in sed. The matched string looks like:
[cite!abc,cde]
I want to replace it with:
[!cite](abc, cde)
The command to replace the outer format is:
sed 's/\[cite\!\([^]]+\)\]/\[\!cite\]\(\1\)/g' file
which gives
[!cite](abc,cde)
However, I want to put space after , and there may be an arbitrary number or comma delimited entries, e.g.
[cite!abc,cde,def,fgh]
Is there an elegant way of doing this in sed or do I need to resort to perl scripts?
If you're guaranteed no spaces after commas in the input string:
sed 's/,/, /g' file
If you do have spaces after the commas in the input string, you'll get extra spaces in the output.
EDIT:
If there may be spaces after the commas in some of the elements, you can avoid adding more with:
sed 's/,\([^ ]\)/, \1/g' file
$ echo '1, 2, 3,4,5,6' | sed -e 's/,\([^ ]\)/, \1/g'
1, 2, 3, 4, 5, 6
This might work for you (GNU sed):
sed -E ':a;/\[([^]!]+)!([^]]+)\]/{s//\n[!\1](\2)\n/;h;s/,/, /g;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/;ta}' file
The solution comes in two parts:
The string to be amended are identified, and a first pass rearranges to the requires format less the comma space separated list.
The string is surrounded by newlines and a copy is made into the hold space. Then the current pattern space is has all commas space separated and the copy and current line are joined and using patten matching formatted to final required state.
The process is then repeated on the current line so as to effect a global replacement.

Is there a way to use sed to remove only the exact string match?

I have recently started learning bash and I ran into a problem doing an assignment, So I have a txt file and in it contains something like
foo:abc:200:1:1:1
foobar:asd:100:3:2:1
bar:test:100:2:2:2
where the first column is the title of the book followed by the author name followed by price,quantity available and qty sold all seperated with the delimiter ":"
the goal here is to remove a book base on the name and author the user types in.
I have searched around and found that sed might possibly be able to help me with this problem, I have tried to test sed by deleting base on the title alone with
sed /"foo"/d Book.txt
I expected the output to be
foobar:asd:100:3:2:1
bar:test:100:2:2:2
however the output was
bar:test:100:2:2:2
which tells me that any line in the txt file containing "foo" will get deleted
Hence I would like to ask
Is there any way to use sed so it deletes the exact match only instead of lines containing foo?
is there any way to use delimiters with sed so I can use both title and author?
Should I be using something other than sed?
Using sed it is better to use:
sed -E '/(^|:)foo(:|$)/d' file
foobar:asd:100:3:2:1
bar:test:100:2:2:2
Which makes sure foo is preceded by start or : and followed by end or :.
However this job is more suitable for awk as data is delimited by colon:
awk -F: '$1 != "foo"' file
Is there any way to use sed so it deletes the exact match only instead of lines containing foo?
Yes you can for the given example, if you mark your search pattern to match exactly foo: you can have luck deleting it. For e.g. if you do below
sed '/^foo:/d' file
The pattern ^ marks that the string starting with foo followed by a colon mark : which matches your use-case. This is assuming foo can be part of the fist column only
Is there any way to use delimiters with sed so I can use both title and author?
Should I be using something other than sed?
If you are dealing with a input file has a fixed de-limiter like : which will never form a part of your valid column content, then using awk/perl are better suited as they read text easily once a de-limiter is set.
As an example, consider an e.g. if you want to change the quantity name from fourth column for one particular book named foobar, with awk you can just do
awk -F: 'BEGIN { OFS = FS } $1 == "foobar" { $4 = 6 }1' input-file
To decode above line, the content within '..' are left untouched by the shell and passed literally to the command, that's why we wrap the content in single quotes. Also the statements inside it are not meaningful in the context of the shell.
So the -F: sets the input field-separator to : which is when the command reads the file line by line, the first line is broken down into tokens separated by :. The first column is labelled $1, which is extended up to $NF, meaning the last column of the line. The part BEGIN { OFS = FS } assigns the output field separator as the same as input i.e. retain the : de-limitation when awk writes the output also.
The part $1 == "foobar" { $4 = 6 } is almost self-explanatory in a sense, that if the first column contains the string within quotes do the action inside {..}, which is set the fourth column value as 6. The {..}1 is a short-hand notation for {...; print} which is to re-construct the line based on the output field/record separators defined.
This might work for you (GNU sed):
sed '/\<foo\>/d' file
Or
sed '/\bfoo\b/d' file
The first solution uses \< start word and \> end word. The second solution uses the \b word boundary.
P.S. The dual of \b is \B so to delete lines that contain foobar or foobaz but not foo only, use:
sed '/\bfoo\B/d' file

Sed not reading multiline input?

I have some text as below:
path => ["/home/Desktop/**/auditd.log",
"/home/Desktop/**/rsyslog*.log",
"/home/Desktop/**/snmpd.log",
"/home/Desktop/**/kernel.log",
"/home//Desktop/**/ntpd.log",
"/home/Desktop/**/mysql*.log",
"/home/Desktop/**/sshd.log",
"/hme/Desktop/**/su.log",
"/home/Desktop/**/run-parts(.log"
]
I want to extract the values inside [ ], so I am doing:
sed -n 's/.*\[\(.*\)\]/\1/p'
Sed is not returning anything.
If I do sed -n 's/.*\[\(.*\)log/\1/p it's returning properly the string between [ and log.
"/home/Desktop/**/auditd.",
So it's able to search within the line.
How to make this work??
EDIT:
I created a file with content:
path => [asd,masd,dasd
sdalsd,ad
asdlmas;ldasd
]
When I do grep -o '\[.*\]' it does not work but grep -o '\[.*' this returns the 1st line [asd,masd,dasd. So it's working for single line not for multiple lines.
Try doing this :
$ grep -o '".*",?' file
OUTPUT:
"/home/Desktop/**/auditd.log",
"/home/Desktop/**/rsyslog*.log",
"/home/Desktop/**/snmpd.log",
"/home/Desktop/**/kernel.log",
"/home//Desktop/**/ntpd.log",
"/home/Desktop/**/mysql*.log",
"/home/Desktop/**/sshd.log",
"/hme/Desktop/**/su.log",
"/home/Desktop/**/run-parts(.log"
-o for grep print only the matching part
" is a literal double quote
.* is anything
" si the closing double quote
, is a literal double quote
? mean o or 1 occurrence
Well, I was a bit too slow, but I think the question of how to apply sed substitutions to a whole file as a block rather than on a line-by-line basis merits a general answer, so I'll leave one here.
In your specific case, you could use this pattern:
sed -n 'H; $ { x; s/.*\[\(.*\)\].*/\1/; p; }' foo.txt
The general trick is
sed -n 'H; $ { x; s/pattern/replacement/flags; p; }' file
What this means is: Every line that comes in is appended to the hold buffer (with the H command), and at the end of the file ($), when the whole file is in the hold buffer, the stuff between the brackets is executed. In that block, the hold buffer is swapped with the pattern space (x), then the substitution is done, and what remains in the pattern space is printed (p).
EDIT: One caveat of this simple form is that it doesn't work properly if your pattern wants to match the beginning of the file; for reasons the hold buffer is not entirely empty when sed is first called (it contains an empty line). If this is important, the slightly more complicated form
sed -n '1 h; 1 !H; $ { x; s/pattern/replacement/flags; p; }' file
fixes it. This will use h instead of H for the first line, which overwrites the hold buffer with the pattern space rather than appending the pattern space to the hold buffer.
You can do it with awk by replacing .*[ or ] or white spaces with nothing:
$ awk '{gsub(/.*\[|\]| /, ""); print}' filename
"/home/Desktop/**/auditd.log",
"/home/Desktop/**/rsyslog*.log",
"/home/Desktop/**/snmpd.log",
"/home/Desktop/**/kernel.log",
"/home//Desktop/**/ntpd.log",
"/home/Desktop/**/mysql*.log",
"/home/Desktop/**/sshd.log",
"/hme/Desktop/**/su.log",
"/home/Desktop/**/run-parts(.log"
from your sample, it could also be (treat only first and last line)
sed '1s/^[^[]*\[//;$d' YourFile

How to use sed to replace partial of a find

some of the lines in a file look like this:
LOB ("VALUE") STORE AS SECUREFILE "L_MS_WRKNPY_VALUE_0000000011"(
LOB ("INDEX_XML") STORE AS SECUREFILE "L_HRRPTRY_INDX_L_0000000011"(
What I can assume is that in the "*" the string starts with an L_ and ends in 10 chars number.
I want for each line that:
start with LOB (white-space before the LOB)
inside "" the first two letters are L_
line always ends with "(
replace the last 10 chars in the "" with variable.
all I manage to do is:
cat /tmp/out.log | sed 's/_[0-9_]*/$NUM/g' > /tmp/newout.log
to find the required rows I run:
grep "^ LOB" create_tables_clean.sql | grep "\"L_"
However I dont know how to combine the two and get what I wish.
sed -r 's/(\sLOB.*"L_.+_)([0-9]{10})("\()/\1'$myVar'\3/'
Replace $myVar with your variable, obviously.
I made three capturing groups:
(\sLOB.*"L_.+_) #catches everything until the 10 numbers
([0-9]{10}) #catches the 10 numbers
("\() #catches the last "(
The first capturing group matches only if your line starts with LOB (with a preceeding whitespace) and contains "L_.
Then you simply substitute the second capturing group (containing only the 10 numbers) with your variable while keeping the first and third capturing group (\1'$myVar'\3).
Your whole call would look like
cat /tmp/out.log | sed -r 's/(\sLOB.*"L_.+_)([0-9]{10})("\()/\1'$NUM'\3/g' > /tmp/newout.log
(notice I added the g-modifier to the regex, so it will match every occurence)
several useless characters are in accepted sed command, here is shorter one.
sed -r 's/(LOB.*"L_.*)([0-9]{10})("\()/\1'$myVar'\3/' file
Second, #Basti M, when you need echo with double quotes in string, use singe quotes, then you needn't escape the double quotes.
echo 'LOB ("VALUE") STORE AS SECUREFILE "L_MS_WRKNPY_VALUE_0000000011"('

SED: addressing two lines before match

Print line, which is situated 2 lines before the match(pattern).
I tried next:
sed -n ': loop
/.*/h
:x
{n;n;/cen/p;}
s/./c/p
t x
s/n/c/p
t loop
{g;p;}
' datafile
The script:
sed -n "1N;2N;/XXX[^\n]*$/P;N;D"
works as follows:
Read the first three lines into the pattern space, 1N;2N
Search for the test string XXX anywhere in the last line, and if found print the first line of the pattern space, P
Append the next line input to pattern space, N
Delete first line from pattern space and restart cycle without any new read, D, noting that 1N;2N is no longer applicable
This might work for you (GNU sed):
sed -n ':a;$!{N;s/\n/&/2;Ta};/^PATTERN\'\''/MP;$!D' file
This will print the line 2 lines before the PATTERN throughout the file.
This one with grep, a bit simpler solution and easy to read [However need to use one pipe]:
grep -B2 'pattern' file_name | sed -n '1,2p'
If you can use awk try this:
awk '/pattern/ {print b} {b=a;a=$0}' file
This will print two line before pattern
I've tested your sed command but the result is strange (and obviously wrong), and you didn't give any explanation. You will have to save three lines in a buffer (named hold space), do a pattern search with the newest line and print the oldest one if it matches:
sed -n '
## At the beginning read three lines.
1 { N; N }
## Append them to "hold space". In following iterations it will append
## only one line.
H
## Get content of "hold space" to "pattern space" and check if the
## pattern matches. If so, extract content of first line (until a
## newline) and exit.
g
/^.*\nsix$/ {
s/^\n//
P
q
}
## Remove the old of the three lines saved and append the new one.
s/^\n[^\n]*//
h
' infile
Assuming and input file (infile) with following content:
one
two
three
four
five
six
seven
eight
nine
ten
It will search six and as output will yield:
four
Here are some other variants:
awk '{a[NR]=$0} /pattern/ {f=NR} END {print a[f-2]}' file
This stores all lines in an array a. When pattern is found store line number.
At then end print that line number from the file.
PS may be slow with large files
Here is another one:
awk 'FNR==NR && /pattern/ {f=NR;next} f-2==FNR' file{,}
This reads the file twice (file{,} is the same as file file)
At first round it finds the pattern and store line number in variable f
Then at second round it prints the line two before the value in f