Is there a way to use sed to remove only the exact string match? - regex

I have recently started learning bash and I ran into a problem doing an assignment, So I have a txt file and in it contains something like
foo:abc:200:1:1:1
foobar:asd:100:3:2:1
bar:test:100:2:2:2
where the first column is the title of the book followed by the author name followed by price,quantity available and qty sold all seperated with the delimiter ":"
the goal here is to remove a book base on the name and author the user types in.
I have searched around and found that sed might possibly be able to help me with this problem, I have tried to test sed by deleting base on the title alone with
sed /"foo"/d Book.txt
I expected the output to be
foobar:asd:100:3:2:1
bar:test:100:2:2:2
however the output was
bar:test:100:2:2:2
which tells me that any line in the txt file containing "foo" will get deleted
Hence I would like to ask
Is there any way to use sed so it deletes the exact match only instead of lines containing foo?
is there any way to use delimiters with sed so I can use both title and author?
Should I be using something other than sed?

Using sed it is better to use:
sed -E '/(^|:)foo(:|$)/d' file
foobar:asd:100:3:2:1
bar:test:100:2:2:2
Which makes sure foo is preceded by start or : and followed by end or :.
However this job is more suitable for awk as data is delimited by colon:
awk -F: '$1 != "foo"' file

Is there any way to use sed so it deletes the exact match only instead of lines containing foo?
Yes you can for the given example, if you mark your search pattern to match exactly foo: you can have luck deleting it. For e.g. if you do below
sed '/^foo:/d' file
The pattern ^ marks that the string starting with foo followed by a colon mark : which matches your use-case. This is assuming foo can be part of the fist column only
Is there any way to use delimiters with sed so I can use both title and author?
Should I be using something other than sed?
If you are dealing with a input file has a fixed de-limiter like : which will never form a part of your valid column content, then using awk/perl are better suited as they read text easily once a de-limiter is set.
As an example, consider an e.g. if you want to change the quantity name from fourth column for one particular book named foobar, with awk you can just do
awk -F: 'BEGIN { OFS = FS } $1 == "foobar" { $4 = 6 }1' input-file
To decode above line, the content within '..' are left untouched by the shell and passed literally to the command, that's why we wrap the content in single quotes. Also the statements inside it are not meaningful in the context of the shell.
So the -F: sets the input field-separator to : which is when the command reads the file line by line, the first line is broken down into tokens separated by :. The first column is labelled $1, which is extended up to $NF, meaning the last column of the line. The part BEGIN { OFS = FS } assigns the output field separator as the same as input i.e. retain the : de-limitation when awk writes the output also.
The part $1 == "foobar" { $4 = 6 } is almost self-explanatory in a sense, that if the first column contains the string within quotes do the action inside {..}, which is set the fourth column value as 6. The {..}1 is a short-hand notation for {...; print} which is to re-construct the line based on the output field/record separators defined.

This might work for you (GNU sed):
sed '/\<foo\>/d' file
Or
sed '/\bfoo\b/d' file
The first solution uses \< start word and \> end word. The second solution uses the \b word boundary.
P.S. The dual of \b is \B so to delete lines that contain foobar or foobaz but not foo only, use:
sed '/\bfoo\B/d' file

Related

How to use sed to search and replace a pattern who appears multiple times in the same line?

Because the question can be misleading, here is a little example. I have this kind of file:
some text
some text ##some-text-KEY-some-other-text##
text again ##some-text-KEY-some-other-text## ##some-text-KEY-some-other-text##
again ##some-text-KEY-some-other-text-KEY-text##
some text with KEY ##KEY-some-text##
blabla ##KEY##
In this example, I want to replace each occurrence of KEY- inside a pair of ## by VALUE-. I started with this sed command:
sed -i 's/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g'
Here is how it works:
\(##[^#]*\): create a first group composed of two # and any characters except # ...
KEY-: ... until the last occurrence of KEY- on that line
\([^#]*##\): and create a second group with all the characters except # until the next pair of #.
The problem is my command can't handle correctly the following line because there are multiple KEY- inside my pair of ##:
again ##some-text-KEY-some-other-text-KEY-text##
Indeed, I get this result:
again ##some-text-KEY-some-other-text-VALUE-text##
If I want to replace all the occurrences of KEY- in that line, I have to run my command multiple times and I prefer to avoid that. I also tried with lazy operators but the problem is the same.
How can I create a regex and a sed command who can handle correctly all my file?
The problem is rather complex: you need to replace all occurrences of some multicharacter text inside blocks of text between identical multicharacter delimiters.
The easiest and safest way to solve the task is using Perl:
perl -i -pe 's/(##)(.*?)(##)/$end_delim=$3; "$1" . $2=~s|KEY-|VALUE-|gr . "$end_delim"/ge' file
See the online demo.
The (##)(.*?)(##) pattern will match strings between two adjacent ## substrings capturing the start delimiter into Group 1, end delimiter in Group 3, and all text in between into Group 2. Since the regex substitution re-sets all placeholders, the temporary variable is used to keep the value of the end delimiter ($end_delim=$3), then, "$1" . $2=~s|KEY-|VALUE-|gr . "$end_delim" replaces the match with the value in the Group 1 of the first match (the first ##), then the Group 2 value with all KEY- replaced with VALUE-, and then the end delimiter.
If there are no KEY-s in between matches on the same line you may use a branch with sed by enclosing your command with :A and tA:
sed -i ':A; s/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g; tA' file
Note you missed the first placeholder in \VALUE-\2, it should be \1VALUE-\2.
See the online demo:
s="some KEY- text
some text ##some-text-KEY-some-other-text##
text again ##some-text-KEY-some-other-text## ##some-text-KEY-some-other-text##
again ##some-text-KEY-some-other-text-KEY-text##
some text with KEY ##KEY-some-text##
blabla ##KEY##"
sed ':A; s/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g; tA' <<< "$s"
Output:
some KEY- text
some text ##some-text-VALUE-some-other-text##
text again ##some-text-VALUE-some-other-text## ##some-text-VALUE-some-other-text##
again ##some-text-VALUE-some-other-text-VALUE-text##
some text with KEY ##VALUE-some-text##
blabla ##KEY##
More details:
sed allows the usage of loops and branches. The :A in the code above is a label, a special location marker that can be "jumped at" using the appropriate operator. t is used to create a branch, this "command jumps to the label only if the previous substitute command was successful". So, once the pattern matched and the replacement occurred, sed goes back to where it was and re-tries a match. If it is not successful, sed goes on to search for the matches further in the string. So, tA means go back to the location marked with A if there was a successful search-and-replace operation.
This might work for you (GNU sed):
sed -E 's/##/\n/g;:a;s/^([^\n]*(\n[^\n]*\n[^\n]*)*\n[^\n]*)KEY-/\1VALUE-/;ta;s/\n/##/g' file
Convert ##'s to newlines. Using a loop, replace VAL- between matched newlines to VALUE-. When all done replace newlines by ##'s.

How to match and replace string following the match via sed or awk

I have a file which I want to modify into a new file using cat.
So the file contains lines like:
name "myName"
place "xyz"
and so on....
I want these lines to be changed to
name "Jon"
place "paris"
I tried to do it like this but its not working:
cat originalFile | sed 's/^name\*/name "Jon"/' > tempFile
I tried using all sorts of special characters and it did not work. I am unable to recognize the space characters after name and then "myName".
You may match the rest of the line using .*, and you may match a space with a space, or [[:blank:]] or [[:space:]]:
sed 's/^\(name[[:space:]]\).*/\1"Jon"/;s/^\(place[[:space:]]\).*/\1"paris"/' originalFile > tempFile
Note there are two replace commands here joined with s semicolon. The first parts are wrapped with a capturing group that is necessary because the space POSIX character class is not literal and in order to keep it after replacing the \1 backreference should be used (to insert the text captured with Group 1).
See the online demo:
s='name "myName"
place "xyz"'
sed 's/^\(name[[:space:]]\).*/\1"Jon"/;s/^\(place[[:space:]]\).*/\1"paris"/' <<< "$s"
Output:
name "Jon"
place "paris"
An awk alternative:
awk '$1=="name"{$0="name \"Jon\""} $1=="place"{$0="place \"paris\""} 1' originalFile
It will work when there're space(s) before name or place.
It's not regex match here but just string compare.
awk separates fields by space characters which including \n or .
Append > tempFile to it when the results seems correct to you.

Insert text into line if that line doesn't contain another string using sed

I am merging a number of text files on a linux server but the lines in some differ slightly and I need to unify them.
For example some files will have line like
id='1244' group='american' name='fred',american
Other files will be like
id='2345' name='frank', english
finally others will be like
id='7897' group='' name='maria',scottish
what I need to do is, if group='' or group is not in the string at all I need to add it somewhere before the comma setting it to the text after the comma so in the 2nd example above the line would become:
id='2345' name='frank' group='english',english
and the same in the last example which would become
id='7897' name='maria' group='scottish',scottish
This is going into a bash script. I can't actually delete the line and add to the end of the file as it relates to the following line.
I've used the following:
sed -i.bak 's#group=""##' file
which deletes the group="" string so the lines will either contain group='something' or wont contain it at all and that works
Then I tried to add the group if it doesn't exist using the following:
sed -i.bak '/group/! s#,(.*$)#group="\1",\1#' file
but that throws up the error
sed: -e expression #1, char 38: invalid reference \1 on `s' command's RHS
EDIT by Ed Morton to create a single sample input file and expected output:
Sample Input:
id='1244' group='american' name='fred',american
foo
id='2345' name='frank', english
bar
id='7897' group='' name='maria',scottish
Expected Output:
id='1244' group='american' name='fred',american
foo
id='2345' name='frank' group='english',english
bar
id='7897' name='maria' group='scottish',scottish
sed -r "
/group=''/ s/// # group is empty, remove it
/group=/! s/,[[:blank:]]*(.+)/ group='\\1',\\1/ # group is missing, add it
" file
id='1244' group='american' name='fred',american
foo
id='2345' name='frank' group='english',english
bar
id='7897' name='maria' group='scottish',scottish
The foo and bar lines are untouched because the s/// command did not match a comma followed by characters.
something like
sed '
/^[^,]*group[^,]*,/ ! {
s/, *\(.*\)/ group='\''\1'\'', \1/
}
/^[^,]*group='\'\''/ {
s/group='\'\''\([^,]*\), *\(.*\)/group='\''\2'\''\1, \2/
}
'
This GNU awk may help:
awk -v sq="'" '
BEGIN{RS="[ ,\n]+"; FS="="; found=0}
$1=="group"{
if($2==sq sq)
{next}
else
{found=1}
}
NF>1{
printf "%s=%s ",$1,$2
}
NF==1{
if(!found)
{printf "group=%s",$1}
print ","$1
found=0
}
' file
The script relies on the record separator RS which is set to get all key='value' pairs.
If the key group isn't found or is empty, it is printed when reaching a record with only one field.
Note that the variable sq holds the single quote character and is used to detect empty group field.
Sed can be pretty ugly. And your data format appears to be somewhat inconsistent. This MIGHT work for you:
$ sed -e "/group='[a-z]/b e" -e "s/group='' *//" -e "s/,\([a-z]*\)$/ group='\1', /" -e ':e' input.txt
Broken out for easier reading, here's what we're doing:
/group='[a-z]/b e - If the line contains a valid group, branch to the end.
s/group='' *// - Remove any empty group,
s/,\([a-z]*\)$/ group='\1', / - add a new group based on your specs
:e - branch label for the first command.
And then the default action is to print the line.
I really don't like manipulating data this way. It's prone to error, and you'll be further ahead reading this data into something that accurately stores its data structure, then prints the data according to a new structure. A more robust solution would likely be tied directly to whatever is producing or consuming this data, and would not sit in the middle like this.

How to add a new line to the beginning and end of every line matching a pattern?

How can sed be used to add a \n to the beginning and to the end of every line matching the pattern %%###? This is how far I got:
If foo.txt contains
foobar
%%## Foo
%%### Bar
foobar
then sed 's/^%%###/\n&\n/g' foo.txt gives
foobar
%%## Foo
%%###
Bar
foobar
instead of
foobar
%%## Foo
%%### Bar
foobar
Note: This seems related to this post
Update: I'm actually looking for case where lines starting with the pattern are considered only.
With GNU sed:
sed 's/.*%%###.*/\n&\n/' file
Output:
foobar
%%## Foo
%%### Bar
foobar
&: Refer to that portion of the pattern space which matched
If you want to edit your file "in place" use sed's option -i.
It is cumbersome to directly add newlines via sed. But here is one option if you have perl available:
$ foo.txt | perl -pe 's/(.*%%###.*)/\n$1\n/'
Here we capture every matching line, which is defined as any line containing the pattern %%### anywhere, and then we replace with that line surrounded by leading and trailing newlines.
This might work for you (GNU sed):
sed '/%%###/!b;G;i\\' file
For those lines that meet the criteria, append a newline from the hold space (the hold space by default contains a newline) and insert an empty line.
Another way:
sed -e '/%%###/!b;i\\' -e 'a\\' file
This time insert and then append empty lines.
N.B. The i and a must be followed by a newline, this can be achieved by putting them in separate -e invocations.
A third way:
sed '/%%###/!b;G;s/.*\(.\)/\1&/' file
As in the first way, append a newline from the hold space, then copy it i.e. the last character of the amended current line, and prepend it to the current line.
Yet another way:
sed '/%%###/{x;p;x;G}' file
Swap to the hold space, print the newline, swap back and append the newline.
N.B. If the hold space may not be empty (a previous, x,h,H,g or G command may have changed it) the buffer may be cleared before it is printed (p) by using the zap command z.
And of course:
sed '/%%###/s/^\|$/\n/g' file
Personally I'd use a string instead of regexp comparison since there's no repexp characters in the text you want to match and if there were you wouldn't want them treated as such, and just print newlines around the string instead of modifying the string itself:
awk '{print ($1=="^%%###" ? ORS $0 ORS : $0)}' file
The above will work with any awk in every shell on every UNIX box and is easily modified if you don't want multiple blank lines between consecutive %%### lines or don't want blank lines added if the existing surrounding lines are already blank or you need to do anything else.

Last Occurrence of Character Field Separator AWK

I'm using a find command to find all files of a certain format, that command has been golden. I'm piping that output into an awk command and I want to use the last underscore as a field separator. The problem being that depending on the path the file is in, there could be one or two underscores before the fact.
find . -regex ".*prob[0-9]*_.*" | awk 'BEGIN { FS = "_.*$" } { print $1 " " $2 }'
I get what's wrong with the regular expression in my field separator, it thinks to separate on the underscore and whatever follows, is there away to specify just the single character itself. Moreover, how do I specifically use a field separator on the last occurrence of a character.
This is somewhat an extension of a question I asked earlier:
Suppress output to StdOut when piping echo
The files I get are generally like this, the wrinkle being that the directory can have an underscore as well:
/the/directory/probXXXXX_XX
where X is any integer.
A workaround I've been thinking of is separating at every underscore and then print every column... I'd rather like to get it working in the method above though.
A trick of awk that is not obvious is that $ is an operator; you can use it with a variable or even an expression, and in particular with expressions involving the predefined variable NF: $NF gets the last field, $(NF - 1) the second last field.