How do I remove a particular pattern with a number sequence sed

How do I remove a particular pattern with a number sequence sed - regex

I'm very new to sed bash command, so trying to learn.
I'm currently faced with a few thousand markdown files i need to clean up and I'm trying to create a command that deletes part of the following
# null 864: Headline
body text
I need anything that come before the headline deleted which is '# null 864: '
it's allways: '# null ' then some digits ': '
I'm using gnu-sed because I'm using mac
The best I've come up with sofar is
gsed -i '/#\snull\s([1-9]|[1-9][0-9]|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9]):\s/d' *.md
The above does not seem to work?
however if I do
gsed -i '/#\snull/d' *.md
it does what I want, however it does some unintended stuff in the body test.
How do I control so only the headline and the body text remains?

Considering that you want to print values before headline and don't want to print any other lines, then try following.
sed -E -n 's/^(#\s+null\s+[0-9]+:\s+)Headline/\1/p' Input_file
In case you want to print value before Headline and if match is not found want to print that complete line then try following:
sed -E 's/^(#\s+null\s+[0-9]+:\s+)Headline/\1/' Input_file
Explanation: Simple using -E option of sed to enable ERE(extended regular expression), then using s option of sed to perform substitution here. matching # followed by space(s) null followed by space(s) digits colon and space(s) and keeping it in 1st capturing group, while substitution, substituting it with 1st capturing group.
NOTE: Above commands will print values on terminal, in case you want to save them inplace then use -i option once you are satisfied with above code's output.

If I'm understanding correctly, you have files like this:
This should get deleted
This should too.
# null 864: Headline
body text
this should get kept
You want to keep the headline, and everything after, right? You can do this in awk:
awk '/# null [0-9]+:/,eof {print}' foo.md

You might use awk, and replace the # null 864: part with an empty string using sub.
See this page to either create a new file, or to overwrite the same file.
The }1 prints the whole line as 1 evaluates to true.
awk '{sub(/^# null [0-9]+:[[:blank:]]+/,"")}1' file
The pattern matches
^# null Match literally from the start of the string
[0-9]+:[[:blank:]]+ match 1+ digits, then : and 1+ spaces
Output
Headline
body text

On a mac ed should be installed by default so.
The content of script.ed
g/^# null [[:digit:]]\{1,\}: Headline$/s/^.\{1,\}: //
,p
Q
for file in *.md; do ed -s "$file" < ./script.ed; done
If the output is ok, remove the ,p and change the Q to w so it can edit the file in-place
g/^# null [[:digit:]]\{1,\}: Headline$/s/^.\{1,\}: //
w
Run the loop again.

I'd use a range in sed same as Andy Lester's awk solution.
Borrowing his infile,
$: cat tst.md
This should get deleted
This should too.
# null 864: Headline
body text
this should get kept
$: sed -Ein '/^# null [0-9]+:/,${p;d};d;' tst.md
$: cat tst.md
# null 864: Headline
body text
this should get kept

Related

How to find and replace a pattern string using sed/perl/awk?

I have a file foo.properties with contents like
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.03,delta:1.0,gamma:.5
In my script, I need to replace whatever value is against ph (The current value is unknown to the bash script) and change it to 0.5. So the the file should look like
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
I know it can be easily done if the current value is known by using
sed "s/\,ph\:0.03\,/\,ph\:0.5\,/" foo.properties
But in my case, I have to actually read the contents against allNames and search for the value and then replace within a for loop. Rest all is taken care of but I can't figure out the sed/perl command for this.
I tried using sed "s/\,ph\:.*\,/\,ph\:0.5\,/" foo.properties and some variations but it didn't work.

A simpler sed solution:
sed -E 's/([=,]ph:)[0-9.]+/\10.5/g' file
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
Here we match ([=,]ph:) (i.e. , or = followed by ph:) and capture in group #1. This should be followed by 1+ of [0-9.] character to natch any number. In replacement we put \1 back with 0.5

With your shown samples, please try following awk code.
awk -v new_val="0.5" '
match($0,/,ph:[0-9]+(\.[0-9]+)?/){
val=substr($0,RSTART+1,RLENGTH-1)
sub(/:.*/,":",val)
print substr($0,1,RSTART) val new_val substr($0,RSTART+RLENGTH)
next
}
1
' Input_file
Detailed Explanation: Creating awk's variable named new_val which contains new value which needs to put in. In main program of awk using match function of awk to match ,ph:[0-9]+(\.[0-9]+)? regex in each line, if a match of regex is found then storing that matched value into variable val. Then substituting everything from : to till end of value in val variable with : here. Then printing values as pre requirement of OP(values before matched regex value with val(edited matched value in regex) with new value and rest of line), using next will avoid going further and by mentioning 1 printing rest other lines which are NOT having a matched value in it.
2nd solution: Using sub function of awk.
awk -v newVal="0.5" '/^allNames=/{sub(/,ph:[^,]*/,",ph:"newVal)} 1' Input_file

Would you please try a perl solution:
perl -pe '
s/(?<=\bph:)[\d.]+(?=,|$)/0.5/;
' foo.properties
The -pe option makes perl to read the input line by line, perform
the operation, then print it as sed does.
The regex (?<=\bph:) is a zero-length lookbehind which matches
the string ph: preceded by a word boundary.
The regex [\d.]+ will match a decimal number.
The regex (?=,|$) is a zero-length lookahead which matches
a comma or the end of the string.
As the lookbehind and the lookahead has zero length, they are not
substituted by the s/../../ operator.
[Edit]
As Dave Cross comments, the lookahead (?=,|$) is unnecessary as long as the input file is correctly formatted.

Works with decimal place or not, or no value, anywhere in the line.
sed -E 's/(^|[^-_[:alnum:]])ph:[0-9]*(.[0-9]+)?/ph:0.5/g'
Or possibly:
sed -E 's/(^|[=,[:space:]])ph:[0-9]+(.[0-9]+)?/ph:0.5/g'
The top one uses "not other naming characters" to describe the character immediately before a name, the bottom one uses delimiter characters (you could add more characters to either). The purpose is to avoid clashing with other_ph or autograph.

Here you go
#!/usr/bin/perl
use strict;
use warnings;
print "\nPerl Starting ... \n\n";
while (my $recordLine =<DATA>)
{
chomp($recordLine);
if (index($recordLine, "ph:") != -1)
{
$recordLine =~ s/ph:.*?,/ph:0.5,/g;
print "recordLine: $recordLine ...\n";
}
}
print "\nPerl End ... \n\n";
__DATA__
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.03,delta:1.0,gamma:.5
output:
Perl Starting ...
recordLine: allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5 ...
Perl End ...

Using any sed in any shell on every Unix box (the other sed solutions posted that use sed -E require GNU or BSD seds):
a) if ph: is never the first tag in the allNames list (as shown in your sample input):
$ sed 's/\(,ph:\)[^,]*/\10.5/' foo.properties
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
b) or if it can be first:
$ sed 's/\([,=]ph:\)[^,]*/\10.5/' foo.properties
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5

How to use sed to search and replace a pattern who appears multiple times in the same line?

Because the question can be misleading, here is a little example. I have this kind of file:
some text
some text ##some-text-KEY-some-other-text##
text again ##some-text-KEY-some-other-text## ##some-text-KEY-some-other-text##
again ##some-text-KEY-some-other-text-KEY-text##
some text with KEY ##KEY-some-text##
blabla ##KEY##
In this example, I want to replace each occurrence of KEY- inside a pair of ## by VALUE-. I started with this sed command:
sed -i 's/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g'
Here is how it works:
\(##[^#]*\): create a first group composed of two # and any characters except # ...
KEY-: ... until the last occurrence of KEY- on that line
\([^#]*##\): and create a second group with all the characters except # until the next pair of #.
The problem is my command can't handle correctly the following line because there are multiple KEY- inside my pair of ##:
again ##some-text-KEY-some-other-text-KEY-text##
Indeed, I get this result:
again ##some-text-KEY-some-other-text-VALUE-text##
If I want to replace all the occurrences of KEY- in that line, I have to run my command multiple times and I prefer to avoid that. I also tried with lazy operators but the problem is the same.
How can I create a regex and a sed command who can handle correctly all my file?

The problem is rather complex: you need to replace all occurrences of some multicharacter text inside blocks of text between identical multicharacter delimiters.
The easiest and safest way to solve the task is using Perl:
perl -i -pe 's/(##)(.*?)(##)/$end_delim=$3; "$1" . $2=~s|KEY-|VALUE-|gr . "$end_delim"/ge' file
See the online demo.
The (##)(.*?)(##) pattern will match strings between two adjacent ## substrings capturing the start delimiter into Group 1, end delimiter in Group 3, and all text in between into Group 2. Since the regex substitution re-sets all placeholders, the temporary variable is used to keep the value of the end delimiter ($end_delim=$3), then, "$1" . $2=~s|KEY-|VALUE-|gr . "$end_delim" replaces the match with the value in the Group 1 of the first match (the first ##), then the Group 2 value with all KEY- replaced with VALUE-, and then the end delimiter.
If there are no KEY-s in between matches on the same line you may use a branch with sed by enclosing your command with :A and tA:
sed -i ':A; s/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g; tA' file
Note you missed the first placeholder in \VALUE-\2, it should be \1VALUE-\2.
See the online demo:
s="some KEY- text
some text ##some-text-KEY-some-other-text##
text again ##some-text-KEY-some-other-text## ##some-text-KEY-some-other-text##
again ##some-text-KEY-some-other-text-KEY-text##
some text with KEY ##KEY-some-text##
blabla ##KEY##"
sed ':A; s/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g; tA' <<< "$s"
Output:
some KEY- text
some text ##some-text-VALUE-some-other-text##
text again ##some-text-VALUE-some-other-text## ##some-text-VALUE-some-other-text##
again ##some-text-VALUE-some-other-text-VALUE-text##
some text with KEY ##VALUE-some-text##
blabla ##KEY##
More details:
sed allows the usage of loops and branches. The :A in the code above is a label, a special location marker that can be "jumped at" using the appropriate operator. t is used to create a branch, this "command jumps to the label only if the previous substitute command was successful". So, once the pattern matched and the replacement occurred, sed goes back to where it was and re-tries a match. If it is not successful, sed goes on to search for the matches further in the string. So, tA means go back to the location marked with A if there was a successful search-and-replace operation.

This might work for you (GNU sed):
sed -E 's/##/\n/g;:a;s/^([^\n]*(\n[^\n]*\n[^\n]*)*\n[^\n]*)KEY-/\1VALUE-/;ta;s/\n/##/g' file
Convert ##'s to newlines. Using a loop, replace VAL- between matched newlines to VALUE-. When all done replace newlines by ##'s.

Is there a way to use sed to remove only the exact string match?

I have recently started learning bash and I ran into a problem doing an assignment, So I have a txt file and in it contains something like
foo:abc:200:1:1:1
foobar:asd:100:3:2:1
bar:test:100:2:2:2
where the first column is the title of the book followed by the author name followed by price,quantity available and qty sold all seperated with the delimiter ":"
the goal here is to remove a book base on the name and author the user types in.
I have searched around and found that sed might possibly be able to help me with this problem, I have tried to test sed by deleting base on the title alone with
sed /"foo"/d Book.txt
I expected the output to be
foobar:asd:100:3:2:1
bar:test:100:2:2:2
however the output was
bar:test:100:2:2:2
which tells me that any line in the txt file containing "foo" will get deleted
Hence I would like to ask
Is there any way to use sed so it deletes the exact match only instead of lines containing foo?
is there any way to use delimiters with sed so I can use both title and author?
Should I be using something other than sed?

Using sed it is better to use:
sed -E '/(^|:)foo(:|$)/d' file
foobar:asd:100:3:2:1
bar:test:100:2:2:2
Which makes sure foo is preceded by start or : and followed by end or :.
However this job is more suitable for awk as data is delimited by colon:
awk -F: '$1 != "foo"' file

Is there any way to use sed so it deletes the exact match only instead of lines containing foo?
Yes you can for the given example, if you mark your search pattern to match exactly foo: you can have luck deleting it. For e.g. if you do below
sed '/^foo:/d' file
The pattern ^ marks that the string starting with foo followed by a colon mark : which matches your use-case. This is assuming foo can be part of the fist column only
Is there any way to use delimiters with sed so I can use both title and author?
Should I be using something other than sed?
If you are dealing with a input file has a fixed de-limiter like : which will never form a part of your valid column content, then using awk/perl are better suited as they read text easily once a de-limiter is set.
As an example, consider an e.g. if you want to change the quantity name from fourth column for one particular book named foobar, with awk you can just do
awk -F: 'BEGIN { OFS = FS } $1 == "foobar" { $4 = 6 }1' input-file
To decode above line, the content within '..' are left untouched by the shell and passed literally to the command, that's why we wrap the content in single quotes. Also the statements inside it are not meaningful in the context of the shell.
So the -F: sets the input field-separator to : which is when the command reads the file line by line, the first line is broken down into tokens separated by :. The first column is labelled $1, which is extended up to $NF, meaning the last column of the line. The part BEGIN { OFS = FS } assigns the output field separator as the same as input i.e. retain the : de-limitation when awk writes the output also.
The part $1 == "foobar" { $4 = 6 } is almost self-explanatory in a sense, that if the first column contains the string within quotes do the action inside {..}, which is set the fourth column value as 6. The {..}1 is a short-hand notation for {...; print} which is to re-construct the line based on the output field/record separators defined.

This might work for you (GNU sed):
sed '/\<foo\>/d' file
Or
sed '/\bfoo\b/d' file
The first solution uses \< start word and \> end word. The second solution uses the \b word boundary.
P.S. The dual of \b is \B so to delete lines that contain foobar or foobaz but not foo only, use:
sed '/\bfoo\B/d' file

Insert text into line if that line doesn't contain another string using sed

I am merging a number of text files on a linux server but the lines in some differ slightly and I need to unify them.
For example some files will have line like
id='1244' group='american' name='fred',american
Other files will be like
id='2345' name='frank', english
finally others will be like
id='7897' group='' name='maria',scottish
what I need to do is, if group='' or group is not in the string at all I need to add it somewhere before the comma setting it to the text after the comma so in the 2nd example above the line would become:
id='2345' name='frank' group='english',english
and the same in the last example which would become
id='7897' name='maria' group='scottish',scottish
This is going into a bash script. I can't actually delete the line and add to the end of the file as it relates to the following line.
I've used the following:
sed -i.bak 's#group=""##' file
which deletes the group="" string so the lines will either contain group='something' or wont contain it at all and that works
Then I tried to add the group if it doesn't exist using the following:
sed -i.bak '/group/! s#,(.*$)#group="\1",\1#' file
but that throws up the error
sed: -e expression #1, char 38: invalid reference \1 on `s' command's RHS
EDIT by Ed Morton to create a single sample input file and expected output:
Sample Input:
id='1244' group='american' name='fred',american
foo
id='2345' name='frank', english
bar
id='7897' group='' name='maria',scottish
Expected Output:
id='1244' group='american' name='fred',american
foo
id='2345' name='frank' group='english',english
bar
id='7897' name='maria' group='scottish',scottish

sed -r "
/group=''/ s/// # group is empty, remove it
/group=/! s/,[[:blank:]]*(.+)/ group='\\1',\\1/ # group is missing, add it
" file
id='1244' group='american' name='fred',american
foo
id='2345' name='frank' group='english',english
bar
id='7897' name='maria' group='scottish',scottish
The foo and bar lines are untouched because the s/// command did not match a comma followed by characters.

something like
sed '
/^[^,]*group[^,]*,/ ! {
s/, *\(.*\)/ group='\''\1'\'', \1/
}
/^[^,]*group='\'\''/ {
s/group='\'\''\([^,]*\), *\(.*\)/group='\''\2'\''\1, \2/
}
'

This GNU awk may help:
awk -v sq="'" '
BEGIN{RS="[ ,\n]+"; FS="="; found=0}
$1=="group"{
if($2==sq sq)
{next}
else
{found=1}
}
NF>1{
printf "%s=%s ",$1,$2
}
NF==1{
if(!found)
{printf "group=%s",$1}
print ","$1
found=0
}
' file
The script relies on the record separator RS which is set to get all key='value' pairs.
If the key group isn't found or is empty, it is printed when reaching a record with only one field.
Note that the variable sq holds the single quote character and is used to detect empty group field.

Sed can be pretty ugly. And your data format appears to be somewhat inconsistent. This MIGHT work for you:
$ sed -e "/group='[a-z]/b e" -e "s/group='' *//" -e "s/,\([a-z]*\)$/ group='\1', /" -e ':e' input.txt
Broken out for easier reading, here's what we're doing:
/group='[a-z]/b e - If the line contains a valid group, branch to the end.
s/group='' *// - Remove any empty group,
s/,\([a-z]*\)$/ group='\1', / - add a new group based on your specs
:e - branch label for the first command.
And then the default action is to print the line.
I really don't like manipulating data this way. It's prone to error, and you'll be further ahead reading this data into something that accurately stores its data structure, then prints the data according to a new structure. A more robust solution would likely be tied directly to whatever is producing or consuming this data, and would not sit in the middle like this.

sed command to delete text until match is found for each line of a csv

I have a csv file and I am trying to delete all characters from the beginning of the line till it finds the first occurrence of "2015". I want to do this for each line in the csv file.
My csv file structure is as follows:
Field1 , Field2 , Field3 , Field4
sometext1 , 2015-07-15 , sometext2, sometext3
sometext1 , 2015-07-14 , sometext2, sometext3
sometext1 , 2015-07-13 , sometext2, sometext3
I cannot use the cut command or sed for the first occurrence of a comma because the text in the Field1 sometimes has commas in them too, which is making it complicated for parsing. I figured if I search for the first occurrence of the text 2015 for each line and replace all the preceding characters with nothing, then that should work.
FYI I only want to do this for the FIRST occurrence of 2015 only. There is another text field with 2015 in it within another column and I don't any text prior to that to be affected.
For example, if my original line is:
sometext1,#015,2015-07-10,sometext2,2015,sometext3
I want it to return:
2015-07-10,sometext2,2015,sometext3
Does anyone know the sed command to do this?
Any help will be appreciated!
Thanks

Here is a way to do it with sed assuming "#####" never occurs in a line:
sed -e 's/2015/#####&/'|sed -e 's/.*#####//'
For example:
> echo sometext1,#015,2015-07-10,sometext2,2015,sometext3\
|sed -e 's/2015/#####&/'|sed -e 's/.*#####//'
2015-07-10,sometext2,2015,sometext3
The first sed command prefixes "#####" to the first occurence of 2015 and the second sed command removes everything from the beginning to the end of the "#####" prefix.
The basic reason for using this two stage method is that sed's regular expression matcher has only greedy wildcards that always pick the longest match and does not support lazy matching which picks the shortest match.
If "#####" may occur in a line a more unlikely string could be substituted for it such as "7z#dNjm_wG8a3!esu#Rhv=".

To do this with sed without Perl-style non-greedy operators, you need to mark the first instance with something you know won't be in the line, as Tris describes. However, that solution requires knowledge of what won't be in the file. Fortunately, you can guarantee that a newline won't be in the line because that's what terminated the line. Thus you can do something like:
sed 's/2015/\n&/;s/.*\n//' input.txt > output.txt
NOTE: this won't modify the header row which you would have to treat specially.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js