process a delimited text file with sed - regex

I have a ";" delimited file:
aa;;;;aa
rgg;;;;fdg
aff;sfg;;;fasg
sfaf;sdfas;;;
ASFGF;;;;fasg
QFA;DSGS;;DSFAG;fagf
I'd like to process it replacing the missing value with a \N .
The result should be:
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;\N
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
I'm trying to do it with a sed script:
sed "s/;\(;\)/;\\N\1/g" file1.txt >file2.txt
But what I get is
aa;\N;;\N;aa
rgg;\N;;\N;fdg
aff;sfg;\N;;fasg
sfaf;sdfas;\N;;
ASFGF;\N;;\N;fasg
QFA;DSGS;\N;DSFAG;fagf

You don't need to enclose the second semicolon in parentheses just to use it as \1 in the replacement string. You can use ; in the replacement string:
sed 's/;;/;\\N;/g'
As you noticed, when it finds a pair of semicolons it replaces it with the desired string then skips over it, not reading the second semicolon again and this makes it insert \N after every two semicolons.
A solution is to use positive lookaheads; the regex is /;(?=;)/ but sed doesn't support them.
But it's possible to solve the problem using sed in a simple manner: duplicate the search command; the first command replaces the odd appearances of ;; with ;\N, the second one takes care of the even appearances. The final result is the one you need.
The command is as simple as:
sed 's/;;/;\\N;/g;s/;;/;\\N;/g'
It duplicates the previous command and uses the ; between g and s to separe them. Alternatively you can use the -e command line option once for each search expression:
sed -e 's/;;/;\\N;/g' -e 's/;;/;\\N;/g'
Update:
The OP asks in a comment "What if my file have 100 columns?"
Let's try and see if it works:
$ echo "0;1;;2;;;3;;;;4;;;;;5;;;;;;6;;;;;;;" | sed 's/;;/;\\N;/g;s/;;/;\\N;/g'
0;1;\N;2;\N;\N;3;\N;\N;\N;4;\N;\N;\N;\N;5;\N;\N;\N;\N;\N;6;\N;\N;\N;\N;\N;\N;
Look, ma! It works!
:-)
Update #2
I ignored the fact that the question doesn't ask to replace ;; with something else but to replace the empty/missing values in a file that uses ; to separate the columns. Accordingly, my expression doesn't fix the missing value when it occurs at the beginning or at the end of the line.
As the OP kindly added in a comment, the complete sed command is:
sed 's/;;/;\\N;/g;s/;;/;\\N;/g;s/^;/\\N;/g;s/;$/;\\N/g'
or (for readability):
sed -e 's/;;/;\\N;/g;' -e 's/;;/;\\N;/g;' -e 's/^;/\\N;/g' -e 's/;$/;\\N/g'
The two additional steps replace ';' when they found it at beginning or at the end of line.

You can use this sed command with 2 s (substitute) commands:
sed 's/;;/;\\N;/g; s/;;/;\\N;/g;' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
Or using lookarounds regex in a perl command:
perl -pe 's/(?<=;)(?=;)/\\N/g' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf

The main problem is that you can't use several times the same characters for a single replacement:
s/;;/..../g: The second ; can't be reused for the next match in a string like ;;;
If you want to do it with sed without to use a Perl-like regex mode, you can use a loop with the conditional command t:
sed ':a;s/;;/;\\N;/g;ta;' file
:a defines a label "a", ta go to this label only if something has been replaced.
For the ; at the end of the line (and to deal with eventual trailing whitespaces):
sed ':a;s/;;/;\\N;/g;ta; s/;[ \t\r]*$/;\\N/1' file

this awk one-liner will give you what you want:
awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N"}7' file
if you really want the line: sfaf;sdfas;\N;\N;\N , this line works for you:
awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N";sub(/;$/,";\\N")}7' file

sed 's/;/;\\N/g;s/;\\N\([^;]\)/;\1/g;s/;[[:blank:]]*$/;\\N/' YourFile
non recursive, onliner, posix compliant
Concept:
change all ;
put back unmatched one
add the special case of last ; with eventually space before the end of line

This might work for you (GNU sed):
sed -r ':;s/^(;)|(;);|(;)$/\2\3\\N\1\2/g;t' file
There are 4 senarios in which an empty field may occur: at the start of a record, between 2 field delimiters, an empty field following an empty field and at the end of a record. Alternation can be employed to cater for senarios 1,2 and 4 and senario 3 can be catered for by a second pass using a loop (:;...;t). Multiple senarios can be replaced in both passes using the g flag.

Related

Modifying a pattern-matched line as well as next line in a file

I'm trying to write a script that, among other things, automatically enable multilib. Meaning in my /etc/pacman.conf file, I have to turn this
#[multilib]
#Include = /etc/pacman.d/mirrorlist
into this
[multilib]
Include = /etc/pacman.d/mirrorlist
without accidentally removing # from lines like these
#[community-testing]
#Include = /etc/pacman.d/mirrorlist
I already accomplished this by using this code
linenum=$(rg -n '\[multilib\]' /etc/pacman.conf | cut -f1 -d:)
sed -i "$((linenum))s/#//" /etc/pacman.conf
sed -i "$((linenum+1))s/#//" /etc/pacman.conf
but I'm wondering, whether this can be solved in a single line of code without any math expressions.
With GNU sed. Find row starting with #[multilib], append next line (N) to pattern space and then remove all # from pattern space (s/#//g).
sed -i '/^#\[multilib\]/{N;s/#//g}' /etc/pacman.conf
If the two lines contain further #, then these are also removed.
Could you please try following, written with shown samples only. Considering that multilib and it's very next line only you want to deal with.
awk '
/multilib/ || found{
found=$0~/multilib/?1:""
sub(/^#+/,"")
print
}
' Input_file
Explanation:
First checking if a line contains multilib or variable found is SET then following instructions inside it's block.
Inside block checking if line has multilib then set it to 1 or nullify it. So that only next line after multilib gets processed only.
Using sub function of awk to substitute starting hash one or more occurences with NULL here.
Then printing current line.
This will work using any awk in any shell on every UNIX box:
$ awk '$0 == "#[multilib]"{c=2} c&&c--{sub(/^#/,"")} 1' file
[multilib]
Include = /etc/pacman.d/mirrorlist
and if you had to uncomment 500 lines instead of 2 lines then you'd just change c=2 to c=500 (as opposed to typing N 500 times as with the currently accepted solution). Note that you also don't have to escape any characters in the string you're matching on. So in addition to being robust and portable this is a much more generally useful idiom to remember than the other solutions you have so far. See printing-with-sed-or-awk-a-line-following-a-matching-pattern/17914105#17914105 for more.
A perl one-liner:
perl -0777 -api.back -e 's/#(\[multilib]\R)#/$1/' /etc/pacman.conf
modify in place with a backup of original in /etc/pacman.conf.back
If there is only one [multilib] entry, with ed and the shell's printf
printf '/^#\[multilib\]$/;+1s/^#//\n,p\nQ\n' | ed -s /etc/pacman.conf
Change Q to w to edit pacman.conf
Match #[multilib]
; include the next address
+1 the next line (plus one line below)
s/^#// remove the leading #
,p prints everything to stdout
Q exit/quit ed without error message.
-s means do not print any message.
Ed can do this.
cat >> edjoin.txt << EOF
/multilib/;+j
s/#//
s/#/\
/
wq
EOF
ed -s pacman.conf < edjoin.txt
rm -v ./edjoin.txt
This will only work on the first match. If you have more matches, repeat as necessary.
This might work for you (GNU sed):
sed '/^#\[multilib\]/,+1 s/^#//' file
Focus on a range of lines (in this case, two) where the first line begins #[multilib] and remove the first character in those lines if it is a #.
N.B. The [ and ] must be escaped in the regexp otherwise they will match a single character that is m,u,l,t,i or b. The range can be extended by changing the integer +1 to +n if you were to want to uncomment n lines plus the matching line.
To remove all comments in a [multilib] section, perhaps:
sed '/^#\?\[[^]]*\]$/h;G;/^#\[multilib\]/M s/^#//;P;d' file

Replace several lines by one using sed

I have an input like this:
This_is(A)
Goto(B,condition_1)
Goto(C,condition_2)
This_is(B)
Goto(A,condition_3)
This_is(C)
Goto(B,condition_1)
I want it to become like this
(A,B,condition_1)
(A,C,condition_2)
(B,A,condition_3)
(C,B,condition_1)
Anyone knows how to do this with sed?
Assuming you don't really need to do this with sed, this will work using any awk in any shell on every UNIX box:
$ awk -F'[()]' '/^[^[:space:]]/{s=$2; next} {sub(/[^[:space:]]*\(/,"("s",")} 1' file
(A,B,condition_1)
(A,C,condition_2)
(B,A,condition_3)
(C,B,condition_1)
This is a possible sed solution, where I have hardcoded a few bits, like This_is and Goto because the OP did not clarify if those strings change along the file in the actual file:
sed '/^This_is/{:a;N;s/\(^This_is(\(.\)).*\)\(\n *\)Goto(\([^)]*)\)$/\1\3(\2,\4/;$!ta;s/[^\n]*\n//}' input_file
(Unfortunately, with all these parenthesis, using the -E does not shorten the command much.)
The code is slightly more readable if split on more lines:
sed '/^This_is/{
:a
N
s/\(^This_is(\(.\)).*\)\(\n *\)Goto(\([^)]*)\)$/\1\3(\2,\4/
$!ta
s/[^\n]*\n//
}' os
Here you can see that the code takes action only on the lines starting with This_is; when the program hits those lines, it does the following.
It uses the N command to append the next line to the pattern space (interspersing \ns),
and it attempts a substitution with s/…/…/, which essentially tries to pick the x in This_is(x) and to put it just after the last Goto( on the multiline,
and it keeps doing this as long as the latter action is successful (ta branches to :a if s was successful) and the last line has not been read ($! matches all line but the last);
Indeed, this is a do-while loop, where :a marks the entry point, where the control jumps back if the while-condition is true, and ta is the command that evaluates the logical condition.
When the above while loop terminates, the shorter s/…/…/ command removes the leading line from the multiline pattern space, which is the This_is line.
This might work for you (GNU sed):
sed -E '/^\S.*\(.*\)/{h;d};G;s/\S+\((.*\))\n.*(\(.*)\).*/\2,\1/;P;d' file
If a line starts with a non-white space character and contains parens, copy it to the hold space (HS) and then delete it.
Otherwise, append the HS, remove non-white characters upto the opening paren, insert the value between parens from the stored value, add a comma and print the first line and then delete the whole of the pattern space.
N.B. Lines that do not meet the substitution criteria will be unchanged.
An alternative solution using GNU parallel and sed:
parallel --pipe --recstart T -kqN1 sed -E '1{h;d};G;s/\S+\((.*)\n.*(\(.*)\).*/\2,\1/;P;d' <file

Adding a line using sed

Can't seem to find the right way to do this, despite checking my regex in a reg checker.
Given a text file containing, amongst others, this entry:
zone "example.net" {
type master;
file "/etc/bind/zones/db.example.net";
allow-transfer { x.x.x.x;y.y.y.y; };
also-notify { x.x.x.x;y.y.y.y; };
};
I want to add lines after the also-notify line, for that domain specifically.
So using this sed command string:
sed '/"example\.net".*?also-notify.*?};/a\nxxxxxxx/s' named.conf.local
I thought should work to add 'xxxxxxx' after the line. But nope. What am I doing wrong?
With POSIX sed, you can use the a for append command with an escaped literal new line:
$ sed '/^[[:blank:]]*also-notify/ a\
NEW LINE' file
With GNU sed, a is slightly more natural since the new line is assumed:
$ gsed '/^[[:blank:]]*also-notify/ a NEW LINE' file
The issue with the sed in your example is two fold.
The first is any sed regex cannot be for a multi-line match as in example\.net".*?also-notify.*?. That is more of a perl type match. You would need to use a range operator for the start as in:
$ sed '/"example\.net/,/also-notify/{
/^[[:blank:]]*also-notify/ a\
NEW LINE
}' file
The second issue is the \n in the appended text. With POSIX sed, the \n is not supported in any context. With GNU sed, the new line is assumed and the \n is out of context (if immediately after the a) and interpreted as an escaped literal n. You can use \n with GNU sed after 1 character but not immediately after. In POSIX sed, leading spaces of the appended line will always be stripped.
Following awk may help on this.
awk -v new_lines="new_line here" '/also-notify/{flag=1;print new_lines} /^};/{flag=""} !flag' Input_file
In case you want to edit Input_file itself then append > temp_file && mv temp_file Input_file to above code too. Also print new_lines here new_lines is a variable you could print the new liens directly too in there.
You're pretty close already. Just use a range (/pattern/,/pattern/{ #commands }) to select the text you want to operate on and then use /pattern/a/\ ... to add the line you want.
/"example\.net"/,/also-notify/{
/also-notify/a\
\ this is the text I want to add.
}
sed trims leading space on text to be appended. Adding a backslash \ at the start of the line prevents this.
In Bash, this would look like something like:
sed -e '/"example\.net"/,/also-notify/{
/also-notify/a\
\ this is the text I want to add.
}' named.conf.local
Also note that sed uses an older dialect of regular expressions that doesn't support non-greedy quantifies like *?.

sed: Replacing a double quote in a quoted field within a delmited record

Given an optionally quoted, pipe delimited file with the following records:
"foo"|"bar"|123|"9" Nails"|"2"
"blah"|"blah"|456|"Guns "N" Roses"|"7"
"brik"|"brak"|789|""BB" King"|"0"
"yin"|"yang"|789|"John "Cougar" Mellencamp"|"5"
I want to replace any double quotes not next to a delimiter.
I used the following and it almost works. With one exception.
sed "s/\([^|]\)\"\([^|]\)/\1'\2/g" a.txt
The output looks like this:
"foo"|"bar"|123|"9' Nails"|"2"
"blah"|"blah"|456|"Guns 'N" Roses"|"7"
"brik"|"brak"|789|"'BB' King"|"0"
"yin"|"yang"|789|"John 'Cougar' Mellencamp"|"5"
It doesn't catch the second set of quotes if they are separated by a single character as in Guns "N" Roses. Does anyone know why that is and how it can be fixed? In the mean time I'm just piping the output to a second regex to handle the special case. I'd prefer to do this in one pass since some of the files can be largish.
Thanks in advance.
You can use substitution twice in sed:
sed -r "s/([^|])\"([^|])/\1'\2/g; s/([^|])\"([^|])/\1'\2/g" file
"foo"|"bar"|123|"9' Nails"|"2"
"blah"|"blah"|456|"Guns 'N' Roses"|"7"
"brik"|"brak"|789|"'BB' King"|"0"
"yin"|"yang"|789|"John 'Cougar' Mellencamp"|"5"
sed kind of implements a "while" loop:
sed ':a; s/\([^|]\)"\([^|]\)/\1'\''\2/g; ta' file
The t command loops to the label a if the previous s/// command replaced something. So that will repeat the replacement until no other matches are found.
Also, perl handles your case without looping, thanks to zero-width look-ahead:
perl -pe 's/[^|]\K"(?!\||$)/'\''/g'
But it doesn't handle consecutive double quotes, so the loop:
perl -pe 's//'\''/g while /[^|]\K"(?!\||$)/' file
You may like to use \x27 instead of the awkward '\'' method to insert a single quote in a single quoted string. Works with perl and GNU sed.

search and replace substring in string in bash

I have the following task:
I have to replace several links, but only the links which ends with .do
Important: the files have also other links within, but they should stay untouched.
<li>Einstellungen verwalten</li>
to
<li>Einstellungen verwalten</li>
So I have to search for links with .do, take the part before and remember it for example as $a , replace the whole link with
<s:url action=' '/>
and past $a between the quotes.
I thought about sed, but sed as I know does only search a whole string and replace it complete.
I also tried bash Parameter Expansions in combination with sed but got severel problems with the quotes and the variables.
cat ./src/main/webapp/include/stoBox2.jsp | grep -e '<a href=".*\.do">' | while read a;
do
b=${a#*href=\"};
c=${b%.do*};
sed -i 's/href=\"$a.do\"/href=\"<s:url action=\'$a\'/>\"/g' ./src/main/webapp/include/stoBox2.jsp;
done;
any ideas ?
Thanks a lot.
sed -i sed 's#href="\(.*\)\.do"#href="<s:url action='"'\1'"'/>"#g' ./src/main/webapp/include/stoBox2.jsp
Use patterns with parentheses to get the link without .do, and here single and double quotes separate the sed command with 3 parts (but in fact join with one command) to escape the quotes in your text.
's#href="\(.*\)\.do"#href="<s:url action='
"'\1'"
'/>"#g'
parameters -i is used for modify your file derectly. If you don't want to do this just remove it. and save results to a tmp file with > tmp.
Try this one:
sed -i "s%\(href=\"\)\([^\"]\+\)\.do%\1<s:url action='\2'/>%g" \
./src/main/webapp/include/stoBox2.jsp;
You can capture patterns with parenthesis (\(,\)) and use it in the replacement pattern.
Here I catch a string without any " but preceding .do (\([^\"]\+\)\.do), and insert it without the .do suffix (\2).
There is a / in the second pattern, so I used %s to delimit expressions instead of traditional /.