Conditional replacement with SED - replace

I'm new to SED and have what may be a simple question. I've used it before to replace and delete characters but this is a little different. I need to eliminate commas within quotations, then eliminate the quotations in a csv file. So this:
"5,196,386","99,017",493,21
should end up looking like this:
5196386,99017,493,21

gnu awk one-liner:
awk -v FPAT='([^,]+|"[^"]+")' -v OFS="," '{for(i=1;i<=NF;i++)gsub(/[",]/,"",$i)}7'
with your example:
kent$ awk -v FPAT='([^,]+|"[^"]+")' -v OFS="," '{for(i=1;i<=NF;i++)gsub(/[",]/,"",$i)}7' <<< '"5,196,386","99,017",493,21'
5196386,99017,493,21

You'll need to do that with multiple s/// operations. The first will eliminate the commas between pairs of quotes when there are only commas and digits between the quotes; the second will eliminate the quotes (which by now have only digits between them):
sed -e 's/"\([0-9][0-9]*\),\([0-9,][0-9,]*\)"/"\1\2"/g' \
-e 's/"\([0-9][0-9]*\),\([0-9,][0-9,]*\)"/"\1\2"/g' \
-e 's/"\([0-9][0-9]*\)"/\1/g'
You have to repeat the first operation as often as the maximum number of commas that can appear between quotes. If your values go into the billions, you'll need a third copy of it.

I'd use a language with a proper CSV parser. For example:
echo '"5,196,386","99,017",493,21' |
ruby -rcsv -ne 'CSV.parse($_) do |row|
puts CSV.generate_line(row.map {|e| e.delete(",")})
end'
5196386,99017,493,21

This should work with nearly all awk
echo '"5,196,386","99,017",493,21' | awk 'BEGIN {FS=OFS=""} {for (i=1;i<=NF;i++) {if ($i=="\"") {f=!f;$i=""}; if (f && $i==",") $i=""}}1'
5196386,99017,493,21
How does it work:
awk '
BEGIN { # Begin block
FS=OFS=""} # Set input and output Field separator to "" (nothing) makes loop work on every characters
{for (i=1;i<=NF;i++) { # Looping trough line, one and one character at the time
if ($i=="\"") { # If a double quote is found do:
f=!f # Swap the flag "f" (If "f" is true, you are inside a double quote string
$i=""} # Delete the double quote
if (f && $i==",") # If "f" is true and we find a comma "," (inside a double quote string):
$i=""} # Delete the comma
}
1 # Print the line.
' file

This might work for you (GNU sed):
sed -r ':a;s/"[0-9,]+"/\n&\n/;T;h;s/[,"]//g;G;s/.*\n(.*)\n.*\n(.*)\n.*\n/\2\1/;ta' file
This puts \n markers round a double quoted field, makes a copy of the whole line, removes double quotes and commas, then puts the line back together again and repeats till no more changes are needed.
An alternative method:
sed -r 's/^/\n/;ta;:a;s/\n+$//;t;s/\n\n"/\n/;ta;s/\n"/\n\n/;ta;s/\n\n,/\n\n/;ta;s/(\n+)(.)/\2\1/;ta' file
Passes character by character through the string using a \n as marker. Two \n's marks when the next character is within a quoted field.

awk '{gsub(/"5,196,386","99,017"/,"5196386,99017")}1' file
5196386,99017,493,21

Related

Using protected wildcard character in awk field separator doesn't work

I have a file that contains paragraphs separated by lines of *(any amount). When I use egrep with the regex of '^\*+$' it works as intended, only displaying the lines that contain only stars.
However, when I use the same expression in awk -F or awk FS, it doesn't work and just prints out the whole document, excluding the lines of stars.
Commands that I tried so far:
awk -F'^\*+$' '{print $1, $2}' msgs
awk -F'/^\*+$/' '{print $1, $2}' msgs
awk 'BEGIN{ FS="/^\*+$/" } ; { print $1,$2 }' msgs
Printing the first field always prints out the whole document, using the first version it excludes the lines with the stars, other versions include everything from the file.
Example input:
Par1 test teststsdsfsfdsf
fdsfdsfdsftesyt
fdsfdsfdsf fddsteste345sdfs
***
Par2 dsadawe232343a5edsfe
43s4esfsd s45s45e4t rfgsd45
***
Par3 dsadasd
fasfasf53sdf sfdsf s45 sdfs
dfsf dsf
***
Par4 dasdasda r3ar d afa fs
ds fgdsfgsdfaser ar53d f
***
Par 5 dasdawr3r35a
fsada35awfds46 s46 sdfsds5 34sdf
***
Expected output for print $1:
Par1 test teststsdsfsfdsf fdsfdsfdsftesyt fdsfdsfdsf fddsteste345sdfs
EDIT: Added example input and expected output
Strings used as regexps in awk are parsed twice:
to turn them into a regexp, and
to use them as a regexp.
So if you want to use a string as a regexp (including any time you assign a Field Separator or Record Separator as a regexp) then you need to double any escapes as each iteration of parsing will consume one of them. See https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps for details.
Good (a literal/constant regexp):
$ echo 'a(b)c' | awk '$0 ~ /\(b)/'
a(b)c
Bad (a poorly-written dynamic/computed regexp):
$ echo 'a(b)c' | awk '$0 ~ "\(b)"'
awk: cmd. line:1: warning: escape sequence `\(' treated as plain `('
a(b)c
Good (a well-written dynamic/computed regexp):
$ echo 'a(b)c' | awk '$0 ~ "\\(b)"'
a(b)c
but IMHO if you're having to double escapes to make a char literal then it's clearer to use a bracket expression instead:
$ echo 'a(b)c' | awk '$0 ~ "[(]b)"'
a(b)c
Also, ^ in a regexp means "start of string" which is only matched at the start of all the input, just like $ would only be matched at the end of all of the output. ^ does not mean "start of line" as some documents/scripts may lead you to believe. It only appears to mean that in grep and sed because they are line-oriented and so usually the script is being compared to 1 line at a time, but awk isnt line-oriented, it's record-oriented and so the input being compared to the regexp isn't necessarily just a line (the same is true in sed if you read multiple lines into its hold space).
So to match a line of *s as a Record Separator (RS) assuming you're using gawk or some other awk that can treat a multi-char RS as a regexp, you'd have to write this regexp:
(^|\n)[*]+(\n|$)
but be aware that also matches the newlines before the first and after the last *s on the target lines so you need to handle that appropriately in your code.
It seems like this is what you're really trying to do:
$ awk -v RS='(^|\n)[*]+(\n|$)' 'NR==1{$1=$1; print}' file
Par1 test teststsdsfsfdsf fdsfdsfdsftesyt fdsfdsfdsf fddsteste345sdfs

repace n occurrences of a character in a string from the end

I am struggling to come up with a solution to replace n occurrences of a character with another character in a string starting from the end of the string. For example, if I want to replace last 5 occurrences of "," with "|" in a string like
abc, def,,{"data":{"xyz":null,"uan":"5643df"},{"path":"/abc/def/xyz"}},546,453,,,
to get a result like
abc, def,,{"data":{"xyz":null,"uan":"5643df"},{"path":"/abc/def/xyz"}}|546|453|||
I have looked at multiple solution which helps you find the last occurrence or all occurrences or 5 occurrences from the beginning but nothing which helps me do it from the end of the string. Reversing the string and doing it from the beginning and then reversing the string again is not an option because of the sheer size of the file.
With GNU sed. Replace five times last comma and rest of row with pipe and rest of row (s/,([^,]*)$/|\1/):
echo 'a,b,c,d,e,f,g,h' | sed -r 's/,([^,]*)$/|\1/; s/,([^,]*)$/|\1/; s/,([^,]*)$/|\1/; s/,([^,]*)$/|\1/; s/,([^,]*)$/|\1/;'
Output:
a,b,c|d|e|f|g|h
An awk version:
echo 'a,b,c,d,e,f,g,h' | awk -F, '{printf "%s",$1;for(i=2;i<=NF;i++) printf (NF-5<i?"|%s":",%s"),$i;print ""}'
a,b,c|d|e|f|g|h
It uses a loop to print each field. Count up and find when to use , or |. Number can be changed to get other result.
Example last to field:
echo 'a,b,c,d,e,f,g,h' | awk -F, '{printf "%s",$1;for(i=2;i<=NF;i++) printf (NF-2<i?"|%s":",%s"),$i;print ""}'
a,b,c,d,e,f|g|h
This might work for you (GNU sed):
sed -E '/(,[^,]*){5}$/{s//\n&/;h;y/,/|/;H;g;s/\n.*\n//}' file
Insert a newline just before the fifth comma from the end of a line, make a copy, replace all ,'s by |'s, append the current line to the copy and remove everything between the first and last newlines.
An alternative using GNU parallel and sed:
parallel -n0 -q echo 's/\(.*\),/\1|/' ::: {1..5} | sed -f - file
N.B. The first solution only amends the a line if there are at least 5 commas whereas the second solution amends a line regardless of how many commas there are.

find and replace using sed command

I want to find single quote ' between double quotes and replace it with (back slash single quote single quote) \' ' using sed command.
input = 'gender':"Men's",'colour':'Red','name':"Men's levi's"
output = 'gender':"Men\' 's",'colour':'Red','name':"Men\' 's levi\' 's"
I tried this where I can replace comma with pipe but when trying to replace single quote with \' ' it doesn't work:
sed 's/(\"[^"\'']\{1,\}),([^"\'']\{1,\}\")/\1 | \2/g' test.csv
Here is a way to do that using awk:
awk 'BEGIN{FS=OFS=","} {
for (i=1; i<=NF; i++)
if (split($i, a, / *: */) == 2 && a[2] ~ /^"/) {
gsub("\047", "\\\047 \047", a[2])
$i=a[1] ":" a[2]
}
} 1' file
'gender':"Men\' 's",'colour':'Red','name':"Men\' 's levi\' 's"
With GNU awk for multi-char RS and RT, all you need is:
$ awk -v RS='"[^"]+"' '{gsub(/\047/,"\\\047 \047",RT); ORS=RT} 1' file
'gender':"Men\' 's",'colour':'Red','name':"Men\' 's levi\' 's"
With sed you could do this:
sed -e ":a"
-e "s/'\([^\\\":]*\(\\.[^\\\":]*\)*\"\)/\\\\\f \f\1/"
-e "ta"
-e "s/\\\\\f \f/\\\' '/g" file
Linebreaks and indentations are for readability. The whole point is that you first match single quotes that are followed by a double quote (might not be immediately), replace it with a \\\f \f (\\ a literal backslash, \f form feed) do the same thing using a loop (t) then you replace previous replacement with your desired string. The main regex also takes care from escaped double quotation marks inside a double quoted string but it fails if you have colons : or commas , within it.
One-liner:
sed -e ":a" -e "s/'\([^\\\":]*\(\\.[^\\\":]*\)*\"\)/\\\\\f \f\1/" -e "ta" -e "s/\\\\\f \f/\\\' '/g" file

file cleanup using sed and regex (remove some but not all newlines)

i have a text file that i would like to load into hive. it has linebreaks within a string column so it won't load properly. from what i found out online the file needs to be preprocessed and all those linebreaks be removed. i have tried many regexes so far, but to no avail.
this is the file:
/biz/1-or-8;5.0;"a bunch of
text
with some
linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more
text
here.";2016-10-18
the desired output should be this:
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18
i could achieve this in notepad++ by using this as a regex: (\r\n^(?!\/biz\/))+
however, when i run that regex using sed like so it doesn't work:
sed -e 's/(\r\n^(?!\/biz\/))+//g' original.csv > clean.csv
As stated, sed doesn't support lookaround assertions such as (?!\/biz\/).
Since your input is essentially record-oriented, awk offers a convenient solution.
With GNU awk or Mawk (required to support multi-character input record separators):
awk -v RS='/biz/' '$1=$1 { print RS $0 }' file
RS='/biz/' splits the input into records by /biz/ (reserved variable RS is the input-record separator, \n by default).
$1=$1 looks like a no-op, but actually rebuilds the input record at hand ($0) by normalizing any record-internal runs of whitespace - including newlines - to a single space each, relying on awk's default field-splitting and output behavior.
Additionally, since $1=$1 serves as a pattern (conditional), the outcome of the assignment decides whether the associated action ({ ... }) is executed for the record at hand.
For an empty record - such as the implied one before the very first /biz - the assignment returns '', which in a Boolean context evaluates to false and therefore skips the associated block.
{ print RS $0 } prints the rebuilt input record, prefixed by the input record separator; print automatically appends the output record separator, ORS, which defaults to \n.
Note: Your code references \r\n, i.e., Windows-style CRLF line breaks. Since you're trying to use sed, I trust that the versions of the Unix utilities available to you on Windows transparently handle CRLF.
If you're actually on a Unix platform and only happen to be dealing with a Windows-originated file, a little more work is needed.
maybe this can help you;
sed -n '/^\s*$/d;$!{ 1{x;d}; H}; ${ H;x;s|\n\([^\/biz]\)| \1|g;p}'
test ;
$ sed -n '/^\s*$/d;$!{ 1{x;d}; H}; ${ H;x;s|\n\([^\/biz]\)| \1|g;p}' test
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18
awk to the rescue! (with multi-char RS support)
$ awk -v RS='\n?^/' 'NF{$1=$1; print "/" $0}' file
or
$ awk -v RS='\n?^/' 'NF{$1="/"$1}NF' file
Create files
$ cat biz.awk
{ # read entire input to a string `f' (skips newlines)
f = f $0
}
END {
gsub("[^^]/biz/", "\n/biz/", f) # add a newline to all but the
# first /biz/
print f
}
and
$ cat file
/biz/1-or-8;5.0;"a bunch of
text
with some
linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more
text
here.";2016-10-18
Usage:
awk -f biz.awk file
sed doesn't support lookarounds, perl does
$ perl -0777 -pe 's/(\n^(?!\/biz\/))+//mg' original.csv
/biz/1-or-8;5.0;"a bunch oftextwith somelinebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"moretexthere.";2016-10-18
-0777 option will slurp entire file as single string
m option allows to use ^$ anchors in multiline strings
Note, line endings in Unix like systems do not use \r, but if your input does have them, use \r\n as specified used in OP.
Use different delimiter to avoid having to escape /
perl -0777 -pe 's|(\n^(?!/biz/))+||mg' original.csv
Another way to do it is delete all \n characters between a pair of double quotes
$ perl -0777 -pe 's|".*?"|$&=~s/\n//gr|gse' ip.txt
/biz/1-or-8;5.0;"a bunch oftextwith somelinebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"moretexthere.";2016-10-18
s modifier allows .* to match across multiple lines and e modifier allows to use expression instead of string in replacement
$&=~s/\n//gr allows to perform substitution on matched text ".*?"
sed is for simple substitutions on individual lines, that is all. For anything else you should be using awk. With GNU awk for multi-char RS and RT:
$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n+/," ",RT); print $0 RT}' file
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18

sed substitution with user-specified replacement string

The general form of the substitution command in sed is:
s/regexp/replacement/flags
where the '/' characters may be uniformly replaced by any other single character. But how do you choose this separator character when the replacement string is being fed in by an environment variable and might contain any printable character? Is there a straightforward way to escape the separator character in the variable using bash?
The values are coming from trusted administrators so security is not my main concern. (In other words, please don't answer with: "Never do this!") Nevertheless, I can't predict what characters will need to appear in the replacement string.
You can use control character as regex delimiters also like this:
s^Aregexp^Areplacement^Ag
Where ^A is CTRLva pressed together.
Or else use awk and don't worry about delimiters:
awk -v s="search" -v r="replacement" '{gsub(s, r)} 1' file
Here isn't (easy) solution for the following using the sed.
while read -r string from to wanted
do
echo "in [$string] want replace [$from] to [$to] wanted result: [$wanted]"
final=$(echo "$string" | sed "s/$from/$to/")
[[ "$final" == "$wanted" ]] && echo OK || echo WRONG
echo
done <<EOF
=xxx= xxx === =====
=abc= abc /// =///=
=///= /// abc =abc=
EOF
what prints
in [=xxx=] want replace [xxx] to [===] wanted result: [=====]
OK
in [=abc=] want replace [abc] to [///] wanted result: [=///=]
sed: 1: "s/abc/////": bad flag in substitute command: '/'
WRONG
in [=///=] want replace [///] to [abc] wanted result: [=abc=]
sed: 1: "s/////abc/": bad flag in substitute command: '/'
WRONG
Can't resists: Never do this! (with sed). :)
Is there a straightforward way to escape the separator character in
the variable using bash?
No, because you passing the strings from variables, you can't easily escape the separator character, because in "s/$from/$to/" the separator can appear not only in the $to part but in the $from part too. E.g. when you escape the separator it in the $from part it will not do the replacement at all, because will not find the $from.
Solution: use something other as sed
1.) Using pure bash. In the above script instead of the sed use the
final=${string//$from/$to}
2.) If the bash's substitutions are not enough, use something to what you can pass the $from and $to as variables.
as #anubhava already said, can use: awk -v f="$from" -v t="$to" '{gsub(f, t)} 1' file
or you can use perl and passing values as environment variables
final=$(echo "$string" | perl_from="$from" perl_to="$to" perl -pe 's/$ENV{perl_from}/$ENV{perl_to}/')
or passing the variables to perl via the command line arguments
final=$(echo "$string" | perl -spe 's/$f/$t/' -- -f="$from" -t="$to")
2 options:
1) take a char not in the string (need a pre process on content check and possible char without warranty that a char is available)
# Quick and dirty sample using `'/_##|!%=:;,-` arbitrary sequence
Separator="$( printf "%sa%s%s" '/_##|!%=:;,-' "${regexp}" "${replacement}" \
| sed -n ':cycle
s/\(.\)\(.*a.*\1.*\)\1/\1\2/g;t cycle
s/\(.\)\(.*a.*\)\1/\2/g;t cycle
s/^\(.\).*a.*/\1/p
' )"
echo "Separator: [ ${Separator} ]"
sed "s${Separator}${regexp}${Separator}${replacement}${Separator}flag" YourFile
2) escape the wanted char in the string patterns (need a pre process to escape char).
# Quick and dirty sample using # arbitrary with few escape security check
regexpEsc="$( printf "%s" "${regexp}" | sed 's/#/\\#/g' )"
replacementEsc"$( printf "%s" "${replacement}" | sed 's/#/\\#/g' )"
sed 's#regexpEsc#replacementEsc#flags' YourFile
From man sed
\cregexpc
Match lines matching the regular expression regexp. The c may be any
character.
When working with paths i often use # as separator:
sed s\#find/path#replace/path#
No need to escape / with ugly \/.