replacing doublequotes in csv - regex

I've got nearly the following problem and didn't find the solution. This could be my CSV file structure:
1223;"B630521 ("L" fixed bracket)";"2" width";"length: 5"";2;alternate A
1224;"B630522 ("L" fixed bracket)";"3" width";"length: 6"";2;alternate B
As you can see there are some " written for inch and "L" in the enclosing ".
Now I'm looking for a UNIX shell script to replace the " (inch) and "L" double quotes with 2 single quotes, like the following example:
sed "s/$OLD/$NEW/g" $QFILE > $TFILE && mv $TFILE $QFILE
Can anyone help me?

Update (Using perl it easy since you get full lookahead features)
perl -pe 's/(?<!^)(?<!;)"(?!(;|$))/'"'"'/g' file
Output
1223;"B630521 ('L' fixed bracket)";"2' width";"length: 5'";2;alternate A
1224;"B630522 ('L' fixed bracket)";"3' width";"length: 6'";2;alternate B
Using sed, grep only
Just by using grep, sed (and not perl, php, python etc) a not so elegant solution can be:
grep -o '[^;]*' file | sed 's/"/`/; s/"$/`/; s/"/'"'"'/g; s/`/"/g'
Output - for your input file it gives:
1223
"B630521 ('L' fixed bracket)"
"2' width"
"length: 5'"
2
alternate A
1224
"B630522 ('L' fixed bracket)"
"3' width"
"length: 6'"
2
alternate B
grep -o is basically splitting the input by ;
sed first replaces " at start of line by `
then it replaces " at end of line by another `
it then replaces all remaining double quotes " by single quite '
finally it puts back all " at the start and end

Maybe this is what you want:
sed "s/\([0-9]\)\"\([^;]\)/\1''\2/g"
I.e.: Find double quotes (") following a number ([0-9]) but not followed by a semicolon ([^;]) and replace it with two single quotes.
Edit:
I can extend my command (it's becoming quite long now):
sed "s/\([0-9]\)\"\([^;]\)/\1''\2/g;s/\([^;]\)\"\([^;]\)/\1\'\2/g;s/\([^;]\)\"\([^;]\)/\1\'\2/g"
As you are using SunOS I guess you cannot use extended regular expressions (sed -r)? Therefore I did it that way: The first s command replaces all inch " with '', the second and the third s are the same. They substitute all " that are not a direct neighbor of a ; with a single '. I have to do it twice to be able to substitute the second " of e.g. "L" because there's only one character between both " and this character is already matched by \([^;]\). This way you would also substitute "" with ''. If you have """ or """" etc. you have to put one more (but only one more) s.

For the "L" try this:
sed "s/\"L\"/'L'/g"
For inches you can try:
sed "s/\([0-9]\)\"\"/\1''\"/g"
I am not sure it is the best option, but I have tried and it works. I hope this is helpful.

Related

GNU sed regex to fix mySQL db inserts for SQLite

I am trying to translate a huge mySQL database dump file from mySQL syntax into SQLite syntax.
At https://regex101.com/ I have successfully created a ECMAScript flavor regex to turn something like:
,'foo\'s bar!',
into:
,"foo\'s bar!"
with this regular expression:
/,'([^']+)\\'([^']+)',/"$1\\'$2"/g
testing against this short file:
(1058,'gpl5q0x51349lmdq3e0ijm4k9b6n','Henry\'s_1.csv','text/csv','{\"identified\":true,\"analyzed\":true}',33854,'mUVk0/XGX+afIpkrqBm7LQ==','2021-01-06 03:07:23'),
(1059,'xzj8mivsenkakkrurfjytxjsaj1h','Henry\'s_2.csv','text/csv','{\"identified\":true,\"analyzed\":true}',33555,'KfRYqfAWtSIYXZ6oQZyYbA==','2021-01-06 03:07:23'),
Resulting in:
(1058,'gpl5q0x51349lmdq3e0ijm4k9b6n'"Henry\'s_1.csv"'text/csv','{\"identified\":true,\"analyzed\":true}',33854,'mUVk0/XGX+afIpkrqBm7LQ==','2021-01-06 03:07:23'),
(1059,'xzj8mivsenkakkrurfjytxjsaj1h'"Henry\'s_2.csv"'text/csv','{\"identified\":true,\"analyzed\":true}',33555,'KfRYqfAWtSIYXZ6oQZyYbA==','2021-01-06 03:07:23'),
but for the life of me I cannot translate this into a GNU sed flavor regex.
For example, this command does not make any substitutions in the output:
sed -r s/,'([^']+)\\'([^']+)',/"$1\\'$2"/g <test.sql
...
sed -r s/,'([^']+)\\'([^']+)',/"\1\\'\2"/g <test.sql: doesn't work either.
I have looked for a regex tool online that translates between different flavors of regex but cannot find one that works on GNU sed (shipped with GIT: sed (GNU sed) 4.8). PCRE seems to be close to what sed has but that doesn't work. I tried perl as well, no luck.
Anyone know a regex expression that works or a translator tool that works?
I am just about ready to write a nodejs program to do this for me.
Also, for extra credit, how can I write a sed script to handle any number of escaped quotes within a quoted string? I have that issue to deal with as well in my DB dump file.
Examples:
'foo\'-bar' // on instance
'foo\'and\'bar' // two instances
'foo\'and\'bar\'s on the deck' // three instances
and so on...
Thanks!
You can use
sed -E "s/,'([^']+)\\\\'([^']+)',/"'"'"\\1\\\\'\\2"'"'/g test.sql
The "s/,'([^']+)\\\\'([^']+)',/"'"'"\\1\\\\'\\2"'"'/g consists of
"s/,'([^']+)\\\\'([^']+)',/" - a s/,'([^']+)\\'([^']+)',/ part (inside double quotes, so backslashes need doubling)
'"' - a " char (inside single quotes)
"\\1\\\\'\\2" - \1\\'\2 pattern (inside double quotes, so backslashes are doubled)
'"' - a " char (inside single quotes)
/g - the global flag (no need quoting here).
First look at your command
sed -r s/,'([^']+)\\'([^']+)',/"\1\\'\2"/g test.sql
I prefer writing the whole sed command in single quotes. When you need a single quote, you must close the string ('), use an escaped single quote (\') and open the next string with a ', all joined: '\''.
I also added two , characters.
sed -r 's/,'\''([^'\'']+)\\'\''([^'\'']+)'\'',/,"\1\\'\''\2",/g' test.sql
# Shorter
sed -r 's/,'\''([^'\'']+\\'\''[^'\'']+)'\'',/,"\1",/g' test.sql
# Using another way to write the single quotes, with the hex notation
sed -r 's/,\x27([^\x27]+\\\x27[^\x27]+)\x27,/,"\1",/g' test.sql
This works for simple cases, not for 'foo\'and\'bar\'s on the deck'.
I think you want to replace the quotes in the simple fields too.
Suppose you want to transform
(1058,'gpl5q0x51349lmdq3e0ijm4k9b6n','Henry\'s_1.csv','text/csv','{\"identified\":true,\"analyzed\":true}',33854,'mUVk0/XGX+afIpkrqBm7LQ==','2021-01-06 03:07:23'),
(1059,'xzj8mivsenkakkrurfjytxjsaj1h','Henry\'s_2.csv','text/csv','{\"identified\":true,\"analyzed\":true}',33555,'KfRYqfAWtSIYXZ6oQZyYbA==','2021-01-06 03:07:23'),
(2000,'extra credit from question','foo\'and\'bar\'s on the deck','text/csv','{\"identified\":true,\"analyzed\":true}',33999,'KgSBFstbdthdsssssstvbA==','2022-01-02 13:07:23'),
into
(1058,"gpl5q0x51349lmdq3e0ijm4k9b6n","Henry\'s_1.csv","text/csv","{\"identified\":true,\"analyzed\":true}",33854,"mUVk0/XGX+afIpkrqBm7LQ==","2021-01-06 03:07:23"),
(1059,"xzj8mivsenkakkrurfjytxjsaj1h","Henry\'s_2.csv","text/csv","{\"identified\":true,\"analyzed\":true}",33555,"KfRYqfAWtSIYXZ6oQZyYbA==","2021-01-06 03:07:23"),
(2000,"extra credit from question","foo\'and\'bar\'s on the deck","text/csv","{\"identified\":true,\"analyzed\":true}",33999,"KgSBFstbdthdsssssstvbA==","2022-01-02 13:07:23"),
In this answer I don't use the '\'' but the hexadecimal notation \x27.
First "backup" the \' combinations (replace them by an unused character like \r), replace all normal quotes by double quotes and "restore the backup" (change back the \r).
sed 's/\\\x27/\r/g; s/\x27/"/g; s/\r/\\\x27/g' test.sql
# or hex value for double quote "
sed 's/\\\x27/\r/g; s/\x27/\x22/g; s/\r/\\\x27/g' test.sql

Remove double quotes from the substring preceding the first dot

I previously asked the question in which I required help with removing double quotes from a string after a . (dot). I kindly received an answer however I am unsure as to how it works exactly.
I am now attempting to remove double quotes from around a string before the . (dot). I have attempted through trial and error to edit the original command, however I haven't had much luck, the closest I have come so far I have left below.
Could someone please explain how and why the first command works and if possible aid me editing my attempt to allow it to remove the double quotes from around the string on the left of the . (dot).
Original Command - removes " " from the right of the dot:
sed 's/\."\([^"]*\)"/.\1/g' file
Sample Before:
"A".HELLO
A."HELLO"
"A"."HELLO"
Required Result:
A.HELLO
A."HELLO"
A."HELLO"
Attempt:
sed -i 's/"*"\.\([^"]*\)"/.\1/g' $(2)
After:
"A".HELLO
A."HELLO"
"A.HELLO"
Link to original post: UNIX Bash - Removing double quotes from specific strings within a file
Credit to user potong for original answer.
Could you please try following(in case you are ok with awk), written and tested with shown samples in GNU awk.
awk 'BEGIN{FS=OFS="."} {gsub(/"/,"",$1)} 1' Input_file
Explanation: Making field separator and output field separator as . in BEGIN section. Then in main program globally substituting " with NULL in first field specifically, since we have made . as field separator and OP wants to remove " double quotes before . only hence taking 1st field will do the trick here. 1 will print current line of Input_file.
Using a primitive loop:
$ sed -e ':L' -e 's/\([^.]*\)"\([^.]*\.\)/\1\2/' -e 'tL' file
A.HELLO
A."HELLO"
A."HELLO"
This works with all POSIX-compliant seds, not only GNU sed.

Replace all commas between two quotes in a bash script

I need that all "," between two " are replaced with ";" within a bash script. I'm close, but hours on the internet and stackoverflow led me to this:
echo ',,Lung,,"Lobular, each.|lungs, right.",false,,,,"organ, left.",,,,,' | sed -r ':a;s/(".*?),(.*?")/\1;\2/;ta'
With the result:
,,Lung,,"Lobular; each.|lungs; right.";false;;;;"organ; left.",,,,,
Correct would be:
,,Lung,,"Lobular; each.|lungs; right.",false,,,,"organ; left.",,,,,
Not sure how you want to deal with lines that have an odd number of double quotes (eg, the double quoted string spans multiple lines), but perhaps:
awk '!(NR%2){gsub(",",";")} 1' RS=\" ORS=\"
This simply treats " as the record separator and does the replacement only on odd numbered records. Seems to work as desired. (Or, rather, it works as you seem to desire!)
As oguz points out in a comment, this prints an extra " at the end. That can be fixed with:
awk '!(NR%2){gsub(",",";")} {printf RFS $0} {RFS="\""}' RS=\"
which is a bit uglier but more correct . (or, rather, less incorrect!) If your input stream ends with a ", that quote will be truncated. If, however, your input is terminated by a newline rather than a ", this will do what you want.
OTOH, you might just want to do:
perl -wpE 'BEGIN{$/=\1}; y/,/;/ if $in; $in = ! $in if $_ eq "\""'
Which reads one character and uses a simple state machine. ($_ is the current character, so $in = ! $in changes state when a double quote is seen and the transliteration only happens when $in is non-zero.)
If you /really/ wanted to use sed, you could do a whole line replace and include a clause like ^(([^"]*"[^"*]")*[^"]*) at the beginning of your existing expression in order to ensure that the matched quotes are "odd".

Finding and replacing a numeric string between colons, before a space, using sed?

I am attempting to change all coordinate information in a fastq file to zeros. My input file is composed of millions of entries in the following repeating 4-line structure:
#HWI-SV007:140:C173GACXX:6:2215:16030:89299 1:N:0:CAGATC
GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAG
+
###FFFDFHGGDHIIHGIJJJJJJJJJJJGIJJJJJJJIIIDHGHIGIJJIIIJJIJ
I would like to replace the two numeric strings in the first line 16030:89299 with zeros in a generic way, such that any numeric string between the colons, before the space, is replaced. I would like the output to appear as follows, replacing the two strings globally throughout the file with zeros:
#HWI-SV007:140:C173GACXX:6:2215:0:0 1:N:0:CAGATC
GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAG
+
###FFFDFHGGDHIIHGIJJJJJJJJJJJGIJJJJJJJIIIDHGHIGIJJIIIJJIJ
I am attempting to do this using the following sed:
sed 's/:^[0-9]+$:^[0-9]+$\s/:0:0 /g'
However, this does not behave as expected.
I think you will need to use sed -r option.
Also, ^ matches beginning of the line and $ matches end of the line.
Thus this is the command line that works against your sample.
sed -r 's/:[0-9]+:[0-9]+\s/:0:0 /g'
some alternative
awk -F ":" 'BEGIN{ OFS = ":" }{ if ( NF > 1 ) {$6 = 0; sub( /^[0-9]*/, 0, $7)}; print $0 }' YourFile
using column separate by :
sed 's/^\(\([^:]*:\)\{5\}\)[^[:blank:]]*/\10:0/' YourFile
using 5 first element separate by : thant space as delimiter
for your sed
sed 's/:[0-9]+:[0-9]+\(\s\)/:0:0\1/'
^and $ are relative to the whole string not the current word
option to keep the original space instead of replacing by a blank space (case of several or other like \t)
g is not needed (and better not to use here) because normaly only 1 occurence per line
you need to be sure that the pattern is not possible somewhere else (never a space after the previous number) because it's a small one

Substitute words not in double quotes

$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
basic
I want unix sed command such that only basic that is not in quotes should be changed.[change basic to ring]
Expected output:
$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If we disallow escaping quotes, then any basic that is not within " is preceded by an even number of ". So this should do the trick:
sed -r 's/^([^"]*("[^"]*){2}*)basic/\1ring/' file
And as ДМИТРИЙ МАЛИКОВ mentioned, adding the --in-place option will immediately edit the file, instead of returning the new contents.
How does this work?
We anchor the regular expression to the beginning of each line with ". Then we allow an arbitrary number of non-" characters (with [^"]*). Then we start a new subpattern "[^"]* that consists of one " and arbitrarily many non-" characters. We repeat that an even number of times (with {2}*). And then we match basic. Because we matched all of that stuff in the line before basic we would replace that as well. That's why this part is wrapped in another pair of parentheses, thus capturing the line and writing it back in the replacement with \1 followed by ring.
One caveat: if you have multiple basic occurrences in one line, this will only replace the last one that is not enclosed in double quotes, because regex matches cannot overlap. A solution would be a lookbehind, but since this would be a variable-length lookbehind, which is only supported by the .NET regex engine. So if that is the case in your actual input, run the command multiple times until all occurrences are replaced.
$> sed -r 's/^([^\"]*)(basic)([^\"]*)$/\1ring\3/' file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If you wanna edit file in place use --in-place option.
This might work for you (GNU sed):
sed -r 's/^/\n/;ta;:a;s/\n$//;t;s/\n("[^"]*")/\1\n/;ta;s/\nbasic/ring\n/;ta;s/\n([^"]*)/\1\n/;ta' file
Not a sed solution, but it substitutes words not in quotes
Assuming that there is no escaped quotes in strings, i.e. "This is a trap \" hehe", awk might be able to solve this problem
awk -F\" 'BEGIN {OFS=FS}
{
for(i=1; i<=NF; i++){
if(i%2)
gsub(/basic/,"ring",$i)
}
print
}' inputFile
Basically the words that are not in quotes are in odd-numbered fields, and the word "basic" is replaced by "ring" in these fields.
This can be written as a one-liner, but for clarity's sake I've written it in multiple lines.
If basic is at the beginning of line:
sed -e 's/^basic/ring/' file0