sed - remove quotes within quotes in large csv files - regex

I am using stream editor sed to convert a large set of text files data (400MB) into a csv format.
I have come very close to finish, but the outstanding problem are quotes within quotes, on a data like this:
1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for "word3"","another text","more text and more"
The desired output is:
1,word1,"description for word1","another text","text contains double quotes some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"
I have searched around for help, but I am not getting too close to solution, I have tried the following seds with regex patterns:
sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt
These are from the below questions, but do not seem to be working for sed:
Related question for perl
Related question for SISS
The original files are *.txt and I am trying to edit them in place with sed.

Here's one way using GNU awk and the FPAT variable:
gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"", $i); $i=N $i N } }1' file
Results:
1,word1,"description for word1","another text","text contains double
quotes some more text" 2,word2,"description for word2","another
text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"
Explanation:
Using FPAT, a field is defined as either "anything that is not a
comma," or "a double quote, anything that is not a double quote, and a
closing double quote". Then on every line of input, loop through each
field and if the field starts and ends with a double quote, remove all
quotes from the field. Finally, add double quotes surrounding the
field.

sed -e ':r s:["]\([^",]*\)["]\([^",]*\)["]\([^",]*\)["]:"\1\2\3":; tr' FILE
This looks over the strings of the type "STR1 "STR2" STR3 " and converts them to "STR1 STR2 STR3". If it found something, it repeats, to be sure that it eliminates all nested strings at a depth > 2.
It also assures that none of STRx contains comma.

Related

Add backslash before single and double quote

I am trying to add backslash before single and double quote. The problem that I have is that I want to exclude triple quote.
What I did is as for now:
for single quote:
sed -e s/\'/\\\\\'/g test.txt > test1.txt
for double quote:
sed -e s/\"/\\\\\"/g test.txt > test1.txt
I have text like:
1,"""Some text XM'SD12X""","""Some text XM'SD12X""","""Auto " Moto " Some text"Some text"""
What I want is:
120,"""Some text\'SD12X""","""Some text XM\'SD12X""","""Auto \" Moto \" Some text\"Some text"""
If perl is okay:
perl -pe 's/"{3}(*SKIP)(*F)|[\x27"]/\\$&/g'
"{3}(*SKIP)(*F) don't change triple double quotes
use (\x27{3}|"{3})(*SKIP)(*F) if you shouldn't change triple single/double quotes
|[\x27"] match single or double quotes
\\$& prefix \ to the matched portion
With sed, you can replace the triple quotes with newline character (since newline character cannot be present in pattern space for default line-by-line usage), then replace the single/double quote characters and then change newline characters back to triple quotes.
# assuming only triple double quotes are present
sed 's/"""/\n/g; s/[\x27"]/\\&/g; s/\n/"""/g'

Pattern to swap backslashes in particular quoted strings

I'm using sed (actually ssed with extended "-r", so I have access to most any regex functionality I might need) to process files, one change being to convert doubled backslash characters to a single forward slashes inside quoted strings. The problem is that some quoted strings that contain the doubled backslashes should not be converted. All quoted strings I want to target have a certain word "myPhrase" inside the quotes.
So for a file with these two lines:
"\\\\server\\dir\\myPhrase\\subdir"
"Don't change \\something me!"
the output would be:
"//server/dir/myPhrase/subdir"
"Don't change \\something me!"
I've tried various combinations of lookahead like (?=myPhrase) within a search pattern that finds the desired quoted chunks and replaces a capture group (\\) with / as the replacement, but all my attempts either replace just the first occurance of the doubled backslashes, or those to the left of myPhrase, etc.
I'm sure there is some combination of lookahead/noncapture/recursion that should do this, but I'm blanking out completely right now.
With GNU awk for the 3rd arg to match():
$ cat file
"dont change \\this" "\\\\server\\dir\\myPhrase\\subdir" "nor\\this"
"Don't change \\something me!"
$ awk 'match($0,/(.*)("[^"]*myPhrase[^"]*")(.*)/,a){gsub(/[\\][\\]/,"/",a[2]); $0=a[1] a[2] a[3]} 1' file
"dont change \\this" "//server/dir/myPhrase/subdir" "nor\\this"
"Don't change \\something me!"
Try this Perl solution
perl -pe ' s{"(.+)?"}{$x=$1; if($x=~m/myPhrase/ ) {$x=~s!\\\\!/!g};sprintf("\x22%s\x22",$x)}ge ' file
with the below inputs
$ cat ykdvd.txt
"\\\\server\\dir\\myPhrase\\subdir"
"Don't change \\something me!"
Another line
$ perl -pe ' s{"(.+)?"}{$x=$1; if($x=~m/myPhrase/ ) {$x=~s!\\\\!/!g};sprintf("\x22%s\x22",$x)}ge ' ykdvd.txt
"//server/dir/myPhrase/subdir"
"Don't change \\something me!"
Another line
$

Replacing spaces with underscores within quotes

I need to replace within a large text file all occurrences such as 'yw234DV w-23-sDf wef23s-d-f' with the same strings but with underscores instead of spaces for all spaces within quotes, without replacing any spaces outside quotes with underscores.
I'm trying to find a solution for substitution within vim, but a sed solution would also be much appreciated. The number of tokens in each quote-delimited string may vary.
I've been playing with some regexes in vim, but they're pretty elementary and seem to be missing what I need.
My current attempt:
%s/'{[:alnum:] }*/'\0\_/g
And I'm experimenting with variations on that.
This is most similar to my question, though it is Java:
Replacing spaces within quotes
Sample Input:
'wiUEF7-gvouw ow wo24-RTeih we', 'yt23IT iug-76'
Sample Output:
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
You may try this with VIM, tried this on Macvim:
%s/\%('[^']*'\)*\('[^']*'\)/\=substitute(submatch(1), ' ', '_', 'g')/g
Much simpler solution , Thanks to #SergioAraujo:
#%s/\v%(('[^']*'))/\=substitute(submatch(1),' ', '_', 'g')/g
Not sure however, if below is the outcome you have expected
Output:
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
In perl:
perl -i -pe's{(\x27.*?\x27)}{ (my $subst = $1) =~ tr/ /_/ }ge' yourfile
or with perl5.14 or above:
perl -i -pe's{(\x27.*?\x27)}{ $1 =~ tr/ /_/r }ge'
With this the input file:
$ cat file
'wiUEF7-gvouw ow wo24-RTeih we', 'yt23IT iug-76'
We can convert all spaces inside of single-quotes into underscores with:
$ sed -E ":a; s/^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]/\1_/; ta" file
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
How it works
:a
This creates a label a.
s/^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]/\1_/
This inserts the underscores where we want them.
^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]
This looks for any odd number of single quotes followed by any number of non-quote characters followed by a space. Everything before that space is saved in group 1.
\1_
This replaces the matched text with group 1 followed by an underscore.
ta
If the previous command put any new underscores in the string, then jump back to label a and try again.
Using FPAT variable in gnu awk you can do this:
awk -v OFS=', ' -v FPAT="'[^']*'" '{for (h=1; h<=NF; h++)
{gsub(/[[:blank:]]/, "_", $h); printf "%s%s", $h, (h < NF ? OFS : ORS)}}' file
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'

Substitute words not in double quotes

$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
basic
I want unix sed command such that only basic that is not in quotes should be changed.[change basic to ring]
Expected output:
$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If we disallow escaping quotes, then any basic that is not within " is preceded by an even number of ". So this should do the trick:
sed -r 's/^([^"]*("[^"]*){2}*)basic/\1ring/' file
And as ДМИТРИЙ МАЛИКОВ mentioned, adding the --in-place option will immediately edit the file, instead of returning the new contents.
How does this work?
We anchor the regular expression to the beginning of each line with ". Then we allow an arbitrary number of non-" characters (with [^"]*). Then we start a new subpattern "[^"]* that consists of one " and arbitrarily many non-" characters. We repeat that an even number of times (with {2}*). And then we match basic. Because we matched all of that stuff in the line before basic we would replace that as well. That's why this part is wrapped in another pair of parentheses, thus capturing the line and writing it back in the replacement with \1 followed by ring.
One caveat: if you have multiple basic occurrences in one line, this will only replace the last one that is not enclosed in double quotes, because regex matches cannot overlap. A solution would be a lookbehind, but since this would be a variable-length lookbehind, which is only supported by the .NET regex engine. So if that is the case in your actual input, run the command multiple times until all occurrences are replaced.
$> sed -r 's/^([^\"]*)(basic)([^\"]*)$/\1ring\3/' file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If you wanna edit file in place use --in-place option.
This might work for you (GNU sed):
sed -r 's/^/\n/;ta;:a;s/\n$//;t;s/\n("[^"]*")/\1\n/;ta;s/\nbasic/ring\n/;ta;s/\n([^"]*)/\1\n/;ta' file
Not a sed solution, but it substitutes words not in quotes
Assuming that there is no escaped quotes in strings, i.e. "This is a trap \" hehe", awk might be able to solve this problem
awk -F\" 'BEGIN {OFS=FS}
{
for(i=1; i<=NF; i++){
if(i%2)
gsub(/basic/,"ring",$i)
}
print
}' inputFile
Basically the words that are not in quotes are in odd-numbered fields, and the word "basic" is replaced by "ring" in these fields.
This can be written as a one-liner, but for clarity's sake I've written it in multiple lines.
If basic is at the beginning of line:
sed -e 's/^basic/ring/' file0

replacing doublequotes in csv

I've got nearly the following problem and didn't find the solution. This could be my CSV file structure:
1223;"B630521 ("L" fixed bracket)";"2" width";"length: 5"";2;alternate A
1224;"B630522 ("L" fixed bracket)";"3" width";"length: 6"";2;alternate B
As you can see there are some " written for inch and "L" in the enclosing ".
Now I'm looking for a UNIX shell script to replace the " (inch) and "L" double quotes with 2 single quotes, like the following example:
sed "s/$OLD/$NEW/g" $QFILE > $TFILE && mv $TFILE $QFILE
Can anyone help me?
Update (Using perl it easy since you get full lookahead features)
perl -pe 's/(?<!^)(?<!;)"(?!(;|$))/'"'"'/g' file
Output
1223;"B630521 ('L' fixed bracket)";"2' width";"length: 5'";2;alternate A
1224;"B630522 ('L' fixed bracket)";"3' width";"length: 6'";2;alternate B
Using sed, grep only
Just by using grep, sed (and not perl, php, python etc) a not so elegant solution can be:
grep -o '[^;]*' file | sed 's/"/`/; s/"$/`/; s/"/'"'"'/g; s/`/"/g'
Output - for your input file it gives:
1223
"B630521 ('L' fixed bracket)"
"2' width"
"length: 5'"
2
alternate A
1224
"B630522 ('L' fixed bracket)"
"3' width"
"length: 6'"
2
alternate B
grep -o is basically splitting the input by ;
sed first replaces " at start of line by `
then it replaces " at end of line by another `
it then replaces all remaining double quotes " by single quite '
finally it puts back all " at the start and end
Maybe this is what you want:
sed "s/\([0-9]\)\"\([^;]\)/\1''\2/g"
I.e.: Find double quotes (") following a number ([0-9]) but not followed by a semicolon ([^;]) and replace it with two single quotes.
Edit:
I can extend my command (it's becoming quite long now):
sed "s/\([0-9]\)\"\([^;]\)/\1''\2/g;s/\([^;]\)\"\([^;]\)/\1\'\2/g;s/\([^;]\)\"\([^;]\)/\1\'\2/g"
As you are using SunOS I guess you cannot use extended regular expressions (sed -r)? Therefore I did it that way: The first s command replaces all inch " with '', the second and the third s are the same. They substitute all " that are not a direct neighbor of a ; with a single '. I have to do it twice to be able to substitute the second " of e.g. "L" because there's only one character between both " and this character is already matched by \([^;]\). This way you would also substitute "" with ''. If you have """ or """" etc. you have to put one more (but only one more) s.
For the "L" try this:
sed "s/\"L\"/'L'/g"
For inches you can try:
sed "s/\([0-9]\)\"\"/\1''\"/g"
I am not sure it is the best option, but I have tried and it works. I hope this is helpful.