CSV parsing for embedded double quotes - c++

I've written a simple CSV file parser. But after looking at the wiki page on CSV formats I noticed some "extensions" to the basic format. Specifically embedded comma via double quotes. I've managed to parse those, however there is a second issue: embedded double quotes.
Example:
12345,"ABC, ""IJK"" XYZ" -> [1234] and [ABC, "IJK" XYZ]
I can't seem to find the correct way to distinguish between an enclosed double quote and none. So my question is what is the correct way/algorithm to parse CVS formats such as the one above?

The way I normally think about this is basically to look at the quoted value as a single, unquoted value or a sequence of double quoted values that form a value joined by quotes. That is,
to parse the next atom in the row:
read up to the first non whitespace character
if the current character is not a quote:
mark the current spot
read up to the next comma or newline
return the text between the mark and the character before the comma (strip spaces if appropriate)
if the current character is a quote:
create an empty string buffer
while the current character is not a quote
mark the current position +1 (skip the quote character)
read up to the next quote
if the buffer is not empty, append a quote to it
append to the buffer the text between the mark and the character before the current position (to strip both quotes)
advance one character (past the just read quote)
read up to the next comma or newline
return the buffer
essentially, split each double quoted segment of the quoted string and then catenate them together with quotes. thus: "ABC, ""IJK"" XYZ" becomes ABC, , IJK, XYZ, which in turn becomes ABC, "IJK" XYZ

I would do this using a single character look-ahead, so if you're scanning the string and find a double quote, look at the next character to see if it is also a double quote. If it is, then the pair represents a single doublequote character in the output. If it's any other character, you're looking at the end of the quoted string (and hopefully that next character is a comma!). Be sure to account for the end-of-line condition when looking at the next character, too.

If you find a double-quote, then you should look for a double-quote in the end of the word/string. If you can't find, then there is an error. The same for a quote.
I suggest you try Flex/Bison in order to write a parser for the CSV file. Both tools will help you to generate a parser and then you can use the C files with the parser and call it from your C++ program.
On Flex, you create a scanner that can find your tokens, like "word" or ""word"". On Bison, you define the syntax.

A double double-quote ("") is a literal double-quote, while a lone double-quote (") is used for enclosing text (including commas).
Here's a regex for a csv field, if that makes things easier:
([^",\n][^,\n]*)|"((?:[^"]|"")+)"
Group 1 will contain the field if it isn't in quotes, group 2 will contain the field if it is in quotes, minus the surrounding quotes. In that case, just replace all instances of "" with ".

I suggest reading: Stop Rolling Your Own CSV Parser and this CSV RFC. The first is really just someone who wants you to use their C# CSV parser, but still explains many issues.
Your parser should be examining a character at a time. I used a double bool strategy for my parser in D. Each quote toggles weather the string is quoted or not. When in a quoted Cell you flag when hit a quote, and turn off quoting. If the next character is a quote, quoting is turned on, a quote is added to the result and the flag is turned off. If the next character isn't a quote then the flag is turned off and so is quoting.

Related

putting regex to make git diff split words at punctuation into .gitconfig file

current setup
My .gitconfig currently includes this alias:
[alias]
wdiff = diff --color-words --histogram
to let me write git wdiff and get word-by-word rather than line-by-line diff output. I use this for writing scholarly prose in LaTeX.
goal
This method divides words only at white space. I would like to divide at punctuation marks so that, for example, last word of sentence. changed to last word of sentence.\footnote{New footnote.} produces diff output that looks something like this:
last word of sentence.\footnote{New footnote.}
rather than the current output:
last word of sentence.sentence.\footnote{New footnote.}
(where italics means deletion and bold means addition).
attempted solution
I found this other question that begins with a regex that does exactly what I want in the command line, but I haven't figured out how to put this in my .gitconfig file without producing the error message fatal: bad config line 12 in file /Users/alex/.gitconfig. This is what I put in my .gitconfig file:
[alias]
wdiff = diff --color-words='[^][<>()\{},.;:?/|\\=+*&^%$##!~`"'\''[:space:]]+|[][<>(){},.;:?/|\\=+*&^%$##!~`"'\'']' --histogram
The problem seems to be the semicolon.
A different question that deals with a similar problem in .gitconfig suggested putting double-quotes around an entire alias. But when I do that in my case, I get the same error message. I think this is because the regex also includes double-quotes.
question
How can I put the regex into my .gitconfig file such that it can be properly parsed?
I was confused as well until I found this page of documentation. The part you are interested in is:
A line that defines a value can be continued to the next line by ending it with a \; the backslash and the end-of-line are stripped. Leading whitespaces after name =, the remainder of the line after the first comment character # or ;, and trailing whitespaces of the line are discarded unless they are enclosed in double quotes. Internal whitespaces within the value are retained verbatim.
Inside double quotes, double quote " and backslash \ characters must be escaped: use \" for " and \\ for \.
The following escape sequences (beside \" and \\) are recognized: \n for newline character (NL), \t for horizontal tabulation (HT, TAB) and \b for backspace (BS). Other char escape sequences (including octal escape sequences) are invalid.
So, here the correct alias in .git/config:
wdiff = "diff --color-words='[^][<>()\\{},.;:?/|\\\\=+*&^%$##!~`\"'\\''[:space:]]+|[][<>(){},.;:?/|\\\\=+*&^%$##!~`\"'\\'']' --histogram"
In this case you just need to enclose everything in double quotes and escape both " and backslashes.

Finding text in string

I have a requirement to find double quotes in the JSON data which are part of the data itself.
Eg: {"Key": "Value", "Key1", "Val"ue1"}
Now the double quote in Val"ue1 needs to be retrieved and not other double quotes.
Any idea of how to achieve this?
How about : \"([^(?:\"[,}])]*\"[^(?:\"[,}])]*)\"[,}]?
This regex looks for text between a colon followed by a space and one double quote and another double quote followed by a comman or closing curly brackets. The text between those has to consist of something*, at least one quote, and something*.
*something means anything (including empty string) that does not contain a quote followed by comma or closing curly bracket, as this would end the value.
Applied to your example (corrected to replace the comma by a colon), it returns Val"ue1.
String json = "{\"Key\": \"Value\", \"Key1\": \"Val\"ue1\"}";
Matcher m = Pattern.compile(": \"([^(?:\"[,}])]*\"[^(?:\"[,}])]*)\"[,}]").matcher(json);
while (m.find())
{
System.out.println(m.group(1));
}
A regex for extracting data from double-quotes would be (?<=")[^"]*(?=")). The parentheses represent the first capture-group which you can extract with \1 or $1.
However, for parsing JSON (and parsing standards in general) it is recommended to use a library; It will be much more readable, understandable and safe than just a regex which may not cover edge-cases (not saying that there would be some here, but take this as an advice in general)
For instance, if you use Java, Gson is a nice library for parsing and generating JSON; if you use JavaScript, just use the built-in JSON-object. If you don't use any of These languages I'm sure there are other nice libraries for this as well.
Hope this helps

Replace repeated special characters with a single special character

I am attempting to use REGEXREPLACE in Google Sheets to remove the repeating special character \n.
I can't get it to replace all repeating instances of the characters with a single instance.
Here is my code:
REGEXREPLACE("Hi Gene\n\n\n\n\nHope","\\n+","\\n")
I want the results to be:
Hi Gene\nHope
But it always maintains the new lines.
Hi Gene\n\n\n\n\nHope
It has to be an issue with replacing the special characters because this:
REGEXREPLACE("Hi Gennnne\nHope","n+","n")
Produces:
Hi Gene\nHope
How do I remove repeating instances of special characters with a single instance of the special character in Google Sheets?
Edit
Just found easier way:
=REGEXREPLACE("Hi Gene\n\n\n\n\nHope","(\\n)+","\\n")
Original solution
Thy this formula:
=REGEXREPLACE(A1,REPT(F2,(len(A1)-len(REGEXREPLACE(A1,"\\n","")))/2),"\\n")
Put your text in A1.
How it works
It's workaround, we want to use final formula like this:
REGEXREPLACE("Hi Gene\n\n\n\n\nHope","\\n+\\n+\\n+\\n+\\n+","\\n")
First target is to find, how many times to repeat \\n+:
=(len(F1)-len(REGEXREPLACE(F1,F2,F3)))/2
Then just combine RegEx.
https://support.google.com/docs/answer/3098245?hl=en
REGEXREPLACE(text, regular_expression, replacement)
The problem seems to be how it interprets the "text". If I put this in a cell REGEXREPLACE("Hi Gene\n\n\n\n\nHope","","")
the output is Hi Gene\n\n\n\n\nHope as well.
If I place the text in a cell by itself with proper newlines and have this REGEXREPLACE(A1, "(\n)\n*", "$1") it works.
Note I could not just do s/\n+/\n/ as it still does not interpret the newline notation as anything special. It would just output \n instead of a newline.
I believe that you don't need to double escape the newlines, e.g. just search for \n:
REGEXREPLACE("Hi Gene\n\n\n\n\nHope", "\n+", "\n")
When you replace \\n you are searching for the literal text \n, rather than newline.

Using regex to find a double quote within string encased in double quotes

I am using ultraedit with regex. I would like to find (and replace) and embedded double quotes found withing a string that starts/ends with a double quote. This is a text file with pipe | as the delimeter.
How do I find the embedded double quotes:
"This string is ok."|"This is example with a "C" double quoted grade in middle."|"Next line"
I eventually need to replace the double quotes in "C" to just have C.
The big trade off in CSV is correct parsing in every case versus simplicity.
This is a resonably moderated approach. If you have really wily strings with quotes next to pipes in them, you better use something like PERL and Text::CSV.
There is a bother with a regex that requires a non-pipe character on each side of the quote (such as [^|]) in that the parser will absorb the C and then won't find the other quote next to the C.
This example will work pretty well as long as you don't have pipes and quotes next to each other in your actual CSV strings. The lookaheads and behinds are zero-width, so they do not remove any additional characters besides the quote.
1 2 3 4
(?<!^)(?<!\|)"(?!\|)(?!$)
Don't match quotes at the beginning of the line.
Don't match quotes with a pipe in front.
Don't match quotes with a pipe afterwards.
Don't match quotes at the end of a string.
Every quote thus matched can be removed. Don't forget to specify global replacement to get all of the quotes.
Try this find:
(["][^"]*)["]C["]([^"]*["])
and replace:
\1C\2
Turn on Regular Expressions in Perl mode.
Screen shot of
UltraEdit Professional Text/HEX Editor
Version 21.30.0.1005
Trying it out.
Start with:
"This string is ok."|"This is example with a "C" double quoted grade in middle."|"Next line"
"This string is ok."|"This is example with a C double quoted grade in middle."|"Next line"
Ends with:
"This string is ok."|"This is example with a C double quoted grade in middle."|"Next line"
"This string is ok."|"This is example with a C double quoted grade in middle."|"Next line"
Breakdown of the regex FIND.
First part.
(["][^"]*)
from (["][^"]*)["]C["]([^"]*["])
This looks for a sequence of:
Double quote: ["].
Any number of characters that are not double quotes: [^"]*
The brackets that surround ["][^"]* indicate that the regex engine should store this sequence of characters so that the REPLACE part can refer back to it (as back references).
Note that this is repeated at the start and end - meaning that there are two sequences stored.
Second part.
["]C["]
from (["][^"]*)["]C["]([^"]*["])
This looks for a sequence of:
Double quote: ["].
The capital letter C (which may or may not stand for Cookies).
Double quote: ["].
Breakdown of the regex REPLACE.
\1C\2
\1 is a back reference that means replace this with the first sequence saved.
The capital letter C (which may or may not stand for Cookies).
\2 is a back reference that means replace this with the second sequence saved.
For the example you gave just "\w" works as the regex to find "C"
Try it here
The replacing mechanism is probably built into ultraedit
You really don't want to do this with regex. You should use a csv parser that can understand pipe delimiters. If I were to this with just regex, I would use multiple replacements like this:
Find and replace the good quotes with placeholder to text. Start/end quote:
s/(^"|"$)/QUOTE/g
Quotes near pipe delimiters:
s/"\|"/DELIMITER/g
Now only embedded double quotes remain. To delete all of them:
s/"//g
Now put the good quotes back:
s/QUOTE|DELIMITER/"/g
nanny posted a good solution, but for a Perl script, not for usage in a text editor like UltraEdit.
In general it is possible to have double quotes within a field value. But each double quote must be escaped with one more double quote. This is explained for example in Wikipedia article about comma-separated values.
This very simple escaping algorithm makes reading in a CSV file character by character coded in a programming language very easy. But double quotes, separators and line breaks included in a double quoted value are a nightmare for a regular expression find and replace in a CSV file.
I have recorded several replaces into an UltraEdit macro
InsertMode
ColumnModeOff
Top
PerlReOn
Find MatchCase RegExp "^"|"$"
Replace All "QuOtE"
Find MatchCase ""|"
Replace All "QuOtE|"
Find MatchCase "|""
Replace All "|QuOtE"
Find MatchCase """"
Replace All "QuOtEQuOtE"
Find MatchCase """
Replace All """"
Find MatchCase "QuOtE"
Replace All """
The first replace is a Perl regular expression replace. Each double quote at beginning or end of a line is replaced by the string QuOtE by this replace. I'm quite sure that QuOtE does not exist in the CSV file.
Each double quote before and after the pipe character is also replaced by QuOtE by the next 2 non regular expression replaces.
Escaped double quotes "" in the CSV file are replaced next by QuOtEQuOtE with a non regular expression replace.
Now the remaining single double quotes are replaced by two double quotes to make them valid in CSV file. You could of course also remove those single double quotes.
Finally, all QuOtE are replaced back to double quotes.
Note: This is not the ultimate solution. Those replaces could produce nevertheless a wrong result, for example for an already valid CSV line like this one
"first value with separator ""|"" included"|second value|"third value again with separator|"|fourth value contains ""Hello!"""|fifth value
as the result is
"first value with separator """|""" included"|second value|"third value again with separator|"|fourth value contains ""Hello!"""|fifth value
PS: The valid example line above should be displayed in a spreadsheet application as
first value with separator "|" included second value third value again with separator| fourth value contains "Hello!" fifth value

Notepad++ Search and Replace with Tab Delimited File

I have a file that is tab delimited. When exporting from Excel, if the cell has a comma in it, it will wrap the cell with double quotes.
To find the first double quote, I can look for a tab then double quote ex: \t"
The next double quote to remove is at the end of the line, so I would like to find double quote then newline ex: \n" but this is not working.
Example of the file format:
textTABtextTAB"moretextwithquotes"CRLF
First, you're searching for \n" instead of "\n, if I well understand your problem.
Secondly, you need to search for \r\n instead of \n, so your final result should be "\r\n.
If all your data is consistent where double quotes are matched and encapsulates fields,
I would just do a global find and replace just on quoted text.
Replacing the match with just the field data. This strips the quotes, leaves everything
else untouched.
Find: "([^"\\]*(?:\\.[^"\\]*)*)"
Replace: $1