C++ trying to read in malformed CSV with erroneous commas - c++

I am trying to make a simple CSV file parser to transfer a large number of orders from an order system to an invoicing system. The issue is that the CSV which i am downloading has erroneous commas which are sometimes present in the name field and so this throws the whole process off.
The company INSISTS, which is really starting to piss me off, that they are simply copying data they receive into the CSV and so it's valid data.
Excel mostly seems to interpret this correctly or at least puts the data in the right field, my program however doesn't. I opened the CSV in notepad++ and there is no quotes around strings just raw string separated by commas.
This is currently how i am reading the file.
int main()
{
string t;
getline(cin, t);
string Output;
string path = "in.csv";
ifstream input(path);
vstring readout;
vstring contact, InvoiceNumber, InvoiceDate, DueDate, Description, Quantity, UnitAmount, AccountCode, TaxType, Currency, Allocator, test, Backup, AllocatorBackup;
vector<int> read, add, total;
if (input.is_open()) {
for (string line; getline(input, line); ) {
auto arr = explode(line, ',');
contact.push_back(arr[7]); // Source site is the customer in this instance.
InvoiceNumber.push_back(arr[0]); // OrderID will be invoice number
InvoiceDate.push_back(arr[1]); // Perchase date
DueDate.push_back(arr[1]); // Same as order date
Description.push_back(arr[0]);
Quantity.push_back(arr[0]);
UnitAmount.push_back(arr[10]); // The Total
AccountCode.push_back(arr[7]); // Will be set depending on other factors - But contains the site of perchase
Currency.push_back(arr[11]); // EUR/GBP
Allocator.push_back(arr[6]); // This will decide the VAT treatment normally.
AllocatorBackup.push_back(arr[5]); // This will decide VAT treatment if the column is off by one.
Backup.push_back(arr[12]);
TaxType = Currency;
}
}
return 0;
}
vstring explode(string const & s, char delim) {
vstring result;
istringstream q(s);
for (string token; getline(q, token, delim); ) {
result.push_back(move(token));
}
return result;
}
Vstring is a compiler macro i created to save me typing vector so often, so it's the same thing.
The issue is when i come across one of the fields with the comma in it (normally the name field which is [3]) it of cause pushes everything back by one so account code becomes [8] etc.. This is extremely troublesome as it's difficult to tell weather or not i am dealing with correct data in the next field or not in some cases.
So two questions:
1) Is there any simple way in which i could detect this anomaly and correct for it that i've missed? I of cause do try to check in my loop where i can if valid data is where it's expected to be, but this is becoming messy and does not cope with more than one comma.
2) Is the company correct in telling me that it's "Expected behavior" to allow commas entered by a customer to creep into this CSV without being processed or have they completely misunderstood the CSV "standard"?

Retired Ninja mentioned in the comments that one constraint would be to parse all fields either side of the 'problem field' first, and then put the remaining data into the problem field. This is the best approach if you know which field might contain corruption. If you don't know which field could be corrupted, you still have options though!
You know:
The number of fields that should be present
Something about the type of data in each of those fields.
If you codify the types of the fields (implement classes for different data types, so your vectors of strings would become vectors of OrderIDs or Dates or Counts or....), you can test different concatenations (joining adjacent fields that are separated by a comma) and score them according to how many of the fields pass some data validation. You then choose the best scoring interpretation of the data. This would build some data validation into the process, and make everything a bit more robust.

'csv' is not that well defined. There is the standard way, where ',' seperates the columns and '\n' the rows. Sometimes ' " ' is used to handle these symbols inside a field. But Excel includes them only if a Control Character is involved.
Here the definition from Wiki.
RFC 4180 formalized CSV. It defines the MIME type "text/csv", and CSV files that follow its rules should be very widely portable. Among its requirements:
-MS-DOS-style lines that end with (CR/LF) characters (optional for the
last line).
-An optional header record (there is no sure way to detect
whether it is present, so care is required when importing).
-Each record "should" contain the same number of comma-separated fields.
-Any field may be quoted (with double quotes).
-Fields containing a line-break, double-quote or commas should be quoted. (If > they are not, the file will likely be impossible to process correctly).
-A (double)quote character in a field must be represented by two (double) quote > characters.
Comma-separated values
Keep in mind that Excel has different settings on different systems/system language settings. It might be, that their Excel is parsing it correctly, but somewhere else it isn't.
For Example, in countries like Germany there is ';' used to seperate the columns. The decimal seperators differ as well.
1.5 << english
1,5 << german
Same goes for the thousand seperator.
1,000,000 << english
1.000.000 << german
or
1 000 000 << also german
Now, Excel also has different csv export settings like .csv(Seperated values), .csv(MACINTOSH) and .csv(MS-DOS) so I guess there can be differences too.
Now for your questions, in my opinion they are not clearly wrong with what they are doing with their files. But you should think about discussing about a (E)BNF with them. Here some Links:
BNF
EBNF
It is a grammar on which you decide on and with clear definitions the code should be no problem. I know customers can block something like this, because they don't want to have extra work, but it is simply the best solution. If you want ' " ' in your file, they should provide you somehow. I don't know how they copy their data, but it should also be some kind of program (I don't think they do this per hand?), so your code and their code should use the same (E)BNF which you decide on together with them.

Related

Reading/Writing several structs of unknown size to file C++

I am wanting to make a book database to record what books I have read. So far, I have a structure for a book entry.
struct Entry
{
string title;
string author;
int pages;
};
As you can see, the title and the author variables are of undetermined size. I would like to store several structures within one file, and then to read all those structures when I want to display the database.
How would I read/write several of these structures from a file? Would I have to have predetermined sizes? Please provide an example.
you could easily store it in a CSV-format the following way:
title_1,author_1,pages_1,title_2,author_2,pages_2,title_3,auth...
when reading back the file, then parse the file according to your separators and you have back your data.
edit:
as #Mark Setchell suggested you shouldn't use ',' as a separator, because a comma might also be part of the title itself. instead you should use a more rare character as separator, which isn't used very often in possible book titles. some examples are ';' '|' or '#' or even unprintable characters

Fortran 90: reading a generic string with enclosed some "/" characters

Hy everybody, I've found some problems in reading unformatted character strings in a simple file. When the first / is found, everything is missed after it.
This is the example of the text I would like to read: after the first 18 character blocks that are fixed (from #Mod to Flow[kW]), there is a list of chemical species' names, that are variables (in this case 5) within the program I'm writing.
#Mod ID Mod Name Type C. #Coll MF[kg/s] Pres.[Pa] Pres.[bar] Temp.[K] Temp.[C] Ent[kJ/kg K] Power[kW] RPM[rad/s] Heat Flow[kW] METHANE ETHANE PROPANE NITROGEN H2O
I would like to skip, after some formal checks, the first 18 blocks, then read the chemical species. To do the former, I created a character array with dimension of 18, each with a length of 20.
character(20), dimension(18) :: chapp
Then I would like to associate the 18 blocks to the character array
read(1,*) (chapp(i),i=1,18)
...but this is the result: from chapp(1) to chapp(7) are saved the right first 7 strings, but this is chapp(8)
chapp(8) = 'MF[kg '
and from here on, everything is leaved blank!
How could I overcome this reading problem?
The problem is due to your using list-directed input (the * as the format). List-directed input is useful for quick and dirty input, but it has its limitations and quirks.
You stumbled across a quirk: A slash (/) in the input terminates assignment of values to the input list for the READ statement. This is exactly the behavior that you described above.
This is not choice of the compiler writer, but is mandated by all relevant Fortran standards.
The solution is to use formatted input. There are several options for this:
If you know that your labels will always be in the same columns, you can use a format string like '(1X,A4,2X,A2,1X,A3,2X)' (this is not complete) to read in the individual labels. This is error-prone, and is also bad if the program that writes out the data changes format for some reason or other, or if the labes are edited by hand.
If you can control the program that writes the label, you can use tab characters to separate the individual labels (and also, later, the labels). Read in the whole line, split it into tab-separated substrings using INDEX and read in the individual fields using an (A) format. Don't use list-directed format, or you will get hit by the / quirk mentioned above. This has the advantage that your labels can also include spaces, and that the data can be imported from/to Excel rather easily. This is what I usually do in such cases.
Otherwise, you can read in the whole line and split on multiple spaces. A bit more complicated than splitting on single tab characters, but it may be the best option if you cannot control the data source. You cannot have labels containing spaces then.

Deleting duplicate entries in a log file C++

I've written a program to parse through a log file. The file in question is around a million entries or so, and my intent is to remove all duplicate entries by date. So if there's 100 unique log-ins on a date, it will only show one log-in per name. The log output I've created is in the form:
AA 01/Jan/2013
AA 01/Jan 2013
BB 01/Jan 2013
etc. etc. all through the month of January.
This is what I've written so far, the constant i in the for loop is the amount of entries to be sorted through and namearr & datearr are the arrays used for name and date. My end game is to have no repeated values in the first field that correspond to each date. I'm trying to follow proper etiquette and protocols so if I'm off base with this question I apologize.
My first thought in solving this myself is to nest a for loop to compare all previous names to the date, but since I'm learning about Data Structures and Algorithm Analysis, I don't want to creep up to high run times.
if(inFile.is_open())
{
for(int a=0;a<i;a++)
{
inFile>>name;//Take input file name
namearr[a]=name;//Store file name into array
//If names are duplicates, erase them
if(namearr[a]==temp)
{
inFile.ignore(1000,'\n');//If duplicate, skip to next line
}
else
{
temp=name;
inFile.ignore(1,' ');
inFile>>date;//Store date
datearr[a]=date;//Put date into array
inFile.ignore(1000,'\n');//Skip to next like
cout<<namearr[a]<<" "<<datearr[a]<<endl;//Output code to window
oFile<<namearr[a]<<" "<<datearr[a]<<endl;//Output code to file
}
}
}
Ughhh... You better use a Regular Expression library to easily deal with that size of a file. Check Boost Regex
http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/index.html
You can construct a key composed of the name and the date with simple string concatenation. That string becomes the index to a map. As you are processing the file line by line, check to see if that string is already in the map. If it is, then you have encountered the name on that day once before. If you've seen it already do one thing, if it's new do another.
This is efficient because you're constructing a string that will only be found a second time if the name has already been seen on that date and maps efficiently search the space of keys to find if a key exists in the map or not.

How to parse text-based table in C++

I am trying to parse a table in the form of a text file using ifstream, and evaluating/manipulating each entry. However, I'm having trouble figuring out how to approach this because of omissions of particular items. Consider the following table:
NEW VER ID NAME
1 2a 4 "ITEM ONE" (2001)
1 7 "2 ITEM" (2002) {OCT}
1.1 10 "SOME ITEM 3" (2003)
1 12 "DIFFERENT ITEM 4" (2004)
1 a4 16 "ITEM5" (2005) {DEC}
As you can see, sometimes the "NEW" column has nothing in it. What I want to do is take note of the ID, the name, the year (in brackets), and note whether there are braces or not afterwards.
When I started doing this, I looked for a "split" function, but I realized that it would be a bit more complicated because of the aforementioned missing items and the titles becoming separated.
The one thing I can think of is reading each line word by word, keeping track of the latest number I saw. Once I hit a quotation mark, make note that the latest number I saw was an ID (if I used something like a split, the array position right before the quotation mark), then keep record of everything until the next quote (the title), then finally, start looking for brackets and braces for the other information. However, this seems really primitive and I'm looking for a better way to do this.
I'm doing this to sharpen my C++ skills and work with larger, existing datasets, so I'd like to use C++ if possible, but if another language (I'm looking at Perl or Python) makes this trivially easy, I could just learn how to interface a different language with C++. What I'm trying to do now is just sifting data anyways which will eventually become objects in C++, so I still have chances to improve my C++ skills.
EDIT: I also realize that this is possible to complete using only regex, but I'd like to try using different methods of file/string manipulation if possible.
If the column offsets are truly fixed (no tabs, just true space chars a la 0x20) I would read it a line at a time (string::getline) and break it down using the fixed offsets into a set of four strings (string::substr).
Then postprocess each 4-tuple of strings as required.
I would not hard-code the offsets, store them in a separate input file that describes the format of the input - like a table description in SQL Server or other DB.
Something like this:
Read the first line, find "ID", and store the index.
Read each data line using std::getline().
Create a substring from a data line, starting at the index you found "ID" in the header line. Use this to initialize a std::istringstream with.
Read the ID using iss >> an_int.
Search the first ". Search the second ". Search the ( and remember its index. Search the ) and remember that index, too. Create a substring from the characters in between those indexes and use it to initialize another std::istringstream with. Read the number from this stream.
Search for the braces.

Use cases for regular expression find/replace

I recently discussed editors with a co-worker. He uses one of the less popular editors and I use another (I won't say which ones since it's not relevant and I want to avoid an editor flame war). I was saying that I didn't like his editor as much because it doesn't let you do find/replace with regular expressions.
He said he's never wanted to do that, which was surprising since it's something I find myself doing all the time. However, off the top of my head I wasn't able to come up with more than one or two examples. Can anyone here offer some examples of times when they've found regex find/replace useful in their editor? Here's what I've been able to come up with since then as examples of things that I've actually had to do:
Strip the beginning of a line off of every line in a file that looks like:
Line 25634 :
Line 632157 :
Taking a few dozen files with a standard header which is slightly different for each file and stripping the first 19 lines from all of them all at once.
Piping the result of a MySQL select statement into a text file, then removing all of the formatting junk and reformatting it as a Python dictionary for use in a simple script.
In a CSV file with no escaped commas, replace the first character of the 8th column of each row with a capital A.
Given a bunch of GDB stack traces with lines like
#3 0x080a6d61 in _mvl_set_req_done (req=0x82624a4, result=27158) at ../../mvl/src/mvl_serv.c:850
strip out everything from each line except the function names.
Does anyone else have any real-life examples? The next time this comes up, I'd like to be more prepared to list good examples of why this feature is useful.
Just last week, I used regex find/replace to convert a CSV file to an XML file.
Simple enough to do really, just chop up each field (luckily it didn't have any escaped commas) and push it back out with the appropriate tags in place of the commas.
Regex make it easy to replace whole words using word boundaries.
(\b\w+\b)
So you can replace unwanted words in your file without disturbing words like Scunthorpe
Yesterday I took a create table statement I made for an Oracle table and converted the fields to setString() method calls using JDBC and PreparedStatements. The table's field names were mapped to my class properties, so regex search and replace was the perfect fit.
Create Table text:
...
field_1 VARCHAR2(100) NULL,
field_2 VARCHAR2(10) NULL,
field_3 NUMBER(8) NULL,
field_4 VARCHAR2(100) NULL,
....
My Regex Search:
/([a-z_])+ .*?,?/
My Replacement:
pstmt.setString(1, \1);
The result:
...
pstmt.setString(1, field_1);
pstmt.setString(1, field_2);
pstmt.setString(1, field_3);
pstmt.setString(1, field_4);
....
I then went through and manually set the position int for each call and changed the method to setInt() (and others) where necessary, but that worked handy for me. I actually used it three or four times for similar field to method call conversions.
I like to use regexps to reformat lists of items like this:
int item1
double item2
to
public void item1(int item1){
}
public void item2(double item2){
}
This can be a big time saver.
I use it all the time when someone sends me a list of patient visit numbers in a column (say 100-200) and I need them in a '0000000444','000000004445' format. works wonders for me!
I also use it to pull out email addresses in an email. I send out group emails often and all the bounced returns come back in one email. So, I regex to pull them all out and then drop them into a string var to remove from the database.
I even wrote a little dialog prog to apply regex to my clipboard. It grabs the contents applies the regex and then loads it back into the clipboard.
One thing I use it for in web development all the time is stripping some text of its HTML tags. This might need to be done to sanitize user input for security, or for displaying a preview of a news article. For example, if you have an article with lots of HTML tags for formatting, you can't just do LEFT(article_text,100) + '...' (plus a "read more" link) and render that on a page at the risk of breaking the page by splitting apart an HTML tag.
Also, I've had to strip img tags in database records that link to images that no longer exist. And let's not forget web form validation. If you want to make a user has entered a correct email address (syntactically speaking) into a web form this is about the only way of checking it thoroughly.
I've just pasted a long character sequence into a string literal, and now I want to break it up into a concatenation of shorter string literals so it doesn't wrap. I also want it to be readable, so I want to break only after spaces. I select the whole string (minus the quotation marks) and do an in-selection-only replace-all with this regex:
/.{20,60} /
...and this replacement:
/$0"ΒΆ + "/
...where the pilcrow is an actual newline, and the number of spaces varies from one incident to the next. Result:
String s = "I recently discussed editors with a co-worker. He uses one "
+ "of the less popular editors and I use another (I won't say "
+ "which ones since it's not relevant and I want to avoid an "
+ "editor flame war). I was saying that I didn't like his "
+ "editor as much because it doesn't let you do find/replace "
+ "with regular expressions.";
The first thing I do with any editor is try to figure out it's Regex oddities. I use it all the time. Nothing really crazy, but it's handy when you've got to copy/paste stuff between different types of text - SQL <-> PHP is the one I do most often - and you don't want to fart around making the same change 500 times.
Regex is very handy any time I am trying to replace a value that spans multiple lines. Or when I want to replace a value with something that contains a line break.
I also like that you can match things in a regular expression and not replace the full match using the $# syntax to output the portion of the match you want to maintain.
I agree with you on points 3, 4, and 5 but not necessarily points 1 and 2.
In some cases 1 and 2 are easier to achieve using a anonymous keyboard macro.
By this I mean doing the following:
Position the cursor on the first line
Start a keyboard macro recording
Modify the first line
Position the cursor on the next line
Stop record.
Now all that is needed to modify the next line is to repeat the macro.
I could live with out support for regex but could not live without anonymous keyboard macros.