Proper design to parse file in c++

Proper design to parse file in c++ - c++

I'm trying to parse a file and store the different fields in some variable.
However, I'm not sure of the proper design to do this:
Should I have a class for parsing and a class for storage, or should the parser also be used as storage to hold the fields ?
I also have to keep in mind that I might need to extend the parser as new fields may appear in the file.
Edit: It has been mentionned that the question was too broad. I included more details here but I didn't want to influence the answers with what I already implemented. In my current implementation I faced some issues to extended the parser, and someone pointed out that I shouldn't separate data storage and parser (because of the single responsibility principle) but for me it made sense to separate them.

It's a question of necessity. If your problem is likely to stay small, put the parsing and storage together. You'll have to have the parsed data somewhere in memory regardless, so why not right where it comes in?
The main reason you'd want to separate them is if you need to store in different ways, such as database or file, and you need some sort of interface that can deal with extensibility.
But think about that only if it looks likely to come up.

Related

Design patterns for aggregating heterogeneous tabular data

I'm working on some C++ code that integrates information from about several dozen csv files. They all contain some time-stamped record data I want to extract, but the representation is somewhat different in each file. The differences between representations go beyond different column orderings and column names - for example, what's one row with multiple columns in one file may be multiple rows in a different file.
So I need some custom handling for each file to put together a unified data structure that includes the necessary information from all the files. My question is whether there's a preferred code pattern to keep the complexity manageable and the code elegant? Or if there's a good case study I should examine to see how this sort of complexity has been handled in the past.
(I realize something like this might be easier in a scripting language like perl, but the project is in C++ for now. Also, my question is more regarding whether there's a code pattern to deal with this - so the answer doesn't have to be too language specific.)

There are several phrases that you use in your question that stick out to me: custom handling for each file, representation is somewhat different, complexity manageable. Based upon the fact that you are going to have to use varying variations of parsing algorithms based upon the format of the csv file and you are (from what I can tell) wanting to loosely couple your parsing mechanism I would recommend the strategy pattern.
The strategy pattern will decouple the parsing mechanism from the users of the data contained in the CSV file. The users of the data have no interest as to what format the CSV file is in they are only interested in the information within that file which makes the strategy pattern an excellent choice. If there are similarities between your parsing mechanisms you can use both the template and strategy patterns together to reduce duplication and take advantage of inheritance.
By using the strategy pattern you can then extract strategy creation into a factory method or abstract factory as you see fit further allowing clients to be decoupled from the parsing method.

I am not quite sure what you want to do with the different files. If the idea is to use them like database tables and you have some keys with attached information scattered in multiple files, you might want to have a look at something like MapReduce, where you build up part of the information from each files first and aggregate the information sharing the same key in a second step.
As for data structures, it depends on the layout of your files. I would probably have a dedicated reader for each file type which would store the information in dedicated data structures representing the information in the file. You could attach a key to each information and use a reduce operation to merge all the information fragments using the same key and aggregate them in a proxy structure.
On the other hand, if the idea is to build identical objects from different serialization methods (ie the different files are independent but represent the same type of data with a different layout), without knowing in advance which serialization method has been employed, i am afraid the only solution left is to brute-force the deserialization. You can have a set of readers, one for each input type, and try to parse the file, if it fails, the next one starts and so on, until you discover a new file format or find the appropriate reader. I don't think there is any pattern covering this.

Is it bad form to put an exception class in the same file as your main class

I'm writing a very simple program with only one class. I also want to derive my own exception but I don't want to have to use a new file. Is this okay or is it bad coding to have multiple not exactly related classes in one file?

It's okay for small projects, but can quickly turn into a maintenance headache for anything bigger then that.
Categorizing classes inside both files and namespaces helps with organization, is more intuitive to where you can find what you're looking for, and is much much easier to keep track of changes (in version control).
There's also the matter of dependencies, which is probably the most important one - what if you want to use that exception somewhere else? You can't include the implementation file, and even if you keep it an a header shared with other unrelated classes, you'd be polluting the new translation unit with unneeded symbols.

I think the answer, like many things, is that it depends. In this case, the question would be how tightly coupled the exception is to the class. If your class is called FooBarBaz and your exception is called FooBarBazException the it seems unlikely that anyone would want to use it elsewhere and I think it's alright to package it with the class. (In fact, you might argue that it should be a nested class of FooBarBaz.)
That said, when I write exceptions I try to tie them to causes rather than specific classes. For example, if I'm writing a compression library and I have support for several different algorithms, I may write a single DataCorruptionException class to convey detection of corruption. Then I can use the exception message to provide more detailed information, such as what class/function the exception came from.

Efficient ways to save and load data from C++ simulation

I would like to know which are the best way to save and load C++ data.
I am mostly interested in saving classes and matrices (not sparse) I use in my simulations.
Now I just save them as txt files, but if I add a member to a class I then have to modify the function that loads the data (it has to parse and check for the value in the txt file),
that I think is not ideal.
What would you recommend in general? (p.s. as I'd like to release my code I'd really like to use only standard c++ or libraries that can be redistributed).

In this case, there is no "best." What is best for you is highly dependent upon your situation. But, lets have an example to get you thinking about your details and how deep this rabbit hole can go.
If you absolutely positively must have the fastest save possible without question (and you're willing to pay the price), you can define your own memory management to put all objects into a contiguous array of a common type (such as integers). This allows you to write that array to disk as binary data very rapidly. You might need this in a simulation that uses threads efficiently to load every core/processor to run at real time.
Why is a rather horrible solution? Because it takes a LOT of work and runs many risks for problems in the name of "optimization."
It requires you to build your own memory management (operator new() and operator delete()) which may need to be thread safe.
If you try to load from this array, you will have to placement new all objects with a unique non-modifying constructor in order to ensure all virtual pointers are set properly. Oh, and you have to track the type of each address to now how to do this.
For portability with other systems and between versions of the binary, you will need to have utilities to convert from the binary format to something generic enough to be cross platform (including repopulating pointers to other objects).
I have done this. It was highly unpleasant. I have no doubt there are still problems with it and I have only listed a few here. But, it was very, very fast and very, very, very problematic.
You must design to your needs. Generally, the first need is "Make it work." Don't care about efficiency, just about something that accurately persists and that you have the information known and accessible at some point to do it. Also, you should encapsulate the process of saving and loading. Then, if the need "Make it better" steps in, you should be able to change that one bit of code and the rest should work. You might even make the saving format selectable on user needs instead of your needs which you must assume for all users.
Given all the assumptions, pros and cons listed, you should be able to elaborate your particular needs for this question.

Given that performance is not your concern -- which is a critical part of the answer -- the Boost Serialization library is a great answer.
The link in the comment leads to the documentation. Read the tutorial (which is overkill for what you are initially wanting, but well worth it).
Finally, since you have mostly array matrices, try to encapsulate the entire process of save and load so that should you need to change it later, you are writing a new implementatio and choosing between the exisiting. I expend the eddedmtime for the smarts of Boost Serialization would not be great; however, you might find a future requirement moves you to something else or multiple something elses.

The C++ Middleware Writer automates the creation of marshalling functions. When you add a member to a class, it updates the marshalling functions for you.

How to organize import and export code for versioned files?

If an application has to be able to open (and possibly save) the file format for the last N releases, how should the code be organized sanely so that updating the file format is easy and less error-prone? Assume the file format is in XML, and functions take in objects for export and produce objects for import.
Append a number to the end of each function name, and copy/paste it and increment the number for each new version? That's like maintaining multiple versions of version-controlled functions within source code. Perhaps do some magic at build time?

Firstly, supporting import of old versions is a lot easier than export. This is because usually later versions are different because they support more features. Hence saving to an old format may well mean loss of data. Consequently, my experience has only been on supporting import of multiple versions, spanning over a decade.
XML is of course the smart solution. It is designed with this problem in mind. The key point to me is that clean code structure follows from a clean data model. Provided new versions add features and these are represented by support for additional tags, you do not really have to recode handling of existing tags at all.
Now you could change the semantics of existing tags, requiring their recoding. Solution: don't do this if you can avoid it. When you add a attribute or tag, make sure you define the default value and then old and new data files are handled seamlessly.
So it seems to me that with care you should be able to avoid many cases where you really have significantly different code for handling the same fields in different file versions. Where this does occur, I am guess there are "special circumstances" (that's life with software). When you design the generic solution you'll have specific use cases in mind, and such special cases may not be handled anyway.
In summary: You'll future proof most efficiently via defining the upgrade path for the data model.

A version number is probably required.
But the best thing is to actually make a design for your XML. And make sure that the XML is structured in an intuitive and natural way. Otherwise the current organisation of your code may leak into the structure of the XML, which makes the XML harder to read for future versions of your product.
When saving enumerated values, don't write the numbers, but the name of the enumerable. If some elements could occur multiple times in principle, but not in your current application, design it as an array in XML. Make sure the numbers you write are in a unit that is logical in the problem domain, and not what your application happens to use right now.
In XML written this way, it should not be hard to support legacy versions of your XML.
Edit:
If you make drastic changes, it can be helpful to just implement a legacy data object that reads the legacy xml. Then you write a conversion method to convert from the old data model to the new one. This helps you to a fresh start esp. if the old data model was badly designed.

What is the best way I should go about creating a program to store information into a file, edit the information in that file, and add new information

I'm about to start on a little project i'm trying to do where I create a C++ program to store inventory data into a file ( I guess a .txt will do )
• Item Description • Quantity on Hand
• Wholesale Cost • Retail Cost • Date
Added to Inventory
I need to be able to:
• Add new records to the file
• Display any record in the file
• Change any record in the file
Is there anything I should know of before I start this that could make this much more easy & efficient...
Like for example, should I try and use XML or what that be too hard to work with via C++?
I've never really understood the most efficient way of doing this.
Like would I search through the file and look for things in brackets or something?
EDIT
The datasize shouldn't be too large. It is for homework I guess you could say. I want to write the struct's contents into a file's route, how would I go about doing that?

There are many approaches. Is this for homework or for real use? If it's for homework, there are probably some restrictions on what you may use.
Otherwise I suggest some embedded DBMS like SQLite. There are others too, but this will be the most powerful solution, and will also have the easiest implementation.
XML is also acceptable, and has many reusable implementations available, but it will start loosing performance once you go into thousands of records. The same goes for JSON. And one might still debat which one is simpler - JSON or XML.
Another possibility is to create a struct and write its contents directly to the file. Will get tricky though if the record size is not constant. And, if the record format changes, the file will need to be rebuilt. Otherwise this solution could be one of the best performance-wise - if implemented carefully.

Could you please enlighten us why don't you want to use a database engine for it?
If it is just for learning then.... give us please an estimated size of stored data in that file and the access pattern (how many users, how often they do it etc.)?
The challenge will be to create an efficient search and modification code.
For the search, it's about data structures and organization.
For the modification, it's how would you write updates to the file without reading it completely into memory, updating it there and then writing it again completely back to the file.

If this is a project that will actually be used, with the potential to have features added over time, go for a database solution from the start, even if it seems overkill. I've been down this road before, small features get added over time, and before you realize it you have implemented a database. Poorly. Bite the bullet and use a database.
If this is a learning exercise, it depends on the amount of data you want to store. If it is small, the easiest thing to do is read the entire file into memory and operate on it there. When changes are made, write the entire file back out to disk. If the data is too large to do that, the next best thing is to have fixed sized records. Create a POD struct that contains all of the data (i.e., no pointers, stl containers, etc). Then you can rewrite individual records without needed to rewrite the entire file. If neither of these will work, your best bet is a database solution.

If you insist to do it manually, I suggest JSON instead of XML.
Also, consider sqlite.

This sounds like a perfect job for SQLite. Small, fast, flexible and easy to use.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js