Suggestion about file reader tool implementation - c++

Say I need to implement a tool that always read & update a file.
The tool is text based and taking commands to perform from the command line.
If the tool always read & write data to and from the DB (The DB is just files in this case) , should I include any data-structures in the main memory to make it easier?
I thought about just making an interpreter that just reads the command, parse it and perform it. Also, when there is a request for data , the tool just go over the file and grab the required data (without saving any of it in a data structure).
Keep in mind that the tool always update its DB whenever required so ill also have to update the data structures every time it updates his DB.
Bottom line is, is that a good idea to go over the file and grab the information every time it is required or I should just make data-structures within the program to make it faster and easier to keep the data?
The interpreter class (struct in this case) is something like :
struct Interpreter {
virtual void interpret(const std::string& cmd) = 0;
};
The concrete interpreter inherits from it.
would love to hear some suggestions.
thanks

Use boost:
- program options to read the command line
- filesystem to read the file path
As for whether you read the file or cache it? You say: "have to update the data structures every time it updates his DB". I read from this that you must read the files every time, to be consistent with his DB.
Even if you find that you need to cache the files to improve performance, remember the old adage: "it's easier to make working code fast, than to make fast code to work!"

Related

Multi processes read different part of a big binary file simultanously

I have a large binary file, and it is saved on a NFS share disk. In the cluster, I want multiple processes to simultaneously read this big file. Each process gets a file pointer, opens the big file and reads starting from the supplied pointer and read some size of bytes.
How do I design this project? As far as I concerned, it is similar to some concurrency databases. Is there any lightweight library or open-source projects related to my project? I use the C++ language.
Not sure if there is a point to use a library.
You could use basic stuff. Open and reposition yourself in the file and then perform the read:
http://www.cplusplus.com/reference/fstream/ifstream/open/
http://www.cplusplus.com/reference/istream/istream/seekg/
or
http://www.cplusplus.com/reference/cstdio/fopen/
http://www.cplusplus.com/reference/cstdio/fseek/
nicolae: I agree :-)
mining: so far you haven't said anything about a need for interaction between your readers.
Consider a simple scenario.
Let's say you have your C++ program called "dostuff" which takes the following arguments:
--name something to lable your output.
--offset offset point, seek to here (default to zero).
--bytes number of bytes to process.
inputfile the file you want to read
The following would run your two processes in the background.
$ dostuff --name "proc1" --offset=0 --bytes=100 \\myserver\myshare\bigfile.dat &
$ dostuff --name "proc2" --offset=100 --bytes=100 \\myserver\myshare\bigfile.dat &
You can open a file handle within each process.
So long as the data access is read only why do you want to make it more complex?
important: I'm not saying it shouldn't be more complex, I'm suggesting you haven't yet shown a need for additional complexity. And that complexity is going to come from a need for your readers to collaborate. If they don't need to collaborate then you're pretty much done with your architecture - use the links Nicolae provided and good luck to you.

Beginner - data storage through XML or text files

I am a beginner in visual studio and has only code C and C++ in command line settings.
Currently, I am taking a module(software development) which requires me to come up with an expense tracker - a program which helps user tracks his/her daily expenses. Therefore, at the end of each individual day, or after a user uses finishes the program, we would have to perform data storage to store all the info in one place which we would export it during the next usage.
My constraint include not using any relational database(although i have no idea what it is :( ). Data storage must be done using XML or text files. Following this, I have several questions regarding data storage:
1) If data is stored successfully, do we export it everytime we start the program? And everytime after the user closes the program, we overwrite the existing data file and then store it accordingly?
2) I have heard from some people that using text file may be easier. Searching on the internet and library only provides me with information regarding XML and not text. Would anyone be able to help me with it? Like tutorials link and stuff?
Thank you very much!
File writing/handling works similar to every other buffer in c++.
you can enable file handling using the fstream header. You can create a file, write to it and over-write every time the program is run, or can even create a file the first time the program is run and then append to it every subsequent time the program runs.
Ive only ever done text files, never tried XML, but Im guessing they're similar.
http://www.cplusplus.com/doc/tutorial/files/ should give you everything you need to know.
Your choice of XML vs plain text depends on the kind of data that you'll be storing.
The reason why you'll only find XML libraries on the internet is because XML is a lot more complicated than plain text. If you don't know what XML is or if the data that you're storing isn't very complex, then I would suggest going with plain text.
For example, to track expenses, you might store a file like this:
sandwich 5.00
coffee 2.30
soft drink 1.50
...
It's very easy to read/write lines like this to/from a file in C++.

Reading/writing only needed data to/from a large data file to minimize memory footprint

I'm currently brainstorming a financial program that will deal with (over time) fairly large amounts of data. It will be a C++/Qt GUI app.
I figure reading all the data into memory at runtime is out of the question because given enough data, it might hog too much memory.
I'm trying to come up with a way to read into memory only what I need, for example, if I have an account displayed, only the data that is actually being displayed (and anything else that is absolutely necessary). That way the memory footprint could remain small even if the data file is 4gb or so.
I thought about some sort of searching function that would slowly read the file line by line and find a 'tag' or something identifying the specific data I want, and then load that, but considering this could theoretically happen every time there's a gui update that seems like a terrible way to go.
Essentially I want to be able to efficiently locate specific data in a file, read only that into memory, and possibly change it and write it back without reading and writing the whole file every time. I'm not an experienced programmer and my googling for ideas hasn't been very successful.
Edit: I should probably mention I intend to use Qt's fancy QDataStream related classes to store the data. In other words the file will likely be binary and not easily searchable line by line like a text file.
Okay based on your comments.
Start simple. Forget about your fiscal application for now, except as background. So suitable example for your file system
One data type e.g accounts.
Start with fixed width columns giving you a fixed width record.
One file for data
Have another file for the index of account number
Do Insert, Update and Delete, you'll learn a lot.
For instance.
Delete, you could find the index and the data, move them out and rebuild both files.
You could have a an internal field on the account record, that indicated it had been deleted, set that in data, and just remove the index. The latter is also rewrite the entire file though. You could put the delete flag in the index file instead...
When inserting do you want to append, do you want to find a deleted record and reuse that slot?
Is your index just going to be a straight list of accounts and position, or dovyouvwant to hash it, use a tree. You could spend a weeks if not months just looking at indexing strategies alone.
Happy learning anyway. It will be interesting to help with your future questions.

How to save changes to XML file using TinyXML?

I'm working on a project that requires me to load some of the data from an XML file on to a GUI. The GUI allows the user to make some changes to the data. What I want to be able to do is to save these changes back onto the XML file.
I know it is possible to rewrite the whole file but the file is pretty huge, and not all the data in the file is being changed or even being used in my program.
This is my first project working with TinyXML and C++ Builder. I am just looking for some suggestions as to how I should approach this.
Unless you are certain that the new text will be exactly the same size as the old, rewriting only part of a text file is not a good idea in general. There are file formats where piecemeal replacement is possible. XML is not one of them. Not in the general case, at least.
Inserting data in the middle of a file, thus moving the rest down, is basically equivalent to loading the rest of the file, making the file bigger, and writing it back. So you may as well just load the entire file, make your modifications, and save it again. Your code will be simpler and likely not much slower.
And no, a SAX parser isn't going to help you here. It allows you to stream reading (though I would suggest a pull parser rather than a push one), but that's not going to allow you to insert data into the file. That's generally not supported by most XML parsers I know of. They can write data, but writing and non-destructively inserting are two different things.
TinyXml will let you do what you want without damaging the file contents (as long as its valid xml). I just checked this so I am quite certain. Obviously you have to know and precisely what attributes and tags you want to edit, but you can add/edit tags without affecting existing attributes/tags/comments even within the tags you edit. It will take a while until you get used to the structure, but it is definitely possible.
You have to know the structure of the xml!
TiXmlDocument doc("filepath"); //will open your document
if (!doc.LoadFile()) //you do have to open the whole file
{
cout<<"No XML structure found"<<endl;
return; // exit function don't load anything
}
TiXmlElement *root = doc.RootElement(); //pointer to root element
Now you can use this pointer and commands like:
root->FirstChild("tageone")->ToElement();
tageone->SetDoubleAttribute("attribute", value);
to change stuff.
Sorry for the rushed explanation, but you'll need to read through the documentation a bit to get the hang of it.
cheers
Update
As I said in the comment, I don't think that you are better off if you insert into the middle of a file. However, if you need/want additional security I suggest two additional steps:
perform a sanity check of the xml file at all the important steps. This can be anything where you make sure that the file you are reading is really what you need.
calculate a checksum over the content of the whole file before saving and check it afterwards. This does not necessarily need to be a CRC, I just named the function calculate_crc(). Anything that lets you verify the integrity of the data is good.
I would do this approximately as follows (pseudocode):
TiXmlDocument doc( "demo.xml" );
doc.LoadFile();
perform_sanitycheck(doc);
// do whatever you need to change
perform_sanitycheck(doc);
unsigned int crc = calculate_crc(doc);
doc.SaveFile("temp_name.xml"); // save the file under another name
TiXmlDocument doc2( "temp_name.xml" );
perform_sanitycheck(doc2);
if(verify_crc(doc, crc))
{
delete_file("demo.xml");
rename_file("temp_name.xml", "demo.xml");
}
The sanity check would take the appropriate action if necessary. You need to substitute the two function delete_file() and rename_file() with an API or library function for your environment.
The functions calculate_crc() and verify_crc() could be specifically crafted to check only the parts that you need to have unchanged.

Writing to the middle of the file (without overwriting data)

In windows is it possible through an API to write to the middle of a file without overwriting any data and without having to rewrite everything after that?
If it's possible then I believe it will obviously fragment the file; how many times can I do it before it becomes a serious problem?
If it's not possible what approach/workaround is usually taken? Re-writing everything after the insertion point becomes prohibitive really quickly with big (ie, gigabytes) files.
Note: I can't avoid having to write to the middle. Think of the application as a text editor for huge files where the user types stuff and then saves. I also can't split the files in several smaller ones.
I'm unaware of any way to do this if the interim result you need is a flat file that can be used by other applications other than the editor. If you want a flat file to be produced, you will have to update it from the change point to the end of file, since it's really just a sequential file.
But the italics are there for good reason. If you can control the file format, you have some options. Some versions of MS Word had a quick-save feature where they didn't rewrite the entire document, rather they appended a delta record to the end of the file. Then, when re-reading the file, it applied all the deltas in order so that what you ended up with was the right file. This obviously won't work if the saved file has to be usable immediately to another application that doesn't understand the file format.
What I'm proposing there is to not store the file as text. Use an intermediate form that you can efficiently edit and save, then have a step which converts that to a usable text file infrequently (e.g., on editor exit). That way, the user can save as much as they want but the time-expensive operation won't have as much of an impact.
Beyond that, there are some other possibilities.
Memory-mapping (rather than loading) the file may provide efficiences which would speed things up. You'd probably still have to rewrite to the end of the file but it would be happening at a lower level in the OS.
If the primary reason you want fast save is to start letting the user keep working (rather than having the file available to another application), you could farm the save operation out to a separate thread and return control to the user immediately. Then you would need synchronisation between the two threads to prevent the user modifying data yet to be saved to disk.
The realistic answer is no. Your only real choices are to rewrite from the point of the modification, or build a more complex format that uses something like an index to tell how to arrange records into their intended order.
From a purely theoretical viewpoint, you could sort of do it under just the right circumstances. Using FAT (for example, but most other file systems have at least some degree of similarity) you could go in and directly manipulate the FAT. The FAT is basically a linked list of clusters that make up a file. You could modify that linked list to add a new cluster in the middle of a file, and then write your new data to that cluster you added.
Please note that I said purely theoretical. Doing this kind of manipulation under a complete unprotected system like MS-DOS would have been difficult but bordering on reasonable. With most newer systems, doing the modification at all would generally be pretty difficult. Most modern file systems are also (considerably) more complex than FAT, which would add further difficulty to the implementation. In theory it's still possible -- in fact, it's now thoroughly insane to even contemplate, where it was once almost reasonable.
I'm not sure about the format of your file but you could make it 'record' based.
Write your data in chunks and give each chunk an id.
Id could be data offset in file.
At the start of the file you could
have a header with a list of ids so
that you can read records in
order.
At the end of 'list of ids' you could point to another location in the file (and id/offset) that stores another list of ids
Something similar to filesystem.
To add new data you append them at the end and update index (add id to the list).
You have to figure out how to handle delete record and update.
If records are of the same size then to delete you can just mark it empty and next time reuse it with appropriate updates to index table.
Probably the most efficient way to do this (if you really want to do it) is to call ReadFileScatter() to read the chunks before and after the insertion point, insert the new data in the middle of the FILE_SEGMENT_ELEMENT[3] list, and call WriteFileGather(). Yes, this involves moving bytes on disk. But you leave the hard parts to the OS.
If using .NET 4 try a memory-mapped file if you have an editor-like application - might jsut be the ticket. Something like this (I didn't type it into VS so not sure if I got the syntax right):
MemoryMappedFile bigFile = MemoryMappedFile.CreateFromFile(
new FileStream(#"C:\bigfile.dat", FileMode.Create),
"BigFileMemMapped",
1024 * 1024,
MemoryMappedFileAccess.ReadWrite);
MemoryMappedViewAccessor view = MemoryMapped.CreateViewAccessor();
int offset = 1000000000;
view.Write<ObjectType>(offset, ref MyObject);
I noted both paxdiablo's answer on dealing with other applications, and Matteo Italia's comment on Installable File Systems. That made me realize there's another non-trivial solution.
Using reparse points, you can create a "virtual" file from a base file plus deltas. Any application unaware of this method will see a continuous range of bytes, as the deltas are applied on the fly by a file system filter. For small deltas (total <16 KB), the delta information can be stored in the reparse point itself; larger deltas can be placed in an alternative data stream. Non-trivial of course.
I know that this question is marked "Windows", but I'll still add my $0.05 and say that on Linux it is possible to both insert or remove a lump of data to/from the middle of a file without either leaving a hole or copying the second half forward/backward:
fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, offset, len)
fallocate(fd, FALLOC_FL_INSERT_RANGE, offset, len)
Again, I know that this probably won't help the OP but I personally landed here searching for a Linix-specific answer. (There is no "Windows" word in the question, so web search engine saw no problem with sending me here.