Multi processes read different part of a big binary file simultanously - concurrency

I have a large binary file, and it is saved on a NFS share disk. In the cluster, I want multiple processes to simultaneously read this big file. Each process gets a file pointer, opens the big file and reads starting from the supplied pointer and read some size of bytes.
How do I design this project? As far as I concerned, it is similar to some concurrency databases. Is there any lightweight library or open-source projects related to my project? I use the C++ language.

Not sure if there is a point to use a library.
You could use basic stuff. Open and reposition yourself in the file and then perform the read:
http://www.cplusplus.com/reference/fstream/ifstream/open/
http://www.cplusplus.com/reference/istream/istream/seekg/
or
http://www.cplusplus.com/reference/cstdio/fopen/
http://www.cplusplus.com/reference/cstdio/fseek/

nicolae: I agree :-)
mining: so far you haven't said anything about a need for interaction between your readers.
Consider a simple scenario.
Let's say you have your C++ program called "dostuff" which takes the following arguments:
--name something to lable your output.
--offset offset point, seek to here (default to zero).
--bytes number of bytes to process.
inputfile the file you want to read
The following would run your two processes in the background.
$ dostuff --name "proc1" --offset=0 --bytes=100 \\myserver\myshare\bigfile.dat &
$ dostuff --name "proc2" --offset=100 --bytes=100 \\myserver\myshare\bigfile.dat &
You can open a file handle within each process.
So long as the data access is read only why do you want to make it more complex?
important: I'm not saying it shouldn't be more complex, I'm suggesting you haven't yet shown a need for additional complexity. And that complexity is going to come from a need for your readers to collaborate. If they don't need to collaborate then you're pretty much done with your architecture - use the links Nicolae provided and good luck to you.

Related

How to track the number of times my console application in C++14 has been launched?

I'm building a barebones Notepad-styled project (console-based, does not have a GUI as of now) and I'd like to track, display (and later use it in some ways) the number of times the console application has been launched. I don't know if this helps, but I'm building my console application on Windows 10, but I'd like it to run on Windows 7+ as well as on Linux distros such as Ubuntu and the like.
I prefer not storing the details in a file and then subsequently reading from it to maintain count. Please suggest a way or any other resource that details how to do this.
I'd put a strikethrough on my quote above, but SO doesn't have it apparently.
Note that this is my first time building such a project so I may not be familiar with advanced stuff... So, when you're answering please try to explain as is required for a not-so-experienced software developer.
Thanks & Have a great one!
Edit: It seems that the general advice is to use text files to protect portability and to account for the fact that if down-the-line, I need to store some extra info, the text file will come in super handy. In light of this, I'll focus my efforts on the text file.
Thanks to all for keeping my efforts from de-railing!
I prefer not storing the details in a file
In the comments, you wrote that the reason is security and you consider using a file as "over-kill" in this case.
Security can be solved easily - just encrypt the file. You can use a library like this to get it done.
In addition, since you are writing and reading to/from the file only once each time the application is opened/closed, and the file should take only small number of bytes to store such data, I think it's the right, portable solution.
If you still don't want to use a file, you can use windows registry to store the data, but this solution is not portable

Is it a good idea to include a large text variable in compiled code?

I am writing a program that produces a formatted file for the user, but it's not only producing the formatted file, it does more.
I want to distribute a single binary to the end user and when the user runs the program, it will generate the xml file for the user with appropriate data.
In order to achieve this, I want to give the file contents to a char array variable that is compiled in code. When the user runs the program, I will write out the char file to generate an xml file for the user.
char* buffers = "a xml format file contents, \
this represent many block text \
from a file,...";
I have two questions.
Q1. Do you have any other ideas for how to compile my file contents into binary, i.e, distribute as one binary file.
Q2. Is this even a good idea as I described above?
What you describe is by far the norm for C/C++. For large amounts of text data, or for arbitrary binary data (or indeed any data you can store in a file - e.g. zip file) you can write the data to a file, link it into your program directly.
An example may be found on sites like this one
I'll recommend using another file to contain data other than putting data into the binary, unless you have your own reasons. I don't know other portable ways to put strings into binary file, but your solution seems OK.
However, note that using \ at the end of line to form strings of multiple lines, the indentation should be taken care of, because they are concatenated from the begging of the next lineļ¼š
char* buffers = "a xml format file contents, \
this represent many block text \
from a file,...";
Or you can use another form:
char *buffers =
"a xml format file contents,"
"this represent many block text"
"from a file,...";
Probably, my answer provides much redundant information for topic-starter, but here are what I'm aware of:
Embedding in source code: plain C/C++ solution it is a bad idea because each time you will want to change your content, you will need:
recompile
relink
It can be acceptable only your content changes very rarely or never of if build time is not an issue (if you app is small).
Embedding in binary: Few little more flexible solutions of embedding content in executables exists, but none of them cross-platform (you've not stated your target platform):
Windows: resource files. With most IDEs it is very simple
Linux: objcopy.
MacOS: Application Bundles. Even more simple than on Windows.
You will not need recompile C++ file(s), only re-link.
Application virtualization: there are special utilities that wraps all your application resources into single executable, that runs it similar to as on virtual machine.
I'm only aware of such utilities for Windows (ThinApp, BoxedApp), but there are probably such things for other OSes too, or even cross-platform ones.
Consider distributing your application in some form of installer: when starting installer it creates all resources and unpack executable. It is similar to generating whole stuff by main executable. This can be large and complex package or even simple self-extracting archive.
Of course choice, depends on what kind of application you are creating, who are your target auditory, how you will ship package to end-users etc. If it is a game and you targeting children its not the same as Unix console utility for C++ coders =)
It depends. If you are doing some small unix style utility with no perspective on internatialization, then it's probably fine. You don't want to bloat a distributive with a file no one would ever touch anyways.
But in general it is a bad practice, because eventually someone might want to modify this data and he or she would have to rebuild the whole thing just to fix a typo or anything.
The decision is really up to you.
If you just want to keep your distributive in one piece, you might also find this thread interesting: Store data in executable
Why don't you distribute your application with an additional configuration file? e.g. package your application executable and config file together.
If you do want to make it into a single file, try embed your config file into the executable one as resources.
I see it more of an OS than C/C++ issue. You can add the text to the resource part of your binary/program. In Windows programs HTML, graphics and even movie files are often compiled into resources that make part of the final binary.
That is handy for possible future translation into another language, plus you can modify resource part of the binary without recompiling the code.

is there any reference/resource about how to design the structure of a data file? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What are important points when designing a (binary) file format?
I am going to develop a program which will store data in file.
The file can be big. The data in the file is basically made up with variable length records. And I need random access to the records.
I just want to read some resouces/books about how to design the structure of a data file. But I can't find any yet.
Any suggestion is much appreciated.
You might find http://decoy.iki.fi/texts/filefd/filefd useful. It's a general starting point to the techniques to consider.
Also look at this question here on SO: What are important points when designing a (binary) file format?
The problem you describe is a central theme of Database Theory.
Any decent text on the subject should give you some good ideas. The standard text from uni was:
Fundamentals of Database Systems- Elmasari & Nava (PDF) (Amazon)
Another approach is to use a memory mapped array of structs, take a look at my bountied answer to a similar question
Yet another approach is to use a binary protocol like Google protobuf and "send" your data to the file when writing and "receive" it when writing.
If the answer you're looking for is "what book to read" I can't help.
If "how do to that" may be good for you as well I've some suggestions.
One good solution is the one suggested by Srykar; I would just add that I'd use SQLite instead of MySQL. It's an open source C library that you can embed in your program. It lets you store data in a DB just the way you'd do with SQL statement, but calling the library C functions instead. In your case you may keep everything in memory and then save the data to disk at proper time.
Reference:
http://www.sqlite.org
Another option is the old "do it yourself way". I mean: there's nothing very complicated about storing your data to a file (unless your data is very very structured, but I'd go with option nr. 1 in this case).
You write down a plan of how you want the structure of your file to be. And you follow that plan both when writing the file to disk and when reading it re-storing the data into memory.
If you have n records. Write n to disk, then write each record.
If each record has variable lenght, then write the length of each record before writing the record.
You talk about "random access" in your question. Probably you mean that the file is very big and at access time you want to read from disk only the portion you're interested in.
If so plan to build an index; that index will tell the offset of each element in bytes from the beginning of the file. Store the index at the beginning of the file and then store the data.
When you read the file you start reading the index, get the offset to the data you need, and read that portion of file.
These are very basic examples, just to get the idea...
Hope they helps!
Is there any reason you are not considering putting this data in a persistent DB store like mysql? these system are built to deal with random data access with proper indexes to speeden you data retrieval. Plus while reading from a file, you would have to read the entire file to get what you want as there are no indexes and no query language.
Added to this they have systems in place to make sure multiple running processes can access the same data without data getting corrupted. It provided data recovery incase of inconsistencies.
So just storing is the simple part, it does not end there. You would have to provide all the other solutions eventually. Better use whats available.

How to measure the amount of data transmitted by my MPI program?

I'm experimenting my distributed clustering algorithm (implemented with MPI) on 24 computers that I set up as a cluster using BCCD (Bootable Cluster CD) that can be downloaded at http://bccd.net/.
I've written a batch program to run my experiment that consists in running my algorithm several times varying the number of nodes and the size of the input data.
I want to know the amount of data used in the MPI communications for each run of my algorithm so I can see how the amount of data changes when varying the previous mentioned parameters. And I want to do all this automatically using a batch program.
Someone told me to use tcpdump, but I found some difficulties in this approach.
First, I don't know how to call tcpdump in my batch program (which is written in C++ using the command system for making calls) before each run of my algorithm, since tcpdump requires another terminal to run in parallel with my application. And I can't run tcpdump in another computer since the network uses a switch. So I need to run it on the master node.
Second, I saw the traffic with tcpdump while my experiment was going on and I couldn't figure out what was the port used by MPI. It seems to use many ports. I wanted to know that for filtering the packages.
Third, I tried capturing whole packages and saving it to a file using tcpdump and in a few seconds the file was 3,5MB. But my whole experiment takes 2 days. So the final log file will be huge if I follow this approach.
The ideal approach would be to capture just the size field in the header of the packages and sum this up to obtain the total amount of data transmitted. In that way the logfile would be much smaller than if I were capturing the whole package. But I don't know how to do it.
Another restriction is that I don't have access to the computer disc. So I only have the RAM and my 4GB USB Flash drive. So I can't have huge logfiles.
I have already thought about using some MPI tracing or profiling tool such as those mentioned at http://www.open-mpi.org/faq/?category=perftools. I have only tested Sun Performance Analyzer until now. The problem is that I guess it will be difficult to install those tools on BCCD and maybe even impossible. In addtion to that, this tool will make my experiment take longer to end, sice it adds overhead. But if someone is familiar with BCCD and think it is a good choice to use one of those tools, so please let me know.
Hope someone have a solution.
Implementations like tcpdump won't work if there are multi-core nodes which use shard memory to communicate, anyway.
Using something like MPE is almost certainly the way to go. Those tools add very little overhead, and some overhead is always going to be necessary if you want to count messages. You can use mpitrace to write out every MPI call, and parse the resulting text file yourself. By the way, note that MPE is explicitly discussed on the bccd website. MPICH2 comes with MPE built in, but it can be compiled for any implementation. I've only found a very modest overhead for MPE.
IPM is another nice tool that does counting of messages and sizes; you should be able either parse the XML output, or use the postprocessing tools and just manually integrate the graphs (say either bytes_rx/bytes_tx by rank, or the message buffer size/count graph). The overhead for IPM is even less than for MPE, and mostly comes after the program's finished running to do the file I/O.
If you were really super worried about the overhead with either of these approaches, you could always write your own MPI wrappers using the profiling interface that wrapped MPI_Send, MPI_Recv, etc, and just counted # of bytes sent and recieved for each process, and output only that total at the end.

Writing to the middle of the file (without overwriting data)

In windows is it possible through an API to write to the middle of a file without overwriting any data and without having to rewrite everything after that?
If it's possible then I believe it will obviously fragment the file; how many times can I do it before it becomes a serious problem?
If it's not possible what approach/workaround is usually taken? Re-writing everything after the insertion point becomes prohibitive really quickly with big (ie, gigabytes) files.
Note: I can't avoid having to write to the middle. Think of the application as a text editor for huge files where the user types stuff and then saves. I also can't split the files in several smaller ones.
I'm unaware of any way to do this if the interim result you need is a flat file that can be used by other applications other than the editor. If you want a flat file to be produced, you will have to update it from the change point to the end of file, since it's really just a sequential file.
But the italics are there for good reason. If you can control the file format, you have some options. Some versions of MS Word had a quick-save feature where they didn't rewrite the entire document, rather they appended a delta record to the end of the file. Then, when re-reading the file, it applied all the deltas in order so that what you ended up with was the right file. This obviously won't work if the saved file has to be usable immediately to another application that doesn't understand the file format.
What I'm proposing there is to not store the file as text. Use an intermediate form that you can efficiently edit and save, then have a step which converts that to a usable text file infrequently (e.g., on editor exit). That way, the user can save as much as they want but the time-expensive operation won't have as much of an impact.
Beyond that, there are some other possibilities.
Memory-mapping (rather than loading) the file may provide efficiences which would speed things up. You'd probably still have to rewrite to the end of the file but it would be happening at a lower level in the OS.
If the primary reason you want fast save is to start letting the user keep working (rather than having the file available to another application), you could farm the save operation out to a separate thread and return control to the user immediately. Then you would need synchronisation between the two threads to prevent the user modifying data yet to be saved to disk.
The realistic answer is no. Your only real choices are to rewrite from the point of the modification, or build a more complex format that uses something like an index to tell how to arrange records into their intended order.
From a purely theoretical viewpoint, you could sort of do it under just the right circumstances. Using FAT (for example, but most other file systems have at least some degree of similarity) you could go in and directly manipulate the FAT. The FAT is basically a linked list of clusters that make up a file. You could modify that linked list to add a new cluster in the middle of a file, and then write your new data to that cluster you added.
Please note that I said purely theoretical. Doing this kind of manipulation under a complete unprotected system like MS-DOS would have been difficult but bordering on reasonable. With most newer systems, doing the modification at all would generally be pretty difficult. Most modern file systems are also (considerably) more complex than FAT, which would add further difficulty to the implementation. In theory it's still possible -- in fact, it's now thoroughly insane to even contemplate, where it was once almost reasonable.
I'm not sure about the format of your file but you could make it 'record' based.
Write your data in chunks and give each chunk an id.
Id could be data offset in file.
At the start of the file you could
have a header with a list of ids so
that you can read records in
order.
At the end of 'list of ids' you could point to another location in the file (and id/offset) that stores another list of ids
Something similar to filesystem.
To add new data you append them at the end and update index (add id to the list).
You have to figure out how to handle delete record and update.
If records are of the same size then to delete you can just mark it empty and next time reuse it with appropriate updates to index table.
Probably the most efficient way to do this (if you really want to do it) is to call ReadFileScatter() to read the chunks before and after the insertion point, insert the new data in the middle of the FILE_SEGMENT_ELEMENT[3] list, and call WriteFileGather(). Yes, this involves moving bytes on disk. But you leave the hard parts to the OS.
If using .NET 4 try a memory-mapped file if you have an editor-like application - might jsut be the ticket. Something like this (I didn't type it into VS so not sure if I got the syntax right):
MemoryMappedFile bigFile = MemoryMappedFile.CreateFromFile(
new FileStream(#"C:\bigfile.dat", FileMode.Create),
"BigFileMemMapped",
1024 * 1024,
MemoryMappedFileAccess.ReadWrite);
MemoryMappedViewAccessor view = MemoryMapped.CreateViewAccessor();
int offset = 1000000000;
view.Write<ObjectType>(offset, ref MyObject);
I noted both paxdiablo's answer on dealing with other applications, and Matteo Italia's comment on Installable File Systems. That made me realize there's another non-trivial solution.
Using reparse points, you can create a "virtual" file from a base file plus deltas. Any application unaware of this method will see a continuous range of bytes, as the deltas are applied on the fly by a file system filter. For small deltas (total <16 KB), the delta information can be stored in the reparse point itself; larger deltas can be placed in an alternative data stream. Non-trivial of course.
I know that this question is marked "Windows", but I'll still add my $0.05 and say that on Linux it is possible to both insert or remove a lump of data to/from the middle of a file without either leaving a hole or copying the second half forward/backward:
fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, offset, len)
fallocate(fd, FALLOC_FL_INSERT_RANGE, offset, len)
Again, I know that this probably won't help the OP but I personally landed here searching for a Linix-specific answer. (There is no "Windows" word in the question, so web search engine saw no problem with sending me here.