dd - Understanding block size - c++

I've used "dd" for creating test files and performing backups across HDDs. No problem.
Currently, I'm trying to use it to test NFS transfer rates. At first, I was varying the block size ("bs" argument)... But this got me thinking, why would I need to vary this argument?
A typical use-case that I want to simulate is:
Node X has a large data structure in memory
Node X wants to write it to a file located in a NFS-mounted directory
In this case, the typical C/C++ code for a 2D array would be:
FILE *ptr = fopen("path_to_nfs_area", "w");
for (int i = 0; i < data.size(); ++i)
fwrite(data[i], sizeof(float), width, ptr);
...
So in this case, we're writing to a buffer in 32bit increments (sizeof(float)) - and since this is a FILE object, it's probably being buffered as well (maybe that's not a good thing, but might be irrelevant for this discussion).
I'm having a hard time making the jump from "dd" writing from if->of in "bs" chunks versus an application writing out variables from memory (and simulating this with dd).
Does it make sense to say that it is pointless to vary the value of "bs" less than the system PAGE_SIZE?
Here's my current understanding, so I don't see why changing the "dd" block size would matter:

You might get better answers on superuser.com, as this question is a bit off-topic here.
But consider the possibility that the nfs share is not mounted with an async flag - in that case, each single write would need to be confirmed by the nfs server before the next write can evens start. So bs=1 would need about double time compared to bs=2, and each of them would be MUCH slower that a sensible block size.
If the async flag is set on the nfs mount, your kernel might merge several small writes into one big one anyway, so the effect of setting bs should be neglectable.
Anyway, if you're testing to set up an environment for a specific application, use that application for testing, nothing else. Performance might depend on so much application-specific behaviour that any generic tool won't be able to reproduce.

Related

Optimal method to mmap a file to RAM?

I am using mmap to read a file and I only recently found out that it is not actually getting it into RAM, but is only creating a virtual address space for it. This will cause any accessing of the data to still use disk which I want to avoid, so I want to read it all into RAM.
I am reading the file via:
char* cs_virt;
cs_virt = (char*)mmap(0, nchars, PROT_READ, MAP_PRIVATE, finp, offset);
and when I loop after this, I see that the virtual memory for this process has, indeed, been blown up. I want to copy this into RAM, though, so I do the following:
char* cs_virt;
cs_virt = (char*)mmap(0, nchars, PROT_READ, MAP_PRIVATE, finp, offset);
cs = (char*)malloc(nchars*sizeof(char));
for(int ichar = 0; ichar < nchars; ichar++) {
cs[ichar] = cs_virt[ichar];
}
Is this the best method? If not, what is a more efficient method to do this? I have this taking place in a function and cs is declared outside the function. Once I exit the function, I will retain cs, but will cs_virt need to be deleting or will it go away on it's own since it is declared locally in the function?
If you are using Linux, you may be able to use MAP_POPULATE:
MAP_POPULATE (since Linux 2.5.46)
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later
accesses to the mapping will not be blocked by page faults.
MAP_POPULATE is supported for private mappings
only since Linux 2.6.23.
This may be useful if you have time to spare when you mmap() but your later accesses need to be responsive. Consider also MAP_LOCKED if you really need the file to be mapped in and never swapped back out.
MPI and I/O is a murky issue. HDF5 seems to be the most common library that can help you with that, but it often needs tuning for the particular cluster, which is often impossible for mere users of the cluster. A colleague of mine had better success with SIONlib, and was able to get his code working on nearly 1e6 cores on JUGENE with that, so I'd have look at that.
In both cases you will probably need to adapt your file format. In the case of my colleague it even paid of to write the data in parallel fashion using SIONlib, and to later do e sequential postprocessing to "defragment" the holes left be the parallel access pattern that SIONlib chose. It might be similar for input.

Improving/optimizing file write speed in C++

I've been running into some issues with writing to a file - namely, not being able to write fast enough.
To explain, my goal is to capture a stream of data coming in over gigabit Ethernet and simply save it to a file.
The raw data is coming in at a rate of 10MS/s, and it's then saved to a buffer and subsequently written to a file.
Below is the relevant section of code:
std::string path = "Stream/raw.dat";
ofstream outFile(path, ios::out | ios::app| ios::binary);
if(outFile.is_open())
cout << "Yes" << endl;
while(1)
{
rxSamples = rxStream->recv(&rxBuffer[0], rxBuffer.size(), metaData);
switch(metaData.error_code)
{
//Irrelevant error checking...
//Write data to a file
std::copy(begin(rxBuffer), end(rxBuffer), std::ostream_iterator<complex<float>>(outFile));
}
}
The issue I'm encountering is that it's taking too long to write the samples to a file. After a second or so, the device sending the samples reports its buffer has overflowed. After some quick profiling of the code, nearly all of the execution time is spent on std::copy(...) (99.96% of the time to be exact). If I remove this line, I can run the program for hours without encountering any overflow.
That said, I'm rather stumped as to how I can improve the write speed. I've looked through several posts on this site, and it seems like the most common suggestion (in regard to speed) is to implement file writes as I've already done - through the use of std::copy.
If it's helpful, I'm running this program on Ubuntu x86_64. Any suggestions would be appreciated.
So the main problem here is that you try to write in the same thread as you receive, which means that your recv() can only be called again after copy is complete. A few observations:
Move the writing to a different thread. This is about a USRP, so GNU Radio might really be the tool of your choice -- it's inherently multithreaded.
Your output iterator is probably not the most performant solution. Simply "write()" to a file descriptor might be better, but that's performance measurements that are up to you
If your hard drive/file system/OS/CPU aren't up to the rates coming in from the USRP, even if decoupling receiving from writing thread-wise, then there's nothing you can do -- get a faster system.
Try writing to a RAM disk instead
In fact, I don't know how you came up with the std::copy approach. The rx_samples_to_file example that comes with UHD does this with a simple write, and you should definitely favor that over copying; file I/O can, on good OSes, often be done with one copy less, and iterating over all elements is probably very slow.
Let's do a bit of math.
Your samples are (apparently) of type std::complex<std::float>. Given a (typical) 32-bit float, that means each sample is 64 bits. At 10 MS/s, that means the raw data is around 80 megabytes per second--that's within what you can expect to write to a desktop (7200 RPM) hard drive, but getting fairly close to the limit (which is typically around 100-100 megabytes per second or so).
Unfortunately, despite the std::ios::binary, you're actually writing the data in text format (because std::ostream_iterator basically does stream << data;).
This not only loses some precision, but increases the size of the data, at least as a rule. The exact amount of increase depends on the data--a small integer value can actually decrease the quantity of data, but for arbitrary input, a size increase close to 2:1 is fairly common. With a 2:1 increase, your outgoing data is now around 160 megabytes/second--which is faster than most hard drives can handle.
The obvious starting point for an improvement would be to write the data in binary format instead:
uint32_t nItems = std::end(rxBuffer)-std::begin(rxBuffer);
outFile.write((char *)&nItems, sizeof(nItems));
outFile.write((char *)&rxBuffer[0], sizeof(rxBuffer));
For the moment I've used sizeof(rxBuffer) on the assumption that it's a real array. If it's actually a pointer or vector, you'll have to compute the correct size (what you want is the total number of bytes to be written).
I'd also note that as it stands right now, your code has an even more serious problem: since it hasn't specified a separator between elements when it writes the data, the data will be written without anything to separate one item from the next. That means if you wrote two values of (for example) 1 and 0.2, what you'd read back in would not be 1 and 0.2, but a single value of 10.2. Adding separators to your text output will add yet more overhead (figure around 15% more data) to a process that's already failing because it generates too much data.
Writing in binary format means each float will consume precisely 4 bytes, so delimiters are not necessary to read the data back in correctly.
The next step after that would be to descend to a lower-level file I/O routine. Depending on the situation, this might or might not make much difference. On Windows, you can specify FILE_FLAG_NO_BUFFERING when you open a file with CreateFile. This means that reads and writes to that file will basically bypass the cache and go directly to the disk.
In your case, that's probably a win--at 10 MS/s, you're probably going to use up the cache space quite a while before you reread the same data. In such a case, letting the data go into the cache gains you virtually nothing, but costs you some data to copy data to the cache, then somewhat later copy it out to the disk. Worse, it's likely to pollute the cache with all this data, so it's no longer storing other data that's a lot more likely to benefit from caching.

c++ Alternative implementation to avoid shifting between RAM and SWAP memory

I have a program, that uses dynamic programming to calculate some information. The problem is, that theoretically the used memory grows exponentially. Some filters that I use limit this space, but for a big input they also can't avoid that my program runs out of RAM - Memory.
The program is running on 4 threads. When I run it with a really big input I noticed, that at some point the program starts to use the swap memory, because my RAM is not big enough. The consequence of this is, that my CPU-usage decreases from about 380% to 15% or lower.
There is only one variable that uses the memory which is the following datastructure:
Edit (added type) with CLN library:
class My_Map {
typedef std::pair<double,short> key;
typedef cln::cl_I value;
public:
tbb::concurrent_hash_map<key,value>* map;
My_Map() { map = new tbb::concurrent_hash_map<myType>(); }
~My_Map() { delete map; }
//some functions for operations on the map
};
In my main program I am using this datastructure as globale variable:
My_Map* container = new My_Map();
Question:
Is there a way to avoid the shifting of memory between SWAP and RAM? I thought pushing all the memory to the Heap would help, but it seems not to. So I don't know if it is possible to maybe fully use the swap memory or something else. Just this shifting of memory cost much time. The CPU usage decreases dramatically.
If you have 1 Gig of RAM and you have a program that uses up 2 Gb RAM, then you're going to have to find somewhere else to store the excess data.. obviously. The default OS way is to swap but the alternative is to manage your own 'swapping' by using a memory-mapped file.
You open a file and allocate a virtual memory block in it, then you bring pages of the file into RAM to work on. The OS manages this for you for the most part, but you should think about your memory usage so not to try to keep access to the same blocks while they're in memory if you can.
On Windows you use CreateFileMapping(), on Linux you use mmap(), on Mac you use mmap().
The OS is working properly - it doesn't distinguish between stack and heap when swapping - it pages you whatever you seem not to be using and loads whatever you ask for.
There are a few things you could try:
consider whether myType can be made smaller - e.g. using int8_t or even width-appropriate bitfields instead of int, using pointers to pooled strings instead of worst-case-length character arrays, use offsets into arrays where they're smaller than pointers etc.. If you show us the type maybe we can suggest things.
think about your paging - if you have many objects on one memory page (likely 4k) they will need to stay in memory if any one of them is being used, so try to get objects that will be used around the same time onto the same memory page - this may involve hashing to small arrays of related myType objects, or even moving all your data into a packed array if possible (binary searching can be pretty quick anyway). Naively used hash tables tend to flay memory because similar objects are put in completely unrelated buckets.
serialisation/deserialisation with compression is a possibility: instead of letting the OS swap out full myType memory, you may be able to proactively serialise them into a more compact form then deserialise them only when needed
consider whether you need to process all the data simultaneously... if you can batch up the work in such a way that you get all "group A" out of the way using less memory then you can move on to "group B"
UPDATE now you've posted your actual data types...
Sadly, using short might not help much because sizeof key needs to be 16 anyway for alignment of the double; if you don't need the precision, you could consider float? Another option would be to create an array of separate maps...
tbb::concurrent_hash_map<double,value> map[65536];
You can then index to map[my_short][my_double]. It could be better or worse, but is easy to try so you might as well benchmark....
For cl_I a 2-minute dig suggests the data's stored in a union - presumably word is used for small values and one of the pointers when necessary... that looks like a pretty good design - hard to improve on.
If numbers tend to repeat a lot (a big if) you could experiment with e.g. keeping a registry of big cl_Is with a bi-directional mapping to packed integer ids which you'd store in My_Map::map - fussy though. To explain, say you get 987123498723489 - you push_back it on a vector<cl_I>, then in a hash_map<cl_I, int> set [987123498723489 to that index (i.e. vector.size() - 1). Keep going as new numbers are encountered. You can always map from an int id back to a cl_I using direct indexing in the vector, and the other way is an O(1) amortised hash table lookup.

Variable Length Array Performance Implications (C/C++)

I'm writing a fairly straightforward function that sends an array over to a file descriptor. However, in order to send the data, I need to append a one byte header.
Here is a simplified version of what I'm doing and it seems to work:
void SendData(uint8_t* buffer, size_t length) {
uint8_t buffer_to_send[length + 1];
buffer_to_send[0] = MY_SPECIAL_BYTE;
memcpy(buffer_to_send + 1, buffer, length);
// more code to send the buffer_to_send goes here...
}
Like I said, the code seems to work fine, however, I've recently gotten into the habit of using the Google C++ style guide since my current project has no set style guide for it (I'm actually the only software engineer on my project and I wanted to use something that's used in industry). I ran Google's cpplint.py and it caught the line where I am creating buffer_to_send and threw some comment about not using variable length arrays. Specifically, here's what Google's C++ style guide has to say about variable length arrays...
http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Variable-Length_Arrays_and_alloca__
Based on their comments, it appears I may have found the root cause of seemingly random crashes in my code (which occur very infrequently, but are nonetheless annoying). However, I'm a bit torn as to how to fix it.
Here are my proposed solutions:
Make buffer_to_send essentially a fixed length array of a constant length. The problem that I can think of here is that I have to make the buffer as big as the theoretically largest buffer I'd want to send. In the average case, the buffers are much smaller, and I'd be wasting about 0.5KB doing so each time the function is called. Note that the program must run on an embedded system, and while I'm not necessarily counting each byte, I'd like to use as little memory as possible.
Use new and delete or malloc/free to dynamically allocate the buffer. The issue here is that the function is called frequently and there would be some overhead in terms of constantly asking the OS for memory and then releasing it.
Use two successive calls to write() in order to pass the data to the file descriptor. That is, the first write would pass only the one byte, and the next would send the rest of the buffer. While seemingly straightforward, I would need to research the code a bit more (note that I got this code handed down from a previous engineer who has since left the company I work for) in order to guarantee that the two successive writes occur atomically. Also, if this requires locking, then it essentially becomes more complex and has more performance impact than case #2.
Note that I cannot make the buffer_to_send a member variable or scope it outside the function since there are (potentially) multiple calls to the function at any given time from various threads.
Please let me know your opinion and what my preferred approach should be. Thanks for your time.
You can fold the two successive calls to write() in your option 3 into a single call using writev().
http://pubs.opengroup.org/onlinepubs/009696799/functions/writev.html
I would choose option 1. If you know the maximum length of your data, then allocate that much space (plus one byte) on the stack using a fixed size array. This is no worse than the variable length array you have shown because you must always have enough space left on the stack otherwise you simply won't be able to handle your maximum length (at worst, your code would randomly crash on larger buffer sizes). At the time this function is called, nothing else will be using the further space on your stack so it will be safe to allocate a fixed size array.

using MPI: C++ std::bad_alloc

i'm working with supercomputer, using MPI.
but problem in.. C++
have a program, which open file with data and read it into vector<long>v1
//open file
...
vector<long>v1;
while (!f1.eof()){
//input data into
v1.push_back(s1);
}
okey, when file of data contains only 50 millions of "long-numbers", it worked perfect.
but when file of data contains over 75 millions of "long-numbers", it failed with exception:
std::bad_alloc();
how to improve this?
besides, use many processors ( over 100 )
Don't use a vector for this. A vector requires all its elements to fit in consecutive memory locations and it isn't suitable for very large collections. The right data structure to use depends on your access patterns, list will work, but it will waste a lot of memory (two pointers for each long you store). Perhaps you want to break the longs into groups of 100 or so and make a linked list of those groups. Again, the right answer depends on your actual outer problem.
While I don't have much experience with supercomputers (at all), I can tell you that std::bad_alloc should only occur when you run out of system resources.
Chances are in this case that you have reached the limit the computer is imposing on your heap (either from an operating system perspective or a physical perspective {kind of the same thing in the end}) since your vector will be dynamically allocating elements on the heap.
You can try using top or a similar command to monitor your resource usage, and check your system settings against what you're actually using.
Another note - you should create your vector and call reserve() if you know how many elements it will roughly be using - it will greatly improve your efficiency.