Allocating ~10GB of vectors - how can I speed it up?

Allocating ~10GB of vectors - how can I speed it up? - c++

I'm loading about ~1000 files, each representing an array of ~3 million floats. I need to have them all in memory together as I need to do some calculations that involve all of them.
In the code below, I've broken out the memory allocation and file reading, so I can observe the speed of each separately. I was a bit surprised to find the memory allocation taking much longer than the file reading.
std::vector<std::vector<float> * > v(matrix_count);
for(int i=0; i < matrix_count; i++) {
v[i] = new std::vector<float>(array_size);
}
for(int i=0; i < matrix_count; i++) {
std::ifstream is(files[i]);
is.read((char*) &((*v[i])[0]), size);
is.close();
}
Measuring the time, the allocating loop took 6.8s while file loading took 2.5s. It seems counter-intuitive that reading from the disk is almost 3x faster than just allocating space for it.
Is there something I could do to speed up the memory allocation? I tried allocating one large vector instead, but that failed with bad_malloc -- I guess a 10GB vector isn't ok.

Is there something I could do to speed up the memory allocation? I tried allocating one large vector instead, but that failed with bad_malloc -- I guess a 10GB vector isn't ok.
I mainly wanted to respond by addressing this one part: bad_alloc exceptions tend to be misunderstood. They're not the result of "running out of memory" -- they're the result of the system failing to find a contiguous block of unused pages. You could have plenty more than enough memory available and still get a bad_alloc if you get in the habit of trying to allocate massive blocks of contiguous memory, simply because the system can't find a contiguous set of pages that are free. You can't necessarily avoid bad_alloc by "making sure plenty of memory is free" as you might have already seen where having over 100 gigabytes of RAM can still make you vulnerable to them when trying to allocate a mere 10 GB block. The way to avoid them is to allocate memory in smaller chunks instead of one epic array. At a large enough scale, structures like unrolled lists can start to offer favorable performance over a gigantic array and a much lower (exponentially) probability of ever getting a bad_alloc exception unless you actually do exhaust all the memory available. There is actually a peak where contiguity and the locality of reference it provides ceases to become beneficial and may actually hinder memory performance at a large enough size (mainly due to paging, not caching).
For the kind of epic scale input you're handling, you might actually get better performance out of std::deque given the page-friendly nature of it (it's one of the few times where deque can really shine without need for push_front vs. vector). It's something to potentially try if you don't need perfect contiguity.
Naturally it's best if you measure this with an actual profiler. It'll help us hone in on the actual problem, though it might not be completely shocking (surprising but maybe not shocking) that you might be bottlenecked by memory here instead of disk IO given the kind of "massive number of massive blocks" you're allocating (disk IO is slow but memory heap allocation can sometimes be expensive if you are really stressing the system). It depends a lot on the system's allocation strategy but even slab or buddy allocators can fall back to a much slower code branch if you allocate such epic blocks of memory and en masse, and allocations may even start to require something resembling a search or more access to secondary storage in those extreme cases (here I'm afraid I'm not sure exactly what goes on behind the hood when allocating so many massive blocks, but I have "felt" and measured these kinds of bottlenecks before but in a way where I never quite figured out what the OS was doing exactly -- this above paragraph is purely conjecture).
Here it's kind of counter-intuitive but you can often get better performance allocating a larger number of smaller blocks. Typically that makes things worse, but if we're talking about 3 million floats per memory block and a thousand memory blocks like it, it might help to start allocating in, say, page-friendly 4k chunks. Typically it's cheaper to pre-allocate memory in large blocks in advance and pool it, but "large" in this case is more like 4 kilobyte blocks, not 10 gigabyte blocks.
std::deque will typically do this kind of thing for you so it might be the quickest thing to try out to see if it helps. With std::deque, you should be able to make a single one for all 10 GB worth of contents without splitting it into smaller ones to avoid bad_alloc. It also doesn't have the zero-initialization overhead of the entire contents that some cited, and push_backs to it are constant-time even in the worst-case scenario (not amortized constant time as with std::vector), so I would try std::deque with actually push_back instead of pre-sizing it and using operator[]. You could read the file contents in small chunks at a time (ex: using 4k byte buffers) and just push back the floats. It's something to try anyway.
Anyway, these are all just educated guesses without code and profiling measurements, but these are some things to try out after your measurements.
MMFs may also be the ideal solution for this scenario. Let the OS handle all the tricky details of what it takes to access the file's contents.

Use multiple threads for both memory allocation and reading files. You can create a set of say 15 threads and let each thread pick up the next available job.
When you dig deeper, you will see that opening the file also has a considerable overhead which gets reduced substantially by using multiple threads.

You don't need to handle all the data in memory. Instead of that, you should use something like virtual vector which loads required data when needed. Using that approach saves the memory and don't brings your to side effects of huge memory allocation.

Related

How to allocate a large dynamic array in C++?

So I am currently trying to allocate dynamically a large array of elements in C++ (using "new"). Obviously, when "large" becomes too large (>4GB), my program crashes with a "bad_alloc" exception because it can't find such a large chunk of memory available.
I could allocate each element of my array separately and then store the pointers to these elements in a separate array. However, time is critical in my application so I would like to avoid as much cache misses as I can. I could also group some of these elements into blocks but what would be the best size for such a block?
My question is then: what is the best way (timewise) to allocate dynamically a large array of elements such that elements do not have to be stored contiguously but they must be accessible by index (using [])? This array is never going to be resized, no elements is going to be inserted or deleted of it.
I thought I could use std::deque for this purpose, knowing that the elements of an std::deque might or might not be stored contiguously in memory but I read there are concerns about the extra memory this container takes?
Thank you for your help on this!

If your problem is such that you actually run out of memory allocating fairly small blocks (as is done by deque) is not going to help, the overhead of tracking the allocations will only make the situation worse. You need to re-think your implementation such that you can deal with it in blocks that will still fit in memory. For such problems, if using x86 or x64 based hardware I would suggest blocks of at least 2 megabytes (the large page size).

Obviously, when "large" becomes too large (>4GB), my program crashes
with a "bad_alloc" exception because it can't find such a large chunk
of memory available.
You should be using 64-bit CPU and OS at this point, allocating huge contiguous chunk of memory should not be a problem, unless you are actually running out of memory. It is possible that you are building 32-bit program. In this case you won't be able to allocate more than 4 GB. You should build 64-bit application.
If you want something better than plain operator new, then your question is OS-specific. Look at API provided by your OS: on POSIX system you should look for mmap and for VirtualAlloc on Windows.
There are multiple problems with large allocations:
For security reasons OS kernel never gives you pages filled with garbage values, instead all new memory will be zero initialized. This means you don't have to initialize that memory as long as zeroes are exactly what you want.
OS gives you real memory lazily on first access. If you are processing large array, you might waste a lot of time taking page faults. To avoid this you can use MAP_POPULATE on Linux. On Windows you can try PrefetchVirtualMemory (but I am not sure if it can do the job). This should make init allocation slower, but should decrease total time spent in kernel.
Working with large chunks of memory wastes slots in Translation Lookaside Buffer (TLB). Depending on you memory access pattern, this can cause noticeable slowdown. To avoid this you can try using large pages (mmap with MAP_HUGETLB, MAP_HUGE_2MB, MAP_HUGE_1GB on Linux, VirtualAlloc and MEM_LARGE_PAGES). Using large pages is not easy, as they are usually not available by default. They also cannot be swapped out (always "locked in memory"), so using them requires privileges.
If you don't want to use OS-specific functions, the best you can find in C++ is std::calloc. Unlike std::malloc or operator new it returns zero initialized memory so you can probably avoid wasting time initializing that memory. Other than that, there is nothing special about that function. But this is the closest you can get while staying withing standard C++.
There are no standard containers designed to handle large allocations, moreover, all standard container are really really bad at handling those situations.
Some OSes (like Linux) overcommit memory, others (like Windows) do not. Windows might refuse to give you memory if it knows it won't be able to satisfy your request later. To avoid this you might want to increase your page file. Windows needs to reserve that space on disk beforehand, but it does not mean it will use it (start swapping). As actual memory is given to programs lazily, there are might be a lot of memory reserved for applications that will never be actually given to them.
If increasing page file is too inconvenient, you can try creating large file and map it into memory. That file will serve as a "page file" for your memory. See CreateFileMapping and MapViewOfFile.

The answer to this question is extremely application, and platform, dependent. These days if you just need a small integer factor greater than 4GB, you use a 64-bit machine, if possible. Sometimes reducing the size of the element in the array is possible as well. (E.g. using 16-bit fixed-point of half-float instead of 32-bit float.)
Beyond this, you are either looking at sparse arrays or out-of-core techniques. Sparse arrays are used when you are not actually storing elements at all locations in the array. There are many possible implementations and which is best depends on both the distribution of the data and the access pattern of the algorithm. See Eigen for example.
Out-of-core involves explicitly reading and writing parts of the array to/from disk. This used to be fairly common, but people work pretty hard to avoid doing this now. Applications that really require such are often built on top of a database or similar to handle the data management. In scientific computing, one ends up needing to distribute the compute as well as the data storage so there's a lot of complexity around that as well. For important problems the entire design may be driven by having good locality of reference.
Any sparse data structure will have overhead in how much space it takes. This can be fairly low, but it means you have to be careful if you actually have a dense array and are simply looking to avoid memory fragmentation.
If your problem can be broken into smaller pieces that only access part of the array at a time and the main issue is memory fragmentation making it hard to allocate one large block, then breaking the array in to pieces, effectively adding an outer vector of pointers, is a good bet. If you have random access to an array larger than 4 gigabytes and no way to localize the accesses, 64-bit is the way to go.

Depending on what you need the memory for and your speed concerns, and if you're using Linux, you can always try using mmap and simulate a sort of swap. It might be slower, but you can map very large sizes. See Mmap() an entire large file

Is there a downside to a significant overestimation in a reserve()?

Let's suppose we have a method that creates and uses possibly very big vector<foo>s.
The maximum number of elements is known to be maxElems.
Standard practice as of C++11 is to my best knowledge:
vector<foo> fooVec;
fooVec.reserve(maxElems);
//... fill fooVec using emplace_back() / push_back()
But what happens if we have a scenario where the number of elements is going to be significantly less in the majority of calls to our method?
Is there any disadvantage to the conservative reserve call other than the excess allocated memory (which supposably can be freed with shrink_to_fit() if necessary)?

Summary
There is likely to be some downside to using a too-large reserve, but how much is depends both on the size and context of your reserve() as well as your specific allocator, operating system and their configuration.
As you are probably aware, on platforms like Windows and Linux, large allocations are generally not allocating any physical memory or page table entries until it is first accessed, so you might imagine large, unused allocations to be "free". Sometimes this is called "reserving" memory without "committing" it, and I'll use those terms here.
Here are some reasons this might not be as free as you'd imagine:
Page Granularity
The lazy commit described above only happens at a page granularity. If you are using (typical) 4096 byte pages, it means that if you usually reserve 4,000 bytes for a vector that will usually contains elements taking up 100 bytes, the lazy commit buys you nothing! At least the whole page of 4096 bytes has to be committed and you don't save physical memory. So it isn't just the ratio between the expected and reserved size that matters, but the absolute size of the reserved size that determines how much waste you'll see.
Keep in mind that many systems are now using "huge pages" transparently, so in some cases the granularity will be on the order of 2 MB or more. In that case you need allocations on the order of 10s or 100s of MB to really take advantage of the lazy allocation strategy.
Worse Allocation Performance
Memory allocators for C++ generally try to allocate large chunks of memory (e.g., via sbrk or mmap on Unix-like platforms) and then efficiently carve that up into the small chunks the application is requesting. Getting these large chunks of memory via say a system call like mmap may be several orders of magnitude slower than the fast path allocation within the allocator which is often only a dozen instructions or so. When you ask for large chunks that you mostly won't use, you defeat that optimization and you'll often be going down the slow path.
As a concrete example, let's say your allocator asks mmap for chunks of 128 KB which it carves up to satisfy allocations. You are allocating about 2K of stuff in a typical vector, but reserve 64K. You'll now pay a mmap call for every other reserve call, but if you just asked for the 2K you ultimately needed, you'd have about 32 times fewer mmap calls.
Dependence on Overcommit Handling
When you ask for a lot of memory and don't use it, you can get into the situation where you've asked for more memory than your system supports (e.g., more than your RAM + swap). Whether this is even allowed depends on your OS and how it is configured, and no matter what you are up for some interesting behavior if you subsequently commit more memory simply by writing it. I means that arbitrary processes may be killed, or you might get unexpected errors on any memory write. What works on one system may fail on another due to different overcommit tunables.
Finally, it makes managing your process a bit harder since the "VM size" metric as reported by monitoring tools won't have much relationship to what your process may ultimately commit.
Worse Locality
Allocating more memory than you need makes it likely that your working set will be more sparsely spread out in the virtual address space. The overall effect is a reduction in locality of reference. For very small allocations (e.g., a few dozen bytes) this may reduce the within-same-cache-line locality, but for larger sizes the main effect is likely to be to spread your data onto more physical pages, increasing TLB pressure. The exact thresholds will depend a lot on details like whether hugepages are enabled.

What you cite as standard C++11 practice is hardly standard, and probably not even good practice.
These days I'd be inclined to discourage the use of reserve, and let your platform (i.e. the C++ standard library optimised to your platform) deal with the reallocation as it sees fit.
That said, calling reserve with an excessive amount may well also effectively be benign due to modern operating systems only giving you the memory if you actually use it (Linux is particularly good at that). But relying on this could cause you trouble if you port to a different operating system, whereas simply omitting reserve is less likely to.

You have 2 options:
You don't call reserve and let the default implementation of the vector figure out the size, which uses exponential growth.
Or
You call reserve(maxElems) and shrink_to_fit() afterwards.
The first option is less likely to give you a std::bad_alloc (even though modern OS's probably will never throw this if you don't touch the last block of the reserved memory)
The second option is less likely to invoke multiple calls to reserve, the first option will most likely have 2 calls : the reserve and the shrink_to_fit() (which might be a no-op depending on the implementation since it's non-binding) while option 2 might have significant more. Less calls = better performance.

If you are on linux reserve will call malloc which only allocates virtual memory, but not physical. Physical memory will be used when you actually insert elements to a vector. That's why you can considerably overestimate reserve size.
If you can estimate maximum vector size you can reserve it just once on start to avoid reallocations and no physical memory will be wasted.

But what happens if we have a scenario where the number of elements is going to be significantly less in the majority of calls to our method?
The allocated memory simply remains unused.
Is there a downside to a significant overestimation in a reserve()?
Yes, at least a potential downside: The memory that was allocated for the vector can not be used for other objects.
This is especially problematic in embedded systems that do not usually have virtual memory, and little physical memory to spare.
Concerning programs running inside an operating system, if the operating system does not "over commit" the memory, then this can still easily cause the virtual memory allocation of the program to reach the limit given to the process.
Even in over committing system, particularly gratuitous overestimation can in theory result in exhaustion of virtual address space. But you need pretty big numbers to achieve that on 64 bit architectures.
Is there any disadvantage to the conservative reserve call other than the excess allocated memory (which supposably can be be freed with shrink_to_fit() if necessary)?
Well, this is slower than initially allocating exactly correct amount of memory, but the difference might be marginal.

Most efficient way to grow array C++

Apologies if this has been asked before, I can't find a question that fully answers what I want to know. They mention ways to do this, but don't compare approaches.
I am writing a program in C++ to solve a PDE to steady state. I don't know how many time steps this will take. Therefore I don't know how long my time arrays will be. This will have a maximum time of 100,000s, but the time step could be as small as .001, so it could be as many as 1e8 doubles in length in the worst case (not necessarily a rare case either).
What is the most efficient way to implement this in terms of memory allocated and running time?
Options I've looked at:
Dynamically allocating an array with 1e8 elements, most of which won't ever be used.
Allocating a smaller array initially, creating a larger array when needed and copying elements over
Using std::vector and it's size increasing functionality
Are there any other options?
I'm primarily concerned with speed, but I want to know what memory considerations come into it as well

If you are concerned about speed just allocate 1e8 doubles and be done with it.
In most cases vector should work just fine. Remember that amortized it's O(1) for the append.
Unless you are running on something very weird the OS memory allocation should take care of most fragmentation issues and the fact that it's hard to find a 800MB free memory block.

As noted in the comments, if you are careful using vector, you can actually reserve the capacity to store the maximum input size in advance (1e8 doubles) without paging in any memory.
For this you want to avoid the fill constructor and methods like resize (which would end up accessing all the memory) and use reserve and push_back to fill it and only touch memory as needed. That will allow most operating systems to simply page in chunks of your accessed vector at a time instead of the entire contents all at once.
Yet I tend to avoid this solution for the most part at these kinds of input scales, but for simple reasons:
A possibly-paranoid portability fear that I may encounter an operating system which doesn't have this kind of page-on-demand behavior.
A possibly-paranoid fear that the allocation may fail to find a contiguous set of unused pages and face out of memory errors (this is a grey zone -- I tend to worry about this for arrays which span gigabytes, hundreds of megabytes is borderline).
Just a totally subjective and possibly dumb/old bias towards not leaning too heavily on the operating system's behavior for paging in allocated memory, and preferring to have a data structure which simply allocates on demand.
Debugging.
Among the four, the first two could simply be paranoia. The third might just be plain dumb. Yet at least on operating systems like Windows, when using a debug build, the memory is initialized in its entirety early, and we end up mapping the allocated pages to DRAM immediately on reserving capacity for such a vector. Then we might end up leading to a slight startup delay and a task manager showing 800 megabytes of memory usage for a debug build even before we've done anything.
While generally the efficiency of a debug build should be a minor concern, when the potential discrepancy between release and debug is enormous, it can start to render production code almost incapable of being effectively debugged. So when the differences are potentially vast like this, my preference is to "chunk it up".
The strategy I like to apply here is to allocate smaller chunks -- smaller arrays of N elements, where N might be, say, 512 doubles (just snug enough to fit a common denominator page size of 4 kilobytes -- possibly minus a couple of doubles for chunk metadata). We fill them up with elements, and when they get full, create another chunk.
With these chunks, we can aggregate them together by either linking them (forming an unrolled list) or storing a vector of pointers to them in a separate aggregate depending on whether random-access is needed or merely sequential access will suffice. For the random-access case, this incurs a slight overhead, yet one I've tended to find relatively small at these input scales which often have times dominated by the upper levels of the memory hierarchy rather than register and instruction level.
This might be overkill for your case and a careful use of vector may be the best bet. Yet if that doesn't suffice and you have similar concerns/needs as I do, this kind of chunky solution might help.

The only way to know which option is 'most efficient' on your machine is to try a few different options and profile. I'd probably start with the following:
std::vector constructed with the maximum possible size.
std::vector constructed with a conservative ballpark size and push_back.
std::deque and push_back.
The std::vector vs std::deque debate is ongoing. In my experience, when the number of elements is unknown and not too large, std::deque is almost never faster than std::vector (even if the std::vector needs multiple reallocations) but may end up using less memory. When the number of elements is unknown and very large, std::deque memory consumption seems to explode and std::vector is the clear winner.
If after profiling, none of these options offers satisfactory performance, then you may want to consider writing a custom allocator.

How does a memory leak improve performance

I'm building a large RTree (spatial index) full of nodes. It needs to be able to handle many queries AND updates. Objects are continuously being created and destroyed. The basic test I'm running is to see the performance of the tree as the number of objects in the tree increases. I insert from 100-20000 uniformly size, randomly located objects in increments of 100. Searching and updating are irrelevant to the issue I am currently faced with.
Now, when there is NO memory leak the "insert into tree" performance is everywhere. It goes anywhere from 10.5 seconds with ~15000 objects to 1.5 with ~18000. There is no pattern whatsoever.
When I deliberately add in a leak, as simple as putting in "new int;" I don't assign it to anything, that right there is a line to itself, the performance instantly falls onto a nice gentle curve sloping from 0 (roughly) seconds for 100 objects to 1.5 for the full 20k.
Very, very lost at this point. If you want source code I can include it but it's huuugggeeee and literally the only line that makes a difference is "new int;"
Thanks in advance!
-nick

I'm not sure how you came up with this new int test, but it's not very good way to fix things :) Run your code using a profiler and find out where the real delays are. Then concentrate on fixing the hot spots.
g++ has it built in - just compile with -pg

Without more information it's impossible to be sure.
However I wonder if this is to do with heap fragmentation. By creating a freeing many blocks of memory you'll likely be creating a whole load of small fragments of memory linked together.The memory manager needs to keep track of them all so it can allocate them again if needed.
Some memory managers when you free a block try to "merge" it with surrounding blocks of memory and on a highly fragmented heap this can be very slow as it tries to find the surrounding blocks. Not only this, but if you have limited physical memory it can "touch " many physical pages of memory as it follows the chain of memory blocks which can cause a whole load of extremely slow page faults which will be very variable in speed depending on exactly how much physical memory the OS decides to give that process.
By leaving some un-freed memory you will be changing this pattern of access which might make a large difference to the speed. You might for example be forcing the run time library to allocate new block of memory each time rather than having to track down a suitably sized existing block to reuse.
I have no evidence this is the case in your program, but I do know that memory fragmentation is often the causes of slow programs when a lot of new and free is performed.

The possible thing that is happening which explains this (a theory)
The compiler did not remove the empty new int
The new int is in one of the inner loops or somewhere in your recursive traversal wherein it gets executed the most amount of time
The overall RSS of the process increases and eventually the total memory being used by the process
There are page faults happening because of this
Because of the page-faults, the process becomes I/O bound instead of being CPU bound
End result, you see a drop in the throughput. It will help if you can mention the compiler being used and the options for the compiler that you are using to build the code.

I am taking a stab in the dark here but the problem could be the way the heap gets fragmented. You said that you are creating a destroying large numbers of objects. I will assume that the objects are all of different size.
When one allocates memory on the heap, a cell the size needed is broken off from the heap. When the memory is freed, the cell is added to a freelist. When one does a new alloc, the allocator walks the heap until a cell that is big enough is found. When doing large numbers of allocations, the free list can get rather long and walking the list can take a non-trivial amount of time.
Now an int is rather small. So when you do your new int, it may well eat up all the small heap cells on the free list and thus dramatically speed up larger allocations.
The chances are, however that you are allocating and freeing similar sized objects. If you use your own freelists, you will safe yourself many heap walks and may dramatically improve performance. This is exactly what the STL allocators do to improve performance.

Solution: Do not run from Visual Studio. Actually run the .exe file. Figured this out because that's what the profilers were doing and the numbers were magically dropping. Checked memory usage and version running (and giving me EXCEPTIONAL times) was not blowing up to excessively huge sizes.
Solution to why the hell Visual Studio does ridiculous crap like this: No clue.

Faster to malloc multiple small times or few large times?

When using malloc to allocate memory, is it generally quicker to do multiple mallocs of smaller chunks of data or fewer mallocs of larger chunks of data? For example, say you are working with an image file that has black pixels and white pixels. You are iterating through the pixels and want to save the x and y position of each black pixel in a new structure that also has a pointer to the next and previous pixels x and y values. Would it be generally faster to iterate through the pixels allocating a new structure for each black pixel's x and y values with the pointers, or would it be faster to get a count of the number of black pixels by iterating through once, then allocating a large chunk of memory using a structure containing just the x and y values, but no pointers, then iterating through again, saving the x and y values into that array? I'm assuming certain platforms might be different than others as to which is faster, but what does everyone think would generally be faster?

It depends:
Multiple small times means multiple times, which is slower
There may be a special/fast implementation for small allocations.
If I cared, I'd measure it! If I really cared a lot, and couldn't guess, then I might implement both, and measure at run-time on the target machine, and adapt accordingly.
In general I'd assume that fewer is better: but there are size and run-time library implementations such that a (sufficiently) large allocation will be delegated to the (relatively slow) O/S. whereas a (sufficiently) small allocation will be served from a (relatively quick) already-allocated heap.

Allocating large blocks is more efficient; additionally, since you are using larger contiguous blocks, you have greater locality of reference, and traversing your in-memory structure once you've generated it should also be more efficient! Further, allocating large blocks should help to reduce memory fragmentation.

Generally speaking, allocating larger chunks of memory fewer times will be faster. There's overhead involved each time a call to malloc() is made.

Except speed issues there is also the memory fragmentation problem.

Allocating memory is work. The amount of work done when allocating a block of memory is typically independent of the size of the block. You work it out from here.

It's faster not to allocate in performance-sensitive code at all. Allocate the memory you're going to need once in advance, and then use and reuse that as much as you like.
Memory allocation is a relatively slow operation in general, so don't do it more often than necessary.

In general malloc is expensive. It has to find an appropriate memory chunk from which to allocate memory and keep track of non-contiguous memory blocks. In several libraries you will find small memory allocators that try to minimize the impact by allocating a large block and managing the memory in the allocator.
Alexandrescu deals with the problem in 'Modern C++ Design' and in the Loki library if you want to take a look at one such libs.

This question is one of pragmatism, I'm afraid; that is to say, it depends.
If you have a LOT of pixels, only a few of which are black then counting them might be the highest cost.
If you're using C++, which your tags suggest you are, I would strongly suggest using STL, somthing like std::vector.
The implementation of vector, if I remember correctly, uses a pragmatic approach to allocation. There are a few heuristics for allocation strategies, an informative one is this:
class SampleVector {
int N,used,*data;
public:
SampleVector() {N=1;used=0;data=malloc(N);}
void push_back(int i)
{
if (used>=N)
{
// handle reallocation
N*=2;
data=realloc(data,N);
}
data[used++]=i;
}
};
In this case, you DOUBLE the amount of memory allocated every time you realloc.
This means that reallocations progressively halve in frequency.
Your STL implementation will have been well-tuned, so if you can use that, do!

Another point to consider is how this interacts with threading. Using malloc many times in a threaded concurrent application is a major drag on performance. In that environment you are better off with a scalable allocator like the one used in Intel's Thread Building Blocks or Hoard. The major limitation with malloc is that there is a single global lock that all the threads contend for. It can be so bad that adding another thread dramatically slows down your application.

As already mentonned, malloc is costly, so fewer will probably be faster.
Also, working with the pixels, on most platforms will have less cache-misses and will be faster.
However, there is no guarantee on every platforms

Next to the allocation overhead itself, allocating multiple small chunks may result in lots of cache misses, while if you can iterate through a contiguous block, chances are better.
The scenario you describe asks for preallocation of a large block, imho.

Although allocating large blocks is faster per byte of allocated memory, it will probably not be faster if you artificially increase the allocation size only to chop it up yourself. You're are just duplicating the memory management.

Do an iteration over the pixels to count the number of them to be stored.
Then allocate an array for the exact number of items. This is the most efficient solution.
You can use std::vector for easier memory management (see the std::vector::reserve procedure). Note: reserve will allocate probably a little (probably up to 2 times) more memory then necessary.

"I can allocate-it-all" (really, I can!)
We can philosophy about some special implementations, that speed up small allocations considerably ... yes! But in general this holds:
malloc must be general. It must implement all different kinds of allocations. That is the reason it is considerably slow! It might be, that you use a special kinky-super-duper Library, that speeds things up, but also those can not do wonders, since they have to implement malloc in its full spectrum.
The rule is, when you have more specialized allocation coding, you are always faster then the broad "I can allocate-it-all" routine "malloc".
So when you are able to allocate the memory in bigger blocks in your coding (and it does not cost you to much) you can speed up things considerably. Also - as mentioned by others - you will get lot less fragmentation of memory, that also speeds things up and can cost less memory. You must also see, that malloc needs additional memory for every chunk of memory it returns to you (yes, special routines can reduce this ... but you don't know! what it does really unless you implemented it yourself or bought some wonder-library).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js