Trilinos sparse block matrix anomalous memory consumption - c++

I am building an application based on distributed linear algebra using Trilinos, the main issue is that memory consumption is much higher than expected.
I have built a simple test case for building an Epetra::VbrMatrix with 1.5 million doubles grouped as 5 millions blocks of 3 doubles, which should be about 115MB.
After building the matrix on 2 processors, half data each, I get a memory consumption of 500MB on each processor, which is about 7.5 times the data, it looks unreasonable to me, the matrix should just have some integer arrays for locating the nonzero blocks.
I asked on the trilinos-users mailing list, they say memory usage looks too high, but hope to have some more help here.
I tested both on my laptop with Ubuntu + gcc 4.4.5 + Trilinos 10.0 and on a cluster with PGI compiler and Trilinos 10.4.0, the result is about the same.
My test code is on gist https://gist.github.com/848310, where I also wrote memory consumption at different stage in my testing with 2 MPI processes on my laptop.
If anybody has any suggestion that would be really helpful. Also if you could even just build, run and report memory consumption it would be great.

answer by Alan Williams form the trilinos-users list, in short VBRmatrix is not suitable for such small blocks, as the storage overhead is bigger than the data themselves:
The VbrMatrix storage format
definitely incurs some storage
overhead, as compared to the simple
number of double-precision values
being stored.
In your program, you are storing
5,000,000 X 1 X 3 == 15 million
doubles. With 8 bytes per double, that
is 120 million bytes.
The matrix class Epetra_VbrMatrix
(which is the base class for
Epetra_FEVbrMatrix) internally stores
a Epetra_CrsGraph object, which
represents the sparsity structure.
This requires a couple of integers per
block-row, and 1 integer per
block-nonzero. (Your case has 5
million block-rows with 1
block-nonzero per row, so at least 15
million integers in total.)
Additionally, the Epetra_VbrMatrix
class stores a
Epetra_SerialDenseMatrix object for
each block-nonzero. This adds a couple
of integers, plus a bool and a
pointer, for each block-nonzero. In
your case, since your block-nonzeros
are small (1x3 doubles), this is a
substantial overhead. The VbrMatrix
format has proportionally less
overhead the bigger your
block-nonzeros are. But in your case,
with 1x3 blocks, the VbrMatrix is
indeed occupying several times more
memory than is required for the
15million doubles.

Related

Why does the capacity of a vector increase by a factor of 1.5? [duplicate]

C++ has std::vector and Java has ArrayList, and many other languages have their own form of dynamically allocated array. When a dynamic array runs out of space, it gets reallocated into a larger area and the old values are copied into the new array. A question central to the performance of such an array is how fast the array grows in size. If you always only grow large enough to fit the current push, you'll end up reallocating every time. So it makes sense to double the array size, or multiply it by say 1.5x.
Is there an ideal growth factor? 2x? 1.5x? By ideal I mean mathematically justified, best balancing performance and wasted memory. I realize that theoretically, given that your application could have any potential distribution of pushes that this is somewhat application dependent. But I'm curious to know if there's a value that's "usually" best, or is considered best within some rigorous constraint.
I've heard there's a paper on this somewhere, but I've been unable to find it.
I remember reading many years ago why 1.5 is preferred over two, at least as applied to C++ (this probably doesn't apply to managed languages, where the runtime system can relocate objects at will).
The reasoning is this:
Say you start with a 16-byte allocation.
When you need more, you allocate 32 bytes, then free up 16 bytes. This leaves a 16-byte hole in memory.
When you need more, you allocate 64 bytes, freeing up the 32 bytes. This leaves a 48-byte hole (if the 16 and 32 were adjacent).
When you need more, you allocate 128 bytes, freeing up the 64 bytes. This leaves a 112-byte hole (assuming all previous allocations are adjacent).
And so and and so forth.
The idea is that, with a 2x expansion, there is no point in time that the resulting hole is ever going to be large enough to reuse for the next allocation. Using a 1.5x allocation, we have this instead:
Start with 16 bytes.
When you need more, allocate 24 bytes, then free up the 16, leaving a 16-byte hole.
When you need more, allocate 36 bytes, then free up the 24, leaving a 40-byte hole.
When you need more, allocate 54 bytes, then free up the 36, leaving a 76-byte hole.
When you need more, allocate 81 bytes, then free up the 54, leaving a 130-byte hole.
When you need more, use 122 bytes (rounding up) from the 130-byte hole.
In the limit as n → ∞, it would be the golden ratio: ϕ = 1.618...
For finite n, you want something close, like 1.5.
The reason is that you want to be able to reuse older memory blocks, to take advantage of caching and avoid constantly making the OS give you more memory pages. The equation you'd solve to ensure that a subsequent allocation can re-use all prior blocks reduces to xn − 1 − 1 = xn + 1 − xn, whose solution approaches x = ϕ for large n. In practice n is finite and you'll want to be able to reusing the last few blocks every few allocations, and so 1.5 is great for ensuring that.
(See the link for a more detailed explanation.)
It will entirely depend on the use case. Do you care more about the time wasted copying data around (and reallocating arrays) or the extra memory? How long is the array going to last? If it's not going to be around for long, using a bigger buffer may well be a good idea - the penalty is short-lived. If it's going to hang around (e.g. in Java, going into older and older generations) that's obviously more of a penalty.
There's no such thing as an "ideal growth factor." It's not just theoretically application dependent, it's definitely application dependent.
2 is a pretty common growth factor - I'm pretty sure that's what ArrayList and List<T> in .NET uses. ArrayList<T> in Java uses 1.5.
EDIT: As Erich points out, Dictionary<,> in .NET uses "double the size then increase to the next prime number" so that hash values can be distributed reasonably between buckets. (I'm sure I've recently seen documentation suggesting that primes aren't actually that great for distributing hash buckets, but that's an argument for another answer.)
One approach when answering questions like this is to just "cheat" and look at what popular libraries do, under the assumption that a widely used library is, at the very least, not doing something horrible.
So just checking very quickly, Ruby (1.9.1-p129) appears to use 1.5x when appending to an array, and Python (2.6.2) uses 1.125x plus a constant (in Objects/listobject.c):
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
/* check for integer overflow */
if (new_allocated > PY_SIZE_MAX - newsize) {
PyErr_NoMemory();
return -1;
} else {
new_allocated += newsize;
}
newsize above is the number of elements in the array. Note well that newsize is added to new_allocated, so the expression with the bitshifts and ternary operator is really just calculating the over-allocation.
Let's say you grow the array size by x. So assume you start with size T. The next time you grow the array its size will be T*x. Then it will be T*x^2 and so on.
If your goal is to be able to reuse the memory that has been created before, then you want to make sure the new memory you allocate is less than the sum of previous memory you deallocated. Therefore, we have this inequality:
T*x^n <= T + T*x + T*x^2 + ... + T*x^(n-2)
We can remove T from both sides. So we get this:
x^n <= 1 + x + x^2 + ... + x^(n-2)
Informally, what we say is that at nth allocation, we want our all previously deallocated memory to be greater than or equal to the memory need at the nth allocation so that we can reuse the previously deallocated memory.
For instance, if we want to be able to do this at the 3rd step (i.e., n=3), then we have
x^3 <= 1 + x
This equation is true for all x such that 0 < x <= 1.3 (roughly)
See what x we get for different n's below:
n maximum-x (roughly)
3 1.3
4 1.4
5 1.53
6 1.57
7 1.59
22 1.61
Note that the growing factor has to be less than 2 since x^n > x^(n-2) + ... + x^2 + x + 1 for all x>=2.
Another two cents
Most computers have virtual memory! In the physical memory you can have random pages everywhere which are displayed as a single contiguous space in your program's virtual memory. The resolving of the indirection is done by the hardware. Virtual memory exhaustion was a problem on 32 bit systems, but it is really not a problem anymore. So filling the hole is not a concern anymore (except special environments). Since Windows 7 even Microsoft supports 64 bit without extra effort. # 2011
O(1) is reached with any r > 1 factor. Same mathematical proof works not only for 2 as parameter.
r = 1.5 can be calculated with old*3/2 so there is no need for floating point operations. (I say /2 because compilers will replace it with bit shifting in the generated assembly code if they see fit.)
MSVC went for r = 1.5, so there is at least one major compiler that does not use 2 as ratio.
As mentioned by someone 2 feels better than 8. And also 2 feels better than 1.1.
My feeling is that 1.5 is a good default. Other than that it depends on the specific case.
The top-voted and the accepted answer are both good, but neither answer the part of the question asking for a "mathematically justified" "ideal growth rate", "best balancing performance and wasted memory". (The second-top-voted answer does try to answer this part of the question, but its reasoning is confused.)
The question perfectly identifies the 2 considerations that have to be balanced, performance and wasted memory. If you choose a growth rate too low, performance suffers because you'll run out of extra space too quickly and have to reallocate too frequently. If you choose a growth rate too high, like 2x, you'll waste memory because you'll never be able to reuse old memory blocks.
In particular, if you do the math1 you'll find that the upper limit on the growth rate is the golden ratio ϕ = 1.618… . Growth rate larger than ϕ (like 2x) mean that you'll never be able to reuse old memory blocks. Growth rates only slightly less than ϕ mean you won't be able to reuse old memory blocks until after many many reallocations, during which time you'll be wasting memory. So you want to be as far below ϕ as you can get without sacrificing too much performance.
Therefore I'd suggest these candidates for "mathematically justified" "ideal growth rate", "best balancing performance and wasted memory":
≈1.466x (the solution to x4=1+x+x2) allows memory reuse after just 3 reallocations, one sooner than 1.5x allows, while reallocating only slightly more frequently
≈1.534x (the solution to x5=1+x+x2+x3) allows memory reuse after 4 reallocations, same as 1.5x, while reallocating slightly less frequently for improved performance
≈1.570x (the solution to x6=1+x+x2+x3+x4) only allows memory reuse after 5 reallocations, but will reallocate even less infrequently for even further improved performance (barely)
Clearly there's some diminishing returns there, so I think the global optimum is probably among those. Also, note that 1.5x is a great approximation to whatever the global optimum actually is, and has the advantage being extremely simple.
1 Credits to #user541686 for this excellent source.
It really depends. Some people analyze common usage cases to find the optimal number.
I've seen 1.5x 2.0x phi x, and power of 2 used before.
If you have a distribution over array lengths, and you have a utility function that says how much you like wasting space vs. wasting time, then you can definitely choose an optimal resizing (and initial sizing) strategy.
The reason the simple constant multiple is used, is obviously so that each append has amortized constant time. But that doesn't mean you can't use a different (larger) ratio for small sizes.
In Scala, you can override loadFactor for the standard library hash tables with a function that looks at the current size. Oddly, the resizable arrays just double, which is what most people do in practice.
I don't know of any doubling (or 1.5*ing) arrays that actually catch out of memory errors and grow less in that case. It seems that if you had a huge single array, you'd want to do that.
I'd further add that if you're keeping the resizable arrays around long enough, and you favor space over time, it might make sense to dramatically overallocate (for most cases) initially and then reallocate to exactly the right size when you're done.
I recently was fascinated by the experimental data I've got on the wasted memory aspect of things. The chart below is showing the "overhead factor" calculated as the amount of overhead space divided by the useful space, the x-axis shows a growth factor. I'm yet to find a good explanation/model of what it reveals.
Simulation snippet: https://gist.github.com/gubenkoved/7cd3f0cb36da56c219ff049e4518a4bd.
Neither shape nor the absolute values that simulation reveals are something I've expected.
Higher-resolution chart showing dependency on the max useful data size is here: https://i.stack.imgur.com/Ld2yJ.png.
UPDATE. After pondering this more, I've finally come up with the correct model to explain the simulation data, and hopefully, it matches experimental data nicely. The formula is quite easy to infer simply by looking at the size of the array that we would need to have for a given amount of elements we need to contain.
Referenced earlier GitHub gist was updated to include calculations using scipy.integrate for numerical integration that allows creating the plot below which verifies the experimental data pretty nicely.
UPDATE 2. One should however keep in mind that what we model/emulate there mostly has to do with the Virtual Memory, meaning the over-allocation overheads can be left entirely on the Virtual Memory territory as physical memory footprint is only incurred when we first access a page of Virtual Memory, so it's possible to malloc a big chunk of memory, but until we first access the pages all we do is reserving virtual address space. I've updated the GitHub gist with CPP program that has a very basic dynamic array implementation that allows changing the growth factor and the Python snippet that runs it multiple times to gather the "real" data. Please see the final graph below.
The conclusion there could be that for x64 environments where virtual address space is not a limiting factor there could be really little to no difference in terms of the Physical Memory footprint between different growth factors. Additionally, as far as Virtual Memory is concerned the model above seems to make pretty good predictions!
Simulation snippet was built with g++.exe simulator.cpp -o simulator.exe on Windows 10 (build 19043), g++ version is below.
g++.exe (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0
PS. Note that the end result is implementation-specific. Depending on implementation details dynamic array might or might not access the memory outside the "useful" boundaries. Some implementations would use memset to zero-initialize POD elements for whole capacity -- this will cause virtual memory page translated into physical. However, std::vector implementation on a referenced above compiler does not seem to do that and so behaves as per mock dynamic array in the snippet -- meaning overhead is incurred on the Virtual Memory side, and negligible on the Physical Memory.
I agree with Jon Skeet, even my theorycrafter friend insists that this can be proven to be O(1) when setting the factor to 2x.
The ratio between cpu time and memory is different on each machine, and so the factor will vary just as much. If you have a machine with gigabytes of ram, and a slow CPU, copying the elements to a new array is a lot more expensive than on a fast machine, which might in turn have less memory. It's a question that can be answered in theory, for a uniform computer, which in real scenarios doesnt help you at all.
I know it is an old question, but there are several things that everyone seems to be missing.
First, this is multiplication by 2: size << 1. This is multiplication by anything between 1 and 2: int(float(size) * x), where x is the number, the * is floating point math, and the processor has to run additional instructions for casting between float and int. In other words, at the machine level, doubling takes a single, very fast instruction to find the new size. Multiplying by something between 1 and 2 requires at least one instruction to cast size to a float, one instruction to multiply (which is float multiplication, so it probably takes at least twice as many cycles, if not 4 or even 8 times as many), and one instruction to cast back to int, and that assumes that your platform can perform float math on the general purpose registers, instead of requiring the use of special registers. In short, you should expect the math for each allocation to take at least 10 times as long as a simple left shift. If you are copying a lot of data during the reallocation though, this might not make much of a difference.
Second, and probably the big kicker: Everyone seems to assume that the memory that is being freed is both contiguous with itself, as well as contiguous with the newly allocated memory. Unless you are pre-allocating all of the memory yourself and then using it as a pool, this is almost certainly not the case. The OS might occasionally end up doing this, but most of the time, there is going to be enough free space fragmentation that any half decent memory management system will be able to find a small hole where your memory will just fit. Once you get to really bit chunks, you are more likely to end up with contiguous pieces, but by then, your allocations are big enough that you are not doing them frequently enough for it to matter anymore. In short, it is fun to imagine that using some ideal number will allow the most efficient use of free memory space, but in reality, it is not going to happen unless your program is running on bare metal (as in, there is no OS underneath it making all of the decisions).
My answer to the question? Nope, there is no ideal number. It is so application specific that no one really even tries. If your goal is ideal memory usage, you are pretty much out of luck. For performance, less frequent allocations are better, but if we went just with that, we could multiply by 4 or even 8! Of course, when Firefox jumps from using 1GB to 8GB in one shot, people are going to complain, so that does not even make sense. Here are some rules of thumb I would go by though:
If you cannot optimize memory usage, at least don't waste processor cycles. Multiplying by 2 is at least an order of magnitude faster than doing floating point math. It might not make a huge difference, but it will make some difference at least (especially early on, during the more frequent and smaller allocations).
Don't overthink it. If you just spent 4 hours trying to figure out how to do something that has already been done, you just wasted your time. Totally honestly, if there was a better option than *2, it would have been done in the C++ vector class (and many other places) decades ago.
Lastly, if you really want to optimize, don't sweat the small stuff. Now days, no one cares about 4KB of memory being wasted, unless they are working on embedded systems. When you get to 1GB of objects that are between 1MB and 10MB each, doubling is probably way too much (I mean, that is between 100 and 1,000 objects). If you can estimate expected expansion rate, you can level it out to a linear growth rate at a certain point. If you expect around 10 objects per minute, then growing at 5 to 10 object sizes per step (once every 30 seconds to a minute) is probably fine.
What it all comes down to is, don't over think it, optimize what you can, and customize to your application (and platform) if you must.

Apparent CUDA magic

I'm using CUDA (in reality I'm using pyCUDA if the difference matters) and performing some computation over arrays. I'm launching a kernel with a grid of 320*600 threads. Inside the kernel I'm declaring two linear arrays of 20000 components using:
float test[20000]
float test2[20000]
With these arrays I perform simple calculations, like for example filling them with constant values. The point is that the kernel executes normally and perform correctly the computations (you can see this filling an array with a random component of test and sending that array to host from device).
The problem is that my NVIDIA card has only 2GB of memory and the total amount of memory to allocate the arrays test and test2 is 320*600*20000*4 bytes that is much more than 2GB.
Where is this memory coming from? and how can CUDA perform the computation in every thread?
Thank you for your time
The actual sizing of the local/stack memory requirements is not as you suppose (for the entire grid, all at once) but is actually based on a formula described by #njuffa here.
Basically, the local/stack memory require is sized based on the maximum instantaneous capacity of the device you are running on, rather than the size of the grid.
Based on the information provided by njuffa, the available stack size limit (per thread) is the lesser of:
The maximum local memory size (512KB for cc2.x and higher)
available GPU memory/(#of SMs)/(max threads per SM)
For your first case:
float test[20000];
float test2[20000];
That total is 160KB (per thread) so we are under the maximum limit of 512KB per thread. What about the 2nd limit?
GTX 650m has 2 cc 3.0 (kepler) SMs (each Kepler SM has 192 cores). Therefore, the second limit above gives, if all the GPU memory were available:
2GB/2/2048 = 512KB
(kepler has 2048 max threads per multiprocessor)
so it is the same limit in this case. But this assumes all the GPU memory is available.
Since you're suggesting in the comments that this configuration fails:
float test[40000];
float test2[40000];
i.e. 320KB, I would conclude that your actual available GPU memory is at the point of this bulk allocation attempt is somewhere above (160/512)*100% i.e. above 31% but below (320/512)*100% i.e. below 62.5% of 2GB, so I would conclude that your available GPU memory at the time of this bulk allocation request for the stack frame would be something less than 1.25GB.
You could try to see if this is the case by calling cudaGetMemInfo right before the kernel launch in question (although I don't know how to do this in pycuda). Even though your GPU starts out with 2GB, if you are running the display from it, you are likely starting with a number closer to 1.5GB. And dynamic (e.g. cudaMalloc) and or static (e.g. __device__) allocations that occur prior to this bulk allocation request at kernel launch, will all impact available memory.
This is all to explain some of the specifics. The general answer to your question is that the "magic" arises due to the fact that the GPU does not necessarily allocate the stack frame and local memory for all threads in the grid, all at once. It need only allocate what is required for the maximum instantaneous capacity of the device (i.e. SMs * max threads per SM), which may be a number that is significantly less than what would be required for the whole grid.

Best way to stock 2xx millions of points

So I need to have an array of Terrain (ground) which contains 2xx little terrain (ground). Each little terrain(ground) contains 1 000 000 floats points.
So I thought about something like
QVector<QVector<float> > terrain;
QVector<float> littleTerrain;
But at a moment there is a problem with all the allocations I can do .
So of course, I thought about
QVector<float*>
but it's necessary to delete all the pointers in the vector
So, using the smart pointer is possible here ? Otherwise, what is the best choice to do ?
float is 4 bytes and pointer is also 4 bytes in 32bit config so there is no point to hold pointers for your floats. Vectors can be used for you purpose but you should first reserve memory for your vectors. Also try to determined the limits of your vector implementation. Implementation can be different from library to library.
If you only wish to support 64 bit architectures, then certainly a list of vectors would work. It'll work very crappily since there's likely not enough physical RAM to back it all up.
On 32 bit architectures you don't have enough address space to keep it all in a simple data structure anyway.
So, what you need to do is a caching scheme where only a subset of this data is resident in the in-memory data structure at any time. To optimize the caching behavior, at all levels (L1, L2, L3, paging), you need to - and this is important -- keep spatially neighboring data in neighboring memory addresses.
One way of doing it would be to have all terrain subdivided into 32x32 unit squares. Note that 32*32*sizeof(float) == 4096 is the small page size on most systems. This square should be further subdivided in 4x4 unit squares. Note that 4*4*sizeof(float) == 64 and is the cache line size on many systems.
The overhead will be proportional to the number of containers you have - a few hundred containers, a few kilobytes in total. Nothing at all, in other words.
The important thing to do is to reserve sufficient memory for the number of points per "littleTerrain". Otherwise Qt will make sure it fits, but probably overallocate by 50% or so.

Do I have a memory inconcistency? Macos

I currently use Activity monitor, however this yields a serious inconsistency. I have a running program that builds and stores a 60 x 100,000 array of 10 dimensional GSL vectors in double precision, and another 6 x 60 array of 16,807 dimensional GSL vectors.
I code in C++, I'm using the GSL library out of convenience for the minute.
GSL vectors are essentially an array of double precision data and a pointer, so I think it should be accurate to measure their usage in terms of just the double precision components.
Now, by my calculations, I should be storing around 500 MB of data (8bytes per double). However, my Macos Activity monitor tells me I'm storing 1.4 GB of "real" memory. Now, this may be a very inaccurate method to measure memory usage, but it's not inaccurate at predicting when my machine will switch from using RAM to using swaps, and become very slow! When I increase the first array size to 60 x 400k, for example, I run out of memory and everything stops dead.
So is it that my math is wrong, or is there something going wrong with the way my computer is estimating how much data it's storing?
EDIT: Or is it something about the way I'm storing pointer based data that's confusing the allocator into massively overcompensating the storage need?
EDIT 2: The data is stored in stl::vector< stl::vector<gsl::gsl_vector * > > structures. I read that Eigen does not use dynamical memory allocation: could this lead to substantial improvement in the memory management?
You are creating 600,000 std::vectors. That's a lot of overhead. Each of those vectors isn't simply an array - there is an overhead of pointer, size, capacity, alignment, etc. Plus I suspect you allocate your GSL vectors on the heap?
Eigen may partially solve your problems as it has a static specialization of Matrix class. However, you should be really thinking of using a proper 3d contigous storage array/tensor object.

Is there a limit on the size of array that can be used in CUDA?

Written a program that calculates the integral of a simple function. When testing it I found that if I used an array of size greater than 10 million elements it produced the wrong answer. I found that the error seemed to be occurring when once the array had been manipulated in a CUDA kernel. 10 millions elements and below worked fine and produced the correct result.
Is there a size limit on the amount of elements that can be transferred across to the GPU or calculated upon the GPU?
P.S. using C style arrays containing floats.
There are many different kinds of memory that you can use with CUDA. In particular, you have
Linear Memory (cuMemAlloc)
Pinned Memory (cuMemHostAlloc)
Zero-Copy Memory (cuMemAllocHost)
Pitch Allocation (cuMemAllocPitch)
Textures Bound to Linear Memory
Textures Bound to CUDA Arrays
Textures Bound to Pitch Memory
...and cube maps and surfaces, which I will not list here.
Each kind of memory is associated with its own hardware resource limits, many of which you can find by using cuDeviceGetAttribute. The function cuMemGetInfo returns the amount of free and total memory on the device, but because of alignment requirements, allocating 1,000,000 floats may result in more than 1,000,000 * sizeof(float) bytes being consumed. The maximum number of blocks that you can schedule at once is also a limitation: if you exceed it, the kernel will fail to launch (you can easily find this number using cuDeviceGetAttribute). You can find out the alignment requirements for different amounts of memory using the CUDA Driver API, but for a simple program, you can make a reasonable guess and check the value of allocation function to determine whether the allocation was successful.
There is no restriction on the amount of bytes that you can transfer; using asynchronous functions, you can overlap kernel execution with memory copying (providing that your card supports this). Exceeding the maximum number of blocks you can schedule, or consuming the available memory on your device means that you will have to split up your task so that you can use multiple kernels to handle it.
For compute capability>=3.0 the max grid dimensions are 2147483647x65535x65535,
so for a that should cover any 1-D array of sizes up to 2147483647x1024 = 2.1990233e+12.
1 billion element arrays are definitely fine.
1,000,000,000/1024=976562.5, and round up to 976563 blocks. Just make sure that if threadIdx.x+blockIdx.x*blockDim.x>= number of elements you return from kernel without processing.