I have a mystery that I do not have an answer for. I have written a simple program in C++(and I should say that I'm not a professional C++ developer). Here it is:
#include <iostream>
int main(){
const int SIZE = 1000;
int pool1 [SIZE];
int pool2 [SIZE];
int result [SIZE*SIZE];
//Prepare data
for(int i = 0; i < SIZE; i++){
pool1[i] = i + 1;
pool2[i] = SIZE - i;
}
//Run test
for(int i = 0; i < SIZE; i++){
for(int j = 0; j < SIZE; j++){
result[i*SIZE + j] = pool1[i]*pool2[j];
}
}
return 0;
}
The program seems to work (I use it as a kind of benchmark for different languages) but then I ran it with valgrind and it started complaing:
==25912== Invalid read of size 4
==25912== at 0x804864B: main (in /home/renra/Dev/Benchmarks/array_iteration/array_iteration_cpp)
==25912== Address 0xbee79da0 is on thread 1's stack
==25912== Invalid write of size 4
==25912== at 0x8048632: main (in /home/renra/Dev/Benchmarks/array_iteration/array_iteration_cpp)
==25912== Address 0xbeaa9498 is on thread 1's stack
==25912== More than 10000000 total errors detected. I'm not reporting any more.
==25912== Final error counts will be inaccurate. Go fix your program!
Hmm, does not look good. Size 4 probably refers to the size of int. As you can see at first I was using SIZE 1000 so the results array would be 1,000,000 ints long. So, I thought, it was just overflowing and I needed a larger value type(at least for the iterators and the array of results). I used unsigned long long (the max of unsigned long is 18,446,744,073,709,551,615 and all I needed was 1,000,000 - SIZE*SIZE ). But I'm still getting these error messages (and they still say the read and write size is 4 even though sizeof(long long) is 8).
Also the messages are not there when I use a lower SIZE, but they seem to kick in exactly at SIZE 707 regardless of the used type. Anybody has a clue? I'm quite curious :-).
C and C++ both have no clear limit on what sizes of arrays you will be able to use on the stack and also usually no builtin protection. Just don't allocate such large chunks as automatic (scope local) variables. Use malloc in C or new in C++ for such a purpose.
Related
asking on stack again. I have an array wich I want to be always at the minimum size, because I have to send over the internet. The problem is, the program has no way to know what the minimum size is until the operation is finished. This leads me to having to ways: using vectors, or make an array of the maximum lenght the program could ever need, and then that it knows the minimum size, initialize a pointer with new and put the data there. But I can't use vectors because they require serialization to be sent, and both vector and serialization have overheads I don't want. Example:
unsigned short data[1270], // the maximum size the operation could take is 1270 shorts
*packet; // pointer
int counter; //this is to count how big "packet" will be
//example of operation, wich of course is different from my program
// in this case the operation takes 6 bytes
while(true) {
for (int i; i != 6; i++) {
counter++;
data[i]= 1;
}
packet=new unsigned short[counter];
for (int i; i!=counter; i++) {
packet[i]=data[i];
}
}
Like you might have noticed, this code runs in cycles, so the problem might be my way to repeatedly re-initialize the same pointer.
The problem in this code is, if I do:
std::cout<<counter<<" "<<sizeof(packet)/sizeof(unsigned short)<<" ";
counter variates in size (usually from 1 to 35), but the size of packet is always 2. I also tried delete [] before new, but it didn't solve the problem.
This issue could also be related to another part of the code, but here i am just asking:
Is my way of repeatedly allocate memory right?
Continually add to an std::vector while requesting to the compiler that the size allocated in heap memory not exceed the amount actually needed:
std::vector<int> vec;
std::size_t const maxSize = 10;
for (std::size_t i; i != maxSize; ++i)
{
vec.reserve(vec.size() + 1u);
vec.push_back(1234); // whatever you're adding
}
I should add though that I see no good reason for doing this under normal circumstances. The performance of this "program" could be severely hampered with no obvious benefit.
You can always use pointers and realloc. C++ is such a powerfull language because of its pointers, you don't need to use arrays.
Take a look at the cplusplus entry on realloc.
For your case you could use it like this:
new_packet = (unsigned short*) realloc (packet, new_size * sizeof(unsigned short));
if (new_packet!=NULL) {
packet = new_packet;
for(int i ; i < new_size ; i++)
packet[i] = new_values[i];
}
else {
if( packet != NULL )
free (packet);
puts ("Error (re)allocating memory");
exit (1);
}
Okay, I see a couple problems in your logic here. Lets start with the main one: Why do you need to alloc a whole fresh array with a copy of whats in data just to send it over a socket? It's not like sending a letter dude, send() will transfer a copy of the information, not literally move it over the network. It's perfectly fine to do this:
send(socket, data, counter * sizeof(unsigned short), 0);
There. You don't need a new pointer for anything.
Also, I don't know where you got the serialization thing from. Vectors are basically arrays that resize automatically, and will also delete themselves from memory once the function is done. You could do this:
std::vector<unsigned short> packet;
packet.reserve(counter);
for (std::size_t i = 0; i < counter; ++i)
packet[i] = data[i];
send(socket, &packet[0], packet.size() * sizeof(unsigned short), 0);
Or even shorten to:
std::vector<unsigned short> packet;
for (std::size_t i = 0; i < counter; ++i)
packet.push_back(data[i]);
But with this option the vector will resize counter times, what is performance consuming. Always set its size first if you have the information available.
First I should say I'm quite new to programming in C++ (let alone CUDA), though it is what I first learned with about 184 years ago. I'd say I'm a bit out of touch with memory allocation, and datatype sizes, though I'm learning. Anyway here goes:
I have a GPU with compute capability 3.0 (It's a Geforce 660 GTX w/ 2GB of DRAM).
Going by ./deviceQuery found in the CUDA samples (and by other charts I've found online), my maximum grid size is listed:
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
At 2,147,483,647 (2^31-1) that x dimension is huge and kind of nice… YET, when I run my code, pushing beyond 65535 in the x dimension, things get... weird.
I used an example from an Udacity course, and modified it to test the extremes. I've kept the kernel code fairly simple to prove the point:
__global__ void referr(long int *d_out, long int *d_in){
long int idx = blockIdx.x;
d_out[idx] = idx;
}
Please note below the ARRAY_SIZE being the size of the grid, but also being the size of the array of integers on which to do operations. I am leaving the size of the blocks at 1x1x1. JUST for the sake of understanding the limitations, I KNOW that having this many operations with blocks of only 1 thread makes no sense, but I want to understand what's going on with the grid size limitations.
int main(int argc, char ** argv) {
const long int ARRAY_SIZE = 522744;
const long int ARRAY_BYTES = ARRAY_SIZE * sizeof(long int);
// generate the input array on the host
long int h_in[ARRAY_SIZE];
for (long int i = 0; i < ARRAY_SIZE; i++) {
h_in[i] = i;
}
long int h_out[ARRAY_SIZE];
// declare GPU memory pointers
long int *d_in;
long int *d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
// transfer the array to the GPU
cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel with ARRAY_SIZE blocks in the x dimension, with 1 thread each.
referr<<<ARRAY_SIZE, 1>>>(d_out, d_in);
// copy back the result array to the CPU
cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
// print out the resulting array
for (long int i =0; i < ARRAY_SIZE; i++) {
printf("%li", h_out[i]);
printf(((i % 4) != 3) ? "\t" : "\n");
}
cudaFree(d_in);
cudaFree(d_out);
return 0;
}
This works as expected with an ARRAY_SIZE at MOST of 65535. The last few lines of the output below
65516 65517 65518 65519
65520 65521 65522 65523
65524 65525 65526 65527
65528 65529 65530 65531
65532 65533 65534
If I push the ARRAY_SIZE beyond this the output gets really unpredictable and eventually if the number gets too high I get a Segmentation fault (core dumped) message… whatever that even means. Ie. with an ARRAY_SIZE of 65536:
65520 65521 65522 65523
65524 65525 65526 65527
65528 65529 65530 65531
65532 65533 65534 131071
Why is it now stating that the blockIdx.x for this last one is 131071?? That is 65535+65535+1. Weird.
Even weirder, when I set the ARRAY_SIZE to 65537 (65535+2) I get some seriously strange results for the last lines of the output.
65520 65521 65522 65523
65524 65525 65526 65527
65528 65529 65530 65531
65532 65533 65534 131071
131072 131073 131074 131075
131076 131077 131078 131079
131080 131081 131082 131083
131084 131085 131086 131087
131088 131089 131090 131091
131092 131093 131094 131095
131096 131097 131098 131099
131100 131101 131102 131103
131104 131105 131106 131107
131108 131109 131110 131111
131112 131113 131114 131115
131116 131117 131118 131119
131120 131121 131122 131123
131124 131125 131126 131127
131128 131129 131130 131131
131132 131133 131134 131135
131136 131137 131138 131139
131140 131141 131142 131143
131144 131145 131146 131147
131148 131149 131150 131151
131152 131153 131154 131155
131156 131157 131158 131159
131160 131161 131162 131163
131164 131165 131166 131167
131168 131169 131170 131171
131172 131173 131174 131175
131176 131177 131178 131179
131180 131181 131182 131183
131184 131185 131186 131187
131188 131189 131190 131191
131192 131193 131194 131195
131196 131197 131198 131199
131200
Isn't 65535 the limit for older GPUs? Why is my GPU "messing up" when I push past the 65535 barrier for the x grid dimension? Or is this by design? What in the world is going on?
Wow, sorry for the long question.
Any help to understand this would be greatly appreciated! Thanks!
You should be using proper CUDA error checking . And you should be compiling for a compute 3.0 architecture by specifying -arch=sm_30 when you compile with nvcc.
This question already has answers here:
Why doesn't this program segfault?
(1 answer)
Why doesn't this code cause a segfault?
(3 answers)
Closed 9 years ago.
I wrote the following after noticing something weird happening in another project. This code does not produce a segfault even though arrays are being called out of bounds multiple times. Can someone explain to me why there is no segfault from running the code below?
#include <stdlib.h>
#include <stdio.h>
int main()
{
int *a = (int *)malloc(4 * sizeof(int));
int *b = (int *)malloc(3 * sizeof(int));
int i = 0;
for(i =0; i <3 ; i++)
{
b[i] = 3+i;
}
for(i = 0; i < 4; i++)
{
a[i] = i;
}
for(i = 0; i < 100 ; i++){
a[i] = -1;
}
for(i = 0 ; i < 100 ; i++){
printf("%d \n", b[i]);
}
}
A segfault only happens if you try to access memory locations that are not mapped into your process.
The mallocs are taken from bigger chunks of preallocated memory that makes the heap. E.g. the system may make (or increase) the heap in 4K blocks, so reaching beyond the the bounds of your arrays will still be inside that block of heap-memory that is already allocated to your process (and from which it would assign memory for subsequent mallocs).
In a different situation (where more memory was allocated previously, so your mallocs are near the end of a heap block), this may segfault, but it is basically impossible to predict this (especially taking into account different platforms or compilers).
Undefined behaviour is undefined. Anything can happen, including the appearance of "correct" behaviour.
A segmentation fault occurs when a process tries to access memory which OS accounts as not belonging to the process. As memory accounting inside an OS is done by pages (usually 1 page = 4 KB), a process can access any memory within the allocated page without OS noticing it.
Should be using new and not malloc
What is the platform?
When you try undefined behaviour - gues what it is undefined.
I have a program that generates files containing random distributions of the character A - Z. I have written a method that reads these files (and counts each character) using fread with different buffer sizes in an attempt to determine the optimal block size for reads. Here is the method:
int get_histogram(FILE * fp, long *hist, int block_size, long *milliseconds, long *filelen)
{
char *buffer = new char[block_size];
bzero(buffer, block_size);
struct timeb t;
ftime(&t);
long start_in_ms = t.time * 1000 + t.millitm;
size_t bytes_read = 0;
while (!feof(fp))
{
bytes_read += fread(buffer, 1, block_size, fp);
if (ferror (fp))
{
return -1;
}
int i;
for (i = 0; i < block_size; i++)
{
int j;
for (j = 0; j < 26; j++)
{
if (buffer[i] == 'A' + j)
{
hist[j]++;
}
}
}
}
ftime(&t);
long end_in_ms = t.time * 1000 + t.millitm;
*milliseconds = end_in_ms - start_in_ms;
*filelen = bytes_read;
return 0;
}
However, when I plot bytes/second vs. block size (buffer size) using block sizes of 2 - 2^20, I get an optimal block size of 4 bytes -- which just can't be correct. Something must be wrong with my code but I can't find it.
Any advice is appreciated.
Regards.
EDIT:
The point of this exercise is to demonstrate the optimal buffer size by recording the read times (plus computation time) for different buffer sizes. The file pointer is opened and closed by the calling code.
There are many bugs in this code:
It uses new[], which is C++.
It doesn't free the allocated memory.
It always loops over block_size bytes of input, not bytes_read as returned by fread().
Also, the actual histogram code is rather inefficient, since it seems to loop over each character to determine which character it is.
UPDATE: Removed claim that using feof() before I/O is wrong, since that wasn't true. Thanks to Eric for pointing this out in a comment.
You're not stating what platform you're running this on, and what compile time parameters you use.
Of course, the fread() involves some overhead, leaving user mode and returning. On the other hand, instead of setting the hist[] information directly, you're looping through the alphabet. This is unnecessary and, without optimization, causes some overhead per byte.
I'd re-test this with hist[j-26]++ or something similar.
Typically, the best timing would be achieved if your buffer size equals the system's buffer size for the given media.
I am doing some audio processing and therefore mixing some C and Objective C. I have set up a class that handles my OpenAL interface and my audio processing. I have changed the class suffix to
.mm
...as described in the Core Audio book among many examples online.
I have a C style function declared in the .h file and implemented in the .mm file:
static void granularizeWithData(float *inBuffer, unsigned long int total) {
// create grains of audio data from a buffer read in using ExtAudioFileRead() method.
// total value is: 235377
float tmpArr[total];
// now I try to zero pad a new buffer:
for (int j = 1; j <= 100; j++) {
tmpArr[j] = 0;
// CRASH on first iteration EXC_BAD_ACCESS (code=1, address= ...blahblah)
}
}
Strange??? Yes I am totally out of ideas as to why THAT doesn't work but the FOLLOWING works:
float tmpArr[235377];
for (int j = 1; j <= 100; j++) {
tmpArr[j] = 0;
// This works and index 0 - 99 are filled with zeros
}
Does anyone have any clue as to why I can't declare an array of size 'total' which has an int value? My project uses ARC, but I don't see why this would cause a problem. When I print the value of 'total' when debugging, it is in fact the correct value. If anyone has any ideas, please help, it is driving me nuts!
Problem is that that array gets allocated on the stack and not on the heap. Stack size is limited so you can't allocate an array of 235377*sizeof(float) bytes on it, it's too large. Use the heap instead:
float *tmpArray = NULL;
tmpArray = (float *) calloc(total, sizeof(float)); // allocate it
// test that you actually got the memory you asked for
if (tmpArray)
{
// use it
free(tmpArray); // release it
}
Mind that you are always responsible of freeing memory which is allocated on the heap or you will generate a leak.
In your second example, since size is known a priori, the compiler reserves that space somewhere in the static space of the program thus allowing it to work. But in your first example it must do it on the fly, which causes the error. But in any case before being sure that your second example works you should try accessing all the elements of the array and not just the first 100.