I am getting very serious issue with one function in C++. This is my function
double** Fun1(unsigned l,unsigned n, vector<int>& list,
vector<string>& DataArray)
{
double** array2D = 0;
array2D = new double*[l];
string alphabet="ACGT";
for (int i = 0; i < l; i++)
{
array2D[i] = new double [4];
vector<double> count(4, 0.0);
for(int j=0;j<n;++j)
{
for(int k=0;k<4;k++)
{
if (toupper(DataArray[list[j]][i])==alphabet[k])
count[k]=count[k]+1;
}
}
for(int k=0;k<4;k++)
array2D[i][k]=count[k];
count.clear();
}
return array2D;
}
The value of l is around 100 and n=1, DataArray size is (50000 x l) and list will contain any one number between 0-49999.
Now i am calling this function from my main program many number of times (may be like more than 50 Million times). Upto certain number of times it going very smooth but after 2/3 minute around that my system hangs. I am unable to find what is problem with this code. I guess memory is getting short but don't know why?
You are missing the corresponding delete[] from your code.
Note the [] which means you are deleting an array. If you forget to add these you will be venturing into undefined territory (3.7.4.2 in N3797).
You may wish to try using std::array to mitigate having to new and delete[] so much. Also if this is called as much as you say and the loop is this small I would be concerned about the coherency of the data.
Upto certain number of times it going very smooth but after 2/3 minute
around that my system hangs
its not hang. If you are working on linux or unix machine check top (system performance). virtual memory of your system got full. Do delete or delete[] appropriately for every new
Related
I have a distance matrix D of size n by n and a constant L as input. I need to create a vector v contains all entries in D such that its value is at most L. Here v must be in a specific order v = [v1 v2 .. vn] where vi contains entries in ith row of D with value at most L. The order of entries in each vi is not important.
I wonder there is a fast way to create v using vector, array or any data structure + parallization. What I did is to use for loops and it is very slow for large n.
vector<int> v;
for (int i=0; i < n; ++i){
for (int j=0; j < n; ++j){
if (D(i,j) <= L) v.push_back(j);
}
}
The best way is mostly depending on the context. If you are seeking for GPU parallization you should take a look at OpenCL.
For CPU based parallization the C++ standard #include <thread> library is probably your best bet, but you need to be careful:
Threads take time to create so if n is relatively small (<1000 or so) it will slow you down
D(i,j) has to be readably by multiple threads at the same time
v has to be writable by multiple threads, a standard vector wont cut it
v may be a 2d vector with vi as its subvectors, but these have to be initialized before the parallization:
std::vector<std::vector<int>> v;
v.reserve(n);
for(size_t i = 0; i < n; i++)
{
v.push_back(std::vector<int>());
}
You need to decide how many threads you want to use. If this is for one machine only, hardcoding is a valid option. There is a function in the thread library that gets the amount of supported threads, but it is more of a hint than trustworthy.
size_t threadAmount = std::thread::hardware_concurrency(); //How many threads should run hardware_concurrency() gives you a hint, but its not optimal
std::vector<std::thread> t; //to store the threads in
t.reserve(threadAmount-1); //you need threadAmount-1 extra threads (we already have the main-thread)
To start a thread you need a function it can execute. In this case this is to read through part of your matrix.
void CheckPart(size_t start, size_t amount, int L, std::vector<std::vector<int>>& vec)
{
for(size_t i = start; i < amount+start; i++)
{
for(size_t j = 0; j < n; j++)
{
if(D(i,j) <= L)
{
vec[i].push_back(j);
}
}
}
}
Now you need to split your matrix in parts of about n/threadAmount rows and start the threads. The thread constructor needs a function and its parameter, but it will always try to copy the parameters, even if the function wants a reference. To prevent this, you need to force using a reference with std::ref()
int i = 0;
int rows;
for(size_t a = 0; a < threadAmount-1; a++)
{
rows = n/threadAmount + ((n%threadAmount>a)?1:0);
t.push_back(std::thread(CheckPart, i, rows, L, std::ref(v)));
i += rows;
}
The threads are now running and all there is to do is run the last block on the main function:
SortPart(i, n/threadAmount, L, v);
After that you need to wait for the threads finishing and clean them up:
for(unsigned int a = 0; a < threadAmount-1; a++)
{
if(t[a].joinable())
{
t[a].join();
}
}
Please note that this is just a quick and dirty example. Different problems might need different implementation, and since I can't guess the context the help I can give is rather limited.
In consideration of the comments, I made the appropriate corrections (in emphasis).
Have you searched tips for writing performance code, threading, asm instructions (if your assembly is not exactly what you want) and OpenCL for parallel-processing? If not, I strongly recommend!
In some cases, declaring all for loop variables out of the for loop (to avoid declaring they a lot of times) will make it faster, but not in this case (comment from our friend Paddy).
Also, using new insted of vector can be faster, as we see here: Using arrays or std::vectors in C++, what's the performance gap? - and I tested, and with vector it's 6 seconds slower than with new,which only takes 1 second. I guess that the safety and ease of management guarantees that come with std::vector is not desired when someone is searching for performance, even because using new is not so difficult, just avoid heap overflow with calculations and remember using delete[]
user4581301 is correct here, and the following statement is untrue: Finally, if you build D in a array instead of matrix (or maybe if you copy D into a constant array, maybe...), it will be much mor cache-friendly and will save one for loop statement.
I'm struggling to understand the output I'm getting from gprof.
I have a simple wrapper class around a 2D array that I need to be contiguous in memory.
Its constructor and the method I use to access values are:
Array2d::Array2d(int size, double initialValue)
: mSize(size) {
data = new double *[size];
data[0] = new double[size * size];
for (int i = 1; i < size; ++i) {
data[i] = data[0] + i * size;
}
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
data[i][j] = initialValue;
}
}
}
double &Array2d::operator()(int i, int j) {
return data[i][j];
}
In the numerical code I'm working on, this is one output I obtained from gprof:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
49.33 34.80 34.80 43507867293 0.00 0.00 Array2d::operator()(int, int)
18.05 47.53 12.73 jacobi(Array2d&, Array2d&, int, int, double, double, double, int)
I'm surprised to see that almost 50% of the running time is spent accessing values from the array.
This Array2d class replaced the use of std::vector<double>, which was much faster.
What I am failing to understand here?
I'm surprised to see that almost 50% of the running time is spent
accessing values from the array.
We cannot say much about this without seeing your code. It is easily possible to write code that has higher percentage of a single call. Consider
int main() {
while(true){
foo();
}
}
A profiler will tell you that close to 100% of the runtime is spend in foo. Does that mean foo is slow? No, we dont know.
The percentages you get from the profiler rather give you a hint to where the hot spots are in your code. If you know that 50% of the time is spend in one particular function call, then you know that this is a good canditate for improving performance. If you optimize this one function call you can achieve a speedup of up to x2 (cf amdahls law).
On the other hand, a function that uses only 0.1% of the total runtime can be made 1000-times faster without a significant impact on the total runtime.
Whether your element access is slow or fast, you can only know if you implement a second variant, leave everything else in the code as is and repeat the profiling. The variant that leads to a higher percentage performs worse.
I've run into a strange situation.
In my program I have a loop that combines a bunch of data together in a giant vector. I was trying to figure out why it was running so slowly, even though it seemed like I was doing everything right to allocate memory in an efficient manner on the go.
In my program it is difficult to determine how big the final vector of combined data should be, but the size of each piece of data is known as it is processed. So instead of reserving and resizing the combined data vector in one go, I was reserving enough space for each data chunk as it is added to the larger vector. That's when I ran into this issue that is repeatable using the simple snippet below:
std::vector<float> arr1;
std::vector<float> arr2;
std::vector<float> arr3;
std::vector<float> arr4;
int numLoops = 10000;
int numSubloops = 50;
{
// Test 1
// Naive test where no pre-allocation occurs
for (int q = 0; q < numLoops; q++)
{
for (int g = 0; g < numSubloops; g++)
{
arr1.push_back(q * g);
}
}
}
{
// Test 2
// Ideal situation where total amount of data is reserved beforehand
arr2.reserve(numLoops * numSubloops);
for (int q = 0; q < numLoops; q++)
{
for (int g = 0; g < numSubloops; g++)
{
arr2.push_back(q * g);
}
}
}
{
// Test 3
// Total data is not known beforehand, so allocations made for each
// data chunk as they are processed using 'resize' method
int arrInx = 0;
for (int q = 0; q < numLoops; q++)
{
arr3.resize(arr3.size() + numSubloops);
for (int g = 0; g < numSubloops; g++)
{
arr3[arrInx++] = q * g;
}
}
}
{
// Test 4
// Total data is not known beforehand, so allocations are made for each
// data chunk as they are processed using the 'reserve' method
for (int q = 0; q < numLoops; q++)
{
arr4.reserve(arr4.size() + numSubloops);
for (int g = 0; g < numSubloops; g++)
{
arr4.push_back(q * g);
}
}
}
The results of this test, after compilation in Visual Studio 2017, are as follows:
Test 1: 7 ms
Test 2: 3 ms
Test 3: 4 ms
Test 4: 4000 ms
Why is there the huge discrepancy in running times?
Why does calling reserve a bunch of times, followed by push_back take 1000x times longer than calling resize a bunch of times, followed by direct index access?
How does it make any sense that it could take 500x longer than the naive approach which includes no pre-allocations at all?
How does it make any sense that it could take 500x longer than the
naive approach which includes no pre-allocations at all?
That's where you're mistaken. The 'naive' approach you speak of does do pre-allocations. They're just done behind the scenes, and infrequently, in the call to push_back. It doesn't just allocate room for one more element every time you call push_back. It allocates some amount that is a factor (usually between 1.5x and 2x) of the current capacity. And then it doesn't need to allocate again until that capacity runs out. This is much more efficient than your loop which does an allocation every time 50 elements are added, with no regard for the current capacity.
#Benjamin Lindley's answer explains the capacity of std::vector. However, for exactly why the 4th test case is that slow, in fact it's an implementation detail of the standard library.
[vector.capacity]
void reserve(size_type n);
...
Effects: A directive that informs a vector of a planned change in size, so that it can manage the storage allocation accordingly. After reserve(), capacity() is greater or equal to the argument of reserve if reallocation happens; and equal to the previous value of capacity() otherwise. Reallocation happens at this point if and only if the current capacity is less than the argument of reserve().
Thus it is not guaranteed by C++ standard that after reserve() for a larger capacity, the actual capacity should be the requested one. Personally I think it's not unreasonable for an implementation to follow some specific policy when such larger capacity request is received. However, I also tested on my machine, it seems the STL just does the simplest thing.
(I have tried to simplify this as much as i could to find out where I'm doing something wrong.)
The ideea of the code is that I have a global array *v (I hope using this array isn't slowing things down, the threads should never acces the same value because they all work on different ranges) and I try to create 2 threads each one sorting the first half, respectively the second half by calling the function merge_sort() with the respective parameters.
On the threaded run, i see the process going to 80-100% cpu usage (on dual core cpu) while on the no threads run it only stays at 50% yet the run times are very close.
This is the (relevant) code:
//These are the 2 sorting functions, each thread will call merge_sort(..). Is this a problem? both threads calling same (normal) function?
void merge (int *v, int start, int middle, int end) {
//dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
//copies the original values into the 2 halves
//then sorts them back into the v array
}
void merge_sort (int *v, int start, int end) {
//recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
//calls merge(start, middle, end)
}
//here i'm expecting each thread to be created and to call merge_sort on its specific range (this is a simplified version of the original code to find the bug easier)
void* mergesort_t2(void * arg) {
t_data* th_info = (t_data*)arg;
merge_sort(v, th_info->a, th_info->b);
return (void*)0;
}
//in main I simply create 2 threads calling the above function
int main (int argc, char* argv[])
{
//some stuff
//getting the clock to calculate run time
clock_t t_inceput, t_sfarsit;
t_inceput = clock();
//ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
//the a and b are the range of values the created thread will have to sort
pthread_t thread[2];
t_data next_info[2];
next_info[0].crt_depth = 1;
next_info[0].a = 0;
next_info[0].b = n/2;
next_info[1].crt_depth = 1;
next_info[1].a = n/2+1;
next_info[1].b = n-1;
for (int i=0; i<2; i++) {
if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
cerr<<"error\n;";
return err;
}
}
for (int i=0; i<2; i++) {
if (pthread_join(thread[i], &status) != 0) {
cerr<<"error\n;";
return err;
}
}
//now i merge the 2 sorted halves
merge(v, 0, n/2, n-1);
//calculate end time
t_sfarsit = clock();
cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
delete [] v;
}
Output (on 1 million values):
Sort time (s): 1.294
Output with direct calling of merge_sort, no threads:
Sort time (s): 1.388
Output (on 10 million values):
Sort time (s): 12.75
Output with direct calling of merge_sort, no threads:
Sort time (s): 13.838
Solution:
I'd like to thank WhozCraig and Adam too as they've hinted to this from the beginning.
I've used the inplace_merge(..) function instead of my own and the program run times are as they should now.
Here's my initial merge function (not really sure if the initial, i've probably modified it a few times since, also array indices might be wrong right now, i went back and forth between [a,b] and [a,b), this was just the last commented-out version):
void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]
//create the 2 new arrays
int *st = new int[m-a+1];
int *dr = new int[c-m+1];
//copy the values
for (int i1 = 0; i1 <= m-a; i1++)
st[i1] = v[a+i1];
for (int i2 = 0; i2 <= c-(m+1); i2++)
dr[i2] = v[m+1+i2];
//merge them back together in sorted order
int is=0, id=0;
for (int i=0; i<=c-a; i++) {
if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
v[a+i] = st[is];
is++;
}
else {
v[a+i] = dr[id];
id++;
}
}
delete st, dr;
}
all this was replaced with:
inplace_merge(v+a, v+m, v+c);
Edit, some times on my 3ghz dual core cpu:
1 million values:
1 thread : 7.236 s
2 threads: 4.622 s
4 threads: 4.692 s
10 million values:
1 thread : 82.034 s
2 threads: 46.189 s
4 threads: 47.36 s
There's one thing that struck me: "dynamically creates 2 new arrays[...]". Since both threads will need memory from the system, they need to acquire a lock for that, which could well be your bottleneck. In particular the idea of doing microscopic array allocations sounds horribly inefficient. Someone suggested an in-place sort that doesn't need any additional storage, which is much better for performance.
Another thing is the often-forgotten starting half-sentence for any big-O complexity measurements: "There is an n0 so that for all n>n0...". In other words, maybe you haven't reached n0 yet? I recently saw a video (hopefully someone else will remember it) where some people tried to determine this limit for some algorithms, and their results were that these limits are surprisingly high.
Note: since OP uses Windows, my answer below (which incorrectly assumed Linux) might not apply. I left it for sake of those who might find the information useful.
clock() is a wrong interface for measuring time on Linux: it measures CPU time used by the program (see http://linux.die.net/man/3/clock), which in case of multiple threads is the sum of CPU time for all threads. You need to measure elapsed, or wallclock, time. See more details in this SO question: C: using clock() to measure time in multi-threaded programs, which also tells what API can be used instead of clock().
In the MPI-based implementation that you try to compare with, two different processes are used (that's how MPI typically enables concurrency), and the CPU time of the second process is not included - so the CPU time is close to wallclock time. Nevertheless, it's still wrong to use CPU time (and so clock()) for performance measurement, even in serial programs; for one reason, if a program waits for e.g. a network event or a message from another MPI process, it still spends time - but not CPU time.
Update: In Microsoft's implementation of C run-time library, clock() returns wall-clock time, so is OK to use for your purpose. It's unclear though if you use Microsoft's toolchain or something else, like Cygwin or MinGW.
I am doing some audio processing and therefore mixing some C and Objective C. I have set up a class that handles my OpenAL interface and my audio processing. I have changed the class suffix to
.mm
...as described in the Core Audio book among many examples online.
I have a C style function declared in the .h file and implemented in the .mm file:
static void granularizeWithData(float *inBuffer, unsigned long int total) {
// create grains of audio data from a buffer read in using ExtAudioFileRead() method.
// total value is: 235377
float tmpArr[total];
// now I try to zero pad a new buffer:
for (int j = 1; j <= 100; j++) {
tmpArr[j] = 0;
// CRASH on first iteration EXC_BAD_ACCESS (code=1, address= ...blahblah)
}
}
Strange??? Yes I am totally out of ideas as to why THAT doesn't work but the FOLLOWING works:
float tmpArr[235377];
for (int j = 1; j <= 100; j++) {
tmpArr[j] = 0;
// This works and index 0 - 99 are filled with zeros
}
Does anyone have any clue as to why I can't declare an array of size 'total' which has an int value? My project uses ARC, but I don't see why this would cause a problem. When I print the value of 'total' when debugging, it is in fact the correct value. If anyone has any ideas, please help, it is driving me nuts!
Problem is that that array gets allocated on the stack and not on the heap. Stack size is limited so you can't allocate an array of 235377*sizeof(float) bytes on it, it's too large. Use the heap instead:
float *tmpArray = NULL;
tmpArray = (float *) calloc(total, sizeof(float)); // allocate it
// test that you actually got the memory you asked for
if (tmpArray)
{
// use it
free(tmpArray); // release it
}
Mind that you are always responsible of freeing memory which is allocated on the heap or you will generate a leak.
In your second example, since size is known a priori, the compiler reserves that space somewhere in the static space of the program thus allowing it to work. But in your first example it must do it on the fly, which causes the error. But in any case before being sure that your second example works you should try accessing all the elements of the array and not just the first 100.