I've been comparing two ways of finding the highest value from a matrix (should they be duplicated, randomly choose between them), single threaded vs multi-threaded. Typically, the multi-thread should be faster, assuming I coded this properly. Because it is not, it is by far slower, I can only assume I did something wrong. Can someone pinpoint what I did wrong?
Note: I know I shouldn't use rand(), but for this purpose I feel there aren't that many problems with doing so, I'll replace it with a mt19937_64 after this is working properly.
Thanks in advance!
double* RLPolicy::GetActionWithMaxQ(std::tuple<int, double*, int*, int*, int, double*>* state, long* selectedActionIndex, bool& isActionQZero)
{
const bool useMultithreading = true;
double* qIterator = Utilities::DiscretizeStateActionPairToStart(_world->GetCurrentStatePointer(), (long*)&(std::get<0>(*state)));
// Represents the action-pointer for which Q-values are duplicated
// Note: A shared_ptr is used instead of a unique_ptr since C++11 wont support unique_ptrs for pointers to pointers **
static std::shared_ptr<double*> duplicatedQValues(new double*[*_world->GetActionsNumber()], std::default_delete<double*>());
/*[](double** obj) {
delete[] obj;
});*/
static double* const defaultAction = _actionsListing.get();// [0];
double* actionOut = defaultAction; //default action
static double** const duplicatedQsDefault = duplicatedQValues.get();
if (!useMultithreading)
{
const double* const qSectionEnd = qIterator + *_world->GetActionsNumber() - 1;
double* largestValue = qIterator;
int currentActionIterator = 0;
long duplicatedIndex = -1;
do {
if (*qIterator > *largestValue)
{
largestValue = qIterator;
actionOut = defaultAction + currentActionIterator;
*selectedActionIndex = currentActionIterator;
duplicatedIndex = -1;
}
// duplicated value, map it
else if (*qIterator == *largestValue)
{
++duplicatedIndex;
*(duplicatedQsDefault + duplicatedIndex) = defaultAction + currentActionIterator;
}
++currentActionIterator;
++qIterator;
} while (qIterator != qSectionEnd);
// If duped (equal) values are found, select among them randomly with equal probability
if (duplicatedIndex >= 0)
{
*selectedActionIndex = (std::rand() % duplicatedIndex);
actionOut = *(duplicatedQsDefault + *selectedActionIndex);
}
isActionQZero = *largestValue == 0;
return actionOut;
}
else
{
static const long numberOfSections = 6;
unsigned int actionsPerSection = *_world->GetActionsNumber() / numberOfSections;
unsigned long currentSectionStart = 0;
static double* actionsListing = _actionsListing.get();
long currentFoundResult = FindActionWithMaxQInMatrixSection(qIterator, 0, actionsPerSection, duplicatedQsDefault, actionsListing);
static std::vector<std::future<long>> maxActions;
for (int i(0); i < numberOfSections - 1; ++i)
{
currentSectionStart += actionsPerSection;
maxActions.push_back(std::async(&RLPolicy::FindActionWithMaxQInMatrixSection, std::ref(qIterator), currentSectionStart, std::ref(actionsPerSection), std::ref(duplicatedQsDefault), actionsListing));
}
long foundActionIndex;
actionOut = actionsListing + currentFoundResult;
for (auto &f : maxActions)
{
f.wait();
foundActionIndex = f.get();
if (actionOut == nullptr)
actionOut = defaultAction;
else if (*(actionsListing + foundActionIndex) > *actionOut)
actionOut = actionsListing + foundActionIndex;
}
maxActions.clear();
return actionOut;
}
}
/*
Deploy a thread to find the action with the highest Q-value for the provided Q-Matrix section.
#return - The index of the action (on _actionListing) which contains the highest Q-value.
*/
long RLPolicy::FindActionWithMaxQInMatrixSection(double* qMatrix, long sectionStart, long sectionLength, double** dupListing, double* actionListing)
{
double* const matrixSectionStart = qMatrix + sectionStart;
double* const matrixSectionEnd = matrixSectionStart + sectionLength;
double** duplicatedSectionStart = dupListing + sectionLength;
static double* const defaultAction = actionListing;
long maxValue = sectionLength;
long maxActionIndex = 0;
double* qIterator = matrixSectionStart;
double* largestValue = matrixSectionStart;
long currentActionIterator = 0;
long duplicatedIndex = -1;
do {
if (*qIterator > *largestValue)
{
largestValue = qIterator;
maxActionIndex = currentActionIterator;
duplicatedIndex = -1;
}
// duplicated value, map it
else if (*qIterator == *largestValue)
{
++duplicatedIndex;
*(duplicatedSectionStart + duplicatedIndex) = defaultAction + currentActionIterator;
}
++currentActionIterator;
++qIterator;
} while (qIterator != matrixSectionEnd);
// If duped (equal) values are found, select among them randomly with equal probability
if (duplicatedIndex >= 0)
{
maxActionIndex = (std::rand() % duplicatedIndex);
}
return maxActionIndex;
}
Parallel programs are not necessarily faster than serial programs; there are both fixed and variable time costs to setting up parallel algorithms, and for small and/or simple problems, this parallel overhead cost can be greater than the cost of the serial algorithm as a whole. Examples of parallel overhead include thread spawning and synchronization, extra memory copying, and memory bus pressure. At around 2 mircoseconds for your serial program and around 500 microseconds for your parallel program, it's likely your matrix is small enough that the work of setting up the parallel algorithm overshadows the work of solving the matrix problem.
Related
I am a C++ newbie.
Context: I found this third-party snippet of code that seems to work, but based on my (very limited) knowledge of C++ I suspect it will cause problems. The snippet is as follows:
int aVariable;
int anInt = 1;
int anotherInt = 2;
int lastInt = 3;
aVariable = CHAIN(anInt, anotherInt, lastInt);
Where CHAIN is defined as follows (this is part of a library):
int CHAIN(){ Map(&CHAIN, MakeProcInstance(&_CHAIN), MAP_IPTR_VPN); }
int _CHAIN(int i, int np, int p){ return ASMAlloc(np, p, &chainproc); }
int keyalloc[16384], kpos, alloc_locked, tmp[4];
int ASMAlloc(int np, int p, alias proc)
{
int v, x;
// if(alloc_locked) return 0 & printf("WARNING: you can declare compound key statements (SEQ, CHAIN, EXEC, TEMPO, AXIS) only inside main() call, and not during an event.\xa");
v = elements(&keyalloc) - kpos - 4;
if(v < np | !np) return 0; // not enough allocation space or no parameters
Map(&v, p); Dim(&v, np); // v = params array
keyalloc[kpos] = np + 4; // size
keyalloc[kpos+1] = &proc; // function
keyalloc[kpos+2] = kpos + 2 + np; // parameters index
while(x < np)
{
keyalloc[kpos+3+x] = v[x];
x = x+1;
}
keyalloc[kpos+3+np] = kpos + 3 | JUMP;
x = ASMFind(kpos);
if(x == kpos) kpos = kpos + np + 4;
return x + 1 | PROC; // skip block size
}
int ASMFind(int x)
{
int i, j, k; while(i < x)
{
k = i + keyalloc[i]; // next
if(keyalloc[i] == keyalloc[x]) // size
if(keyalloc[i+1] == keyalloc[x+1]) // proc
{
j = x-i;
i = i+3;
while(keyalloc[i] == keyalloc[j+i]) i = i+1; // param
if((keyalloc[i] & 0xffff0000) == JUMP) return x-j;
}
i = k;
}
return x;
}
EDIT:
The weird thing is that running
CHAIN(aVariable);
effectively executes
CHAIN(anInt, anotherInt, lastInt);
Somehow. This is what led me to believe that aVariable is, in fact, a pointer.
QUESTION:
Is it correct to store a parametrized function call into an integer variable like so? Does "aVariable" work just as a pointer, or is this likely to corrupt random memory areas?
You're calling a function (through an obfuscated interface), and storing the result in an integer. It might or might not cause problems, depending on how you use the value / what you expect it to mean.
Your example contains too many undefined symbols for the reader to provide any better answer.
Also, I think this is C, not C++ code.
I am new to threading in C++ but I've done enough reading to at least get what I'm working on to compile. As of yet it hasn't improved performance at all. Right now I just have it creating the number of threads as there are loops but I can imagine that can pretty quickly cause the system to thrash. Is there a better alternative to brute force controlling the number of threads? I also intend to run this on the WestGrid computing system where I can specify the number of processors to use. What is the best way to set the number of threads to optimize for the number of processors.
void ExecuteCRTProcess(const long &numberOfRows, ZZ* rij, const ZZ &powRoh, const int &rowLength, long* PublicKey, ZZ* rQ0, const ZZ &Q0, ZZ* primes, const ZZ &productOfPrimes, ZZ* resultsArray, const bool IsItPrimeArray, long* caseStudy, long* caseStudyTracker, const ZZ &X0, const long &Roh){
int rc;
pthread_t threads[numberOfRows];
struct parameters ThreadParameters[numberOfRows];
for(int i = 0; i< numberOfRows ; i++){
FillR(rij,powRoh,rowLength); // fill up the vector rij with random numbers between 0 powRoh
MultiplyVectorByTwo(rij, rowLength, i, IsItPrimeArray); //Multiply rij vector by 2. If calculating Xi' also add 1 to appropriate values.
ThreadParameters[i].rij = rij;
ThreadParameters[i].rQ0 = rQ0;
ThreadParameters[i].primes = primes;
ThreadParameters[i].rowLength = rowLength;
ThreadParameters[i].Q0 = Q0;
ThreadParameters[i].i = i;
ThreadParameters[i].X0 = X0;
rc = pthread_create(&threads[i],NULL,CRTNew,(void *)&ThreadParameters[i]);
if(rc){
cout << "Error: unable to create thread, " << rc << endl;
exit(-1);
}
for(long j = 0; j< rowLength; j++){
cout << (resultsArray[i] % primes[j]) << " ";
}
cout << endl;*/
}
for(int i = 0; i< numberOfRows; i++){
pthread_join(threads[i], NULL);
resultsArray[i] = ThreadParameters[i].result;
}
}
The threads created run this function
void* CRTNew(void *threadArg){
struct parameters *local_data;
local_data = (struct parameters *) threadArg;
ZZ a, p, A, P, crt;
long Z, Public;
a = local_data->rQ0[local_data->i];
p = local_data->Q0;
A = local_data->rij[0];
P = local_data->primes[0];
for(int i = 1; i<=local_data->rowLength; i++){
A = A%P;
Z = CRT(a, p, A, P);
A = local_data->rij[i]; P = local_data->primes[i];
if(i == local_data->rowLength) Public = Z;
}
if(a < 0) crt = a+p;
else crt = a%p;
local_data->result = crt%local_data->X0;
pthread_exit(NULL);
}
'What is the best way to set the number of threads to optimize for the number of processors';
1) Create [num of cores] threads, (or maybe a few more), at app startup.
2) Never create any more threads
3) Never let the threads terminate.
4) Have them wait for work tasks on a producer-consumer thread, in the manner of a pool.
Alternatively, use a thread pool class or equivalent parallel language feature that already works.
I have a huge vector<vector<int>> (18M x 128). Frequently I want to take 2 rows of this vector and compare them by this function:
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
for (int k = 0; k < 128; k += 2) {
pplus = nodeL[indx2][k] - nodeL[indx1][k];
pminus = nodeL[indx1][k + 1] - nodeL[indx2][k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
As you see, the function, loops through the two row vectors does some subtraction and at the end returns a maximum. This function will be used a million times, so I was wondering if it can be accelerated through SSE instructions. I use Ubuntu 12.04 and gcc.
Of course it is microoptimization but it would helpful if you could provide some help, since I know nothing about SSE. Thanks in advance
Benchmark:
int nofTestCases = 10000000;
vector<int> nodeIds(nofTestCases);
vector<int> goalNodeIds(nofTestCases);
vector<int> results(nofTestCases);
for (int l = 0; l < nofTestCases; l++) {
nodeIds[l] = randomNodeID(18000000);
goalNodeIds[l] = randomNodeID(18000000);
}
double time, result;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff2(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
where
int randomNodeID(int n) {
return (int) (rand() / (double) (RAND_MAX + 1.0) * n);
}
/** Returns a timestamp ('now') in seconds (incl. a fractional part). */
inline double timestamp() {
struct timeval tp;
gettimeofday(&tp, NULL);
return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}
FWIW I put together a pure SSE version (SSE4.1) which seems to run around 20% faster than the original scalar code on a Core i7:
#include <smmintrin.h>
int getDiff_SSE(int indx1, int indx2)
{
int result[4] __attribute__ ((aligned(16))) = { 0 };
const int * const p1 = &nodeL[indx1][0];
const int * const p2 = &nodeL[indx2][0];
const __m128i vke = _mm_set_epi32(0, -1, 0, -1);
const __m128i vko = _mm_set_epi32(-1, 0, -1, 0);
__m128i vresult = _mm_set1_epi32(0);
for (int k = 0; k < 128; k += 4)
{
__m128i v1, v2, vmax;
v1 = _mm_loadu_si128((__m128i *)&p1[k]);
v2 = _mm_loadu_si128((__m128i *)&p2[k]);
v1 = _mm_xor_si128(v1, vke);
v2 = _mm_xor_si128(v2, vko);
v1 = _mm_sub_epi32(v1, vke);
v2 = _mm_sub_epi32(v2, vko);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *)result, vresult);
return max(max(max(result[0], result[1]), result[2]), result[3]);
}
You probably can get the compiler to use SSE for this. Will it make the code quicker? Probably not. The reason being is that there is a lot of memory access compared to computation. The CPU is much faster than the memory and a trivial implementation of the above will already have the CPU stalling when it's waiting for data to arrive over the system bus. Making the CPU faster will just increase the amount of waiting it does.
The declaration of nodeL can have an effect on the performance so it's important to choose an efficient container for your data.
There is a threshold where optimising does have a benfit, and that's when you're doing more computation between memory reads - i.e. the time between memory reads is much greater. The point at which this occurs depends a lot on your hardware.
It can be helpful, however, to optimise the code if you've got non-memory constrained tasks that can run in prarallel so that the CPU is kept busy whilst waiting for the data.
This will be faster. Double dereference of vector of vectors is expensive. Caching one of the dereferences will help. I know it's not answering the posted question but I think it will be a more helpful answer.
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
const vector<int>& nodetemp1 = nodeL[indx1];
const vector<int>& nodetemp2 = nodeL[indx2];
for (int k = 0; k < 128; k += 2) {
pplus = nodetemp2[k] - nodetemp1[k];
pminus = nodetemp1[k + 1] - nodetemp2[k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
A couple of things to look at. One is the amount of data you are passing around. That will cause a bigger issue than the trivial calculation.
I've tried to rewrite it using SSE instructions (AVX) using library here
The original code on my system ran in 11.5s
With Neil Kirk's optimisation, it went down to 10.5s
EDIT: Tested the code with a debugger rather than in my head!
int getDiff(std::vector<std::vector<int>>& nodeL,int row1, int row2) {
Vec4i result(0);
const std::vector<int>& nodetemp1 = nodeL[row1];
const std::vector<int>& nodetemp2 = nodeL[row2];
Vec8i mask(-1,0,-1,0,-1,0,-1,0);
for (int k = 0; k < 128; k += 8) {
Vec8i nodeA(nodetemp1[k],nodetemp1[k+1],nodetemp1[k+2],nodetemp1[k+3],nodetemp1[k+4],nodetemp1[k+5],nodetemp1[k+6],nodetemp1[k+7]);
Vec8i nodeB(nodetemp2[k],nodetemp2[k+1],nodetemp2[k+2],nodetemp2[k+3],nodetemp2[k+4],nodetemp2[k+5],nodetemp2[k+6],nodetemp2[k+7]);
Vec8i tmp = select(mask,nodeB-nodeA,nodeA-nodeB);
Vec4i tmp_a(tmp[0],tmp[2],tmp[4],tmp[6]);
Vec4i tmp_b(tmp[1],tmp[3],tmp[5],tmp[7]);
Vec4i max_tmp = max(tmp_a,tmp_b);
result = select(max_tmp > result,max_tmp,result);
}
return horizontal_add(result);
}
The lack of branching speeds it up to 9.5s but still data is the biggest impact.
If you want to speed it up more, try to change the data structure to a single array/vector rather than a 2D one (a.l.a. std::vector) as that will reduce cache pressure.
EDIT
I thought of something - you could add a custom allocator to ensure you allocate the 2*18M vectors in a contiguous block of memory which allows you to keep the data structure and still go through it quickly. But you'd need to profile it to be sure
EDIT 2: Tested the code with a debugger rather than in my head!
Sorry Alex, this should be better. Not sure it will be faster than what the compiler can do. I still maintain that it's memory access that's the issue, so I would still try the single array approach. Give this a go though.
I am making a 3D application where a boat has to drive through buoy tracks. I also need to store the tracks in groups or "layouts". The buoys class is basically a list of "buoy layouts" inside of which is a list of "buoy tracks", inside of which is a list of buoys.
I checked the local variable watcher and all memory allocations in the constructor appear to work. Later when the calculateCoordinates function is called it enters a for loop. On the first iteration of the for loop the functions pointer is used and works fine, but then on this line
ctMain[j+1][1] = 0;
the function pointers are set to NULL. I am guessing it has something to with the structs not being allocated or addressed correctly. I am not sure what to do from here. Maybe I am not understanding how malloc is working.
Update
I replaced the M3DVector3d main_track with double ** main_track, thinking maybe malloc is not handling the typedefs correctly. But I am getting the same error when trying to access the main_track variable later in calculateCoordinates.
Update
It ended up being memory corruption caused by accessing a pointer wrong in the line
rotatePointD(&(cTrack->main_track[j]), rotation);
It only led to an error later when I tried to access it.
// Buoys.h
////////////////////////////////////////////
struct buoy_layout_t;
struct buoy_track_t;
typedef double M3DVector3d[3];
class Buoys {
public:
Buoys();
struct buoy_layout_t ** buoyLayouts;
int nTrackLayouts;
int currentLayoutID;
void calculateCoordinates();
};
struct buoy_track_t {
int nMain, nYellow, nDistract;
M3DVector3d * main_track,
yellow_buoys,
distraction_buoys;
double (*f)(double x);
double (*fp)(double x);
double thickness;
M3DVector3d start, end;
};
struct buoy_layout_t {
int nTracks;
buoy_track_t ** tracks;
};
// Buoys.cpp
/////////////////////////////
// polynomial and its derivative, for shape of track
double buoyfun1(double x) {return (1.0/292.0)*x*(x-12.0)*(x-24.0);}
double buoyfun1d(double x) {return (1.0/292.0)*((3.0*pow(x,2))-(72.0*x)+288.0);}
// ... rest of buoy shape functions go here ...
Buoys::Buoys() {
struct buoy_layout_t * cLayout;
struct buoy_track_t * cTrack;
nTrackLayouts = 1;
buoyLayouts = (buoy_layout_t **) malloc(nTrackLayouts*sizeof(*buoyLayouts));
for (int i = 0; i < nTrackLayouts; i++) {
buoyLayouts[i] = (buoy_layout_t *) malloc(sizeof(*(buoyLayouts[0])));
}
currentLayoutID = 0;
// ** Layout 1 **
cLayout = buoyLayouts[0];
cLayout->nTracks = 1;
cLayout->tracks = (buoy_track_t **) malloc(sizeof(*(cLayout->tracks)));
for (int i = 0; i < 1; i++) {
cLayout->tracks[i] = (buoy_track_t *) malloc (sizeof(*(cLayout->tracks)));
}
cTrack = cLayout->tracks[0];
cTrack->main_track = (M3DVector3d *) malloc(30*sizeof(*(cTrack->main_track)));
cTrack->nMain = 30;
cTrack->f = buoyfun1;
cTrack->fp = buoyfun1d;
cTrack->thickness = 5.5;
cTrack->start[0] = 0; cTrack->start[1] = 0; cTrack->start[2] = 0;
cTrack->end[0] = 30; cTrack->end[1] = 0; cTrack->end[2] = -19;
// ... initialize rest of layouts here ...
// ** Layout 2 **
// ** Layout 3 **
// ...
// ** Layout N **
calculateCoordinates();
}
void Buoys::calculateCoordinates()
{
int i, j;
buoy_layout_t * cLayout = buoyLayouts[0];
for (i = 0; i < (cLayout->nTracks); i++) {
buoy_track_t * cTrack = cLayout->tracks[i];
M3DVector3d * ctMain = cTrack->main_track;
double thickness = cTrack->thickness;
double rotation = getAngleD(cTrack->start[0], cTrack->start[2],
cTrack->end[0], cTrack->end[2]);
double full_disp = sqrt(pow((cTrack->end[0] - cTrack->start[0]), 2)
+ pow((cTrack->end[2] - cTrack->start[2]), 2));
// nBuoys is nBuoys per side. So one side has nBuoys/2 buoys.
for (j=0; j < cTrack->nMain; j+=2) {
double id = j*((full_disp)/(cTrack->nMain));
double y = (*(cTrack->f))(id);
double yp = (*(cTrack->fp))(id);
double normal, normal_a;
if (yp!=0) {
normal = -1.0/yp;
}
else {
normal = 999999999;
}
if (normal > 0) {
normal_a = atan(normal);
}
else {
normal_a = atan(normal) + PI;
}
ctMain[j][0] = id + ((thickness/2.0)*cos(normal_a));
ctMain[j][1] = 0;
ctMain[j][2] = y + ((thickness/2.0)*sin(normal_a));
ctMain[j+1][0] = id + ((thickness/2.0)*cos(normal_a+PI));
ctMain[j+1][1] = 0; // function pointers get set to null here
ctMain[j+1][2] = y + ((thickness/2.0)*sin(normal_a+PI));
}
for (j=0; j < cTrack->nMain; j++) {
rotatePointD(&(cTrack->main_track[j]), rotation);
}
}
}
Unless there are requirements for learning pointers or you cannot use STL, given you are using C++ I'd strongly recommend you use more STL, it is your friend. But anyways...
First, the type of ctMain is *M3DVector3D. So you can safely access ctMain[0], but you cannot access ctMain[1], maybe you meant for the type of ctMain to be **M3DVector3D, in which case the line for initialization you had written which is:
cTrack->main_track = (M3DVector3d *) malloc(30*sizeof(*(cTrack->main_track)));
would make sense.
More Notes
Why are you allocating 30 of these here?
cTrack->main_track = (M3DVector3d *) malloc(30*sizeof(*(cTrack->main_track)));
Given the type of main_track, you only need:
cTrack->main_track = (M3DVector3d *) malloc(sizeof(M3DVector3d));
In addition, for organizational purposes, when doing sizeof you may want to give the actual type to check the sizeof, as opposed to the variable (there should be no difference, just organizational), these two changes:
buoyLayouts = (buoy_layout_t **) malloc(nTrackLayouts*sizeof(buoy_layout_t*));
for (int i = 0; i < nTrackLayouts; i++) {
buoyLayouts[i] = (buoy_layout_t *) malloc(sizeof(buoy_layout_t));
}
cLayout->tracks = (buoy_track_t **) malloc(clayout->nTracks * sizeof(buoy_track_t*));
for (int i = 0; i < 1; i++) {
cLayout->tracks[i] = (buoy_track_t *) malloc(sizeof(buoy_track_t));
}
I have a dynamic programming algorithm for Knapsack in C++. When it was implemented as a function and accessing variables passed into it, it was taking 22 seconds to run on a particular instance. When I made it the member function of my class KnapsackInstance and had it use variables that were data members of that class, it started taking 37 seconds to run. As far as I know, only accessing member functions goes through the vtable so I'm at a loss to explain what might be happening.
Here's the code of the function
int KnapsackInstance::dpSolve() {
int i; // Current item number
int d; // Current weight
int * tbl; // Array of size weightLeft
int toret;
tbl = new int[weightLeft+1];
if (!tbl) return -1;
memset(tbl, 0, (weightLeft+1)*sizeof(int));
for (i = 1; i <= numItems; ++i) {
for (d = weightLeft; d >= 0; --d) {
if (profitsWeights.at(i-1).second <= d) {
/* Either add this item or don't */
int v1 = profitsWeights.at(i-1).first + tbl[d-profitsWeights.at(i-1).second];
int v2 = tbl[d];
tbl[d] = (v1 < v2 ? v2 : v1);
}
}
}
toret = tbl[weightLeft];
delete[] tbl;
return toret;
}
tbl is one column of the DP table. We start from the first column and go on until the last column. The profitsWeights variable is a vector of pairs, the first element of which is the profit and the second the weight. toret is the value to return.
Here is the code of the original function :-
int dpSolve(vector<pair<int, int> > profitsWeights, int weightLeft, int numItems) {
int i; // Current item number
int d; // Current weight
int * tbl; // Array of size weightLeft
int toret;
tbl = new int[weightLeft+1];
if (!tbl) return -1;
memset(tbl, 0, (weightLeft+1)*sizeof(int));
for (i = 1; i <= numItems; ++i) {
for (d = weightLeft; d >= 0; --d) {
if (profitsWeights.at(i-1).second <= d) {
/* Either add this item or don't */
int v1 = profitsWeights.at(i-1).first + tbl[d-profitsWeights.at(i-1).second];
int v2 = tbl[d];
tbl[d] = (v1 < v2 ? v2 : v1);
}
}
}
toret = tbl[weightLeft];
delete[] tbl;
return toret;
}
This was run on Debian Lenny with g++-4.3.2 and -O3 -DNDEBUG turned on
Thanks
In a typical implementation, a member function receives a pointer to the instance data as a hidden parameter (this). As such, access to member data is normally via a pointer, which may account for the slow-down you're seeing.
On the other hand, it's hard to do more than guess with only one version of the code to look at.
After looking at both pieces of code, I think I'd write the member function more like this:
int KnapsackInstance::dpSolve() {
std::vector<int> tbl(weightLeft+1, 0);
std::vector<pair<int, int> > weights(profitWeights);
int v1;
for (int i = 0; i <numItems; ++i)
for (int d = weightLeft; d >= 0; --d)
if ((weights[i+1].second <= d) &&
((v1 = weights[i].first + tbl[d-weights[i-1].second])>tbl[d]))
tbl[d] = v1;
return tbl[weightLeft];
}