I have a function here that can make program count, wait etc with least count of 1 millisecond. But i was wondering if i can do same will lower precision. I have read other answers but they are mostly about changing to linux or sleep is guesstimate and whats more is those answers were around a decade old so maybe there might have come new function to do it.
Here's function-
void sleep(unsigned int mseconds)
{
clock_t goal = mseconds + clock();
while (goal > clock());
}
Actually, i was trying to make a function similar to secure_compare but i dont think it is wise idea to waste 1 millisecond(current least count) on just comparing two strings.
Here is function i made for the same -
bool secure_compare(string a,string b){
clock_t limit=wait + clock(); //limit of time program can take to compare
bool x = (a==b);
if(clock()>limit){ //if time taken to compare is more increase wait so it takes this new max time for other comparisons too
wait = clock()-limit;
cout<<"Error";
secure_compare(a,b);
}
while(clock()<limit); //finishing time left to make it constant time function
return x;
}
You're trying to make a comparison function time-independent. There are basically two ways to do this:
Measure the time taken for the call and sleep the appropriate amount
This might only swap out one side channel (timing) with another (power consumption, since sleeping and computation might have different power usage characteristics).
Make the control flow more data-independent:
Instead of using the normal string comparison, you could implement your own comparison that compares all characters and not just up until the first mismatch, like this:
bool match = true;
size_t min_length = min(a.size(), b.size());
for (size_t i = 0; i < min_length; ++i) {
match &= (a[i] == b[i]);
}
return match;
Here, no branching (conditional operations) takes place, so every call of this method with strings of the same length should take roughly the same time. So the only side-channel information you leak is the length of the strings you compare, but that would be difficult to hide anyways, if they are of arbitrary length.
EDIT: Incorporating Passer By's comment:
If we want to reduce the size leakage, we could try to round the size up and clamp the index values.
bool match = true;
size_t min_length = min(a.size(), b.size());
size_t rounded_length = (min_length + 1023) / 1024 * 1024;
for (size_t i = 0; i < rounded_length; ++i) {
size_t clamped_i = min(i, min_length - 1);
match &= (a[clamped_i] == b[clamped_i]);
}
return match;
There might be a tiny cache timing sidechannel present (because we don't get any more cache misses if i > clamped_i), but since a and b should be in the cache hierarchy anyways, I doubt the difference is usable in any way.
Related
I have a function which has a factor that needs to be adjusted according to the load on the machine to consume exactly the wall time passed to the function. The factor can vary according to the load of the machine.
void execute_for_wallTime(int factor, int wallTime)
{
double d = 0;
for (int n = 0; n<factor; ++n)
for (int m = 0; wall_time; ++m)
d += d * n*m;
}
Is there a way to dynamically check the load on the machine and adjust the factor accordingly in order to consume the exact wall time passed to the function?
The wall time is read from the file and passed to this function. The values are in micro seconds, e.g:
73
21
44
According to OP comment:
#include <sys/time.h>
int deltaTime(struct timeval *tv1, struct timeval *tv2){
return ((tv2->tv_sec - tv1->tv_sec)*1000000)+ tv2->tv_usec - tv1->tv_usec;
}
//might require longs anyway. this is time in microseconds between the 2 timevals
void execute_for_wallTime(int wallTime)
{
struct timeval tvStart, tvNow;
gettimeofday(&tvStart, NULL);
double d = 0;
for (int m = 0; wall_time; ++m){
gettimeofday(&tvNow, NULL);
if(deltaTime(tvStart,tvNow) >=wall_time) { // if timeWall is 1000 microseconds,
// this function returns after
// 1000 microseconds (and a
// little more due to overhead)
return;
}
d += d*m;
}
}
Now deal with timeWall by increasing or decreasing it in a logic outside this function depending on your performance calculations. This function simply runs for timeWall microseconds.
For C++ style, you can use std::chrono.
I must comment that I would handle things differently, for example by calling nanosleep(). The operations make no sense unless in actual code you plan to substitute these "fillers" with actual operations. In that case you might consider threads and schedulers. Besides the clock calls add overhead.
(I have tried to simplify this as much as i could to find out where I'm doing something wrong.)
The ideea of the code is that I have a global array *v (I hope using this array isn't slowing things down, the threads should never acces the same value because they all work on different ranges) and I try to create 2 threads each one sorting the first half, respectively the second half by calling the function merge_sort() with the respective parameters.
On the threaded run, i see the process going to 80-100% cpu usage (on dual core cpu) while on the no threads run it only stays at 50% yet the run times are very close.
This is the (relevant) code:
//These are the 2 sorting functions, each thread will call merge_sort(..). Is this a problem? both threads calling same (normal) function?
void merge (int *v, int start, int middle, int end) {
//dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
//copies the original values into the 2 halves
//then sorts them back into the v array
}
void merge_sort (int *v, int start, int end) {
//recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
//calls merge(start, middle, end)
}
//here i'm expecting each thread to be created and to call merge_sort on its specific range (this is a simplified version of the original code to find the bug easier)
void* mergesort_t2(void * arg) {
t_data* th_info = (t_data*)arg;
merge_sort(v, th_info->a, th_info->b);
return (void*)0;
}
//in main I simply create 2 threads calling the above function
int main (int argc, char* argv[])
{
//some stuff
//getting the clock to calculate run time
clock_t t_inceput, t_sfarsit;
t_inceput = clock();
//ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
//the a and b are the range of values the created thread will have to sort
pthread_t thread[2];
t_data next_info[2];
next_info[0].crt_depth = 1;
next_info[0].a = 0;
next_info[0].b = n/2;
next_info[1].crt_depth = 1;
next_info[1].a = n/2+1;
next_info[1].b = n-1;
for (int i=0; i<2; i++) {
if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
cerr<<"error\n;";
return err;
}
}
for (int i=0; i<2; i++) {
if (pthread_join(thread[i], &status) != 0) {
cerr<<"error\n;";
return err;
}
}
//now i merge the 2 sorted halves
merge(v, 0, n/2, n-1);
//calculate end time
t_sfarsit = clock();
cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
delete [] v;
}
Output (on 1 million values):
Sort time (s): 1.294
Output with direct calling of merge_sort, no threads:
Sort time (s): 1.388
Output (on 10 million values):
Sort time (s): 12.75
Output with direct calling of merge_sort, no threads:
Sort time (s): 13.838
Solution:
I'd like to thank WhozCraig and Adam too as they've hinted to this from the beginning.
I've used the inplace_merge(..) function instead of my own and the program run times are as they should now.
Here's my initial merge function (not really sure if the initial, i've probably modified it a few times since, also array indices might be wrong right now, i went back and forth between [a,b] and [a,b), this was just the last commented-out version):
void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]
//create the 2 new arrays
int *st = new int[m-a+1];
int *dr = new int[c-m+1];
//copy the values
for (int i1 = 0; i1 <= m-a; i1++)
st[i1] = v[a+i1];
for (int i2 = 0; i2 <= c-(m+1); i2++)
dr[i2] = v[m+1+i2];
//merge them back together in sorted order
int is=0, id=0;
for (int i=0; i<=c-a; i++) {
if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
v[a+i] = st[is];
is++;
}
else {
v[a+i] = dr[id];
id++;
}
}
delete st, dr;
}
all this was replaced with:
inplace_merge(v+a, v+m, v+c);
Edit, some times on my 3ghz dual core cpu:
1 million values:
1 thread : 7.236 s
2 threads: 4.622 s
4 threads: 4.692 s
10 million values:
1 thread : 82.034 s
2 threads: 46.189 s
4 threads: 47.36 s
There's one thing that struck me: "dynamically creates 2 new arrays[...]". Since both threads will need memory from the system, they need to acquire a lock for that, which could well be your bottleneck. In particular the idea of doing microscopic array allocations sounds horribly inefficient. Someone suggested an in-place sort that doesn't need any additional storage, which is much better for performance.
Another thing is the often-forgotten starting half-sentence for any big-O complexity measurements: "There is an n0 so that for all n>n0...". In other words, maybe you haven't reached n0 yet? I recently saw a video (hopefully someone else will remember it) where some people tried to determine this limit for some algorithms, and their results were that these limits are surprisingly high.
Note: since OP uses Windows, my answer below (which incorrectly assumed Linux) might not apply. I left it for sake of those who might find the information useful.
clock() is a wrong interface for measuring time on Linux: it measures CPU time used by the program (see http://linux.die.net/man/3/clock), which in case of multiple threads is the sum of CPU time for all threads. You need to measure elapsed, or wallclock, time. See more details in this SO question: C: using clock() to measure time in multi-threaded programs, which also tells what API can be used instead of clock().
In the MPI-based implementation that you try to compare with, two different processes are used (that's how MPI typically enables concurrency), and the CPU time of the second process is not included - so the CPU time is close to wallclock time. Nevertheless, it's still wrong to use CPU time (and so clock()) for performance measurement, even in serial programs; for one reason, if a program waits for e.g. a network event or a message from another MPI process, it still spends time - but not CPU time.
Update: In Microsoft's implementation of C run-time library, clock() returns wall-clock time, so is OK to use for your purpose. It's unclear though if you use Microsoft's toolchain or something else, like Cygwin or MinGW.
I have several counters that keep increasing (never decreasing) by concurrent threads. Each thread is responsible of one counter. Occasionally, one of the threads would need to find the minimum of all counters. I do this with a simple iteration over all counters and select the minimum. I need to ensure that this minimum is no greater than any of the counters. Currently, I don't use any concurrency mechanisms. Is there any chance that I get a wrong answer (i.e., end up with a minimum that is greater than one of the counters). The code works most of the time, but occasionally (less than 0.1% of the time), it breaks by finding a minimum that is larger than one of the counters. I use a C++ code, and the code looks like this.
unsigned long int counters[NUM_COUNTERS];
void* WorkerThread(void* arg) {
int i_counter = *((int*) arg);
// DO some work
counters[i_counter]++;
occasionally {
unsigned long int min = counters[i_counter];
for (int i = 0; i < NUM_COUNTERS; i++) {
if (counters[i] < min)
min = counters[i];
}
// The minimum is now stored in min
}
}
Update:
After employing the fix suggested by #JerryCoffin, the code looks like this
unsigned long int counters[NUM_COUNTERS];
void* WorkerThread(void* arg) {
int i_counter = *((int*) arg);
// DO some work
counters[i_counter]++;
occasionally {
unsigned long int min = counters[i_counter];
for (int i = 0; i < NUM_COUNTERS; i++) {
unsigned long int counter_i = counters[i];
if (counter_i < min)
min = counter_i;
}
// The minimum is now stored in min
}
}
Yes, it's broken -- it has a race condition.
In other words, when you pick out the smallest value, it's undoubtedly smaller than any other you look at -- but if the other thread increments it after you do the comparison, it could end up larger than some other counter by the time you try to use it.
if (counters[i] < min)
// could change between the comparison above and the assignment below
min = counters[i];
The relatively short interval between comparing and saving the value explains why the answer you're getting is right most of the time -- it'll only go wrong if there's a context switch immediately after the comparison, and the other thread increments that counter often enough before control switches back that it's no longer the smallest counter by the time it gets saved.
I'm writing code that takes a number from a user and prints in back in letters as string. I want to know, which is better performance-wise, to have if statements, like
if (n < 100) {
// code for 2-digit numbers
} else if (n < 1000) {
// code for 3-digit numbers
} // etc..
or to put the number in a string and get its length, then work on it as a string.
The code is written in C++.
Of course if-else will be faster.
To compare two numbers you just compare them bitwise (there are different ways to do it but it's a very fast operation).
To get the length of the string you will need to make the string, put the data into it and compute the length somehow (there can be different ways of doing it too, the simplest being counting all the symbols). Of course it takes much more time.
On a simple example though you will not notice any difference. It often amazes me that people get concerned with such things (no offense). It will not make any difference for you if the code will execute in 0.003 seconds instead of 0.001 seconds really... You should make such low-level optimizations only after you know that this exact place is a bottleneck of your application, and when you are sure that you can increase the performance by a decent amount.
Until you measure and this really is a bottleneck, don't worry about performance.
That said, the following should be even faster (for readability, let's assume you use a type that ranges between 0 and 99999999):
if (n < 10000) {
// code for less or equal to 4 digits
if (n < 100)
{
//code for less or equal to 2 digits
if (n < 10)
return 1;
else
return 2;
}
else
{
//code for over 2 digits, but under or equal to 4
if (n>=1000)
return 4;
else
return 3;
}
} else {
//similar
} // etc..
Basically, it's a variation of binary search. Worst case, this will take O(log(n)) as opposed to O(n) - n being the maximum number of digits.
The string variant will be slower:
std::stringstream ss; // allocation, initialization ...
ss << 4711; // parsing, setting internal flags, ...
std::string str = ss.str(); // allocations, array copies ...
// cleaning up (compiler does it for you) ...
str.~string();
ss.~stringstream(); // destruction ...
The ... indicate there's more stuff happening.
A compact (good for cache) loop (good for branch prediction) might be what you want:
int num_digits (int value, int base=10) {
int num = 0;
while (value) {
value /= base;
++num;
}
return num;
}
int num_zeros (int value, int base=10) {
return num_decimal_digits(value, base) - 1;
}
Depending on circumstances, because it is cache and prediction friendly, this may be faster than solutions based on relational operators.
The templated variant enables the compiler to do some micro optimizations for your division:
template <int base=10>
int num_digits (int value) {
int num = 0;
while (value) {
value /= base;
++num;
}
return num;
}
The answers are good, but think a bit, about relative times.
Even by the slowest method you can think of, the program can do it in some tiny fraction of a second, like maybe 100 microseconds.
Balance that against the fastest user you can imagine, who could type in the number in maybe 500 milliseconds, and who could read the output in another 500 milliseconds, before doing whatever comes next.
OK, the machine does essentially nothing for 1000 milliseconds, and in the middle it has to crunch like crazy for 100 microseconds because, after all, we don't want the user to think the program is slow ;-)
I have the following C++ code:
const int N = 1000000
int id[N]; //Value can range from 0 to 9
float value[N];
// load id and value from an external source...
int size[10] = { 0 };
float sum[10] = { 0 };
for (int i = 0; i < N; ++i)
{
++size[id[i]];
sum[id[i]] += value[i];
}
How should I optimize the loop?
I considered using SSE to add every 4 floats to a sum and then after N iterations, the sum is just the sum of the 4 floats in the xmm register but this doesn't work when the source is indexed like this and needs to write out to 10 different arrays.
This kind of loop is very hard to optimize using SIMD instructions. Not only isn't there an easy way in most SIMD instruction sets to do this kind of indexed read ("gather") or write ("scatter"), even if there was, this particular loop still has the problem that you might have two values that map to the same id in one SIMD register, e.g. when
id[0] == 0
id[1] == 1
id[2] == 2
id[3] == 0
in this case, the obvious approach (pseudocode here)
x = gather(size, id[i]);
y = gather(sum, id[i]);
x += 1; // componentwise
y += value[i];
scatter(x, size, id[i]);
scatter(y, sum, id[i]);
won't work either!
You can get by if there's a really small number of possible cases (e.g. assume that sum and size only had 3 elements each) by just doing brute-force compares, but that doesn't really scale.
One way to get this somewhat faster without using SIMD is by breaking up the dependencies between instructions a bit using unrolling:
int size[10] = { 0 }, size2[10] = { 0 };
int sum[10] = { 0 }, sum2[10] = { 0 };
for (int i = 0; i < N/2; i++) {
int id0 = id[i*2+0], id1 = id[i*2+1];
++size[id0];
++size2[id1];
sum[id0] += value[i*2+0];
sum2[id1] += value[i*2+1];
}
// if N was odd, process last element
if (N & 1) {
++size[id[N]];
sum[id[N]] += value[N];
}
// add partial sums together
for (int i = 0; i < 10; i++) {
size[i] += size2[i];
sum[i] += sum2[i];
}
Whether this helps or not depends on the target CPU though.
Well, you are calling id[i] twice in your loop. You could store it in a variable, or a register int if you wanted to.
register int index;
for(int i = 0; i < N; ++i)
{
index = id[i];
++size[index];
sum[index] += value[i];
}
The MSDN docs state this about register:
The register keyword specifies that
the variable is to be stored in a
machine register.. Microsoft Specific
The compiler does not accept user
requests for register variables;
instead, it makes its own register
choices when global
register-allocation optimization (/Oe
option) is on. However, all other
semantics associated with the register
keyword are honored.
Something you can do is to compile it with the -S flag (or equivalent if you aren't using gcc) and compare the various assembly outputs using -O, -O2, and -O3 flags. One common way to optimize a loop is to do some degree of unrolling, for (a very simple, naive) example:
int end = N/2;
int index = 0;
for (int i = 0; i < end; ++i)
{
index = 2 * i;
++size[id[index]];
sum[id[index]] += value[index];
index++;
++size[id[index]];
sum[id[index]] += value[index];
}
which will cut the number of cmp instructions in half. However, any half-decent optimizing compiler will do this for you.
Are you sure it will make much difference? The likelihood is that the loading of "id from an external source" will take significantly longer than adding up the values.
Do not optimise until you KNOW where the bottleneck is.
Edit in answer to the comment: You misunderstand me. If it takes 10 seconds to load the ids from a hard disk then the fractions of a second spent on processing the list are immaterial in the grander scheme of things. Lets say it takes 10 seconds to load and 1 second to process:
You optimise the processing loop so it takes 0 seconds (almost impossible but its to illustrate a point) then it is STILL taking 10 seconds. 11 Seconds really isn't that ba a performance hit and you would be better off focusing your optimisation time on the actual data load as this is far more likely to be the slow part.
In fact it can be quite optimal to do double buffered data loads. ie you load buffer 0, then you start the load of buffer 1. While buffer 1 is loading you process buffer 0. when finished start the load of the next buffer while processing buffer 1 and so on. this way you can completely amortise the cost of procesing.
Further edit: In fact your best optimisation would probably come from loading things into a set of buckets that eliminate the "id[i]" part of te calculation. You could then simply offload to 3 threads where each uses SSE adds. This way you could have them all going simultaneously and, provided you have at least a triple core machine, process the whole data in a 10th of the time. Organising data for optimal processing will always allow for the best optimisation, IMO.
Depending on your target machine and compiler, see if you have the _mm_prefetch intrinsic and give it a shot. Back in the Pentium D days, pre-fetching data using the asm instruction for that intrinsic was a real speed win as long as you were pre-fetching a few loop iterations before you needed the data.
See here (Page 95 in the PDF) for more info from Intel.
This computation is trivially parallelizable; just add
#pragma omp parallel_for reduction(+:size,+:sum) schedule(static)
immediately above the loop if you have OpenMP support (-fopenmp in GCC.) However, I would not expect much speedup on a typical multicore desktop machine; you're doing so little computation per item fetched that you're almost certainly going to be constrained by memory bandwidth.
If you need to perform the summation several times for a given id mapping (i.e. the value[] array changes more often than id[]), you can halve your memory bandwidth requirements by pre-sorting the value[] elements into id order and eliminating the per-element fetch from id[]:
for (i = 0, j = 0, k = 0; j < 10; sum[j] += tmp, j++)
for (k += size[j], tmp = 0; i < k; i++)
tmp += value[i];