So recently I ran into a problem that I thought was interesting and I couldn't fully explain. I've highlighted the nature of the problem in the following code:
#include <cstring>
#include <chrono>
#include <iostream>
#define NLOOPS 10
void doWorkFast(int total, int *write, int *read)
{
for (int j = 0; j < NLOOPS; j++) {
for (int i = 0; i < total; i++) {
write[i] = read[i] + i;
}
}
}
void doWorkSlow(int total, int *write, int *read, int innerLoopSize)
{
for (int i = 0; i < NLOOPS; i++) {
for (int j = 0; j < total/innerLoopSize; j++) {
for (int k = 0; k < innerLoopSize; k++) {
write[j*k + k] = read[j*k + k] + j*k + k;
}
}
}
}
int main(int argc, char *argv[])
{
int n = 1000000000;
int *heapMemoryWrite = new int[n];
int *heapMemoryRead = new int[n];
for (int i = 0; i < n; i++)
{
heapMemoryRead[i] = 1;
}
std::memset(heapMemoryWrite, 0, n * sizeof(int));
auto start1 = std::chrono::high_resolution_clock::now();
doWorkFast(n,heapMemoryWrite, heapMemoryRead);
auto finish1 = std::chrono::high_resolution_clock::now();
auto duration1 = std::chrono::duration_cast<std::chrono::microseconds>(finish1 - start1);
for (int i = 0; i < n; i++)
{
heapMemoryRead[i] = 1;
}
std::memset(heapMemoryWrite, 0, n * sizeof(int));
auto start2 = std::chrono::high_resolution_clock::now();
doWorkSlow(n,heapMemoryWrite, heapMemoryRead, 10);
auto finish2 = std::chrono::high_resolution_clock::now();
auto duration2 = std::chrono::duration_cast<std::chrono::microseconds>(finish2 - start2);
std::cout << "Small inner loop:" << duration1.count() << " microseconds.\n" <<
"Large inner loop:" << duration2.count() << " microseconds." << std::endl;
delete[] heapMemoryWrite;
delete[] heapMemoryRead;
}
Looking at the two doWork* functions, for every iteration, we are reading the same addresses adding the same value and writing to the same addresses. I understand that in the doWorkSlow implementation, we are doing one or two more operations to resolve j*k + k, however, I think it's reasonably safe to assume that relative to the time it takes to do the load/stores for memory read and write, the time contribution of these operations is negligible.
Nevertheless, doWorkSlow takes about twice as long (46.8s) compared to doWorkFast (25.5s) on my i7-3700 using g++ --version 7.5.0. While things like cache prefetching and branch prediction come to mind, I don't have a great explanation as to why doWorkFast is much faster than doWorkSlow. Does anyone have insight?
Thanks
Looking at the two doWork* functions, for every iteration, we are reading the same addresses adding the same value and writing to the same addresses.
This is not true!
In doWorkFast, you index each integer incrementally, as array[i].
array[0]
array[1]
array[2]
array[3]
In doWorkSlow, you index each integer as array[j*k + k], which jumps around and repeats.
When j is 10, for example, and you iterate k from 0 onwards, you are accessing
array[0] // 10*0+0
array[11] // 10*1+1
array[22] // 10*2+2
array[33] // 10*3+3
This will prevent your optimizer from using instructions that can operate on many adjacent integers at once.
Related
I wrote a code to calculate moving L2 norm of two arrays.
func_lstl2(const int &nx, const float x[],const int &ny, const float y[], int &shift, double &lstl2)
{
int maxshift = 200;
int len_z = maxshift * 2;
int len_work = len_z + ny;
//initialize array work and array z
double *z = new double[len_z]; float *work = new float[len_work];
for (int i = 0; i < len_z; i++)
z[i] = 0;
for (int i = 0; i < len_work; i++)
work[i] = 0;
for (int i = 0; i < ny; i++)
work[i + maxshift] = y[i];
// do moving least square residue calculation
float temp;
for (int i = 0; i < len_z; i++)
{
for (int j = 0; j < nx; j++)
{
temp = x[j] - work[i + j];
z[i] += temp * temp;
}
}
// find the best fit value
lstl2 = 1E30;
shift = 0;
for (int i = 0; i < len_z; i++)
{
if (z[i] < lstl2)
{
lstl2 = z[i];
shift = i - maxshift;
}
}
//end of program
delete[] z;
delete[] work;
}
I tested two arrays with exactly same length and same scale.
int shift; double lstl2;
func_lstl2(2000,z1,2000,z2,shift,lstl2) ;
func_lstl2(2000,x1,2000,x2,shift,lstl2) ;
For z array, it used 0.0032346 seconds, for x array, it used 0.0140903 seconds. I cannot figure out why there is near 5 times time consumption difference. Could you help me figure it out? Thank you very much!
Here is the link for z array and x array.
https://drive.google.com/file/d/1aONKTjE_7NI1bp8YkDL2CMfg9C5h67Fe/view?usp=sharing
I strongly suspect you're dealing with denormalized floating point calculation effects. Using your existing function, loading the values as-appropriate in vectors, and turning them loose seven times on the provided input, (compiled with -O3 optimization)
for (int i = 0; i < 5; ++i)
{
int shift = 0;
double lstl2 = 0;
auto tp0 = steady_clock::now();
func_lstl2(2000, v1.data(), 2000, v2.data(), shift, lstl2);
auto tp1 = steady_clock::now();
std::cout << pr[0] << ',' << pr[1] << ':';
std::cout << duration_cast<milliseconds>(tp1 - tp0).count() << "ms\n";
}
I receive the following output, confirming your conundrum:
x1.txt,x2.txt:23ms
x1.txt,x2.txt:19ms
x1.txt,x2.txt:21ms
x1.txt,x2.txt:21ms
x1.txt,x2.txt:19ms
x1.txt,x2.txt:22ms
x1.txt,x2.txt:21ms
z1.txt,z2.txt:8ms
z1.txt,z2.txt:9ms
z1.txt,z2.txt:5ms
z1.txt,z2.txt:5ms
z1.txt,z2.txt:6ms
z1.txt,z2.txt:5ms
z1.txt,z2.txt:5ms
However, enabling denormalize-as-zero (DAZ) and flush-to-zero (FTZ) for floating calculations (the mechanism for doing so is toolchain-dependent; below is clang 13.01 on macOS):
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
delivers the following:
x1.txt,x2.txt:4ms
x1.txt,x2.txt:4ms
x1.txt,x2.txt:3ms
x1.txt,x2.txt:5ms
x1.txt,x2.txt:3ms
x1.txt,x2.txt:3ms
x1.txt,x2.txt:5ms
z1.txt,z2.txt:7ms
z1.txt,z2.txt:6ms
z1.txt,z2.txt:4ms
z1.txt,z2.txt:3ms
z1.txt,z2.txt:3ms
z1.txt,z2.txt:4ms
z1.txt,z2.txt:3ms
Your x-data set is sensitive to this; z does not appear to be. See this question for a better explanation.
Given an integer n and array a. Finding maximum of (a[i]+a[j])*(j-i) with 1<=i<=n-1 and i+1<=j<=n
Example:
Input
5
1 3 2 5 4
Output
21
Explanation :With i=2 and j=5, we have the maximum of (a[i]+a[j])*(j-i) is (3+4)*(5-2)=21
Constraints:
n<=10^6
a[i]>0 with 1<=i<=n
I can solve this problem with n<=10^4, but what should I do if n is too large, like the constraints?
First, let's reference the "brute force" force algorithm. This will have some issues, that I will call out below, but it is a correct solution.
struct Result
{
size_t i;
size_t j;
int64_t value;
};
Result findBestBruteForce(const vector<int>& a)
{
size_t besti = 0;
size_t bestj = 0;
int64_t bestvalue = INT64_MIN;
for (size_t i = 0; i < a.size(); i++)
{
for (size_t j = i + 1; j < a.size(); j++)
{
// do the math in 64-bit space to avoid overflow
int64_t value = (a[i] + (int64_t)a[j]) * (j - i);
if (value > bestvalue)
{
bestvalue = value;
besti = i;
bestj = j;
}
}
}
return { besti, bestj, bestvalue };
}
The problem with the above code is that it runs at O(N²). Or more precisely, for the the N iterations of the outer for-loop (where i goes from 0 to N), there are an average of N/2 iterations on the inner for-loop. If N is small, this isn't a problem.
On my PC, with full optimizations turned on. When is N under 20000, the run time is less than a second. Once N approaches 100000, it takes several seconds to process the 5 billion iterations. Let's just go with a "billion operations per second" as an expected rate. If N were to 1000000, the maximum as the OP outlined, it would probably take 500 seconds. Such is the nature of a N-squared algorithm.
So how can we speed it up? Here's an interesting observation. Let's say our array was this:
10 5 4 15 13 100 101 6
On the first iteration of the outer loop above, where i=0, we'd be computing this on each iteration of the inner loop:
for each j: (a[0]+a[j])(j-0)
for each j: (10+a[j])(j-0)
for each j: [15*1, 14*2, 25*3, 23*4, 1000*5, 1010*6, 16*6]
= [15, 28, 75, 92, 5000, 6060, 96]
Hence, for when i=0, a[i] = 15 and the largest value computed from that set is 6060.
Since A[0] is 15, and we're tracking a current "best" value, there's no incentive to iterate all the values again for i=1 since a[1]==14 is less than 15. There's no j index that would compute a value of (a[1]+a[j])*(j-1) larger than what's already been found. Because (14+a[j])*(j-1) will always be less than (15+a[j])*(j-1). (Assumes all values in the array are non-negative).
So to generalize, the outer loop can skip over any index of i where A[best_i] > A[i]. And that's a real simple alteration to our above code:
Result findBestOptimized(const std::vector<int>& a)
{
if (a.size() < 2)
{
return {0,0,INT64_MIN};
}
size_t besti = 0;
size_t bestj = 0;
int64_t bestvalue = INT64_MIN;
int minimum = INT_MIN;
for (size_t i = 0; i < a.size(); i++)
{
if (a[i] <= minimum)
{
continue;
}
for (size_t j = i + 1; j < a.size(); j++)
{
int64_t value = (a[i] + (int64_t)a[j]) * (j - i);
if (value > bestvalue)
{
bestvalue = value;
besti = i;
bestj = j;
minimum = a[i];
}
}
}
return { besti, bestj, bestvalue };
}
Above, we introduce a minimum value for A[i] to be before considering doing the full inner loop enumeration.
I benchmarked this with build optimizations on. On a random array of a million items, it runs in under a second.
But wait... there's another optimization!
If the inner loop fails to find an index j such that value > bestvalue, then we already know that the current A[i] is greater than minimum. Hence, we can increment minimum to A[i] regardless at the end of the inner loop.
Now, I'll present the final solution:
Result findBestOptimizedEvenMore(const std::vector<int>& a)
{
if (a.size() < 2)
{
return { 0,0,INT64_MIN };
}
size_t besti = 0;
size_t bestj = 0;
int64_t bestvalue = INT64_MIN;
int minimum = INT_MIN;
for (size_t i = 0; i < a.size(); i++)
{
if (a[i] <= minimum)
{
continue;
}
for (size_t j = i + 1; j < a.size(); j++)
{
int64_t value = (a[i] + (int64_t)a[j]) * (j - i);
if (value > bestvalue)
{
bestvalue = value;
besti = i;
bestj = j;
}
}
minimum = a[i]; // since we know a[i] > minimum, we can do this
}
return { besti, bestj, bestvalue };
}
I benchmarked the above solution on different array sizes from N=100 to N=1000000. It does all iterations in under 25 milliseconds.
In the above solution, there's likely a worst case runtime of O(N²) again when all the items in the array are in ascending order. But I believe the average case should be on the order of O(N lg N) or better. I'll do some more analysis later if anyone is interested.
Note: Some notation for variables and the Result class in the code have been copied from #selbie's excellent answer.
Here's another O(n^2) worst-case solution with (likely provable) O(n) expected performance on random permutations and room for optimization.
Suppose [i, j] are our array bounds for an optimal pair. By the problem definition, this means all elements left of i must be strictly less than A[i], and all elements right of j must be strictly less than A[j].
This means we can compute the left-maxima of A: all elements strictly greater than all previous elements, as well as the right-maxima of A. Then, we only need to consider left endpoints from the left-maxima and right endpoints from the right-maxima.
I don't know the expectation of the product of the sizes of left and right maxima sets, but we can get an upper bound. The size of left maxima is at most the size of the longest increasing subsequence (LIS) of A. The right maxima are at most the size of the longest decreasing subsequence. These aren't independent, but I'm taking as an (unproven) assumption that the LIS and LDS lengths are inversely correlated with each other for random permutations. The right-maxima must start after the left-maxima end, so this seems like a safe assumption.
The length of the LIS for random permutations follows the Tracy-Widom distribution, so it has mean sqrt(2N) and standard deviation N^(-1/6). The expected square of the size is therefore 2N + 1/(N^1/3) so ~2N. This isn't exactly the proof we wanted, since you'd need to sum over the partial density function to be rigorous, but the LIS is already an upper bound on the left-maxima size, so I think the conclusion is still true.
C++ code (Result class and some variable names taken from selbie's post, as mentioned):
struct Result
{
size_t i;
size_t j;
int64_t value;
};
Result find_best_sum_size_product(const std::vector<int>& nums)
{
/* Given: list of positive integers nums
Returns: Tuple with (best_i, best_j, best_product)
where best_i and best_j maximize the product
(nums[i]+nums[j])*(j-i) over 0 <= i < j < n
Runtime: O(n^2) worst case,
O(n) average on random permutations.
*/
int n = nums.size();
if (n < 2)
{
return {0,0,INT64_MIN};
}
std::vector<int> left_maxima_indices;
left_maxima_indices.push_back(0);
for (int i = 1; i < n; i++){
if (nums.at(i) > nums.at(left_maxima_indices.back())) {
left_maxima_indices.push_back(i);
}
}
std::vector<int> right_maxima_indices;
right_maxima_indices.push_back(n-1);
for (int i = n-1; i >= 0; i--){
if (nums.at(i) > nums.at(right_maxima_indices.back())) {
right_maxima_indices.push_back(i);
}
}
size_t best_i = 0;
size_t best_j = 0;
int64_t best_product = INT64_MIN;
int i = 0;
int j = 0;
for (size_t left_idx = 0;
left_idx < left_maxima_indices.size();
left_idx++)
{
i = left_maxima_indices.at(left_idx);
for (size_t right_idx = 0;
right_idx < right_maxima_indices.size();
right_idx++)
{
j = right_maxima_indices.at(right_idx);
if (i == j) continue;
int64_t value = (nums.at(i) + (int64_t)nums.at(j)) * (j - i);
if (value > best_product)
{
best_product = value;
best_i = i;
best_j = j;
}
}
}
return { best_i, best_j, best_product };
}
I started from the two excellent answers by #selbie and #kcsquared.
Their solutions gave impressive results for random inputs. What was not clear is the worst case behavior.
What sequence would correspsond to the worst case?
I finally found a critial sequence for these two answers, a triangle sequence: this sequence slightly increases up to a max, and then slightly decrease. With such a sequence and n=10^5 for example, these answers take more than 10s.
My solutions starts from #selbie solution and add two improvements:
I add #kcsquared's trick: on the right (of j), they can be only lower elements
When considering a new left element a[i], it is useless to start from i + 1 to get the second element. We can start from the current best_j
With these tricks, I was able to improve the two posted answer performances a little bit. However, it still
fails to solve the triangle sequence issue: about 10s for n = 10^5.
#include <iostream>
#include <vector>
#include <string>
#include <cstdlib>
#include <ctime>
#include <chrono>
struct Result {
size_t i;
size_t j;
int64_t value;
};
void print (const Result& res, const std::string& prefix = "") {
std::cout << prefix;
std::cout << "(" << res.i << ", " << res.j << ") -> " << res.value << std::endl;
}
Result findBest(const std::vector<int>& a) {
if (a.size() < 2) {
return { 0, 0, INT64_MIN };
}
int n = a.size();
std::vector<int> next_max(n, -1);
int current_max = n-1;
for (int i = n-1; i >= 0; --i) {
if (a[i] > a[current_max]) {
current_max = i;
}
next_max[i] = current_max;
}
size_t besti = 0;
size_t bestj = 0;
int64_t bestvalue = INT64_MIN;
int minimum = INT_MIN;
for (size_t i = 0; i < a.size(); i++) {
if (a[i] <= minimum) {
continue;
}
minimum = a[i];
size_t jmin = (bestj > i) ? bestj : i+1;
for (size_t j = jmin; j < a.size(); j++) {
j = next_max[j];
value = (a[i] + (int64_t)a[j]) * (j - i);
if (value > bestvalue) {
bestvalue = value;
besti = i;
bestj = j;
}
}
}
return { besti, bestj, bestvalue };
}
int main() {
int n = 1000000;
int vmax = 100000000;
std::vector<int> A (n);
std::srand(std::time(0));
for (int i = 0; i < n; ++i) {
A[i] = rand() % vmax + 1;
}
std::cout << "n = " << n << std::endl;
auto t0 = std::chrono::high_resolution_clock::now();
auto res = findBest (A);
auto t1 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count();
print (res, "Random: ");
std::cout << "time = " << duration/1000 << " ms" << std::endl;
int i_max = n/2;
for (int i = 0; i < i_max; ++i) A[i] = i+1;
A[i_max] = 10 * i_max;
for (int i = i_max+1; i < n; ++i) {
A[i] = 2*i_max - i;
}
t0 = std::chrono::high_resolution_clock::now();
res = findBest (A);
t1 = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count();
print (res, "Triangle sequence: ");
std::cout << "time = " << duration/1000 << " ms" << std::endl;
return 0;
}
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am working on a project about multithreading. Here Operation is a class which contains a type, a key, a time and an answer.
Here is my code:
#include <cstdlib>
#include <fstream>
#include <string>
#include <iomanip>
#include <pthread.h>
#include <vector>
#include "block.h"
using namespace std;
std::vector<Operation> *data;
block_bloom_filter filter(10000000, 0.01);
int ans[30000000];
void *test(void *arg)
{
int thread_id = *((int *)arg);
for (auto &op : data[thread_id])
{
if (op.type == 1)
{
filter.insert(op);
}
else
{
filter.query(op);
}
}
return 0;
}
int main(int argc, char **argv)
{
int k = atoi(argv[1]);
int *op_num = new int[k];
data = new vector<Operation>[k];
for (int i = 0; i < k; i++)
{
string tmp = "data" + to_string(i + 1) + ".in";
const char *s = tmp.c_str();
ifstream fin;
fin.open(s);
fin >> op_num[i];
//data[i] = new Operation[op_num[i]];
for (int j = 0; j < op_num[i]; j++)
{
string tmp1;
fin >> tmp1;
if (tmp1 == "insert")
{
Operation tmp2;
tmp2.type = 1;
fin >> tmp2.key >> tmp2.time;
tmp2.ans = -1;
data[i].push_back(tmp2);
}
else
{
Operation tmp2;
tmp2.type = 2;
fin >> tmp2.key >> tmp2.time;
tmp2.ans = -1;
data[i].push_back(tmp2);
}
}
fin.close();
}
auto start = std::chrono::high_resolution_clock::now();
int num_threads = k;
pthread_t *threads = new pthread_t[num_threads];
//auto **threads = new thread *[num_threads];
//pthread_t *threads = new pthread_t[k];
/*for (int i = 0; i < num_threads; i++)
{
threads[i] = new thread(test, i);
}
for (int i = 0; i < num_threads; i++)
{
threads[i]->join();
}*/
for (int i = 0; i < k; i++)
{
pthread_create(&threads[i], NULL, test, (void *)&(i));
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
//std::cerr << "duration = " << duration.count() << "us" << std::endl;
double time_used = duration.count() / 1e3;
std::ofstream f_time("time.out");
f_time << std::fixed << std::setprecision(3) << time_used << std::endl;
f_time.close();
for (int i = 0; i < k; i++)
{
for (int j = 0; j < op_num[i]; j++)
{
ans[data[i][j].time - 1] = data[i][j].ans;
}
}
ofstream fout;
fout.open("result.out");
for (int i = 0; i < 30000000; i++)
{
if (ans[i] >= 0)
fout << ans[i] << endl;
}
fout.close();
delete[] data;
delete[] threads;
delete[] op_num;
//pthread_exit(NULL);
}
My code can compile, but when running it shows segmentation fault and can only generate time.out no result.out. I've been working on it for a long time but still do not know why. Hope someone can help me.
Below is block.h
#include <algorithm>
#include <chrono>
#include <cmath>
#include <ctime>
#include <fstream>
#include <iostream>
#include <numeric>
#include <string>
#include <vector>
#include "Headers/MurmurHash3.h"
#include "xxHash/xxhash.c"
#define M_LN2 0.69314718055994530942
using namespace std;
typedef std::vector<bool> bit_vector;
class Operation
{
public:
int type; // 1: insert, 2: query
char key[17];
int time;
int ans;
};
int str_len = 16;
int cache_size = 64;
int block_size = 512;
int key_num = 10000000;
int slot_num = 1 << 27;
int hash_num = int((double)slot_num / key_num * M_LN2);
int block_num = (slot_num + block_size - 1) / block_size;
class bloom_filter
{
uint32_t size; // Probable Number of elements in universe
double fpr; // False positive rate
int m; // optimal size of bloom filter
int k; // Number of hash functions
bit_vector bloom;
public:
int get_size() { return size; }
double get_fpr() { return fpr; }
bloom_filter(int n, double fpr)
{
this->size = n;
this->fpr = fpr;
this->m = ceil(
-((n * log(fpr)) /
pow(log(2), 2.0))); // Natural logarithm m = −n ln p/(ln 2)2
// cout << m<< "\n";
this->k = ceil(
(m / n) * log(2)); // Calculate k k = (m/n) ln 2 2-k ≈ 0.6185 m/n
// cout << k;
bloom.resize(m, false);
}
void insert(string S)
{
uint32_t *p = new uint32_t(1); // For storing Hash Vaue
const void *str = S.c_str(); // Convert string to C string to use as a
// parameter for constant void
int index;
// cout<<S.length()<<"\t"<<sizeof(str)<<"\n";
// cout<<S<<"\n";
for (int i = 0; i < k; i++)
{
// MurmurHash3_x64_128();
MurmurHash3_x86_32(str, S.length(), i + 1,
p); // String, String size
index = *p % m;
// cout<<*p<<"\t"<<index<<"\t";
bloom[index] = true;
}
// cout<<"\n";
// print();
}
/*void print()
{
for (int i = 0; i < bloom.size(); i++)
{
cout << bloom.at(i);
}
}*/
char query(string S)
{
uint32_t *p = new uint32_t(1); // For storing Hash Vaue
const void *str = S.c_str(); // Convert string to C string to use as a
// parameter for constant void
int index;
// cout << S.length() << "\t" << sizeof(str) << "\n";
// cout<<S<<"\n";
for (int i = 0; i < k; i++)
{
// MurmurHash3_x64_128();
MurmurHash3_x86_32(str, S.length(), i + 1,
p); // String, String size
index = *p % m;
// cout<<*p<<"\t"<<index<<"\t";
if (bloom[index] == false)
return 'N';
}
return 'Y';
}
};
class block_bloom_filter
{
int size; // Probable Number of elements in universe
double fpr; // False positive rate
int m; // optimal size of bloom filter
int k; // Number of hash functions
int s; // Number of bloom filters
bit_vector block_bloom;
int cache_line_size;
public:
int get_size() { return size; }
double get_fpr() { return fpr; }
block_bloom_filter(int n, double fpr)
{
this->size = n;
this->fpr = fpr;
this->m = ceil(
-((n * log(fpr)) /
pow(log(2), 2.0))); // Natural logarithm m = −n ln p/(ln 2)2
// cout << m << "\n";
this->k = ceil(
(m / n) * log(2)); // Calculate k k = (m/n) ln 2 2-k ≈ 0.6185 m/n
// cout << k<<"\n";
this->cache_line_size = sysconf(_SC_LEVEL1_DCACHE_LINESIZE) * 8;
this->s =
ceil((double)m / cache_line_size); // Total number of Bloom Filters
// cout<<s<<"s valye\n";
block_bloom.resize(cache_line_size * s, false);
}
/*void insert(Operation &S)
{
int block_number;
int first_index, last_index;
int index;
uint32_t *p = new uint32_t(1); // For storing Hash Value
const void *str = S.key.c_str(); // Convert string to C string to use as a
// parameter for constant void
MurmurHash3_x86_32(str, sizeof(str), 1,
p); // String, String size//Find out block number
// if(s!=0)
block_number = *p % s;
first_index = block_number * cache_line_size;
for (int i = 1; i < k; i++)
{
// MurmurHash3_x64_128();
MurmurHash3_x86_32(str, S.key.length(), i + 1,
p); // String, String size
// cout<<*p<<"\n";
// cout<<"div="<<div << "\n";
index = (*p) % cache_line_size;
// cout<<index<<"\t";
// if(index>m) cout<<"\n"<<index<<"\tError detected\n";
// cout<<"\n"<<index<<"a\t\n";
// cout<<"\n"<<first_index<<"a\t\n";
// cout<<(index+first_index)<<"a\t\n";
block_bloom[index + first_index] = true;
}
// cout<<"\n";
// print();
}*/
XXH64_hash_t GetHash(const char *str)
{
return XXH3_64bits_withSeed(str, 16, /* Seed */ 123976235672331983ll);
}
void insert(Operation &s)
{
XXH64_hash_t hash = GetHash(s.key);
XXH64_hash_t hash1 = hash % m;
XXH64_hash_t hash2 = (hash / m) % m;
for (int i = 0; i < k; i++)
{
int pos = (hash1 + i * hash2) % m;
block_bloom[pos] = 1;
}
}
void query(Operation &s)
{
XXH64_hash_t hash = GetHash(s.key);
XXH64_hash_t hash1 = hash % m;
XXH64_hash_t hash2 = (hash / m) % m;
for (int i = 0; i < k; i++)
{
int pos = (hash1 + i * hash2) % m;
if (!block_bloom[pos])
{
s.ans = 0;
return;
}
}
s.ans = 1;
return;
}
};
for (int i = 0; i < k; i++)
{
pthread_create(&threads[i], NULL, test, (void *)&(i));
The third parameter to pthread_create(), the thread function's parameter, is a pointer to the loop variable. The thread function reads it, as follows:
void *test(void *arg)
{
int thread_id = *((int *)arg);
There are no guarantees whatsoever that this gets executed by the new execution thread before the parent execution thread increments i. When it comes to multiple execution threads, neither POSIX nor the C++ library gives you any guarantees as to the relative execution order of multiple threads.
All that pthread_create() guarantees you is that at some point in time later, which can before before or after pthread_create() returns, the new execution thread pops into existence and begins executing the thread function.
And it may very well be that one or more (if not all) execution threads finally begin executing, for real, after the for loop terminates and i gets destroyed. At which pointL when they do start executing, they will discover a pointer to a destroyed variable as their argument, and dereferencing it becomes undefined behavior.
Or, some of those execution threads get their gear running, at some point after they get created. By this time i's been incremented a couple of times already. So they both read the *(int *)arg, whose value is now -- who knows? And, just to make things interesting, both execution threads do this at the same time, and read the same value. At this point, the end result is already going to be garbage. It is clear that the intent here is for each execution thread getting a unique value for its parameter, but this very unlikely to happen here. There's nothing in the shown code that ensures that each execution threads actually gets its own unique thread_id.
Additionally, the original parent execution thread seems to assume that all the execution threads will all finish their job before the parent execution thread reads their results, and writes them out to a file.
Unfortunately, there's no code in the parent execution thread that appears to actually wait for all execution threads to finish. As soon as they're all started, it takes it on faith that they complete instantly, and it reads the partial results, and writes it out to a file:
auto stop = std::chrono::high_resolution_clock::now();
Well, the bad news here is that there's nothing that actually waits for all execution threads to actually stop, at this point. They're still running here. Even if the program manages to avoid crashing, the output results will be incomplete, and mostly junk.
ans[data[i][j].time - 1]
It appears that the value of .time here was originally read from the input file. There does not appear to be any bounds checking here. It's possible for this vector/array access to be out of bounds, resulting in an undefined behavior and a likely crash.
Also, another problem with the shown code: There are plenty of calls to new, but only some of those get deleted, resulting in multiple memory leaks. Inspecting the shown code, there is no clear reason to new anything, in the first place.
In conclusion, there are multiple problems with the shown code that result in undefined behavior, and any of them will be the reason for the observed crash. The shown approach is very much error-prone, and will require much more substantial work, and proper multi-threading support, and inter-thread sequencing, in order to get the sequence of all events happen in the correct order, across all the execution threads.
(Keep in mind, I'm a complete beginner to c++)
I have tried to write a sieve of eratosthenes funciton in c++, and it currently looks as follows:
#include <iostream>
#include <unordered_set>
#include <vector>
#include <numeric>
#include <chrono>
int main() {
auto start = std::chrono::high_resolution_clock::now();
int max_val = 100000;
std::vector<int> primes(max_val);
std::iota(primes.begin(), primes.end(), 0);
for (int i = 2; i <= primes.size(); i++) {
int j = i+i;
while (j < primes.size()) {
primes[j] = 0;
j += i;
}
}
std::unordered_set<int> set_primes(primes.begin(), primes.end());
set_primes.erase(1);
set_primes.erase(0);
std::cout << set_primes.size() << "\n";
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
std::cout << duration.count() << "\n|";
}
However, this function is very inefficient (generating a set of primes less than 10mil takes around 11 seconds, where my python program can do it in about 4). I'm guessing the issue might lie in me iterating over millions of zeros in my vector when creating the unordered_set. What could I do to improve the efficiency of my program?
firstly all divisors of a number(except for itself and 1) are equal or less than sqrt(number), so first of all your first loop should run be from 2 to sqrt(n), secondly, you can start your second loop not from i + i but from i * i (don't want to prove it here, it's not that simple), so the final code is gonna be:
int n;
cin >> n;
vector<bool> prime (n+1, true);
prime[0] = prime[1] = false;
int i = 2;
while (i * i <= n) {
if (prime[i])
for (int j = i * i; j <= n; j += i)
prime[j] = false;
i++;
}
I am looking for the best algorithm to do a rotation of a row in a 2D array / matrix. Let's say we have
mat[3][3] = {{1,2,3},{4,5,6},{7,8,9}};
I want to shift the element to the left by one, the first row 1 2 3 will then become 2 3 1. The function realizes this by copying each element to the left via dynamical memory allocation.
void rotate_row(const int& row) {
int *temp_row = (int *) (malloc(3));
for (int i = 0; i < 3; ++i) {
temp_row[i % 3] = mat[row][(i + 1) % 3];
}
memcpy(mat[row], temp_row, 3);
free(temp_row);
}
To manipulate any specific row, we simple call the function rotate_row(row).
I don't quite understand malloc thing in C, since I grow up learning a completely new way of dynamical allocation, so I first change it to:
void rotate_rows(const int& row) {
//int *temp_row = (int *) (malloc(3));
int *temp_row = new int[3];
for (int i = 0; i < 3; ++i) {
temp_row[i % 3] = mat[row][(i + 1) % 3];
}
memcpy( mat[row], temp_row, 3);
//free(temp_row);
delete [] temp_row;
temp_row = NULL;
}
My question first is, will simply changing the way of dynamical memory allocation accelerates the code?
Also, I don't think it is necessary to use dynamical memory allocation for my purpose(rotate the row). Is their any better (not necessary the best) algorithm available?
Rotating will not change array size, hence doing it in-place sounds much more performant to me, no need for dynamic memory allocation and freeing previous pointer.
void rotate(int * array, size_t n) {
if (n <= 1)
return;
const int head = array[0];
for (size_t i = 1; i < n; ++i)
array[i - 1] = array[i];
array[n - 1] = head;
}
You can avoid all the dynamic memory allocation, and use the std::rotate algorithm:
#include <algorithm>
#include <iostream>
int main()
{
int mat[3][3] = { {1,2,3},{4,5,6},{7,8,9} };
// rotate left each row by 1
for (int i = 0; i < 3; ++i)
std::rotate(&mat[i][0], &mat[i][1], &mat[i][3]);
for (int i = 0; i < 3; ++i)
std::cout << mat[i][0] << " " << mat[i][1] << " " << mat[i][2] << "\n";
}
Output:
2 3 1
5 6 4
8 9 7
Edit:
Here is a sample of rotating each row by it's row index + 1:
#include <algorithm>
#include <iostream>
int main()
{
int mat[3][3] = { {1,2,3},{4,5,6},{7,8,9} };
// rotate left each row by 1
for (int i = 0; i < 3; ++i)
std::rotate(&mat[i][0], &mat[i][i+1], &mat[i][3]);
for (int i = 0; i < 3; ++i)
std::cout << mat[i][0] << " " << mat[i][1] << " " << mat[i][2] << "\n";
}
Output:
2 3 1
6 4 5
7 8 9
Yes there's a better way and no need to use dynamic memory allocation. You actually don't need another array to solve this.
Here's a sample code:
for(int i = 0; i < n/2; i++){
for(int j = i; j < n-i-1; j++){
int tmp = matrix[i][j];
matrix[i][j] = matrix[n-j-1][i];
matrix[n-j-1][i] = matrix[n-i-1][n-j-1];
matrix[n-i-1][n-j-1] = matrix[j][n-i-1];
matrix[j][n-i-1] = tmp;
}
}
This is pretty straight forward so I think you'll understand the code easily without explanation.