Windows Threads: Parallel Mergesort - c++

I have what is hopefully a very easy question, I just cant find the answer online. I made a merge sort function ( which im sure has inefficiencies), but im here to ask about the threads. I'm using Windows' CreateThread function to spawn threads to sort intervals of a given array. Once all the threads are finished, I will merge the segments together for the final result. I havent implemented the final merge yet because im getting strange errors which im sure is from a dumb mistake in the threads. I'll post my code, if you could kindly look at paraMSort. Ill post the whole MergeSort.h file so you can see the helper functions as well. Sometimes the code will compile and run perfectly. Sometimes the console will abruptly close with no errors/exceptions. There shouldnt be mutex issues because im doing operations on different segments of the array (Different memory locations altogether). Does anyone see something wrong? Thanks so much.
PS. Are Windows CreateThread's kernel level? In other words, if I make 2 threads on a dual core computer, may they run simultaneously on separate cores? Im thinking yes, since on this computer I can do the same work in 1/2 the time with 2 threads (with another test example).
PPS. I also saw some parallelism answers solved using Boost.Thread. Should I just use boost threads instead of windows threads? I don't have experience with Boost.
#include "Windows.h"
#include <iostream>
using namespace std;
void insert_sort(int* A, int sA, int eA, int* B, int sB, int eB)
{
int value;
int iterator;
for(int i = sA + 1; i < eA; i++)
{
value = A[i]; // Grab the next value in the array
iterator = i - 1;
// Move this value left up the list until its in the right spot
while(iterator >= sA && value < A[iterator])
A[iterator + 1] = A[iterator--];
A[iterator + 1] = value; // Put value in its correct spot
}
for(int i = sA; i < eB; i++)
{
B[i] = A[i]; // Put results in B
}
}
void merge_func(int* a, int sa, int ea, int* b, int sb, int eb, int* c, int sc)
{
int i = sa, j = sb, k = sc;
while(i < ea && j < eb)
c[k++] = a[i] < b[j] ? a[i++] : b[j++];
while(i < ea)
c[k++] = a[i++];
while(j < eb)
c[k++] = b[j++];
}
void msort_big(int* a, int* b, int s, int e, bool inA)
{
if(e-s < 4)
{
insert_sort(a, s, e, b, s, e);
return; // We sorted (A,s,e) into (B,s,e).
}
int m = (s + e)/2;
msort_big(a, b, s, m, !inA);
msort_big(a, b, m, e, !inA);
// If we want to merge in A, do it. Otherwise, merge in B
inA ? merge_func(b, s, m, b, m, e, a, s) : merge_func(a, s, m, a, m, e, b, s);
}
void msort(int* toBeSorted, int s, int e)
// Sorts toBeSorted from [s, e+1), so just enter [s, e] and
// the call to msort_big adds one.
{
int* b = new int[e - s + 1];
msort_big(toBeSorted, b, s, e+1, true);
delete [] b;
}
template <class T>
struct SortData_Send
{
T* data;
int start;
int end;
};
DWORD WINAPI msort_para_callback(LPVOID lpParam)
{
SortData_Send<int> dat = *(SortData_Send<int>*)lpParam;
msort(dat.data, dat.start, dat.end);
cout << "done! " << endl;
}
int ceiling_func(double num)
{
int temp = (int)num;
if(num > (double)temp)
{
return temp + 1;
}
else
{
return temp;
}
}
void paraMSort(int* toBeSorted, int s, int e, int numThreads)
{
HANDLE threads[numThreads];
DWORD threadIDs[numThreads];
SortData_Send<int>* sent[numThreads];
for(int i = 0; i < numThreads; i++)
{
// So for each thread, make an interval and pass the pointer to the array of ints.
// So for numThreads = 3 and array size of 0 to 99 (100), we have 0-32, 33-65, 66-100.
// 100 because sort function takes [start, end).
sent[i] = new SortData_Send<int>;
sent[i]->data = toBeSorted;
sent[i]->start = s + ceiling_func(double(i)*(double)e/double(numThreads));
sent[i]->end = ceiling_func(double(i+1)*double(e)/double(numThreads)) + ((i == numThreads-1) ? 1 : -1);
threads[i] = CreateThread(NULL, 0, msort_para_callback, sent[i], 0, &threadIDs[i]);
}
WaitForMultipleObjects(numThreads, threads, true, INFINITE);
cout << "Done waiting!" <<endl;
}

Assuming 's' is your starting point and 'e' is your ending point for a thread shouldn't your code be something like
sent[i]->start = s + ceiling_func(double(i)*(double)(e-s)/double(numThreads));
sent[i]->end = (i == numThreads-1) ? e : (s - 1 + ceiling_func(double(i+1)*(double)(e-s)/double(numThreads)));
This is in case your function void paraMSort(int* toBeSorted, int s, int e, int numThreads) is being called with a value of 's' not equal to 0? This could cause you to read wrong sections of memory.

Related

How can I approach this CP task?

The task (from a Bulgarian judge, click on "Език" to change it to English):
I am given the size of the first (S1 = A) of N corals. The size of every subsequent coral (Si, where i > 1) is calculated using the formula (B*Si-1 + C)%D, where A, B, C and D are some constants. I am told that Nemo is nearby the Kth coral (when the sizes of all corals are sorted in ascending order).
What is the size of the above-mentioned Kth coral ?
I will have T tests and for every one of them I will be given N, K, A, B, C and D and prompted to output the size of the Kth coral.
The requirements:
1 ≤ T ≤ 3
1 ≤ K ≤ N ≤ 107
0 ≤ A < D ≤ 1018
1 ≤ C, B*D ≤ 1018
Memory available is 64 MB
Time limit is 1.9 sec
The problem I have:
For the worst case scenario I will need 107*8B which is 76 MB.
The solution If the memory available was at least 80 MB would be:
#include <iostream>
#include <vector>
#include <iterator>
#include <algorithm>
using biggie = long long;
int main() {
int t;
std::cin >> t;
int i, n, k, j;
biggie a, b, c, d;
std::vector<biggie>::iterator it_ans;
for (i = 0; i != t; ++i) {
std::cin >> n >> k >> a >> b >> c >> d;
std::vector<biggie> lut{ a };
lut.reserve(n);
for (j = 1; j != n; ++j) {
lut.emplace_back((b * lut.back() + c) % d);
}
it_ans = std::next(lut.begin(), k - 1);
std::nth_element(lut.begin(), it_ans, lut.end());
std::cout << *it_ans << '\n';
}
return 0;
}
Question 1: How can I approach this CP task given the requirements listed above ?
Question 2: Is it somehow possible to use std::nth_element to solve it since I am not able to store all N elements ? I mean using std::nth_element in a sliding window technique (If this is possible).
# Christian Sloper
#include <iostream>
#include <queue>
using biggie = long long;
int main() {
int t;
std::cin >> t;
int i, n, k, j, j_lim;
biggie a, b, c, d, prev, curr;
for (i = 0; i != t; ++i) {
std::cin >> n >> k >> a >> b >> c >> d;
if (k < n - k + 1) {
std::priority_queue<biggie, std::vector<biggie>, std::less<biggie>> q;
q.push(a);
prev = a;
for (j = 1; j != k; ++j) {
curr = (b * prev + c) % d;
q.push(curr);
prev = curr;
}
for (; j != n; ++j) {
curr = (b * prev + c) % d;
if (curr < q.top()) {
q.pop();
q.push(curr);
}
prev = curr;
}
std::cout << q.top() << '\n';
}
else {
std::priority_queue<biggie, std::vector<biggie>, std::greater<biggie>> q;
q.push(a);
prev = a;
for (j = 1, j_lim = n - k + 1; j != j_lim; ++j) {
curr = (b * prev + c) % d;
q.push(curr);
prev = curr;
}
for (; j != n; ++j) {
curr = (b * prev + c) % d;
if (curr > q.top()) {
q.pop();
q.push(curr);
}
prev = curr;
}
std::cout << q.top() << '\n';
}
}
return 0;
}
This gets accepted (Succeeds all 40 tests. Largest time 1.4 seconds, for a test with T=3 and D≤10^9. Largest time for a test with larger D (and thus T=1) is 0.7 seconds.).
#include <iostream>
using biggie = long long;
int main() {
int t;
std::cin >> t;
int i, n, k, j;
biggie a, b, c, d;
for (i = 0; i != t; ++i) {
std::cin >> n >> k >> a >> b >> c >> d;
biggie prefix = 0;
for (int shift = d > 1000000000 ? 40 : 20; shift >= 0; shift -= 20) {
biggie prefix_mask = ((biggie(1) << (40 - shift)) - 1) << (shift + 20);
int count[1 << 20] = {0};
biggie s = a;
int rank = 0;
for (j = 0; j != n; ++j) {
biggie s_vs_prefix = s & prefix_mask;
if (s_vs_prefix < prefix)
++rank;
else if (s_vs_prefix == prefix)
++count[(s >> shift) & ((1 << 20) - 1)];
s = (b * s + c) % d;
}
int i = -1;
while (rank < k)
rank += count[++i];
prefix |= biggie(i) << shift;
}
std::cout << prefix << '\n';
}
return 0;
}
The result is a 60 bits number. I first determine the high 20 bits with one pass through the numbers, then the middle 20 bits in another pass, then the low 20 bits in another.
For the high 20 bits, generate all the numbers and count how often each high 20 bits pattern occurrs. After that, add up the counts until you reach K. The pattern where you reach K, that pattern covers the K-th largest number. In other words, that's the result's high 20 bits.
The middle and low 20 bits are computed similarly, except we take the by then known prefix (the high 20 bits or high+middle 40 bits) into account. As a little optimization, when D is small, I skip computing the high 20 bits. That got me from 2.1 seconds down to 1.4 seconds.
This solution is like user3386109 described, except with bucket size 2^20 instead of 10^6 so I can use bit operations instead of divisions and think of bit patterns instead of ranges.
For the memory constraint you hit:
(B*Si-1 + C)%D
requires only the value (Si-2) before itself. So you can compute them in pairs, to use only 1/2 of total you need. This only needs indexing even values and iterating once for odd values. So you can just use half-length LUT and compute the odd value in-flight. Modern CPUs are fast enough to do extra calculations like these.
std::vector<biggie> lut{ a_i,a_i_2,a_i_4,... };
a_i_3=computeOddFromEven(lut[1]);
You can make a longer stride like 4,8 too. If dataset is large, RAM latency is big. So it's like having checkpoints in whole data search space to balance between memory and core usage. 1000-distance checkpoints would put a lot of cpu cycles into re-calculations but then the array would fit CPU's L2/L1 cache which is not bad. When sorting, the maximum re-calc iteration per element would be n=1000 now. O(1000 x size) maybe it's a big constant but maybe somehow optimizable by compiler if some constants really const?
If CPU performance becomes problem again:
write a compiling function that writes your source code with all the "constant" given by user to a string
compile the code using command-line (assuming target computer has some accessible from command line like g++ from main program)
run it and get results
Compiler should enable more speed/memory optimizations when those are really constant in compile-time rather than depending on std::cin.
If you really need to add a hard-limit to the RAM usage, then implement a simple cache with the backing-store as your heavy computations with brute-force O(N^2) (or O(L x N) with checkpoints every L elements as in first method where L=2 or 4, or ...).
Here's a sample direct-mapped cache with 8M long-long value space:
int main()
{
std::vector<long long> checkpoints = {
a_0, a_16, a_32,...
};
auto cacheReadMissFunction = [&](int key){
// your pure computational algorithm here, helper meant to show variable
long long result = checkpoints[key/16];
for(key - key%16 times)
result = iterate(result);
return result;
};
auto cacheWriteMissFunction = [&](int key, long long value){
/* not useful for your algorithm as it doesn't change behavior per element */
// backing_store[key] = value;
};
// due to special optimizations, size has to be 2^k
int cacheSize = 1024*1024*8;
DirectMappedCache<int, long long> cache(cacheSize,cacheReadMissFunction,cacheWriteMissFunction);
std::cout << cache.get(20)<<std::endl;
return 0;
}
If you use a cache-friendly sorting-algorithm, a direct cache access would make a lot of re-use for nearly all the elements in comparisons if you fill the output buffer/terminal with elements one by one by following something like a bitonic-sort-path (that is known in compile-time). If that doesn't work, then you can try accessing files as a "backing-store" of cache for sorting whole array at once. Is file system prohibited for use? Then the online-compiling method above won't work either.
Implementation of a direct mapped cache (don't forget to call flush() after your algorithm finishes, if you use any cache.set() method):
#ifndef DIRECTMAPPEDCACHE_H_
#define DIRECTMAPPEDCACHE_H_
#include<vector>
#include<functional>
#include<mutex>
#include<iostream>
/* Direct-mapped cache implementation
* Only usable for integer type keys in range [0,maxPositive-1]
*
* CacheKey: type of key (only integers: int, char, size_t)
* CacheValue: type of value that is bound to key (same as above)
*/
template< typename CacheKey, typename CacheValue>
class DirectMappedCache
{
public:
// allocates buffers for numElements number of cache slots/lanes
// readMiss: cache-miss for read operations. User needs to give this function
// to let the cache automatically get data from backing-store
// example: [&](MyClass key){ return redis.get(key); }
// takes a CacheKey as key, returns CacheValue as value
// writeMiss: cache-miss for write operations. User needs to give this function
// to let the cache automatically set data to backing-store
// example: [&](MyClass key, MyAnotherClass value){ redis.set(key,value); }
// takes a CacheKey as key and CacheValue as value
// numElements: has to be integer-power of 2 (e.g. 2,4,8,16,...)
DirectMappedCache(CacheKey numElements,
const std::function<CacheValue(CacheKey)> & readMiss,
const std::function<void(CacheKey,CacheValue)> & writeMiss):size(numElements),sizeM1(numElements-1),loadData(readMiss),saveData(writeMiss)
{
// initialize buffers
for(size_t i=0;i<numElements;i++)
{
valueBuffer.push_back(CacheValue());
isEditedBuffer.push_back(0);
keyBuffer.push_back(CacheKey()-1);// mapping of 0+ allowed
}
}
// get element from cache
// if cache doesn't find it in buffers,
// then cache gets data from backing-store
// then returns the result to user
// then cache is available from RAM on next get/set access with same key
inline
const CacheValue get(const CacheKey & key) noexcept
{
return accessDirect(key,nullptr);
}
// only syntactic difference
inline
const std::vector<CacheValue> getMultiple(const std::vector<CacheKey> & key) noexcept
{
const int n = key.size();
std::vector<CacheValue> result(n);
for(int i=0;i<n;i++)
{
result[i]=accessDirect(key[i],nullptr);
}
return result;
}
// thread-safe but slower version of get()
inline
const CacheValue getThreadSafe(const CacheKey & key) noexcept
{
std::lock_guard<std::mutex> lg(mut);
return accessDirect(key,nullptr);
}
// set element to cache
// if cache doesn't find it in buffers,
// then cache sets data on just cache
// writing to backing-store only happens when
// another access evicts the cache slot containing this key/value
// or when cache is flushed by flush() method
// then returns the given value back
// then cache is available from RAM on next get/set access with same key
inline
void set(const CacheKey & key, const CacheValue & val) noexcept
{
accessDirect(key,&val,1);
}
// thread-safe but slower version of set()
inline
void setThreadSafe(const CacheKey & key, const CacheValue & val) noexcept
{
std::lock_guard<std::mutex> lg(mut);
accessDirect(key,&val,1);
}
// use this before closing the backing-store to store the latest bits of data
void flush()
{
try
{
std::lock_guard<std::mutex> lg(mut);
for (size_t i=0;i<size;i++)
{
if (isEditedBuffer[i] == 1)
{
isEditedBuffer[i]=0;
auto oldKey = keyBuffer[i];
auto oldValue = valueBuffer[i];
saveData(oldKey,oldValue);
}
}
}catch(std::exception &ex){ std::cout<<ex.what()<<std::endl; }
}
// direct mapped access
// opType=0: get
// opType=1: set
CacheValue const accessDirect(const CacheKey & key,const CacheValue * value, const bool opType = 0)
{
// find tag mapped to the key
CacheKey tag = key & sizeM1;
// compare keys
if(keyBuffer[tag] == key)
{
// cache-hit
// "set"
if(opType == 1)
{
isEditedBuffer[tag]=1;
valueBuffer[tag]=*value;
}
// cache hit value
return valueBuffer[tag];
}
else // cache-miss
{
CacheValue oldValue = valueBuffer[tag];
CacheKey oldKey = keyBuffer[tag];
// eviction algorithm start
if(isEditedBuffer[tag] == 1)
{
// if it is "get"
if(opType==0)
{
isEditedBuffer[tag]=0;
}
saveData(oldKey,oldValue);
// "get"
if(opType==0)
{
const CacheValue && loadedData = loadData(key);
valueBuffer[tag]=loadedData;
keyBuffer[tag]=key;
return loadedData;
}
else /* "set" */
{
valueBuffer[tag]=*value;
keyBuffer[tag]=key;
return *value;
}
}
else // not edited
{
// "set"
if(opType == 1)
{
isEditedBuffer[tag]=1;
}
// "get"
if(opType == 0)
{
const CacheValue && loadedData = loadData(key);
valueBuffer[tag]=loadedData;
keyBuffer[tag]=key;
return loadedData;
}
else // "set"
{
valueBuffer[tag]=*value;
keyBuffer[tag]=key;
return *value;
}
}
}
}
private:
const CacheKey size;
const CacheKey sizeM1;
std::mutex mut;
std::vector<CacheValue> valueBuffer;
std::vector<unsigned char> isEditedBuffer;
std::vector<CacheKey> keyBuffer;
const std::function<CacheValue(CacheKey)> loadData;
const std::function<void(CacheKey,CacheValue)> saveData;
};
#endif /* DIRECTMAPPEDCACHE_H_ */
You can solve this problem using a Max-heap.
Insert the first k elements into the max-heap. The largest element of these k will now be at the root.
For each remaining element e:
Compare e to the root.
If e is larger than the root, discard it.
If e is smaller than the root, remove the root and insert e into the heap structure.
After all elements have been processed, the k-th smallest element is at the root.
This method uses O(K) space and O(n log n) time.
There’s an algorithm that people often call LazySelect that I think would be perfect here.
With high probability, we make two passes. In the first pass, we save a random sample of size n much less than N. The answer will be around index (K/N)n in the sorted sample, but due to the randomness, we have to be careful. Save the values a and b at (K/N)n ± r instead, where r is the radius of the window. In the second pass, we save all of the values in [a, b], count the number of values less than a (let it be L), and select the value with index K−L if it’s in the window (otherwise, try again).
The theoretical advice on choosing n and r is fine, but I would be pragmatic here. Choose n so that you use most of the available memory; the bigger the sample, the more informative it is. Choose r fairly large as well, but not quite as aggressively due to the randomness.
C++ code below. On the online judge, it’s faster than Kelly’s (max 1.3 seconds on the T=3 tests, 0.5 on the T=1 tests).
#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iostream>
#include <limits>
#include <optional>
#include <random>
#include <vector>
namespace {
class LazySelector {
public:
static constexpr std::int32_t kTargetSampleSize = 1000;
explicit LazySelector() { sample_.reserve(1000000); }
void BeginFirstPass(const std::int32_t n, const std::int32_t k) {
sample_.clear();
mask_ = n / kTargetSampleSize;
mask_ |= mask_ >> 1;
mask_ |= mask_ >> 2;
mask_ |= mask_ >> 4;
mask_ |= mask_ >> 8;
mask_ |= mask_ >> 16;
}
void FirstPass(const std::int64_t value) {
if ((gen_() & mask_) == 0) {
sample_.push_back(value);
}
}
void BeginSecondPass(const std::int32_t n, const std::int32_t k) {
sample_.push_back(std::numeric_limits<std::int64_t>::min());
sample_.push_back(std::numeric_limits<std::int64_t>::max());
const double p = static_cast<double>(sample_.size()) / n;
const double radius = 2 * std::sqrt(sample_.size());
const auto lower =
sample_.begin() + std::clamp<std::int32_t>(std::floor(p * k - radius),
0, sample_.size() - 1);
const auto upper =
sample_.begin() + std::clamp<std::int32_t>(std::ceil(p * k + radius), 0,
sample_.size() - 1);
std::nth_element(sample_.begin(), upper, sample_.end());
std::nth_element(sample_.begin(), lower, upper);
lower_ = *lower;
upper_ = *upper;
sample_.clear();
less_than_lower_ = 0;
equal_to_lower_ = 0;
equal_to_upper_ = 0;
}
void SecondPass(const std::int64_t value) {
if (value < lower_) {
++less_than_lower_;
} else if (upper_ < value) {
} else if (value == lower_) {
++equal_to_lower_;
} else if (value == upper_) {
++equal_to_upper_;
} else {
sample_.push_back(value);
}
}
std::optional<std::int64_t> Select(std::int32_t k) {
if (k < less_than_lower_) {
return std::nullopt;
}
k -= less_than_lower_;
if (k < equal_to_lower_) {
return lower_;
}
k -= equal_to_lower_;
if (k < sample_.size()) {
const auto kth = sample_.begin() + k;
std::nth_element(sample_.begin(), kth, sample_.end());
return *kth;
}
k -= sample_.size();
if (k < equal_to_upper_) {
return upper_;
}
return std::nullopt;
}
private:
std::default_random_engine gen_;
std::vector<std::int64_t> sample_ = {};
std::int32_t mask_ = 0;
std::int64_t lower_ = std::numeric_limits<std::int64_t>::min();
std::int64_t upper_ = std::numeric_limits<std::int64_t>::max();
std::int32_t less_than_lower_ = 0;
std::int32_t equal_to_lower_ = 0;
std::int32_t equal_to_upper_ = 0;
};
} // namespace
int main() {
int t;
std::cin >> t;
for (int i = t; i > 0; --i) {
std::int32_t n;
std::int32_t k;
std::int64_t a;
std::int64_t b;
std::int64_t c;
std::int64_t d;
std::cin >> n >> k >> a >> b >> c >> d;
std::optional<std::int64_t> ans = std::nullopt;
LazySelector selector;
do {
{
selector.BeginFirstPass(n, k);
std::int64_t s = a;
for (std::int32_t j = n; j > 0; --j) {
selector.FirstPass(s);
s = (b * s + c) % d;
}
}
{
selector.BeginSecondPass(n, k);
std::int64_t s = a;
for (std::int32_t j = n; j > 0; --j) {
selector.SecondPass(s);
s = (b * s + c) % d;
}
}
ans = selector.Select(k - 1);
} while (!ans);
std::cout << *ans << '\n';
}
}

D-lang being faster than C++? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
So I was practising for an upcoming programming contest on algorithms, and I stumbled upon a problem from the previous year.
I pretty much solved it(in C++), but i was getting some timeouts, so i took a look at the official solution and it was written in Dlang.
I then tried to imitate what the official answer did in D, but I was still getting timeouts( > 4 seconds on a single input). Afaik, C++ is supposed to be faster than D, but D solves the same input in a split second and C++ takes more than 5 seconds for it
Here is the D answer code
import std.stdio;
import std.algorithm;
struct edge {
int src, des, w, o;
int opCmp (ref const edge e) const {
if(w != e.w) return w - e.w;
else return o - e.o;
}
};
const int MAXN = 100004, MAXM = 200004;
int N, M, D, ee, weight, days;
int[MAXN] ds;
edge[] edges;
void init() {
for(int i=1;i<=N;i++) ds[i] = i;
}
int find(int x) {
return ds[x] = (x == ds[x] ? x: find(ds[x]));
}
bool connected(int x, int y) {
return find(x) == find(y);
}
bool merge(int x, int y) {
int xr = find(x), yr = find(y);
if(xr ^ yr) {
ds[xr] = yr;
return 1;
}
return 0;
}
void main() {
scanf("%d%d%d", &N, &M, &D);
for(int i=1, a, b, c;i<=M;i++) {
scanf("%d%d%d", &a, &b, &c);
if(i < N)
edges ~= edge(a, b, c, 0);
else
edges ~= edge(a, b, c, 1);
}
edges.sort();
init();
int i, maxe=0;
for(i=0;i<edges.length;i++) {
auto e = edges[i];
if(merge(e.src, e.des)) {
if(e.o)
days ++;
}
}
printf("%d", days);
}
And then here is what I wrote in C++ as the answer code
#include <iostream>
#include <vector>
#include <map>
#include <algorithm>
using namespace std;
struct Edge{
long long source, end, weight, old;
Edge(long long _s, long long _e, long long _w, long long _o):source(_s), end(_e), weight(_w), old(_o){}
};
int parents[100004];
vector<Edge>edges;
bool inc(Edge a, Edge b)
{
if(a.weight == b.weight)return a.old > b.old;
return a.weight < b.weight;
}
long long find(long long node)
{
if(parents[node] == node)return node;
else return find(parents[node]);
}
void init(long long M)
{
for(long long i = 0; i < M; ++i)parents[i] = i;
}
bool connect(long long x, long long y)
{
long long fx = find(x);
long long fy = find(y);
if(fx == fy)return false;
parents[fx] = fy;
return true;
}
long long noOfDays()
{
long long days = 0;
for(auto edge : edges){
if(connect(edge.source, edge.end)){
if(!edge.old)++days;
}
}
return days;
}
int main()
{
ios::sync_with_stdio(false);
long long N, M , D;
cin >> N >> M >> D;
N--;
for(long long i = 0; i < M; ++i){
long long a,b,c;
cin >> a >> b >> c;
if(i < N){
edges.push_back(Edge(a,b,c,1));
}else{
edges.push_back(Edge(a,b,c,0));
}
}
sort(edges.begin(), edges.end(), inc);
init(N+2);
cout << noOfDays() << endl;
}
The input which takes more than 5 seconds on C++, and a split second on D can be found here "http://ddl3.data.hu/get/356808/10699419/s4.24.in"
Here is the question I was actually trying to solve "https://dmoj.ca/problem/ccc17s4"(I am only doing the 11 point part).
Is there any way I can make my C++ code as fast as the D code? and why exactly isn't my C++ code running as fast as the D code?
EDIT: For all clarifications, g++ was used for the C++ without any optimizations, and 'dmd' for the Dlang, without any optimizations either
find() seems to be heavily used and they are very different in D and C++ implementations:
int find(int x) {
return ds[x] = (x == ds[x] ? x: find(ds[x]));
}
vs:
long long find(long long node)
{
if(parents[node] == node)return node;
else return find(parents[node]);
}
find() in D modifies array (looks like some kind of dynamic programming, were you cash previous result) while in C++ you always do full lookup. You should compare apples to apples, especially this code could be written exactly the same way in C++.
Out of curiosity, I tried running OPs code, and also the version below, which I created by minimally tweaking the 'D' code so that it would compile under C++. OPs C++ version took around 12 seconds to run. The version below took around 0.25 seconds to run.
My conclusion, in answer to the question is that the difference in run time seen by the OP is likely due to differences in implementation as described in some of the other answers, as opposed to poor performance of C++.
#include <cstdio>
#include <vector>
#include <algorithm>
struct edge {
edge(int src, int des, int w, int o) : src(src), des(des), w(w), o(o) {}
int src, des, w, o;
int opCmp(const edge& e) const {
if (w != e.w) return w - e.w;
else return o - e.o;
}
};
const int MAXN = 100004, MAXM = 200004;
int N, M, D, ee, weight, days;
int ds[MAXN];
std::vector<edge> edges;
void init() {
for (int i = 1; i <= N; i++) ds[i] = i;
}
int find(int x) {
return ds[x] = (x == ds[x] ? x : find(ds[x]));
}
bool connected(int x, int y) {
return find(x) == find(y);
}
bool merge(int x, int y) {
int xr = find(x), yr = find(y);
if (xr ^ yr) {
ds[xr] = yr;
return 1;
}
return 0;
}
void main() {
std::scanf("%d%d%d", &N, &M, &D);
for (int i = 1, a, b, c; i <= M; i++) {
scanf("%d%d%d", &a, &b, &c);
if (i < N)
edges.push_back(edge(a, b, c, 0));
else
edges.push_back(edge(a, b, c, 1));
}
std::sort(edges.begin(), edges.end(), [](const edge& lhs, const edge& rhs) { return lhs.opCmp(rhs) < 0; });
init();
int i, maxe = 0;
for (i = 0; i<edges.size(); i++) {
auto e = edges[i];
if (merge(e.src, e.des)) {
if (e.o)
days++;
}
}
printf("%d", days);
}
One possible contributor to the slow performance of the C++ version is the 'inc' function. It receives 2 'Edge' structs by value, which in C++ will mean copying of the structs for every comparison during the sort call towards the end of main().
Try changing the signature of 'inc' to accept 'const Edge&' instead of 'Edge'. This will cause the struct values to be passed by reference and so avoid the extra copying.
Also, if you run a profiler you should be able to find where the majority of the time is being spent. This is the 'right' way to approach optimization: measure to find where you have a performance bottleneck, address the bottleneck and measure again to confirm you have indeed improved the performance.

How does this strange I/O method work?

While taking input output in C++ I have only used scanf/printf and cin/cout. Now I recently came across this code taking I/O in a strange fashion.
Also note that this I/O method is causing the code to run extremely fast, as this code uses almost the same algorithm as most of the other codes but it executes in a much smaller time. Why is this I/O so fast and how does this work in general?
edit: code
#include <bits/stdtr1c++.h>
#define MAXN 200010
#define MAXQ 200010
#define MAXV 1000010
#define clr(ar) memset(ar, 0, sizeof(ar))
#define read() freopen("lol.txt", "r", stdin)
using namespace std;
const int block_size = 633;
long long res, out[MAXQ]; int n, q, ar[MAXN], val[MAXN], freq[MAXV];
namespace fastio{
int ptr, ye;
char temp[25], str[8333667], out[8333669];
void init(){
ptr = 0, ye = 0;
fread(str, 1, 8333667, stdin);
}
inline int number(){
int i, j, val = 0;
while (str[ptr] < 45 || str[ptr] > 57) ptr++;
while (str[ptr] > 47 && str[ptr] < 58) val = (val * 10) + (str[ptr++] - 48);
return val;
}
inline void convert(long long x){
int i, d = 0;
for (; ;){
temp[++d] = (x % 10) + 48;
x /= 10;
if (!x) break;
}
for (i = d; i; i--) out[ye++] = temp[i];
out[ye++] = 10;
}
inline void print(){
fwrite(out, 1, ye, stdout);
} }
struct query{
int l, r, d, i;
inline query() {}
inline query(int a, int b, int c){
i = c;
l = a, r = b, d = l / block_size;
}
inline bool operator < (const query& other) const{
if (d != other.d) return (d < other.d);
return ((d & 1) ? (r < other.r) : (r > other.r));
} } Q[MAXQ];
void compress(int n, int* in, int* out){
unordered_map <int, int> mp;
for (int i = 0; i < n; i++) out[i] = mp.emplace(in[i], mp.size()).first->second; }
inline void insert(int i){
res += (long long)val[i] * (1 + 2 * freq[ar[i]]++); }
inline void erase(int i){
res -= (long long)val[i] * (1 + 2 * --freq[ar[i]]); }
inline void run(){
sort(Q, Q + q);
int i, l, r, a = 0, b = 0;
for (res = 0, i = 0; i < q; i++){
l = Q[i].l, r = Q[i].r;
while (a > l) insert(--a);
while (b <= r) insert(b++);
while (a < l) erase(a++);
while (b > (r + 1)) erase(--b);
out[Q[i].i] = res;
}
for (i = 0; i < q; i++) fastio::convert(out[i]); }
int main(){
fastio::init();
int n, i, j, k, a, b;
n = fastio::number();
q = fastio::number();
for (i = 0; i < n; i++) val[i] = fastio::number();
compress(n, val, ar);
for (i = 0; i < q; i++){
a = fastio::number();
b = fastio::number();
Q[i] = query(a - 1, b - 1, i);
}
run();
fastio::print();
return 0; }
This solution, http://codeforces.com/contest/86/submission/22526466 (624 ms, 32 MB RAM uses) uses single fread and manual parsing of numbers from memory (so it uses more memory); many other solutions are slower and uses scanf (http://codeforces.com/contest/86/submission/27561563 1620 ms 9MB) or C++ iostream cin (http://codeforces.com/contest/86/submission/27558562 3118 ms, 15 MB). Not all difference of solutions comes from input-output and parsing (solutions methods have differences too), but some is.
fread(str, 1, 8333667, stdin);
This code uses single fread libcall to read up to 8MB, which is full file. The file may have up to 2 (n,t) + 200000 (a_i) + 2*200000 (l,r) 6/7-digit numbers with or without line breaks or separated by one (?) space, so around 8 chars max for number (6 or 7 for number, as 1000000 is allowed too, and 1 space or \n); max input file size is like 0.6 M * 8 bytes =~ 5 MB.
inline int number(){
int i, j, val = 0;
while (str[ptr] < 45 || str[ptr] > 57) ptr++;
while (str[ptr] > 47 && str[ptr] < 58) val = (val * 10) + (str[ptr++] - 48);
return val;
}
Then code uses manual code of parsing decimal int numbers. According to ascii table, http://www.asciitable.com/ decimal codes of 48...57 are decimal digits (second while loop): '0'...'9', and we can just subtract 48 from the letter code to get the digit; multiply partially read val by 10 and add current digit. And chr<45 || chr > 57 in the first while loops sound like skipping non-digits from input. This is incorrect, as this code will not parse codes 45, 46, 47 = '-', '.', '/', and no any number after these chars will be read.
n = fastio::number();
q = fastio::number();
for (i = 0; i < n; i++) val[i] = fastio::number();
for (i = 0; i < q; i++){
a = fastio::number();
b = fastio::number();
Actual reading uses this fastio::number() method; and other solutions uses calling of scanf or iostream operator << in loop:
for (int i = 0; i < N; i++) {
scanf("%d", &(arr[i]));
add(arr[i]);
}
or
for (int i = 1; i <= n; ++i)
cin >> a[i];
Both methods are more universal, but they do library call, which will read some chars from internal buffer (like 4KB) or call OS syscall for buffer refill, and every function does many checks and has error reporting: For every number of input scanf will reparse the same format string of first argument, and will do all the logic described in POSIX http://pubs.opengroup.org/onlinepubs/7908799/xsh/fscanf.html and all error-checking. C++ iostream has no format string, but it is still more universal: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/istream.tcc#L156 'operator>>(int& __n)'.
So, standard library functions have more logic inside, more calls, more branching; and they are more universal and much safer, and should be used in real-world programming. And this "sport programming" contest allow users to solve the task with standard library functions which are fast enough, if you can imagine the algorithm. Authors or task are required to write several solutions with standard i/o functions to check that timelimit of the task is correct and task can be solved. (The TopCoder system is better with i/o, you will not implement i/o, the data is already passed into your function in some language structs/collections).
Sometimes tasks in sport programming have tight limits on memory: input files several times bigger than allowed memory usage, and programmer can't read whole file into memory. For example: get 20 mln of digits of single verylong number from input file and add 1 to it, with memory limit of 2 MB; you can't read full input number from file in forward direction; it is very hard to do correct reading in chunks in backward direction; and you just need to forget standard method of addition (Columnar addition) and build FSM (Finite-state machine) with state, counting sequences of 9s.

Finding the number of unique paths on a gird

I'm trying to understand how to solve the problem of finding all unique paths in a grid using dynamic programming:
A robot is located at the top-left corner of a m x n grid (marked ‘Start’ in the diagram below). The robot can only move either down or right at any point in time. The robot is trying to reach the bottom-right corner of the grid (marked ‘Finish’ in the diagram below). How many possible unique paths are there?
I was looking at this article and I was wondering why in the below solution, the matrix is initialized at M_MAX + 2 and N_MAX + 2, and also why in the function signature of backtrack, why the last parameter is initialized with int mat[][N_MAX+2]
const int M_MAX = 100;
const int N_MAX = 100;
int backtrack(int r, int c, int m, int n, int mat[][N_MAX+2]) {
if (r == m && c == n)
return 1;
if (r > m || c > n)
return 0;
if (mat[r+1][c] == -1)
mat[r+1][c] = backtrack(r+1, c, m, n, mat);
if (mat[r][c+1] == -1)
mat[r][c+1] = backtrack(r, c+1, m, n, mat);
return mat[r+1][c] + mat[r][c+1];
}
int bt(int m, int n) {
int mat[M_MAX+2][N_MAX+2];
for (int i = 0; i < M_MAX+2; i++) {
for (int j = 0; j < N_MAX+2; j++) {
mat[i][j] = -1;
}
}
return backtrack(1, 1, m, n, mat);
}
Then in the author's bottom-up approach solution:
const int M_MAX = 100;
const int N_MAX = 100;
int dp(int m, int n) {
int mat[M_MAX+2][N_MAX+2] = {0};
mat[m][n+1] = 1;
for (int r = m; r >= 1; r--)
for (int c = n; c >= 1; c--)
mat[r][c] = mat[r+1][c] + mat[r][c+1];
return mat[1][1];
}
I don't know what the purpose of the line mat[m][n+1] = 1; serves.
I'm not familiar with Java, so I apologize if these boil down to syntactical or language-specific questions.
Firstly, notice that the author and the second solution both use 1-based indexing. So, of course, mat[M_MAX+1][N_MAX+1] would be quite justified.
Now, notice the logic the author is using.
mat[r][c] = mat[r+1][c] + mat[r][c+1];
Hence, to prevent r+1 or c+1 from going out of bounds when c = n+1 or r = m+1, instead of adding an if-statement like this:
if (r == m)
mat[r][c] = mat[r][c+1];
if (c == n)
mat[r][c] = mat[r+1][c];
He has decided to simply add an extra row or column with 0 value stored in it. Hence:
mat[M_MAX+2][N_MAX+2] = {0};
Finally, in a bottom up approach, one must initialize mat[m][n] to 1. Instead of doing that, knowing that mat[m][n] = mat[m+1][n] + mat[m][n+1];, he initialized :
mat[m][n+1] = 1; // mat[m+1][n] = 0;
Feel free to ask any questions in comments.

How to convert recursive to iterative solution

I've managed to write my algorithm in recursive way:
int fib(int n) {
if(n == 1)
return 3
elseif (n == 2)
return 2
else
return fib(n – 2) + fib(n – 1)
}
Currently I'm trying to convert it to iterative approach without success:
int fib(int n) {
int i = 0, j = 1, k, t;
for (k = 1; k <= n; ++k)
{
if(n == 1) {
j = 3;
}
else if(n == 2) {
j = 2;
}
else {
t = i + j;
i = j;
j = t;
}
}
return j;
}
So how can I rectify my code to reach my goal?
Solving this problem by a general convert-to-iterative is a bad idea. But, that is what you asked.
None of these are good ways to solve fib: there are closed form solutions for fib, and/or iterative solutions that are cleaner, and/or recursive memoized solutions. Rather, I'm showing relatively mechanical techniques for taking a recursive function (that isn't tail-recursive or otherwise simple to solve), and solving it without using the automatic storage stack (recursion).
I have had code that does too deep a recursive nesting and blows the stack in medium-high complexity cases; when refactored to iterative, the problem went away. These are the kinds of solutions required when what you have is a recursive solution you half understand, and you need it to be iterative.
The general means to convert a recursive to an iterative solution is to manage the stack manually.
In this case, I'll also memoize return values.
We cache the return values in retvals.
If we cannot immediately solve a problem, we state what problems we first need to solve in order to solve our problem (in particular, the n-1 and n-2 cases). Then we queue up solving our problem again (by which point, we will have what we need ready go).
int fib( int n ) {
std::map< int, int > retvals {
{1,3},
{2,2}
};
std::vector<int> arg;
arg.push_back(n);
while( !arg.empty() ) {
int n = arg.back();
arg.pop_back();
// have we solved this already? If so, stop.
if (retvals.count(n)>0)
continue;
// are we done? If so, calculate the result:
if (retvals.count(n-1)>0 && retvals.count(n-2)>0) {
retvals[n] = retvals[n-1] + retvals[n-2];
continue;
}
// to calculate n, first calculate n-1 and n-2:
arg.push_back(n); arg.push_back(n-1); arg.push_back(n-2);
}
return retvals[n];
}
No recursion, just a loop.
A "dumber" way to do this is to take the function and make it a pseudo-coroutine.
First, rewrite your recursive code to do one thing per line:
int fib(int n) {
if(n == 1)
return 3
if (n == 2)
return 2
int a = fib(n-2);
int b = fib(n-1);
return a+b;
}
Next, create a struct with all of the functions' state:
struct fib_data {
int n, a, b, r;
};
and add labels at each point where we make a recursive call, and an enum with similar names:
enum Calls {
e1, e2
};
int fib(int n) {
fib_data d;
d.n = n;
if(d.n == 1)
return 3
if (d.n == 2)
return 2
d.a = fib(n-2);
CALL1:
d.b = fib(n-1);
CALL2:
d.r = d.a+d.b;
return d.r;
}
add CALLS to your fib_data.
Next create a stack of fib_data:
enum Calls {
e0, e1, e2
};
struct fib_data {
Calls loc = Calls::e0;
int n, a, b, r;
};
int fib(int n) {
std::vector<fib_data> stack;
stack.push_back({n});
if(stack.back().n == 1)
return 3
if (stack.back().n == 2)
return 2
stack.back().a = fib(stack.back().n-2);
CALL1:
stack.back().b = fib(stack.back().n-1);
CALL2:
stack.back().r = stack.back().a + stack.back().b;
return stack.back().r;
}
now create a loop. Instead of recursively calling, set the return location in your fib_data, push a fib_data onto the stack with an n and an e0 location, then continue the loop. At the top of the loop, switch on the top of the stack's location.
To return: Create a function local variable r to store return values. To return, set r, pop the stack, and continue the loop.
If the stack is empty at the start of the loop, return r from the function.
enum Calls {
e0, e1, e2
};
struct fib_data {
int n, a, b, r;
Calls loc = Calls::e0;
};
int fib(int n) {
std::vector<fib_data> stack;
stack.push_back({n});
int r;
while (!stack.empty()) {
switch(stack.back().loc) {
case e0: break;
case e1: goto CALL1;
case e2: goto CALL2;
};
if(stack.back().n == 1) {
r = 3;
stack.pop_back();
continue;
}
if (stack.back().n == 2){
r = 2;
stack.pop_back();
continue;
}
stack.back().loc = e1;
stack.push_back({stack.back().n-2});
continue;
CALL1:
stack.back().a = r;
stack.back().loc = e2;
stack.push_back({stack.back().n-1});
continue;
CALL2:
stack.back().b = r;
stack.back().r = stack.back().a + stack.back().b;
r = stack.back().r;
stack.pop_back();
continue;
}
}
Then note that b and r do not have to be in the stack -- remove it, and make it local.
This "dumb" transformation emulates what the C++ compiler does when you recurse, but the stack is stored in the free store instead of automatic storage, and can reallocate.
If pointers to the local variables need to persist, using a std::vector for the stack won't work. Replace the pointers with offsets into the standard vector, and it will work.
This should be fib(0) = 0, fib(1) = 1, fib(2) = 1, fib(3) = 2, fib(4) = 3, fib(5) = 5, fib(6) = 8, ... .
fib(n)
{
int f0, f1, t;
if(n < 2)
return n;
n -= 2;
f0 = 1;
f1 = 1;
while(n--){
t = f1+f0;
f0 = f1;
f1 = t;
}
return f1;
}
or you can unfold the loop a bit, and get rid of the temp variable:
int fib(int n)
{
int f0, f1;
if(n < 2)
return n;
f0 = 1-(n&1);
f1 = 1;
while(0 < (n -= 2)){
f0 += f1;
f1 += f0;
}
return f1;
}
This is a classic problem. you can not simply get rid of the recursion if you are given n and you want to calculate down.
the solution is Dynamic programming. basically you want to create an array of size n, then starting from index 0 fill it up until you reach index n-1;
something like this:
int fib(int n)
{
int buffer[n+1];
buffer[0]=3;
buffer[1]=2;
for(int i=2;i<=n; ++i)
{
buffer[i] = buffer[i-1] + buffer[i-2];
}
return buffer[n];
}
alternatively to save memory and not use a big array you can use:
int fib(int n)
{
int buffer [2];
buffer[0] = 3;
buffer[1] = 2;
for(int i=3; i<=n; i++)
{
int tmp = buffer[0] + buffer[1];
buffer[0] = buffer[1];
buffer[1] = temp;
}
return buffer[1];
}
For a sake of completeness here is the iterative solution with O(1) space complexity:
int fib(n)
{
int i;
int a0 = 3;
int a1 = 2;
int tmp;
if (n == 1)
return a0;
for (i = 3; i <=n; i++ )
{
tmp = a0 + a1;
a0 = a1;
a1 = tmp;
}
return a1;
}