OpenCL data parallel summation into a variable - c++

Is it possible to use the opencl data parallel kernel to sum vector of size N, without doing the partial sum trick?
Say that if you have access to 16 work items and your vector is of size 16. Wouldn't it not be possible to just have a kernel doing the following
__kernel void summation(__global float* input, __global float* sum)
{
int idx = get_global_id(0);
sum[0] += input[idx];
}
When I've tried this, the sum variable doesn't get updated, but only overwritten. I've read something about using barriers, and i tried inserting a barrier before the summation above, it does update the variable somehow, but it doesn't reproduce the correct sum.

Let me try to explain why sum[0] is overwritten rather than updated.
In your case of 16 work items, there are 16 threads which are running simultaneously. Now sum[0] is a single memory location which is shared by all of the threads, and the line sum[0] += input[idx] is run by each of the 16 threads, simultaneously.
Now the instruction sum[0] += input[idx] (I think) expands performs a read of sum[0], then adds input[idx] to that before writing the result back to sum[0].
There will will be a data race as multiple threads are reading from and writing to the same shared memory location. So what might happen is:
All threads may read the value of sum[0] before any other thread
writes their updated result back to sum[0], in which case the final
result of sum[0] would be the value of input[idx] of the thread
which executed the slowest. Since this will be different each time,
if you run the example multiple times you should see different
results.
Or, one thread may execute slightly more slowly, in which case
another thread may have already written an updated result back to
sum[0] before this slow thread reads sum[0], in which case there
will be an addition using the values of more than one thread, but not
all threads.
So how can you avoid this?
Option 1 - Atomics (Worse Option):
You can use atomics to force all threads to block if another thread is performing an operation on the shared memory location, but this obviously results in a loss of performance since you are making the parallel process serial (and incurring the costs of parallelisation -- such as moving memory between the host and the device and creating the threads).
Option 2 - Reduction (Better Option):
The best solution would be to reduce the array, since you can use the parallelism most effectively, and can give O(log(N)) performance. Here is a good overview of reduction using OpenCL : Reduction Example.

Option 3 (and worst of all)
__kernel void summation(__global float* input, __global float* sum)
{
int idx = get_global_id(0);
for(int j=0;j<N;j++)
{
barrier(CLK_GLOBAL_MEM_FENCE| CLK_LOCAL_MEM_FENCE);
if(idx==j)
sum[0] += input[idx];
else
doOtherWorkWhileSingleCoreSums();
}
}
using a mainstream gpu, this should sum all of them as slow as a pentium mmx . This is just like computing on a single core and giving other cores other jobs but in a slower way.
A cpu device could be better than gpu for this kind.

Related

race condition using OpenMP atomic capture operation for 3D histogram of particles and making an index

I have a piece of code in my full code:
const unsigned int GL=8000000;
const int cuba=8;
const int cubn=cuba+cuba;
const int cub3=cubn*cubn*cubn;
int Length[cub3];
int Begin[cub3];
int Counter[cub3];
int MIndex[GL];
struct Particle{
int ix,jy,kz;
int ip;
};
Particle particles[GL];
int GetIndex(const Particle & p){return (p.ix+cuba+cubn*(p.jy+cuba+cubn*(p.kz+cuba)));}
...
#pragma omp parallel for
for(int i=0; i<cub3; ++i) Length[i]=Counter[i]=0;
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
int ic=GetIndex(particles[i]);
#pragma omp atomic update
Length[ic]++;
}
Begin[0]=0;
#pragma omp single
for(int i=1; i<cub3; ++i) Begin[i]=Begin[i-1]+Length[i-1];
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
if(particles[i].ip==3)
{
int ic=GetIndex(particles[i]);
if(ic>cub3 || ic<0) printf("ic=%d out of range!\n",ic);
int cnt=0;
#pragma omp atomic capture
cnt=Counter[ic]++;
MIndex[Begin[ic]+cnt]=i;
}
}
If to remove
#pragma omp parallel for
the code works properly and the output results are always the same.
But with this pragma there is some undefined behaviour/race condition in the code, because each time it gives different output results.
How to fix this issue?
Update: The task is the following. Have lots of particles with some random coordinates. Need to output to the array MIndex the indices in the array particles of the particles, which are in each cell (cartesian cube, for example, 1×1×1 cm) of the coordinate system. So, in the beginning of MIndex there should be the indices in the array particles of the particles in the 1st cell of the coordinate system, then - in the 2nd, then - in the 3rd and so on. The order of indices within given cell in the area MIndex is not important, may be arbitrary. If it is possible, need to make this in parallel, may be using atomic operations.
There is a straight way: to traverse across all the coordinate cells in parallel and in each cell check the coordinates of all the particles. But for large number of cells and particles this seems to be slow. Is there a faster approach? Is it possible to travel across the particles array only once in parallel and fill MIndex array using atomic operations, something like written in the code piece above?
You probably can't get a compiler to auto-parallelize scalar code for you if you want an algorithm that can work efficiently (without needing atomic RMWs on shared counters which would be a disaster, see below). But you might be able to use OpenMP as a way to start threads and get thread IDs.
Keep per-thread count arrays from the initial histogram, use in 2nd pass
(Update: this might not work: I didn't notice the if(particles[i].ip==3) in the source before. I was assuming that Count[ic] will go as high as Length[ic] in the serial version. If that's not the case, this strategy might leave gaps or something.
But as Laci points out, perhaps you want that check when calculating Length in the first place, then it would be fine.)
Manually multi-thread the first histogram (into Length[]), with each thread working on a known range of i values. Keep those per-thread lengths around, even as you sum across them and prefix-sum to build Begin[].
So Length[thread][ic] is the number of particles in that cube, out of the range of i values that this thread worked on. (And will loop over again in the 2nd loop: the key is that we divide the particles between threads the same way twice. Ideally with the same thread working on the same range, so things may still be hot in L1d cache.)
Pre-process that into a per-thread Begin[][] array, so each thread knows where in MIndex to put data from each bucket.
// pseudo-code, fairly close to actual C
for(ic < cub3) {
// perhaps do this "vertical" sum into a temporary array
// or prefix-sum within Length before combining across threads?
int pos = sum(Length[0..nthreads-1][ic-1]) + Begin[0][ic-1];
Begin[0][ic] = pos;
for (int t = 1 ; t<nthreads ; t++) {
pos += Length[t][ic]; // prefix-sum across threads for this cube bucket
Begin[t][ic] = pos;
}
}
This has a pretty terrible cache access pattern, especially with cuba=8 making Length[t][0] and Length[t+1][0] 4096 bytes apart from each other. (So 4k aliasing is a possible problem, as are cache conflict misses).
Perhaps each thread can prefix-sum its own slice of Length into that slice of Begin, 1. for cache access pattern (and locality since it just wrote those Lengths), and 2. to get some parallelism for that work.
Then in the final loop with MIndex, each thread can do int pos = --Length[t][ic] to derive a unique ID from the Length. (Like you were doing with Count[], but without introducing another per-thread array to zero.)
Each element of Length will return to zero, because the same thread is looking at the same points it just counted. With correctly-calculated Begin[t][ic] positions, MIndex[...] = i stores won't conflict. False sharing is still possible, but it's a large enough array that points will tend to be scattered around.
Don't overdo it with number of threads, especially if cuba is greater than 8. The amount of Length / Begin pre-processing work scales with number of threads, so it may be better to just leave some CPUs free for unrelated threads or tasks to get some throughput done. OTOH, with cuba=8 meaning each per-thread array is only 4096 bytes (too small to parallelize the zeroing of, BTW), it's really not that much.
(Previous answer before your edit made it clearer what was going on.)
Is this basically a histogram? If each thread has its own array of counts, you can sum them together at the end (you might need to do that manually, not have OpenMP do it for you). But it seems you also need this count to be unique within each voxel, to have MIndex updated properly? That might be a showstopper, like requiring adjusting every MIndex entry, if it's even possible.
After your update, you are doing a histogram into Length[], so that part can be sped up.
Atomic RMWs would be necessary for your code as-is, performance disaster
Atomic increments of shared counters would be slower, and on x86 might destroy the memory-level parallelism too badly. On x86, every atomic RMW is also a full memory barrier, draining the store buffer before it happens, and blocking later loads from starting until after it happens.
As opposed to a single thread which can have cache misses to multiple Counter, Begin and MIndex elements outstanding, using non-atomic accesses. (Thanks to out-of-order exec, the next iteration's load / inc / store for Counter[ic]++ can be doing the load while there are cache misses outstanding for Begin[ic] and/or for Mindex[] stores.)
ISAs that allow relaxed-atomic increment might be able to do this efficiently, like AArch64. (Again, OpenMP might not be able to do that for you.)
Even on x86, with enough (logical) cores, you might still get some speedup, especially if the Counter accesses are scattered enough they cores aren't constantly fighting over the same cache lines. You'd still get a lot of cache lines bouncing between cores, though, instead of staying hot in L1d or L2. (False sharing is a problem,
Perhaps software prefetch can help, like prefetchw (write-prefetching) the counter for 5 or 10 i iterations later.
It wouldn't be deterministic which point went in which order, even with memory_order_seq_cst increments, though. Whichever thread increments Counter[ic] first is the one that associates that cnt with that i.
Alternative access patterns
Perhaps have each thread scan all points, but only process a subset of them, with disjoint subsets. So the set of Counter[] elements that any given thread touches is only touched by that thread, so the increments can be non-atomic.
Filtering by p.kz ranges maybe makes the most sense since that's the largest multiplier in the indexing, so each thread "owns" a contiguous range of Counter[].
But if your points aren't uniformly distributed, you'd need to know how to break things up to approximately equally divide the work. And you can't just divide it more finely (like OMP schedule dynamic), since each thread is going to scan through all the points: that would multiply the amount of filtering work.
Maybe a couple fixed partitions would be a good tradeoff to gain some parallelism without introducing a lot of extra work.
Re: your edit
You already loop over the whole array of points doing Length[ic]++;? Seems redundant to do the same histogramming work again with Counter[ic]++;, but not obvious how to avoid it.
The count arrays are small, but if you don't need both when you're done, you could maybe just decrement Length to assign unique indices to each point in a voxel. At least the first histogram could benefit from parallelizing with different count arrays for each thread, and just vertically adding at the end. Should scale perfectly with threads since the count array is small enough for L1d cache.
BTW, for() Length[i]=Counter[i]=0; is too small to be worth parallelizing. For cuba=8, it's 8*8*16 * sizeof(int) = 4096 bytes, just one page, so it's just two small memsets.
(Of course if each thread has their own separate Length array, they each need to zero it). That's small enough to even consider unrolling with maybe 2 count arrays per thread to hide store/reload serial dependencies if a long sequence of points are all in the same bucket. Combining count arrays at the end is a job for #pragma omp simd or just normal auto-vectorization with gcc -O3 -march=native since it's integer work.
For the final loop, you could split your points array in half (assign half to each thread), and have one thread get unique IDs by counting down from --Length[i], and another counting up from 0 in Counter[i]++. With different threads looking at different points, this could give you a factor of 2 speedup. (Modulo contention for MIndex stores.)
To do more than just count up and down, you'd need info you don't have from just the overall Length array... but which you did have temporarily. See the section at the top
You are right to make the update Counter[ic]++ atomic, but there is an additional problem on the next line: MIndex[Begin[ic]+cnt]=i; Different iterations can write into the same location here, unless you have mathematical proof that this is never the case from the structure of MIndex. So you have to make that line atomic too. And then there is almost no parallel work left in your loop, so your speed up if probably going to be abysmal.
EDIT the second line however is not of the right form for an atomic operation, so you have to make it critical. Which is going to make performance even worse.
Also, #Laci is correct that since this is an overwrite statement, the order of parallel scheduling is going to influence the outcome. So either live with that fact, or accept that this can not be parallelized.

why using more than 2 threads consume more time?

I want to optimize my sequential code to make a gradient.
The main thread compute gradient for the border of the image and the other threads each one compute the gradient for a chunk of the image,
using 2 threads and the main thread give result better than sequential code but using more than 2 threads, but it consume more time and looks worst than the sequential.
I tried this code to speed up the gradient process:
for (int n = 0; n<iter_outer; n++)
{
int chunk = 1 + ((row - 1) / num_threads); //ceiling
int start=0;
int end=0;
//Launch a group of threads
for (int tid = 0; tid < num_threads; ++tid)
{
start = tid * chunk;
end = start + chunk;
t[tid] = thread(gradient, tid, g, vx, vy, row, col, 1, start, end);
}
//Launched from the main;
gradient(1, g, vx, vy, row, col,0, start, end);
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i)
{
t[i].join();
}
}
For any parallel execution you have to take into account Amdahl's law. It states that the time required to do some task in parallel does not scale linear with the number of processors:
t = ( (1-p) + p/n ) * T
where
T is the time needed for the task when it is done sequentially
p fraction of time that can be parallelized
n is the number of processors
Note that I used a slightly different formulation, but the statement is the same: The total speedup you get is limited by 1/(1-p) (e.g. if p=50% your parallel version will run maximum twice as fast).
In addition to that you have to consider that adding more parallelism in reality also adds more overhead (for synchronisation, setting up threads, etc), so a more realistic estimate is:
t = ( (1-p) + p/n ) * T + o*p
^^ overhead
This t as a function of the number of processors p has a minimum for a certain number of processors. Adding more processors to the problem will not result in a speedup but rather in a slow down, because the minimum time you need to do that p portion is zero, but the overhead you add by adding more processors increases unlimited.
This does not explain why you dont get a speedup in your case, but in general it is not a big surprise that simply adding more processors on a task does not always result in a speedup.
The parallel execution is a huge benefit for tasks that are easily splittable and the threads will not depend on themselves, however creating threads does come with a price. Let's imagine that a computer does nothing else but running your program (there is not OS and no other processes). The processor has 2 cores, they are processors in their own regard and can concurrently run any code. In case of just one thread the second core sits and does absolutely nothing hence there is potential for speed up. If you spawn the second thread (and give it 50% of the task) the second core now works as well and theoretically the speedup is 2 (ignoring the sequential parts and practical aspects). Now, lets make 4 threads. Wait... we have two processors and 4 threads? Yes, now each CPU does more than one thing and before changing the task on which it works the CPU has to switch contexts (change the values of registers to hold appropriate variables values, go to different code section and so on) this takes time and if you create way too many threads it will in fact take more time than doing the job. This might be a huge impact on any threaded application and should be noted before deciding on how many threads to run.
Note that this post is as simplification many modern CPU's can run efficiently more then one thread per core (ie. HyperThreading).
It seems like your CPU is dual-core. So, actually, only 2 tasks could be done parallel

Multi-threaded GEMM slower than single threaded one?

I wrote some Naiive GEMM code and I am wondering why it is much slower than the equivalent single threaded GEMM code.
With a 200x200 matrix, Single Threaded: 7ms, Multi Threaded: 108ms, CPU: 3930k, 12 threads in thread pool.
template <unsigned M, unsigned N, unsigned P, typename T>
static Matrix<M, P, T> multiply( const Matrix<M, N, T> &lhs, const Matrix<N, P, T> &rhs, ThreadPool & pool )
{
Matrix<M, P, T> result = {0};
Task<void> task(pool);
for (auto i=0u; i<M; ++i)
for (auto j=0u; j<P; j++)
task.async([&result, &lhs, &rhs, i, j](){
T sum = 0;
for (auto k=0u; k < N; ++k)
sum += lhs[i * N + k] * rhs[k * P + j];
result[i * M + j] = sum;
});
task.wait();
return std::move(result);
}
I do not have experience with GEMM, but your problem seems to be related to issues that appear in all kind of multi-threading scenarios.
When using multi-threading, you introduce a couple of potential overheads, the most common of which usually are
creation/cleanup of starting/ending threads
context switches when (number of threads) > (number of CPU cores)
locking of resources, waiting to obtain a lock
cache synchronization issues
The items 2. and 3. probably don't play a role in your example: you are using 12 threads on 12 (hyperthreading) cores, and your algorithm does not involve locks.
However, 1. might be relevant in your case: You are creating a total of 40000 threads, each of which multiplying and adding 200 values. I'd suggest to try a less fine-grained threading, maybe only splitting after the first loop. It's always a good idea not to split up the problem into pieces smaller than necessary.
Also 4. will very likely be important in your case. While you're not running into a race condition when writing the results to the array (because every thread is writing to its own index position), you are very likely to provoke a large overhead of cache syncs.
"Why?" you might think, because you're writing to different places in memory. That's because a typical CPU cache is organized in cache lines, which on the current Intel and AMD CPU models are 64 bytes long. This is the smallest size that can be used for transfers from and to the cache, when something is changed. Now that all CPU cores are reading and writing to adjacent memory words, this leads to synchronization of 64 bytes between all the cores whenever you are writing just 4 bytes (or 8, depending on the size of the data type you're using).
If memory is not an issue, you can simply "pad" every output array element with "dummy" data so that there is only one output element per cache line. If you're using 4byte data types, this would mean to skip 15 array elements for each 1 real data element. The cache issues will also improve when you make your threading less fine-grained, because every thread will access its own continuous region in memory practically without interfering with other threads' memory.
Edit: A more detailed description by Herb Sutter (one of the Gurus of C++) can be found here: http://www.drdobbs.com/parallel/maximize-locality-minimize-contention/208200273
Edit2: BTW, it's suggested to avoid std::move in the return statement, as this might get in the way of return-value-optimization and copy-elision rules, which the standard now demands to happen automatically. See Is returning with `std::move` sensible in the case of multiple return statements?
Multi threading means always synchronization, context switching, function call. This all adds up and costs CPU cycles, you can spend on the main task itself.
If you have just a third nested loop, you save all these steps and can do the computation inline instead of a subroutine, where you must setup a stack, call into, switch to a different thread, return the result and switch back to the main thread.
Multi threading is useful only, if these costs are small compared to the main task. I guess, you will see better results with multi threading, when the matrix is larger than just 200x200.
In general multi-threading is well applicable for tasks which take a lot of time, most favourably because of complexity and not device access. The loop you showed us takes to short to execute for it to be effectively parallelized.
You have to remember that there is much overhead with thread creation. There is also some (but significantly less) overhead with synchronization.

OpenMP and C++ parallel for loop: why does my code slow down when using OpenMP?

I have a simple question about using OpenMP (with C++) that I hoped someone could help me with. I've included a small example below to illustrate my problem.
#include<iostream>
#include<vector>
#include<ctime>
#include<omp.h>
using namespace std;
int main(){
srand(time(NULL));//Seed random number generator
vector<int>v;//Create vector to hold random numbers in interval [0,9]
vector<int>d(10,0);//Vector to hold counts of each integer initialized to 0
for(int i=0;i<1e9;++i)
v.push_back(rand()%10);//Push back random numbers [0,9]
clock_t c=clock();
#pragma omp parallel for
for(int i=0;i<v.size();++i)
d[v[i]]+=1;//Count number stored at v[i]
cout<<"Seconds: "<<(clock()-c)/CLOCKS_PER_SEC<<endl;
for(vector<int>::iterator i=d.begin();i!=d.end();++i)
cout<<*i<<endl;
return 0;
}
The above code creates a vector v that contains 1 billion random integers in the range [0,9]. Then, the code loops through v counting how many instances of each different integer there is (i.e., how many ones are found in v, how many twos, etc.)
Each time a particular integer is encountered, it is counted by incrementing the appropriate element of a vector d. So, d[0] counts how many zeroes, d[6] counts how many sixes, and so on. Make sense so far?
My problem is when I try to make the counting loop parallel. Without the #pragma OpenMP statement, my code takes 20 seconds, yet with the pragma it takes over 60 seconds.
Clearly, I've misunderstood some concept relating to OpenMP (perhaps how data is shared/accessed?). Could someone explain my error please or point me in the direction of some insightful literature with appropriate keywords to help my search?
Your code exibits:
race conditions due to unsyncronised access to a shared variable
false and true sharing cache problems
wrong measurement of run time
Race conditions arise because you are concurrently updating the same elements of vector d in multiple threads. Comment out the srand() line and run your code several times with the same number of threads (but with more than one thread). Compare the outputs from different runs.
False sharing occurs when two threads write to memory locations that are close to one another as to result on the same cache line. This results in the cache line constantly bouncing from core to core or CPU to CPU in multisocket systems and excess of cache coherency messages. With 32 bytes per cache line 8 elements of the vector could fit in one cache line. With 64 bytes per cache line the whole vector d fits in one cache line. This makes the code slow on Core 2 processors and slightly slower (but not as slow as on Core 2) on Nehalem and post-Nehalem (e.g. Sandy Bridge) ones. True sharing occurs at those elements that are accesses by two or more threads at the same time. You should either put the increment in an OpenMP atomic construct (slow), use an array of OpenMP locks to protect access to elements of d (faster or slower, depending on your OpenMP runtime) or accumulate local values and then do a final synchronised reduction (fastest). The first one is implemented like this:
#pragma omp parallel for
for(int i=0;i<v.size();++i)
#pragma omp atomic
d[v[i]]+=1;//Count number stored at v[i]
The second is implemented like this:
omp_lock_t locks[10];
for (int i = 0; i < 10; i++)
omp_init_lock(&locks[i]);
#pragma omp parallel for
for(int i=0;i<v.size();++i)
{
int vv = v[i];
omp_set_lock(&locks[vv]);
d[vv]+=1;//Count number stored at v[i]
omp_unset_lock(&locks[vv]);
}
for (int i = 0; i < 10; i++)
omp_destroy_lock(&locks[i]);
(include omp.h to get access to the omp_* functions)
I leave it up to you to come up with an implementation of the third option.
You are measuring elapsed time using clock() but it measures the CPU time, not the runtime. If you have one thread running at 100% CPU usage for 1 second, then clock() would indicata an increase in CPU time of 1 second. If you have 8 threads running at 100% CPU usage for 1 second, clock() would indicate an increate in CPU time of 8 seconds (that is 8 threads times 1 CPU second per thread). Use omp_get_wtime() or gettimeofday() (or some other high resolution timer API) instead.
EDIT
Once your race condition is resolved via correct synchronization, then the following paragraph applies, before that your data race conditions unfortunately make speed comparisons mute:
Your program is slowing down because you have 10 possible outputs during the pragma section which are being accessed randomly. OpenMP cannot access any of those elements without a lock (which you would need to provide via synchronization) as a result and locking will cause your threads to have a higher overhead than you gain from counting in parallel.
A solution to make this speed up, is to instead make a local variable for each OpenMP thread which counts all of the 0-10 values that a particular thread has seen. Then sum those up in the master count vector. This will be easily parallelized and much faster as the threads don't need to lock on a shared write vector. I would expect a close to Nx speed up where N is the number of threads from OpenMP as there should be very limited locking required. This solution also avoids a lot of the race conditions currently in your code.
See http://software.intel.com/en-us/articles/use-thread-local-storage-to-reduce-synchronization/ for more details on thread local OpenMP

OpenMP and optimising vector operations

I'm running an algorithm at the moment that is very heavy but extremely parallel.
I've been looking at ways to speed it up and I've noticed that the slowest operation I have is my VecAdd function (It gets called thousands of times on a 6000 or so wide vector).
It is implemented as follows:
bool VecAdd( float* pOut, const float* pIn1, const float* pIn2, unsigned int num )
{
for( int idx = 0; idx < num; idx++ )
{
pOut[idx] = pIn1[idx] + pIn2[idx];
}
return true;
}
Its a very simple loop but all the additions can be performed in parallel. My first optimisation option is to move over to using SIMD as I can easily get a near 4 times speed up doing this.
However I'm also interested in the possibility of using OpenMP and having it automatically thread the for loop (potentially giving me a further 4x speedup for a total of 16x with SIMD).
However it really runs slowly. With the loop straight it takes around 3.2 seconds to process my example data. If I insert
#pragma omp parallel for
before the for loop I was assuming it would farm out several blocks of additions to other threads.
Unfortunately the result is that it takes ~7 seconds to process my example data.
Now I understand that a lot of my problem here will be caused by overheads with setting up threads and so forth but I'm still surprised just how much slower it makes things run.
Is it possible to speed this up by somehow setting up the thread pool in advance or will I never be able to combat these overheads?
Any thoughts on advice as to whether I can thread this nicely with OpenMP will be much appreciated!
Your loop should parallelize fine with the #pragma omp parallel for.
However, I think the problem is that you shouldn't parallelize at that level. You said that the function gets called thousands of times, but only operates on 6000 floats. Parallelize at the higher level, so that each thread is responsible for thounsands/4 calls to VecAdd. Right now you have this algorithm:
List item
serial execution
(re) start threads
do short computation
synchronize threads (at the end of the for loop)
back to serial code
Change it so that it's parallel at the highest possible level.
Memory bandwidth of course matters, but there is no way it would result in slower than serial execution.