Best way to implement global counters for highly concurrent applications? [duplicate] - concurrency

This question already has answers here:
How to create global counter in highly concurrent system
(3 answers)
Closed 9 months ago.
What is the best way to implement global counters for a highly concurrent application? In my case I may have 10K-20K go routines performing "work", and I want to count the number and types of items that the routines are working on collectively...
The "classic" synchronous coding style would look like:
var work_counter int
func GoWorkerRoutine() {
for {
// do work
Now this gets more complicated because I want to track the "type" of work being done, so really I'd need something like this:
var work_counter map[string]int
var work_mux sync.Mutex
func GoWorkerRoutine() {
for {
// do work
It seems like there should be a "go" optimized way using channels or something similar to this:
var work_counter int
var work_chan chan int // make() called somewhere else (buffered)
// started somewher else
func GoCounterRoutine() {
for {
select {
case c := <- work_chan:
work_counter += c
func GoWorkerRoutine() {
for {
// do work
work_chan <- 1
This last example is still missing the map, but that's easy enough to add. Will this style provide better performance than just a simple atomic increment? I can't tell if this is more or less complicated when we're talking about concurrent access to a global value versus something that may block on I/O to complete...
Thoughts are appreciated.
Update 5/28/2013:
I tested a couple implementations, and the results were not what I expected, here's my counter source code:
package helpers
import (
type CounterIncrementStruct struct {
bucket string
value int
type CounterQueryStruct struct {
bucket string
channel chan int
var counter map[string]int
var counterIncrementChan chan CounterIncrementStruct
var counterQueryChan chan CounterQueryStruct
var counterListChan chan chan map[string]int
func CounterInitialize() {
counter = make(map[string]int)
counterIncrementChan = make(chan CounterIncrementStruct,0)
counterQueryChan = make(chan CounterQueryStruct,100)
counterListChan = make(chan chan map[string]int,100)
go goCounterWriter()
func goCounterWriter() {
for {
select {
case ci := <- counterIncrementChan:
if len(ci.bucket)==0 { return }
case cq := <- counterQueryChan:
if found { <- val
} else { <- -1
case cl := <- counterListChan:
nm := make(map[string]int)
for k, v := range counter {
nm[k] = v
cl <- nm
func CounterIncrement(bucket string, counter int) {
if len(bucket)==0 || counter==0 { return }
counterIncrementChan <- CounterIncrementStruct{bucket,counter}
func CounterQuery(bucket string) int {
if len(bucket)==0 { return -1 }
reply := make(chan int)
counterQueryChan <- CounterQueryStruct{bucket,reply}
return <- reply
func CounterList() map[string]int {
reply := make(chan map[string]int)
counterListChan <- reply
return <- reply
It uses channels for both writes and reads which seems logical.
Here are my test cases:
func bcRoutine(b *testing.B,e chan bool) {
for i := 0; i < b.N; i++ {
func BenchmarkChannels(b *testing.B) {
e:=make(chan bool)
go bcRoutine(b,e)
go bcRoutine(b,e)
go bcRoutine(b,e)
go bcRoutine(b,e)
go bcRoutine(b,e)
var mux sync.Mutex
var m map[string]int
func bmIncrement(bucket string, value int) {
func bmRoutine(b *testing.B,e chan bool) {
for i := 0; i < b.N; i++ {
func BenchmarkMutex(b *testing.B) {
e:=make(chan bool)
for i := 0; i < b.N; i++ {
go bmRoutine(b,e)
go bmRoutine(b,e)
go bmRoutine(b,e)
go bmRoutine(b,e)
go bmRoutine(b,e)
I implemented a simple benchmark with just a mutex around the map (just testing writes), and benchmarked both with 5 goroutines running in parallel. Here are the results:
$ go test --bench=. helpers
BenchmarkChannels 100000 15560 ns/op
BenchmarkMutex 1000000 2669 ns/op
ok helpers 4.452s
I would not have expected the mutex to be that much faster...
Further thoughts?

If you're trying to synchronize a pool of workers (e.g. allow n goroutines to crunch away at some amount of work) then channels are a very good way to go about it, but if all you actually need is a counter (e.g page views) then they are overkill. The sync and sync/atomic packages are there to help.
import "sync/atomic"
type count32 int32
func (c *count32) inc() int32 {
return atomic.AddInt32((*int32)(c), 1)
func (c *count32) get() int32 {
return atomic.LoadInt32((*int32)(c))
Go Playground Example

Don't use sync/atomic - from the linked page
Package atomic provides low-level atomic memory primitives useful for
implementing synchronization algorithms.
These functions require great care to be used correctly. Except for
special, low-level applications, synchronization is better done with
channels or the facilities of the sync package
Last time I had to do this I benchmarked something which looked like your second example with a mutex and something which looked like your third example with a channel. The channels code won when things got really busy, but make sure you make the channel buffer big.

Don't be afraid of using mutexes and locks just because you think they're "not proper Go". In your second example it's absolutely clear what's going on, and that counts for a lot. You will have to try it yourself to see how contented that mutex is, and whether adding complication will increase performance.
If you do need increased performance, perhaps sharding is the best way to go:
The downside is that your counts will only be as up-to-date as your sharding decides. There may also be performance hits from calling time.Since() so much, but, as always, measure it first :)

The other answer using sync/atomic is suited for things like page counters, but not for submitting unique identifiers to an external API. To do that, you need an "increment-and-return" operation, which can only be implemented as a CAS loop.
Here's a CAS loop around an int32 to generate unique message IDs:
import "sync/atomic"
type UniqueID struct {
counter int32
func (c *UniqueID) Get() int32 {
for {
val := atomic.LoadInt32(&c.counter)
if atomic.CompareAndSwapInt32(&c.counter, val, val+1) {
return val
To use it, simply do:
requestID := client.msgID.Get()
form.Set("id", requestID)
This has an advantage over channels in that it doesn't require as many extra idle resources - existing goroutines are used as they ask for IDs rather than using one goroutine for every counter your program needs.
TODO: Benchmark against channels. I'm going to guess that channels are worse in the no-contention case and better in the high-contention case, as they have queuing while this code simply spins attempting to win the race.

Old question but I just stumbled upon this and it may help:
Basically the engineers at Uber has built a few nice util functions on top of the sync/atomic package
I haven't tested this in production yet but the codebase is very small and the implementation of most functions is quite stock standard
Definitely preferred over using channels or basic mutexes

The last one was close:
package main
import "fmt"
func main() {
ch := make(chan int, 3)
go GoCounterRoutine(ch)
go GoWorkerRoutine(1, ch)
// not run as goroutine because mein() would just end
GoWorkerRoutine(2, ch)
// started somewhere else
func GoCounterRoutine(ch chan int) {
counter := 0
for {
ch <- counter
counter += 1
func GoWorkerRoutine(n int, ch chan int) {
var seq int
for seq := range ch {
// do work:
fmt.Println(n, seq)
This introduces a single point of failure: if the counter goroutine dies, everything is lost. This may not be a problem if all goroutine are executed on one computer, but may become a problem if they are scattered over the network. To make the counter immune to failures of single nodes in the cluster, special algorithms have to be used.

I implemented this with a simple map + mutex which seems to be the best way to handle this since it is the "simplest way" (which is what Go says to use to choose locks vs channels).
package main
import (
type single struct {
mu sync.Mutex
values map[string]int64
var counters = single{
values: make(map[string]int64),
func (s *single) Get(key string) int64 {
return s.values[key]
func (s *single) Incr(key string) int64 {
return s.values[key]
func main() {
You can run the code on
I made a simple packaged version on

see by yourself and let me know what you think.
package helpers
type CounterIncrementStruct struct {
bucket string
value int
type CounterQueryStruct struct {
bucket string
channel chan int
var counter map[string]int
var counterIncrementChan chan CounterIncrementStruct
var counterQueryChan chan CounterQueryStruct
var counterListChan chan chan map[string]int
func CounterInitialize() {
counter = make(map[string]int)
counterIncrementChan = make(chan CounterIncrementStruct, 0)
counterQueryChan = make(chan CounterQueryStruct, 100)
counterListChan = make(chan chan map[string]int, 100)
go goCounterWriter()
func goCounterWriter() {
for {
select {
case ci := <-counterIncrementChan:
if len(ci.bucket) == 0 {
counter[ci.bucket] += ci.value
case cq := <-counterQueryChan:
val, found := counter[cq.bucket]
if found { <- val
} else { <- -1
case cl := <-counterListChan:
nm := make(map[string]int)
for k, v := range counter {
nm[k] = v
cl <- nm
func CounterIncrement(bucket string, counter int) {
if len(bucket) == 0 || counter == 0 {
counterIncrementChan <- CounterIncrementStruct{bucket, counter}
func CounterQuery(bucket string) int {
if len(bucket) == 0 {
return -1
reply := make(chan int)
counterQueryChan <- CounterQueryStruct{bucket, reply}
return <-reply
func CounterList() map[string]int {
reply := make(chan map[string]int)
counterListChan <- reply
return <-reply
package distributed
type Counter struct {
buckets map[string]int
incrQ chan incrQ
readQ chan readQ
sumQ chan chan int
func New() Counter {
c := Counter{
buckets: make(map[string]int, 100),
incrQ: make(chan incrQ, 1000),
readQ: make(chan readQ, 0),
sumQ: make(chan chan int, 0),
return c
func (c Counter) run() {
for {
select {
case a := <-c.readQ:
a.res <- c.buckets[a.bucket]
case a := <-c.sumQ:
var sum int
for _, cnt := range c.buckets {
sum += cnt
a <- sum
case a := <-c.incrQ:
c.buckets[a.bucket] += a.count
func (c Counter) Get(bucket string) int {
res := make(chan int)
c.readQ <- readQ{bucket: bucket, res: res}
return <-res
func (c Counter) Sum() int {
res := make(chan int)
c.sumQ <- res
return <-res
type readQ struct {
bucket string
res chan int
type incrQ struct {
bucket string
count int
func (c Counter) Agent(bucket string, limit int) *Agent {
a := &Agent{
bucket: bucket,
limit: limit,
sendIncr: c.incrQ,
return a
type Agent struct {
bucket string
limit int
count int
sendIncr chan incrQ
func (a *Agent) Incr(n int) {
a.count += n
if a.count > a.limit {
select {
case a.sendIncr <- incrQ{bucket: a.bucket, count: a.count}:
a.count = 0
func (a *Agent) Done() {
a.sendIncr <- incrQ{bucket: a.bucket, count: a.count}
a.count = 0
package counters
import (
var mux sync.Mutex
var m map[string]int
func bmIncrement(bucket string, value int) {
m[bucket] += value
func BenchmarkMutex(b *testing.B) {
m = make(map[string]int)
buckets := []string{
var wg sync.WaitGroup
for i := 0; i < b.N; i++ {
go func() {
for _, b := range buckets {
bmIncrement(b, 5)
for _, b := range buckets {
bmIncrement(b, 5)
package counters
import (
func BenchmarkDistributed(b *testing.B) {
counter := distributed.New()
agents := []*distributed.Agent{
counter.Agent("abc123", 100),
counter.Agent("def456", 100),
counter.Agent("ghi789", 100),
var wg sync.WaitGroup
for i := 0; i < b.N; i++ {
go func() {
for _, a := range agents {
for _, a := range agents {
for _, a := range agents {
$ go test --bench=. --count 10 -benchmem
goos: linux
goarch: amd64
pkg: test/counters
BenchmarkDistributed-4 3356620 351 ns/op 24 B/op 0 allocs/op
BenchmarkDistributed-4 3414073 368 ns/op 11 B/op 0 allocs/op
BenchmarkDistributed-4 3371878 374 ns/op 7 B/op 0 allocs/op
BenchmarkDistributed-4 3240631 387 ns/op 3 B/op 0 allocs/op
BenchmarkDistributed-4 3169230 389 ns/op 2 B/op 0 allocs/op
BenchmarkDistributed-4 3177606 386 ns/op 0 B/op 0 allocs/op
BenchmarkDistributed-4 3064552 390 ns/op 0 B/op 0 allocs/op
BenchmarkDistributed-4 3065877 409 ns/op 2 B/op 0 allocs/op
BenchmarkDistributed-4 2924686 400 ns/op 1 B/op 0 allocs/op
BenchmarkDistributed-4 3049873 389 ns/op 0 B/op 0 allocs/op
BenchmarkMutex-4 1000000 1106 ns/op 17 B/op 0 allocs/op
BenchmarkMutex-4 948331 1246 ns/op 9 B/op 0 allocs/op
BenchmarkMutex-4 1000000 1244 ns/op 12 B/op 0 allocs/op
BenchmarkMutex-4 1000000 1246 ns/op 11 B/op 0 allocs/op
BenchmarkMutex-4 1000000 1228 ns/op 1 B/op 0 allocs/op
BenchmarkMutex-4 1000000 1235 ns/op 2 B/op 0 allocs/op
BenchmarkMutex-4 1000000 1244 ns/op 1 B/op 0 allocs/op
BenchmarkMutex-4 1000000 1214 ns/op 0 B/op 0 allocs/op
BenchmarkMutex-4 956024 1233 ns/op 0 B/op 0 allocs/op
BenchmarkMutex-4 1000000 1213 ns/op 0 B/op 0 allocs/op
ok test/counters 37.461s
If you change the limit value to 1000, the code gets much faster, instantly without worries
$ go test --bench=. --count 10 -benchmem
goos: linux
goarch: amd64
pkg: test/counters
BenchmarkDistributed-4 5463523 221 ns/op 0 B/op 0 allocs/op
BenchmarkDistributed-4 5455981 220 ns/op 0 B/op 0 allocs/op
BenchmarkDistributed-4 5591240 213 ns/op 0 B/op 0 allocs/op
BenchmarkDistributed-4 5277915 212 ns/op 0 B/op 0 allocs/op
BenchmarkDistributed-4 5430421 213 ns/op 0 B/op 0 allocs/op
BenchmarkDistributed-4 5374153 226 ns/op 0 B/op 0 allocs/op
BenchmarkDistributed-4 5656743 219 ns/op 0 B/op 0 allocs/op
BenchmarkDistributed-4 5337343 211 ns/op 0 B/op 0 allocs/op
BenchmarkDistributed-4 5353845 217 ns/op 0 B/op 0 allocs/op
BenchmarkDistributed-4 5416137 217 ns/op 0 B/op 0 allocs/op
BenchmarkMutex-4 1000000 1002 ns/op 135 B/op 0 allocs/op
BenchmarkMutex-4 1253211 1141 ns/op 58 B/op 0 allocs/op
BenchmarkMutex-4 1000000 1261 ns/op 3 B/op 0 allocs/op
BenchmarkMutex-4 987345 1678 ns/op 59 B/op 0 allocs/op
BenchmarkMutex-4 925371 1247 ns/op 0 B/op 0 allocs/op
BenchmarkMutex-4 1000000 1259 ns/op 2 B/op 0 allocs/op
BenchmarkMutex-4 978800 1248 ns/op 0 B/op 0 allocs/op
BenchmarkMutex-4 982144 1213 ns/op 0 B/op 0 allocs/op
BenchmarkMutex-4 975681 1254 ns/op 0 B/op 0 allocs/op
BenchmarkMutex-4 994789 1205 ns/op 0 B/op 0 allocs/op
ok test/counters 34.314s
Changing Counter.incrQ length will also greatly affect performance, though it is more memory.

If your work counter types are not dynamic, i.e. you can write them all out upfront, I don't think you'll get much simpler or faster than this.
No mutex, no channel, no map. Just a statically sized array and an enum.
type WorkType int
const (
WorkType1 WorkType = iota
var workCounter [NumWorkTypes]int64
func updateWorkCount(workType WorkType, delta int) {
atomic.AddInt64(&workCounter[workType], int64(delta))
Usage like so:
updateWorkCount(WorkType1, 1)
If you need to sometimes work with work types as strings for display purposes, you can always generate code with a tool like stringer


SwiftUI Drag & Drop - NSInternalInconsistencyException: Could not get the cell at indexPath

I have implemented drag & drop between two lists in SwiftUI. Regular drag and drops work fine when dropping it in between list items, but when I do some more advanced drag like dragging it onto the cell itself, my app crashes.
This crash happens even before I drop it (so while I am still dragging it, without releasing drop)
Has anyone occurred this crash ? Is this a SwiftUI bug or am I doing something wrong ?
I have tried removing my code blocks inside dropDestinations and yet crash still exists, so what could I be doing wrong ?
Here is the gif of the crash (crash occurs right after I drag it onto the second cell, without dropping/releasing it):
Code inside dropDestination does not even get called.
SwiftUI Code
var body2: some View {
HStack {
VStack {
List {
ForEach(viewModel.pendingOrders) { order in
RestaurantOrderCardView(order: order) {
VStack {
List {
ForEach(viewModel.inPreparationOrders) { item in
RestaurantOrderCardView(order: item)
.dropDestination(for: RestaurantOrder.self) { items, offset in
guard let firstItem = items.first else { return }
withAnimation {
viewModel.dropItem(firstItem, offset: offset)
.background {
if !viewModel.inPreparationOrders.isEmpty {
.overlay {
if viewModel.inPreparationOrders.isEmpty {
RoundedRectangle(cornerRadius: 8)
.foregroundColor(!inDropArea ? .grayBackgroundColor : .grayBackgroundColor.opacity(0.25))
.dropDestination(for: RestaurantOrder.self) { items, location in
guard let firstItem = items.first else { return false }
withAnimation {
viewModel.dropItem(firstItem, offset: 0)
return true
} isTargeted: { inDropArea in
withAnimation {
self.inDropArea = inDropArea
Crash Log
2023-01-06 11:25:05.995303+0100 SimplifyStayAdmin[78761:2158775] *** Assertion failure in -[_UICollectionViewDragAndDropController _beginDragAndDropInsertingItemAtIndexPath:], _UICollectionViewDragAndDropController.m:620
2023-01-06 11:25:06.074407+0100 SimplifyStayAdmin[78761:2158775] *** Terminating app due to uncaught exception 'NSInternalInconsistencyException', reason: 'Could not get the cell at indexPath <NSIndexPath: 0x8846e25efd6dfbf1> {length = 2, path = 0 - 0} to start the reording portion of the Drag-and-drop'
*** First throw call stack:
0 CoreFoundation 0x000000018040e7c8 __exceptionPreprocess + 172
1 libobjc.A.dylib 0x0000000180051144 objc_exception_throw + 56
2 Foundation 0x0000000180b13b98 _userInfoForFileAndLine + 0
3 UIKitCore 0x0000000109a546ec -[_UICollectionViewDragAndDropController _beginDragAndDropInsertingItemAtIndexPath:] + 540
4 UIKitCore 0x0000000109a53064 -[_UICollectionViewDragAndDropController beginReorderingForItemAtIndexPath:cell:] + 204
5 UIKitCore 0x0000000109a1bad0 -[UICollectionView _beginInteractiveMovementForItemAtIndexPath:] + 196
6 UIKit 0x000000011803040c -[UICollectionViewAccessibility beginInteractiveMovementForItemAtIndexPath:] + 80
7 UIKitCore 0x0000000109a5c55c -[_UICollectionViewDragDestinationController _reorderingDisplayLinkDidTick] + 912
8 QuartzCore 0x0000000187dd04f8 _ZN2CA7Display11DisplayLink14dispatch_itemsEyyy + 808
9 QuartzCore 0x0000000187ec89a8 _ZL22display_timer_callbackP12__CFMachPortPvlS1_ + 336
10 CoreFoundation 0x000000018033ee94 __CFMachPortPerform + 172
11 CoreFoundation 0x000000018037387c __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 56
12 CoreFoundation 0x0000000180372e9c __CFRunLoopDoSource1 + 496
13 CoreFoundation 0x000000018036d43c __CFRunLoopRun + 2176
14 CoreFoundation 0x000000018036c7a4 CFRunLoopRunSpecific + 584
15 GraphicsServices 0x0000000188ff7c98 GSEventRunModal + 160
16 UIKitCore 0x000000010a1f237c -[UIApplication _run] + 868
17 UIKitCore 0x000000010a1f6374 UIApplicationMain + 124
18 SwiftUI 0x000000010e6150d4 OUTLINED_FUNCTION_51 + 496
19 SwiftUI 0x000000010e614f7c OUTLINED_FUNCTION_51 + 152
20 SwiftUI 0x000000010dd7ab60 OUTLINED_FUNCTION_10 + 88
21 SimplifyStayAdmin 0x000000010044d0c4 $s17SimplifyStayAdmin0abC3AppV5$mainyyFZ + 40
22 SimplifyStayAdmin 0x000000010044d378 main + 12
23 dyld 0x00000001080c1fa0 start_sim + 20
24 ??? 0x00000001081c9e50 0x0 + 4431060560
25 ??? 0x0869800000000000 0x0 + 606156362346397696
libc++abi: terminating with uncaught exception of type NSException
*** Terminating app due to uncaught exception 'NSInternalInconsistencyException', reason: 'Could not get the cell at indexPath <NSIndexPath: 0x8846e25efd6dfbf1> {length = 2, path = 0 - 0} to start the reording portion of the Drag-and-drop'
terminating with uncaught exception of type NSException
CoreSimulator 857.14 - Device: iPad (10th generation) (D79A0AD8-4970-4F84-A5F9-078DAFB697A3) - Runtime: iOS 16.2 (20C52) - DeviceType: iPad (10th generation)

Can anyone please tell why this code is producing the wrong output?

Problem statement:
Given two arrays A and B. Given Q queries each having a positive integer i denoting an index of the array A. For each query, your task is to find all the elements less than or equal to Ai in the array B.
My code doesn't seem to work for all the test cases.
20000 // array size
24527 11330 19029 903 1178 1687 3954 11549 15705 23325 14294 23378 28891 27002 26716 13346 12153 14766 7641 17062 4928 2979 11867 833 27474 25860 28590 27 13961 12627 10021 4560 12638 10294 9878 6249 31497 28178 15015 16523 5610 8923 20040 10478 18216 21291 26497 31761 6552 32657 24942 21036 2016 11819 1928 28519 4572 14967 30245 12873 16704 22374 25667 18035 24959 30642 14977 28558 28396 4210 7022 130 287 27116 16646 21224 13467 29354 21370 21187 22446 18640 7472 29290 24216 28076 16395 6857 25327 22415 20460 27593 12865 21979 30329 24845 12284 31582 1053 11999 3723 734 4687 27498 9154 25077 6936 22569 23676 32288 10703 24479 4994 14354 2344 6985 20399 16718 4717 30161 11602 28660 522 15748 30420 1243 30031 15110 12443 6113 30066 8260 7213 7807 13267 25515 30361 16545 23428 23448 30227 28596 7177 11791 19166 29696 20828 26799 10095 25656 27957 21733 5071 15183 1415 23649 4161 142 11342 4550 19237 13796 29832 12710 28188 125 18561 12205 18029 16277 30036 9244 19623 1423 4015 1164.................
The correct output is:
And my code's output is:
This is my code:
#include <iostream>
using namespace std;
int findindex (int arr[], int start, int end, int x)
while(start <= end)
int mid = (start + end) / 2;
if (arr[mid] <= x) // its lesser so more el can exist in rt search space
start = mid + 1;
end = mid - 1;
return end;
int main() {
int t ; cin>>t;
int n,i,q;
int arr1[n],arr2[n];
for(i = 0; i < n; i++) // arr1
for(i = 0; i < n; i++) //arr2
cin>>q; // no of queries
{ int x;
int index=findindex(arr2,0,n-1,x) ;
return 0;
When you are calling the findindex() function, pass arr1[x] instead of x.
GFG Verdict:

high performance calculations and saving of the threads identificators

I write grid-stride loop to have High Performance Calculations, where large N, for example long long N 1<<36, or even more. From total grid I need only some indexes, which have to satisfy the define condition.
__global__ void Indexes(int *array, int N) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
while( index<N)
if (condition)
{....//do something to save index in array}
index += blockDim.x * gridDim.x;
Of course, it is possible use the Thrust, which allows to have both host and device arrays. But in this case obviously the calculation will be extremely ineffective, because need firstly to create a lot of non-needed elements, then to delete these.
What is the most effective way to save the indexes directly in array in device to pass in CPU?
If your output is relatively dense (i.e. a lot of indices and relatively few zeros), then the stream compaction approach suggested in comments is a good solution. There are a lot of ready-to-go stream compaction implementations which you can probably adapt to your purposes.
If your output is sparse, so you need to save relatively few indices for a lot of inputs, then stream compaction isn't such a great solution because it will waste a lot of GPU memory. In that case (and you can roughly estimate an upper bound of the number of output indices) something like this:
template <typename T>
struct Array
T* p;
int Nmax;
int* next;
Array() = default;
__host__ __device__
Array(T* _p, int _Nmax, int* _next) : p(_p), Nmax(_Nmax), next(_next) {};
int append(T& val)
int pos = atomicAdd(next, 1);
if (pos > Nmax) {
atomicExch(next, Nmax);
return -1;
} else {
p[pos] = val;
return pos;
is probably more appropriate. Here, the idea is to use an atomically incremented position in the output array to keep track of where a thread should store its index. The code will signal if you fill the index array, and there will be information from which you can work out a restart strategy to stop the current kernel and then start from the last known index which you were able to store.
A complete example:
$ cat
#include <iostream>
#include <thrust/device_ptr.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/copy.h>
namespace AppendArray
template <typename T>
struct Array
T* p;
int Nmax;
int* next;
Array() = default;
__host__ __device__
Array(T* _p, int _Nmax, int* _next) : p(_p), Nmax(_Nmax), next(_next) {};
int append(T& val)
int pos = atomicAdd(next, 1);
if (pos > Nmax) {
atomicExch(next, Nmax);
return -1;
} else {
p[pos] = val;
return pos;
void kernelfind(int* input, int N, AppendArray::Array<int> indices)
int idx = threadIdx.x + blockIdx.x * blockDim.x;
for(; idx < N; idx += gridDim.x*blockDim.x) {
if (input[idx] % 10000 == 0) {
if (indices.append(idx) < 0) return;
int main()
const int Ninputs = 1 << 20;
thrust::device_vector<int> inputs(Ninputs);
thrust::counting_iterator<int> vals(1);
thrust::copy(vals, vals + Ninputs, inputs.begin());
int* d_input = thrust::raw_pointer_cast(;
int Nindices = Ninputs >> 12;
thrust::device_vector<int> indices(Nindices);
int* d_indices = thrust::raw_pointer_cast(;
int* pos; cudaMallocManaged(&pos, sizeof(int)); *pos = 0;
AppendArray::Array<int> index(d_indices, Nindices-1, pos);
int gridsize, blocksize;
cudaOccupancyMaxPotentialBlockSize(&gridsize, &blocksize, kernelfind, 0, 0);
kernelfind<<<gridsize, blocksize>>>(d_input, Ninputs, index);
for(int i = 0; i < *pos; ++i) {
int idx = indices[i];
std::cout << i << " " << idx << " " << inputs[idx] << std::endl;
return 0;
$ nvcc -std=c++11 -arch=sm_52 -o append
$ ./append
0 9999 10000
1 19999 20000
2 29999 30000
3 39999 40000
4 49999 50000
5 69999 70000
6 79999 80000
7 59999 60000
8 89999 90000
9 109999 110000
10 99999 100000
11 119999 120000
12 139999 140000
13 129999 130000
14 149999 150000
15 159999 160000
16 169999 170000
17 189999 190000
18 179999 180000
19 199999 200000
20 209999 210000
21 219999 220000
22 239999 240000
23 249999 250000
24 229999 230000
25 279999 280000
26 269999 270000
27 259999 260000
28 319999 320000
29 329999 330000
30 289999 290000
31 299999 300000
32 339999 340000
33 349999 350000
34 309999 310000
35 359999 360000
36 379999 380000
37 399999 400000
38 409999 410000
39 369999 370000
40 429999 430000
41 419999 420000
42 389999 390000
43 439999 440000
44 459999 460000
45 489999 490000
46 479999 480000
47 449999 450000
48 509999 510000
49 539999 540000
50 469999 470000
51 499999 500000
52 569999 570000
53 549999 550000
54 519999 520000
55 589999 590000
56 529999 530000
57 559999 560000
58 619999 620000
59 579999 580000
60 629999 630000
61 669999 670000
62 599999 600000
63 609999 610000
64 699999 700000
65 639999 640000
66 649999 650000
67 719999 720000
68 659999 660000
69 679999 680000
70 749999 750000
71 709999 710000
72 689999 690000
73 729999 730000
74 779999 780000
75 799999 800000
76 809999 810000
77 739999 740000
78 849999 850000
79 759999 760000
80 829999 830000
81 789999 790000
82 769999 770000
83 859999 860000
84 889999 890000
85 879999 880000
86 819999 820000
87 929999 930000
88 869999 870000
89 839999 840000
90 909999 910000
91 939999 940000
92 969999 970000
93 899999 900000
94 979999 980000
95 959999 960000
96 949999 950000
97 1019999 1020000
98 1009999 1010000
99 989999 990000
100 1029999 1030000
101 919999 920000
102 1039999 1040000
103 999999 1000000

Huge latency spikes while running simple code

I have a simple benchmark that demonstrates performance of busywait threads. It runs in two modes: first one simply gets two timepoints sequentially, second one iterates through vector and measures duration of an iteration.
I see that two sequential calls of clock::now() takes about 50 nanoseconds on the average and one average iteration through vector takes about 100 nanoseconds. But sometimes these operations are executed with a huge delay: about 50 microseconds in the first case and 10 milliseconds (!) in the second case.
Test runs on single isolated core so context switches do not occur. I also call mlockall in beginning of the program so I assume that page faults do not affect the performance.
Following additional optimizations were also applied:
kernel boot parameters: intel_idle.max_cstate=0 idle=halt
irqaffinity=0,14 isolcpus=4-13,16-27 pti=off spectre_v2=off audit=0
selinux=0 nmi_watchdog=0 nosoftlockup=0 rcu_nocb_poll rcu_nocbs=19-20
rcu[^c] kernel threads moved to a housekeeping CPU core 0;
network card RxTx queues moved to a housekeeping CPU core 0;
writeback kernel workqueue moved to a housekeeping CPU core 0;
transparent_hugepage disabled;
Intel CPU HyperThreading disabled;
swap file/partition is not used.
System details:
Default Archlinux kernel:
5.1.9-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 11 16:18:09 UTC 2019 x86_64 GNU/Linux
that has following PREEMPT and HZ settings:
Hardware details:
RAM: 256GB
CPU(s): 28
On-line CPU(s) list: 0-27
Thread(s) per core: 1
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 # 2.60GHz
Stepping: 1
CPU MHz: 3200.011
CPU max MHz: 3500.0000
CPU min MHz: 1200.0000
BogoMIPS: 5202.68
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13
NUMA node1 CPU(s): 14-27
Example code:
struct TData
std::vector<char> Data;
TData() = default;
TData(size_t aSize)
for (size_t i = 0; i < aSize; ++i)
using TBuffer = std::vector<TData>;
TData DoMemoryOperation(bool aPerform, const TBuffer& aBuffer, size_t& outBufferIndex)
if (!aPerform)
return TData {};
const TData& result = aBuffer[outBufferIndex];
if (++outBufferIndex == aBuffer.size())
outBufferIndex = 0;
return result;
void WarmUp(size_t aCyclesCount, bool aPerform, const TBuffer& aBuffer)
size_t bufferIndex = 0;
for (size_t i = 0; i < aCyclesCount; ++i)
auto data = DoMemoryOperation(aPerform, aBuffer, bufferIndex);
void TestCycle(size_t aCyclesCount, bool aPerform, const TBuffer& aBuffer, Measurings& outStatistics)
size_t bufferIndex = 0;
for (size_t i = 0; i < aCyclesCount; ++i)
auto t1 = std::chrono::steady_clock::now();
auto data = DoMemoryOperation(aPerform, aBuffer, bufferIndex);
auto t2 = std::chrono::steady_clock::now();
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
outStatistics.AddMeasuring(diff, t2);
int Run(int aCpu, size_t aDataSize, size_t aBufferSize, size_t aCyclesCount, bool aAllocate, bool aPerform)
if (mlockall(MCL_CURRENT | MCL_FUTURE))
throw std::runtime_error("mlockall failed");
std::cout << "Test parameters"
<< ":\ndata size=" << aDataSize
<< ",\nnumber of elements=" << aBufferSize
<< ",\nbuffer size=" << aBufferSize * aDataSize
<< ",\nnumber of cycles=" << aCyclesCount
<< ",\nallocate=" << aAllocate
<< ",\nperform=" << aPerform
<< ",\nthread ";
TBuffer buffer;
if (aPerform)
std::fill(buffer.begin(), buffer.end(), TData { aDataSize });
std::cout << "Running..."<< std::endl;
WarmUp(aBufferSize * 2, aPerform, buffer);
Measurings statistics;
TestCycle(aCyclesCount, aPerform, buffer, statistics);
if (munlockall())
throw std::runtime_error("munlockall failed");
return 0;
And following results are received:
StandaloneTests --run_test=MemoryAccessDelay --cpu=19 --data-size=280 --size=67108864 --count=1000000000 --allocate=1 --perform=0
Test parameters:
data size=280,
number of elements=67108864,
buffer size=18790481920,
number of cycles=1000000000,
thread 14056 on cpu 19
Statistics: min: 16: max: 18985: avg: 18
0 - 10 : 0 (0 %): -
10 - 100 : 999993494 (99 %): min: 40: max: 117130: avg: 40
100 - 1000 : 946 (0 %): min: 380: max: 506236837: avg: 43056598
1000 - 10000 : 5549 (0 %): min: 56876: max: 70001739: avg: 7341862
10000 - 18985 : 11 (0 %): min: 1973150818: max: 14060001546: avg: 3644216650
StandaloneTests --run_test=MemoryAccessDelay --cpu=19 --data-size=280 --size=67108864 --count=1000000000 --allocate=1 --perform=1
Test parameters:
data size=280,
number of elements=67108864,
buffer size=18790481920,
number of cycles=1000000000,
thread 3264 on cpu 19
Statistics: min: 36: max: 4967479: avg: 48
0 - 10 : 0 (0 %): -
10 - 100 : 964323921 (96 %): min: 60: max: 4968567: avg: 74
100 - 1000 : 35661548 (3 %): min: 122: max: 4972632: avg: 2023
1000 - 10000 : 14320 (0 %): min: 1721: max: 33335158: avg: 5039338
10000 - 100000 : 130 (0 %): min: 10010533: max: 1793333832: avg: 541179510
100000 - 1000000 : 0 (0 %): -
1000000 - 4967479 : 81 (0 %): min: 508197829: max: 2456672083: avg: 878824867
Any ideas what is the reason of such huge delays and how it may be investigated?
TData DoMemoryOperation(bool aPerform, const TBuffer& aBuffer, size_t& outBufferIndex);
It returns a std::vector<char> by value. That involves a memory allocation and data copying. The memory allocations can do a syscall (brk or mmap) and memory mappings related syscalls are notorious for being slow.
When timings include syscalls one cannot expect low variance.
You may like to run your application with /usr/bin/time --verbose <app> or perf -ddd <app> to see the number of page faults and context switches.

Why is my supposedly parallel go program not parallel

package main
import (
var wg sync.WaitGroup
func alphabets() {
for char := 'a'; char < 'a'+26; char++ {
fmt.Printf("%c ", char)
wg.Done() //decrement number of goroutines to wait for
func numbers() {
for number := 1; number < 27; number++ {
fmt.Printf("%d ", number)
func main() {
wg.Add(2) //wait for two goroutines
fmt.Println("Starting Go Routines")
go alphabets()
go numbers()
fmt.Println("\nWaiting To Finish")
wg.Wait() //wait for the two goroutines to finish executing
fmt.Println("\nTerminating Program")
I expect the output to be mixed up(for lack of a better word), but instead; a sample output is:
$ go run parallel_prog.go
Starting Go Routines
Waiting To Finish
a b c d e f g h i j k l m n o p q r s t u v w x y z 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Terminating Program
What I'm I missing?
You're missing nothing. It's working. The calls aren't showing up "interlaced" (mixed up) not because they're not being parallelized, but because they're happening really fast.
You can easily add some calls to time.Sleep to see the parallelization better. By sleeping, we know 100% that printing alphabets and numbers should be interlaced.
Your program with Sleep calls to "force" interlacing
package main
import (
var wg sync.WaitGroup
func alphabets() {
defer wg.Done()
for char := 'a'; char < 'a'+26; char++ {
fmt.Printf("%c ", char)
time.Sleep(time.Second * 2)
func numbers() {
defer wg.Done()
for number := 1; number < 27; number++ {
fmt.Printf("%d ", number)
time.Sleep(time.Second * 3)
func main() {
fmt.Println("Starting Go Routines")
go alphabets()
go numbers()
fmt.Println("\nWaiting To Finish")
fmt.Println("\nTerminating Program")
You probably already know this, but setting GOMAXPROCS doesn't have any effect on whether or not this example is executed in parallel, just how many resources it consumes.
The GOMAXPROCS setting controls how many operating systems threads attempt to execute code simultaneously. For example, if GOMAXPROCS is 4, then the program will only execute code on 4 operating system threads at once, even if there are 1000 goroutines. The limit does not count threads blocked in system calls such as I/O.
Source: Go 1.5 GOMAXPROCS Default
Are you using the Go playground by any chance? When I run your code locally, I get:
Starting Go Routines
Waiting To Finish
1 2 3 4 5 6 7 8 9 10 11 12 a 13 14 15 16 17 b 18 19 c 20 21 d 22 23 e 24 25 f 26 g h i j k l m n o p q r s t u v w x y z
Terminating Program
The playground is deterministic in nature. Goroutines don't yield as often and don't run in multiple threads.