Busy loop slows down latency-critical computation

Busy loop slows down latency-critical computation - c++

My code does the following:
Do some long-running intense computation (called useless below)
Do a small latency-critical task
I find that the time it takes to execute the latency-critical task is higher with the long-running computation than without it.
Here is some stand-alone C++ code to reproduce this effect:
#include <stdio.h>
#include <stdint.h>
#define LEN 128
#define USELESS 1000000000
//#define USELESS 0
// Read timestamp counter
static inline long long get_cycles()
{
unsigned low, high;
unsigned long long val;
asm volatile ("rdtsc" : "=a" (low), "=d" (high));
val = high;
val = (val << 32) | low;
return val;
}
// Compute a simple hash
static inline uint32_t hash(uint32_t *arr, int n)
{
uint32_t ret = 0;
for(int i = 0; i < n; i++) {
ret = (ret + (324723947 + arr[i])) ^ 93485734985;
}
return ret;
}
int main()
{
uint32_t sum = 0; // For adding dependencies
uint32_t arr[LEN]; // We'll compute the hash of this array
for(int iter = 0; iter < 3; iter++) {
// Create a new array to hash for this iteration
for(int i = 0; i < LEN; i++) {
arr[i] = (iter + i);
}
// Do intense computation
for(int useless = 0; useless < USELESS; useless++) {
sum += (sum + useless) * (sum + useless);
}
// Do the latency-critical task
long long start_cycles = get_cycles() + (sum & 1);
sum += hash(arr, LEN);
long long end_cycles = get_cycles() + (sum & 1);
printf("Iteration %d cycles: %lld\n", iter, end_cycles - start_cycles);
}
}
When compiled with -O3 with USELESS set to 1 billion, the three iterations took 588, 4184, and 536 cycles, respectively. When compiled with USELESS set to 0, the iterations took 394, 358, and 362 cycles, respectively.
Why could this (particularly the 4184 cycles) be happening? I suspected cache misses or branch mis-predictions induced by the intense computation. However, without the intense computation, the zeroth iteration of the latency critical task is pretty fast so I don't think that cold cache/branch predictor is the cause.

Moving my speculative comment to an answer:
It is possible that while your busy loop is running, other tasks on the server are pushing the cached arr data out of the L1 cache, so that the first memory access in hash needs to reload from a lower level cache. Without the compute loop this wouldn't happen. You could try moving the arr initialization to after the computation loop, just to see what the effect is.

Related

Why doesn't my OpenMP program scale with number of threads?

I write a program to calculate the sum of an array of 1M numbers where all elements = 1. I use OpenMP for multithreading. However, the run time doesn't scale with the number of threads. Here is the code:
#include <iostream>
#include <omp.h>
#define SIZE 1000000
#define N_THREADS 4
using namespace std;
int main() {
int* arr = new int[SIZE];
long long sum = 0;
int n_threads = 0;
omp_set_num_threads(N_THREADS);
double t1 = omp_get_wtime();
#pragma omp parallel
{
if (omp_get_thread_num() == 0) {
n_threads = omp_get_num_threads();
}
#pragma omp for schedule(static, 16)
for (int i = 0; i < SIZE; i++) {
arr[i] = 1;
}
#pragma omp for schedule(static, 16) reduction(+:sum)
for (int i = 0; i < SIZE; i++) {
sum += arr[i];
}
}
double t2 = omp_get_wtime();
cout << "n_threads " << n_threads << endl;
cout << "time " << (t2 - t1)*1000 << endl;
cout << sum << endl;
}
The run time (in milliseconds) for different values of N_THREADS is as follows:
n_threads 1
time 3.6718
n_threads 2
time 2.5308
n_threads 3
time 3.4383
n_threads 4
time 3.7427
n_threads 5
time 2.4621
I used schedule(static, 16) to use chunks of 16 iterations per thread to avoid false sharing problem. I thought the performance issue was related to false sharing, but I now think it's not. What could possibly be the problem?

Your code is memory bound, not computation expensive. Its speed depends on the speed of memory access (cache utilization, number of memory channels, etc), therefore it is not expected to scale well with the number of threads.
UPDATE, I run this code using 1000x bigger SIZE (i.e. #define SIZE 100000000) (g++ -fopenmp -O3 -mavx2)
Here are the results, it still scales badly with number of threads:
n_threads 1
time 652.656
time 657.207
time 608.838
time 639.168
1000000000
n_threads 2
time 422.621
time 373.995
time 425.819
time 386.511
time 466.632
time 394.198
1000000000
n_threads 3
time 394.419
time 391.283
time 470.925
time 375.833
time 442.268
time 449.611
time 370.12
time 458.79
1000000000
n_threads 4
time 421.89
time 402.363
time 424.738
time 414.368
time 491.843
time 429.757
time 431.459
time 497.566
1000000000
n_threads 8
time 414.426
time 430.29
time 494.899
time 442.164
time 458.576
time 449.313
time 452.309
1000000000

5 threads contending for same accumulator for reduction or having only 16 chunk size must be inhibiting efficient pipelining of loop iterations. Try coarser region per thread.
Maybe more importantly, you need multiple repeats of benchmark programmatically to get an average and to heat CPU caches/cores into higher frequencies to have better measurement.
The benchmark results saying 1MB/s. Surely the worst RAM will do 1000 times better than that. So memory is not bottleneck (for now). 1 million elements per 4 second is like locking contention or non-heated benchmark. Normally even a Pentium 1 would make more bandwidth than that. You sure you are compiling with O3 optimization?

I have reimplemented the test as a Google Benchmark with different values:
#include <benchmark/benchmark.h>
#include <memory>
#include <omp.h>
constexpr int SCALE{32};
constexpr int ARRAY_SIZE{1000000};
constexpr int CHUNK_SIZE{16};
void original_benchmark(benchmark::State& state)
{
const int num_threads{state.range(0)};
const int array_size{state.range(1)};
const int chunk_size{state.range(2)};
auto arr = std::make_unique<int[]>(array_size);
long long sum = 0;
int n_threads = 0;
omp_set_num_threads(num_threads);
// double t1 = omp_get_wtime();
#pragma omp parallel
{
if (omp_get_thread_num() == 0) {
n_threads = omp_get_num_threads();
}
#pragma omp for schedule(static, chunk_size)
for (int i = 0; i < array_size; i++) {
arr[i] = 1;
}
#pragma omp for schedule(static, chunk_size) reduction(+:sum)
for (int i = 0; i < array_size; i++) {
sum += arr[i];
}
}
// double t2 = omp_get_wtime();
// cout << "n_threads " << n_threads << endl;
// cout << "time " << (t2 - t1)*1000 << endl;
// cout << sum << endl;
state.counters["n_threads"] = n_threads;
}
static void BM_original_benchmark(benchmark::State& state) {
for (auto _ : state) {
original_benchmark(state);
}
}
BENCHMARK(BM_original_benchmark)
->Args({1, ARRAY_SIZE, CHUNK_SIZE})
->Args({1, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({1, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({2, ARRAY_SIZE, CHUNK_SIZE})
->Args({2, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({2, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({4, ARRAY_SIZE, CHUNK_SIZE})
->Args({4, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({4, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({8, ARRAY_SIZE, CHUNK_SIZE})
->Args({8, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({8, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({16, ARRAY_SIZE, CHUNK_SIZE})
->Args({16, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({16, ARRAY_SIZE, SCALE * CHUNK_SIZE});
BENCHMARK_MAIN();
I only have access to Compiler Explorer at the moment which will not execute the complete suite of benchmarks. However, it looks like increasing the chunk size will improve the performance. Obviously, benchmark and optimize for your own system.

Why is this benchmark code for linear and binary search not working?

I am trying to benchmark linear and binary search as a part of an assignment. I have written the necessary search and randomizer functions. But when I try to benchmark them I get 0 delay even for higher array sizes.
The code:
#include<iostream>
#include <time.h>
#include <windows.h>
using namespace std;
double getTime()
{
LARGE_INTEGER t, f;
QueryPerformanceCounter(&t);
QueryPerformanceFrequency(&f);
return (double)t.QuadPart/(double)f.QuadPart;
}
int linearSearch(int arr[], int len,int target){
int resultIndex = -1;
for(int i = 0;i<len;i++){
if(arr[i] == target){
resultIndex = i;
break;
}
}
return resultIndex;
}
void badSort(int arr[],int len){
for(int i = 0 ; i< len;i++){
int indexToSwapWith = i;
for(int j = i+1;j < len;j++){
if(arr[j] < arr[indexToSwapWith] )
indexToSwapWith = j;
}
if(indexToSwapWith != i){
int t = arr[i];
arr[i] = arr[indexToSwapWith];
arr[indexToSwapWith] = t;
}
}
}
int binSearch(int arr[], int len,int target){
int resultIndex = -1;
int first = 0;
int last = len;
int mid = first;
while(first <= last){
mid = (first + last)/2;
if(target < arr[mid])
last = mid-1;
else if(target > arr[mid])
first = mid+1;
else
break;
}
if(arr[mid] == target)
resultIndex = mid;
return resultIndex;
}
void fillArrRandomly(int arr[],int len){
srand(time(NULL));
for(int i = 0 ; i < len ;i++){
arr[i] = rand();
}
}
void benchmarkRandomly(int len){
float startTime = getTime();
int arr[len];
fillArrRandomly(arr,len);
badSort(arr,len);
/*
for(auto i : arr)
cout<<i<<"\n";
*/
float endTime = getTime();
float timeElapsed = endTime - startTime;
cout<< "prep took " << timeElapsed<<endl;
int target = rand();
startTime = getTime();
int result = linearSearch(arr,len,target);
endTime = getTime();
timeElapsed = endTime - startTime;
cout<<"linear search result for "<<target<<":"<<result<<" after "<<startTime<<" to "<<endTime <<":"<<timeElapsed<<"\n";
startTime = getTime();
result = binSearch(arr,len,target);
endTime = getTime();
timeElapsed = endTime - startTime;
cout<<"binary search result for "<<target<<":"<<result<<" after "<<startTime<<" to "<<endTime <<":"<<timeElapsed<<"\n";
}
int main(){
benchmarkRandomly(30000);
}
Sample output:
prep took 0.9375
linear search result for 29445:26987 after 701950 to 701950:0
binary search result for 29445:26987 after 701950 to 701950:0
I have tried using clock_t as well but it was the result was the same. Do I need even higher array size or am I benchmarking the wrong way?
In the course I have to implement most of the stuff myself. That's why I'm not using stl. I'm not sure if using stl::chrono is allowed but I'd like to ensure that the problem does not lie elsewhere first.
Edit: In case it isn't clear, I can't include the time for sorting and random generation in the benchmark.

One problem is that you set startTime = getTime() before you pack your test arrays with random values. If the random number generation is slow this might dominate that returned results. The main effort is sorting your array, the search time will be extremely low compared to this.
It is probably too course an interval as you suggest. For a binary search on 30k objects we are talking about just 12 or 13 iterations so on a modern machine 20 / 1000000000 seconds at most. This is approximately zero ms.
Increasing the number of array entries won't help much, but you could try increasing the array size until you get near the memory limit. But now your problem will be that the preparatory random number generation and sorting will take forever.
I would suggest either :-
A. Checking for a very large number of items :-
unsigned int total;
startTime = getTime();
for (i=0; i<10000000; i++)
total += binSearch(arr, len, rand());
endTime = getTime();
B. Modify your code to count the number of times you compare elements and use that information instead of timing.

It looks like you're using the search result (by printing it with cout *outside the timed region, that's good). And the data + key are randomized, so the search shouldn't be getting optimized away at compile time. (Benchmarking with optimization disabled is pointless, so you need tricks like this.)
Have you looked at timeElapsed with a debugger? Maybe it's a very small float that prints as 0 with default cout settings?
Or maybe float endTime - float startTime actually is equal to 0.0f because rounding to the nearest float made them equal. Subtracting two large nearby floating-point numbers produces "catastrophic cancellation".
Remember that float only has 24 bits of significand, so regardless of the frequency you divide by, if the PerformanceCounter values differ in less than 1 part in 2^24, you'll get zero. (If that function returns raw counts from x86 rdtsc, then that will happen if your system's last reboot was more than 2^24 times longer ago than the time interval. x86 TSC starts at zero when the system boots, and (on CPUs in the last ~10 years) counts at a "reference frequency" that's (approximately) equal to your CPU's rated / "sticker" frequency, regardless of turbo or idle clock speeds. See Get CPU cycle count?)
double might help, but much better to subtract in the integer domain before dividing. Also, rewriting that part will take QueryPerformanceFrequency out of the timed interval!
As #Jon suggests, it's often better to put the code under test into a repeat loop inside one longer timed interval, so (code) caches and branch prediction can warm up.
But then you have the problem of making sure repeated calls aren't optimized away, and of randomizing the search key inside the loop. (Otherwise a smart compiler might hoist the search out of the loop).
Something like volatile int result = binSearch(...); can help, because assigning to (or initializing) a volatile is a visible side-effect that can't be optimized away. So the compiler needs to actually materialize each search result in a register.
For some compilers, e.g. ones that support GNU C inline asm, you can use inline asm to require the compiler to produce a value in a register without adding any overhead of storing it anywhere. AFAIK this isn't possible with MSVC inline asm.

Asynchronous writing to a bit array

TL; DR How to safely perfom a single bit update A[n/8] |= (1<<n%8); for A being a huge array of chars (i.e., setting n's bit of A true) when computing in parallel using C++11's <thread> library?
I'm performing a computation that's easy to parallelize. I'm computing elements of a certain subset of the natural numbers, and I wanna find elements that are not in the subset. For this I create a huge array (like A = new char[20l*1024l*1024l*1024l], i.e., 20GiB). A n's bit of this array is true if n lies in my set.
When doing it in parallel and setting the bits true using A[n/8] |= (1<<n%8);, I seem to get a small loss of information, supposedly due to concurring work on the same byte of A (each thread has to first read the byte, update the single bit and write the byte back). How can I get around this? Is there a way how to do this update as an atomic operation?
The code follows. GCC version: g++ (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609. The machine is an 8-core Intel(R) Xeon(R) CPU E5620 # 2.40GHz, 37GB RAM. Compiler options: g++ -std=c++11 -pthread -O3
#include <iostream>
#include <thread>
typedef long long myint; // long long to be sure
const myint max_A = 20ll*1024ll*1024ll; // 20 MiB for testing
//const myint max_A = 20ll*1024ll*1024ll*1024ll; // 20 GiB in the real code
const myint n_threads = 1; // Number of threads
const myint prime = 1543; // Tested prime
char *A;
const myint max_n = 8*max_A;
inline char getA(myint n) { return A[n/8] & (1<<(n%8)); }
inline void setAtrue(myint n) { A[n/8] |= (1<<n%8); }
void run_thread(myint startpoint) {
// Calculate all values of x^2 + 2y^2 + prime*z^2 up to max_n
// We loop through x == startpoint (mod n_threads)
for(myint x = startpoint; 1*x*x < max_n; x+=n_threads)
for(myint y = 0; 1*x*x + 2*y*y < max_n; y++)
for(myint z = 0; 1*x*x + 2*y*y + prime*z*z < max_n; z++)
setAtrue(1*x*x + 2*y*y + prime*z*z);
}
int main() {
myint n;
// Only n_threads-1 threads, as we will use the master thread as well
std::thread T[n_threads-1];
// Initialize the array
A = new char[max_A]();
// Start the threads
for(n = 0; n < n_threads-1; n++) T[n] = std::thread(run_thread, n);
// We use also the master thread
run_thread(n_threads-1);
// Synchronize
for(n = 0; n < n_threads-1; n++) T[n].join();
// Print and count all elements not in the set and n != 0 (mod prime)
myint cnt = 0;
for(n=0; n<max_n; n++) if(( !getA(n) )&&( n%1543 != 0 )) {
std::cout << n << std::endl;
cnt++;
}
std::cout << "cnt = " << cnt << std::endl;
return 0;
}
When n_threads = 1, I get the correct value cnt = 29289. When n_threads = 7, I got cnt = 29314 and cnt = 29321 on two different calls, suggesting some of the bitwise operations on a single byte were concurring.

std::atomic provides all the facilities that you need here:
std::array<std::atomic<char>, max_A> A;
static_assert(sizeof(A[0]) == 1, "Shall not have memory overhead");
static_assert(std::atomic<char>::is_always_lock_free,
"No software-level locking needed on common platforms");
inline char getA(myint n) { return A[n / 8] & (1 << (n % 8)); }
inline void setAtrue(myint n) { A[n / 8].fetch_or(1 << n % 8); }
The load in getA is atomic (equivalent to load()), and std::atomic even has built-in support for oring the stored value with another one (fetch_or), atomically of course.
When initializing A, the naive way of for (auto& a : A) a = 0; would require synchronization after every store, which you can avoid by waiving some thread-safety. std::memory_order_release only requires that what we write is visible to other threads (but not that other thread's writes are visible to us). And indeed, if you do
// Initialize the array
for (auto& a : A)
a.store(0, std::memory_order_release);
you get the safety you need without any assembly-level synchronization on x86. You could do the reverse for the loads after the threads finish, but that has no added benefit on x86 (it's just a mov either way).
Demo on the full code: https://godbolt.org/z/nLPlv1

Weird program latency behavior on VM

I wrote a program to read 256KB array to get 1ms latency. The program is pretty simple and attached.
However, when I run it on VM on Xen, I found that the latency is not stable. It has the following pattern: The time unit is ms.
#totalCycle CyclePerLine totalms
22583885 5513 6.452539
3474342 848 0.992669
3208486 783 0.916710
25848572 6310 7.385306
3225768 787 0.921648
3210487 783 0.917282
25974700 6341 7.421343
3244891 792 0.927112
3276027 799 0.936008
25641513 6260 7.326147
3531084 862 1.008881
3233687 789 0.923911
22397733 5468 6.399352
3523403 860 1.006687
3586178 875 1.024622
26094384 6370 7.455538
3540329 864 1.011523
3812086 930 1.089167
25907966 6325 7.402276
I'm thinking some process is doing something and it's like an event-driven process. Does any one encounter this before? or anyone can point out the potential process/services that could make this happen?
Below is my program. I run it for 1000 times. Each time got the one line of the result above.
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <string>
#include <ctime>
using namespace std;
#if defined(__i386__)
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}
#elif defined(__x86_64__)
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
#endif
#define CACHE_LINE_SIZE 64
#define WSS 24567 /* 24 Mb */
#define NUM_VARS WSS * 1024 / sizeof(long)
#define KHZ 3500000
// ./a.out memsize(in KB)
int main(int argc, char** argv)
{
unsigned long wcet = atol(argv[1]);
unsigned long mem_size_KB = 256; // mem size in KB
unsigned long mem_size_B = mem_size_KB * 1024; // mem size in Byte
unsigned long count = mem_size_B / sizeof(long);
unsigned long row = mem_size_B / CACHE_LINE_SIZE;
int col = CACHE_LINE_SIZE / sizeof(long);
unsigned long long start, finish, dur1;
unsigned long temp;
long *buffer;
buffer = new long[count];
// init array
for (unsigned long i = 0; i < count; ++i)
buffer[i] = i;
for (unsigned long i = row-1; i >0; --i) {
temp = rand()%i;
swap(buffer[i*col], buffer[temp*col]);
}
// warm the cache again
temp = buffer[0];
for (unsigned long i = 0; i < row-1; ++i) {
temp = buffer[temp];
}
// First read, should be cache hit
temp = buffer[0];
start = rdtsc();
int sum = 0;
for(int wcet_i = 0; wcet_i < wcet; wcet_i++)
{
for(int j=0; j<21; j++)
{
for (unsigned long i = 0; i < row-1; ++i) {
if (i%2 == 0) sum += buffer[temp];
else sum -= buffer[temp];
temp = buffer[temp];
}
}
}
finish = rdtsc();
dur1 = finish-start;
// Res
printf("%lld %lld %.6f\n", dur1, dur1/row, dur1*1.0/KHZ);
delete[] buffer;
return 0;
}

The use of the RDTSC instruction in a virtual machine is complicated. It is likely that the hypervisor (Xen) is emulating the RDTSC instruction by trapping it. Your fastest runs show around 800 cycles/cache line, which is very, very, slow... the only explanation is that the RDTSC results in a trap that is handled by the hypervisor, that overhead is a performance bottleneck. I'm not sure about the even longer time that you see periodically, but given that the RDTSC is being trapped, all timing bets are off.
You can read more about it here
http://xenbits.xen.org/docs/4.2-testing/misc/tscmode.txt
Instructions in the rdtsc family are non-privileged, but
privileged software may set a cpuid bit to cause all rdtsc family
instructions to trap. This trap can be detected by Xen, which can
then transparently "emulate" the results of the rdtsc instruction and
return control to the code following the rdtsc instruction
By the way, that article is wrong in that the hypervisor doesn't set a cpuid bit to cause RDTSC to trap, it is bit #2 in Control Register 4 (CR4.TSD):
http://en.wikipedia.org/wiki/Control_register#CR4

Quickly count number of zero-valued bytes in an array

What's a speedy method to count the number of zero-valued bytes in a large, contiguous array? (Or conversely, the number of non-zero bytes.) By large, I mean 216 bytes or larger. The array's position and length can consist of whatever byte alignment.
Naive way:
int countZeroBytes(byte[] values, int length)
{
int zeroCount = 0;
for (int i = 0; i < length; ++i)
if (!values[i])
++zeroCount;
return zeroCount;
}
For my problem, I usually just maintain zeroCount and update it based on specific changes to values. However, I'd like to have a fast, general method of re-computing zeroCount after an arbitrary bulk change to values occurs. I'm sure there's a bit-twiddly method of accomplishing this more quickly, but alas, I'm but a novice twiddler.
EDIT: A few people have asked about the nature of the data being zero-checked, so I'll describe it. (It'd be nice if solutions were still general, though.)
Basically, envision a world composed of voxels (e.g. Minecraft), with procedurally generated terrain segregated into cubic chunks, or effectively pages of memory indexed as three-dimensional arrays. Each voxel is fly-weighted as a unique byte corresponding to a unique material (air, stone, water, etc). Many chunks contain only air or water, while others contain varying combinations of a 2-4 voxels in large quantities (dirt, sand, etc), with effectively 2-10% of voxels being random outliers. Voxels existing in large quantities tend to be highly clustered along every axis.
It seems as though a zero-byte-counting method would be useful in a number of unrelated scenarios, though. Hence, the desire for a general solution.

This is a special case of How to count character occurrences using SIMD with c=0, the char (byte) value to count matches for. See that Q&A for a well-optimized manually-vectorized AVX2 implementation of char_count (char const* vector, size_t size, char c); with a much tighter inner loop than this, avoiding reducing each vector of 0/-1 matches to scalar separately.
This will go as O(n) so the best you can do is decrease the constant. One quick fix is to remove the branch. This gives a result as fast as my SSE version below if the zeros are randomly distrbuted. This is likely due to the fact the GCC vectorizes this loop. However, for long runs of zeros or for a random density of zeros less than 1% the SSE version below is still faster.
int countZeroBytes_fix(char* values, int length) {
int zeroCount = 0;
for(int i=0; i<length; i++) {
zeroCount += values[i] == 0;
}
return zeroCount;
}
I originally thought that the density of zeros would matter. That turns out not to be the case, at least with SSE. Using SSE is a lot faster independent of the density.
Edit: actually, it does depend on the density it just the density of zeros has to be smaller than I expected. 1/64 zeros (1.5% zeros) is one zero in 1/4 SSE registers so the branch prediction does not work very well. However, 1/1024 zeros (0.1% zeros) is faster (see the table of times).
SIMD is even faster if the data has long runs of zeros.
You can pack 16 bytes into a SSE register. Then you can compare all 16 bytes at once with zero using _mm_cmpeq_epi8. Then to handle runs of zero you can use _mm_movemask_epi8 on the result and most of the time it will be zero. You could get a speed up of up to 16 in this case (for first half 1 and second half zero I got over a 12X speedup).
Here is a table of times in seconds for 2^16 bytes (with a repeat of 10000).
1.5% zeros 50% zeros 0.1% zeros 1st half 1, 2nd half 0
countZeroBytes 0.8s 0.8s 0.8s 0.95s
countZeroBytes_fix 0.16s 0.16s 0.16s 0.16s
countZeroBytes_SSE 0.2s 0.15s 0.10s 0.07s
You can see the results for last 1/2 zeros at http://coliru.stacked-crooked.com/a/67a169ddb03d907a
#include <stdio.h>
#include <stdlib.h>
#include <emmintrin.h> // SSE2
#include <omp.h>
int countZeroBytes(char* values, int length) {
int zeroCount = 0;
for(int i=0; i<length; i++) {
if (!values[i])
++zeroCount;
}
return zeroCount;
}
int countZeroBytes_SSE(char* values, int length) {
int zeroCount = 0;
__m128i zero16 = _mm_set1_epi8(0);
__m128i and16 = _mm_set1_epi8(1);
for(int i=0; i<length; i+=16) {
__m128i values16 = _mm_loadu_si128((__m128i*)&values[i]);
__m128i cmp = _mm_cmpeq_epi8(values16, zero16);
int mask = _mm_movemask_epi8(cmp);
if(mask) {
if(mask == 0xffff) zeroCount += 16;
else {
cmp = _mm_and_si128(and16, cmp); //change -1 values to 1
//hortiontal sum of 16 bytes
__m128i sum1 = _mm_sad_epu8(cmp,zero16);
__m128i sum2 = _mm_shuffle_epi32(sum1,2);
__m128i sum3 = _mm_add_epi16(sum1,sum2);
zeroCount += _mm_cvtsi128_si32(sum3);
}
}
}
return zeroCount;
}
int main() {
const int n = 1<<16;
const int repeat = 10000;
char *values = (char*)_mm_malloc(n, 16);
for(int i=0; i<n; i++) values[i] = rand()%64; //1.5% zeros
//for(int i=0; i<n/2; i++) values[i] = 1;
//for(int i=n/2; i<n; i++) values[i] = 0;
int zeroCount = 0;
double dtime;
dtime = omp_get_wtime();
for(int i=0; i<repeat; i++) zeroCount = countZeroBytes(values,n);
dtime = omp_get_wtime() - dtime;
printf("zeroCount %d, time %f\n", zeroCount, dtime);
dtime = omp_get_wtime();
for(int i=0; i<repeat; i++) zeroCount = countZeroBytes_SSE(values,n);
dtime = omp_get_wtime() - dtime;
printf("zeroCount %d, time %f\n", zeroCount, dtime);
}

I've come with this OpenMP implementation, which may take advantage of the array being in the local cache of each processor to actually read it in parallel.
nzeros_total = 0;
#pragma omp parallel for reduction(+:nzeros_total)
for (i=0;i<NDATA;i++)
{
if (v[i]==0)
nzeros_total++;
}
A quick benchmark, consisting on running 1000 times a for loop with a naive implementation (the same the OP has written in the question) versus the OpenMP implementation, running 1000 times too, taking the best time for both methods, with an array of 65536 ints with a zero value element probability of 50%, using Windows 7 on a QuadCore CPU, and compiled with VStudio 2012 Ultimate, yields these numbers:
DEBUG RELEASE
Naive method: 580 microseconds. 341 microseconds.
OpenMP method: 159 microseconds. 99 microseconds.
NOTE: I've tried the #pragma loop (hint_parallel(4)) but aparently, this didn't cause the naive version to perform any better so my guess is that the compiler was already applying this optimization, or it couldn't be applied at all. Also, #pragma loop (no_vector) didn't cause the naive version to perform worse.

You can also use POPCNT instruction which returns number of bits set. This allows to further simplify code and speed it up by eliminating unnecessary branches. Here is example with AVX2 and POPCNT:
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include "immintrin.h"
int countZeroes(uint8_t* bytes, int length)
{
const __m256i vZero = _mm256_setzero_si256();
int count = 0;
for (int n = 0; n < length; n += 32)
{
__m256i v = _mm256_load_si256((const __m256i*)&bytes[n]);
v = _mm256_cmpeq_epi8(v, vZero);
int k = _mm256_movemask_epi8(v);
count += _mm_popcnt_u32(k);
}
return count;
}
#define SIZE 1024
int main()
{
uint8_t bytes[SIZE] __attribute__((aligned(32)));
for (int z = 0; z < SIZE; ++z)
bytes[z] = z % 2;
int n = countZeroes(bytes, SIZE);
printf("%d\n", n);
return 0;
}

For situations where 0s are common it would be faster to check 64 bytes at a time, and only check the bytes if the span is non-zero. If zero's are rare this will be more expensive. This code assumes that the large block is divisible by 64. This also assumes that memcmp is as efficient as you can get.
int countZeroBytes(byte[] values, int length)
{
static const byte zeros[64]={};
int zeroCount = 0;
for (int i = 0; i < length; i+=64)
{
if (::memcmp(values+i, zeros, 64) == 0)
{
zeroCount += 64;
}
else
{
for (int j=i; j < i+64; ++j)
{
if (!values[j])
{
++zeroCount;
}
}
}
}
return zeroCount;
}

Brute force to count zero bytes: Use a vector compare instruction which sets each byte of a vector to 1 if that byte was 0, and to 0 if that byte was not zero.
Do this 255 times to process up to 255 x 64 bytes (if you have 512 bit instruction available, or 255 x 32 or 255 x 16 bytes if you only have 128 bit vectors). And then you just add up the 255 result vectors. Since each byte after the compare had a value of 0 or 1, each sum is at most 255, so you now have one vector of 64 / 32 / 16 bytes, down from about 16,000 / 8,000 / 4,000 bytes.

It may be faster to avoid the condition and trade it for a look-up and an add:
char isCharZeroLUT[256] = { 1 }; /* 1 0 0 ... */
int zeroCount = 0;
for (int i = 0; i < length; ++i) {
zeroCount += isCharZeroLUT[values[i]];
}
I haven't measured the differences, though. It is also worth noting that some compiler happily vectorize sufficiently simple loops.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js