Weird program latency behavior on VM - c++

I wrote a program to read 256KB array to get 1ms latency. The program is pretty simple and attached.
However, when I run it on VM on Xen, I found that the latency is not stable. It has the following pattern: The time unit is ms.
#totalCycle CyclePerLine totalms
22583885 5513 6.452539
3474342 848 0.992669
3208486 783 0.916710
25848572 6310 7.385306
3225768 787 0.921648
3210487 783 0.917282
25974700 6341 7.421343
3244891 792 0.927112
3276027 799 0.936008
25641513 6260 7.326147
3531084 862 1.008881
3233687 789 0.923911
22397733 5468 6.399352
3523403 860 1.006687
3586178 875 1.024622
26094384 6370 7.455538
3540329 864 1.011523
3812086 930 1.089167
25907966 6325 7.402276
I'm thinking some process is doing something and it's like an event-driven process. Does any one encounter this before? or anyone can point out the potential process/services that could make this happen?
Below is my program. I run it for 1000 times. Each time got the one line of the result above.
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <string>
#include <ctime>
using namespace std;
#if defined(__i386__)
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}
#elif defined(__x86_64__)
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
#endif
#define CACHE_LINE_SIZE 64
#define WSS 24567 /* 24 Mb */
#define NUM_VARS WSS * 1024 / sizeof(long)
#define KHZ 3500000
// ./a.out memsize(in KB)
int main(int argc, char** argv)
{
unsigned long wcet = atol(argv[1]);
unsigned long mem_size_KB = 256; // mem size in KB
unsigned long mem_size_B = mem_size_KB * 1024; // mem size in Byte
unsigned long count = mem_size_B / sizeof(long);
unsigned long row = mem_size_B / CACHE_LINE_SIZE;
int col = CACHE_LINE_SIZE / sizeof(long);
unsigned long long start, finish, dur1;
unsigned long temp;
long *buffer;
buffer = new long[count];
// init array
for (unsigned long i = 0; i < count; ++i)
buffer[i] = i;
for (unsigned long i = row-1; i >0; --i) {
temp = rand()%i;
swap(buffer[i*col], buffer[temp*col]);
}
// warm the cache again
temp = buffer[0];
for (unsigned long i = 0; i < row-1; ++i) {
temp = buffer[temp];
}
// First read, should be cache hit
temp = buffer[0];
start = rdtsc();
int sum = 0;
for(int wcet_i = 0; wcet_i < wcet; wcet_i++)
{
for(int j=0; j<21; j++)
{
for (unsigned long i = 0; i < row-1; ++i) {
if (i%2 == 0) sum += buffer[temp];
else sum -= buffer[temp];
temp = buffer[temp];
}
}
}
finish = rdtsc();
dur1 = finish-start;
// Res
printf("%lld %lld %.6f\n", dur1, dur1/row, dur1*1.0/KHZ);
delete[] buffer;
return 0;
}

The use of the RDTSC instruction in a virtual machine is complicated. It is likely that the hypervisor (Xen) is emulating the RDTSC instruction by trapping it. Your fastest runs show around 800 cycles/cache line, which is very, very, slow... the only explanation is that the RDTSC results in a trap that is handled by the hypervisor, that overhead is a performance bottleneck. I'm not sure about the even longer time that you see periodically, but given that the RDTSC is being trapped, all timing bets are off.
You can read more about it here
http://xenbits.xen.org/docs/4.2-testing/misc/tscmode.txt
Instructions in the rdtsc family are non-privileged, but
privileged software may set a cpuid bit to cause all rdtsc family
instructions to trap. This trap can be detected by Xen, which can
then transparently "emulate" the results of the rdtsc instruction and
return control to the code following the rdtsc instruction
By the way, that article is wrong in that the hypervisor doesn't set a cpuid bit to cause RDTSC to trap, it is bit #2 in Control Register 4 (CR4.TSD):
http://en.wikipedia.org/wiki/Control_register#CR4

Related

not able to get expected results in VC++ version of GCC

i need to convert vm detection code written in gcc to vc++
here is the code
static inline unsigned long long rdtsc_diff_vmexit() {
unsigned long long ret, ret2;
unsigned eax, edx;
__asm__ volatile("rdtsc" : "=a" (eax), "=d" (edx));
ret = ((unsigned long long)eax) | (((unsigned long long)edx) << 32);
/* vm exit forced here. it uses: eax = 0; cpuid; */
__asm__ volatile("cpuid" : /* no output */ : "a"(0x00));
/**/
__asm__ volatile("rdtsc" : "=a" (eax), "=d" (edx));
ret2 = ((unsigned long long)eax) | (((unsigned long long)edx) << 32);
return ret2 - ret;}
i found some example for doing same in vc++ as follow
BOOL rdtsc_diff_vmexit(){
ULONGLONG tsc1 = 0;
ULONGLONG tsc2 = 0;
ULONGLONG avg = 0;
INT cpuInfo[4] = {};
// Try this 10 times in case of small fluctuations
for (INT i = 0; i < 10; i++)
{
tsc1 = __rdtsc();
__cpuid(cpuInfo, 0);
tsc2 = __rdtsc();
// Get the delta of the two RDTSC
avg += (tsc2 - tsc1);
}
// We repeated the process 10 times so we make sure our check is as much reliable as we can
avg = avg / 10;
return (avg < 1000 && avg > 0) ? FALSE : TRUE;}
but problem is GCC version return true if run in virtual box while vc++ returns false can you please let me know what is wrong. wondering if the gcc asm code can be rewritten for VC++?

How do gzip don't find two same part?

I wrote such code
#include "zlib.h"
unsigned char dst[1<<26];
unsigned char src[1<<24];
int main() {
unsigned long dstlen = 1<<26;
srand (12345);
for (int i=0; i<1<<23; i++) src[i] = src[i | 1<<23] = rand();
compress(dst,&dstlen,src,1<<24);
printf ("%d/%d = %f\n", dstlen, 1<<24, dstlen / double(1<<24));
}
which tries to compress two same 223 bytes part connected together. However, the result is
16782342/16777216 = 1.000306
How is data with such rule not compressed?
The maximum distance for matching strings in zlib is 32,768 bytes back.

arduino low i2c read speed;

I'm currently working on a project using the genuino 101 where i need to read large amounts of data trough i2c, to fill an arbitrarily sized buffer.from the following image i can see that the read requests themselves only take about 3 milliseconds and the write request about 200 nanoseconds.
however there is a very large time (750+ ms) between read transactions in the same block
#define RD_BUF_SIZE 32
void i2cRead(unsigned char device, unsigned char memory, int len, unsigned char * rdBuf)
{
ushort bytesRead = 0;
ushort _memstart = memory;
while (bytesRead < len)
{
Wire.beginTransmission((int)device);
Wire.write(_memstart);
Wire.endTransmission();
Wire.requestFrom((int)device, BLCK_SIZE);
int i = 0;
while (Wire.available())
{
rdBuf[bytesRead+i] = Wire.read();
i++;
}
bytesRead += BLCK_SIZE;
_memstart += BLCK_SIZE;
}
}
from my understanding this shouldn't take that long, unless adding to memstart and bytesRead is taking extremely long. by my, arguably limited, understanding of time complexity this function has a time complexity of O(n) and should, in the best case only take about 12 ms for a 128 byte query
Am i missing something?
Those 700ms are not caused by the execution time of the few instructions in your function. Those should be done in microseconds. You may have a buffer overflow, or the other device might be delaying transfers, or there's another bug not related to buffer overflow.
This is about how I'd do it:
void i2cRead(unsigned char device, unsigned char memory, int len, unsigned char * rdBuf, int bufLen)
{
ushort _memstart = memory;
if ( bufLen < len ) {
len = bufLen;
}
while (len > 0)
{
Wire.beginTransmission((int)device);
Wire.write(_memstart);
Wire.endTransmission();
int reqSize = 32;
if ( len < reqSize ) {
reqSize = len;
}
Wire.requestFrom((int)device, reqSize);
while (Wire.available() && (len != 0))
{
*(rdBuf++) = Wire.read();
_memstart++;
len--;
}
}
}

Busy loop slows down latency-critical computation

My code does the following:
Do some long-running intense computation (called useless below)
Do a small latency-critical task
I find that the time it takes to execute the latency-critical task is higher with the long-running computation than without it.
Here is some stand-alone C++ code to reproduce this effect:
#include <stdio.h>
#include <stdint.h>
#define LEN 128
#define USELESS 1000000000
//#define USELESS 0
// Read timestamp counter
static inline long long get_cycles()
{
unsigned low, high;
unsigned long long val;
asm volatile ("rdtsc" : "=a" (low), "=d" (high));
val = high;
val = (val << 32) | low;
return val;
}
// Compute a simple hash
static inline uint32_t hash(uint32_t *arr, int n)
{
uint32_t ret = 0;
for(int i = 0; i < n; i++) {
ret = (ret + (324723947 + arr[i])) ^ 93485734985;
}
return ret;
}
int main()
{
uint32_t sum = 0; // For adding dependencies
uint32_t arr[LEN]; // We'll compute the hash of this array
for(int iter = 0; iter < 3; iter++) {
// Create a new array to hash for this iteration
for(int i = 0; i < LEN; i++) {
arr[i] = (iter + i);
}
// Do intense computation
for(int useless = 0; useless < USELESS; useless++) {
sum += (sum + useless) * (sum + useless);
}
// Do the latency-critical task
long long start_cycles = get_cycles() + (sum & 1);
sum += hash(arr, LEN);
long long end_cycles = get_cycles() + (sum & 1);
printf("Iteration %d cycles: %lld\n", iter, end_cycles - start_cycles);
}
}
When compiled with -O3 with USELESS set to 1 billion, the three iterations took 588, 4184, and 536 cycles, respectively. When compiled with USELESS set to 0, the iterations took 394, 358, and 362 cycles, respectively.
Why could this (particularly the 4184 cycles) be happening? I suspected cache misses or branch mis-predictions induced by the intense computation. However, without the intense computation, the zeroth iteration of the latency critical task is pretty fast so I don't think that cold cache/branch predictor is the cause.
Moving my speculative comment to an answer:
It is possible that while your busy loop is running, other tasks on the server are pushing the cached arr data out of the L1 cache, so that the first memory access in hash needs to reload from a lower level cache. Without the compute loop this wouldn't happen. You could try moving the arr initialization to after the computation loop, just to see what the effect is.

Why is the cuda version 2x slower than the cpu code?

This is a little baffling to me why the cuda code runs about twice as slow as the cpu version. The cpu code is commented out above the main. I am just counting all the primes from 0 to (512 * 512 * 512). The cpu version executed in about 97 seconds whereas the gpu version took 182 seconds. I have an intel i7 running at 4 Ghz and an nvidia GTX 960. Any ideas why?
#include <cuda.h>
#include <iostream>
#include <cstdint>
#include <stdio.h>
#include <ctime>
#include <vector>
#include <cstdlib>
#include <climits>
using namespace std;
__host__ __device__ bool is_prime(uint32_t n)
{
if(n == 2)
return true;
if(n % 2 == 0)
return false;
uint32_t sr = sqrtf(n);
for(uint32_t i = 3; i <= sr; i += 2)
if(n % i == 0)
return false;
return true;
}
__global__ void prime_sum(unsigned int* count)
{
uint32_t n = (blockIdx.y * gridDim.y + blockIdx.x) * blockDim.x + threadIdx.x;
if(is_prime(n))
atomicAdd(count, 1);
}
int main()
{
/* CPU VERSION
time_t start = time(0);
int pcount = 0;
for(uint32_t i = 0; i < (512 * 512 * 512); i++)
{
if(is_prime(i)) pcount++;
}
start = time(0) - start;
std::cout << pcount << "\t" << start << std::endl;
*/
//CUDA VERSION
time_t start = time(0);
unsigned int* sum_d;
cudaMalloc(&sum_d, sizeof(unsigned int));
cudaMemset(sum_d, 0, sizeof(unsigned int));
prime_sum<<< dim3(512, 512), 512 >>>(sum_d);
unsigned int sum = 0;
cudaMemcpy(&sum, sum_d, sizeof(unsigned int), cudaMemcpyDeviceToHost);
start = time(0) - start;
std::cout << sum << "\t" << start << std::endl;
cudaFree(sum_d);
return 0;
}
Here is one idea. The efficiency of the is_prime function comes from being able to exit quickly most of the time because most numbers will be divisible by 2 or lower numbers so when executed in serial most of the time the loop exits fast. However due to warps each group of 32 threads must wait for the worst to finish. Also I am including evens so half the threads will be eliminated by the first if.
First, GPUs generally have good floating point computing power but not integer computing power, and modular (and division) operation is very slow.
Second, global atmoic operations are slow before Kelper architecture, but you have a GTX 960, so I think it's not the problem.
Third, for CPU version, each integer can exit the loop right after it is not a prime. However for GPU, an integer must wait until all of its 32 neighbor threads exit. In your code, the even threads exit right after they enter the kernel but they must wait until the odd threads finish their loop.
BTW, why do you use <<< dim3(512,512), 512>>>? I think 1D work dimension <<<512*512,512>>> is fairly enough.