Why is the response time of ONNX Runtime increasing? - c++

I'm using from ONNX Runtime library to inference my deep neural network model in c++. The inference time on the CPU is about 10 milliseconds. When I use the GPU(Nvidia 1050 ti), the inference time is about 4ms for about the first minute after the first processing, but after about 1 minute after first processing the time suddenly increases over 25ms. What is the problem?
I am using CUDA 11.8 and the following options are enabled in using the ONNX Runtime:
sessionOptions.SetIntraOpNumThreads(1);
sessionOptions.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);
OrtCUDAProviderOptions cuda_options;
cuda_options.device_id = 0;
sessionOptions.AppendExecutionProvider_CUDA(cuda_options);
And im calculating time of inference so this:
auto start = std::chrono::high_resolution_clock::now();
my_session->Run(Ort::RunOptions{ nullptr }, inputNames.data(),inputTensors.data(), 1, outputNames.data(),
outputTensors.data(), 2);
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Time taken by function: "
<< duration.count() << " microseconds" << endl;
and result:

Related

std::chrono gives abnormal value when I use cpulimit to limt the program

My code is like:
while(1)
{
std::cout << "local time " << std::chrono<<std::chrono::duration_cast<std::chrono::milliseconds>((std::chrono::steady_clock::now()).time_since_epoch()).count() << "\n";
clock_t t1, t2;
t1 = clock();
std::chrono::steady_clock::time_point t3 = std::chrono::steady_clock::now(), t4;
MyProcessFunc();
t4 = std::chrono::steady_clock::now();
t2 = clock();
std::cout << "process chrono time " << std::chrono::duration_cast<std::chrono::milliseconds>(t4 - t3).count() << "ms\n";
std::cout << "process clock time " << 1000.0*(t2 - t1)/CLOCKS_PER_SEC << "ms\n";
}
And when this program is running, I use "taskset" and "cpulimit" to restrict it to using only a single core of CPU and about 10% of this single core. Then I found chrono gives werid values:
local time 352398168
process chrono time 28ms
process clock time 26.829ms
local time 352398196
process chrono time 808ms
process clock time 26.934ms
local time 352399004
process chrono time 28ms
process clock time 27.168ms
local time 352399032
process chrono time 28ms
process clock time 27ms
local time 352399061
process chrono time 27ms
process clock time 26.931ms
local time 352399089
process chrono time 809ms
process clock time 30.479ms
local time 352399898
process chrono time 33ms
process clock time 32.135ms
I can feel the stutter of the program running, so may be chrono's result is right. But I think there are definitely other places that block the running of the program. The main part of the time consuming is not on my program.
Any one knows why plz
clock() returns the time actually consumed by your program, see here, while std::chrono::steady_clock keeps counting even when your program is not running.
This explains your results.

batch inference is as slow as single image inference in tensorflow c++

OS:Ubuntu 16.04
version:Tensorflow c++ 2.0-beta1 (compiled with all optimization flag:AVX AVX2 SSE4.1 SSE4.2 FMA XLA)
IDE:eclipse
With CUDA:No (just CPU in my prediction)
I have test the time of single image inference with tensorflow c++ api is 0.02 seconds which is so slow that i just can not believe with my own eyes because i have compiled tensorflow c++ shared library with all the optimizations such as AVX/AVX2/FMA/SSE4.1/SSE4.2/FMA. However,i have to find the solution to decrease the cost time in prediction.Someone tells me the time can hugely decrease if i use batch inference instead of single image inference.Unfortunately,the time is 0.7 seconds in batch inference when the batch size is 32.In another word,0.7/32=0.02,it is as slow as single image inference.
To improve the tensorflow c++ api inference performance (decrease the time in prediction),i have tried MKL-DNN in tensorflow but it is useless in improve the time cost.
In the beginning,i compile the tensorflow c++ shared library without the optimization flag--AVX/AVX2/SSE4.1/SSE4.2/FMA/XLA so the 6 optimization warnings occur every time i run the prediction.In my opinion,the cost time will decrease if i re-compile tensorflow with the six optimization flags.However,the inference time with optimization flags is nearly as the same as the inference time without these flags.I am so comfused that i have to find other useful solutions to decrease the inference time such as replace single image prediction with batch inference this time i am testing.
The code below is single image inference.
Session* session;
Status status = NewSession(SessionOptions(), &session);
const std::string graph_fn = "/media/root/Ubuntu311/projects/Ecology_projects/JPMVCNN_AlgaeAnalysisMathTestDemo/model-0723/model.meta";
MetaGraphDef graphdef;
Status status_load = ReadBinaryProto(Env::Default(), graph_fn, &graphdef); //从meta文件中读取图模型;
if (!status_load.ok()) {
std::cout << "ERROR: Loading model failed..." << graph_fn << std::endl;
std::cout << status_load.ToString() << "\n";
return -1;
}
Status status_create = session->Create(graphdef.graph_def()); //将模型导入会话Session中;
if (!status_create.ok()) {
std::cout << "ERROR: Creating graph in session failed..." << status_create.ToString() << std::endl;
return -1;
}
// cout << "Session successfully created.Load model successfully!"<< endl;
// 读入预先训练好的模型的权重
const std::string checkpointPath = "/media/root/Ubuntu311/projects/Ecology_projects/JPMVCNN_AlgaeAnalysisMathTestDemo/model-0723/model";
Tensor checkpointPathTensor(DT_STRING, TensorShape());
checkpointPathTensor.scalar<std::string>()() = checkpointPath;
status = session->Run(
{{ graphdef.saver_def().filename_tensor_name(), checkpointPathTensor },},
{},{graphdef.saver_def().restore_op_name()},nullptr);
if (!status.ok())
{
throw runtime_error("Error loading checkpoint from " + checkpointPath + ": " + status.ToString());
}
// cout << "Load weights successfully!"<< endl;
//read image for prediction...
char srcfile[200];
double alltime=0.0;
for(int numingroup=0;numingroup<1326;numingroup++)
{
sprintf(srcfile, "/media/root/Ubuntu311/projects/Ecology_projects/copy/cnn-imgs96224/%d.JPG",numingroup);
cv::Mat srcimg=cv::imread(srcfile,0);
if(!srcimg.data)
{
continue;
}
Tensor resized_tensor(DT_FLOAT, TensorShape({1,96,224,1}));
float *imgdata = resized_tensor.flat<float>().data();
cv::Mat cameraImg(96, 224, CV_32FC1, imgdata);
srcimg.convertTo(cameraImg, CV_32FC1);
//对图像做预处理
cameraImg=cameraImg/255;
// std::cout <<"Read image successfully: "<< resized_tensor.DebugString()<<endl;
vector<std::pair<string, Tensor> > inputs;
std::string Input1Name = "input";
inputs.push_back(std::make_pair(Input1Name, resized_tensor));
Tensor is_training_val(DT_BOOL,TensorShape());
is_training_val.scalar<bool>()()=false;
std::string Input2Name = "is_training";
inputs.push_back(std::make_pair(Input2Name, is_training_val));
vector<tensorflow::Tensor> outputs;
string output="output";
cv::TickMeter timer;
timer.start();
Status status_run = session->Run(inputs, {output}, {}, &outputs);
if (!status_run.ok()) {
std::cout << "ERROR: RUN failed..." << std::endl;
std::cout << status_run.ToString() << "\n";
return -1;
}
timer.stop();
cout<<"single image inference time is: "<<timer.getTimeSec()<<" s."<<endl;
alltime+=(timer.getTimeSec());
timer.reset();
Tensor t = outputs[0];
int ndim2 = t.shape().dims();
auto tmap = t.tensor<float, 2>(); // Tensor Shape: [batch_size, target_class_num]
int output_dim = t.shape().dim_size(1);
std::vector<double> tout;
// Argmax: Get Final Prediction Label and Probability
int output_class_id = -1;
double output_prob = 0.0;
for (int j = 0; j < output_dim; j++)
{
std::cout << "Class " << j << " prob:" << tmap(0, j) << "," << std::endl;
if (tmap(0, j) >= output_prob) {
output_class_id = j;
output_prob = tmap(0, j);
}
}
// std::cout << "Final class id: " << output_class_id << std::endl;
// std::cout << "Final class prob: " << output_prob << std::endl;
}
cout<<"all image have been predicted and time is: "<<alltime<<endl;
And information below is the outputs.
root#rootwd-Default-string:/media/root/Ubuntu311/projects/Ecology_projects/tensorflowtest/Debug# ./tensorflowtest
2019-08-12 17:44:40.362149: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3407969999 Hz
2019-08-12 17:44:40.362455: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2af1b90 executing computations on platform Host. Devices:
2019-08-12 17:44:40.362469: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
single image inference time is: 0.941759 s.
single image inference time is: 0.0218276 s.
single image inference time is: 0.0230476 s.
single image inference time is: 0.0221443 s.
single image inference time is: 0.0222238 s.
single image inference time is: 0.021393 s.
single image inference time is: 0.0223495 s.
single image inference time is: 0.0227179 s.
single image inference time is: 0.021407 s.
single image inference time is: 0.02372 s.
single image inference time is: 0.0220384 s.
single image inference time is: 0.0225262 s.
single image inference time is: 0.0217821 s.
single image inference time is: 0.0230875 s.
single image inference time is: 0.0228805 s.
single image inference time is: 0.0217929 s.
single image inference time is: 0.0220751 s.
single image inference time is: 0.0281811 s.
single image inference time is: 0.0257438 s.
single image inference time is: 0.0259228 s.
single image inference time is: 0.0264548 s.
single image inference time is: 0.0242932 s.
single image inference time is: 0.025251 s.
single image inference time is: 0.0258176 s.
single image inference time is: 0.025607 s.
single image inference time is: 0.0265529 s.
single image inference time is: 0.0252388 s.
single image inference time is: 0.0229052 s.
single image inference time is: 0.0234532 s.
single image inference time is: 0.0219921 s.
single image inference time is: 0.0222037 s.
single image inference time is: 0.0228582 s.
single image inference time is: 0.0231251 s.
single image inference time is: 0.0211131 s.
single image inference time is: 0.0234812 s.
single image inference time is: 0.0227733 s.
single image inference time is: 0.02183 s.
single image inference time is: 0.0215002 s.
single image inference time is: 0.0222 s.
single image inference time is: 0.022995 s.
single image inference time is: 0.0217708 s.
single image inference time is: 0.0226695 s.
single image inference time is: 0.0234447 s.
...
single image inference time is: 0.0226969 s.
single image inference time is: 0.0216993 s.
single image inference time is: 0.0220073 s.
single image inference time is: 0.0224785 s.
single image inference time is: 0.0219879 s.
single image inference time is: 0.0233075 s.
single image inference time is: 0.0229301 s.
single image inference time is: 0.0215029 s.
single image inference time is: 0.0230741 s.
single image inference time is: 0.0224437 s.
single image inference time is: 0.0220314 s.
single image inference time is: 0.0212338 s.
single image inference time is: 0.0226974 s.
all image have been predicted and time is: 31.1918
root#rootwd-Default-string:/media/root/Ubuntu311/projects/Ecology_projects/tensorflowtest/Debug#
The code below is batch inference.
Session* session;
Status status = NewSession(SessionOptions(), &session);
const std::string graph_fn = "/media/root/Ubuntu311/projects/Ecology_projects/JPMVCNN_AlgaeAnalysisMathTestDemo/model-0723/model.meta";
MetaGraphDef graphdef;
Status status_load = ReadBinaryProto(Env::Default(), graph_fn, &graphdef); //从meta文件中读取图模型;
if (!status_load.ok()) {
std::cout << "ERROR: Loading model failed..." << graph_fn << std::endl;
std::cout << status_load.ToString() << "\n";
return -1;
}
Status status_create = session->Create(graphdef.graph_def()); //将模型导入会话Session中;
if (!status_create.ok()) {
std::cout << "ERROR: Creating graph in session failed..." << status_create.ToString() << std::endl;
return -1;
}
// cout << "Session successfully created.Load model successfully!"<< endl;
// 读入预先训练好的模型的权重
const std::string checkpointPath = "/media/root/Ubuntu311/projects/Ecology_projects/JPMVCNN_AlgaeAnalysisMathTestDemo/model-0723/model";
Tensor checkpointPathTensor(DT_STRING, TensorShape());
checkpointPathTensor.scalar<std::string>()() = checkpointPath;
status = session->Run(
{{ graphdef.saver_def().filename_tensor_name(), checkpointPathTensor },},
{},{graphdef.saver_def().restore_op_name()},nullptr);
if (!status.ok())
{
throw runtime_error("Error loading checkpoint from " + checkpointPath + ": " + status.ToString());
}
// cout << "Load weights successfully!"<< endl;
int cnnrows=96;
int cnncols=224;
//read image for prediction...
char srcfile[200];
const int imgnum=1326;
const int batch=32;
double alltime=0.0;
//all image inference...
for(int imgind=0;imgind<imgnum/batch;imgind++)
{
//a batch inference...
tensorflow::Tensor input_tensor(tensorflow::DT_FLOAT, tensorflow::TensorShape({ batch, cnnrows, cnncols, 1 }));
auto input_tensor_mapped = input_tensor.tensor<float, 4>();
int batchind=0;
int imgrealind=imgind*batch;
for(;batchind!=batch;batchind++)
{
sprintf(srcfile, "/media/root/Ubuntu311/projects/Ecology_projects/copy/cnn-imgs96224/%d.JPG",imgrealind);
cv::Mat srcimg=cv::imread(srcfile,0);
if(!srcimg.data)
{
continue;
}
cv::Mat cameraImg(96, 224, CV_32FC1);
srcimg.convertTo(cameraImg, CV_32FC1);
cameraImg=cameraImg/255;
//convert batch cv image to tensor
for (int y = 0; y < cnnrows; ++y)
{
const float* source_row = (float*)cameraImg.data + (y * cnncols);
for (int x = 0; x < cnncols; ++x)
{
const float* source_pixel = source_row + x;
input_tensor_mapped(batchind, y, x, 0) = *source_pixel;
}
}
imgrealind++;
//a batch image transfer done...
}
vector<std::pair<string, Tensor> > inputs;
std::string Input1Name = "input";
inputs.push_back(std::make_pair(Input1Name, input_tensor));
Tensor is_training_val(DT_BOOL,TensorShape());
is_training_val.scalar<bool>()()=false;
std::string Input2Name = "is_training";
inputs.push_back(std::make_pair(Input2Name, is_training_val));
vector<tensorflow::Tensor> outputs;
string output="output";
cv::TickMeter timer;
timer.start();
Status status_run = session->Run(inputs, {output}, {}, &outputs);
if (!status_run.ok()) {
std::cout << "ERROR: RUN failed..." << std::endl;
std::cout << status_run.ToString() << "\n";
return -1;
}
timer.stop();
cout<<"time of this batch inference is: "<<timer.getTimeSec()<<" s."<<endl;
alltime+=(timer.getTimeSec());
timer.reset();
auto finalOutputTensor = outputs[0].tensor<float, 2>();
int output_dim = outputs[0].shape().dim_size(1);
for(int b=0; b<batch;b++)
{
for(int i=0; i<output_dim; i++)
{
// cout << b << "the output for class "<<i<<" is "<< finalOutputTensor(b, i) <<endl;
}
}
//all images inference done...
}
cout<<"all image have been predicted and time is: "<<alltime<<endl;
And the information below is its outputs:
root#rootwd-Default-string:/media/root/Ubuntu311/projects/Ecology_projects/tensorflowtest/Debug# ./tensorflowtest
2019-08-12 17:47:26.517909: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3407969999 Hz
2019-08-12 17:47:26.518092: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1481b90 executing computations on platform Host. Devices:
2019-08-12 17:47:26.518106: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
time of this batch inference is: 1.73786 s.
time of this batch inference is: 0.735492 s.
time of this batch inference is: 0.735382 s.
time of this batch inference is: 0.714616 s.
time of this batch inference is: 0.753576 s.
time of this batch inference is: 0.734335 s.
time of this batch inference is: 0.738822 s.
time of this batch inference is: 0.727782 s.
time of this batch inference is: 0.726601 s.
time of this batch inference is: 0.724234 s.
time of this batch inference is: 0.737588 s.
time of this batch inference is: 0.743579 s.
time of this batch inference is: 0.737886 s.
time of this batch inference is: 0.729694 s.
time of this batch inference is: 0.72652 s.
time of this batch inference is: 0.724418 s.
time of this batch inference is: 0.728979 s.
time of this batch inference is: 0.720166 s.
time of this batch inference is: 0.727582 s.
time of this batch inference is: 0.732912 s.
time of this batch inference is: 0.734843 s.
time of this batch inference is: 0.732175 s.
time of this batch inference is: 0.724297 s.
time of this batch inference is: 0.724738 s.
time of this batch inference is: 0.736695 s.
time of this batch inference is: 0.736627 s.
time of this batch inference is: 0.726824 s.
time of this batch inference is: 0.731248 s.
time of this batch inference is: 0.72861 s.
time of this batch inference is: 0.752497 s.
time of this batch inference is: 0.737133 s.
time of this batch inference is: 0.742782 s.
time of this batch inference is: 0.730087 s.
time of this batch inference is: 0.732464 s.
time of this batch inference is: 0.737972 s.
time of this batch inference is: 0.738182 s.
time of this batch inference is: 0.738349 s.
time of this batch inference is: 0.72544 s.
time of this batch inference is: 0.741428 s.
time of this batch inference is: 0.733115 s.
time of this batch inference is: 0.743221 s.
all image have been predicted and time is: 31.0668
root#rootwd-Default-string:/media/root/Ubuntu311/projects/Ecology_projects/tensorflowtest/Debug#
Any help will be much appreciated.
For cpu inference, more batches cannot help because cpu compute serially.

GLUT does not get time with millis accuracy

I've got timer issue in GLUT.
glutGet(GLUT_ELAPSED_TIME) only get time with sec accuracy (1000, 2000, 3000...)
and
glutTimerFunc(...) works only when millis parameter is set greater than 1000.
I don't know exactly how GLUT measure time
but I think there's something wrong with my system time setting.
How can I get time with millis accuracy in OpenGL?
As already mentioned in the comments above, you could use more reliable C++ date and time utilities like the std::chrono library. Here is a simple example:
#include <iostream>
#include <chrono>
int main()
{
const auto start = std::chrono::high_resolution_clock::now();
// do something...
const auto end = std::chrono::high_resolution_clock::now();
std::cout << "Took " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms\n";
return 0;
}

Boost.Compute slower than plain CPU?

I just started to play with Boost.Compute, to see how much speed it can bring to us, I wrote a simple program:
#include <iostream>
#include <vector>
#include <algorithm>
#include <boost/foreach.hpp>
#include <boost/compute/core.hpp>
#include <boost/compute/platform.hpp>
#include <boost/compute/algorithm.hpp>
#include <boost/compute/container/vector.hpp>
#include <boost/compute/functional/math.hpp>
#include <boost/compute/types/builtin.hpp>
#include <boost/compute/function.hpp>
#include <boost/chrono/include.hpp>
namespace compute = boost::compute;
int main()
{
// generate random data on the host
std::vector<float> host_vector(16000);
std::generate(host_vector.begin(), host_vector.end(), rand);
BOOST_FOREACH (auto const& platform, compute::system::platforms())
{
std::cout << "====================" << platform.name() << "====================\n";
BOOST_FOREACH (auto const& device, platform.devices())
{
std::cout << "device: " << device.name() << std::endl;
compute::context context(device);
compute::command_queue queue(context, device);
compute::vector<float> device_vector(host_vector.size(), context);
// copy data from the host to the device
compute::copy(
host_vector.begin(), host_vector.end(), device_vector.begin(), queue
);
auto start = boost::chrono::high_resolution_clock::now();
compute::transform(device_vector.begin(),
device_vector.end(),
device_vector.begin(),
compute::sqrt<float>(), queue);
auto ans = compute::accumulate(device_vector.begin(), device_vector.end(), 0, queue);
auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
std::cout << "ans: " << ans << std::endl;
std::cout << "time: " << duration.count() << " ms" << std::endl;
std::cout << "-------------------\n";
}
}
std::cout << "====================plain====================\n";
auto start = boost::chrono::high_resolution_clock::now();
std::transform(host_vector.begin(),
host_vector.end(),
host_vector.begin(),
[](float v){ return std::sqrt(v); });
auto ans = std::accumulate(host_vector.begin(), host_vector.end(), 0);
auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
std::cout << "ans: " << ans << std::endl;
std::cout << "time: " << duration.count() << " ms" << std::endl;
return 0;
}
And here's the sample output on my machine (win7 64-bit):
====================Intel(R) OpenCL====================
device: Intel(R) Core(TM) i7-4770 CPU # 3.40GHz
ans: 1931421
time: 64 ms
-------------------
device: Intel(R) HD Graphics 4600
ans: 1931421
time: 64 ms
-------------------
====================NVIDIA CUDA====================
device: Quadro K600
ans: 1931421
time: 4 ms
-------------------
====================plain====================
ans: 1931421
time: 0 ms
My question is: why is the plain (non-opencl) version faster?
As others have said, there is most likely not enough computation in your kernel to make it worthwhile to run on the GPU for a single set of data (you're being limited by kernel compilation time and transfer time to the GPU).
To get better performance numbers, you should run the algorithm multiple times (and most likely throw out the first one as that will be far greater because it includes the time to compile and store the kernels).
Also, instead of running transform() and accumulate() as separate operations, you should use the fused transform_reduce() algorithm which performs both the transform and reduction with a single kernel. The code would look like this:
float ans = 0;
compute::transform_reduce(
device_vector.begin(),
device_vector.end(),
&ans,
compute::sqrt<float>(),
compute::plus<float>(),
queue
);
std::cout << "ans: " << ans << std::endl;
You can also compile code using Boost.Compute with the -DBOOST_COMPUTE_USE_OFFLINE_CACHE which will enable the offline kernel cache (this requires linking with boost_filesystem). Then the kernels you use will be stored in your file system and only be compiled the very first time you run your application (NVIDIA on Linux already does this by default).
I can see one possible reason for the big difference. Compare the CPU and the GPU data flow:-
CPU GPU
copy data to GPU
set up compute code
calculate sqrt calculate sqrt
sum sum
copy data from GPU
Given this, it appears that the Intel chip is just a bit rubbish at general compute, the NVidia is probably suffering from the extra data copying and setting up the GPU to do the calculation.
You should try the same program but with a much more complex operation - sqrt and sum are too simple to overcome the extra overhead of using the GPU. You could try calculating Mandlebrot points for instance.
In your example, moving the lambda into the accumulate would be faster (one pass over memory vs. two passes)
You're getting bad results because you're measuring time incorrectly.
OpenCL Device has it's own time counters, which aren't related to Host counters. Every OpenCL task has 4 states, timers for which can be queried: (from Khronos web site)
CL_PROFILING_COMMAND_QUEUED, when the command identified by event is enqueued in a command-queue by the host
CL_PROFILING_COMMAND_SUBMIT, when the command identified by event that has been enqueued is submitted by the host to the device associated with the command-queue.
CL_PROFILING_COMMAND_START, when the command identified by event starts execution on the device.
CL_PROFILING_COMMAND_END, when the command identified by event has finished execution on the device.
Take into account, that timers are Device-side. So, to measure kernel & command queue performance, you can query for these timers. In your case, 2 last timers are needed.
In your sample code, you're measuring Host time, which includes data transfer time (as Skizz said) plus all time wasted on Command Queue maintenance.
So, to learn actual kernel performance, you need either to pass cl_event to your kernel (no idea how to do it in boost::compute) & query that event for performance counters or make your kernel really huge & complicated to hide all overheads.

boost chrono endTime - startTime returns negative in boost 1.51

The boost chrono library vs1.51 on my macbook pro returns negative times when I substract endTime - startTime. If you print the timepoints you see that the end time is earlier than the startTime. How can this happen?
typedef boost::chrono::steady_clock clock_t;
clock_t clock;
// Start time measurement
boost::chrono::time_point<clock_t> startTime = clock.now();
short test_times = 7;
// Spend some time...
for ( int i=0; i<test_times; ++i )
{
xnodeptr spResultDoc=parser.parse(inputSrc);
xstring sXmlResult = spResultDoc->str();
const char16_t* szDbg = sXmlResult.c_str();
BOOST_CHECK(spResultDoc->getNodeType()==xnode::DOCUMENT_NODE && sXmlResult == sXml);
}
// Stop time measurement
boost::chrono::time_point<clock_t> endTime = clock.now();
clock_t::duration elapsed( endTime - startTime);
std::cout << std::endl;
std::cout << "Now time: " << clock.now() << std::endl;
std::cout << "Start time: " << startTime << std::endl;
std::cout << "End time: " << endTime << std::endl;
std::cout << std::endl << "Total Parse time: " << elapsed << std::endl;
std::cout << "Avarage Parse time per iteration: " << (boost::chrono::duration_cast<boost::chrono::milliseconds>(elapsed) / test_times) << std::endl;
I tried different clocks but no difference.
Any help would be appreciated!
EDIT: Forgot to add the output:
Now time: 1 nanosecond since boot
Start time: 140734799802912 nanoseconds since boot
End time: 140734799802480 nanoseconds since boot
Total Parse time: -432 nanoseconds
Avarage Parse time per iteration: 0 milliseconds
Hyperthreading or just scheduling interference, the Boost implementation punts monotonic support to the OS:
POSIX: clock_gettime (CLOCK_MONOTONIC) although it still may fail due to kernel errors handling hyper-threading when calibrating the system.
WIN32: QueryPerformanceCounter() which on anything but Nehalem architecture or newer is not going to be monotonic across cores and threads.
OSX: mach_absolute_time(), i.e. the steady & high resolution clocks are the same. The source code shows that it uses RDTSC thus strict dependency upon hardware stability: i.e. no guarantees.
Disabling hyperthreading is a recommended way to go, but say on Windows you are really limited. Aside of dropping timer resolution the only available method is direct access to the underlying hardware timers whilst ensuring thread affinity.
It looks like a good time to submit a bug to Boost, I would recommend:
Win32: Use GetTick64Count(), as discussed here.
OSX: Use clock_get_time (SYSTEM_CLOCK) according to this question.