How to measure latency of low latency c++ application - c++

I need to measure message decoding latency (3 to 5 us ) of a low latency application.
I used following method,
1. Get time T1
2. Decode Data
3. Get time T2
4. L1 = T2 -T1
5. Store L1 in a array (size = 100000)
6. Repeat same steps for 100000 times.
7. Print array.
8. Get the 99% and 95% presentile for the data set.
But i got fluctuation between each test. Can some one explain the reason for this ?
Could you suggest any alternative method for this.
Note: Application is tight loop (acquire 100% cpu) and Bind to CPU via taskset commad

There are a number of different ways that performance metrics can be gathered either using code profilers or by using existing system calls.
NC State University has a good resource on the different types of timers and profilers that are available as well as the appropriate case for using each and some examples on their HPC website here.
Fluctuations will inevitably occur on most modern systems, certain BIOS setting related to hyper threading and frequency scaling can have a significant impact on the performance of certain applications, as can power-consumption and cooling/environmental settings.
Looking at the distribution of results as a histogram and/or fitting them to a Gaussian will also help determine how normal the distribution is and if the fluctuations are normal statistical noise or serious outliers. Running additional tests would also be beneficial.

Related

SpannerIO Batch Read Speed Affected by Slow Partition Reader

We have been using SpannerIO.readAll to scan large amount of data in google dataflow setting. The ReadOperations passed to spanner are created withQuery(query) and withBatching(true). I noticed that though initially the throughput is OK, it dropped to very low throughput in the end probably due to outliers with larger amount work. Looking at BatchSpannerRead code, one DoFn is taking care of all the batch scan work for a partition. Although in a perfect world, we should assume the generated partitions should handle this outlier issues, but in practice, will it make sense to re-split the work of those slow workers?

Training Tensorflow Inception-v3 Imagenet on modest hardware setup

I've been training Inception V3 on a modest machine with a single GPU (GeForce GTX 980 Ti, 6GB). The maximum batch size appears to be around 40.
I've used the default learning rate settings specified in the inception_train.py file: initial_learning_rate = 0.1, num_epochs_per_decay = 30 and learning_rate_decay_factor = 0.16. After a couple of weeks of training the best accuracy I was able to achieve is as follows (About 500K-1M iterations):
2016-06-06 12:07:52.245005: precision # 1 = 0.5767 recall # 5 = 0.8143 [50016 examples]
2016-06-09 22:35:10.118852: precision # 1 = 0.5957 recall # 5 = 0.8294 [50016 examples]
2016-06-14 15:30:59.532629: precision # 1 = 0.6112 recall # 5 = 0.8396 [50016 examples]
2016-06-20 13:57:14.025797: precision # 1 = 0.6136 recall # 5 = 0.8423 [50016 examples]
I've tried fiddling with the settings towards the end of the training session, but couldn't see any improvements in accuracy.
I've started a new training session from scratch with num_epochs_per_decay = 10 and learning_rate_decay_factor = 0.001 based on some other posts in this forum, but it's sort of grasping in the dark here.
Any recommendations on good defaults for a small hardware setup like mine?
TL,DR: There is no known method for training an Inception V3 model from scratch in a tolerable amount of time from a modest hardware set up. I would strongly suggest retraining a pre-trained model on your desired task.
On a small hardware set up like yours, it will be difficult to achieve maximum performance. Generally speaking for CNN's, the best performance is with the largest batch sizes possible. This means that for CNN's the training procedure is often limited by the maximum batch size that can fit in GPU memory.
The Inception V3 model available for download here was trained with an effective batch size of 1600 across 50 GPU's -- where each GPU ran a batch size of 32.
Given your modest hardware, my number one suggestion would be to download the pre-trained mode from the link above and retrain the model for the individual task you have at hand. This would make your life much happier.
As a thought experiment (but hardly practical) .. if you feel especially compelled to exactly match the training performance of the model from the pre-trained model by training from scratch, you could do the following insane procedure on your 1 GPU. Namely, you could run the following procedure:
Run with a batch size of 32
Store the gradients from the run
Repeat this 50 times.
Average the gradients from the 50 batches.
Update all variables with the gradients.
Repeat
I am only mentioning this to give you a conceptual sense of what would need to be accomplished to achieve the exact same performance. Given the speed numbers you mentioned, this procedure would take months to run. Hardly practical.
More realistically, if you are still strongly interested in training from scratch and doing the best you can, here are some general guidelines:
Always run with the largest batch size possible. It looks like you are already doing that. Great.
Make sure that you are not CPU bound. That is, make sure that the input processing queue's are always modestly full as displayed on TensorBoard. If not, increase the number of preprocessing threads or use a different CPU if available.
Re: learning rate. If you are always running synchronous training (which must be the case if you only have 1 GPU), then the higher batch size, the higher the tolerable learning rate. I would a try a series of several quick runs (e.g. a few hours each) to identify the highest learning possible which does not lead to NaN's. After you find such a learning rate, knock it down by say 5-10% and run with that.
As for num_epochs_per_decay and decay_rate, there are several strategies. The strategy highlighted by 10 epochs per decay, 0.001 decay factor is to hammer the model for as long as possible until the eval accuracy asymptotes. And then lower the learning rate. This is a simple strategy which is nice. I would verify that is what you see in your model monitoring that the eval accuracy and determining that it indeed asymptotes before you allow the model to decay the learning rate. Finally, the decay factor is a bit ad-hoc but lowering by say a power of 10 seems to be a good rule of thumb.
Note again that these are general guidelines and others might even offer differing advice. The reason why we can not give you more specific guidance is that CNNs of this size are just not often trained from scratch on a modest hardware setup.
Excellent tips.
There is precedence for training using a similar setup as yours.
Check this out - http://3dvision.princeton.edu/pvt/GoogLeNet/
These people trained GoogleNet, but, using Caffe. Still, studying their experience would be useful.

Slow Performance with Apache Spark Gradient Boosted Tree training runs

I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. My goal is to output the definition of the resulting GBT ensemble in human-readable format. My experience so far is that for my problem size adding more resources to the cluster seems to not have an effect on the length of the run. A 10-iteration training run seem to roughly take 13hrs. This isn't acceptable since I'm looking to do 100-300 iteration runs, and the execution time seems to explode with the number of iterations.
My Spark application
This isn't the exact code, but it can be reduced to:
SparkConf sc = new SparkConf().setAppName("GBT Trainer")
// unlimited max result size for intermediate Map-Reduce ops.
// Having no limit is probably bad, but I've not had time to find
// a tighter upper bound and the default value wasn't sufficient.
.set("spark.driver.maxResultSize", "0");
JavaSparkContext jsc = new JavaSparkContext(sc)
// The input file is encoded in plain-text LIBSVM format ~59GB in size
<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), "s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD();
BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Classification");
boostingStrategy.setNumIterations(10);
boostingStrategy.getTreeStrategy().setNumClasses(2);
boostingStrategy.getTreeStrategy().setMaxDepth(1);
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);
GradientBoostedTreesModel model = GradientBoostedTrees.train(data, boostingStrategy);
// Somewhat-convoluted code below reads in Parquete-formatted output
// of the GBT model and writes it back out as json.
// There might be cleaner ways of achieving the same, but since output
// size is only a few KB I feel little guilt leaving it as is.
// serialize and output the GBT classifier model the only way that the library allows
String outputPath = "s3://somebucket/somekeyprefex";
model.save(jsc.sc(), outputPath + "/parquet");
// read in the parquet-formatted classifier output as a generic DataFrame object
SQLContext sqlContext = new SQLContext(jsc);
DataFrame outputDataFrame = sqlContext.read().parquet(outputPath + "/parquet"));
// output DataFrame-formatted classifier model as json
outputDataFrame.write().format("json").save(outputPath + "/json");
Question
What is the performance bottleneck with my Spark application (or with GBT learning algorithm itself) on input of that size and how can I achieve greater execution parallelism?
I'm still a novice Spark dev, and I'd appreciate any tips on cluster configuration and execution profiling.
More details on the cluster setup
I'm running this app on a AWS EMR cluster (emr-4.0.0, YARN cluster mode) of r3.8xlarge instances (32 cores, 244GB RAM each). I'm using such large instances in order to maximize flexibility of resource allocation. So far I've tried using 1-3 r3.8xlarge instances with a variety of resource allocation schemes between the driver and workers. For example, for a cluster of 1 r3.8xlarge instances I submit the app as follows:
aws emr add-steps --cluster-id $1 --steps Name=$2,\
Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,\
Args=[/usr/lib/spark/bin/spark-submit,--verbose,\
--deploy-mode,cluster,--master,yarn,\
--driver-memory,60G,\
--executor-memory,30G,\
--executor-cores,5,\
--num-executors,6,\
--class,GbtTrainer,\
"s3://somebucket/somekey/spark.jar"],\
ActionOnFailure=CONTINUE
For a cluster of 3 r3.8xlarge instances I tweak resource allocation:
--driver-memory,80G,\
--executor-memory,35G,\
--executor-cores,5,\
--num-executors,18,\
I don't have a clear idea of how much memory is useful to give to every executor, but I feel that I'm being generous in either case. Looking through Spark UI, I'm not seeing task with input size of more than a few GB. I'm steering on the side of caution when giving the driver process so much memory in order to ensure that it isn't memory starved for any intermediate result-aggregation operations.
I'm trying to keep the number of cores per executor down to 5 as per suggestions in Cloudera's How To Tune Your Spark Jobs series (according to them, more that 5 cores tends to introduce a HDFS IO bottleneck). I'm also making sure that there is enough of spare RAM and CPUs left over for the host OS and Hadoop services.
My findings thus far
My only clue is Spark UI showing very long Scheduling Delay for a number of tasks at the tail-end of execution. I also get the feeling that the stages/tasks timeline shown by Spark UI does not account for all of the time that the job takes to finish. I suspect that the driver application is stuck performing some kind of a lengthy operation either at the end of every training iteration, or at the end of the entire training run.
I've already done a fair bit of research on tuning Spark applications. Most articles will give great suggestions on using RDD operations which reduce intermediate input size or avoid shuffling of data between stages. In my case I'm basically using an "out-of-the-box" algorithm, which is written by ML experts and should already be well tuned in this regard. My own code that outputs GBT model to S3 should take a trivial amount of time to run.
I haven't used MLLibs GBT implemention, but I have used both
LightGBM and XGBoost successfully. I'd highly suggest taking a look at these other libraries.
In general, GBM implementations need to train models iteratively as they consider the loss of the entire ensemble when building the next tree. This makes GBM training inherently bottlenecked and not easily parallelizable (unlike random forests which are trivially parallelizable). I'd expect it to perform better with fewer tasks, but that might not be your whole issue. Since you have so many features 500K, you're going to have very high overhead when calculating the histograms and split points during training. You should reduce the number of features you have, especially since they're much larger than the number of samples which will cause it to overfit.
As for tuning your cluster:
You want to minimize data movement, so fewer executors with more memory. 1 executor per ec2 instance, with the number of cores set to whatever the instance provides.
Your data is small enough to fit into ~2 EC2s of that size. Assuming you are using doubles (8 bytes), it comes to 8 * 500000 * 50000 = 200 GB Try loading it all into ram by using .cache() on your dataframe. If you perform an operation over all the rows (like sum) you should force it to load and you can measure how long the IO takes. Once its in ram and cached any other operations over it will be faster.
With a dataset of that size, you may well be better off loading the full dataset into memory and using XGBoost directly rather than the Spark implementation.
If you want to stick with Spark to give greater scalability, I'd recommend taking a closer look at your partitioning strategy. If your data isn't effectively partitioned, adding machines won't improve your runtime, as you describe above, and the subset of overloaded workers will remain your bottleneck. Ensure you have an effective partition key, and repartition your RDD before you begin your training stage.

High CPU usage by Django App

I've created a pretty simple Django app which somewhat produces a high CPU load: rendering a simple generic view with a list of simple models (20 of them) and 5-6 SQL queries per page produce an apache process which loads CPU by 30% - 50%. While memory usage is pretty ok (30MB), CPU load is not ok to my understanding and this is not because of apache/wsgi settings or something, the same CPU load happens when I run the app via runserver.
Since, I'm new to Django I wanted to ask:
1) Are these 30-50% figures an usual thing for a Django app? (Django 1.4, ubuntu 12.04, python 2.7.3)
2) How do I profile CPU load? I used a profile middleware from here: http://djangosnippets.org/snippets/186/ but it shows only ms numbers not CPU load numbers and there was nothing special, so how do I identify what eats up so much CPU power?
CPU usage itself doesn't tell how efficient your app is. More important metric to measure the performance is how many requests/second your app can process. The kind of processor your machine has naturally also has a huge effect on the results.
I suggest you to run ab with multiple concurrent requests and compare the requests/second number to some benchmarks (there should be many around the net). ab will try to test the maximum throughput, so it's natural that one of the resources will be fully utilized (bottleneck), usually this is disk-io. As an example if you happen to get CPU usage close to 100% it may mean you are wasting CPU somewhere (reqs/second is low) or you that have optimized disk-io well (reqs/s high).
Looking at the %CPU column is not very accurate. I certainly see spikes of 50%-100% CPU all of the time.. it does not indicate how long a cpu is being used, just that we hit that value at that specific moment. These would fall into min / max figures, not your average cpu usage.
Another important piece: say you have 4 cores as I do which means the 30-50% figure on top is out of a maximum of 400%. 50% on top means 50% of one core, 12.5% on all four, etc.
You can press 1 in top to see individual core cpu figures.

Anyone benchmarked virtual machine performance for build servers?

We have been trying to use virtual machines for build servers. Our build servers are all running WinXP32 and we are hosting them on VMWare Server 2.0 running on Ubuntu 9.10. We build a mix of C, C++, python packages, and other various deployment tasks (installers, 7z files, archives, etc). The management using VMWare hosted build servers is great. We can move them around, shared system resources on one large 8-core box, remotely access the systems through a web interface, and just basically manage things better.
But the problem is that the performance compared to using a physical machine seems to range from bad to horrid depending upon what day it is. It has proven very frustrating. Sometimes the system load for the host will go above 20 and some times it will be below 1. It doesn't seem to be based on how much work is actually being done on the systems. I suspect there is a bottleneck in the system, but I can't seem to figure out what it is. (most recent suspect is I/O, but we have a dedicated 1TB 7200RPM SATA 2 drive with 32MB of cache doing nothing but the virtual machines. Seems like enough for 1-2 machines. All other specs seem to be enough too. 8GB RAM, 2GB per VM, 8 cores, 1 per vm).
So after exhausting everything I can think of, I wanted to turn to the Stack Overflow community.
Has anyone run or seen anyone else run benchmarks of software build performance within a VM.
What should we expect relative to a physical system?
How much performance are we giving up?
What hardware / vm server configurations are people using?
Any help would be greatly appreciated.
Disk IO is definitely a problem here, you just can't do any significant amount of disk IO activity when you're backing it up with a single spindle. The 32MB cache on a single SATA drive is going to be saturated just by your Host and a couple of Guest OS's ticking over. If you look at the disk queue length counter in your Ubuntu Host OS you should see that it is high (anything above 1 on this system with 2 drive for any length of time means something is waiting for that disk).
When I'm sizing infrastructure for VM's I generally take a ballpark of 30-50 IOPS per VM as an average, and that's for systems that do not exercise the disk subsystem very much. For systems that don't require a lot of IO activity you can drop down a bit but the IO patterns for build systems will be heavily biased towards lots of very random fairly small reads. To compound the issue you want a lot of those VM's building concurrently which will drive contention for the disk through the roof. Overall disk bandwidth is probably not a big concern (that SATA drive can probably push 70-100Meg/sec when the IO pattern is totally sequential) but when the files are small and scattered you are IO bound by the limits of the spindle which will be about 70-100 IO per second on a 7.2k SATA. A host OS running a Type 2 Hypervisor like VMware Server with a single guest will probably hit that under a light load.
My recommendation would be to build a RAID 10 array with smaller and ideally faster drives. 10k SAS drives will give you 100-150 IOPs each so a pack of 4 can handle 600 read IOPS and 300 write IOPs before topping out. Also make sure you align all of the data partitions for the drive hosting the VMDK's and within the Guest OS's if you are putting the VM files on a RAID array. For workloads like these that will give you a 20-30% disk performance improvement. Avoid RAID 5 for something like this, space is cheap and the write penalty on RAID 5 means you need 4 drives in a RAID 5 pack to equal the write performance of a single drive.
One other point I'd add is that VMware Server is not a great Hypervisor in terms of performance, if at all possible move to a Type 1 Hypervisor (like ESXi v4, it's also free). It's not trivial to set up and you lose the Host OS completely so that might be an issue but you'll see far better IO performance across the board particularly for disk and network traffic.
Edited to respond to your comment.
1) To see whether you actually have a problem on your existing Ubuntu host.
I see you've tried dstat, I don't think it gives you enough detail to understand what's happening but I'm not familiar with using it so I might be wrong. Iostat will give you a good picture of what is going on - this article on using iostat will help you get a better picture of the actual IO pattern hitting the disk - http://bhavin.directi.com/iostat-and-disk-utilization-monitoring-nirvana/ . The avgrq-sz and avgwq-sz are the raw indicators of how many requests are queued. High numbers are generally bad but what is actually bad varies with the disk type and RAID geometry. What you are ultimately interested in is seeing whether your disk IO's are spending more\increasing time in the queue than in actually being serviced. The calculation (await-svctim)/await*100 really tells you whether your disk is struggling to keep up, above 50% and your IO's are spending as long queued as being serviced by the disk(s), if it approaches 100% the disk is getting totally slammed. If you do find that the host is not actually stressed and VMware Server is actually just lousy (which it could well be, I've never used it on a Linux platform) then you might want to try one of the alternatives like VirtualBox before you jump onto ESXi.
2) To figure out what you need.
Baseline the IO requirements of a typical build on a system that has good\acceptable performance - on Windows look at the IOPS counters - Disk Reads/sec and Disk Writes/sec counters and make sure the average queue length is <1. You need to know the peak values for both while the system is loaded, instantaneous peaks could be very high if everything is coming from disk cache so watch for sustained peak values over the course of a minute or so. Once you have those numbers you can scope out a disk subsystem that will deliver what you need. The reason you need to look at the IO numbers is that they reflect the actual switching that the drive heads have to go through to complete your reads and writes (the IO's per second, IOPS) and unless you are doing large file streaming or full disk backups they will most accurately reflect the limits your disk will hit when under load.
Modern disks can sustain approximately the following:
7.2k SATA drives - 70-100 IOPS
10k SAS drives - 120-150 IOPS
15k SAS drives - 150-200 IOPS
Note these are approximate numbers for typical drives and represent the saturated capability of the drives under maximum load with unfavourable IO patterns. This is designing for worst case, which is what you should do unless you really know what you are doing.
RAID packs allow you to parallelize your IO workload and with a decent RAID controller an N drive RAID pack will give you N*(Base IOPS for 1 disk) for read IO. For write IO there is a penalty caused by the RAID policy - RAID 0 has no penalty, writes are as fast as reads. RAID 5 requires 2 reads and 2 writes per IO (read parity, read existing block, write new parity, write new block) so it has a penalty of 4. RAID 10 has a penalty of 2 (2 writes per IO). RAID 6 has a penalty of 5. To figure out how many IOPS you need from a RAID array you take the basic read IOPS number your OS needs and add to that the product of the write IOPS number the OS needs and the relevant penalty factor.
3) Now work out the structure of the RAID array that will meet your performance needs
If your analysis of a physical baseline system tells you that you only need 4\5 IOPS then your single drive might be OK. I'd be amazed if it does but don't take my word for it - get your data and make an informed decision.
Anyway let's assume you measured 30 read IOPS and 20 write IOPS during your baseline exercise and you want to be able to support 8 instances of these build systems as VM's. To deliver this your disk subsystem will need to be able to support 240 read IOPS and 160 write IOPS to the OS. Adjust your own calculations to suit the number of systems you really need.
If you choose RAID 10 (and I strongly encourage it, RAID 10 sacrifices capacity for performance but when you design for enough performance you can size the disks to get the capacity you need and the result will usually be cheaper than RAID5 unless your IO pattern involves very few writes) Your disks need to be able to deliver 560 IOPS in total (240 for read, and 320 for write in order to account for the RAID 10 write penalty factor of 2).
This would require:
- 4 15k SAS drives
- 6 10k SAS drives (round up, RAID 10 requires an even no of drives)
- 8 7.2k SATA drives
If you were to choose RAID 5 you would have to adjust for the increased write penalty and will therefore need 880 IOPS to deliver the performance you want.
That would require:
- 6 15k SAS drives
- 8 10k SAS drives
- 14 7.2k SATA drives
You'll have a lot more space this way but it will cost almost twice as much because you need so many more drives and you'll need a fairly big box to fit those into. This is why I strongly recommend RAID 10 if performance is any concern at all.
Another option is to find a good SSD (like the Intel X-25E, not the X-25M or anything cheaper) that has enough storage to meet your needs. Buy two and set them up for RAID 1, SSD's are pretty good but their failure rates (even for drives like the X-25E's) are currently worse than rotating disks so unless you are prepared to deal with a dead system you want RAID 1 at a minimum. Combined with a good high end controller something like the X-25E will easily sustain 6k IOPS in the real world, that's the equivalent of 30 15k SAS drives. SSD's are quite expensive per GB of capacity but if they are used appropriately they can deliver much more cost effective solutions for tasks that are IO intensive.