Disclamer: I don't know almost nothing on CNNs and I have no idea where I could ask this.
My research is focused on high performance on computer vision applications. We generate codes representing an image in less than 20 ms on images with the largest size of 500pxs.
This is done by combining SURF descriptors and VLAD codes, obtaining a vector representing an image that will be used in our object recognition application.
Can CNNs be faster? According to this benchmark (which is based on much smaller images) the times needed is longer, almost double considering that the size of the image is half of ours.
Yes, they can be faster. The numbers you got are for networks trained for ImageNet classification, 1 Million images, 1000 classes. Unless your classification problem is similar, then using a ImageNet network is overkill.
You should also remember that these networks have weights in the order of 10-100 million, so they are quite expensive to evaluate. But you probably don't need a really big network, and you can design your own network, with less layers and parameters that is much cheaper to evaluate.
In my experience, I designed a network to classify 96x96 sonar image patches, and with around 4000 weights in total, it can get over 95% classification accuracy and run at 40 ms per frame on a RPi2.
A bigger network with 900K weights, same input size, takes 7 ms to evaluate on a Core i7. So this is surely possible, you just need to play with smaller network architectures. A good start is SqueezeNet, which is a network that can achieve good performance in Imagenet, but has 50 times less weights, and it is of course much faster than other networks.
I would be wary of benchmarks and blanket statements. It's important to know every detail that went into generating the quoted values. For example, would running CNN on GPU hardware improve the quoted values?
20ms seems very fast to me; so does 40ms. I have no idea what your requirement is.
What other benefits could CNN offer? Maybe it's more than just raw speed.
I don't believe that neural networks are the perfect technique for every problem. Regression, SVM, and other classification techniques are still viable.
There's a bias at work here. Your question reads as if you are looking only to confirm that your current research is best. You have a sunk cost that you're loath to throw away, but you're worried that there might be something better out there. If that's true, I don't think this is a good question for SO.
"I don't know almost nothing on CNNs" - if you're a true researcher, seeking the truth, I think you have an obligation to learn and answer for yourself. TensorFlow and Keras make this easy to do.
Answer to your question: Yes, they can. They can be slower and they can be faster than classic descriptors. For example, using only a single filter and several max-poolings will almost certainly be faster. But the results will also certainly be crappy.
You should ask a much more specific question. Relevant parts are:
Problem: Classification / Detection / Semantic Segmentation / Instance Segmentation / Face verification / ... ?
Constraints: Minimum accuracy / maximum speed / maximum latency?
Evaluation specifics:
Which hardware is available (GPUs)?
Do you evaluate on a single image? Often you can evaluate up to 512 images in about the same time as one image.
Also: The input image size should not be relevant. If CNNs achieve better results on smaller inputs than classic descriptors, why should you care?
Papers
Please note that CNNs are usually not tweaked towards speed, but towards accuracy.
Detection: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks: 600px x ~800px in 200ms on a GPU
InverseFaceNet: Deep Single-Shot Inverse Face Rendering From A Single Image: 9.79ms with GeForce GTX Titan and AlexNet to get FC7 features
Semantic segmentation: Pixel-wise Segmentation of Street with Neural
Networks 20ms with GeForce GTX 980
Related
I want to compute the disparity map in Windows platform. I have used several codes on the internet, but I can't compute precision disparity map. I have used Opencv SGBM algorithm, but the disparity map was very noisy. could any one introduce an efficient code?
Thanks in advance for you
Well, stereo vision is always a decision between speed and accuracy. You can't have both. SGM or SGBM are already doing a pretty good compromise here. If you want very high precision you can move towards true global algorithms such as belief propagation, but expect several minutes computation time for processing a single camera frame.
If you want fast processing, you can use block matching. But here you will get poor performance on low texture surfaces and effects like foreground fattening.
The only way to speed things up is to switch to a hardware that is capable of massively parallel processing. There exist some great FPGA-based solutions for stereo vision like this one:
https://nerian.com/products/sp1-stereo-vision/
Also using a (high-end) graphics card would be an option. There exist several research papers with proposed implementations and there might be some code available somewhere.
I am doing a large scale deep learning experiment involving image data of the order of around 800 GB.
The space available on a computational server is only 30 GB, and cannot be extended to match 800 GB.
At present I counter the problem by dividing my data using Python into chunks of 30 GB, and then process them by copying using openssh. Everytime I need another chunk, I delete the present chunk and then repeat the process for the next chunk. For several epochs of CNN training, this process is repeated hundreds of time.
Though I have not benchmarked, but I am concerned if this is a very major performance bottleneck, because CNN training itself takes weeks on a data of this scale. Repeated copying might be very costly.
I have never had an opportunity to face this issue so now I am thinking, if it is possible for me to read files directly from the memory of my storage server for processing.
Specifically my questions are :
Is it possible to read files directly from memory of another system, as though the files are on the same system, without explicit scp ?
What kind of C++ framework(s) are available for doing something like that ?
What techniques are typically used by professional programmers in such a resource-constrained situation ?
I am not a computer science major and this is my first stint where I am faced with such performance-centric issues.Thus, I have almost no practical experience of dealing with such cases. So, a little enlightment or reference would be great.
It may sound a little bit rude, but you need to realize that you can't do any sort of real-world machine learning on a calculator.
If you have a machine 10 years old or a dial-up internet connection, you can not analyze big data. The fact that your server has 30Gb of free hdd space at the time, when you can easily buy 1Tb for a price below 200$ means that something is really wrong here.
A lot of machine learning algorithms iterate through data many many times before they will converge, so any solution that requires to download / remove data many times will be significantly (impractically) slower. Even assuming a pretty fast and steady 200 Mb/s connection it will take you a couple of hours to download the whole data. Now repeat this even 100 times (NN converging after 100 iterations is mostly impossible) and you will see how bad is your situation.
This is close to my final remark - if you want to work with big-data, upgrade your machine to handle big data
Which costs more, the explicit cost of copying, or the implicit and hidden cost of reading data with latency?
As a data point, Google just announced a distributed version of TensorFlow. Which can do CNN training. (See https://www.tensorflow.org/versions/r0.8/tutorials/deep_cnn/index.html for details.) And in that, each machine winds up processing a chunk of data at a time. In a way that is not dissimilar to what you are already doing.
I'm writing performance benchmarks for some of my code. This is both to compare my own implementations as I develop/experiment, and to compare against "competing" implementations. I have no problem writing these, and getting usable results.
It's very well established that more samples are a good thing, as it reduces the impact of erroneous data and gives a more true result.
So, if I'm profiling a given function/procedure/whatever, how many samples does it seem reasonable to get?
I'm currently doing about 1 million samples for each test. These are individual operations, the results rarely take longer than 10s per item, even on an old laptop. Most are under a hundredth of a second.
Actually, it is not well established that more samples are a good thing.
It is nothing more than common wisdom.
I think you are sharing in a general confusion about the reason for profiling, whether the purpose is to measure performance or to find speedups.
For measuring performance, you don't need samples at all.
what you need is a stopwatch, whether in software or not.
If your process runs too quickly for the resolution of the stopwatch, just run your process 10^3 or 10^6 times, measure it, and divide by that number.
For finding speedups, sampling the call stack is very effective, provided the samples contain line-level or instruction-level call site information.
How many samples do you need?
Well, if you see it doing something that could be removed on one sample, that probably doesn't mean much.
But if you see it on two samples, that estimates it's costing time fraction F of about 2/N where N is the number of samples.
Example: if you see it twice in 10 samples, that means it costs roughly 20% of time.
In general, if the speedup is going to save you fraction F of time, it takes on average 2/F samples to see it twice.
Example: if it is going to save 30% of time (F = 0.3) you need on average 2/0.3 = 6.67 samples to see it twice.
Of course, if you see it more than twice, all the better.
Bottom line, for finding speedups, you don't need a lot of samples.
What you do need is to examine each one for activity that could be removed.
What you don't need is to mush them together into "statistics" (like most profilers do).
Many people understand this.
If you want a bit more rigorous explanation, look here.
OK so I am just trying to work out the best way reduce band width between the GPU and CPU.
Particle Systems.
Should I be pre calculating most things on the CPU and sending it to the GPU this is includes stuff like positions, rotations, velocity, calculations for alpha and random numbers ect.
Or should I be doing as much as i can in the shaders and using the geometry shader as much as possible.
My problem is that the sort of app that I have written has to have a good few variables sent to the shaders for example, A user at run time will select emitter positions and velocity plus a lot more. The sorts of things that I am not sure how to tackle are things like "if a user wants a random velocity and gives a min and max value to have the random value select from, should this random value be worked out on the CPU and sent as a single value to the GPU or should both the min and max values be sent to the GPU and have a random function generator in the GPU do it? Any comments on reducing bandwidth and optimization are much appreciated.
Should I be pre calculating most things on the CPU and sending it to the GPU this is includes stuff like positions, rotations, velocity, calculations for alpha and random numbers ect.
Or should I be doing as much as i can in the shaders and using the geometry shader as much as possible.
Impossible to answer. Spend too much CPU time and performance will drop. Spend too much GPU time, performance will drop too. Transfer too much data, performance will drop. So, instead of trying to guess (I don't know what app you're writing, what's your target hardware, etc. Hell, you didn't even specify your target api and platform) measure/profile and select optimal method. PROFILE instead of trying to guess the performance. There are AQTime 7 Standard, gprof, and NVPerfKit for that (plus many other tools).
Do you actually have performance problem in your application? If you don't have any performance problems, then don't do anything. Do you have, say ten million particles per frame in real time? If not, there's little reason to worry, since a 600mhz cpu was capable of handling thousand of them easily 7 years ago. On other hand, if you have, say, dynamic 3d environmnet and particles must interact with it (bounce), then doing it all on GPU will be MUCH harder.
Anyway, to me it sounds like you don't have to optimize anything and there's no actual NEED to optimize. So the best idea would be to concentrate on some other things.
However, in any case, ensure that you're using correct way to transfer "dynamic" data that is frequently updated. In directX that meant using dynamic write-only vertex buffers with D3DLOCK_DISCARD|D3DLOCK_NOOVERWRITE. With OpenGL that'll probably mean using STREAM or DYNAMIC bufferdata with DRAW access. That should be sufficient to avoid major performance hits.
There's no single right answer to this. Here are some things that might help you make up your mind:
Are you sure the volume of data going over the bus is high enough to be a problem? You might want to do the math and see how much data there is per second vs. what's available on the target hardware.
Is the application likely to be CPU bound or GPU bound? If it's already GPU bound there's no point loading it up further.
Particle systems are pretty easy to implement on the CPU and will run on any hardware. A GPU implementation that supports nontrivial particle systems will be more complex and limited to hardware that supports the required functionality (e.g. stream out and an API that gives access to it.)
Consider a mixed approach. Can you split the particle systems into low complexity, high bandwidth particle systems implemented on the GPU and high complexity, low bandwidth systems implemented on the CPU?
All that said, I think I would start with a CPU implementation and move some of the work to the GPU if it proves necessary and feasible.
I am taking a course on computational geometry in the fall, where we will be implementing some algorithms in C or C++ and benchmarking them. Most of the students generate a few datasets and measure their programs with the time command, but I would like to be a bit more thorough.
I am thinking about writing a program to automatically generate different datasets, run my program with them and use R to test hypotheses and estimate parameters.
So... How do you measure program running time more accurately?
What might be relevant to measure?
What hypotheses might be interesting to test (variance, effects caused by caching, etc.)?
Should I test my code on more than one machine? How should these machines differ?
My overall goals are to learn how these algorithms perform in practice, which implementation techniques are better and how the hardware actually performs.
Profilers are great. Valgrind is pretty popular. Also, I'd suggest trying your code out on risc machines if you can get access to some. Their performance characteristics are different from those of cisc machines in interesting ways.
You could use the Windows API timing function (are not that exactly) and you can use the RDTSC inline assembler command which is sub-nanosecond exact(don't forget that the command and the instructions around it create a small overhead of some hundreds cycles but this is not an big issue).
In order to get better accuracy with program metrics, you will have to run your program many times, such as 100 or 1000.
For more details, on metrics, search the web for metrics and profiling.
Beware that programs may differ in performance (time) measurements due to things running in the background such as virus scanners, music players, and other programs with timers in them.
You could test your program on different machines. Processor clock rates, L1 and L2 cache sizes, RAM sizes, and Disk speeds are all factors (as well as the number of other programs / tasks running concurrently). Floating point may also be a factor.
If you want, you can challenge your compiler by printing the assembly language of the listings for various optimization settings. See which setting produces the fewest or most efficient assembly code.
Since your processing data, look at data driven design: http://www.gamearchitect.net/Articles/DataDrivenDesign.html
You can use the Windows High Performance Counter to get nanosecond accuracy. Technically, afaik, the HPC can be any speed, but you can query it's counts per second, and as far as I know, most CPUs do very very high performance counting.
What you should do is just get a professional profiler. That's what they're for. More realistically, however.
If you're only comparing between algorithms, as long as your machine doesn't happen to excel in one area (Pentium D, SSD sort of thing) it shouldn't matter too much to do it on just one machine. If you want to look at cache effects, try running the algorithm right after the machine starts up (make sure that you get a copy of Windows 7, should be free for CS students), then leave it doing something that can be plenty cache heavy, like image processing, for 24h or something to convince the OS to cache it. Then run algorithm again. Compare.
You didn't specify your platform. If you are on a POSIX system (eg linux) have a look into clock_gettime. This lets you access different kinds of clocks e.g wall clock time or cpu time. You also may get to know about the precision of the clocks.
Since you are willing to do good statistics on your numbers, you should repeat your experiments often enough such that the statistical test give you enough confidence.
If your measurements are not too fine grained and your variance is low this often is quite good for 10 probes or so. But if you go down to small scale, a short function or so, you might need to go much higher.
Also you would have to ensure reproducible experimental conditions, no other load on the machine, enough memory available etc.