Why do I keep getting different 'time to test' results for the same dataset, algorithm and test parameters in WEKA? - weka

I need to observe the test/training times of running different algorithms. But when I run an ML such as NB for example, at one point it gives me 1.7s as the test time and at other times 2.3s or 0.8s (all parameters are kept same).
Similarly, using a dataset with 60 features and the same number of flows but with 20 features for example, the results could show me that the smaller dataset took a longer time.
I would be grateful for an explanation or advice please. Thank you

The time that Weka is displaying is the so-called wall time, which is merely the time that elapsed between starting the evaluation and finishing it. This does not represent the actual time (as in number of CPU cycles) that your machine was performing the evaluation. Depending on other processes on your machine, other Java threads requiring CPU time, this can vary easily.
The Experimenter, in contrast to command-line execution or evaluations in the Explorer, also generates time measures based on actual CPU cycles (UserCPU_Time_...).

Related

How to locate a certain series of assembly instruction within a period of time?

Suppose I'm observing the memory of an application(e.g. Calculator) and I want to find out a series of instructions being called within a period of time, say, 10:20 AM - 10:21 AM 25/08/14.
At 10:20 AM, I should be pressing the button of execution(geting the result of computation).
And I want to find out all the associated instructions and memory calls in the execution process.
I know I can do this in a simply way like by iterative searching for input values on the calculator. However, in other cases it becomes difficult to search for the corresponding value due to complex layers of pointers.
My question:
Is it possible to implement this(finding out instructions or calls within a period of time) in C++?
Try start with using statistical profiling. If the execution is question is not a short moment, and takes at least a few statistical timer periods, you'll get enough to dig into. Multiple starts will increase result accuracy.

Clueless on how to execute big tasks on C++ AMP

I have a task to see if an algorithm I developed can be ran faster using computing on GPU rather than CPU. I'm new to computing on accelerators, I was given a book "C++ AMP" which I've read thoroughly, and I thought I understood it reasonably well (I coded in C and C++ in the past but nowadays its mostly C#).
However, when going into real application, I seem to just not get it. So please, help me if you can.
Let's say I have a task to compute some complicated function that takes a huge matrix input (like 50000 x 50000) and some other data and outputs matrix of same size. Total calculation for the whole matrix takes several hours.
On CPU, I'd just cut tasks into several pieces (number of pieces being something like 100 or so) and execute them using Parralel.For or just a simple task managing loop I wrote myself. Basically, keep several threads running (num of threads = num of cores), start new part when thread finishes, until all parts are done. And it worked well!
However, on GPU, I cannot use the same approach, not only because of memory constraints (that's ok, can partition into several parts) but because of the fact that if something runs for over 2 seconds it's considered a "timeout" and GPU gets reset! So, I must ensure that every part of my calculation takes less than 2 seconds to run.
But that's not every task (like, partition a hour-long work into 60 tasks of 1sec each), which would be easy enough, thats every bunch of tasks, because no matter what queue mode I choose (immediate or automatic), if I run (via parralel_for_each) anything that takes in total more than 2s to execute, GPU will get reset.
Not only that, but if my CPU program hogs all CPU resource, as long as it is kept in lower priority, UI stays interactive - system is responsive, however, when executing code on GPU, it seems that screen is frozen until execution is finished!
So, what do I do? In the demonstrations to the book (N-Body problem), it shows that it is supposed to be like 100x as effective (multicore calculations give 2 gflops, or w/e amount of flops that was, while amp give 200 gflops), but in real application, I just don't see how to do it!
Do I have to partition my big task into like, into billions of pieces, like, partition into pieces that each take 10ms to execute and run 100 of them in parralel_for_each at a time?
Or am I just doing it wrong, and there is a better solution I just don't get?
Help please!
TDRs (the 2s timeouts you see) are a reality of using a resource that is shared between rendering the display and executing your compute work. The OS protects your application from completely locking up the display by enforcing a timeout. This will also impact applications which try and render to the screen. Moving your AMP code to a separate CPU thread will not help, this will free up your UI thread on the CPU but rendering will still be blocked on the GPU.
You can actually see this behavior in the n-body example when you set N to be very large on a low power system. The maximum value of N is actually limited in the application to prevent you running into these types of issues in typical scenarios.
You are actually on the right track. You do indeed need to break up your work into chunks that fit into sub 2s chunks or smaller ones if you want to hit a particular frame rate. You should also consider how your work is being queued. Remember that all AMP work is queued and in automatic mode you have no control over when it runs. Using immediate mode is the way to have better control over how commands are batched.
Note: TDRs are not an issue on dedicated compute GPU hardware (like Tesla) and Windows 8 offers more flexibility when dealing with TDR timeout limits if the underlying GPU supports it.

Inaccuracy in gprof output

I am trying to profile a c++ function using gprof, I am intrested in the %time taken. I did more than one run and for some reason I got a large difference in the results. I don't know what is causing this, I am assuming the sampling rate or I read in other posts that I/O has something to do with it. So is there a way to make it more accurate and generate somehow almost constant results?
I was thinking of the following:
increase the sampling rate
flush the caches before executing anything
use another profiler but I want it to generate results in a similar format to grof as function time% function name, I tried Valgrind but it gave me a massive file in size. So maybe I am generating the file with the wrong command.
Waiting for your input
Regards
I recommend printing a copy of the gprof paper and reading it carefully.
According to the paper, here's how gprof measures time. It samples the PC, and it counts how many samples land in each routine. Multiplied by the time between samples, that is each routine's total self time.
It also records in a table, by call site, how many times routine A calls routine B, assuming routine B is instrumented by the -pg option. By summing those up, it can tell how many times routine B was called.
Starting from the bottom of the call tree (where total time = self time), it assumes the average time per call of each routine is its total time divided by the number of calls.
Then it works back up to each caller of those routines. The time of each routine is its average self time plus the average number of calls to each subordinate routine times the average time of the subordinate routine.
You can see, even if recursions (cycles in the call graph) are not present, how this is fraught with possibilities for errors, such as assumptions about average times and average numbers of calls, and assumptions about subroutines being instrumented, which the authors point out. If there are recursions, they basically say "forget it".
All of this technology, even if it weren't problematic, begs the question - What is it's purpose? Usually, the purpose is "find bottlenecks". According to the paper, it can help people evaluate alternative implementations. That's not finding bottlenecks. They do recommend looking at routines that seem to be called a lot of times, or that have high average times. Certainly routines with low average cumulative time should be ignored, but that doesn't localize the problem very much. And, it completely ignores I/O, as if all I/O that is done is unquestionably necessary.
So, to try to answer your question, try Zoom, for one, and don't expect to eliminate statistical noise in measurements.
gprof is a venerable tool, simple and rugged, but the problems it had in the beginning are still there, and far better tools have come along in the intervening decades.
Here's a list of the issues.
gprof is not very accurate, particularly for small functions, see http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html#SEC11
If this is Linux then I recommend a profiler that doesn't require the code to be instrumented, e.g. Zoom - you can get a free 30 day evaluation license, after that it costs money.
All sampling profilers suffer form statistical inaccuracies - if the error is too large then you need to sample for longer and/or with a smaller sampling interval.

How do you measure the effect of branch misprediction?

I'm currently profiling an implementation of binary search. Using some special instructions to measure this I noticed that the code has about a 20% misprediction rate. I'm curious if there is any way to check how many cycles I'm potentially losing due to this. It's a MIPS based architecture.
You're losing 0.2 * N cycles per iteration, where N is the number of cycles that it takes to flush the pipelines after a mispredicted branch. Suppose N = 10 then that means you are losing 2 clocks per iteration on aggregate. Unless you have a very small inner loop then this is probably not going to be a significant performance hit.
Look it up in the docs for your CPU. If you can't find this information specifically, the length of the CPU's pipeline is a fairly good estimate.
Given that it's MIPS and it's a 300MHz system, I'm going to guess that it's a fairly short pipeline. Probably 4-5 stages, so a cost of 3-4 cycles per mispredict is probably a reasonable guess.
On an in-order CPU you may be able to calculate the approximate mispredict cost as a product of the number of mispredicts and the mispredict cost (which is generally a function of some part of the pipeline)
On a modern out-of-order CPU, however, such a general calculation is usually not possible. There may be a large number of instructions in flight1, only some of which are flushed by a misprediction. The surrounding code may be latency bound by one or more chains of dependent instructions, or it may be throughput bound on resources like execution units, renaming throughput, etc, or it may be somewhere in-between.
On such a core, the penalty per misprediction is very difficult to determine, even with the help of performance counters. You can find entire papers dedicated to the topic: that one found a penalty size of ranging from 9 to 35 cycles averaged across entire benchmarks: if you look at some small piece of code the range will be even larger: a penalty of zero is easy to demonstrate, and you could create a scenario where the penalty is in the 100s of cycles.
Where does that leave you, just trying to determine the misprediction cost in your binary search? Well a simple approach is just to control the number of mispredictions and measure the difference! If you set up your benchmark input have a range of behavior, starting with always following the same branch pattern, all the way to having a random pattern, you can plot the misprediction count versus runtime degradation. If you do, share your result!
1Hundreds of instructions in-flight in the case of modern big cores such as those offered by the x86, ARM and POWER architectures.
Look at your specs for that info and if that fails, run it a billion times and time it external to your program (stop watch of something.) Then run it with without a miss and compare.

Predict C++ program running time

How to predict C++ program running time, if program executes different functions (working with database, reading files, parsing xml and others)? How installers do it?
They do not predict the time. They calculate the number of operations to be done on a total of operations.
You can predict the time by using measurement and estimation. Of course the quality of the predictions will differ. And BTW: The word "predict" is correct.
You split the workload into small tasks, and create an estimation rule for each task, e.g.: if copying files one to ten took 10s, then the remaining 90 files may take another 90s. Measure the time that these tasks take at runtime, and update your estimations.
Each new measurement will make the prediction a bit more precise.
There really is no way to do this in any sort of reliable way, since it depends on thousands of factors.
Progress bars typically measure this in one of two ways:
Overall progress - I have n number of bytes/files/whatever to transfer, and so far I have done m.
Overall work divided by current speed - I have n bytes to transfer, and so far I have done m and it took t seconds, so if things continue at this rate it will take u seconds to complete.
Short answer:
No you can't. For progress bars and such, most applications simply increase the bar length with a percentage based on the overall tasks done. Some psuedo-code:
for(int i=0; i<num_files_to_load; ++i){
files.push_back(File(filepath[i]));
SetProgressBarLength((float)i/((float)num_files_to_load) - 1.0f);
}
This is a very simplified example. Making a for-loop like this would surely block the window system's event/message queue. You would probably add a timed event or something similar instead.
Longer answer:
Given N known parameters, the problem finding whether a program completes at all is undecidable. This is called the Halting problem. You can however, find the time it takes to execute a single instruction. Some very old games actually depended on exact cycle timings, and failed to execute correctly on newer computers due to race conditions that occur because of subtle differences in runtime. Also, on architectures with data and instruction caches, the cycles the instructions consume is not constant anymore. So cache makes cycle-counting unpredictable.
Raymond Chen discussed this issue in his blog.
Why does the copy dialog give such
horrible estimates?
Because the copy dialog is just
guessing. It can't predict the future,
but it is forced to try. And at the
very beginning of the copy, when there
is very little history to go by, the
prediction can be really bad.
In general it is impossible to predict the running time of a program. It is even impossible to predict whether a program will even halt at all. This is undecidable.
http://en.wikipedia.org/wiki/Halting_problem
As others have said, you can't predict the time. Approaches suggested by Partial and rmn are valid solutions.
What you can do more is assign weights to certain operations (for instance, if you know a db call takes roughly twice as long as some processing step, you can adjust accordingly).
A cool installer compiler would execute a faux install, time each op, then save this to disk for the future.
I used such a technique for a 3D application once, which had a pretty dead-on progress bar for loading and mashing data, after you've run it a few times. It wasn't that hard, and it made development much nicer. (Since we had to see that bar 10-15 times/day, startup was 10-20 secs)
You can't predict it entirely.
What you can do is wait until a fraction of the work is done, say 1%, and estimate the remaining time by that - just time how long it takes for 1% and multiply by 100, for example. That is easily done if you can enumerate all that you have to do in advance, or have some kind of a loop going on..
As I mentioned in a previous answer, it is impossible in general to predict the running time.
However, empirically it may be possible to predict with good accuracy.
Typically all of these programs are approximatelyh linear in some input.
But if you wanted a more sophisticated approach, you could define a large number of features (database size, file size, OS, etc. etc.) and input those feature values + running time into a neural network. If you had millions of examples (obviously you would have an automated method for gathering data, e.g. some discovery programs) you might come up with a very flexible and intelligent prediction algorithm.
Of course this would only be worth doing for fun, as I'm sure the value to your company over some crude guessing algorithm will probably be nil :)
You should make estimation of time needed for different phases of the program. For example: reading files - 50, working with database - 30, working with network - 20. In ideal it would be good if you make some progress callback during all of those phases, but it requires coding the progress calculation into the iterations of algorithm.