Akka testkit: what is the timeFactor? - akka

There are several methods on the Akka TestProbe that say they are "correctly treating the timeFactor." What does that mean?
http://doc.akka.io/api/akka/2.0/akka/testkit/TestProbe.html
For example, see the second version of expectMsg.

Using the setting akka.test.timefactor (1 by default) you may increase or decrease all timeouts in your tests (it may be useful when running tests on very different machines in terms of performance). These methods use as max duration the following value: akka.test.single-expect-default (3 seconds by default) multiplied by the akka.test.timefactor mentioned above.

Related

Can I make OpenMP to revert to ideal # of threads after using omp_set_num_threads?

Is there a way to make OpenMP revert the number of threads (for the next time it's used) back to the default after the application has already called omp_set_num_threads() with a specific number?
For example, is there a special code (e.g. 0 or -1) I supply to omp_set_num_threads?
Or should I just try doing something like omp_set_num_threads(omp_get_max_threads())?
I am making the assumption that the default number is whatever the implementation of OpenMP deems as "optimal". But I don't know what, if anything, the default is guaranteed to be or even what it should be. All I know is that I have an application that calls omp_set_num_threads(4) for one specific OpenMP block which I must not edit (for now). But I'd like to prevent that one setting from affecting other OpenMP blocks in my code.
I've had this problem before. (Disclaimer: I work with MSVC, which currently only implements the OpenMP 2.0 standard). To the best of my knowledge, there is nothing in the OpenMP 2.0 standard that allows you to find out this default value. omp_get_max_threads() is not required to return it (all subsequent emphasis mine):
The omp_get_max_threads function returns an integer that is guaranteed to be at least as large as the number of threads that would be used to form a team if a parallel region without a num_threads clause were to be encountered at that point in the code.
In other words, it might return a number that is larger than the currently set (or default) value.
There is no special value for omp_set_num_threads either:
The omp_set_num_threads function sets the default number of threads to use for subsequent parallel regions that do not specify a num_threads clause. [...] The value of the parameter num_threads must be a positive integer.
And if you get it wrong, it's up to the implementation what will happen:
If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads requested for the parallel region exceeds the number that the run-time system can supply, the behavior of the program is implementation-defined. An implementation may, for example, interrupt the execution of the program, or it may serialize the parallel region.
You might find more precise (and less unsettling) information in the documentation of your OpenMP implementation. However, in the case of MSVC, that documentation is just a verbatim copy of the OpenMP 2.0 standard...
Since you are in the business of modifying the number of threads this way, I would like to preemptively caution about the interaction of omp_set_dynamic with omp_get_num_threads within MSVC:
Why does omp_set_dynamic(1) never adjust the number of threads (in Visual C++)?

How to write unit test for VIs that contain "Tick Count (ms)" LabVIEW function?

There is a VI that its outputs (indicators) depend not only on the inputs but also on the values of "Tick Count" functions. The problem is that it does not produce the same output for the same inputs. each time that I run it, it gives different outputs. so the unit test that only captures inputs and outputs would fail. So the question is how to write a unit test for this situation?
I cannot include the VI in the question as it contains several subVIs and the "tick count" functions are spread through all level of its subVIs.
EDIT1: I wrote a wrapper that subtracts the output values of two consecutive runs in order to eliminate the base reference time (which is undefined in this function) but it spoils the outputs.
I think you have been given a very difficult task, as the function you've been asked to test is non-deterministic and it is challenging to write unit tests against non-deterministic code.
There are some ways to test non-deterministic functions: for example one could test that a random number generator produced values uniformly distributed to some tolerance, or test that a clock-setting function matched an ntp server to some tolerance. But I think you're team will be happier if you can make the underlying code deterministic.
Your idea to use conditional disable is good, but I would add the additional step of creating a wrapper VI and then search and replace all native Tick Count with it. This way you can do any modifications to Tick Count in one place. If for some reason the code actually uses the tick count for something other than profiling (for example, it is being used to seed a pseudorandom number generator) you can have your "test/debug" case that read from a Notifier that you are injecting a set of fake tick counts into from your testing code. Notifiers work great for something like this.
You could add an (optional input) that allows you to override the tick count value. Give it a default value of -1, and in the VI you can use the tick count value if it's input is -1.
However I have never seen code relying on tick count.

How to calculate std dev, quartile, ... with benchmarking a code?

I wrote some functions to benchmark a function/piece of code. I do it like this:
start = timer
for(1 second)
call fun
iterations++
stop = timer
And then I have a MEAN (AVERAGE) time: (stop-start/iterations), right?
Single call is too 'short' to measure, so how can I calculate from this type of measurement, std dev, quartile, etc ... ?
Standard deviation and quartiles both deal with the distribution of values in a group.
With only one measurement, these become trivial or meaningless. Since there's only one measurement, that value is the mean, the minimum, the maximum, and the mode. Since none of the measurements deviate from the mean, the variance and standard deviation are zero.
You'll have to find a way to measure the time precisely enough. You'll need the times for individual calls to fun in order to get any meaningful standard deviation etc.
This question may contain useful hints, and I'm sure there are quite a few platform-specific high-resolution timers out there as well.
In general, due to the processing speed and the troubles obtaining micro and millisecond resolutions, most performance measurements are based on a large number of iterations.
For example:
Read start time
for 1,000,000 iterations do
perform function
end-for
read end time.
The duration is the end time - start time.
The average execution time is the duration divided by the number of iterations.
There are other reasons for using the average time: interruptions by OS, data cache misses and maybe external factors (such as hard drive accesses).
For more exact measurements, you will have to use a "test point" and an oscilliscope. Write high pulse to the test point before the iterations and write a low pulse afterwards. Set the oscilloscope to capture the duration. If your oscilloscope has statistical functions and storage, move the test point writes to before and after the function execution.
If a single call is too short to measure, then why do you care how long it takes?
I'm being a bit facetious, but if you're on Intel Linux, and your process is pinned to one core, you can read the CPU's timestamp counter (TSC), which is the highest resolution tick you can get. In recent Intel CPUs it ticks very solidly at the nominal CPU frequency independent of the actual frequency (which varies wildly). If you Google for "rdtsc", you'll find several implementations for a rdtsc() function that you can just call. You could then try something like:
uint64_t tic, elapsed[10000];
for(i=0; i<10000; i++) {
tic = rdtsc()
my_func()
elapsed[i] = tic - rdtsc()
}
That might get you within shouting distance of a maybe kinda/sorta semi-valid values for individual function calls, from which you can then produce whatever statistics you want (mean/mode/median/variance/std.dev.). The validity of this is seriously open to question, but it's the best that can be done with anything like your method. I'd be much more inclined to run the whole application under perf record and then use perf report to see where the cycles are being expended and focus on that.

timers, threads and compiler misbehaviour

I'm having trouble with something and couldn't find any answers about it, as I don't even know what to search for. I have a done a timer class using QueryPerformanceCounter, from my application, I launch a second thread object that has its own instanced timer and I just have an infinite loop getting delta time from the timer and using it to output the number of loop iterations per second.
I've noticed that it was giving me weird values so I started printing delta time and found out it was coming as 0 sometimes, so I went inside the method that returns delta time and did some testing. This is my deltaTime() method:
double MyTimer2::deltaTime()
{
LARGE_INTEGER timenow;
QueryPerformanceCounter(&timenow);
//std::cout << "timenow=" << (double)timenow.QuadPart << " currentticks=" << (double)m_currentTicks.QuadPart << std::endl;
double m_deltaTime = (double)(timenow.QuadPart - m_currentTicks.QuadPart) /* 1000.0*/ / (double)m_frequency.QuadPart;
m_currentTicks = timenow;
if(m_deltaTime < 0.000001)
return 0.0;
return m_deltaTime;
}
So, I put a breakpoint on "return 0.0;" and what happens is that it gets there most of the time, which is not correct. However, if I uncomment the printing code and run, I will never stop on the breakpoint. So in theory, my printing code is making it work correctly, whereas if I remove it, things stop working as they should! How is this possible, why is it happening and how can I fix it? I've tried _ReadWriteBarrier() unsuccessfully.
Thanks in advance!
EDIT: I need a high-resolution timer for physics simulation!
A couple processor generations ago, QueryPerformanceCounter() would read the CPU's cycle counter (e.g. rdtsc). Using this method, the number of ticks from successive reads would never be zero. The resolution was equal to the CPU clock rate, e.g. 3 GHz.
Modern processors have two characteristics which make the cycle counter useless for timing. First, you have multiple cores, which each have their own cycle counter. Threads can migrate between cores, and if you read the cycle counter from two different cores, the difference would not be related to elapsed time. It could even be negative. Secondly, you have dynamic clocking based on load (both underclocking to save power and overclocking for performance). Intel calls these "SpeedStep" and "Turbo Boost", respectively. When the cycle rate isn't fixed, there's no way to convert from ticks to time.
So, QueryPerformanceCounter now uses a dedicated piece of hardware called a High-Performance Event Counter (HPET), with a resolution of several MHz. Importantly, there's only one regardless of how many cores you have, and it doesn't change speed dynamically. But, since the resolution is lower, it is now possible to read it twice between ticks, in which case you'll get an elapsed time reported as zero.
In practice, this isn't a problem. If you need timing more precise than what the HPET can provide, then a general purpose computer is not suitable for you. Timing in the nanosecond range will be severely affected by interrupts.
What could possibly be the purpose of this block?
if(m_deltaTime < 0.000001)
return 0.0;
It has no value, it simply screws with the results, telling you the time was zero when it actually wasn't.
First of all, your timer is wrong: it consumes your CPU intensively. On the single core machine it will slow down all the system. If you want to create a timer and target Windows, you can use timer functions.
Then, every not negative value, returned by your deltaTime() function is valid. While you hosted not in real-time operating system, every operation can take arbitrary amount of time. One iteration can take about tens cycles of processor ticks, or tens years. No one guarantee.
Third, about experimental results. It seems that if context will be switched once between two consecutive time measurement, you get value about 0.016s, if not, you get value bellow 0.000001s that is floored to 0s.
As it was said, printing to console is relatively heavy operation and you actually always get context switched when you enable it.
EDIT
While QueryPerformanceCounter seems to offer great resolution, it traps you. You will never get actually high resolution timer, unless you work in real-time OS.

How/why do functional languages (specifically Erlang) scale well?

I have been watching the growing visibility of functional programming languages and features for a while. I looked into them and didn't see the reason for the appeal.
Then, recently I attended Kevin Smith's "Basics of Erlang" presentation at Codemash.
I enjoyed the presentation and learned that a lot of the attributes of functional programming make it much easier to avoid threading/concurrency issues. I understand the lack of state and mutability makes it impossible for multiple threads to alter the same data, but Kevin said (if I understood correctly) all communication takes place through messages and the mesages are processed synchronously (again avoiding concurrency issues).
But I have read that Erlang is used in highly scalable applications (the whole reason Ericsson created it in the first place). How can it be efficient handling thousands of requests per second if everything is handled as a synchronously processed message? Isn't that why we started moving towards asynchronous processing - so we can take advantage of running multiple threads of operation at the same time and achieve scalability? It seems like this architecture, while safer, is a step backwards in terms of scalability. What am I missing?
I understand the creators of Erlang intentionally avoided supporting threading to avoid concurrency problems, but I thought multi-threading was necessary to achieve scalability.
How can functional programming languages be inherently thread-safe, yet still scale?
A functional language doesn't (in general) rely on mutating a variable. Because of this, we don't have to protect the "shared state" of a variable, because the value is fixed. This in turn avoids the majority of the hoop jumping that traditional languages have to go through to implement an algorithm across processors or machines.
Erlang takes it further than traditional functional languages by baking in a message passing system that allows everything to operate on an event based system where a piece of code only worries about receiving messages and sending messages, not worrying about a bigger picture.
What this means is that the programmer is (nominally) unconcerned that the message will be handled on another processor or machine: simply sending the message is good enough for it to continue. If it cares about a response, it will wait for it as another message.
The end result of this is that each snippet is independent of every other snippet. No shared code, no shared state and all interactions coming from a a message system that can be distributed among many pieces of hardware (or not).
Contrast this with a traditional system: we have to place mutexes and semaphores around "protected" variables and code execution. We have tight binding in a function call via the stack (waiting for the return to occur). All of this creates bottlenecks that are less of a problem in a shared nothing system like Erlang.
EDIT: I should also point out that Erlang is asynchronous. You send your message and maybe/someday another message arrives back. Or not.
Spencer's point about out of order execution is also important and well answered.
The message queue system is cool because it effectively produces a "fire-and-wait-for-result" effect which is the synchronous part you're reading about. What makes this incredibly awesome is that it means lines do not need to be executed sequentially. Consider the following code:
r = methodWithALotOfDiskProcessing();
x = r + 1;
y = methodWithALotOfNetworkProcessing();
w = x * y
Consider for a moment that methodWithALotOfDiskProcessing() takes about 2 seconds to complete and that methodWithALotOfNetworkProcessing() takes about 1 second to complete. In a procedural language this code would take about 3 seconds to run because the lines would be executed sequentially. We're wasting time waiting for one method to complete that could run concurrently with the other without competing for a single resource. In a functional language lines of code don't dictate when the processor will attempt them. A functional language would try something like the following:
Execute line 1 ... wait.
Execute line 2 ... wait for r value.
Execute line 3 ... wait.
Execute line 4 ... wait for x and y value.
Line 3 returned ... y value set, message line 4.
Line 1 returned ... r value set, message line 2.
Line 2 returned ... x value set, message line 4.
Line 4 returned ... done.
How cool is that? By going ahead with the code and only waiting where necessary we've reduced the waiting time to two seconds automagically! :D So yes, while the code is synchronous it tends to have a different meaning than in procedural languages.
EDIT:
Once you grasp this concept in conjunction with Godeke's post it's easy to imagine how simple it becomes to take advantage of multiple processors, server farms, redundant data stores and who knows what else.
It's likely that you're mixing up synchronous with sequential.
The body of a function in erlang is being processed sequentially.
So what Spencer said about this "automagical effect" doesn't hold true for erlang. You could model this behaviour with erlang though.
For example you could spawn a process that calculates the number of words in a line.
As we're having several lines, we spawn one such process for each line and receive the answers to calculate a sum from it.
That way, we spawn processes that do the "heavy" computations (utilizing additional cores if available) and later we collect the results.
-module(countwords).
-export([count_words_in_lines/1]).
count_words_in_lines(Lines) ->
% For each line in lines run spawn_summarizer with the process id (pid)
% and a line to work on as arguments.
% This is a list comprehension and spawn_summarizer will return the pid
% of the process that was created. So the variable Pids will hold a list
% of process ids.
Pids = [spawn_summarizer(self(), Line) || Line <- Lines],
% For each pid receive the answer. This will happen in the same order in
% which the processes were created, because we saved [pid1, pid2, ...] in
% the variable Pids and now we consume this list.
Results = [receive_result(Pid) || Pid <- Pids],
% Sum up the results.
WordCount = lists:sum(Results),
io:format("We've got ~p words, Sir!~n", [WordCount]).
spawn_summarizer(S, Line) ->
% Create a anonymous function and save it in the variable F.
F = fun() ->
% Split line into words.
ListOfWords = string:tokens(Line, " "),
Length = length(ListOfWords),
io:format("process ~p calculated ~p words~n", [self(), Length]),
% Send a tuple containing our pid and Length to S.
S ! {self(), Length}
end,
% There is no return in erlang, instead the last value in a function is
% returned implicitly.
% Spawn the anonymous function and return the pid of the new process.
spawn(F).
% The Variable Pid gets bound in the function head.
% In erlang, you can only assign to a variable once.
receive_result(Pid) ->
receive
% Pattern-matching: the block behind "->" will execute only if we receive
% a tuple that matches the one below. The variable Pid is already bound,
% so we are waiting here for the answer of a specific process.
% N is unbound so we accept any value.
{Pid, N} ->
io:format("Received \"~p\" from process ~p~n", [N, Pid]),
N
end.
And this is what it looks like, when we run this in the shell:
Eshell V5.6.5 (abort with ^G)
1> Lines = ["This is a string of text", "and this is another", "and yet another", "it's getting boring now"].
["This is a string of text","and this is another",
"and yet another","it's getting boring now"]
2> c(countwords).
{ok,countwords}
3> countwords:count_words_in_lines(Lines).
process <0.39.0> calculated 6 words
process <0.40.0> calculated 4 words
process <0.41.0> calculated 3 words
process <0.42.0> calculated 4 words
Received "6" from process <0.39.0>
Received "4" from process <0.40.0>
Received "3" from process <0.41.0>
Received "4" from process <0.42.0>
We've got 17 words, Sir!
ok
4>
The key thing that enables Erlang to scale is related to concurrency.
An operating system provides concurrency by two mechanisms:
operating system processes
operating system threads
Processes don't share state – one process can't crash another by design.
Threads share state – one thread can crash another by design – that's your problem.
With Erlang – one operating system process is used by the virtual machine and the VM provides concurrency to Erlang programme not by using operating system threads but by providing Erlang processes – that is Erlang implements its own timeslicer.
These Erlang process talk to each other by sending messages (handled by the Erlang VM not the operating system). The Erlang processes address each other using a process ID (PID) which has a three-part address <<N3.N2.N1>>:
process no N1 on
VM N2 on
physical machine N3
Two processes on the same VM, on different VM's on the same machine or two machines communicate in the same way – your scaling is therefore independent of the number of physical machines you deploy your application on (in the first approximation).
Erlang is only threadsafe in a trivial sense – it doesn't have threads. (The language that is, the SMP/multi-core VM uses one operating system thread per core).
You may have a misunderstanding of how Erlang works. The Erlang runtime minimizes context-switching on a CPU, but if there are multiple CPUs available, then all are used to process messages. You don't have "threads" in the sense that you do in other languages, but you can have a lot of messages being processed concurrently.
Erlang messages are purely asynchronous, if you want a synchronous reply to your message you need to explicitly code for that. What was possibly said was that messages in a process message box is processed sequentially. Any message sent to a process goes sits in that process message box, and the process gets to pick one message from that box process it and then move on to the next one, in the order it sees fit. This is a very sequential act and the receive block does exactly that.
Looks like you have mixed up synchronous and sequential as chris mentioned.
Referential transparency: See http://en.wikipedia.org/wiki/Referential_transparency_(computer_science)
In a purely functional language, order of evaluation doesn't matter - in a function application fn(arg1, .. argn), the n arguments can be evaluated in parallel. That guarantees a high level of (automatic) parallelism.
Erlang uses a process modell where a process can run in the same virtual machine, or on a different processor -- there is no way to tell. That is only possible because messages are copied between processes, there is no shared (mutable) state. Multi-processor paralellism goes a lot farther than multi-threading, since threads depend upon shared memory, this there can only be 8 threads running in parallel on a 8-core CPU, while multi-processing can scale to thousands of parallel processes.