How does OpenMP implement access to critical sections?

How does OpenMP implement access to critical sections? - c++

I want to read an input file (in C/C++) and process each line independently as fast as possible. The processing takes a few ticks itself, so I decided to use OpenMP threads. I have this code:
#pragma omp parallel num_threads(num_threads)
{
string line;
while (true) {
#pragma omp critical(input)
{
getline(f, line);
}
if (f.eof())
break;
process_line(line);
}
}
My question is, how do I determine the optimal number of threads to use? Ideally, I would like this to be dynamically detected at runtime. I don't understand the DYNAMIC schedule option for parallel, so I can't say if that would help. Any insights?
Also, I'm not sure how to determine the optimal number "by hand". I tried various numbers for my specific application. I would have thought the CPU usage reported by top would help, but it doesn't(!) In my case, the CPU usage stays consistently at around num_threads*(85-95). However, using pv to observe the speed at which I'm reading input, I noted that the optimal number is around 2-5; above that, the input speed becomes smaller. So my quesiton is- why would I then see a CPU usage of 850 when using 10 threads?? Can this be due to some inefficiency in how OpenMP handles threads waiting to get in the critical section?
EDIT: Here are some timings. I obtained them with:
for NCPU in $(seq 1 20) ; do echo "NCPU=$NCPU" ; { pv -f -a my_input.gz | pigz -d -p 20 | { { sleep 60 ; PID=$(ps gx -o pid,comm | grep my_prog | sed "s/^ *//" | cut -d " " -f 1) ; USAGE=$(ps h -o "%cpu" $PID) ; kill -9 $PID ; sleep 1 ; echo "usage: $USAGE" >&2 ; } & cat ; } | ./my_prog -N $NCPU >/dev/null 2>/dev/null ; sleep 2 ; } 2>&1 | grep -v Killed ; done
NCPU=1
[8.27MB/s]
usage: 98.4
NCPU=2
[12.5MB/s]
usage: 196
NCPU=3
[18.4MB/s]
usage: 294
NCPU=4
[23.6MB/s]
usage: 393
NCPU=5
[28.9MB/s]
usage: 491
NCPU=6
[33.7MB/s]
usage: 589
NCPU=7
[37.4MB/s]
usage: 688
NCPU=8
[40.3MB/s]
usage: 785
NCPU=9
[41.9MB/s]
usage: 884
NCPU=10
[41.3MB/s]
usage: 979
NCPU=11
[41.5MB/s]
usage: 1077
NCPU=12
[42.5MB/s]
usage: 1176
NCPU=13
[41.6MB/s]
usage: 1272
NCPU=14
[42.6MB/s]
usage: 1370
NCPU=15
[41.8MB/s]
usage: 1493
NCPU=16
[40.7MB/s]
usage: 1593
NCPU=17
[40.8MB/s]
usage: 1662
NCPU=18
[39.3MB/s]
usage: 1763
NCPU=19
[38.9MB/s]
usage: 1857
NCPU=20
[37.7MB/s]
usage: 1957
My problem is that I can achieve 40MB/s with 785 CPU usage, but also with 1662 CPU usage. Where do those extra cycles go??
EDIT2: Thanks to Lirik and John Dibling, I now understand that the reason I find the timings above puzzling has nothing to do with I/O, but rather, with the way OpenMP implements critical sections. My intuition is that if you have 1 thread inside a CS and 10 threads waiting to get in, the moment the 1st thread exits the CS, the kernel should wake up one other thread and let it in. The timings suggest otherwise: can it be that the threads wake up many times on their own and find the CS occupied? Is this an issue with the threading library or with the kernel?

"I want to read an input file (in C/C++) and process each line independently as fast as possible."
Reading from file makes your application I/O bound, so the maximum performance you would be able to achieve for the reading portion alone is to read at the maximum disk speed (on my machine that's less than 10% CPU time). What this means is that if you were able to completely free the reading thread from any processing, it would require that the processing takes less than the remaining CPU time (90% with my computer). If the line processing threads take up more than the remaining CPU time, then you will not be able to keep up with the hard drive.
There are several options in that case:
Queue up the input and let the processing threads dequeue "work" until they've caught up with the input that was presented (given that you have enough RAM to do so).
Open enough threads and just max out your processor until all the data is read, which is your best effort scenario.
Throttle reading/processing so that you don't take up all of the system resources (in case you're worried about UI responsiveness and/or user experience).
"...the processing takes a few ticks itself, so I decided to use OpenMP threads"
This is a good sign, but it also means that your CPU utilization will not be very high. This is the part where you can optimize your performance and it's probably best to do it by hand, as John Dibling mentioned. In general, it's best if you queue up each line and let the processing threads pull processing requests from the queue until you have nothing more to process. The latter is also know as a Producer/Consumer design pattern- a very common pattern in concurrent computing.
Update
Why is there be a difference between
(i) each process get lock, pull data, release lock, process data; and
(ii) one process: pull data, get lock, enqueue chunk, release lock,
others: get lock, dequeue chunk, release lock, process data?
There is very little difference: in a way, both represent a consumer/producer pattern. In the first case (i) you don't have an actual queue, but you could consider the file stream to be your Producer (queue) and the Consumer is the thread that reads from the stream. In the second case (ii) you're explicitly implementing the consumer/producer pattern, which is more robust, reusable and provides better abstraction for the Producer. If you ever decide to use more than one "input channel," then the latter case is better.
Finally (and probably most importantly), you can use a lock-free queue with a single producer and a single consumer which will make (ii) a lot faster than (i) in terms of getting you to be i/o bound. With a lock-free queue you can pull data, enqueue chunk and dequque chunk without locking.

The best you can hope to do is tune it yourself, by hand, through repeated measure-adjust-compare cycles.
The optimal number of threads to use to process a dataset is highly dependant on many factors, not the least of which:
The data itself
The algoritm you use to process it
The CPU(s) the threads are running on
The operating system
You could try to design some kind of heuristic that measures the throughput of your processors and adjust it on the fly, but this kind of thing tends to be way more trouble than its worth.
As a rule, for tasks that are I/O bound I generally start with about 12 threads per core and tune from there. For tasks that are CPU bound, I'd start with about 4 threads per core and go from there. The key is the "going from there" part if you really want to optimize your processor use.
Also keep in mind that you should make this setting configurable if you really want to optimize, because every system onn which this is deployed will have different characteristics.

Related

Best way to implement a periodic linux task in c++20

I have a periodic task in c++, running on an embedded linux platform, and have to run at 5 ms intervals. It seems to be working as expected, but is my current solution good enough?
I have implemented the scheduler using sleep_until(), but some comments I have received is that setitimer() is better. As I would like the application to be at least some what portable, I would prefer c++ standard... of course unless there are other problems.
I have found plenty of sites that show implementation with each, but I have not found any arguments for why one solution is better than the other. As I see it, sleep_until() will implement an "optimal" on any (supported) platform, and I'm getting a feeling the comments I have received are focused more on usleep() (which I do not use).
My implementation looks a little like this:
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
do_the_magic();
std::this_thread::sleep_until(next_time);
}
}
A short summoning of the issue.
I have an embedded linux platform, build with yocto and with RT capabilities
The application need to read and process incoming data every 5 ms
Building with gcc 11.2.0
Using c++20
All the "hard work" is done in separate threads, so this question is only regards triggering the task periodically and with minimal jitter

Since the application is supposed to read and process the data every 5 ms, it is possible that a few times, it does not perform the required operations. What I mean to say is that in a time interval of 20 ms, do_the_magic() is supposed to be invoked 4 times... But if the time taken to execute do_the_magic() is 10 ms, it will get invoked only 2 times. If that is an acceptable outcome, the current implementation is good enough.
Since the application is reading data, it probably receives it from the network or disk. And adding the overhead of processing it, it likely takes more than 5 ms to do so (depending on the size of the data). If it is not acceptable to miss out on any invocation of do_the_magic, the current implementation is not good enough.
What you could probably do is create a few threads. Each thread executes the do_the_magic function and then goes to sleep. Every 5 ms, you wake a sleeping thread which will most likely take less than 5 ms to happen. This way no invocation of do_the_magic is missed. Also, the number of threads depends on how long will do_the_magic take to execute.
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
void wake_some_thread () {
static int i = 0;
release_semaphore (i); // Release semaphore associated with thread i
i++;
i = i % NUM_THREADS;
}
void * thread_func (void * args) {
while (true) {
// Wait for a semaphore
do_the_magic();
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
wake_some_thread (); // Releases a semaphore to wake a thread
std::this_thread::sleep_until(next_time);
}
Create as many semaphores as the number of threads where thread i is waiting for semaphore i. wake_some_thread can then release a semaphore starting from index 0 till NUM_THREADS and start again.

5ms is a pretty tight timing.
You can get a jitter-free 5ms tick only if you do the following:
Isolate a CPU for this thread. Configure it with nohz_full and rcu_nocbs
Pin your thread to this CPU, assign it a real-time scheduling policy (e.g., SCHED_FIFO)
Do not let any other threads run on this CPU core.
Do not allow any context switches in this thread. This includes avoiding system calls altogether. I.e., you cannot use std::this_thread::sleep_until(...) or anything else.
Do a busy wait in between processing (ensure 100% CPU utilisation)
Use lock-free communication to transfer data from this thread to other, non-real-time threads, e.g., for storing the data to files, accessing network, logging to console, etc.
Now, the question is how you're going to "read and process data" without system calls. It depends on your system. If you can do any user-space I/O (map the physical register addresses to your process address space, use DMA without interrupts, etc.) - you'll have a perfectly real-time processing. Otherwise, any system call will trigger a context switch, and latency of this context switch will be unpredictable.
For example, you can do this with certain Ethernet devices (SolarFlare, etc.), with 100% user-space drivers. For anything else you're likely to have to write your own user-space driver, or even implement your own interrupt-free device (e.g., if you're running on an FPGA SoC).

Execution time inconsistency in a program with high priority in the scheduler using RT Kernel

Problem
We are trying to implement a program that sends commands to a robot in a given cycle time. Thus this program should be a real-time application. We set up a pc with a preempted RT Linux kernel and are launching our programs with chrt -f 98 or chrt -rr 99 to define the scheduling policy and priority. Loading of the kernel and launching of the program seems to be fine and work (see details below).
Now we were measuring the time (CPU ticks) it takes our program to be computed. We expected this time to be constant with very little variation. What we measured though, were quite significant differences in computation time. Of course, we thought this could be undefined behavior in our rather complex program, so we created a very basic program and measured the time as well. The behavior was similarly bad.
Question
Why are we not measuring a (close to) constant computation time even for our basic program?
How can we solve this problem?
Environment Description
First of all, we installed an RT Linux Kernel on the PC using this tutorial. The main characteristics of the PC are:
PC Characteristics
Details
CPU
Intel(R) Atom(TM) Processor E3950 # 1.60GHz with 4 cores
Memory RAM
8 GB
Operating System
Ubunut 20.04.1 LTS
Kernel
Linux 5.9.1-rt20 SMP PREEMPT_RT
Architecture
x86-64
Tests
The first time we detected this problem was when we were measuring the time it takes to execute this "complex" program with a single thread. We did a few tests with this program but also with a simpler one:
The CPU execution times
The wall time (the world real-time)
The difference (Wall time - CPU time) between them and the ratio (CPU time / Wall time).
We also did a latency test on the PC.
Latency Test
For this one, we followed this tutorial, and these are the results:
Latency Test Generic Kernel
Latency Test RT Kernel
The processes are shown in htop with a priority of RT
Test Program - Complex
We called the function multiple times in the program and measured the time each takes. The results of the 2 tests are:
From this we observed that:
The first execution (around 0.28 ms) always takes longer than the second one (around 0.18 ms), but most of the time it is not the longest iteration.
The mode is around 0.17 ms.
For those that take 17 ms the difference is usually 0 and the ratio 1. Although this is not exclusive to this time. For these, it seems like only 1 CPU is being used and it is saturated (there is no waiting time).
When the difference is not 0, it is usually negative. This, from what we have read here and here, is because more than 1 CPU is being used.
Test Program - Simple
We did the same test but this time with a simpler program:
#include <vector>
#include <iostream>
#include <time.h>
int main(int argc, char** argv) {
int iterations = 5000;
double a = 5.5;
double b = 5.5;
double c = 4.5;
std::vector<double> wallTime(iterations, 0);
std::vector<double> cpuTime(iterations, 0);
struct timespec beginWallTime, endWallTime, beginCPUTime, endCPUTime;
std::cout << "Iteration | WallTime | cpuTime" << std::endl;
for (unsigned int i = 0; i < iterations; i++) {
// Start measuring time
clock_gettime(CLOCK_REALTIME, &beginWallTime);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &beginCPUTime);
// Function
a = b + c + i;
// Stop measuring time and calculate the elapsed time
clock_gettime(CLOCK_REALTIME, &endWallTime);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endCPUTime);
wallTime[i] = (endWallTime.tv_sec - beginWallTime.tv_sec) + (endWallTime.tv_nsec - beginWallTime.tv_nsec)*1e-9;
cpuTime[i] = (endCPUTime.tv_sec - beginCPUTime.tv_sec) + (endCPUTime.tv_nsec - beginCPUTime.tv_nsec)*1e-9;
std::cout << i << " | " << wallTime[i] << " | " << cpuTime[i] << std::endl;
}
return 0;
}
Final Thoughts
We understand that:
If the ratio == number of CPUs used, they are saturated and there is no waiting time.
If the ratio < number of CPUs used, it means that there is some waiting time (theoretically we should only be using 1 CPU, although in practice we use more).
Of course, we can give more details.
Thanks a lot for your help!

Your function will near certainly be optimized away so you are just measuring how long it takes to read the clocks. And as you can see that doesn't take very long with some exceptions:
The very first time you run the code (unless you just compiled it) the pages need to be loaded from disk. If you are unlucky the code spans pages and you include the loading of the next page in the measured time. Quite unlikely given the code size.
The first loop the code and any data needs to be loaded into cache. So that takes longer to execute. The branch predictor might also need a few loops to predict the loop right so the second, third loop might be slightly longer too.
For everything else I think you can blame scheduling:
an IRQ happens but nothing gets rescheduled
the process gets paused while another process runs
the process gets moved to another CPU thread leaving the caches hot
the process gets moved to another CPU core making L1 cache cold but leaving L2/L3 caches hot (if your L2 is shared)
the process gets moved to a CPU on another socket making L1/L2 caches cold but L3 cache hot (if L3 is shared)
You can do little about IRQs. Some you can fix to specific cores but others are just essential (like the timer interrupt for the scheduler itself). You kind of just have to live with that.
But you can fix your program to a specific CPU and you can fix everything else to all the other cores. Basically reserving the core for the real-time code. I guess you would have to use cgroups for this, to keep everything else off the chosen core. And you might still get some kernel threads run on the reserved core. Nothing you can do about that. But that should eliminate most of the large execution times.

Low performance with CLWB (cacheline write-backs) to same location vs. cycling through a few lines

Why does the running time of the code below decrease when I increase kNumCacheLines?
In every iteration, the code modifies one of kNumCacheLines cachelines, writes the line to the DIMM with the clwb instruction, and blocks until the store hits the memory controller with sfence. This example requires Intel Skylake-server or newer Xeon, or IceLake client processors.
#include <stdlib.h>
#include <stdint.h>
#define clwb(addr) \
asm volatile(".byte 0x66; xsaveopt %0" : "+m"(*(volatile char *)(addr)));
static constexpr size_t kNumCacheLines = 1;
int main() {
uint8_t *buf = new uint8_t[kNumCacheLines * 64];
size_t data = 0;
for (size_t i = 0; i < 10000000; i++) {
size_t buf_offset = (i % kNumCacheLines) * 64;
buf[buf_offset] = data++;
clwb(&buf[buf_offset]);
asm volatile("sfence" ::: "memory");
}
delete [] buf;
}
(editor's note: _mm_sfence() and _mm_clwb(void*) would avoid needing inline asm, but this inline asm looks correct, including the "memory" clobber).
Here are some performance numbers on my Skylake Xeon machine, reported by running time ./bench with different values of kNumCacheLines:
kNumCacheLines Time (seconds)
1 2.00
2 2.14
3 1.74
4 1.82
5 1.00
6 1.17
7 1.04
8 1.06
Intuitively, I would expect kNumCacheLines = 1 to give the best performance because of hits in the memory controller's write pending queue. But, it is one of the slowest.
As an explanation for the unintuitive slowdown, it is possible that while the memory controller is completing a write to a cache line, it blocks other writes to the same cache line. I suspect that increasing kNumCacheLines increases performance because of higher parallelism available to the memory controller. The running time jumps from 1.82 seconds to 1.00 seconds when kNumCacheLines goes from four to five. This seems to correlate with the fact that the memory controller's write pending queue has space for 256 bytes from a thread [https://arxiv.org/pdf/1908.03583.pdf, Section 5.3].
Note that because buf is smaller than 4 KB, all accesses use the same DIMM. (Assuming it's aligned so it doesn't cross a page boundary)

This is probably fully explained by Intel's CLWB instruction invalidating cache lines - turns out SKX runs clwb the same as clflushopt, i.e. it's a stub implementation for forward compatibility so persistent-memory software can start using it without checking CPU feature levels.
More cache lines means more memory-level parallelism in reloading the invalidated lines for the next store. Or that the flush part is finished before we try to reload. One or the other; there are a lot of details I don't have a specific explanation for.
In each iteration, you store a counter value into a cache line and clwb it. (and sfence). The previous activity on that cache line was kNumCacheLines iterations ago.
We were expecting that these stores could just commit into lines that were already in Exclusive state, but in fact they're going to be Invalid with eviction probably still in flight down the cache hierarchy, depending on exactly when sfence stalls, and for how long.
So each store needs to wait for an RFO (Read For Ownership) to get the line back into cache in Exclusive state before it can commit from the store buffer to L1d.
It seems that you're only getting a factor of 2 speedup from using more cache lines, even though Skylake(-X) has 12 LFBs (i.e. can track 12 in-flight cache lines incoming or outgoing). Perhaps sfence has something to do with that.
The big jump from 4 to 5 is surprising. (Basically two levels of performance, not a continuous transition). That lends some weight to the hypothesis that it's something to do with the store having made it all the way to DRAM before we try to reload, rather than having multiple RFOs in flight. Or at least casts doubt on the idea that it's just MLP for RFOs. CLWB forcing eviction is pretty clearly key, but the specific details of exactly what happens and why there's any speedup is just pure guesswork on my part.
A more detailed analysis might tell us something about microarchitectural details if anyone wants to do one. This hopefully isn't a very normal access pattern so probably we can just avoid doing stuff like this most of the time!
(Possibly related: apparently repeated writes to the same line of Optane DC PM memory are slower than sequential writes, so you don't want write-through caching or an access pattern like this on that kind of non-volatile memory either.)

How/why do functional languages (specifically Erlang) scale well?

I have been watching the growing visibility of functional programming languages and features for a while. I looked into them and didn't see the reason for the appeal.
Then, recently I attended Kevin Smith's "Basics of Erlang" presentation at Codemash.
I enjoyed the presentation and learned that a lot of the attributes of functional programming make it much easier to avoid threading/concurrency issues. I understand the lack of state and mutability makes it impossible for multiple threads to alter the same data, but Kevin said (if I understood correctly) all communication takes place through messages and the mesages are processed synchronously (again avoiding concurrency issues).
But I have read that Erlang is used in highly scalable applications (the whole reason Ericsson created it in the first place). How can it be efficient handling thousands of requests per second if everything is handled as a synchronously processed message? Isn't that why we started moving towards asynchronous processing - so we can take advantage of running multiple threads of operation at the same time and achieve scalability? It seems like this architecture, while safer, is a step backwards in terms of scalability. What am I missing?
I understand the creators of Erlang intentionally avoided supporting threading to avoid concurrency problems, but I thought multi-threading was necessary to achieve scalability.
How can functional programming languages be inherently thread-safe, yet still scale?

A functional language doesn't (in general) rely on mutating a variable. Because of this, we don't have to protect the "shared state" of a variable, because the value is fixed. This in turn avoids the majority of the hoop jumping that traditional languages have to go through to implement an algorithm across processors or machines.
Erlang takes it further than traditional functional languages by baking in a message passing system that allows everything to operate on an event based system where a piece of code only worries about receiving messages and sending messages, not worrying about a bigger picture.
What this means is that the programmer is (nominally) unconcerned that the message will be handled on another processor or machine: simply sending the message is good enough for it to continue. If it cares about a response, it will wait for it as another message.
The end result of this is that each snippet is independent of every other snippet. No shared code, no shared state and all interactions coming from a a message system that can be distributed among many pieces of hardware (or not).
Contrast this with a traditional system: we have to place mutexes and semaphores around "protected" variables and code execution. We have tight binding in a function call via the stack (waiting for the return to occur). All of this creates bottlenecks that are less of a problem in a shared nothing system like Erlang.
EDIT: I should also point out that Erlang is asynchronous. You send your message and maybe/someday another message arrives back. Or not.
Spencer's point about out of order execution is also important and well answered.

The message queue system is cool because it effectively produces a "fire-and-wait-for-result" effect which is the synchronous part you're reading about. What makes this incredibly awesome is that it means lines do not need to be executed sequentially. Consider the following code:
r = methodWithALotOfDiskProcessing();
x = r + 1;
y = methodWithALotOfNetworkProcessing();
w = x * y
Consider for a moment that methodWithALotOfDiskProcessing() takes about 2 seconds to complete and that methodWithALotOfNetworkProcessing() takes about 1 second to complete. In a procedural language this code would take about 3 seconds to run because the lines would be executed sequentially. We're wasting time waiting for one method to complete that could run concurrently with the other without competing for a single resource. In a functional language lines of code don't dictate when the processor will attempt them. A functional language would try something like the following:
Execute line 1 ... wait.
Execute line 2 ... wait for r value.
Execute line 3 ... wait.
Execute line 4 ... wait for x and y value.
Line 3 returned ... y value set, message line 4.
Line 1 returned ... r value set, message line 2.
Line 2 returned ... x value set, message line 4.
Line 4 returned ... done.
How cool is that? By going ahead with the code and only waiting where necessary we've reduced the waiting time to two seconds automagically! :D So yes, while the code is synchronous it tends to have a different meaning than in procedural languages.
EDIT:
Once you grasp this concept in conjunction with Godeke's post it's easy to imagine how simple it becomes to take advantage of multiple processors, server farms, redundant data stores and who knows what else.

It's likely that you're mixing up synchronous with sequential.
The body of a function in erlang is being processed sequentially.
So what Spencer said about this "automagical effect" doesn't hold true for erlang. You could model this behaviour with erlang though.
For example you could spawn a process that calculates the number of words in a line.
As we're having several lines, we spawn one such process for each line and receive the answers to calculate a sum from it.
That way, we spawn processes that do the "heavy" computations (utilizing additional cores if available) and later we collect the results.
-module(countwords).
-export([count_words_in_lines/1]).
count_words_in_lines(Lines) ->
% For each line in lines run spawn_summarizer with the process id (pid)
% and a line to work on as arguments.
% This is a list comprehension and spawn_summarizer will return the pid
% of the process that was created. So the variable Pids will hold a list
% of process ids.
Pids = [spawn_summarizer(self(), Line) || Line <- Lines],
% For each pid receive the answer. This will happen in the same order in
% which the processes were created, because we saved [pid1, pid2, ...] in
% the variable Pids and now we consume this list.
Results = [receive_result(Pid) || Pid <- Pids],
% Sum up the results.
WordCount = lists:sum(Results),
io:format("We've got ~p words, Sir!~n", [WordCount]).
spawn_summarizer(S, Line) ->
% Create a anonymous function and save it in the variable F.
F = fun() ->
% Split line into words.
ListOfWords = string:tokens(Line, " "),
Length = length(ListOfWords),
io:format("process ~p calculated ~p words~n", [self(), Length]),
% Send a tuple containing our pid and Length to S.
S ! {self(), Length}
end,
% There is no return in erlang, instead the last value in a function is
% returned implicitly.
% Spawn the anonymous function and return the pid of the new process.
spawn(F).
% The Variable Pid gets bound in the function head.
% In erlang, you can only assign to a variable once.
receive_result(Pid) ->
receive
% Pattern-matching: the block behind "->" will execute only if we receive
% a tuple that matches the one below. The variable Pid is already bound,
% so we are waiting here for the answer of a specific process.
% N is unbound so we accept any value.
{Pid, N} ->
io:format("Received \"~p\" from process ~p~n", [N, Pid]),
N
end.
And this is what it looks like, when we run this in the shell:
Eshell V5.6.5 (abort with ^G)
1> Lines = ["This is a string of text", "and this is another", "and yet another", "it's getting boring now"].
["This is a string of text","and this is another",
"and yet another","it's getting boring now"]
2> c(countwords).
{ok,countwords}
3> countwords:count_words_in_lines(Lines).
process <0.39.0> calculated 6 words
process <0.40.0> calculated 4 words
process <0.41.0> calculated 3 words
process <0.42.0> calculated 4 words
Received "6" from process <0.39.0>
Received "4" from process <0.40.0>
Received "3" from process <0.41.0>
Received "4" from process <0.42.0>
We've got 17 words, Sir!
ok
4>

The key thing that enables Erlang to scale is related to concurrency.
An operating system provides concurrency by two mechanisms:
operating system processes
operating system threads
Processes don't share state – one process can't crash another by design.
Threads share state – one thread can crash another by design – that's your problem.
With Erlang – one operating system process is used by the virtual machine and the VM provides concurrency to Erlang programme not by using operating system threads but by providing Erlang processes – that is Erlang implements its own timeslicer.
These Erlang process talk to each other by sending messages (handled by the Erlang VM not the operating system). The Erlang processes address each other using a process ID (PID) which has a three-part address <<N3.N2.N1>>:
process no N1 on
VM N2 on
physical machine N3
Two processes on the same VM, on different VM's on the same machine or two machines communicate in the same way – your scaling is therefore independent of the number of physical machines you deploy your application on (in the first approximation).
Erlang is only threadsafe in a trivial sense – it doesn't have threads. (The language that is, the SMP/multi-core VM uses one operating system thread per core).

You may have a misunderstanding of how Erlang works. The Erlang runtime minimizes context-switching on a CPU, but if there are multiple CPUs available, then all are used to process messages. You don't have "threads" in the sense that you do in other languages, but you can have a lot of messages being processed concurrently.

Erlang messages are purely asynchronous, if you want a synchronous reply to your message you need to explicitly code for that. What was possibly said was that messages in a process message box is processed sequentially. Any message sent to a process goes sits in that process message box, and the process gets to pick one message from that box process it and then move on to the next one, in the order it sees fit. This is a very sequential act and the receive block does exactly that.
Looks like you have mixed up synchronous and sequential as chris mentioned.

Referential transparency: See http://en.wikipedia.org/wiki/Referential_transparency_(computer_science)

In a purely functional language, order of evaluation doesn't matter - in a function application fn(arg1, .. argn), the n arguments can be evaluated in parallel. That guarantees a high level of (automatic) parallelism.
Erlang uses a process modell where a process can run in the same virtual machine, or on a different processor -- there is no way to tell. That is only possible because messages are copied between processes, there is no shared (mutable) state. Multi-processor paralellism goes a lot farther than multi-threading, since threads depend upon shared memory, this there can only be 8 threads running in parallel on a 8-core CPU, while multi-processing can scale to thousands of parallel processes.

Create thread with >70% CPU utilization

I am creating a test program to test the functionality of program which calcultes CPU Utilization.
Now I want to test that program at different times when CPU utilization is 100%, 50% 0% etc.
My question how to make CPU to utilize to 100% or may be > 80%.
I think creating a while loop like will suffice
while(i++< 2000)
{
cout<<" in while "<< endl;
Sleep(10); // sleep for 10 ms.
}
After running this I dont get high CPU utilization.
What would be the possible solutions to make high cpu intensive??

You're right to use a loop, but:
You've got IO
You've got a sleep
Basically nothing in that loop is going to take very much CPU time compared with the time it's sleeping or waiting for IO.
To kill a CPU you need to give it just CPU stuff. The only tricky bit really is making sure the C++ compiler doesn't optimise away the loop. Something like this should probably be okay:
// A bit like generating a hashcode. Pretty arbitrary choice,
// but simple code which would be hard for the compiler to
// optimise away.
int running_total = 23;
for (int i=0; i < some_large_number; i++)
{
running_total = 37 * running_total + i;
}
return running_total;
Note the fact that I'm returning the value out of the loop. That should stop the C++ compiler from noticing that the loop is useless (if you never used the value anywhere, the loop would have no purpose). You may want to disable inlining too, as otherwise I guess there's a possibility that a smart compiler would notice you calling the function without using the return value, and inline it to nothing. (As Suma points out in the answer, using volatile when calling the function should disable inlining.)

Your loop mostly sleeps, which means it has very light CPU load. Besides of Sleep, be sure to include some loop performing any computations, like this (Factorial implementation is left as an exercise to reader, you may replace it with any other non-trivial function).
while(i++< 2000)
{
int sleepBalance = 10; // increase this to reduce the CPU load
int computeBalance = 1000; // increase this to increase the CPU load
for (int i=0; i<computeBalance; i++)
{
/* both volatiles are important to prevent compiler */
/* optimizing out the function */
volatile int n = 30;
volatile int pretendWeNeedTheResult = Factorial(n);
}
Sleep(sleepBalance);
}
By adjusting sleepBalance / computeBalance you may adjust how much CPU this program takes. If you want to this as a CPU load simulation, you might want to take a few addtional steps:
on a multicore system be sure to either spawn the loop like this in multiple threads (one for each CPU), or execute the process multiple times, and to make the scheduling predictable assign thread/process affinity explicitly
sometimes you may also want to increase the thread/process priority to simulate the environment where CPU is heavily loaded with high priority applications.

Use consume.exe in the Windows SDK.
Don't roll your own when someone else has already done the work and will give it to you for free.

If you call Sleep in your loop then most of the the loop's time will be spent doing nothing (sleeping). This is why your CPU utilization is low - because that 10mS sleep is huge compared to the time the CPU will spend executing the rest of the code in each loop iteration. It is a non-trivial task to write code to accurately waste CPU time. Roger's suggestion of using CPU Burn-In is a good one.

I know the "yes" command on UNIX systems, when routed to /dev/null will eat up 100% CPU on a single core (it doesn't thread). You can launch multiple instances of it to utilize each core. You could probably compile the "yes" code in your application and call it directly. You don't specify what C++ compiler you are using for Windows, but I am going to assume it has POSIX compatibility of some kind (ala Cygwin). If that's the case, "yes" should work fine.

To make a thread use a lot of CPU, make sure it doesn't block/wait. Your Sleep call will suspend the thread and not schedule it for at least the number of ms the Sleep call indicates, during which it will not use the CPU.

Get hold of a copy of CPU Burn-In.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js