Huge difference in MPI_Wtime() after using MPI_Barrier()?

Huge difference in MPI_Wtime() after using MPI_Barrier()? - c++

This is the part of the code.
if(rank==0) {
temp=10000;
var=new char[temp] ;
MPI_Send(&temp,1,MPI_INT,1,tag,MPI_COMM_WORLD);
MPI_Send(var,temp,MPI_BYTE,1,tag,MPI_COMM_WORLD);
//MPI_Wait(&req[0],&sta[1]);
}
if(rank==1) {
MPI_Irecv(&temp,1,MPI_INT,0,tag,MPI_COMM_WORLD,&req[0]);
MPI_Wait(&req[0],&sta[0]);
var=new char[temp] ;
MPI_Irecv(var,temp,MPI_BYTE,0,tag,MPI_COMM_WORLD,&req[1]);
MPI_Wait(&req[0],&sta[0]);
}
//I am talking about this MPI_Barrier
MPI_Barrier(MPI_COMM_WORLD);
cout << MPI_Wtime()-t1 << endl ;
cout << "hello " << rank << " " << temp << endl ;
MPI_Finalize();
}
1. when using MPI_Barrier - As expected all the process are taking almost same amount of time, which is of order 0.02
2. when not using MPI_Barrier() - the root process(sending a message) waiting for some extra time .
and the (MPI_Wtime -t1) varies a lot and the time taken by root process is of order 2 seconds.
If i am not really mistaken MPI_Barrier is only used to bring all the running processes at the same level. so why don't the time when i am using MPI_Barrier() is 2 seconds (minimum of all processes . e. root process) . Please explain ?

Thanks to Wesley Bland for noticing that you are waiting twice on the same request. Here is an explanation of what actually happens.
There is something called progression of asynchronous (non-blocking) operations in MPI. That is when the actual transfer happens. Progression could happen in many different ways and at many different points within the MPI library. When you post an asynchronous operation, its progression could be deferred indefinitely, even until the point that one calls MPI_Wait, MPI_Test or some call that would result in new messages being pushed to or pulled from the transmit/receive queue. That's why it is very important to call MPI_Wait or MPI_Test as quickly as possible after the initiation of a non-blocking operation.
Open MPI supports a background progression thread that takes care to progress the operations even if the condition in the previous paragraph is not met, e.g. if MPI_Wait or MPI_Test is never called on the request handle. This has to be explicitly enabled when the library is being built. It is not enabled by default since background progression increases the latency of the operations.
What happens in your case is that you are waiting on the incorrect request the second time you call MPI_Wait in the receiver, therefore the progression of the second MPI_Irecv operation is postponed. The message is more than 40 KiB in size (10000 times 4 bytes + envelope overhead) which is above the default eager limit in Open MPI (32 KiB). Such messages are sent using the rendezvous protocol that requires both the send and the receive operations to be posted and progressed. The receive operation doesn't get progressed and hence the send operation in rank 0 blocks until at some point in time the clean-up routines that MPI_Finalize in rank 1 calls eventually progress the receive.
When you put the call to MPI_Barrier, it leads to the progression of the outstanding receive, acting almost like an implicit call to MPI_Wait. That's why the send in rank 0 completes quickly and both processes move on in time.
Note that MPI_Irecv, immediately followed by MPI_Wait is equivalent to simply calling MPI_Recv. The latter is not only simpler, but also less prone to simple typos like the one that you've made.

You're waiting on the same request twice for your Irecv's. the second one is the one that would take all of the time and since its getting skipped, rank 0 is getting way ahead.
MPI_BARRIER can be implemented such that some processes can leave the algorithm before the rest if the processes enter it. That's probably what's happening here.

In the tests that I have run, I see almost no difference in the runtimes. The main difference being that you seem to be running your code one time whereas I looped over your code thousands of times then took the average. My output is below:
With the barrier
[0]: 1.65071e-05
[1]: 1.66872e-05
Without the barrier
[0]: 1.35653e-05
[1]: 1.30711e-05
So I would assume any variation your are seeing is a result of your operating system more than your program.
Also, why are you using MPI_Irecv coupled with an MPI_wait rather than just using MPI_recv?

Related

Best way to implement a periodic linux task in c++20

I have a periodic task in c++, running on an embedded linux platform, and have to run at 5 ms intervals. It seems to be working as expected, but is my current solution good enough?
I have implemented the scheduler using sleep_until(), but some comments I have received is that setitimer() is better. As I would like the application to be at least some what portable, I would prefer c++ standard... of course unless there are other problems.
I have found plenty of sites that show implementation with each, but I have not found any arguments for why one solution is better than the other. As I see it, sleep_until() will implement an "optimal" on any (supported) platform, and I'm getting a feeling the comments I have received are focused more on usleep() (which I do not use).
My implementation looks a little like this:
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
do_the_magic();
std::this_thread::sleep_until(next_time);
}
}
A short summoning of the issue.
I have an embedded linux platform, build with yocto and with RT capabilities
The application need to read and process incoming data every 5 ms
Building with gcc 11.2.0
Using c++20
All the "hard work" is done in separate threads, so this question is only regards triggering the task periodically and with minimal jitter

Since the application is supposed to read and process the data every 5 ms, it is possible that a few times, it does not perform the required operations. What I mean to say is that in a time interval of 20 ms, do_the_magic() is supposed to be invoked 4 times... But if the time taken to execute do_the_magic() is 10 ms, it will get invoked only 2 times. If that is an acceptable outcome, the current implementation is good enough.
Since the application is reading data, it probably receives it from the network or disk. And adding the overhead of processing it, it likely takes more than 5 ms to do so (depending on the size of the data). If it is not acceptable to miss out on any invocation of do_the_magic, the current implementation is not good enough.
What you could probably do is create a few threads. Each thread executes the do_the_magic function and then goes to sleep. Every 5 ms, you wake a sleeping thread which will most likely take less than 5 ms to happen. This way no invocation of do_the_magic is missed. Also, the number of threads depends on how long will do_the_magic take to execute.
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
void wake_some_thread () {
static int i = 0;
release_semaphore (i); // Release semaphore associated with thread i
i++;
i = i % NUM_THREADS;
}
void * thread_func (void * args) {
while (true) {
// Wait for a semaphore
do_the_magic();
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
wake_some_thread (); // Releases a semaphore to wake a thread
std::this_thread::sleep_until(next_time);
}
Create as many semaphores as the number of threads where thread i is waiting for semaphore i. wake_some_thread can then release a semaphore starting from index 0 till NUM_THREADS and start again.

5ms is a pretty tight timing.
You can get a jitter-free 5ms tick only if you do the following:
Isolate a CPU for this thread. Configure it with nohz_full and rcu_nocbs
Pin your thread to this CPU, assign it a real-time scheduling policy (e.g., SCHED_FIFO)
Do not let any other threads run on this CPU core.
Do not allow any context switches in this thread. This includes avoiding system calls altogether. I.e., you cannot use std::this_thread::sleep_until(...) or anything else.
Do a busy wait in between processing (ensure 100% CPU utilisation)
Use lock-free communication to transfer data from this thread to other, non-real-time threads, e.g., for storing the data to files, accessing network, logging to console, etc.
Now, the question is how you're going to "read and process data" without system calls. It depends on your system. If you can do any user-space I/O (map the physical register addresses to your process address space, use DMA without interrupts, etc.) - you'll have a perfectly real-time processing. Otherwise, any system call will trigger a context switch, and latency of this context switch will be unpredictable.
For example, you can do this with certain Ethernet devices (SolarFlare, etc.), with 100% user-space drivers. For anything else you're likely to have to write your own user-space driver, or even implement your own interrupt-free device (e.g., if you're running on an FPGA SoC).

Why my std::atomic<int> variable isn't thread-safe?

I don't know why my code isn't thread-safe, as it outputs some inconsistent results.
value 48
value 49
value 50
value 54
value 51
value 52
value 53
My understanding of an atomic object is it prevents its intermediate state from exposing, so it should solve the problem when one thread is reading it and the other thread is writing it.
I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
I probably misunderstood what an atomic object is, Can someone explain?
void
inc(std::atomic<int>& a)
{
while (true) {
a = a + 1;
printf("value %d\n", a.load());
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
}
int
main()
{
std::atomic<int> a(0);
std::thread t1(inc, std::ref(a));
std::thread t2(inc, std::ref(a));
std::thread t3(inc, std::ref(a));
std::thread t4(inc, std::ref(a));
std::thread t5(inc, std::ref(a));
std::thread t6(inc, std::ref(a));
t1.join();
t2.join();
t3.join();
t4.join();
t5.join();
t6.join();
return 0;
}

I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
You can, just not the way you have coded it. You have to think about where the atomic accesses occur. Consider this line of code …
a = a + 1;
First the value of a is fetched atomically. Let's say the value fetched is 50.
We add one to that value getting 51.
Finally we atomically store that value into a using the = operator
a ends up being 51
We atomically load the value of a by calling a.load()
We print the value we just loaded by calling printf()
So far so good. But between steps 1 and 3 some other threads may have changed the value of a - for example to the value 54. So, when step 3 stores 51 into a it overwrites the value 54 giving you the output you see.
As #Sopel and #Shawn suggest in the comments, you can atomically increment the value in a using one of the appropriate functions (like fetch_add) or operator overloads (like operator ++ or operator +=. See the std::atomic documentation for details
Update
I added steps 5 and 6 above. Those steps can also lead to results that may not look correct.
Between the store at step 3. and the call tp a.load() at step 5. other threads can modify the contents of a. After our thread stores 51 in a at step 3 it may find that a.load() returns some different number at step 5. Thus the thread that set a to the value 51 may not pass the value 51 to printf().
Another source of problems is that nothing coordinates the execution of steps 5. and 6. between two threads. So, for example, imagine two threads X and Y running on a single processor. One possible execution order might be this …
Thread X executes steps 1 through 5 above incrementing a from 50 to 51 and getting the value 51 back from a.load()
Thread Y executes steps 1 through 5 above incrementing a from 51 to 52 and getting the value 52 back from a.load()
Thread Y executes printf() sending 52 to the console
Thread X executes printf() sending 51 to the console
We've now printed 52 on the console, followed by 51.
Finally, there's another problem lurking at step 6. because printf() doesn't make any promises about what happens if two threads call printf() at the same time (at least I don't think it does).
On a multiprocessor system threads X and Y above might call printf() at exactly the same moment (or within a few ticks of exactly the same moment) on two different processors. We can't make any prediction about which printf() output will appear first on the console.
Note The documentation for printf mentions a lock introduced in C++17 "… used to prevent data races when multiple threads read, write, position, or query the position of a stream." In the case of two threads simultaneously contending for that lock we still can't tell which one will win.

Besides the increment of a being done non-atomically, the fetch of the value to display after the increment is non-atomic with respect to the increment. It is possible that one of the other threads increments a after the current thread has incremented it but before the fetch of the value to display. This would possibly result in the same value being shown twice, with the previous value skipped.
Another issue here is that the threads do not necessarily run in the order they have been created. Thread 7 could execute its output before threads 4, 5, and 6, but after all four threads have incremented a. Since the thread that did the last increment displays its output earlier, you end up with the output not being sequential. This is more likely to happen on a system with fewer than six hardware threads available to run on.
Adding a small sleep between the various thread creates (e.g., sleep_for(10)) would make this less likely to occur, but would still not eliminate the possibility. The only sure way to keep the output ordered is to use some sort of exclusion (like a mutex) to ensure only one thread has access to the increment and output code, and treat both the increment and output code as a single transaction that must run together before another thread tries to do an increment.

The other answers point out the non-atomic increment and various problems. I mostly want to point out some interesting practical details about exactly what we see when running this code on a real system. (x86-64 Arch Linux, gcc9.1 -O3, i7-6700k 4c8t Skylake).
It can be useful to understand why certain bugs or design choices lead to certain behaviours, for troubleshooting / debugging.
Use int tmp = ++a; to capture the fetch_add result in a local variable instead of reloading it from the shared variable. (And as 1202ProgramAlarm says, you might want to treat the whole increment and print as an atomic transaction if you insist on having your counts printed in order as well as being done properly.)
Or you might want to have each thread record the values it saw in a private data structure to be printed later, instead of also serializing threads with printf during the increments. (In practice all trying to increment the same atomic variable will serialize them waiting for access to the cache line; ++a will go in order so you can tell from the modification order which thread went in which order.)
Fun fact: a.store(1 + a.load(std:memory_order_relaxed), std::memory_order_release) is what you might do for a variable that was only written by 1 thread, but read by multiple threads. You don't need an atomic RMW because no other thread ever modifies it. You just need a thread-safe way to publish updates. (Or better, in a loop keep a local counter and just .store() it without loading from the shared variable.)
If you used the default a = ... for a sequentially-consistent store, you might as well have done an atomic RMW on x86. One good way to compile that is with an atomic xchg, or mov+mfence is as expensive (or more).
What's interesting is that despite the massive problems with your code, no counts were lost or stepped on (no duplicate counts), merely printing reordered. So in practice the danger wasn't encountered because of other effects going on.
I tried it on my own machine and did lose some counts. But after removing the sleep, I just got reordering. (I copy-pasted about 1000 lines of the output into a file, and sort -u to uniquify the output didn't change the line count. It did move some late prints around though; presumably one thread got stalled for a while.) My testing didn't check for the possibility of lost counts, skipped by not saving the value being stored into a, and instead reloading it. I'm not sure there's a plausible way for that to happen here without multiple threads reading the same count, which would be detected.
Store + reload, even a seq-cst store which has to flush the store buffer before it can reload, is very fast compared to printf making a write() system call. (The format string includes a newline and I didn't redirect output to a file so stdout is line-buffered and can't just append the string to a buffer.)
(write() system calls on the same file descriptor are serializing in POSIX: write(2) is atomic. Also, printf(3) itself is thread-safe on GNU/Linux, as required by C++17, and probably by POSIX long before that.)
Stdio locking in printf happens to be enough serialization in almost all cases: the thread that just unlocked stdout and left printf can do the atomic increment and then try to take the stdout lock again.
The other threads were all blocked trying to take the lock on stdout. One (other?) thread can wake up and take the lock on stdout, but for its increment to race with the other thread it would have to enter and leave printf and load a the first time before that other thread commits its a = ... seq-cst store.
This does not mean it's actually safe
Just that testing this specific version of the program (at least on x86) doesn't easily reveal the lack of safety. Interrupts or scheduling variations, including competition from other things running on the same machine, certainly could block a thread at just the wrong time.
My desktop has 8 logical cores so there were enough for every thread to get one, not having to get descheduled. (Although normally that would tend to happen on I/O or when waiting on a lock anyway).
With the sleep there, it is not unlikely for multiple threads to wake up at nearly the same time and race with each other in practice on real x86 hardware. It's so long that timer granularity becomes a factor, I think. Or something like that.
Redirecting output to a file
With stdout open on a non-TTY file, it's full-buffered instead of line-buffered, and doesn't always make a system call while holding the stdout lock.
(I got a 17MiB file in /tmp from hitting control-C a fraction of a second after running ./a.out > output.)
This makes it fast enough for threads to actually race with each other in practice, showing the expected bugs of duplicate values. (A thread reads a but loses ownership of the cache line before it stores (tmp)+1, resulting in two or more threads doing the same increment. And/or multiple threads reading the same value when they reload a after flushing their store buffer.)
1228589 unique lines (sort -u | wc) but total output of
1291035 total lines. So ~5% of the output lines were duplicates.
I didn't check if it was usually one value duplicated multiple times or if it was usually only one duplicate. Or how far backward the value ever jumped. If a thread happened to be stalled by an interrupt handler after loading but before storing val+1, it could be quite far. Or if it actually slept or blocked for some reason, it could rewind indefinitely far.

Multi threaded program on multi-core platform not printing output (std::cout) if I run a tight poll loop

As the OP states, I have multiple threads - two of them being tight poll loops (I need polling) with regular sleeps:- 1 seconds of sleep after every 10 seconds.
Program has multiple interim updates to be printed with:
std::cout << "progress report text" << std::endl;
Body of thread that polls, pretty much looks like:
void PollHardwareFunction ()
{
lastTimeSlept = std::chrono::HighResClock::now();
while (!stopSignal)
{
poll_hardware();
// Process the data received from hardware
if (std::chrono::HighResClock::now() - lastTimeSlept > std::chrono::seconds(10))
{
std::this_thread::sleep_for(std::chrono::seconds(1));
auto lastTimeSlept = std::chrono::HighResClock::now();
}
}
}
Other threads are pretty normal that do few logical steps and prints status after each step.
void LongRunningFunction ()
{
int dataCounter = 0;
while (wait_for_data_from_hardware_in_concurrent_queue)
{
std::cout << "Data received: " << dataCounter++ << std::endl;
// Process the data received from hardware
std::cout << "STEP1 done." << std::endl;
std::cout << "STEP2 done." << std::endl;
std::cout << "STEP3 done." << std::endl;
}
}
This prints all messages as expected but only in bulk after 10 seconds. Making it look non responsive/stuck during this 10 seconds.
Program is run on following environment:
Compiled with GCC 6.2, run on RHEL 7, an 8 core CPU.
I notice that the program prints on the console only when the spinning threads go to sleep/idle. Once the busy threads go to sleep, all of the prints appear on my output console together. To add to it, data received from hardware is regular - say every 100 milliseconds.
With several CPU cores available free, why the program stays in non-responsive state till the spinning threads stop/pause?

From your comments:
My program is bit better structured - it uses atomic variables and some of lockfree data structures I have implemented.
and
poll_hardware is a function from the hardware vendor's API that reads hardware buffer and pushes data into a concurrent queue.
That sounds dubious. Did you write your own data structure or did you use an existing one? Regardless, please post the code for the queue. If the queue was provided by the vendor, please post the API.
my perspective here is to understand what can cause the programs output remain (feels) stuck where as std::cout << operator is executed (completed execution) with std::endl?
You don't call cout from PollHardwareFunction(), so the issue MUST be from wait_for_data_from_hardware_in_concurrent_queue blocking when it's not supposed to. (If you want to be sure, switch cout to cerr to avoid buffering writes.)
The first thing I would check is if poll_hardware() is dominating a lock by re-locking as soon as it releases. You may have created what is effectively a spin-lock. This is why user Snps suggested sleeping for 1ms in the comments. 1 yield is not enough. I understand that your data is time critical, but you said 100ms, so theoretically you could poll ever 50ms and be fine. A few ms should be totally OK for debugging purposes.
Lock dominating can be both caused by and solved with a reader/writer lock. Reader/writer locks need to be custom designed with the characteristics of the situation in mind. (how many threads are reading vs writing? how often do reads vs writes occur?)
The second thing I would check are your assumptions about sequential programming and memory caching in your lock-free data structures. Loads and stores can be delayed, rearranged, buffered, etc. as an optimization. Everyone is your "frienemy"--the compiler will do this, then the OS will do it, the CPU will take its turn, and then hardware will do it too.
To prevent this, you have to use a memory barrier (aka memory fence) to keep any of your frienemies from optimizing memory accesses. FYI, mutexes use memory barriers in their implementation. A quick way to see if this fixes your problem is to make your shared variables volatile. HOWEVER, don't trust volatile. It only keeps the compiler from reordering your commands, not necessarily the OS or CPU (depending on compiler, naturally).
It would be good to know about some of your other atomic variables, because there could be a logic bug there.
lastly, here your use of auto is defining a scoped variable lastTimeSlept that shadows the "actual" lastTimeSlept.
if (std::chrono::HighResClock::now() - lastTimeSlept > std::chrono::seconds(10))
{
std::this_thread::sleep_for(std::chrono::seconds(1));
auto lastTimeSlept = std::chrono::HighResClock::now();
}
yikes! I don't think that's causing your issue, though.

timers, threads and compiler misbehaviour

I'm having trouble with something and couldn't find any answers about it, as I don't even know what to search for. I have a done a timer class using QueryPerformanceCounter, from my application, I launch a second thread object that has its own instanced timer and I just have an infinite loop getting delta time from the timer and using it to output the number of loop iterations per second.
I've noticed that it was giving me weird values so I started printing delta time and found out it was coming as 0 sometimes, so I went inside the method that returns delta time and did some testing. This is my deltaTime() method:
double MyTimer2::deltaTime()
{
LARGE_INTEGER timenow;
QueryPerformanceCounter(&timenow);
//std::cout << "timenow=" << (double)timenow.QuadPart << " currentticks=" << (double)m_currentTicks.QuadPart << std::endl;
double m_deltaTime = (double)(timenow.QuadPart - m_currentTicks.QuadPart) /* 1000.0*/ / (double)m_frequency.QuadPart;
m_currentTicks = timenow;
if(m_deltaTime < 0.000001)
return 0.0;
return m_deltaTime;
}
So, I put a breakpoint on "return 0.0;" and what happens is that it gets there most of the time, which is not correct. However, if I uncomment the printing code and run, I will never stop on the breakpoint. So in theory, my printing code is making it work correctly, whereas if I remove it, things stop working as they should! How is this possible, why is it happening and how can I fix it? I've tried _ReadWriteBarrier() unsuccessfully.
Thanks in advance!
EDIT: I need a high-resolution timer for physics simulation!

A couple processor generations ago, QueryPerformanceCounter() would read the CPU's cycle counter (e.g. rdtsc). Using this method, the number of ticks from successive reads would never be zero. The resolution was equal to the CPU clock rate, e.g. 3 GHz.
Modern processors have two characteristics which make the cycle counter useless for timing. First, you have multiple cores, which each have their own cycle counter. Threads can migrate between cores, and if you read the cycle counter from two different cores, the difference would not be related to elapsed time. It could even be negative. Secondly, you have dynamic clocking based on load (both underclocking to save power and overclocking for performance). Intel calls these "SpeedStep" and "Turbo Boost", respectively. When the cycle rate isn't fixed, there's no way to convert from ticks to time.
So, QueryPerformanceCounter now uses a dedicated piece of hardware called a High-Performance Event Counter (HPET), with a resolution of several MHz. Importantly, there's only one regardless of how many cores you have, and it doesn't change speed dynamically. But, since the resolution is lower, it is now possible to read it twice between ticks, in which case you'll get an elapsed time reported as zero.
In practice, this isn't a problem. If you need timing more precise than what the HPET can provide, then a general purpose computer is not suitable for you. Timing in the nanosecond range will be severely affected by interrupts.

What could possibly be the purpose of this block?
if(m_deltaTime < 0.000001)
return 0.0;
It has no value, it simply screws with the results, telling you the time was zero when it actually wasn't.

First of all, your timer is wrong: it consumes your CPU intensively. On the single core machine it will slow down all the system. If you want to create a timer and target Windows, you can use timer functions.
Then, every not negative value, returned by your deltaTime() function is valid. While you hosted not in real-time operating system, every operation can take arbitrary amount of time. One iteration can take about tens cycles of processor ticks, or tens years. No one guarantee.
Third, about experimental results. It seems that if context will be switched once between two consecutive time measurement, you get value about 0.016s, if not, you get value bellow 0.000001s that is floored to 0s.
As it was said, printing to console is relatively heavy operation and you actually always get context switched when you enable it.
EDIT
While QueryPerformanceCounter seems to offer great resolution, it traps you. You will never get actually high resolution timer, unless you work in real-time OS.

How/why do functional languages (specifically Erlang) scale well?

I have been watching the growing visibility of functional programming languages and features for a while. I looked into them and didn't see the reason for the appeal.
Then, recently I attended Kevin Smith's "Basics of Erlang" presentation at Codemash.
I enjoyed the presentation and learned that a lot of the attributes of functional programming make it much easier to avoid threading/concurrency issues. I understand the lack of state and mutability makes it impossible for multiple threads to alter the same data, but Kevin said (if I understood correctly) all communication takes place through messages and the mesages are processed synchronously (again avoiding concurrency issues).
But I have read that Erlang is used in highly scalable applications (the whole reason Ericsson created it in the first place). How can it be efficient handling thousands of requests per second if everything is handled as a synchronously processed message? Isn't that why we started moving towards asynchronous processing - so we can take advantage of running multiple threads of operation at the same time and achieve scalability? It seems like this architecture, while safer, is a step backwards in terms of scalability. What am I missing?
I understand the creators of Erlang intentionally avoided supporting threading to avoid concurrency problems, but I thought multi-threading was necessary to achieve scalability.
How can functional programming languages be inherently thread-safe, yet still scale?

A functional language doesn't (in general) rely on mutating a variable. Because of this, we don't have to protect the "shared state" of a variable, because the value is fixed. This in turn avoids the majority of the hoop jumping that traditional languages have to go through to implement an algorithm across processors or machines.
Erlang takes it further than traditional functional languages by baking in a message passing system that allows everything to operate on an event based system where a piece of code only worries about receiving messages and sending messages, not worrying about a bigger picture.
What this means is that the programmer is (nominally) unconcerned that the message will be handled on another processor or machine: simply sending the message is good enough for it to continue. If it cares about a response, it will wait for it as another message.
The end result of this is that each snippet is independent of every other snippet. No shared code, no shared state and all interactions coming from a a message system that can be distributed among many pieces of hardware (or not).
Contrast this with a traditional system: we have to place mutexes and semaphores around "protected" variables and code execution. We have tight binding in a function call via the stack (waiting for the return to occur). All of this creates bottlenecks that are less of a problem in a shared nothing system like Erlang.
EDIT: I should also point out that Erlang is asynchronous. You send your message and maybe/someday another message arrives back. Or not.
Spencer's point about out of order execution is also important and well answered.

The message queue system is cool because it effectively produces a "fire-and-wait-for-result" effect which is the synchronous part you're reading about. What makes this incredibly awesome is that it means lines do not need to be executed sequentially. Consider the following code:
r = methodWithALotOfDiskProcessing();
x = r + 1;
y = methodWithALotOfNetworkProcessing();
w = x * y
Consider for a moment that methodWithALotOfDiskProcessing() takes about 2 seconds to complete and that methodWithALotOfNetworkProcessing() takes about 1 second to complete. In a procedural language this code would take about 3 seconds to run because the lines would be executed sequentially. We're wasting time waiting for one method to complete that could run concurrently with the other without competing for a single resource. In a functional language lines of code don't dictate when the processor will attempt them. A functional language would try something like the following:
Execute line 1 ... wait.
Execute line 2 ... wait for r value.
Execute line 3 ... wait.
Execute line 4 ... wait for x and y value.
Line 3 returned ... y value set, message line 4.
Line 1 returned ... r value set, message line 2.
Line 2 returned ... x value set, message line 4.
Line 4 returned ... done.
How cool is that? By going ahead with the code and only waiting where necessary we've reduced the waiting time to two seconds automagically! :D So yes, while the code is synchronous it tends to have a different meaning than in procedural languages.
EDIT:
Once you grasp this concept in conjunction with Godeke's post it's easy to imagine how simple it becomes to take advantage of multiple processors, server farms, redundant data stores and who knows what else.

It's likely that you're mixing up synchronous with sequential.
The body of a function in erlang is being processed sequentially.
So what Spencer said about this "automagical effect" doesn't hold true for erlang. You could model this behaviour with erlang though.
For example you could spawn a process that calculates the number of words in a line.
As we're having several lines, we spawn one such process for each line and receive the answers to calculate a sum from it.
That way, we spawn processes that do the "heavy" computations (utilizing additional cores if available) and later we collect the results.
-module(countwords).
-export([count_words_in_lines/1]).
count_words_in_lines(Lines) ->
% For each line in lines run spawn_summarizer with the process id (pid)
% and a line to work on as arguments.
% This is a list comprehension and spawn_summarizer will return the pid
% of the process that was created. So the variable Pids will hold a list
% of process ids.
Pids = [spawn_summarizer(self(), Line) || Line <- Lines],
% For each pid receive the answer. This will happen in the same order in
% which the processes were created, because we saved [pid1, pid2, ...] in
% the variable Pids and now we consume this list.
Results = [receive_result(Pid) || Pid <- Pids],
% Sum up the results.
WordCount = lists:sum(Results),
io:format("We've got ~p words, Sir!~n", [WordCount]).
spawn_summarizer(S, Line) ->
% Create a anonymous function and save it in the variable F.
F = fun() ->
% Split line into words.
ListOfWords = string:tokens(Line, " "),
Length = length(ListOfWords),
io:format("process ~p calculated ~p words~n", [self(), Length]),
% Send a tuple containing our pid and Length to S.
S ! {self(), Length}
end,
% There is no return in erlang, instead the last value in a function is
% returned implicitly.
% Spawn the anonymous function and return the pid of the new process.
spawn(F).
% The Variable Pid gets bound in the function head.
% In erlang, you can only assign to a variable once.
receive_result(Pid) ->
receive
% Pattern-matching: the block behind "->" will execute only if we receive
% a tuple that matches the one below. The variable Pid is already bound,
% so we are waiting here for the answer of a specific process.
% N is unbound so we accept any value.
{Pid, N} ->
io:format("Received \"~p\" from process ~p~n", [N, Pid]),
N
end.
And this is what it looks like, when we run this in the shell:
Eshell V5.6.5 (abort with ^G)
1> Lines = ["This is a string of text", "and this is another", "and yet another", "it's getting boring now"].
["This is a string of text","and this is another",
"and yet another","it's getting boring now"]
2> c(countwords).
{ok,countwords}
3> countwords:count_words_in_lines(Lines).
process <0.39.0> calculated 6 words
process <0.40.0> calculated 4 words
process <0.41.0> calculated 3 words
process <0.42.0> calculated 4 words
Received "6" from process <0.39.0>
Received "4" from process <0.40.0>
Received "3" from process <0.41.0>
Received "4" from process <0.42.0>
We've got 17 words, Sir!
ok
4>

The key thing that enables Erlang to scale is related to concurrency.
An operating system provides concurrency by two mechanisms:
operating system processes
operating system threads
Processes don't share state – one process can't crash another by design.
Threads share state – one thread can crash another by design – that's your problem.
With Erlang – one operating system process is used by the virtual machine and the VM provides concurrency to Erlang programme not by using operating system threads but by providing Erlang processes – that is Erlang implements its own timeslicer.
These Erlang process talk to each other by sending messages (handled by the Erlang VM not the operating system). The Erlang processes address each other using a process ID (PID) which has a three-part address <<N3.N2.N1>>:
process no N1 on
VM N2 on
physical machine N3
Two processes on the same VM, on different VM's on the same machine or two machines communicate in the same way – your scaling is therefore independent of the number of physical machines you deploy your application on (in the first approximation).
Erlang is only threadsafe in a trivial sense – it doesn't have threads. (The language that is, the SMP/multi-core VM uses one operating system thread per core).

You may have a misunderstanding of how Erlang works. The Erlang runtime minimizes context-switching on a CPU, but if there are multiple CPUs available, then all are used to process messages. You don't have "threads" in the sense that you do in other languages, but you can have a lot of messages being processed concurrently.

Erlang messages are purely asynchronous, if you want a synchronous reply to your message you need to explicitly code for that. What was possibly said was that messages in a process message box is processed sequentially. Any message sent to a process goes sits in that process message box, and the process gets to pick one message from that box process it and then move on to the next one, in the order it sees fit. This is a very sequential act and the receive block does exactly that.
Looks like you have mixed up synchronous and sequential as chris mentioned.

Referential transparency: See http://en.wikipedia.org/wiki/Referential_transparency_(computer_science)

In a purely functional language, order of evaluation doesn't matter - in a function application fn(arg1, .. argn), the n arguments can be evaluated in parallel. That guarantees a high level of (automatic) parallelism.
Erlang uses a process modell where a process can run in the same virtual machine, or on a different processor -- there is no way to tell. That is only possible because messages are copied between processes, there is no shared (mutable) state. Multi-processor paralellism goes a lot farther than multi-threading, since threads depend upon shared memory, this there can only be 8 threads running in parallel on a 8-core CPU, while multi-processing can scale to thousands of parallel processes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js