Fork and core dump with threads

Fork and core dump with threads - c++

Similar points to the one in this question have been raised before here and here, and I'm aware of the Google coredump library (which I've appraised and found lacking, though I might try and work on that if I understand the problem better).
I want to obtain a core dump of a running Linux process without interrupting the process. The natural approach is to say:
if (!fork()) { abort(); }
Since the forked process gets a fixed snapshot copy of the original process's memory, I should get a complete core dump, and since the copy uses copy-on-write, it should generally be cheap. However, a critical shortcoming of this approach is that fork() only forks the current thread, and all other threads of the original process won't exist in the forked copy.
My question is whether it is possible to somehow obtain the relevant data of the other, original threads. I'm not entirely sure how to approach this problem, but here are a couple of sub-questions I've come up with:
Is the memory that contains all of the threads' stacks still available and accessible in the forked process?
Is it possible to (quicky) enumerate all the running threads in the original process and store the addresses of the bases of their stacks? As I understand it, the base of a thread stack on Linux contains a pointer to the kernel's thread bookkeeping data, so...
with the stored thread base addresses, could you read out the relevant data for each of the original threads in the forked process?
If that is possible, perhaps it would only be a matter of appending the data of the other threads to the core dump. However, if that data is lost at the point of the fork already, then there doesn't seem to be any hope for this approach.

Are you familiar with process checkpoint-restart? In particular, CRIU? It seems to me it might provide an easy option for you.
I want to obtain a core dump of a running Linux process without interrupting the process [and] to somehow obtain the relevant data of the other, original threads.
Forget about not interrupting the process. If you think about it, a core dump has to interrupt the process for the duration of the dump; your true goal must therefore be to minimize the duration of this interruption. Your original idea of using fork() does interrupt the process, it just does so for a very short time.
Is the memory that contains all of the threads' stacks still available and accessible in the forked process?
No. The fork() only retains the thread that does the actual call, and the stacks for the rest of the threads are lost.
Here is the procedure I'd use, assuming CRIU is unsuitable:
Have a parent process that generates a core dump of the child process whenever the child is stopped. (Note that more than one consecutive stop event may be generated; only the first one until the next continue event should be acted on.)
You can detect the stop/continue events using waitpid(child,,WUNTRACED|WCONTINUED).
Optional: Use sched_setaffinity() to restrict the process to a single CPU, and sched_setscheduler() (and perhaps sched_setparam()) to drop the process priority to IDLE.
You can do this from the parent process, which only needs the CAP_SYS_NICE capability (which you can give it using setcap 'cap_sys_nice=pe' parent-binary to the parent binary, if you have filesystem capabilities enabled like most current Linux distributions do), in both the effective and permitted sets.
The intent is to minimize the progress of the other threads between the moment a thread decides it wants a snapshot/dump, and the moment when all threads have been stopped. I have not tested how long it takes for the changes to take effect -- certainly they only happen at the end of their current timeslices at the very earliest. So, this step should probably be done a bit beforehand.
Personally, I don't bother. On my four-core machine, the following SIGSTOP alone yields similar latencies between threads as a mutex or a semaphore does, so I don't see any need to strive for even better synchronization.
When a thread in the child process decides it wants to take a snapshot of itself, it sends a SIGSTOP to itself (via kill(getpid(), SIGSTOP)). This stops all threads in the process.
The parent process will receive the notification that the child was stopped. It will first examines /proc/PID/task/ to obtain the TIDs for each thread of the child process (and perhaps /proc/PID/task/TID/ pseudofiles for other information), then attaches to each TID using ptrace(PTRACE_ATTACH, TID). Obviously, ptrace(PTRACE_GETREGS, TID, ...) will obtain the per-thread register states, which can be used in conjunction with /proc/PID/task/TID/smaps and /proc/PID/task/TID/mem to obtain the per-thread stack trace, and whatever other information you're interested in. (For example, you could create a debugger-compatible core file for each thread.)
When the parent process is done grabbing the dump, it lets the child process continue. I believe you need to send a separate SIGCONT signal to let the entire child process continue, instead of just relying on ptrace(PTRACE_CONT, TID), but I haven't checked this; do verify this, please.
I do believe that the above will yield a minimal delay in wall clock time between the threads in the process stopping. Quick tests on AMD Athlon II X4 640 on Xubuntu and a 3.8.0-29-generic kernel indicates tight loops incrementing a volatile variable in the other threads only advance the counters by a few thousand, depending on the number of threads (there's too much noise in the few tests I made to say anything more specific).
Limiting the process to a single CPU, and even to IDLE priority, will drastically reduce that delay even further. CAP_SYS_NICE capability allows the parent to not only reduce the priority of the child process, but also lift the priority back to original levels; filesystem capabilities mean the parent process does not even have to be setuid, as CAP_SYS_NICE alone suffices. (I think it'd be safe enough -- with some good checks in the parent program -- to be installed in e.g. university computers, where students are quite active in finding interesting ways to exploit the installed programs.)
It is possible to create a kernel patch (or module) that provides a boosted kill(getpid(), SIGSTOP) that also tries to kick off the other threads from running CPUs, and thus try to make the delay between the threads stopping even smaller. Personally, I wouldn't bother. Even without the CPU/priority manipulation I get sufficient synchronization (small enough delays between the times the threads are stopped).
Do you need some example code to illustrate my ideas above?

When you fork you get a full copy of the running processes memory. This includes all thread's stacks (after all you could have valid pointers into them). But only the calling thread continues to execute in the child.
You can easily test this. Make a multithreaded program and run:
pid_t parent_pid = getpid();
if (!fork()) {
kill(parent_pid, SIGSTOP);
char buffer[0x1000];
pid_t child_pid = getpid();
sprintf(buffer, "diff /proc/%d/maps /proc/%d/maps", parent_pid, child_pid);
system(buffer);
kill(parent_pid, SIGTERM);
return 0;
} else for (;;);
So all your memory is there and when you create a core dump it will contain all the other threads stacks (provided your maximum core file size permits it). The only pieces that will be missing are their register sets. If you need those then you will have to ptrace your parent to obtain them.
You should keep in mind though that core dumps are not designed to contain runtime information of more then one thread - the one that caused the core dump.
To answer some of your other questions:
You can enumerate threads by going through /proc/[pid]/tasks, but you can not identify their stack bases until you ptrace them.
Yes, you have full access to the other threads stacks snapshots (see above) from the forked process. It is not trivial to determine them, but they do get put into a core dump provided the core file size permits it. Your best bet is to save them in some globally accessible structure if you can upon creation.

If you intend to get the core file at non-specific location, and just get core image of the process running without killing, then you can use gcore.
If you intend to get the core file at specific location (condition) and still continue running the process - a crude approach is to execute gcore programmatically from that location.
A more classical, clean approach would be to check the API which gcore uses and embedded it in your application - but would be too much of an effort compared to the need most of the time.
HTH!

If your goal is to snapshot the entire process in order to understand the exact state of all threads at a specific point then I can't see any way to do this that doesn't require some kind of interrupt service routine. You must halt all processors and record off the current state of each thread.
I don't know of any system that provides this kind of full process core dump. The rough outlines of the process would be:
issue an interrupt across all CPUs (both logical and physical cores).
busy wait for all cores to synchronize (this shouldn't take long).
clone the desired process's memory space: duplicate the page tables and mark all pages as copy on write.
have each processor check whether its current thread is in the target process. If so record the current stack pointer for that thread.
for every other thread examine the thread data block for the current stack pointer and record it.
create a kernel thread to save off the copied memory spaces and the thread stack pointers
resume all cores.
This should capture the entire process state, including a snapshot of any processes that were running at the moment the inter-processor interrupt was issued. Because all threads are interrupted (either through standard scheduler suspension process, or via our custom interrupt process) all register states will be on a stack somewhere in the process memory. You then only need to know where the top of each thread stack is. Using the copy on write mechanism to clone the page tables allows transparent save-off while the original process is allowed to resume.
This is a pretty heavyweight option, since it's main functionality requires suspending all processors for a significant amount of time (synchronize, clone, walk all threads). However this should allow you to exactly capture the status of all threads, as well as determine which threads were running (and on which CPUs) when the checkpoint was reached. I would assume some of the framework for doing this process exists (in CRIU for instance). Of course resuming the process will result in a storm of page allocations as the copy on write mechanism protects the check-pointed system state.

Related

Ensure that each thread gets a chance to execute in a given time period using C++11 threads

Suppose I have a multi-threaded program in C++11, in which each thread controls the behavior of something displayed to the user.
I want to ensure that for every time period T during which one of the threads of the given program have run, each thread gets a chance to execute for at least time t, so that the display looks as if all threads are executing simultaneously. The idea is to have a mechanism for round robin scheduling with time sharing based on some information stored in the thread, forcing a thread to wait after its time slice is over, instead of relying on the operating system scheduler.
Preferably, I would also like to ensure that each thread is scheduled in real time.
In case there is no way other than relying on the operating system, is there any solution for Linux?
Is it possible to do this? How?

No that's not cross-platform possible with C++11 threads. How often and how long a thread is called isn't up to the application. It's up to the operating system you're using.
However, there are still functions with which you can flag the os that a special thread/process is really important and so you can influence this time fuzzy for your purposes.
You can acquire the platform dependent thread handle to use OS functions.
native_handle_type std::thread::native_handle //(since C++11)
Returns the implementation defined underlying thread handle.
I just want to claim again, this requires a implementation which is different for each platform!
Microsoft Windows
According to the Microsoft documentation:
SetThreadPriority function
Sets the priority value for the specified thread. This value, together
with the priority class of the thread's process determines the
thread's base priority level.
Linux/Unix
For Linux things are more difficult because there are different systems how threads can be scheduled. Under Microsoft Windows it's using a priority system but on Linux this doesn't seem to be the default scheduling.
For more information, please take a look on this stackoverflow question(Should be the same for std::thread because of this).

I want to ensure that for every time period T during which one of the threads of the given program have run, each thread gets a chance to execute for at least time t, so that the display looks as if all threads are executing simultaneously.
You are using threads to make it seem as though different tasks are executing simultaneously. That is not recommended for the reasons stated in Arthur's answer, to which I really can't add anything.
If instead of having long living threads each doing its own task you can have a single queue of tasks that can be executed without mutual exclusion - you can have a queue of tasks and a thread pool dequeuing and executing tasks.
If you cannot, you might want to look into wait free data structures and algorithms. In a wait free algorithm/data structure, every thread is guaranteed to complete its work in a finite (and even specified) number of steps. I can recommend the book The Art of Multiprocessor Programming where this topic is discussed in length. The gist of it is: every lock free algorithm/data structure can be modified to be wait free by adding communication between threads over which a thread that's about to do work makes sure that no other thread is starved/stalled. Basically, prefer fairness over total throughput of all threads. In my experience this is usually not a good compromise.

How do I determine from strace output what part of my program is failing to acquire a mutex

I'm working on an embedded Linux system (3.12.something), and our application, after some random amount of time, starts hogging the CPU. I've run strace on our application, and right when the problem happens, I see a lot of lines similar to this in the strace output:
[48530666] futex(0x485f78b8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.009002>
I'm pretty sure this is the smoking gun I'm looking for and there is a race of some sort. However, I now need to figure out how to identify the place in the code that's trying to get this mutex. How can I do that? Our code is compiled with GCC and has debugging symbols in it.
My current thinking (that I haven't tried yet) is to print out a string to stdout and flush before trying to grab any mutex in our system, with the expectation that the string will print right before strace complains about getting the lock ... but there are a LOT of places in the code that would have to be instrumented like this.
EDIT: Another strange thing that I just realized is that our program doesn't start hogging the CPU until some random time has passed since it was run (5 minutes to 5 hours and anywhere in between). During that time, there are zero futex syscalls happening. Why do they suddenly start? From what I've read, I think maybe they are being used properly in userspace until something fails and falls back to making a futex() syscall...
Any suggestions?

If you perpetually and often lock a mutex for a short time from different threads, like e.g. one protecting a global logger, you might cause a so-called thread convoy. The problem doesn't occur until two threads compete for the lock. The first gets the lock and holds it for a short time, then, when it needs the lock a second time, it gets preempted because the second one is waiting already. The second one does the same. The timeslice available to each thread is suddenly reduced to the time between two lock attempts, causing many context switches and the according slowdown. Further, all but one thread is always blocked on the mutex, effectively disabling any parallel execution.
In order to fix this, make your threads cooperate instead of competing for resources. For above example of a logger, consider e.g. a lock-free queue for the entries or separate queues for each thread using thread-local storage.
Concerning the futex() calls, the idea is to poll an atomic flag and after some rotations use the actual OS mutex. The atomic flag is available without the expensive switch between user-space and kernel-space. For longer breaks, using the kernel preemption (with futex()) avoids blocking the CPU with polling. This explains why the program doesn't need any futex() calls in normal operation.

You, basically need to generate core file at this moment.
Then you could load program+core in GDB and look at it
man gcore
or
generate-core-file
During that time, there are zero futex syscalls happening. Why do they suddenly start?
This is due to the fact that uncontested mutex, implemented via futex, doesn't make a system call, only atomic increment, purely in user space. Only CONTESTED lock is visible as system call

Windows C++ Process vs Thread

In Windows C++, createThread() causes some of the threads to slow down if one thread is doing a very CPU intensive operation. Will createProcess() alleviate this? If so, does createProcess() imply the code must reside in a second executable, or can this all take place inside the same executable?

The major difference between a process and a thread is that each process has its own memory space, while threads share the memory space of the process that they are running within.
If a thread is truly CPU bound, it will only slow another thread if they are both executing on the same processor core. createProcess will not alleviate this since a process would still have the same issue.
Also, what kind of machine are you running this on? Does it have more than one core?

Not likely - a process is much "heavier" than a thread, so it is likely to be slower still. I'm not sure what you're asking about the 2nd executable, but you can use createProcess on the same .exe.
http://msdn.microsoft.com/en-us/library/ms682425(v=vs.85).aspx
It sounds like you're chasing down some performance issues, so perhaps trying out a threading-oriented profiler would be helpful: http://software.intel.com/en-us/articles/using-intel-thread-profiler-for-win32-threads-philosophy-and-theory/

Each process provides the resources needed to execute a program. A process has a virtual address space, executable code, open handles to system objects, a security context, a unique process identifier, environment variables, a priority class, minimum and maximum working set sizes, and at least one thread of execution. Each process is started with a single thread, often called the primary thread, but can create additional threads from any of its threads.
A thread is the entity within a process that can be scheduled for execution. All threads of a process share its virtual address space and system resources. In addition, each thread maintains exception handlers, a scheduling priority, thread local storage, a unique thread identifier, and a set of structures the system will use to save the thread context until it is scheduled. The thread context includes the thread's set of machine registers, the kernel stack, a thread environment block, and a user stack in the address space of the thread's process. Threads can also have their own security context, which can be used for impersonating clients.

Create process and create thread both cause additional execution on what is a resource limited environment. Meaning no matter how you do parallel processing at some point in time your other lines of execution will imped the current. It is for this reason for very large problems that are suited to parallization distributed system are used. There are pluses and minuses tho threads and processes.
Threads
Threads allow separate execution inside of one address space meaning you can share data variables instances of objects very easily, however it also means you run into many more synchronization issues. These are painfull and as you can see from the shear number of api function involved not a light subject. Threads are a lighter weight on windows then process and as such spin up and down faster and use less resources to maintain. Threads also suffer in that one thread can cause the entire process to fail.
Processes
Process each have there own address space and as such protect themselves from being brought down by another process, but lack the ability to easily communicate. Any communication will necessarily involve some type of IPC ( Pipes, TCP , ...).
The code does not have to be in a second executable just two instances need to run.

That would make things worse. When switching threads, the CPU needs to swap out only a few registers. Since all threads of a process share the same memory, there's no need to flush the cache. But when switching betweeen processes, you also switch mapped memory. Therefore, the CPU has to flush the L1 cache. That's painful.
(L2 cache is physically mapped, i.e. uses hardware addresses. Those don't change, of course.)

What is process and thread?

Yes, I have read many materials related to operating system. And I am still reading. But it seems all of them are describing the process and thread in a "abstract" way, which makes a lot of high level elabration on their behavior and logic orgnization. I am wondering what are they physically? In my opinion, they are just some in-memory "data structures" which are maintained and used by the kernel codes to facilitate the execution of program. For example, operating system use some process data structure (PCB) to describe the aspects of the process assigned for a certain program, such as its priority, its address space and so on. Is this all right?

First thing you need to know to understand the difference between a process and a thread, is a fact, that processes do not run, threads do.
So, what is a thread? Closest I can get explaining it is an execution state, as in: a combination of CPU registers, stack, the lot. You can see a proof of that, by breaking in a debugger at any given moment. What do you see? A call stack, a set of registers. That's pretty much it. That's the thread.
Now, then, what is a process. Well, it's a like an abstract "container" entity for running threads. As far as OS is concerned in a first approximation, it's an entity OS allocates some VM to, assigns some system resources to (like file handles, network sockets), &c.
How do they work together? The OS creates a "process" by reserving some resources to it, and starting a "main" thread. That thread then can spawn more threads. Those are the threads in one process. They more or less can share those resources one way or another (say, locking might be needed for them not to spoil the fun for others &c). From there on, OS is normally responsible for maintaining those threads "inside" that VM (detecting and preventing attempts to access memory which doesn't "belong" to that process), providing some type of scheduling those threads, so that they can run "one-after-another-and-not-just-one-all-the-time".

Normally when you run an executable like notepad.exe, this creates a single process. These process could spawn other processes, but in most cases there is a single process for each executable that you run. Within the process, there can be many threads. Usually at first there is one thread, which usually starts at the programs "entry point" which is the main function usually. Instructions are executed one by one in order, like a person who only has one hand, a thread can only do one thing at a time before it moves on to the next.
That first thread can create additional threads. Each additional thread has it's own entry point, which is usually defined with a function. The process is like a container for all the threads that have been spawned within it.
That is a pretty simplistic explanation. I could go into more detail but probably would overlap with what you will find in your textbooks.
EDIT: You'll notice there are lot's of "usually"'s in my explanation, as there are occasionally rare programs that do things drastically different.

One of the reasons why it is pretty much impossible to describe threads and processes in a non-abstract way is that they are abstractions.
Their concrete implementations differ tremendously.
Compare for example an Erlang Process and a Windows Process: an Erlang Process is very lightweight, often less than 400 Bytes. You can start 10 million processes on a not very recent laptop without any problems. They start up very quickly, they die very quickly and you are expected to be able to use them for very short tasks. Every Erlang Process has its own Garbage Collector associated with it. Erlang Processes can never share memory, ever.
Windows Processes are very heavy, sometimes hundreds of MiBytes. You can start maybe a couple of thousand of them on a beefy server, if you are lucky. They start up and die pretty slowly. Windows Processes are the units of Applications such as IDEs or Text Editors or Word Processors, so they are usually expected to live quite a long time (at least several minutes). They have their own Address Space, but no Garbage Collector. Windows Processes can share memory, although by default they don't.
Threads are a similar matter: an NPTL Linux Thread on x86 can be as small as 4 KiByte and with some tricks you can start 800000+ on a 32 Bit x86 machine. The machine will certainly be useable with thousands, maybe tens of thousands of threads. A .NET CLR Thread has a minimum size of about 1 MiByte, which means that just 4000 of those will eat up your entire address space on a 32 Bit machine. So, while 4000 NPTL Linux Threads is generally not a problem, you can't even start 4000 .NET CLR Threads because you will run out of memory before that.
OS Processes and OS Threads are also implemented very differently between different Operating Systems. The main two approaches are: the kernel knows only about processes. Threads are implemented by a Userspace Library, without any knowledge of the kernel at all. In this case, there are again two approaches: 1:1 (every Thread maps to one Kernel Process) or m:n (m Threads map to n Processes, where usually m > n and often n == #CPUs). This was the early approach taken on many Operating Systems after Threads were invented. However, it is usually deemed inefficient and has been replaced on almost all systems by the second approach: Threads are implemented (at least partially) in the kernel, so that the kernel now knows about two distinct entities, Threads and Processes.
One Operating System that goes a third route, is Linux. In Linux, Threads are neither implemented in Userspace nor in the Kernel. Instead, the Kernel provides an abstraction of both a Thread and a Process (and indeed a couple of more things), called a Task. A Task is a Kernel Scheduled Entity, that carries with it a set of flags that determine which resources it shares with its siblings and which ones are private.
Depending on how you set those flags, you get either a Thread (share pretty much everything) or a Process (share all system resources like the system clock, the filesystem namespace, the networking namespace, the user ID namespace, the process ID namespace, but do not share the Address Space). But you can also get some other pretty interesting things, too. You can trivially get BSD-style jails (basically the same flags as a Process, but don't share the filesystem or the networking namespace). Or you can get what other OSs call a Virtualization Container or Zone (like a jail, but don't share the UID and PID namespaces and system clock). Since a couple of years ago via a technology called KVM (Kernel Virtual Machine) you can even get a full-blown Virtual Machine (share nothing, not even the processor's Page Tables). [The cool thing about this is that you get to reuse the highly-tuned mature Task Scheduler in the kernel for all of these things. One of the things the Xen Virtual Machine has often criticized for, was the poor performance of its scheduler. The KVM developers have a much superior scheduler than Xen, and the best thing is they didn't even have to write a single line of code for it!]
So, on Linux, the performance of Threads and Processes is much closer than on Windows and many other systems, because on Linux, they are actually the same thing. Which means that the usage patterns are very different: on Windows, you typically decide between using a Thread and a Process based on their weight: can I afford a Process or should I use a Thread, even though I actually don't want to share state? On Linux (and usually Unix in general), you decide based on their semantics: do I actually want to share state or not?
One reason why Processes tend to be lighter on Unix than on Windows, is different usage: on Unix, Processes are the basic unit of both concurrency and functionality. If you want to use concurrency, you use multiple Processes. If your application can be broken down into multiple independent pieces, you use multiple Processes. Every Process does exactly one thing and only that one thing. Even a simple one-line shell script often involves dozens or hundreds of Processes. Applications usually consist of many, often short-lived Processes.
On Windows, Threads are the basic units of concurrency and COM components or .NET objects are the basic units of functionality. Applications usually consist of a single long-running Process.
Again, they are used for very different purposes and have very different design goals. It's not that one or the other is better or worse, it's just that they are so different that the common characteristics can only be described very abstractly.
Pretty much the only few things you can say about Threads and Processes are that:
Threads belong to Processes
Threads are lighter than Processes
Threads share most state with each other
Processes share significantly less state than Threads (in particular, they generally share no memory, unless specifically requested)

I would say that :
A process has a memory space, opened files,..., and one or more threads.
A thread is an instruction stream that can be scheduled by the system on a processor.

Have a look at the detailed answer I gave previously here on SO. It gives an insight into a toy kernel structure responsible for maintaining processes and the threads...
Hope this helps,
Best regards,
Tom.

We have discussed this very issue a number of times here. Perhaps you will find some helpful information here:
What is the difference between a process and a thread
Process vs Thread
Thread and Process

A process is a container for a set of resources used while executing a program.
A process includes the following:
Private virtual address space
A program.
A list of handles.
An access token.
A unique process ID.
At least one thread.
A pointer to the parent process, whether or not the process still exists or not.
That being said, a process can contain multiple threads.
Processes themselves can be grouped into jobs, which are containers for processes and are executed as single units.
A thread is what windows uses to schedule execution of instructions on the CPU. Every process has at least one.
I have a couple of pages on my wiki you could take a look at:
Process
Thread

Threads are memory structures in the scheduler of the operating system, as you say. Threads point to the start of some instructions in memory and process these when the scheduler decides they should be. While the thread is executing, the hardware timer will run. Once it hits the desired time, an interrupt will be invoked. After this, the hardware will then stop execution of the current program, and will invoke the registered interrupt handler function, which will be part of the scheduler, to inform that the current thread has finished execution.

Physically:
Process is a structure that maintains the owning credentials, the thread list, and an open handle list
A Thread is a structure containing a context (i.e. a saved register set + a location to execute), a set of PTEs describing what pages are mapped into the process's Virtual Address space, and an owner.
This is of course an extremely simplified explanation, but it gets the important bits. The fundamental unit of execution on both Linux and Windows is the Thread - the kernel scheduler doesn't care about processes (much). This is why on Linux, a thread is just a process who happens to share PTEs with another process.

A process is a area in memory managed by the OS to run an application. Thread is a small area in memory within a process to run a dedicated task.

Processes and Threads are abstractions - there is nothing physical about them, or any other part of an
operating system for that matter. That is why we call it software.
If you view a computer in physical terms you end up with a jumble of
electronics that emulate what a Turing Machine does.
Trying to do anything useful with a raw Truing Machine would turn your brain to Jell-O in
five minutes flat. To avoid
that unpleasant experience, computer folks developed a set of abstractions to compartmentalize
various aspects of computing. This lets you focus on the level of abstraction that
interests you without having to worry about all the other stuff supporting it.
Some things have been cast into circuitry (eg. adders and the like) which makes them physical but the
vast majority of what we work with is based on a set abstractions. As a general rule, the abstractions
we use have some form of mathematical underpinning to them. This is why stacks,
queues and "state" play such an important role in computing - there is a well founded
set of mathematics around these abstractions that let us build upon and reason about
their manipulation.
The key is realizing that software is always based on a
composite of abstract models of "things". Those "things" don't always relate to
anything physical, more likely they relate some other abstraction. This is why
you cannot find a satisfactory "physical" basis for Processes and Threads
anywhere in your text books.
Several other people have posted links to and explanations about what threads and
processes are, none of them point to anything "physical" though. As you guessed, they
are really just a set of data structures and rules that live within the larger
context of an operating system (which in turn is just more data structures and rules...)
Software is like an onion, layers on layers on layers, once you peal all the layers
(abstractions) away, nothing much is left! But the onion is still very real.

It's kind of hard to give a short answer which does this question justice.
And at the risk of getting this horribly wrong and simplying things, you can say threads & processes are an operating-system/platform concept; and under-the-hood, you can define a single-threaded process by,
Low-level CPU instructions (aka, the program).
State of execution--meaning instruction pointer (really, a special register), register values, and stack
The heap (aka, general purpose memory).
In modern operating systems, each process has its own memory space. Aside shared memory (only some OS support this) the operating system forbids one process from writing in the memory space of another. In Windows, you'll see a general protection fault if a process tries.
So you can say a multi-threaded process is the whole package. And each thread is basically nothing more than state of execution.
So when a thread is pre-empted for another (say, on a uni-processor system), all the operating system has to do in principle is save the state of execution of the thread (not sure if it has to do anything special for the stack) and load in another.
Pre-empting an entire process, on the other hand, is more expensive as you can imagine.
Edit: The ideas apply in abstracted platforms like Java as well.

They are not physical pieces of string, if that's what you're asking. ;)
As I understand it, pretty much everything inside the operating system is just data. Modern operating systems depend on a few hardware requirements: virtual memory address translation, interrupts, and memory protection (There's a lot of fuzzy hardware/software magic that happens during boot, but I'm not very familiar with that process). Once those physical requirements are in place, everything else is up to the operating system designer. It's all just chunks of data.

The reason they only are mentioned in an abstract way is that they are concepts, while they will be implemented as data structures there is no universal rule how they have to be implemented.
This is at least true for the threads/processes on their own, they wont do much good without a scheduler and an interrupt timer.
The scheduler is the algorithm by which the operating system chooses the next thread to run for a limited amount of time and the interrupt timer is a piece of hardware which periodically interrupts the execution of the current thread and hands control back to the scheduler.
Forgot something: the above is not true if you only have cooperative threading, cooperative threads have to actively yield control to the next thread, which can get ugly with one thread polling for results of an other thread, which waits for the first to yield.
These are even more lightweight than other threads as they don't require support of the underlying operating system to work.

I had seen many of the answers but most of them are not clear enough for an OS beginner.
In any modern day operating system, one process has a virtual CPU, virtual Memory, Virtual I/O.
Virtual CPU : if you have multiple cores the process might be assigned one or more of the cores for processing by the scheduler.
Virtual I/O : I/O might be shared between various processes. Like for an example keyboard that can be shared by multiple processes. So when you type in a notepad you see the text changing while a key logger running as daemon is storing all the keystrokes. So the process is sharing an I/O resource.
Virtual Memory : http://en.wikipedia.org/wiki/Virtual_memory you can go through the link.
So when a process is taken out of the state of execution by the scheduler it's state containing the values stored in the registers, its stack and heap and much more are saved into a data structure.
So now when we compare a process with a thread, threads started by a process shares the Virtual I/O and Virtual Memory assigned to the process which started it but not the Virtual CPU.
So there might be multiple thread being started by a process all sharing the same virtual Memory and Virtual I/O bu but having different Virtual CPUs.
So you understand the need for locking the resource of a process be it statically allocated (stack) or dynamically allocated(heap) as the virtual memory space is shared between threads of a process.
Also each thread having its own Virtual CPU can run in parallel in different cores and significantly reduce the completion time of a process(reduction will be observable only if you have managed the memory wisely and there are multiple cores).

A thread is controlled by a process, a process is controlled by the operating system

Process doesn't share memory between each other - since it works in so called "protected flat model", on other hand threads shares the same memory.

With the Windows, at least once you get past Win 3.1, the operating system (OS) contains multiple process each with its own memory space and can't interact with other processes without the OS.
Each process has one or more threads that share the same memory space and do not need the OS to interact with other threads.

Process is a container of threads.

Well, I haven't seen an answer to "What are they physically", yet. So I give it a try.
Processes and Thread are nothing phyical. They are a feature of the operating system. Usally any physical component of a computer does not know about them. The CPU does only process a sequential stream of opcodes. These opcodes might belong to a thread. Then the OS uses traps and interrupts regain control, decide which code to excecute and switch to another thread.

Process is one complete entity e.g. and exe file or one jvm. There can be a child process of a parent process where the exe file run again in a separate space. Thread is a separate path of execution in the same process where the process is controlling which thread to execute, halt etc.

Trying to answer this question relating to Java world.
A process is an execution of a program but a thread is a single execution sequence within the process. A process can contain multiple threads. A thread is sometimes called a lightweight process.
For example:
Example 1:
A JVM runs in a single process and threads in a JVM share the heap belonging to that process. That is why several threads may access the same object. Threads share the heap and have their own stack space. This is how one thread’s invocation of a method and its local variables are kept thread safe from other threads. But the heap is not thread-safe and must be synchronized for thread safety.
Example 2:
A program might not be able to draw pictures by reading keystrokes. The program must give its full attention to the keyboard input and lacking the ability to handle more than one event at a time will lead to trouble. The ideal solution to this problem is the seamless execution of two or more sections of a program at the same time. Threads allows us to do this. Here Drawing picture is a process and reading keystroke is sub process (thread).

Hibernating/restarting a thread

I'm looking for a way to restart a thread, either from inside that thread's context or from outside the thread, possibly from within another process. (Any of these options will work.) I am aware of the difficulty of hibernating entire processes, and I'm pretty sure that those same difficulties attend to threads. However, I'm asking anyway in the hopes that someone has some insight.
My goal is to pause, save to file, and restart a running thread from its exact context with no modification to that thread's code, or rather, modification in only a small area - i.e., I can't go writing serialization functions throughout the code. The main block of code must be unmodified, and will not have any global/system handles (file handles, sockets, mutexes, etc.) Really down-and-dirty details like CPU registers do not need to be saved; but basically the heap, stack, and program counter should be saved, and anything else required to get the thread running again logically correctly from its save point. The resulting state of the program should be no different, if it was saved or not.
This is for a debugging program for high-reliability software; the goal is to run simulations of the software with various scripts for input, and be able to pause a running simulation and then restart it again later - or get the sim to a branch point, save it, make lots of copies and then run further simulations from the common starting point. This is why the main program cannot be modified.
The main thread language is in C++, and should run on Windows and Linux, however if there is a way to only do this on one system, then that's acceptable too.
Thanks in advance.

I think what you're asking is much more complicated than you think. I am not too familiar with Windows programming but here are some of the difficulties you'll face in Linux.
A saved thread can only be restored from the root process that originally spawned the thread, otherwise the dynamic libraries would be broken. Because of this saving to disk is essentially meaningless. The reason is dynamic libraries are loaded at different address each time they're loaded. The only way around this would be to take complete control of dynamically linking, no small feat. It's possible, but pretty scary.
The suspended thread will have variables in the the heap. You'd need to be able to find all globals 'owned' by the thread. The 'owned' state of any piece of the heap cannot be determined. In the future it may be possible with the C++0x's garbage collection ABI. You can't just assume the whole stack belongs to the thread to be paused. The main thread uses the heap when creating threads. So blowing away the heap when deserializing the paused thread would break the main thread.
You need to address the issues with globals. And not just the globals from created in the threads. Globals (or statics) can and often are created in dynamic libraries.
There are more resources to a program than just memory. You have file handles, network sockets, database connections, etc. A file handle is just a number. serializing its memory is completely meaningless without the context of the process the file was opened in.
All that said. I don't think the core problem is impossible, just that you should consider a different approach.
Anyway to try to implement this the thread to paused needs to be in a known state. I imagine the thread to be stoped would call a library function meant the halt the process so it could be resumed.
I think the linux system call fork is your friend. Fork perfectly duplicates a process. Have the system run to the desired point and fork. One fork wait to fork others. The second fork runs one set of input.
once it completes the first fork can for again. Again the second fork can run another set of input.
continue ad infinitum.

Threads run in the context of a process. So if you want to do anything like persist a thread state to disk, you need to "hibernate" the entire process.
You will need to serialise the entire set of the processes data. And you'll need to store the current thread execution point. I think serialising the process is do-able (check out boost::serialize) but the thread stop point is a lot more difficult. I would put places where it can be stopped through the code, but as you say, you cannot modify the code.
Given that problem, you're looking at virtualising the platform the app is running on, and using its suspend functionality to pause the entire thing. You might find more information about how to do this in the virtualisation vendor's features, eg Xen.

As the whole logical address space of the program is part of the thread's context, you would have to hibernate the whole process.
If you can guarantee that the thread only uses local variables, you could save its stack. It is easy to suspend a thread with pthreads, but I don't see how you could access its stack from outside then.

The way you would have to do this is via VM Snapshots; get a copy of VMWare Workstation, then you can write code to automate starting/stopping/snapshotting the machine at different points. Any other approach is pretty untenable, as while you might be able to freeze and dethaw a process, you can't reconstruct the system state it expects (all the stuff that Caspin mentions like file handles et al.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js