Safely growing a memory mapped file in C++ application

Safely growing a memory mapped file in C++ application - c++

I have an ancient C++ application originally built in Visual C++ 6.0, that uses a very complex shared-memory DLL to share data among about 8 EXEs and DLLs that all have a pool of values that could be replaced by one or two dictionaries with Strings for the keys, and records for the values. The application is multi-threaded and multi-process. There are three primary executables reading and writing into the shared memory area, and several of the executables have 3 or more threads that read/write or "queue" information into this pooled memory area. About a few hundred places, Structured Exception Handling (SEH) of the __try and __except is used to filter exceptions, and to try to handle Access Violations by resizing the shared memory, which are in segments managed by a class called CGMMF which means growable memory mapped file.
The most salient details are shown here because I cannot find any cohesive source of documentation on the technique in use, or it's safety and suitability. Experimentally I have found that this library didn't work very well on a single core system in 1998, it works somewhat on a single-core virtual machine running windows XP, and that it doesn't work at all on modern 2+ ghz multi-core Windows 7 64-bit systems in 2013. I'm trying to repair it or replace it.
#define ResAddrSpace(pvAddress, dwSize) \
(m_hFileMapRes = CreateFileMapping(HFILE_PAGEFILE, &m_SecAttr, \
PAGE_READWRITE| SEC_RESERVE, 0, dwSize, m_szRegionName), \
(m_hFileMapRes == NULL) ? NULL : \
MapViewOfFileEx(m_hFileMapRes, FILE_MAP_ALL_ACCESS, 0, 0, dwSize, 0))
void CGmmf::Create(void)
{
DWORD dwMaxRgnSize;
if (Gsinf.dwAllocationGranularity == 0)
{
GetSystemInfo(&Gsinf);
}
m_dwFileSizeMax = RoundUp(m_dwFileSizeMax, Gsinf.dwAllocationGranularity);
m_dwFileGrowInc = RoundUp(m_dwFileGrowInc, Gsinf.dwAllocationGranularity);
dwMaxRgnSize = m_dwFileSizeMax + m_dwOverrunBuf;
m_pbFile = (PBYTE)ResAddrSpace(NULL, dwMaxRgnSize);
Adjust(m_dwFileSizeNow);
}
void CGmmf::Adjust(IN DWORD dwDiskFileNow)
{
int nThreadPriority;
__try
{
//
// Boost our thread's priority so that another thread is
// less likely to use the same address space while
// we're changing it.
//
nThreadPriority = GetThreadPriority(GetCurrentThread());
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
//
// Restore the contents with the properly adjusted lengths
//
Construct(dwDiskFileNow);
}
__finally
{
//
// Make sure that we always restore our priority class and thread
// priority so that we do not continue to adversely affect other
// threads in the system.
//
SetThreadPriority(GetCurrentThread(), nThreadPriority);
}
}
void CGmmf::Construct(IN DWORD dwDiskFileNow)
{
DWORD dwDiskFileNew = RoundUp(dwDiskFileNow, m_dwFileGrowInc),
dwStatus = ERROR_SUCCESS;
PBYTE pbTemp;
if (dwDiskFileNew > 0)
{
//
// Grow the MMF by creating a new file-mapping object.
//
// use VirtualAlloc() here to commit
// the requested memory: VirtualAlloc will not fail
// even if the memory block is already committed:
pbTemp = (PBYTE)VirtualAlloc(m_pbFile,dwDiskFileNew,MEM_COMMIT,PAGE_READWRITE);
if(NULL == pbTemp)
{
LogError(GetLastError(), MEM_CREATE_MMF, m_szRegionName);
//
// File-mapping could not be created, the disk is
// probably full.
//
RaiseException(EXCEPTION_GMMF_DISKFULL,
EXCEPTION_NONCONTINUABLE,
0,
NULL);
}
//
// Check to see if our region has been corrupted
// by another thread.
//
if (pbTemp != m_pbFile)
{
RaiseException(EXCEPTION_GMMF_CORRUPTEDRGN,
EXCEPTION_NONCONTINUABLE,
0,
NULL);
}
}
}
So far my options for replacing it include attempting to replace all the shared memory with DCOM (out of process COM) and COM (in process COM) as appropriate to the places where the memory mapped files, and to guard against concurrency issues by hand, using synchronization/mutex/criticalsection or other threadsafe constructs as appropriate.
I want to know if there is already some thread-safe memory-dictionary type I could replace all of this with. Even in the above snippet which is less than 1% of the code of this ancient shared-memory-library-for-visual-C++-6, there are things that make me shudder. For example, raising thread priority as a strategy for avoiding deadlocks, race conditions and general corruption. Maybe that used to make this code stop crashing quite so much on an 80486 CPU at 33 mhz. Shudder.
I have the code building and running in Visual C++ 6.0 and also a branch of it runs in Visual C++ 2008, and I could probably get it going in Visual C++ 2010. What could I use that would give me dictionary semantics, shared memory across processes, and is stable and reliable?
Update By "dictionary" I mean the dictionary datatype as known in Python, which is also called a "key/value store" in some places, and in others (like in the C++ standard library), it's known as std::map. Boost documentation that discusses this is here.

It sounds like you should take a look at Boost Interprocess. You can use it to have std::map-like objects in a shared memory and a lot more. Its been years since I used it last time, so cannot go into much details, but the library documentation is good and has tons of examples, it should get you going in 30 minutes.

Related

How do I reserve memory regions before Windows maps my program's DLLs?

My Windows program needs to use very specific regions of memory. Unfortunately, Windows loads quite a few DLLs in memory and because of ASLR, their locations are not predictable, so they could end up being mapped into regions that my program needs to use. On Linux, Wine solves this problem by using a preloader application which reserves memory regions and then manually loads and executes the actual image and dynamic linker. I assume that specific method is not possible on Windows, but is there another way to get reserved regions of memory that are guaranteed to not be used by DLLs or the process heap?
If it helps, the memory regions are fixed and known at compile time. Also, I'm aware that ASLR can be disabled system-wide in the registry or per-process using the Enhanced Mitigation Experience Toolkit, but I don't want to require my users to do that.

I think I finally got it using a method similar to what dxiv suggested in the comments. Instead of using a dummy DLL, I build a basic executable that loads at the beginning of my reserved region using the /FIXED and /BASE compiler flags. The code for the executable contains an uninitialized array that ensures the image covers the needed addresses in memory, but doesn't take up any extra space in the file:
unsigned char Reserved[4194304]; // 4MB
At runtime, the executable copies itself to a new location in memory and updates a couple of fields in the Process Environment Block to point to it. Without updating the fields, calling certain functions like FormatMessage would cause a crash.
#include <intrin.h>
#include <windows.h>
#include <winternl.h>
#pragma intrinsic(__movsb)
void Relocate() {
void *Base, *NewBase;
ULONG SizeOfImage;
PEB *Peb;
LIST_ENTRY *ModuleList, *NextEntry;
/* Get info about the PE image. */
Base = GetModuleHandleW(NULL);
SizeOfImage = ((IMAGE_NT_HEADERS *)(((ULONG_PTR)Base) +
((IMAGE_DOS_HEADER *)Base)->e_lfanew))->OptionalHeader.SizeOfImage;
/* Allocate memory to hold a copy of the PE image. */
NewBase = VirtualAlloc(NULL, SizeOfImage, MEM_COMMIT, PAGE_READWRITE);
if (!NewBase) {
ExitProcess(GetLastError());
}
/* Copy the PE image to the new location using __movsb since we don't have
a C library. */
__movsb(NewBase, Base, SizeOfImage);
/* Locate the Process Environment Block. */
Peb = (PEB *)__readfsdword(0x30);
/* Update the ImageBaseAddress field of the PEB. */
*((PVOID *)((ULONG_PTR)Peb + 0x08)) = NewBase;
/* Update the base address in the PEB's loader data table. */
ModuleList = &Peb->Ldr->InMemoryOrderModuleList;
NextEntry = ModuleList->Flink;
while (NextEntry != ModuleList) {
LDR_DATA_TABLE_ENTRY *LdrEntry = CONTAINING_RECORD(
NextEntry, LDR_DATA_TABLE_ENTRY, InMemoryOrderLinks);
if (LdrEntry->DllBase == Base) {
LdrEntry->DllBase = NewBase;
break;
}
NextEntry = NextEntry->Flink;
}
}
I built the executable with /NODEFAULTLIB just to reduce its size and the number of DLLs loaded at runtime, hence the use of the __movsb intrinsic. You could probably get away with linking to MSVCRT if you wanted to and then replace __movsb with memcpy. You can also import memcpy from ntdll.dll or write your own.
Once the executable is moved out of the way, I call a function in a DLL that contains the rest of my code. The DLL uses UnmapViewOfFile to get rid of the original PE image, which gives me a nice 4MB+ chunk of memory to work with, guaranteed not to contain mapped files, thread stacks, or heaps.
A few things to keep in mind with this technique:
This is a huge hack. I felt dirty writing it and it very well could fall apart in future versions of Windows. I also haven't tested this on anything other than Windows 7. This code works on Windows 7 and Windows 10, at least.
Since the executable is built with /FIXED /BASE, its code is not position-independent and you can't just jump to the relocated executable.
If the DLL function that calls UnmapViewOfFile returns, the program will crash because the code section we called from doesn't exist anymore. I use ExitProcess to ensure the function never returns.
Some sections in the relocated PE image like those containing code can be released using VirtualFree to free up some physical memory.
My code doesn't bother re-sorting the loader data table entries. It seems to work fine that way, but it could break if something were to depend on the entries being ordered by image address.
Some anti-virus programs might get suspicious about this stuff. Microsoft Security Essentials didn't complain, at least.
In hindsight, dxiv's dummy DLL method may have been easier, because I wouldn't need to mess with the PEB. But I stuck with this technique because the executable is more likely to be loaded at its desired base address. The dummy DLL method didn't work for me. DLLs are loaded by Ntdll after Windows has already reserved regions of memory that I need.

Multi-processing with singletons C++ on Linux x86_64

For the following question, I am looking for an answer that is based on "pure" C/C++ fundamentals, so I would appreciate a non-Boost answer. Thanks.
I have an application (for example, a telecommunications infrastructure server) which will, when started, spawn several processes on a Linux environment (one for logging, one for Timer management, one for protocol messaging, one for message processing etc.). It is on an x86_64 environment on Gentoo. The thing is, I need a singleton to be able to be accessible from all the processes.
This is different from multi-threading using say, POSIX threads, on Linux because the same address space is used by all POSIX threads, but that is not the case when multiple processes, generated by fork () function call, is used. When the same address space is used, the singleton is just the same address in all the threads, and the problem is trivially solved (using the well known protections, which are old hat for everybody on SO). I do enjoy the protections offered to me by multiple processes generated via fork().
Going back to my problem, I feel like the correct way to approach this would be to create the singleton in shared memory, and then pass a handle to the shared memory into the calling tasks.
I imagine the following (SomeSingleton.h):
#include <unistd.h>
#... <usual includes>
#include "SomeGiantObject.h"
int size = 8192; // Enough to contain the SomeSingleton object
int shm_fd = shm_open ("/some_singleton_shm", O_CREAT | O_EXCL | O_RDWR, 0666);
ftruncate (shm_fd, size);
sharedMemoryLocationForSomeSingleton = mmap (NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
class SomeSingleton
{
public:
SomeSingleton* getInstance ()
{
return reinterpret_cast<SomeSingleton*>sharedMemoryLocationForSomeSingleton;
}
private:
SomeSingleton();
/*
Whole bunch of attributes that is shared across processes.
These attributes also should be in shared memory.
e.g., in the following
SomeGiantObject* obj;
obj should also be in shared memory.
*/
};
The getInstance() method returns the shared memory location for the SomeSingleton object.
My questions are as follows:
Is this a legitimate way to handle the problem? How have folks on SO handled this problem before?
For the code above to work, I envision a global declaration (static by definition) that points to the shared memory as shown before the class declaration.
Last, but not the least, I know that on Linux, the overheads of creating threads vs. processes is "relatively similar," but I was wondering why there is not much by way of multi-processing discussions on SO (gob loads of multi-threading, though!). There isn't even a tag here! Has multi-processing (using fork()) fallen off favors among the C++ coding community? Any insight on that is also appreciated. Also, may I request someone with a reputation > 1500 to create a tag "multi-processing?" Thanks.

If you create the shared memory region before forking, then it will be mapped at the same address in all peers.
You can use a custom allocator to place contained objects inside the shared region also. This should probably be done before forking as well, but be careful of repetition of destructor calls (destructors that e.g. flush buffers are fine, but anything that makes an object unusable should be skipped, just leak and let the OS reclaim the memory after all processes close the shared memory handle).

How can I cope with 32bit/64bit mismatches when doing IPC via SendMessage?

I have a piece of C++ code which reads out the text of a tree item (as contained in a plain Common Controls Tree View) using the TVM_GETITEM window message. The tree view which receives the mesage is in a different process, so I'm using a little bit of shared memory for the structure which is pointed to by one of the arguments to the window message. I have to do this work since the remote process is not under my control (I'm writing an application similiar to Spy++).
This works well in principle, but fails in case the target process is substantially different:
If the code of the target process was built with UNICODE defined but my own code wasn't, the two processes will have different ideas about the structure of the string members in the TVITEM structure. I solved this already using an IsWindowUnicode call and then explicitely sending either TVM_GETITEMA or TVM_GETITEMW (recoding the result if necessary).
If the calling process was built in 32bit mode and the target process is 64bit (or the other way round), the layout (and size) of the TVITEM structure structure is different since pointers have a different size.
I'm currently trying to find a good way to solve the second issue. This particular use case (getting the tree item text) is just an example, the same issue exists for other window messages which my code is sending. Right now, I'm considering two approaches:
Build my code twice and then execute either the 32bit or the 64bit code depending on what the target process does. This requires some changes to our build- and packaging system, and it requires factoring the code which is architecture specific out into a dedicated process (right now it's in a DLL). Once that is done, it should work nicely.
Detect the image format of the target process at runtime and then use custom structs instead of the TVITEM structure structure which explicitely use 32bit or 64bit wide pointers. This requires writing code to detect the architecture of a remote process (I hope I can do this by calling GetModuleFileName on the remote process and then analyzing the PE header using the Image Help Library) and hardcoding two structs (one with 32bit pointers, one with 64bit). Furthermore, I have to make sure that the shared memory address is in the 32bit address space (so that my own code can always access it, even if it's compiled in 32bit mode).
Did anybody else have to solve a similiar problem? Are there easier solutions?

I ended up checking whether the remote process is 32bit or 64bit at runtime, and then writing the right structure to the shared memory before sending a message.
For instance, here's how you can use the TVM_GETITEM message even if there's a 32bit <-> 64bit mixup between the caller and the receiver of the message:
/* This template is basically a copy of the TVITEM struct except that
* all fields which return a pointer have a variable type. This allows
* creating different types for different pointer sizes.
*/
template <typename AddrType>
struct TVITEM_3264 {
UINT mask;
AddrType hItem;
UINT state;
UINT stateMask;
AddrType pszText;
int cchTextMax;
int iImage;
int iSelectedImage;
int cChildren;
AddrType lParam;
};
typedef TVITEM_3264<UINT32> TVITEM32;
typedef TVITEM_3264<UINT64> TVITEM64;
// .... later, I can then use the above template like this:
LPARAM _itemInfo;
DWORD pid;
::GetWindowThreadProcessId( treeViewWindow, &pid );
if ( is64BitProcess( pid ) ) {
TVITEM64 itemInfo;
ZeroMemory( &itemInfo, sizeof( itemInfo ) );
itemInfo.mask = TVIF_HANDLE | TVIF_TEXT;
itemInfo.hItem = (UINT64)m_item;
itemInfo.pszText = (UINT64)(LPTSTR)sharedMem->getSharedMemory( sizeof(itemInfo) );
itemInfo.cchTextMax = MaxTextLength;
_itemInfo = (LPARAM)sharedMem->write( &itemInfo, sizeof(itemInfo) );
} else {
TVITEM32 itemInfo;
ZeroMemory( &itemInfo, sizeof( itemInfo ) );
itemInfo.mask = TVIF_HANDLE | TVIF_TEXT;
itemInfo.hItem = (UINT32)m_item;
itemInfo.pszText = (UINT32)(LPTSTR)sharedMem->getSharedMemory( sizeof(itemInfo) );
itemInfo.cchTextMax = MaxTextLength;
_itemInfo = (LPARAM)sharedMem->write( &itemInfo, sizeof(itemInfo) );
}
The sharedMem->getSharedMemory function is a little helper function to get a pointer to the shared memory region; the optional function argument specifies an offset value. What's important is that the shared memory region should always bee in the 32bit address space (so that even a 32bit remote process can access it).

IMHO there is a design problem. I don't know why your are doing this way, maybe you don't have total control of all parts. But in a basic MVC perspective, you are peeking values from a view instead of asking it to the model.

I'm not familiar with this particular message, but if the Windows TVM_GETITEM message is supposed to function correctly across processes, then Windows should fill in the TVITEM struct in the caller's address space and handle any needed conversions for you, without needing you to supply shared memory. If it isn't, then I doubt the problem you're seeing here is easily solvable without some uncomfortable contortions.
The shared memory bit confuses me; generally you have to make both processes explicitly aware of the shared memory segment, and you didn't mention DLL injection or anything like that. How exactly is the callee made aware of the shared memory section in its address space, and how are you using it? Are you sure it's needed for this API?

Can I pass an object to another process just passing its' pointer to a shared memory?

I have a very complicated class(it has unordered_map and so on on inside it) and I want to share an object of it withit two my processes. Can I simply pass just a pointer to it from one process to another? I think, no, but hope to hear "Yes!".
If "no", I'd be grateful to see any links how to cope in such cases.
I need to have only one instance of this object for all processes because it's very large and all of the processes will work woth it for read only.

You certainly can use IPC to accomplish this, and there are plenty of cases where multiple processes make more sense than a multithreaded process (at least one of the processes is built on legacy code to which you can't make extensive modifications, they would best be written in different languages, you need to minimize the chance of faults in one process affecting the stability of the others, etc.) In a POSIX-compatible environment, you would do
int descriptor = shm_open("/unique_name_here", O_RDWR | O_CREAT, 0777);
if (descriptor < 0) {
/* handle error */
} else {
ftruncate(descriptor, sizeof(Object));
void *ptr = mmap(NULL, sizeof(Object), PROT_READ | PROT_WRITE | PROT_EXEC, MAP_SHARED, descriptor, 0);
if (!ptr || ptr == MAP_FAILED)
/* handle error */ ;
Object *obj = new (ptr) Object(arguments);
}
in one process, and then
int descriptor = shm_open("/the_same_name_here", O_RDWR | O_CREAT, 0777);
if (descriptor < 0) {
/* handle error */
} else {
Object *obj = (Object *) mmap(NULL, sizeof(Object), PROT_READ | PROT_WRITE | PROT_EXEC, MAP_SHARED, descriptor, 0);
if (!obj || obj == MAP_FAILED)
/* handle error */ ;
}
in the other. There are many more options, and I didn't show the cleanup code when you're done, so you still ought to read the shm_open() and mmap() manpages, but this should get you started. A few things to remember:
/All/ of the memory the object uses needs to be shared. For example, if the Object contains pointers or references to other objects, or dynamically allocated members (including things like containers, std::string, etc.), you'll have to use placement new to create everything (or at least everything that needs to be shared with the other processes) inside the shared memory blob. You don't need a new shm_open() for each object, but you do have to track (in the creating process) their sizes and offsets, which can be error-prone in non-trivial cases and absolutely hair-pulling if you have fancy auto-allocating types such as STL containers.
If any process will be modifying the object after it's been shared, you'll need to provide a separate synchronization mechanism. This is no worse than what you'd do in a multithreaded program, but you do have to think about it.
If the 'client' processes do not need to modify the shared object, you should you should open their handles with O_RDONLY instead of O_RDWR and invoke mmap() without the PROT_WRITE permission flag. If the client processes might make local modifications that need not be shared with the other processes, invoke mmap() with MAP_PRIVATE instead of MAP_SHARED. This will greatly reduce the amount of synchronization required and the risks of screwing it up.
If these processes will be running on a multiuser system and/or the shared data may be sensitive and/or this is a high-availability application, you're going to want more sophisticated access control than what is shown above. Shared memory is a common source of security holes.

No, process do not (naturally) share memory. If boost is an option, so you can have a look on Boost.Interprocess for easy memory sharing.

No, the pointer is meaningless to the other process. The OS creates a separate address space for other processes; by default, they have no idea that other processes are running, or even that such a thing is possible.

The trick here is that the memory has to be mapped the same way in both your processes. If your mapped shared memory can be arranged that way, it'll work, but I bet it'll be very difficult.
There are a couple of other possibilties. First one is to use an array; array indices will work across both processes.
You can also use placement new to make sure you're allocating objects at a known location within the shared memory, and use those offsets.

If you are on linux, you could use shared memory to store common data between processes. For general case, take a look into boost IPC library.
But pointer from one process can not be used in another (it's address can be used, if accessing IO, or some special devices)

If you use Qt4 there's QSharedMemory or you could use sockets and a custom serialization protocol.

What is the closest thing Windows has to fork()?

I guess the question says it all.
I want to fork on Windows. What is the most similar operation and how do I use it.

Cygwin has fully featured fork() on Windows. Thus if using Cygwin is acceptable for you, then the problem is solved in the case performance is not an issue.
Otherwise you can take a look at how Cygwin implements fork(). From a quite old Cygwin's architecture doc:
5.6. Process Creation
The fork call in Cygwin is particularly interesting
because it does not map well on top of
the Win32 API. This makes it very
difficult to implement correctly.
Currently, the Cygwin fork is a
non-copy-on-write implementation
similar to what was present in early
flavors of UNIX.
The first thing that happens when a
parent process forks a child process
is that the parent initializes a space
in the Cygwin process table for the
child. It then creates a suspended
child process using the Win32
CreateProcess call. Next, the parent
process calls setjmp to save its own
context and sets a pointer to this in
a Cygwin shared memory area (shared
among all Cygwin tasks). It then fills
in the child's .data and .bss sections
by copying from its own address space
into the suspended child's address
space. After the child's address space
is initialized, the child is run while
the parent waits on a mutex. The child
discovers it has been forked and
longjumps using the saved jump buffer.
The child then sets the mutex the
parent is waiting on and blocks on
another mutex. This is the signal for
the parent to copy its stack and heap
into the child, after which it
releases the mutex the child is
waiting on and returns from the fork
call. Finally, the child wakes from
blocking on the last mutex, recreates
any memory-mapped areas passed to it
via the shared area, and returns from
fork itself.
While we have some ideas as to how to
speed up our fork implementation by
reducing the number of context
switches between the parent and child
process, fork will almost certainly
always be inefficient under Win32.
Fortunately, in most circumstances the
spawn family of calls provided by
Cygwin can be substituted for a
fork/exec pair with only a little
effort. These calls map cleanly on top
of the Win32 API. As a result, they
are much more efficient. Changing the
compiler's driver program to call
spawn instead of fork was a trivial
change and increased compilation
speeds by twenty to thirty percent in
our tests.
However, spawn and exec present their
own set of difficulties. Because there
is no way to do an actual exec under
Win32, Cygwin has to invent its own
Process IDs (PIDs). As a result, when
a process performs multiple exec
calls, there will be multiple Windows
PIDs associated with a single Cygwin
PID. In some cases, stubs of each of
these Win32 processes may linger,
waiting for their exec'd Cygwin
process to exit.
Sounds like a lot of work, doesn't it? And yes, it is slooooow.
EDIT: the doc is outdated, please see this excellent answer for an update

I certainly don't know the details on this because I've never done it it, but the native NT API has a capability to fork a process (the POSIX subsystem on Windows needs this capability - I'm not sure if the POSIX subsystem is even supported anymore).
A search for ZwCreateProcess() should get you some more details - for example this bit of information from Maxim Shatskih:
The most important parameter here is SectionHandle. If this parameter
is NULL, the kernel will fork the current process. Otherwise, this
parameter must be a handle of the SEC_IMAGE section object created on
the EXE file before calling ZwCreateProcess().
Though note that Corinna Vinschen indicates that Cygwin found using ZwCreateProcess() still unreliable:
Iker Arizmendi wrote:
> Because the Cygwin project relied solely on Win32 APIs its fork
> implementation is non-COW and inefficient in those cases where a fork
> is not followed by exec. It's also rather complex. See here (section
> 5.6) for details:
>
> http://www.redhat.com/support/wpapers/cygnus/cygnus_cygwin/architecture.html
This document is rather old, 10 years or so. While we're still using
Win32 calls to emulate fork, the method has changed noticably.
Especially, we don't create the child process in the suspended state
anymore, unless specific datastructes need a special handling in the
parent before they get copied to the child. In the current 1.5.25
release the only case for a suspended child are open sockets in the
parent. The upcoming 1.7.0 release will not suspend at all.
One reason not to use ZwCreateProcess was that up to the 1.5.25
release we're still supporting Windows 9x users. However, two
attempts to use ZwCreateProcess on NT-based systems failed for one
reason or another.
It would be really nice if this stuff would be better or at all
documented, especially a couple of datastructures and how to connect a
process to a subsystem. While fork is not a Win32 concept, I don't
see that it would be a bad thing to make fork easier to implement.

Well, windows doesn't really have anything quite like it. Especially since fork can be used to conceptually create a thread or a process in *nix.
So, I'd have to say:
CreateProcess()/CreateProcessEx()
and
CreateThread() (I've heard that for C applications, _beginthreadex() is better).

People have tried to implement fork on Windows. This is the closest thing to it I can find:
Taken from: http://doxygen.scilab.org/5.3/d0/d8f/forkWindows_8c_source.html#l00216
static BOOL haveLoadedFunctionsForFork(void);
int fork(void)
{
HANDLE hProcess = 0, hThread = 0;
OBJECT_ATTRIBUTES oa = { sizeof(oa) };
MEMORY_BASIC_INFORMATION mbi;
CLIENT_ID cid;
USER_STACK stack;
PNT_TIB tib;
THREAD_BASIC_INFORMATION tbi;
CONTEXT context = {
CONTEXT_FULL |
CONTEXT_DEBUG_REGISTERS |
CONTEXT_FLOATING_POINT
};
if (setjmp(jenv) != 0) return 0; /* return as a child */
/* check whether the entry points are
initilized and get them if necessary */
if (!ZwCreateProcess && !haveLoadedFunctionsForFork()) return -1;
/* create forked process */
ZwCreateProcess(&hProcess, PROCESS_ALL_ACCESS, &oa,
NtCurrentProcess(), TRUE, 0, 0, 0);
/* set the Eip for the child process to our child function */
ZwGetContextThread(NtCurrentThread(), &context);
/* In x64 the Eip and Esp are not present,
their x64 counterparts are Rip and Rsp respectively. */
#if _WIN64
context.Rip = (ULONG)child_entry;
#else
context.Eip = (ULONG)child_entry;
#endif
#if _WIN64
ZwQueryVirtualMemory(NtCurrentProcess(), (PVOID)context.Rsp,
MemoryBasicInformation, &mbi, sizeof mbi, 0);
#else
ZwQueryVirtualMemory(NtCurrentProcess(), (PVOID)context.Esp,
MemoryBasicInformation, &mbi, sizeof mbi, 0);
#endif
stack.FixedStackBase = 0;
stack.FixedStackLimit = 0;
stack.ExpandableStackBase = (PCHAR)mbi.BaseAddress + mbi.RegionSize;
stack.ExpandableStackLimit = mbi.BaseAddress;
stack.ExpandableStackBottom = mbi.AllocationBase;
/* create thread using the modified context and stack */
ZwCreateThread(&hThread, THREAD_ALL_ACCESS, &oa, hProcess,
&cid, &context, &stack, TRUE);
/* copy exception table */
ZwQueryInformationThread(NtCurrentThread(), ThreadBasicInformation,
&tbi, sizeof tbi, 0);
tib = (PNT_TIB)tbi.TebBaseAddress;
ZwQueryInformationThread(hThread, ThreadBasicInformation,
&tbi, sizeof tbi, 0);
ZwWriteVirtualMemory(hProcess, tbi.TebBaseAddress,
&tib->ExceptionList, sizeof tib->ExceptionList, 0);
/* start (resume really) the child */
ZwResumeThread(hThread, 0);
/* clean up */
ZwClose(hThread);
ZwClose(hProcess);
/* exit with child's pid */
return (int)cid.UniqueProcess;
}
static BOOL haveLoadedFunctionsForFork(void)
{
HANDLE ntdll = GetModuleHandle("ntdll");
if (ntdll == NULL) return FALSE;
if (ZwCreateProcess && ZwQuerySystemInformation && ZwQueryVirtualMemory &&
ZwCreateThread && ZwGetContextThread && ZwResumeThread &&
ZwQueryInformationThread && ZwWriteVirtualMemory && ZwClose)
{
return TRUE;
}
ZwCreateProcess = (ZwCreateProcess_t) GetProcAddress(ntdll,
"ZwCreateProcess");
ZwQuerySystemInformation = (ZwQuerySystemInformation_t)
GetProcAddress(ntdll, "ZwQuerySystemInformation");
ZwQueryVirtualMemory = (ZwQueryVirtualMemory_t)
GetProcAddress(ntdll, "ZwQueryVirtualMemory");
ZwCreateThread = (ZwCreateThread_t)
GetProcAddress(ntdll, "ZwCreateThread");
ZwGetContextThread = (ZwGetContextThread_t)
GetProcAddress(ntdll, "ZwGetContextThread");
ZwResumeThread = (ZwResumeThread_t)
GetProcAddress(ntdll, "ZwResumeThread");
ZwQueryInformationThread = (ZwQueryInformationThread_t)
GetProcAddress(ntdll, "ZwQueryInformationThread");
ZwWriteVirtualMemory = (ZwWriteVirtualMemory_t)
GetProcAddress(ntdll, "ZwWriteVirtualMemory");
ZwClose = (ZwClose_t) GetProcAddress(ntdll, "ZwClose");
if (ZwCreateProcess && ZwQuerySystemInformation && ZwQueryVirtualMemory &&
ZwCreateThread && ZwGetContextThread && ZwResumeThread &&
ZwQueryInformationThread && ZwWriteVirtualMemory && ZwClose)
{
return TRUE;
}
else
{
ZwCreateProcess = NULL;
ZwQuerySystemInformation = NULL;
ZwQueryVirtualMemory = NULL;
ZwCreateThread = NULL;
ZwGetContextThread = NULL;
ZwResumeThread = NULL;
ZwQueryInformationThread = NULL;
ZwWriteVirtualMemory = NULL;
ZwClose = NULL;
}
return FALSE;
}

Prior to Microsoft introducing their new "Linux subsystem for Windows" option, CreateProcess() was the closest thing Windows has to fork(), but Windows requires you to specify an executable to run in that process.
The UNIX process creation is quite different to Windows. Its fork() call basically duplicates the current process almost in total, each in their own address space, and continues running them separately. While the processes themselves are different, they are still running the same program. See here for a good overview of the fork/exec model.
Going back the other way, the equivalent of the Windows CreateProcess() is the fork()/exec() pair of functions in UNIX.
If you were porting software to Windows and you don't mind a translation layer, Cygwin provided the capability that you want but it was rather kludgey.
Of course, with the new Linux subsystem, the closest thing Windows has to fork() is actually fork() :-)

As other answers have mentioned, NT (the kernel underlying modern versions of Windows) has an equivalent of Unix fork(). That's not the problem.
The problem is that cloning a process's entire state is not generally a sane thing to do. This is as true in the Unix world as it is in Windows, but in the Unix world, fork() is used all the time, and libraries are designed to deal with it. Windows libraries aren't.
For example, the system DLLs kernel32.dll and user32.dll maintain a private connection to the Win32 server process csrss.exe. After a fork, there are two processes on the client end of that connection, which is going to cause problems. The child process should inform csrss.exe of its existence and make a new connection – but there's no interface to do that, because these libraries weren't designed with fork() in mind.
So you have two choices. One is to forbid the use of kernel32 and user32 and other libraries that aren't designed to be forked – including any libraries that link directly or indirectly to kernel32 or user32, which is virtually all of them. This means that you can't interact with the Windows desktop at all, and are stuck in your own separate Unixy world. This is the approach taken by the various Unix subsystems for NT.
The other option is to resort to some sort of horrible hack to try to get unaware libraries to work with fork(). That's what Cygwin does. It creates a new process, lets it initialize (including registering itself with csrss.exe), then copies most of the dynamic state over from the old process and hopes for the best. It amazes me that this ever works. It certainly doesn't work reliably – even if it doesn't randomly fail due to an address space conflict, any library you're using may be silently left in a broken state. The claim of the current accepted answer that Cygwin has a "fully-featured fork()" is... dubious.
Summary: In an Interix-like environment, you can fork by calling fork(). Otherwise, please try to wean yourself from the desire to do it. Even if you're targeting Cygwin, don't use fork() unless you absolutely have to.

The following document provides some information on porting code from UNIX to Win32:
https://msdn.microsoft.com/en-us/library/y23kc048.aspx
Among other things, it indicates that the process model is quite different between the two systems and recommends consideration of CreateProcess and CreateThread where fork()-like behavior is required.

"as soon as you want to do file access or printf then io are refused"
You cannot have your cake and eat it too... in msvcrt.dll, printf() is based on the Console API, which in itself uses lpc to communicate with the console subsystem (csrss.exe). Connection with csrss is initiated at process start-up, which means that any process that begins its execution "in the middle" will have that step skipped. Unless you have access to the source code of the operating system, then there is no point in trying to connect to csrss manually. Instead, you should create your own subsystem, and accordingly avoid the console functions in applications that use fork().
once you have implemented your own subsystem, don't forget to also duplicate all of the parent's handles for the child process;-)
"Also, you probably shouldn't use the Zw* functions unless you're in kernel mode, you should probably use the Nt* functions instead."
This is incorrect. When accessed in user mode, there is absolutely no difference between Zw*** Nt***; these are merely two different (ntdll.dll) exported names that refer to the same (relative) virtual address.
ZwGetContextThread(NtCurrentThread(), &context);
obtaining the context of the current (running) thread by calling ZwGetContextThread is wrong, is likely to crash, and (due to the extra system call) is also not the fastest way to accomplishing the task.

Your best options are CreateProcess() or CreateThread(). There is more information on porting here.

There is no easy way to emulate fork() on Windows.
I suggest you to use threads instead.

fork() semantics are necessary where the child needs access to the actual memory state of the parent as of the instant fork() is called. I have a piece of software which relies on the implicit mutex of memory copying as of the instant fork() is called, which makes threads impossible to use. (This is emulated on modern *nix platforms via copy-on-write/update-memory-table semantics.)
The closest that exists on Windows as a syscall is CreateProcess. The best that can be done is for the parent to freeze all other threads during the time that it is copying memory over to the new process's memory space, then thaw them. Neither the Cygwin frok [sic] class nor the Scilab code that Eric des Courtis posted does the thread-freezing, that I can see.
Also, you probably shouldn't use the Zw* functions unless you're in kernel mode, you should probably use the Nt* functions instead. There's an extra branch that checks whether you're in kernel mode and, if not, performs all of the bounds checking and parameter verification that Nt* always do. Thus, it's very slightly less efficient to call them from user mode.

The closest you say... Let me think... This must be fork() I guess :)
For details see Does Interix implement fork()?

Most of the hacky solutions are outdated. Winnie the fuzzer has a version of fork that works on current versions of Windows 10 (tho this requires system specific offsets and can break easily too).
https://github.com/sslab-gatech/winnie/tree/master/forklib

If you only care about creating a subprocess and waiting for it, perhaps _spawn* API's in process.h are sufficient. Here's more information about that:
https://learn.microsoft.com/en-us/cpp/c-runtime-library/process-and-environment-control
https://en.wikipedia.org/wiki/Process.h

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js