Interprocess communication between 32- and 64-bit apps on Windows x64

Interprocess communication between 32- and 64-bit apps on Windows x64 - c++

We'd like to support some hardware that has recently been discontinued. The driver for the hardware is a plain 32-bit C DLL. We don't have the source code, and (for legal reasons) are not interested in decompiling or reverse engineering the driver.
The hardware sends tons of data quickly, so the communication protocol needs to be pretty efficient.
Our software is a native 64-bit C++ app, but we'd like to access the hardware via a 32-bit process. What is an efficient, elegant way for 32-bit and 64-bit applications to communicate with each other (that, ideally, doesn't involve inventing a new protocol)?
The solution should be in C/C++.
Update: several respondents asked for clarification whether this was a user-mode or kernel-mode driver. Fortunately, it's a user-mode driver.

If this is a real driver (kernel mode), you're SOL. Vista x64 doesn't allow installing unsigned drivers. It this is just a user-mode DLL, you can get a fix by using any of the standard IPC mechanisms. Pipes, sockets, out-of-proc COM, roughly in that order. It all operates on bus speeds so as long as you can buffer enough data, the context switch overhead shouldn't hurt too much.

I would just use sockets. It would allow you to use it over IP if you need it in the future, and you won't be tied down to one messaging API. If in the future you wish to implement this on another OS or language, you can.

This article might be of interest. It discusses the problem and then suggests using COM as a solution. I'm not a big fan of COM but given its ubiquity in the Windows universe, it's possible that it might be efficient enough. You will probably want to architect your solution so that you can batch data (you don't want to do one COM call for each item of data).

Elegant? C++? DCOM/RPC calls to yourself might work, or you could create a named pipe and use that to talk between the two processes (maybe create a "CMessage class" or something), though watch out for different structure alignment between x86 and x64.

If the driver does turn out to be a real driver, nobugz is almost right -- you're going to have to work a lot harder, you're not completely SOL. One solution is to install Win32 on some other machine (or virtual machine) and then use some form of RPC, such as sockets (as suggested by Pyrolistical) or UDP or MQ or even Tibco Rendezvous (which claims to support very high throughput in order to handle the volumes of data generated by the financial markets -- at least that's what I remember from back in the old days).

A memory-mapped file, shared by both sides would have the same contents. The OS will have to do some interesting pointer stuff to make it happen, but quite likely will be able to setup the 2 views in such a way that you're not physically copying memory around. Zero copies is about as good as it gets

Related

kernel vs user-space audio device driver on macOS

I'm in a need to develop an audio device driver for System Audio Capture(based on Soundflower). But soon a problem appeared that it seems IOAudioFamily stack is being deprecated in OSX 10.10 and later. Looking through the IOAudioDevice and IOAudioEngine header files it seems that apple recommends now using the <CoreAudio/AudioServerPlugIn.h> API which runs in user-space. But I can't find lots of information on this user-space device drivers topic. It seems that the only resource is the Apple provided sample devices from https://developer.apple.com/library/prerelease/content/samplecode/AudioDriverExamples/Introduction/Intro.html
Looking through the examples I find that its a lot harder and more work to develop a user-space driver instead of I/O Kit kernel based.
So the question arises what should motivate to develop a device driver in user-space instead of kernel space?

The "SimpleAudioDriver" example is somewhat misnamed. It demonstrates pretty much every feature of the API. This is handy as a reference if you actually need to use those features. It's also structured in a way that's maybe a little more complicated than necessary.
For a virtual device, the NullAudioDriver is probably a much better base, and much, much easier to understand (single source file, if I remember correctly). SimpleAudioDriver is more useful for dealing with issues such as hotplugging, multiple instances of identical devices, etc.
IOAudioEngine is deprecated as you say, and has been since OS X 10.10. Expect it to go away eventually, so if you build your driver with it, you'll probably need to rewrite it sooner than if you create a Core Audio Server Plugin based one.
Testing and debugging audio drivers is awkward either way (due to being so time sensitive), but I'd say userspace ones are slightly less frustrating to deal with. You'll still want to test on a different machine than your development Mac, because if coreaudiod crashes or hangs, apps usually start locking up too, so being able to just ssh in, delete your plugin and kill coreaudiod is handy. Certainly quicker turnaround than having to reboot.
(FWIW, I've shipped both kernel and userspace OS X audio drivers, and I spend a lot of time working on kexts.)

There is a great book on this subject, available free online here:
http://free-electrons.com/doc/books/ldd3.pdf
See page 37 for a summary of why you might want a user-space driver, copied here for convenience:
The advantages of user-space drivers are:
The full C library can be linked in. The driver can perform many exotic tasks without resorting to external programs (the utility
programs implementing usage policies that are usually distributed
along with the driver itself).
The programmer can run a conventional debugger on the driver code without having to go through contortions to debug a running kernel.
If a user-space driver hangs, you can simply kill it. Problems with the driver are unlikely to hang the entire system, unless the hardware
being controlled is really misbehaving.
User memory is swappable, unlike kernel memory. An infrequently used device with a huge driver won’t occupy RAM that other programs could
be using, except when it is actually in use.
A well-designed driver program can still, like kernel-space drivers, allow concurrent access to a device.
If you must write a closed-source driver, the user-space option makes it easier for you to avoid ambiguous licensing situations and
problems with changing kernel interfaces.

Fastest technique to pass messages between processes on Linux?

What is the fastest technology to send messages between C++ application processes, on Linux? I am vaguely aware that the following techniques are on the table:
TCP
UDP
Sockets
Pipes
Named pipes
Memory-mapped files
are there any more ways and what is the fastest?

Whilst all the above answers are very good, I think we'd have to discuss what is "fastest" [and does it have to be "fastest" or just "fast enough for "?]
For LARGE messages, there is no doubt that shared memory is a very good technique, and very useful in many ways.
However, if the messages are small, there are drawbacks of having to come up with your own message-passing protocol and method of informing the other process that there is a message.
Pipes and named pipes are much easier to use in this case - they behave pretty much like a file, you just write data at the sending side, and read the data at the receiving side. If the sender writes something, the receiver side automatically wakes up. If the pipe is full, the sending side gets blocked. If there is no more data from the sender, the receiving side is automatically blocked. Which means that this can be implemented in fairly few lines of code with a pretty good guarantee that it will work at all times, every time.
Shared memory on the other hand relies on some other mechanism to inform the other thread that "you have a packet of data to process". Yes, it's very fast if you have LARGE packets of data to copy - but I would be surprised if there is a huge difference to a pipe, really. Main benefit would be that the other side doesn't have to copy the data out of the shared memory - but it also relies on there being enough memory to hold all "in flight" messages, or the sender having the ability to hold back things.
I'm not saying "don't use shared memory", I'm just saying that there is no such thing as "one solution that solves all problems 'best'".
To clarify: I would start by implementing a simple method using a pipe or named pipe [depending on which suits the purposes], and measure the performance of that. If a significant time is spent actually copying the data, then I would consider using other methods.
Of course, another consideration should be "are we ever going to use two separate machines [or two virtual machines on the same system] to solve this problem. In which case, a network solution is a better choice - even if it's not THE fastest, I've run a local TCP stack on my machines at work for benchmark purposes and got some 20-30Gbit/s (2-3GB/s) with sustained traffic. A raw memcpy within the same process gets around 50-100GBit/s (5-10GB/s) (unless the block size is REALLY tiny and fits in the L1 cache). I haven't measured a standard pipe, but I expect that's somewhere roughly in the middle of those two numbers. [This is numbers that are about right for a number of different medium-sized fairly modern PC's - obviously, on a ARM, MIPS or other embedded style controller, expect a lower number for all of these methods]

I would suggest looking at this also: How to use shared memory with Linux in C.
Basically, I'd drop network protocols such as TCP and UDP when doing IPC on a single machine. These have packeting overhead and are bound to even more resources (e.g. ports, loopback interface).

NetOS Systems Research Group from Cambridge University, UK has done some (open-source) IPC benchmarks.
Source code is located at https://github.com/avsm/ipc-bench .
Project page: http://www.cl.cam.ac.uk/research/srg/netos/projects/ipc-bench/ .
Results: http://www.cl.cam.ac.uk/research/srg/netos/projects/ipc-bench/results.html
This research has been published using the results above: http://anil.recoil.org/papers/drafts/2012-usenix-ipc-draft1.pdf

Check CMA and kdbus:
https://lwn.net/Articles/466304/
I think the fastest stuff these days are based on AIO.
http://www.kegel.com/c10k.html

As you tagged this question with C++, I'd recommend Boost.Interprocess:
Shared memory is the fastest interprocess communication mechanism. The
operating system maps a memory segment in the address space of several
processes, so that several processes can read and write in that memory
segment without calling operating system functions. However, we need
some kind of synchronization between processes that read and write
shared memory.
Source
One caveat I've found is the portability limitations for synchronization primitives. Nor OS X, nor Windows have a native implementation for interprocess condition variables, for example,
and so it emulates them with spin locks.
Now if you use a *nix which supports POSIX process shared primitives, there will be no problems.
Shared memory with synchronization is a good approach when considerable data is involved.

Well, you could simply have a shared memory segment between your processes, using the linux shared memory aka SHM.
It's quite easy to use, look at the link for some examples.

posix message queues are pretty fast but they have some limitations

Need help to choose real-time OS and Hardware

I heard and read that Windows/Linux OS machines are not real-time.
I have read this article. It listed WindowsCE is one of RTOS. That's kind of confusing to me since I always thought WindowsCE is for a mobile or embeded device.
I need a real-time application running 24/7 and processing signals various sensors from each quick moving object to db and monitor by running several machine learning algorithms.
What would be proper real-time hardware and OS for this kind of applications? Development environment would be MFC or Qt C++. I really need opinions from experienced developers. Thanks

QNX has served me well in the past. I should warn you that it was only for training purposes (real-time industrial process control), and that I have implemented real time control programs with this OS by I've never really deployed one.
The first rule with real-time systems is to specify your real-time constraints, such as:
the system must be able to process up to 600 signals per minute; or
the system must spend no more than 1/10 second per signal.
The difference is subtle, but these are different constraints.
Just keep in mind that there is absolutely no way to decide if any hardward/OS/library combination is good enough for you unless you specify these constraints
For that, you think QNX might be proper? What would be its advantages over Windows/Linux systems with high priority setting?
If you look at the QNX documentation for many POSIX systems calls, you will notice they specify extra constraints on performance, which are possibly required to guarantee your real-time constraints. The OS is specifically designed to match these constraints. You won't get this on a system that is not officially an RTOS. If you are going to write real-time software, I recommend that you buy a good book on the subject. There is considerable literature given that the subject is very sensitive.
Get yourself a good book on real-time system design to get a feel of what questions to ask, and then read the technical documentation of each product you will use to see if it can match your constraints. Example of things to look in software libraries like Qt is when they allocate memory. If this is not documented in each class interface, there is no way to guarantee meeting your constraints since there is hidden algorithmic complexity.
Development environment would be MFC or Qt C++.
I would think that Qt compiles on QNX, but I'm not sure if Qt provides the guarantees required to match your real-time constraints. Libraries that abstract away too much stuff are risk since it's difficult to determine if they satisfy your requirements. Hidden memory management is often problematic, but there are other questions you should ask about too.
It seems to me that people say Real-time systems == embedded systems. Am I wrong?
Real-time system definitely does not equal "embedded system", though many embedded systems have real-time constraints.

How real time do you need?
Remember real time is about responsiveness, not speed. In fact most RTOS will be slower on average than general OS.
Do you need to guarrantee a certain average number of transactions/second or do you need to always respond within n seconds of an event?
Do you have custom hardware or are you relying on inputs over ethernet, USB, etc?
Are drivers for the hardware available on the RTOS or will you have to write them yourself ?
Windows and linux are possible RTOS. Windows embedded allows you to turn off services to give much more reliable response rate and there are both realtime kernels and realtime add-ons to Linux which give pretty much the same real time performance as something like VxWorks.
It also depends on how many tasks you need to handle. A lot of the complexity of true RTOS (like VxWorks) is that they can control many tasks at the same time while allowing each a guaranteed latency and CPU share - important for a Mars rover but not for a single data collection PC

How to program in Windows 7.0 to make it more deterministic?

My understanding is that Windows is non-deterministic and can be trouble when using it for data acquisition. Using a 32bit bus, and dual core, is it possible to use inline asm to work with interrupts in Visual Studio 2005 or at least set some kind of flags to be consistent in time with little jitter?
Going the direction of an RTOS(real time operating system): Windows CE with programming in kernel mode may get too expensive for us.

Real time solutions for Windows such as LabVIEW Real-time or RTX are expensive; a stand-alone RTOS would often be less expensive (or even free), but if you need Windows functionality as well, you are perhaps no further forward.
If cost is critical, you might run a free or low-cost RTOS in a virtual machine. This can work, though there is no cooperation over hardware access between the RTOS and Windows, and no direct communication mechanism (you could use TCP/IP over a virtual (or real) network I suppose.
Another alternative is to perform the real-time data acquisition on stand-alone hardware (a microcontroller development board or SBC for example) and communicate with Windows via USB or TCP/IP for example. It is possible that way to get timing jitter down to the microsecond level or better.

There are third-party realtime extensions to Windows. See, e. g. http://msdn.microsoft.com/en-us/library/ms838340(v=winembedded.5).aspx

Windows is not an RTOS, so there is no magic answer. However, there are some things you can do to make the system more "real time friendly".
Disable background processes that can steal system resources from you.
Use a multi-core processor to reduce the impact of context switching
If your program does any disk I/O, move that to its own spindle.
Look into process priority. Make sure your process is running as High or Realtime.
Pay attention to how your program manages memory. Avoid doing thigs that will lead to excessive disk paging.
Consider a real-time extension to Windows (already mentioned).
Consider moving to a real RTOS.
Consider dividing your system into two pieces: (1) real time component running on a microcontroller/DSP/FPGA, and (2) The user interface portion that runs on the Windows PC.

Creating a native application for X86?

Is there a way I could make a C or C++ program that would run without an operating system and that would draw something like a red pixel to the top left corner? I have always wondered how these types of applications are made. Since Windows is written in C I imagine there is a way to do this.
Thanks

If you're writing for a bare processor, with no library support at all, you'll have to get all the hardware manuals, figure out how to access your video memory, and perform whatever operations that hardware requires to get a pixel drawn onto the display (or a sound on the beeper, or a block of memory read from the disk, or whatever).
When you're using an operating system, you'll rely on device drivers to know all this for you. Programs are still written, every day, for platforms without operating systems, but rarely for a bare processor. Many small MPUs come with a support library, usually a set of routines that lets you manipulate whatever peripheral devices they support.

It can certainly be done. You typically write the code in C, and you pretty much have to do everything on your own, with no standard library. To set your pixel, you'd usually load a pointer to the physical address of the screen, and write the correct value to that pointer. Alternatively, on a PC you could consider using the VESA BIOS. In all honesty, it's fairly similar to the way most code for MS-DOS was written (most used MS-DOS to read and write data on disk, but little else).

The core bootloader and the part of the Kernel that bootstraps the OS are written in assembly. See http://en.wikipedia.org/wiki/Booting for a brief writeup of how an operating system boots. There's no way I'm aware of to write a bootloader or Kernel purely in a higher level language such as C or C++ without using assembly.

You need to write a bootstrapper and a loader combination followed by a payload which involves setting the VGA mode manually by interrupt, grabbing a handle to the basic video buffer and then writing a value to the 0th byte.
Start here: http://en.wikipedia.org/wiki/Bootstrapping_(computing)

Without an OS it's difficult to have a loader, which means no dynamic libc. You'd have to link statically, as well as have a decent amount of bootstrap code written in assembly (although it could be provided as object files which you could then link with). Also, since you'd be at the mercy of whatever the system has, you'd be stuck with the VESA video modes (unless you want to write your own graphics driver and subsystem, which you don't).

There is, but not generally from within the OS. Initially, they are an asm stub that's executed from the MBR on the drive. See MBR. For x86 processors, this is generally 16-bit processing code, this generally jumps into the operating system code from here, and upgrades to 32-bit/64-bit mode depending on the operating system and chipset.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js