I have an rpc com interface which passes data from a service to a client. On the client side, I need to store this data temporarily and put it into a queue so that it can be processed later on the ui thread (this needs to be done on the ui thread because that is where the view and view model objects need to be created).
The structure is a bit complex and contains pointers to other structures and variable length strings.
Question - is there an easy way to grab the full "blob" of marshaled memory for stowing or do I need to duplicate the same structure and repack it myself so I can later process it on the ui thread? Currently this looks like duplicating the same structs but replacing LPCWSTR with CComBSTR etc etc which seems kind of dirty and wasteful to me...
Thanks
Until you need to make sense of the data, you can just treat it as a sequence of bytes. The only thing that you need to know is just how much data you have. You can then do something like:
std::vector<unint8_t> buf;
buf.resize(length);
memcpy(&buf[0], source, length);
and then at a later point in time, assuming your vector is still around
memcpy(dest, &buf[0], buf.size());
The vector will get freed when it goes out of scope.
The only thing that can be a bit tricky may be getting the length. This might need some knowledge of the data sent but otherwise there is no need to unpack the data.
You're supposed to marshal the data between your background thread and your UI thread (see CoMarshalInterThreadInterfaceInStream and CoGetInterfaceAndReleaseStream for example). Passing a COM object between threads directly is illegal.
Using the above mentioned API's, you would generate the IStream object and queue the corresponding pointer. Then the UI thread would eventually recover the pointer, call the second API, and retrieve the marshalled copy of the object.
That all assumes that the data is in a COM object to begin with. If it's a blob copied over the wire by your COM object, then proper semantics dictate that when you requested the data from the COM object, you acquired ownership of it (and responsibility to release/delete it). Then just queue a pointer to the data as-is, and have the UI thread do the release/delete (How? I'm not sure; I have to look it up. Maybe CoTaskMemFree()?)
If it's a more bizarre scenario, please clarify what that is.
Related
I am new to Windows IOCP, currently I am rewriting the network part of our server using IOCP. I am trying to figure out how to handle the memory (Overlapped / WSABUF objects).I have a little struct that derives from OVERLAPPED and contains a WSABUF and a few other fields.
I tried keeping a ring buffer these objects and reusing them, but it didn't work. I thought that when the completion routine is called, one of the objects will be passed back to me and I can mark it as available again.
When I allocate them on the heap (and keep in a vector), how do I know when it's safe to delete/reuse them?
Thanks,
Michael
As we all know, an echo server is a server that reads from a socket, and writes that very data into another socket.
Since Windows I/O Completion ports give you different ways to do things, I was wondering what is the best way (the most efficient) to implement an echo server. I'm sure to find someone who tested the ways I will describe here, and can give his/her contribute.
My classes are Stream which abstracts a socket, named pipe, or whatever, and IoRequest which abstracts both an OVERLAPPED structure and the memory buffer to do the I/O (of course, suitable for both reading and writing). In this way when I allocate an IoRequest I'm simply allocating memory for memory buffer for data + OVERLAPPED structure in one shot, so I call malloc() only once.
In addition to this, I also implement fancy and useful things in the IoRequest object, such as an atomic reference counter, and so on.
Said that, let's explore the ways to do the best echo server:
-------------------------------------------- Method A. ------------------------------------------
1) The "reader" socket completes its reading, the IOCP callback returns, and you have an IoRequest just completed with the memory buffer.
2) Let's copy the buffer just received with the "reader" IoRequest to the "writer" IoRequest. (this will involve a memcpy() or whatever).
3) Let's fire again a new reading with ReadFile() in the "reader", with the same IoRequest used for reading.
4) Let's fire a new writing with WriteFile() in the "writer".
-------------------------------------------- Method B. ------------------------------------------
1) The "reader" socket completes its reading, the IOCP callback returns, and you have an IoRequest just completed with the memory buffer.
2) Instead of copying data, pass that IoRequest to the "writer" for writing, without copying data with memcpy().
3) The "reader" now needs a new IoRequest to continue reading, allocate a new one or pass one already allocated before, maybe one just completed for writing before the new writing does happen.
So, in the first case, every Stream objects has its own IoRequest, data is copied with memcpy() or similar functions, and everything works fine.
In the second case the 2 Stream objects do pass IoRequest objects each other, without copying data, but its a little bit more complex, you have to manage the "swapping" of IoRequest objects between the 2 Stream objects, with the possible drawback to get synchronization problems (what about those completions do happen in different threads?)
My questions are:
Q1) Is avoiding copying data really worth it!?
Copying 2 buffers with memcpy() or similar, is very fast, also because the CPU cache is exploited for this very purpose.
Let's consider that with the first method, I have the possibility to echo from a "reader" socket to multiple "writer" sockets, but with the second one I can't do that, since I should create N new IoRequest objects for each N writers, since each WriteFile() needs its own OVERLAPPED structure.
Q2) I guess that when I fire a new N writings for N different sockets with WriteFile(), I have to provide N different OVERLAPPED structure AND N different buffers where to read the data.
Or, I can fire N WriteFile() calls with N different OVERLAPPED taking the data from the same buffer for the N sockets?
Is avoiding copying data really worth it!?
Depends on how much you are copying. 10 bytes, not so much. 10MB, then yes, it's worth avoiding the copying!
In this case, since you already have an object that contains the rx data and an OVERLAPPED block, it seems somewhat pointless to copy it - just reissue it to WSASend(), or whatever.
but with the second one I can't do that
You can, but you need to abstract the 'IORequest' class from a 'Buffer' class. The buffer holds the data, an atomic int reference-count and any other management info for all calls, the IOrequest the OVERLAPPED block and a pointer to the data and any other management information for each call. This information could have an atomic int reference-count for the buffer object.
The IOrequest is the class that is used for each send call. Since it contains only a pointer to the buffer, there is no need to copy the data and so it's reasonably small and O(1) to data size.
When the tx completions come in, the handler threads get the IOrequest, deref the buffer and dec the atomic int in it towards zero. The thread that manages to hit 0 knows that the buffer object is no longer needed and can delete it, (or, more likely, in a high-performance server, repool it for later reuse).
Or, I can fire N WriteFile() calls with N different OVERLAPPED taking
the data from the same buffer for the N sockets?
Yes, you can. See above.
Re. threading - sure, if your 'management data' can be reached from multiple completion-handler threads, then yes, you may want to protect it with a critical-section, but an atomic int should do for the buffer refcount.
How win32 manages instances of OVERLAPPED struct in context of two functions:
GetQueuedCompletionStatus
PostQueuedCompletionStatus
When I call GetQueuedCompletionStatus does win32 free instance of OVERLAPPED struct or I must do it by my self?
When I send data with PostQueuedCompletionStatus does win32 copy it to internal structs? When I must free memory of sent data?
Where I could find some picture with scheme of processing of OVERLAPPED data between GetQueuedCompletionStatus, PostQueuedCompletionStatus and IOCP queue?
The OVERLAPPED structure must exist from when a successful I/O operation (or manual PostQueuedCompletionStatus()) executes until the OVERLAPPED emerges from a call to GetQueuedCompletionStatus().
You are responsible for the lifetime of the structure.
You'll see from the MSDN docs that GetQueuedCompletionStatus() actually takes "a pointer to a variable that receives the address of the OVERLAPPED structure that was specified when the completed I/O operation was started.". What you actually get out of that call is a pointer to the original OVERLAPPED that you passed when you made the PostQueuedCompletionStatus() call (or initiated an overlapped I/O operation).
This is all actually very useful as the "normal" way to use the OVERLAPPED structure is to place it inside a larger structure which holds all of the 'per operation' information that you might need - so it's the ideal way to navigate directly from the limited information that you're given when you call GetQueuedCompletionStatus() to, for example, the data buffer that you used in your overlapped read call...
I find the best way to deal with OVERLAPPED structures is to a) embed them in the buffer you're using for read/write b) reference count them and c) return them to a pool for reuse when the ref count drops to 0.
I have some source code that you could download (here) which may make this a little easier to understand (it's a full IOCP server example so it's a little complex, but it works and shows how these things can be used).
You should pass a the address of a OVERLAPPED * to GetQueuedCompletionStatus. This gets filled in with the value passed to PostQueuedCompletionStatus.
You should not free this data in the PostQueuedCompletionStatus context. It should be done by the context using GetQueuedCompletionStatus. (Assuming it was allocated dynamically in the first place - there is no requirement that it is a dynamically allocated structure, it could be taken out of a fixed pool, or be allocated on the stack of a function that doesn't return until it has been signalled that the operation is complete).
I'm not sure there is such a picture.
I have written a program (suppose X) in c++ which creates a data structure and then uses it continuously.
Now I would like to modify that data structure without aborting the previous program.
I tried 2 ways to accomplish this task :
In the same program X, first I created data structure and then tried to create a child process which starts accessing and using that data structure for some purpose. The parent process continues with its execution and asks the user for any modification like insertion, deletion, etc and takes input from console and subsequently modification is done. The problem here is, it doesn't modify the copy of data structure that the child process was using. Later on, I figured out this won't help because the child process is using its own copy of data structure and hence modifications done via parent process won't be reflected in it. But definitely, I didn't want this to happen. So I went for multithreading.
Instead of creating child process, I created an another thread which access that data structure and uses it and tried to take user input from console in different thread. Even,
this didn't work because of very fast switching between threads.
So, please help me to solve this issue. I want the modification to be reflected in the original data structure. Also I don't want the process (which is accessing and using it continuously) to wait for sometimes since it's time crucial.
First point: this is not a trivial problem. To handle it at all well, you need to design a system, not just a quick hack or two.
First of all, to support the dynamic changing, you'll almost certainly want to define the data structure in code in something like a DLL or .so, so you can load it dynamically.
Part of how to proceed will depend on whether you're talking about data that's stored strictly in memory, or whether it's more file oriented. In the latter case, some of the decisions will depend a bit on whether the new form of a data structure is larger than an old one (i.e., whether you can upgrade in place or no).
Let's start out simple, and assume you're only dealing with structures in memory. Each data item will be represented as an object. In addition to whatever's needed to access the data, each object will provide locking, and a way to build itself from an object of the previous version of the object (lazily -- i.e., on demand, not just in the ctor).
When you load the DLL/.so defining a new object type, you'll create a collection of those the same size as your current collection of existing objects. Each new object will be in the "lazy" state, where it's initialized, but hasn't really been created from the old object yet.
You'll then kick off a thread that walks makes the new collection known to the rest of the program, then walks through the collection of new objects, locking an old object, using it to create a new object, then destroying the old object and removing it from the old collection. It'll use a fairly short timeout when it tries to lock the old object (i.e., if an object is in use, it won't wait for it very long, just go on to the next. It'll iterate repeatedly until all the old objects have been updated and the collection of old objects is empty.
For data on disk, things can be just about the same, except your collections of objects provide access to the data on disk. You create two separate files, and copy data from one to the other, converting as needed.
Another possibility (especially if the data can be upgraded in place) is to use a single file, but embed a version number into each record. Read some raw data, check the version number, and use appropriate code to read/write it. If you're reading an old version number, read with the old code, convert to the new format, and write in the new format. If you don't have space to update in place, write the new record to the end of the file, and update the index to indicate the new position.
Your approach to concurrent access is similar to sharing a cake between a classroom full of blindfolded toddlers. It's no surprise that you end up with a sticky mess. Each toddler will either have to wait their turn to dig in or know exactly which part of the cake she alone can touch.
Translating to code, the former means having a lock or mutex that controls access to a data structure so that only one thread can modify it at any time.
The latter can be done by having a data structure that is modified in place by threads that each know exactly which parts of the data structure they can update, e.g. by passing a struct with details on which range to update, effectively splitting up the data beforehand. These should not overlap and iterators should not be invalidated (e.g. by resizing), which may not be possible for a given problem.
There are many many algorithms for handling resource competition, so this is grossly simplified. Distributed computing is a significant field of computer science dedicated to these kinds problems; study the problem (you didn't give details) and don't expect magic.
I think I'm in a problem. I have two TCP apps connected to each other which use winsock I/O completion ports to send/receive data (non-blocking sockets).
Everything works just fine until there's a data transfer burst. The sender starts sending incorrect/malformed data.
I allocate the buffers I'm sending on the stack, and if I understand correctly, that's a wrong to do, because these buffers should remain as I sent them until I get the "write complete" notification from IOCP.
Take this for example:
void some_function()
{
char cBuff[1024];
// filling cBuff with some data
WSASend(...); // sending cBuff, non-blocking mode
// filling cBuff with other data
WSASend(...); // again, sending cBuff
// ..... and so forth!
}
If I understand correctly, each of these WSASend() calls should have its own unique buffer, and that buffer can be reused only when the send completes.
Correct?
Now, what strategies can I implement in order to maintain a big sack of such buffers, how should I handle them, how can I avoid performance penalty, etc'?
And, if I am to use buffers that means I should copy the data to be sent from the source buffer to the temporary one, thus, I'd set SO_SNDBUF on each socket to zero, so the system will not re-copy what I already copied. Are you with me? Please let me know if I wasn't clear.
Take a serious look at boost::asio. Asynchronous IO is its specialty (just as the name suggests.) It's pretty mature library by now being in Boost since 1.35. Many people use it in production for very intensive networking. There's a wealth of examples in the documentation.
One thing for sure - it take working with buffers very seriously.
Edit:
Basic idea to handling bursts of input is queuing.
Create, say, three linked lists of pre-allocated buffers - one is for free buffers, one for to-be-processed (received) data, one for to-be-sent data.
Every time you need to send something - take a buffer off the free list (allocate a new one if free list is empty), fill with data, put it onto to-be-sent list.
Every time you need to receive something - take a buffer off the free list as above, give it to IO receive routine.
Periodically take buffers off to-be-sent queue, hand them off to send routine.
On send completion (inline or asynchronous) - put them back onto free list.
On receive completion - put buffer onto to-be-processed list.
Have your "business" routine take buffers off to-be-processed list.
The bursts will then fill that input queue until you are able to process them. You might want to limit the queue size to avoid blowing through all the memory.
I don't think it is a good idea to do a second send before the first send is finished.
Similarly, I don't think it is a good idea to change the buffer before the send is finished.
I would be inclined to store the data in some sort of queue. One thread can keep adding data to the queue. The second thread can work in a loop. Do a send and wait for it to finish. If there is more data do another send, else wait for more data.
You would need a critical section (or some such) to nicely share the queue between the threads and possibly an event or a semaphore for the sending thread to wait on if there is no data ready.
Now, what strategies can I implement in order to maintain a big sack of such buffers, how should I handle them, how can I avoid performance penalty, etc'?
It's difficult to know the answer without knowing more about your specific design. In general I'd avoid maintaining your own "sack of buffers" and instead use the OS's built in sack of buffers - the heap.
But in any case, what I would do in the general case is expose an interface to the callers of your code that mirror what WSASend is doing for overlapped i/o. For example, suppose you are providing an interface to send a specific struct:
struct Foo
{
int x;
int y;
};
// foo will be consumed by SendFoo, and deallocated, don't use it after this call
void SendFoo(Foo* foo);
I would require users of SendFoo allocate a Foo instance with new, and tell them that after calling SendFoo the memory is no longer "owned" by their code and they therefore shouldn't use it.
You can enforce this even further with a little trickery:
// After this operation the resultant foo ptr will no longer point to
// memory passed to SendFoo
void SendFoo(Foo*& foo);
This allows the body of SendFoo to send the address of the memory down to WSASend, but modify the passed in pointer to NULL, severing the link between the caller's code and their memory. Of course, you can't really know what the caller is doing with that address, they may have a copy elsewhere.
This interface also enforces that a single block of memory will be used with each WSASend. You are really treading into more than dangerous territory trying to share one buffer between two WSASend calls.