Cross process COM Marshaler: reduce number of copies for large arrays

Cross process COM Marshaler: reduce number of copies for large arrays - c++

As simplified case: I need to transfer a VARIANT to another process over the existing COM interface. I currently use the MIDL-generated marshaller.
The actual transfer is for many values, is part of a time-critical process, and may involve large strings or safearray's (a few MB), thus number of copies made seems relevant.
Since the receiver needs to "keep" the data beyond the function call, at least one copy needs to be made by the marshaler. All signatures I can think of invlove two copies, however:
SetValue([in] VARIANT)
GetValue([out] VARIANT *) // called by receiver
In both cases, in my understanding the marshaller makes a cross-process copy that does get destroyed by the marshaller. Since I need to keep the data in the receiver, I need to make a second copy.
I considered "detaching" the data at the receiver:
SetValue([in, out] VARIANT *)
// receiver detaches value and sets to VT_EMPTY for return
But this would also destroy the source.
Q1: Is it possible to get the MIDL-generated marshaling code to do only one copy?
Q2: Would this be possible with a custom marshaller, and at what cost? (My first looks into that were extremly discouraging)
I am pretty mouch bound to using SAFEARRAY and/or other VARIANT/PROPVARIANT types, and to transfer the whole array.
[edit]
Both sides use C++, the interfaces are IUnknown-based, and it needs to work cross-process on a single machine, in the same context.

You don't say so explicitly, but it seems the problem you are seeking to solve is speed issues. In any case, consider using a profiler to identify the bottleneck if you haven't already done so.
I very much doubt in this case that it is the copying which is taking the time. Rather, it is likely to be the context-switching between processes involved, as you are getting the values one at a time. This means that for each value you retrieve, you have to switch processes to the target of the call, then switch back again.
You could speed this up enormously be making your design less "chatty" when setting or getting multiple values.
Something like this:
SetMultipleValues(
[in] SAFEARRAY(BSTR)* asNames,
[in] SAFEARRAY(VARIANT)* avValues
)
GetMultipleValues(
[in] SAFEARRAY(BSTR)* asNames,
[out,retval] SAFEARRAY(VARIANT)* pavValues
)
I.e. when calling GetMultipleValues, pass in an array of 10 names, and receive an array of 10 values in the same order as the names passed in, (or VT_EMPTY if the value does not exist).

Related

How to determinte how much space to allocate for boost::stacktrace::safe_dump_to?

I'm looking at the boost::stacktrace::safe_dump_to API and I cannot for the life of me work out how to determine how much space to allocate for the safe_dump_to() call. If I pass (nullptr, 0), it just returns 0, so that's not it. I could guess some constant number, but how do i know that's enough?

The docs specify:
This header contains low-level async-signal-safe functions for dumping call stacks. Dumps are binary serialized arrays of void*, so you could read them by using 'od -tx8 -An stacktrace_dump_failename' Linux command or using boost::stacktrace::stacktrace::from_dump functions.
And additionally
Returns:
Stored call sequence depth including terminating zero frame. To get the actually consumed bytes multiply this value by the sizeof(boost::stacktrace::frame::native_frame_ptr_t))
It's not overly explicit, but this means you need sizeof(boost::stacktrace::frame::native_frame_ptr_t)*N where N is the number of stackframes in the trace.
Now of course, you can find out N, but there is not async-safe way to do allocate dynamically anyways, so you'd simply have to pick a number that suits your application. E.g. 256 frames might be reasonable, but you should look at your own needs (e.g. DEBUG builds will show a lot more stack frames, especially with code that heavily relies on template generics, YMMV).
Since the whole safe_dump_to construct is designed to be async-safe, I always just use the overload that writes to a file. When reading it back (typically after a restart) you will be able to deduce the number of frames from the file-size.
Optionally see some of my answers for code samples/more background information

C++ - Managing References in Disk Based Vector

I am developing a set of vector classes that all derived from an abstract vector. I am doing this so that in our software that makes use of these vectors, we can quickly switch between the vectors without any code breaking (or at least minimize failures, but my goal is full compatibility). All of the vectors match.
I am working on a Disk Based Vector that mostly conforms to match the STL Vector implementation. I am doing this because we need to handle large out of memory files that contain various formats of data. The Disk Vector handles data read/write to disk by using template specialization/polymorphism of serialization and deserialization classes. The data serialization and deserialization has been tested, and it works (up to now). My problem occurs when dealing with references to the data.
For example,
Given a DiskVector dv, a call to dv[10] would get a point to a spot on disk, then seek there, read out the char stream. This stream gets passed to a deserializor which converts the byte stream into the appropriate data type. Once I have the value, I my return it.
This is where I run into a problem. In the STL, they return it as a reference, so in order to match their style, I need to return a reference. What I do it store the value in an unordered_map with the given index (in this example, 10). Then I return a reference to the value in the unordered_map.
If this continues without cleanup, then the purpose of the DiskVector is lost because all the data just gets loaded into memory, which is bad due to data size. So I clean up this map by deleting the indexes later on when other calls are made. Unfortunately, if a user decided to store this reference for a long time, and then it gets deleted in the DiskVector, we have a problem.
So my questions
Is there a way to see if any other references to a certain instance are in use?
Is there a better way to solve this while still maintaining the polymorphic style for reasons described at the beginning?
Is it possible to construct a special class that would behave as a reference, but handle the disk IO dynamically so I could just return that instead?
Any other ideas?

So a better solution at what I was trying to do is to use SQLite as the backend for the database. Use BLOBs as the column types for key and value columns. This is the approach I am taking now. That said, in order to get it to work well, I need to use what cdhowie posted in the comments to my question.

Variable Length Array Performance Implications (C/C++)

I'm writing a fairly straightforward function that sends an array over to a file descriptor. However, in order to send the data, I need to append a one byte header.
Here is a simplified version of what I'm doing and it seems to work:
void SendData(uint8_t* buffer, size_t length) {
uint8_t buffer_to_send[length + 1];
buffer_to_send[0] = MY_SPECIAL_BYTE;
memcpy(buffer_to_send + 1, buffer, length);
// more code to send the buffer_to_send goes here...
}
Like I said, the code seems to work fine, however, I've recently gotten into the habit of using the Google C++ style guide since my current project has no set style guide for it (I'm actually the only software engineer on my project and I wanted to use something that's used in industry). I ran Google's cpplint.py and it caught the line where I am creating buffer_to_send and threw some comment about not using variable length arrays. Specifically, here's what Google's C++ style guide has to say about variable length arrays...
http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Variable-Length_Arrays_and_alloca__
Based on their comments, it appears I may have found the root cause of seemingly random crashes in my code (which occur very infrequently, but are nonetheless annoying). However, I'm a bit torn as to how to fix it.
Here are my proposed solutions:
Make buffer_to_send essentially a fixed length array of a constant length. The problem that I can think of here is that I have to make the buffer as big as the theoretically largest buffer I'd want to send. In the average case, the buffers are much smaller, and I'd be wasting about 0.5KB doing so each time the function is called. Note that the program must run on an embedded system, and while I'm not necessarily counting each byte, I'd like to use as little memory as possible.
Use new and delete or malloc/free to dynamically allocate the buffer. The issue here is that the function is called frequently and there would be some overhead in terms of constantly asking the OS for memory and then releasing it.
Use two successive calls to write() in order to pass the data to the file descriptor. That is, the first write would pass only the one byte, and the next would send the rest of the buffer. While seemingly straightforward, I would need to research the code a bit more (note that I got this code handed down from a previous engineer who has since left the company I work for) in order to guarantee that the two successive writes occur atomically. Also, if this requires locking, then it essentially becomes more complex and has more performance impact than case #2.
Note that I cannot make the buffer_to_send a member variable or scope it outside the function since there are (potentially) multiple calls to the function at any given time from various threads.
Please let me know your opinion and what my preferred approach should be. Thanks for your time.

You can fold the two successive calls to write() in your option 3 into a single call using writev().
http://pubs.opengroup.org/onlinepubs/009696799/functions/writev.html

I would choose option 1. If you know the maximum length of your data, then allocate that much space (plus one byte) on the stack using a fixed size array. This is no worse than the variable length array you have shown because you must always have enough space left on the stack otherwise you simply won't be able to handle your maximum length (at worst, your code would randomly crash on larger buffer sizes). At the time this function is called, nothing else will be using the further space on your stack so it will be safe to allocate a fixed size array.

Why are empty classes 8 bytes and larger classes always > 8 bytes?

class foo { }
writeln(foo.classinfo.init.length); // = 8 bytes
class foo { char d; }
writeln(foo.classinfo.init.length); // = 9 bytes
Is d actually storing anything in those 8 bytes, and if so, what? It seems like a huge waste, If I'm just wrapping a few value types then the the class significantly bloats the program, specifically if I am using a lot of them. A char becomes 8 times larger while an int becomes 3 times as large.
A struct's minimum size is 1 byte.

In D, object have a header containing 2 pointer (so it may be 8bytes or 16 depending on your architecture).
The first pointer is the virtual method table. This is an array that is generated by the compiler filled with function pointer, so virtual dispatch is possible. All instances of the same class share the same virtual method table.
The second pointer is the monitor. It is used for synchronization. It is not sure that this field stay here forever, because D emphasis local storage and immutability, which make synchronization on many objects useless. As this field is older than these features, it is still here and can be used. However, it may disapear in the future.
Such header on object is very common, you'll find the same in Java or C# for instance. You can look here for more information : http://dlang.org/abi.html

D uses two machine words in each class instance for:
A pointer to the virtual function table. This contains the addresses of virtual methods. The first entry points towards the class's classinfo, which is also used by dynamic casts.
The monitor, which allows the synchronized(obj) syntax, documented here.
These fields are described in the D documentation here (scroll down to "Class Properties") and here (scroll down to "Classes").

I don't know the particulars of D, but in both Java and .net, every class object contains information about its type, and also holds information about whether it's the target of any monitor locks, whether it's eligible for finalization cleanup, and various other things. Having a standard means by which all objects store such information can make many things more convenient for both users and implementers of the language and/or framework. Incidentally, in 32-bit versions of .net, the overhead for each object is 8 bytes except that there is a 12-byte minimum object size. This minimum stems from the fact that when the garbage-collector moves objects around, it needs to temporarily store in the old location a reference to the new one as well as some sort of linked data structure that will permit it to examine arbitrarily-deep nested references without needing an arbitrarily-large stack.
Edit
If you want to use a class because you need to be able to persist references to data items, space is at a premium, and your usage patterns are such that you'll know when data items are still useful and when they become obsolete, you may be able to define an array of structures, and then pass around indices to the array elements. It's possible to write code to handle this very efficiently with essentially zero overhead, provided that the structure of your program allows you to ensure that every item that gets allocated is released exactly once and things are not used once they are released.
If you would not be able to readily determine when the last reference to an object is going to go out of scope, eight bytes would be a very reasonable level of overhead. I would expect that most frameworks would force objects to be aligned on 32-bit boundaries (so I'm surprised that adding a byte would push the size to nine rather than twelve). If a system is going have a garbage collector that works better than a Commodore 64(*), it would need to have an absolute minimum of a bit of overhead per object to indicate which things are used and which aren't. Further, unless one wants to have separate heaps for objects which can contain supplemental information and those which can't, one will every object to either include space for a supplemental-information pointer, or include space for all the supplemental information (locking, abandonment notification requests, etc.). While it might be beneficial in some cases to have separate heaps for the two categories of objects, I doubt the benefits would very often justify the added complexity.
(*) The Commodore 64 garbage collector worked by allocating strings from the top of memory downward, while variables (which are not GC'ed) were allocated bottom-up. When memory got full, the system would scan all variables to find the reference to the string that was stored at the highest address. That string would then be moved to the very top of memory and all references to it would be updated. The system would then scan all variables to find the reference to the string at the highest address below the one it just moved and update all references to that. The process would repeat until it didn't find any more strings to move. This algorithm didn't require any extra data to be stored with strings in memory, but it was of course dog slow. The Commodore 128 garbage collector stored with each string in GC space a pointer to the variable that holds a reference and a length byte that could be used to find the next lower string in GC space; it could thus check each string in order to find out whether it was still used, relocating it to the top of memory if so. Much faster, but at the cost of three bytes' overhead per string.

You should look into the storage requirements for various types. Every instruction, storage allocation (ie:variable/object, etc) uses up a specific amount of space. In c# an Int32 type integer object should store integer information to the tune of 4 bytes (32bit). It might have other information, too, because it is an object, but your character data type probably only requires 1 byte of information. If you have constructs like for or while in your class, those things will take up space, too, because each of those things is telling your class to do something. The class itself requires a number of instructions to be created in memory, which would account for the 8 initial bytes.
Take an assembler language course. You'll learn all you ever wanted to know and then some about why your programs use however much memory or take up however much storage when compiled.

Lock Free Queue -- Single Producer, Multiple Consumers

I am looking for a method to implement lock-free queue data structure that supports single producer, and multiple consumers. I have looked at the classic method by Maged Michael and Michael Scott (1996) but their version uses linked lists. I would like an implementation that makes use of bounded circular buffer. Something that uses atomic variables?
On a side note, I am not sure why these classic methods are designed for linked lists that require a lot of dynamic memory management. In a multi-threaded program, all memory management routines are serialized. Aren't we defeating the benefits of lock-free methods by using them in conjunction with dynamic data structures?
I am trying to code this in C/C++ using pthread library on a Intel 64-bit architecture.
Thank you,
Shirish

The use of a circular buffer makes a lock necessary, since blocking is needed to prevent the head from going past the tail. But otherwise the head and tail pointers can easily be updated atomically. Or in some cases the buffer can be so large that overwriting is not an issue. (in real life you will see this in automated trading systems, with circular buffers sized to hold X minutes of market data. If you are X minutes behind, you have wayyyy worse problems than overwriting your buffer).
When I implemented the MS queue in C++, I built a lock free allocator using a stack, which is very easy to implement. If I have MSQueue then at compile time I know sizeof(MSQueue::node). Then I make a stack of N buffers of the required size. The N can grow, i.e. if pop() returns null, it is easy to go ask the heap for more blocks, and these are pushed onto the stack. Outside of the possibly blocking call for more memory, this is a lock free operation.
Note that the T cannot have a non-trivial dtor. I worked on a version that did allow for non-trivial dtors, that actually worked. But I found that it was easier just to make the T a pointer to the T that I wanted, where the producer released ownership, and the consumer acquired ownership. This of course requires that the T itself is allocated using lockfree methods, but the same allocator I made with the stack works here as well.
In any case the point of lock-free programming is not that the data structures themselves are slower. The points are this:
lock free makes me independent of the scheduler. Lock-based programming depends on the scheduler to make sure that the holders of a lock are running so that they can release the lock. This is what causes "priority inversion" On Linux there are some lock attributes to make sure this happens
If I am independent of the scheduler, the OS has a far easier time managing timeslices, and I get far less context switching
it is easier to write correct multithreaded programs using lockfree methods since I dont have to worry about deadlock , livelock, scheduling, syncronization, etc This is espcially true with shared memory implementations, where a process could die while holding a lock in shared memory, and there is no way to release the lock
lock free methods are far easier to scale. In fact, I have implemented lock free methods using messaging over a network. Distributed locks like this are a nightmare
That said, there are many cases where lock-based methods are preferable and/or required
when updating things that are expensive or impossible to copy. Most lock free methods use some sort of versioning, i.e. make a copy of the object, update it, and check if the shared version is still the same as when you copied it, then make the current version you update version. Els ecopy it again, apply the update, and check again. Keep doing this until it works. This is fine when the objects are small, but it they are large, or contain file handles, etc then not recommended
Most types are impossible to access in a lock free way, e.g. any STL container. These have invariants that require non atomic access , for example assert(vector.size()==vector.end()-vector.begin()). So if you are updating/reading a vector that is shared, you have to lock it.

This is an old question, but no one has provided an accepted solution. So I offer this info for others who may be searching.
This website: http://www.1024cores.net
Provides some really useful lockfree/waitfree data structures with thorough explanations.
What you are seeking is a lock-free solution to the reader/writer problem.
See: http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem

For a traditional one-block circular buffer I think this simply cannot be done safely with atomic operations. You need to do so much in one read. Suppose you have a structure that has this:
uint8_t* buf;
unsigned int size; // Actual max. buffer size
unsigned int length; // Actual stored data length (suppose in write prohibited from being > size)
unsigned int offset; // Start of current stored data
On a read you need to do the following (this is how I implemented it anyway, you can swap some steps like I'll discuss afterwards):
Check if the read length does not surpass stored length
Check if the offset+read length do not surpass buffer boundaries
Read data out
Increase offset, decrease length
What should you certainly do synchronised (so atomic) to make this work? Actually combine steps 1 and 4 in one atomic step, or to clarify: do this synchronised:
check read_length, this can be sth like read_length=min(read_length,length);
decrease length with read_length: length-=read_length
get a local copy from offset unsigned int local_offset = offset
increase offset with read_length: offset+=read_length
Afterwards you can just do a memcpy (or whatever) starting from your local_offset, check if your read goes over circular buffer size (split in 2 memcpy's), ... . This is 'quite' threadsafe, your write method could still write over the memory you're reading, so make sure your buffer is really large enough to minimize that possibility.
Now, while I can imagine you can combine 3 and 4 (I guess that's what they do in the linked-list case) or even 1 and 2 in atomic operations, I cannot see you do this whole deal in one atomic operation :).
You can however try to drop 'length' checking if your consumers are very smart and will always know what to read. You'd also need a new woffset variable then, because the old method of (offset+length)%size to determine write offset wouldn't work anymore. Note this is close to the case of a linked list, where you actually always read one element (= fixed, known size) from the list. Also here, if you make it a circular linked list, you can read to much or write to a position you're reading at that moment!
Finally: my advise, just go with locks, I use a CircularBuffer class, completely safe for reading & writing) for a realtime 720p60 video streamer and I have got no speed issues at all from locking.

This is an old question but no one has provided an answer that precisely answers it. Given that still comes up high in search results for (nearly) the same question, there should be an answer, given that one exists.
There may be more than one solution, but here is one that has an implementation:
https://github.com/tudinfse/FFQ
The conference paper referenced in the readme details the algorithm.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js