I just want to know if it makes a performance-difference while copying objects in C++ if I use many instances of a class or use std::shared_ptr.
Background: I have some structures which are delivered through a signals&slot mechanism (Qt). (I know that instances are copied while sending a signal) These delivering can occur many times so it has to be fast with low memory usage.
edit (add some details):
I write an embedded application (yeah, Qt is not the fastest for embedded backend I know) which can have a dynamic number of "modules". Each module has its own functionality. Every module has a signal and a slot. Which module receive emitted signals is freely configurable. So it could be that many signals are emitted in a very small time. In this case the signals has to be delivered as fast as possible. The delivered structure has some module-specific data and the data which has to be delivered to the other modules. I cannot say how large the delivered data will be because on the future there will be many more modules which maybe delivers much data.
BTW: I abuse std::shared_ptrin this case. I do not use I for really sharing the ownership. Qt just treat references and instances the same way in signals&slots, it copies the object. So to have the benefits of both, easy memory management of instance and lower memory usage of reference, I thought of using a std::shared_ptr.
Qt just treat references and instances the same way in signals&slots, it copies the object.
No, it only copies in specific circumstances. See this answer for details. TL;DR: It only copies if it needs to deliver the call over a queued connection, and then only if there is a receiver attached to a given signal. With direct connections (or default automatic connections within the same thread), if both the signal and the slot pass the argument by reference, then no copies will be made.
I abuse std::shared_ptr in this case.
It's not abuse. You're passing what amount to a shared data structure, held by a shared_ptr. It makes perfect sense.
The real question is: if your structures are expensive to copy, why won't you use explicit sharing via QSharedData and QExplicitlySharedDataPointer? And why doesn't your question include measurement results to substantiate your concern? Come on, such things are trivial to measure. You've got Qt to help you out - use it.
Related
I want to understand what is true-asio way to use shared data?
reading the asio and the beast examples, the only example of using shared data is http_crawl.cpp. (perhaps I missed something)
in that example the shared object is only used to collect statistics for sessions, that is the sessions do not read that object's data.
as a result I have three questions:
Is it implied that interaction with shared data in asio-style is an Active Object? i.e. should mutexes be avoided?
whether the statement will be correct that for reading the shared data it is also necessary to use "requests" to Active Object, and also no mutexes?
has anyone tried to evaluate the overhead of "requests" to Active Object, compared to using mutexes?
Is it implied that interaction with shared data in asio-style is an Active Object? i.e. should mutexes be avoided?
Starting at the end, yes mutexes should be avoided. This is because all service handlers (initiations and completions) will be executed on the service thread(s) which means that blocking in a handler will block all other handlers.
Whether that leads to Active Object seems to be a choice to me. Yes, a typical approach would be like Active Object (see e.g. boost::asio and Active Object), where operations queue for the data.
However, other approaches are viable and frequently seen, like e.g. the data being moving with their task(s) e.g. through a task flow.
whether the statement will be correct that for reading the shared data it is also necessary to use "requests" to Active Object, and also no mutexes?
Yes, synchronization needs to happen for shared state, regardless of the design pattern chosen (although some design pattern reduce sharing alltogether).
The Asio approach is using strands, which abstract away the scheduling from the control flow. This gives the service the option to optimize for various cases (e.g. continuation on the same strand, the case where there's only one service thread anyway etc.).
has anyone tried to evaluate the overhead of "requests" to Active Object, compared to using mutexes?
Lots of people and lots of times. Often are wary of trying Asio because "it uses locking internally". If you know what you're doing, throughput can be excellent, which goes for most patterns and industrial-strength frameworks.
Specific benchmarks depend heavily on specific implementation choices. I'm pretty sure you can find examples on github, blogs and perhaps even on this site.
(perhaps I missed something)
You're missing the fact that all IO objects are not thread-safe, which means that they themselves are shared data for any composed asynchronous operation (chain)
My current application owns multiple «activatable» objects*. My intent is to "run" all those object in the same io_context and to add the necessary protection in order to toggle from single to multiple threads (to make it scalable)
If these objects were completely independent from each others, the number of threads running the associated io_context could grow smoothly. But since those objects need to cooperate, the application crashes in multithread despite the strand in each object.
Let's say we have objects of type A and type B, all of them served by the same io_context. Each of those types run asynchronous operations (timers and sockets - their handlers are surrounded with bind_executor(strand, handler)), and can build a cache based on information received via sockets and posted operations to them. Objects of type A needs to get information cached from multiple instances of B in order to perform their own work.
Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
If not, what strategy could be adopted to achieve the scalability?
I already tried playing with futures but that strategy leads unsurprisingly to deadlocks.
Thanx
(*) Maybe I'm wrong in the terminology: objects get a reference to an io_context and own their own strand, so I think they are activatable, because they don't own really a running thread
You're mixing vague words a bit. "Activatable", "Strandify", "inter coorporating". They're all close to meaningful concepts, yet, narrowly avoid binding to any precise meaning.
Deconstructing
Let's simplify using more precise concepts.
Let's say we have objects of type A and type B, all of them served by the same io_context
I think it's more fruitful to say "types A and B have associated executors". When you make sure all operations on A and B operate from that executor and you make sure that executor serializes access, then you basically get the Active Object pattern.
[can build a cache based on information received via sockets] and posted operations to them
That's interesting. I take that to mean you don't directly call members of the class, unless they defer the actual execution to the strand. This, again, would be the Active Object.
However, your symptoms suggest that not all operations are "posted to them". Which implies they run on arbitrary threads, leading to your problem.
Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
The key to your problems is here. Data dependencies. It's also, ;ole;y going to limit the usefulness of scaling, unless of course the generation of information to retrieve from other threads is a computationally expensive operation.
However, in the light of the phrase _"to get information cached from multiple instances of B'" suggests that in fact, the data is instantaneous, and you'll just be paying synchronization costs for accessing across threads.
Questions
Q. Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
Technically, yes. By making sure all operations go on the strand, and the objects become true active objects.
However, there's an important caveat: strands aren't zero-cost. Only in certain contexts they can be optimized (e.g. in immediate continuations or when the execution context has no concurrency).
But in all other contexts, they end up synchronizing at similar cost as mutexes. The purpose of a strand is not to remove the lock contention. Instead it rather allows one to declaratively specify the synchronization requirements for tasks, so that so that the same code can be correctly synchronized regardless of the methods of async completion (using callbacks, futures, coroutines, awaitables, etc) or the chosen execution context(s).
Example: I recently uncovered a vivid illustration of the cost of strand synchronization even in a simple context (where serial execution was already implicitly guaranteed) here:
sehe mar 15, 23:08 Oh cool. The strands were unnecessary. I add them for safety until I know it's safe to go without. In this case the async call chains form logical strands (there are no timers or full duplex sockets going on, so it's all linear). That... improves the situation :)
Now it's 3.5gbps even with the 1024 byte server buffer
The throughput increased ~7x from just removing the strand.
Q. If not, what strategy could be adopted to achieve the scalability?
I suspect you really want caches that contain shared_futures. So that the first retrieval puts the future for the result in cache, where subsequent retrievals get the already existing shared future immediately.
If you make sure your cache lookup datastructure is threadsafe, likely with a reader/writer lock (shared_mutex), you will be free to access it with minimal overhead from any actor, instead of requiring to go through individual strands of each producer.
Keep in mind that awaiting futures is a blocking operation. So, if you do that from tasks posted on the execution context, you may easily run out of threads. In such cases it maybe better to provide async_get in terms of boost::asio::async_result or boost::asio::async_completion so you can wait in non-blocking fashion.
I'm writing a generic pure C++14 implementation of a signal class, with lifetime tracking of the connected objects, and an additional plus that all this works with copy and move operations.
There are three types of implementations I have now discovered are possible:
Use a global connection_manager that, well, manages all connections. This would be terrible in a highly concurrent scenario where many signals and slots are connected and disconnected, and would need fine-tuned locking to alleviate some of these issues.
Each signal stores its connections, and proper move and copy semantics are implemented. This has two disadvantages: the signal object is large, and move operations are expensive (all the connections need to be updated when the signal or the connectee is moved. This burdens both sides to have proper move constructors.
Each signal stores a pointer to the real signal. This extra level of indirection makes moves a lot cheaper, as the real connected object stays put in memory behind the extra level of indirection. Same goes for the connectee, although each signal emission will need to dereference one extra pointer to get the real object. No move semantics need to be implemented explicitly.
Currently, I almost have version 2 ready. I don't know exactly what Qt does, but I know it covers almost everything I want to be able to do with signals/slots, albeit no copy/move semantics and a pre-compilation step.
Am I missing something or is number three the best way to go in the end if I don't want to have a huge impact on users of signals (i.e. I don't want to increase the size and complexity of their classes just because my signal/connectee classes do a lot of work each time they're moved/copied.
Note I am looking for real-world experience and suggestions, not opinions.
The boost::interprocess::message_queue mechanism seems primarily designed for just that: interprocess communication.
The problem is that it serializes the objects in the message:
"A message queue just copies raw bytes between processes and does not send objects."
This makes it completely unsuitable for fast and repeated interthread communication with large composite objects being passed.
I want to create a message with a ref/shared_ptr/pointer to a known and previously-created object and safely pass it from one thread to the next.
You CAN use asio::io_service and post with bind completions, but that's rather klunky AND requires that the thread in question be using asio, which seems a bit odd.
I've already written my own, sadly based on asio::io_service, but would prefer to switch over to a boost-supported general mechansim.
You need a mechanism, that designed for interprocess communication because separate processes has separate address space and you cannot simply pass pointers except very spacial cases. For thread communication you can use standard containers like std::stack, std::queue and std::priority_queue to communicate between threads, you just need to provide proper synchronization through mutexes. Or you can use lock-free containers, which also provided by boost. What else would you need for interthread communication?
Whilst I'm no expert in Boost per se, there is a fundamental difficulty in communicating between processes and threads via a pipe, message queue, etc, especially if it is assumed that a program's data is classes containing dynamically allocated memory (which is pretty much the case for things written with Boost; a string is not a simple object like it is in C...).
Copying of Data in Classes
Message queues and pipes are indeed just a way of passing a collection of bytes from one thread/process to another thread/process. Generally when you use them you're looking for the destination thread to end up with a copy of the original data, not just a copy of the references to the data (which would be pointing back at the original data).
With a simple C struct containing no pointers at all it's easy; a copy of the struct contains all the data, no problem. But a C++ class with complex data types like strings is now a structure containing references / pointers to allocated memory. Copy that structure and you haven't actually copied the data in the allocated memory.
That's where serialisation comes in. For interprocess communications where both processes can't ordinarily share the same memory serialisation serves as a way of parcelling up the structure to be sent plus all the data it refers to into a stream of bytes that can be unpacked at the other end. For threads it's no different if you don't want the two threads accessing the same memory at the same time. Serialisation is a convenient way of saving yourself having to navigating through a class to see exactly what needs to be copied.
Efficiency
I don't know what Boost uses for serialisation, but clearly serialising to XML would be painfully inefficient. A binary serialisation like ASN.1 BER would be much faster.
Also, copying data through pipes, message queues is no longer as inefficient as it used to be. Traditionally programmers don't do it because of the perceived waste of time spent copying the data repeatedly just to share it with another thread. With a single core machine that involves a lot of slow and wasteful memory accesses.
However, if one considers what "memory access" is in these days of QPI, Hypertransport, and so forth, it's not so very different to just copying the data in the first place. In both cases it involves data being sent over a serial bus from one core's memory controller to another core's cache.
Today's CPUs are really NUMA machines with memory access protocols layered on top of serial networks to fake an SMP environment. Programming in the style of copying messages through pipes, message queues, etc. is definitely edging towards saying that one is content with the idea of NUMA, and that really you don't need SMP at all.
Also, if you do all your inter-thread communications as message queues, they're not so very different to pipes, and pipes aren't so different to network sockets (at least that's the case on Not-Windows). So if you write your code carefully you can end up with a program that can be redeployed across a distributed network of computers or across a number of threads within a single process. That's a nice way of getting scalability because you're not changing the shape or feel of your program in any significant way when you scale up.
Fringe Benefits
Depending on the serialisation technology used there can be some fringe benefits. With ASN.1 you specify a message schema in which you set out the valid ranges of the message's contents. You can say, for example, that a message contains an integer, and it can have values between 0 and 10. The encoders and decoders generated by decent ASN.1 tools will automatically check that the data you're sending or receiving meets that constraint, and returns errors if not.
I would be surprised if other serialisers like Google Protocol Buffers didn't do a similar constraints check for you.
The benefit is that if you have a bug in your program and you try and send an out of spec message, the serialiser will automatically spot that for you. That can save a ton of time in debugging. Also it is something you definitely don't get if you share a memory buffer and protect it with a semaphore instead of using a message queue.
CSP
Communicating Sequential Processes and the Actor model are based on sending copies of data through message queues, pipes, etc. just like you're doing. CSP in particular is worth paying attention to because it's a good way of avoiding a lot of the pitfalls of multi-threaded software that can lurk undetected in source code.
There are some CSP implementations you can just use. There's JCSP, a class library for Java, and C++CSP, built on top of Boost to do CSP for C++. They're both from the University of Kent.
C++CSP looks quite interesting. It has a template class called csp::mobile, which is kind of like a Boost smart pointer. If you send one of these from one thread to another via a channel (CSP's word for a message queue) you're sending the reference, not the data. However, the template records which thread 'owns' the data. So a thread receiving a mobile now owns the data (which hasn't actually moved), and the thread that sent it can no longer access it. So you get the benefits of CSP without the overhead of copying the data.
It also looks like C++CSP is able to do channels over TCP; that's a very attractive feature, up scaling is a really simple possibility. JCSP works over network connections too.
I am developing an app right now which creates and stores a connection to a local XMPP server in the Application scope. The connection methods are stored in a cfc that makes sure the Application.XMPPConnection is connected and authorized each time it is used, and makes use of the connection to send live events to users. As far as I can tell, this is working fine. BUT it hasn't been tested under any kind of stress.
My question is: Will this set up cause problems later on? I only ask because I can't find evidence of other people using Application variables in this way. If I weren't using railo I would be using CF's event gateway instead to accomplish the same task.
Size itself isn't a problem. If you were to initialize one object per request, you'd burn a lot more memory. The problem is access.
If you have a large number of requests competing for the same object, you need to measure the access time for that object vs. instantiation. Keep in mind that, for data objects, more than one thread can read them. My understanding, though, is that when an object's function is called, it locks that object to other threads until the function returns.
Also, if the object maintains state, you need to consider what to do when multiple threads are getting/setting that data. Will you end up with race conditions?
You might consider handling this object in the session scope, so that it is only instantiated per user (who, likely, will only make one or two simultaneous requests).
Of course you can use application scope for storing these components if they are used by all users in different parts of application.
Now, possible issues are :
size of the component(s)
time needed for initialization if these are set during application start
racing conditions between setting/getting states of these components
For the first, there are ways to calculate size of a component in memory. Lately there were lots of posts on this topic so it would be easy to find some. If you dont have some large structure or query saved inside, I guess you're ok here.
Second, again, if you are not filling this cfc with some large query from DB or doing some slow parsing, you're ok here too.
Third, pay attention to possible situations, where more users are changing states of these components. If so use cflock on each setting of the components the state.