Factory pattern : Can "definition" be too large? - c++

Is there disadvantage when "definition of object" that passed into "factory" become (too) big/complex?
Example
In a game engine, I have a prototype class of 3D-graphic object that is quite large, at least for me.
It contains:-
a pointer(handle) of 3D mesh
a pointer(handle) of 8 textures (e.g. lambertian, specular)
colors of 8 textures (e.g. color multiplier) - 4 floats each
custom setting for 8 textures - 4 floats each
~ 10 boolean flag for blending, depth test, depth write, etc
(gradually added as the project proceed)
In game logic, I cache some (100?) instances of prototype scattering around. Most of them are stored as fields in many subsystems.
I found that it is also very convenient to store prototype by value.
Question
Besides the obvious direct memory/CPU cost, are there any "easy-to-be-overlooked" disadvantage that occur when prototype is very big?
What are criteria to determine that prototype (definition that pass into factory) is too big/complex? What is the remedy/design-pattern that can cure it?
Should the prototype be stored in business/game logic, by handle/pointer instead? (I have this idea because people tend to use pointer for large object, but it is a very weak reason.)

Answers:
Besides the obvious direct memory/CPU cost, are there any "easy-to-be-overlooked" disadvantage that occur when prototype is very big?
For one graphic object, if you hold it in different places with copies, when you change the object, you have to change all the copies under a lock, or you would met inconsistency issue, which increase the code complexity and potential inconsistency issues.
What are criteria to determine that prototype (definition that pass into factory) is too big/complex? What is the remedy/design-pattern that can cure it?
Factory pattern is used for object creation. If you find the logic or code in factory is too complex, the problem should be your object structure not factory pattern.
Should the prototype be stored in business/game logic, by handle/pointer instead? (I have this idea because people tend to use pointer for large object, but it is a very weak reason.)
For your case, i recommend P-Impl pattern or smart pointer pattern to store the same object, which could highly reduce the complex and object numbers.

Related

Managing memory in 3d array of polymorphic classes

I've got a question here that someone might be able to help out with.
If you can, many thanks!
I'm trying to store a three dimensional array containing many, small objects.
All of these object classes will inherit from one parent class.
Ideally I'm looking at a 32x32x32 array of objects, and objects are not likely to exceed 8 bytes each.
So, 32768 objects at up to 8 bytes each = 256kb
However some of them will NOT require 8 bytes of data, and will only require, in some cases, a single byte of data.
The problem is that each of these objects will have a distinct set of behaviours based on what they are. So if I call (obj.print()) each one will behave differently.
This in itself is easily enough solved with creating subclasses that inherit the generic interface class. However because they're being stored in large quantities in a 3d array, storing them as pointers automatically adds another 4 bytes of overhead - 8 bytes for the data type, and 4 bytes for the pointer to it.
On average, even if some of the objects are less than 8 bytes, the memory overhead is still going to add at least 30% higher usage.
That in itself isn't a huge issue, until I start storing many of these data blocks - then memory becomes a rather serious concern.
It seems from a long term maintainability and code management point of view, this really is the best way to do it. But I come to the smart folks here to see if you've got any pearls of wisdom you could share! :)
Thanks,
Brian
If you insist on runtime polymorphism, I have bad news for you - you will be using pointers, which will add some memory overhead.
As far as 1-byte polymorphic structures go, there is a vtable pointer added when your member functions are virtual, so I'm not entirely sure how you would go about implementing them.
If you could drop the polymorphism, the std::tuple might be an alternative. There might be trouble with iterating over it, or initializing it with 32^3 elements, but it might be doable using templates, depending on what you might want to achieve.

In generic object update loop, is it better to update per controller or per object?

I'm writing some generic code which basically will have a vector of objects being updated by a set of controllers.
The code is a bit complex in my specific context but a simplification would be:
template< class T >
class Controller
{
public:
virtual ~Controller(){}
virtual void update( T& ) = 0;
// and potentially other functions used in other cases than update
}
template< class T >
class Group
{
public:
typedef std::shared_ptr< Controller<T> > ControllerPtr;
void add_controller( ControllerPtr ); // register a controller
void remove_controller( ControllerPtr ); // remove a controller
void update(); // udpate all objects using controllers
private:
std::vector< T > m_objects;
std::vector< ControllerPtr > m_controllers;
};
I intentionally didn't use std::function because I can't use it in my specific case.
I also intentionally use shared pointers instead of raw pointers, this is not important for my question actually.
Anyway here it's the update() implementation that interest me.
I can do it two ways.
A) For each controller, update all objects.
template< class T >
void Group<T>::update()
{
for( auto& controller : m_controllers )
for( auto& object : m_objects )
controller->update( object );
}
B) For each object, update by applying all controllers.
template< class T >
void Group<T>::update()
{
for( auto& object : m_objects )
for( auto& controller : m_controllers )
controller->update( object );
}
"Measure! Measure! Measure!" you will say and I fully agree, but I can't measure what I don't use. The problem is that it's generic code. I don't know the size of T, I just assume it will not be gigantic, maybe small, maybe still a bit big. Really I can't assume much about T other than it is designed to be contained in a vector.
I also don't know how many controllers or T instances will be used. In my current use cases, there would be widely different counts.
The question is: which solution would be the most efficient in general?
I'm thinking about cache coherency here. Also, I assume this code would be used on different compilers and platforms.
My guts tells me that updating instruction cache is certainly faster than updating data cache, which would make solution B) the more efficient in general. However, I learnt to not trust my gusts when I have doubts about performance, so I'm asking here.
The solution I'm getting to would allow the user to choose (using a compile-time policy) which update implementation to use with each Group instance, but I want to provide a default policy and I can't decide which one would be the most efficient for most of the cases.
We have a living proof that modern compilers (Intel C++ in particular) are able to swap loops, so it shouldn't really matter for you.
I have remembered it from the great #Mysticial's answer:
Intel Compiler 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate!
Wikipedia article about the topic
Detecting whether loop interchange can be done requires checking if the swapped code will really produce the same results. In theory it could be possible to prepare classes that won't allow for the swap, but then again, it could be possible to prepare classes that would benefit from either version more.
Cache-Friendliness Is Close to Godliness
Knowing nothing else about how the update methods of individual Controllers behave, I think the most important factor in performance would be cache-friendliness.
Considering cache effectiveness, the only difference between the two loops is that m_objects are laid out contiguously (because they are contained in the vector) and they are accessed linearly in memory (because the loop is in order) but m_controllers are only pointed to here and they can be anywhere in memory and moreover, they can be of different types with different update() methods that themselves can reside anywhere. Therefore, while looping over them we would be jumping around in memory.
In respect to cache, the two loops would behave like this: (things are never simple and straightforward when you are concerned about performance, so bear with me!)
Loop A: The inner loop runs efficiently (unless the objects are large - hundreds or thousands of bytes - or they store their data outside themselves, e.g., std::string) because the cache access pattern is predictable and the CPU will prefetch consecutive cachelines so there won't be much stalling on reading memory for the objects. However, if the size of the vector of objects is larger than the size of the L2 (or L3) cache, each iteration of the outer loop will require reloading of the entire cache. But again, that cache reloading will be efficient!
Loop B: If indeed the controllers have many different types of update() methods, the inner loop here may cause wild jumping around in memory, but all these different update functions will be working on data that is cached and available (specially if objects are large or they themselves contain pointers to data scattered in memory.) Unless the update() methods access so much memory themselves (because, e.g., their code is huge or they require large amount of their own - i.e. controller - data) that they thrash the cache on each invocation; in which case all bets are off anyways.
So, I suggest the following strategy generally, which requires information that you probably don't have:
If objects are small (or smallish!) and POD-like (don't contain pointers themselves) definitely prefer loop A.
If objects are large and/or complex, or if there are many many different types of complex controllers (hundreds or thousands of different update() methods) prefer loop B.
If objects are large and/or complex, and there are so very many of them that iterating over them will thrash the cache many times (millions of objects), and the update() methods are many and they are very large and complex and require a lot of other data, then I'd say the order of your loop doesn't make any difference and you need to consider redesigning objects and controllers.
Sorting the Code
If you can, it may be beneficial to sort the controllers based on their type! You can use some internal mechanism in Controller or something like typeid() or another technique to sort the controllers based on their type, so the behavior of consecutive update() passes become more regular and predictable and nice.
This is a good idea regardless of which loop order you choose to implement, but it will have much more effect in loop B.
However, if you have so much variation among controllers (i.e. if practically all are unique) this won't help much. Also, obviously, if you need to preserve the order that controllers are applied, you won't be able to do this.
Adaptation and Improvisation
It should not be hard to implement both loop strategies and select between them at compile-time (or even runtime) based on either user hint or based on information available at compile time (e.g. size of T or some traits of T; if T is small and/or a POD, you probably should use loop A.)
You can even do this at runtime, basing your decision on the number of objects and controllers and anything else you can find out about them.
But, these kinds of "Klever" tricks can get you into trouble as the behavior of your container will depend on weird, opaque and even surprising heuristics and hacks. Also, they might and will even hurt performance in some cases, because there are many other factors meddling in performance of these two loops, including but not limited to the nature of the data and the code in objects and controllers, the exact sizes and configurations of cache levels and their relative speeds, the architecture of CPU and the exact way it handles prefetching, branch prediction, cache misses, etc., the code that the compiler generates, and much more.
If you want to use this technique (implementing both loops and switching between them are compile- and/or run-time) I highly suggest that you let the user do the choosing. You can accept a hint about which update strategy to use, either as a template parameter or a constructor argument. You can even have two update functions (e.g. updateByController() and updateByObject()) that the user can call at will.
On Branch Prediction
The only interesting branch here is the virtual update call, and as an indirect call through two pointers (the pointer to the controller instance and then the pointer to its vtable) it is quite hard to predict. However, sorting controllers based on type will help immensely with this.
Also remember that a mispredicted branch will cause a stall of a few to a few dozen CPU cycles, but for a cache miss, the stall will be in the hundreds of cycles. Of course, a mispredicted branch can cause a cache miss too, so... As I said before, nothing is simple and straightforward when it comes to performance!
In any case, I think cache friendliness is by far the most important factor in performance here.

Optimization and testability at the same time - how to break up code into smaller units

I am trying to break up a long "main" program in order to be able to modify it, and also perhaps to unit-test it. It uses some huge data, so I hesitate:
What is best: to have function calls, with possibly extremely large (memory-wise) data being passed,
(a) by value, or
(b) by reference
(by extremely large, I mean maps and vectors of vectors of some structures and small classes... even images... that can be really large)
(c) Or to have private data that all the functions can access ? That may also mean that main_processing() or something could have a vector of all of them, while some functions will only have an item... With the advantage of functions being testable.
My question though has to do with optimization, while I am trying to break this monster into baby monsters, I also do not want to run out of memory.
It is not very clear to me how many copies of data I am going to have, if I create local variables.
Could someone please explain ?
Edit: this is not a generic "how to break down a very large program into classes". This program is part of a large solution, that is already broken down into small entities.
The executable I am looking at, while fairly large, is a single entity, with non-divisible data. So the data will either be all created as member variable in a single class, which I have already created, or it will (all of it) be passed around as argument around functions.
Which is better ?
If you want unit testing, you cannot "have private data that all the functions can access" because then, all of that data would be a part of each test case.
So, you must think about each function, and define exactly on which part of the data it works. As for function parameters and return values, it's very simple: use pass-by-value for small objects, and pass-by-reference for large objects.
You can use a guesstimate for the threshold that separates small and large. I use the rule "8 is small, anything more is large" but what is good for my system cannot be equally good for yours.
This seems more like a general question about OOP. Split up your data into logically grouped concepts (classes), and place the code that works with those data elements with the data (member functions), then tie it all together with composition, inheritance, etc.
Your question is too broad to give more specific advice.

Matrix creation destruction in c++ best practice?

Suppose I have a c++ code with many small functions, in each of which i will typically need a matrix float M1(n,p) with n,p known at run-time to contain the results of intermediate computations (no need to initialize M1, just to declare it because each function will just overwrite over all rows of M1).
Part of the reason for this is that each function works on an original data matrix that it can't modify, so many operations (sorting, de-meaning, sphering) need to be done on "elsewhere".
Is it better practice to create a temporary M1(n,p) within each function, or rather once and for all in the main() and pass it to each function as a sort of bucket that each function can use as scrap space?
n and p are often moderately large [10^2-10^4] for n and [5-100] for p.
(originally posted at the codereview stackexchange but moved here).
Best,
heap allocations are indeed, quite expensive.
premature optimization is bad, but if your library is quite general and the matrices are huge, it may not be premature to seek an efficient design. After all, you don't want to modify your design after you accumulated many dependencies to it.
there are various levels in which you can tackle this problem. You can, for example, avoid the heap allocation expense by tackling it at the memory allocator level (per-thread memory pool, e.g.)
while heap allocation is expensive, you are creating one giant matrix only to do some rather expensive operations on the matrices (typically linear complexity or worse). Relatively speaking, allocating a matrix on the free store may not be that expensive compared to what you are inevitably going to have to do with it subsequently, so it may actually be quite cheap in comparison to the overall logic of a function like sorting.
I recommend you write the code naturally, taking into account #3 as a future possibility. That is, don't take in references to matrix buffers for intermediary computations to accelerate the creation of temporaries. Make the temporaries and return them by value. Correctness and good, clear interfaces come first.
Mostly the goal here is to separate the creational policy of a matrix (via allocator or other means) which gives you that breathing room to optimize as an afterthought without changing too much existing code. If you can do it by modifying only the implementation details of the functions involved or, better yet, modifying only the implementation of your matrix class, then you're really well off because then you're free to optimize without changing the design, and any design which allows that is generally going to be complete from an efficiency standpoint.
WARNING: The following is only intended if you really want to squeeze the most out of every cycle. It is essential to understand #4 and also get yourself a good profiler. It's also worth noting that you'll probably do better by optimizing memory access patterns for these matrix algorithms than trying to optimize away the heap allocation.
If you need to optimize the memory allocation, consider optimizing it with something general like a per-thread memory pool. You could make your matrix take in an optional allocator, for instance, but I emphasize optional here and I'd also emphasize correctness first with a trivial allocator implementation.
In other words:
Is it better practice to declare M1(n,p) within each function, or
rather once and for all in the main() and pass it to each function as
a sort of bucket that each function can use as scrap space.
Go ahead and create M1 as a temporary in each function. Try to avoid requiring the client to make some matrix that has no meaning to him/her only to compute intermediary results. That would be exposing an optimization detail which is something we should strive not to do when designing interfaces (hide all details that clients should not have to know about).
Instead, focus on more general concepts if you absolutely want that option to accelerate the creation of these temporaries, like an optional allocator. This fits in with practical designs like with std::set:
std::set<int, std::less<int>, MyFastAllocator<int>> s; // <-- okay
Even though most people just do:
std::set<int> s;
In your case, it might simply be:
M1 my_matrix(n, p, alloc);
It's a subtle difference, but an allocator is a much more general concept we can use than a cached matrix which otherwise has no meaning to the client except that it's some kind of cache that your functions require to help them compute results faster. Note that it doesn't have to be a general allocator. It could just be your preallocated matrix buffer passed in to a matrix constructor, but conceptually it might be good to separate it out merely for the fact that it is something a bit more opaque to clients.
Additionally, constructing this temporary matrix object would also require care not to share it across threads. That's another reason you probably want to generalize the concept a bit if you do go the optimization route, as something a bit more general like a matrix allocator can take into account thread safety or at the very least emphasize more by design that a separate allocator should be created per thread, but a raw matrix object probably cannot.
The above is only useful if you really care about the quality of your interfaces first and foremost. If not, I'd recommend going with Matthieu's advice as it is much simpler than creating an allocator, but both of us emphasize making the accelerated version optional.
Do not use premature optimisation. Create something that works properly and well and optimise later if it is shown to be slow.
(By the way I don't think stackoverflow is the correct place for it either).
In reality if you want to speed up your application operating on large matrices, using concurrency would be your solution. And if you are using concurrency, you are likely to get into far more trouble if you have one big global matrix.
Essentially it means you can never have more than one calculation happening at a time, even if you have the memory for it.
The design of your matrix needs to be optimal. We would have to look at this design.
I would therefore generally say in your code, no, do NOT create one big global matrix because it sounds wrong for what you want to do with it.
First try to define the matrix inside the function. That is definitely the better design choice. But if you get performance losses you can't affort, I think the "passing buffer per reference" is ok, as long as you keep in mind that the functions aren't thread safe anymore. If at any point you use threads, each thread needs it own buffer.
There are advantages in terms of performance in requiring an externally supplied buffer, especially when you are required to chain the functions that make use of it.
However, from a user point of view, it can soon get annoying.
I have often found that it's simple enough in C++ to get the best of both worlds by simply offerring both ways:
int compute(Matrix const& argument, Matrix& buffer);
inline int compute(Matrix const& argument) {
Matrix buffer(argument.width, argument.height);
return compute(argument, buffer);
}
This very simple wrapping means that the code is written once, and two slightly different interfaces are presented.
The more involved api (taking a buffer) is also slightly less safe as the buffer must respect some size constraints wrt the argument so you might want to further insulate the fast api (for example behind a namespace) as to encourage users to use the slower but safer interface first, and only try out the fast one when it's proven necessary.

Storing collections of items that an algorithm might need

I have a class MyClass that stores a collection of PixelDescriptor* objects. MyClass uses a function specified by a Strategy pattern style object (call it DescriptorFunction) to do something for each descriptor:
void MyFunction()
{
descriptorA = DescriptorCollection[0];
for_each descriptor in DescriptorCollection
{
DescriptorFunction->DoSomething(descriptor)
}
}
However, this only makes sense if the descriptors are of a type that the DescriptorFunction knows how to handle. That is, not all DescriptorFunction's know how to handle all descriptor types, but as long as the descriptors that are stored are of the type that the visitor that is specified knows about, all is well.
How would you ensure the right type of descriptors are computed? Even worse, what if the strategy object needs more than one type of descriptor?
I was thinking about making a composite descriptor type, something like:
class CompositeDescriptor
{
std::vector<PixelDescriptor*> Descriptors;
}
Then a CompositeDescriptor could be passed to the DescriptorFunction. But again, how would I ensure that the correct descriptors are present in the CompositeDescriptor?
As a more concrete example, say one descriptor is Color and another is Intensity. One Strategy may be to average Colors. Another strategy may be to average Intensities. A third strategy may be to pick the larger of the average color or the average intensity.
I've thought about having another Strategy style class called DescriptorCreator that the client would be responsible for setting up. If a ColorDescriptorCreator was provided, then the ColorDescriptorFunction would have everything it needs. But making the client responsible for getting this pairing correct seems like a bad idea.
Any thoughts/suggestions?
EDIT: In response to Tom's comments, a bit more information:
Essentially DescriptorFunction is comparing two pixels in an image. These comparisons can be done in many ways (besides just finding the absolute difference between the pixels themseles). For example 1) Compute the average of corresponding pixels in regions around the pixels (centered at the pixels). 2) Compute a fancier "descriptor" which typically produces a vector at each pixel and average the difference of the two vectors element-wise. 3) compare 3D points corresponding to the pixel locations in external data, etc etc.
I've run into two problems.
1) I don't want to compute everything inside the strategy (if the strategy just took the 2 pixels to compare as arguments) because then the strategy has to store lots of other data (the image, there is a mask involved describing some invalid regions, etc etc) and I don't think it should be responsible for that.
2) Some of these things are expensive to compute. I have to do this millions of times (the pixels being compared are always difference, but the features at each pixel do not change), so I don't want to compute any feature more than once. For example, consider the strategy function compares the fancy descriptors. In each iteration, one pixels is compared to all other pixels. This means in the second iteration, all of the features would have to be computed again, which is extremely redundant. This data needs to be stored somewhere that all of the strategies can access it - this is why I was trying to keep a vector in the main client.
Does this clarify the problem? Thanks for the help so far!
The first part sounds like a visitor pattern could be appropriate. A visitor can ignore any types it doesn't want to handle.
If they require more than one descriptor type, then it is a different abstraction entirely. Your description is somewhat vague, so it's hard to say exactly what to do. I suspect that you are over thinking it. I say that because generally choosing arguments for an algorithm is a high level concern.
I think the best advice is to try it out.
I would write each algorithm with the concrete arguments (or stubs if its well understood). Write code to translate the collection into the concrete arguments. If there is an abstraction to be made, it should be apparent while writing your translations. Writing a few choice helper functions for these translations is usually the bulk of the work.
Giving a concrete example of the descriptors and a few example declarations might give enough information to offer more guidance.