std::tuple sizeof, is it a missed optimization? - c++

I've checked all major compilers, and sizeof(std::tuple<int, char, int, char>) is 16 for all of them. Presumably they just put elements in order into the tuple, so some space is wasted because of alignment.
If tuple stored elements internally like: int, int, char, char, then its sizeof could be 12.
Is it possible for an implementation to do this, or is it forbidden by some rule in the standard?

std::tuple sizeof, is it a missed optimization?
Yep.
Is it possible for an implementation to do this[?]
Yep.
[Is] it forbidden by some rule in the standard?
Nope!
Reading through [tuple], there is no constraint placed upon the implementation to store the members in template-argument order.
In fact, every passage I can find seems to go to lengths to avoid making any reference to member-declaration order at all: get<N>() is used in the description of operational semantics. Other wording is stated in terms of "elements" rather than "members", which seems like quite a deliberate abstraction.
In fact, some implementations do apparently store the members in reverse order, at least, probably simply due to the way they use inheritance recursively to unpack the template arguments (and because, as above, they're permitted to).
Speaking specifically about your hypothetical optimisation, though, I'm not aware of any implementation that doesn't store elements in [some trivial function of] the user-given order; I'm guessing that it would be "hard" to come up with such an order and to provide the machinery for std::get, at least as compared to the amount of gain you'd get from doing so. If you are really concerned about padding, you may of course choose your element order carefully to avoid it (on some given platform), much as you would with a class (without delving into the world of "packed" attributes). (A "packed" tuple could be an interesting proposal…)

Yes, it's possible and has been (mostly) done by R. Martinho Fernandes. He used to have a blog called Flaming Danger Zone, which is now down for some reason, but its sources are still available on github.
Here are the all four parts of the Size Matters series on this exact topic: 1, 2, 3, 4.
You might wish to view them raw since github doesn't understand C++ highlighting markup used and renders code snippets as unreadable oneliners.
He essentially computes a permutation for tuple indices via C++11 template meta-program, that sorts elements by alignment in non-ascending order, stores the elements according to it, and then applies it to the index on every access.

They could. One possible reason they don’t: some architectures, including x86, have an indexing mode that can address an address base + size × index in a single instruction—but only when size is a power of 2. Or it might be slightly faster to do a load or store aligned to a 16-byte boundary. This could make code that addresses arrays of std::tuple slightly faster and more compact if they add four padding bytes.

Related

Is it better to perform direct pointer operations or []

I have a 2d array. I need to perform a few operations on it as fast as possible (function will be called a dozen of times per second, so It would be nice to make it efficient).
Now, let's say I want to get element A[i][j], is there any difference in speed between simply using A[i][j] and *(A+(i*width+j)) (ignoring the fact that I need to calculate i*width+j, let's say I already have this value)?
With all the optimizations turned on, there should be no difference - not only in the timing, but also in the code the compiler generates for these two constructs.
The biggest difference from a programmer's point of view is readability. The first construct immediately tells the reader that he's dealing with a 2D array, while the second one requires some thinking (is it a row-major order, or a column-major order? Where is the width calculated? What was the reason to choose this way over a more obvious 2D array syntax?). That is why the first construct is preferable in real-life scenarios.
Depending on the quality of compiler, I think the [] notation can result in faster code. The reason is that when you use pointers, the compiler can't be sure that pointer aliasing is not occurring and this can preclude certain optimizations.
On the other hand, if the [] notation is used, those concerns do not apply and the compiler can get more aggressive with applying optimizations.

Efficient use of boolean true and false in C++?

Would any compiler experts be able to comment on the efficient use of boolean values? Specifically, is the compiler able to optimize a std::vector<boolean> to use minimal memory? Is there an equivalent data structure that would?
Back in the day, there were languages that had compilers that could compress an array of booleans to a representation of just one bit per boolean value. Perhaps the best that could be done for C++ is to use std::vector<char> to store the boolean values for minimal memory usage?
The use case here would be storing hundreds of millions of boolean values, where a single byte would save lots of space over 4 or more bytes per value and a single bit, even more.
See std::vector
Specializations
The standard library provides a specialization of std::vector for the type bool, which is optimized for space efficiency.
vector<bool> space-efficient dynamic bitset
(class template specialization)
and from "Working Draft C++, 2012-11-02"
23.3.7 Class vector [vector.bool]
1 To optimize space allocation, a specialization of vector for bool elements is provided:
template <class Allocator> class vector<bool, Allocator> {
...
}
3 There is no requirement that the data be stored as a contiguous allocation of bool values. A space-optimized representation of bits is recommended instead.
So there is no requirement, but only a recommendation, to store the bool values as bits.
std::vector for bool is a template specialization that does what you are asking for.
You can read more here.
You may also want to explore the standard bitset.
Note, that vector<bool> is not a container, however it pretends to be one and provides iterators.
One day that may cause confusion and errors if you treat it like a normal container, e.g. trying to get an address of elements.
You may consider std::bitset or boost::dynamic_bitset if you need to store 1 bit per Boolean value. These data structures do not pretend to be containers, so it is unlikely you make any errors when using any of them, especially in template code.
In what is widely considered to be a flaw in the standard, std::vector is specialised to use a single bit to represent each bool value.
If that happens to be what you are looking for, then just use it.
As a standard-agnostic way of guaranteeing efficient storage, you could create your own Bitvector class. Essentially for every 8 bool values you only need to allocate a single char and then you can store each bool in a single bit. You can then use bit shifting techniques in the accessors/mutators to store/retrieve your individual bits.
One such example is outlined in Ron Penton and André LaMothe's Data Structures for Game Programmers (which I'd also recommend as a general data structure reference). It's not too difficult to write your own though, and, though I haven't searched at great length, there are probably some further examples on the Internet.

Are recursive types really the only way to build noncontinuous arbitrary-size data structures?

I just noticed a question asking what recursive data types ("self-referential types") would be good for in C++ and I was tempted to boldly claim
It's the only way to construct data structures (more precisely containers) that can accept arbitrary large data collections without using continuous memory areas.
That is, if you had no random-access arrays, you would require some means of references (logically) to a type within that type (obviously, instead of having a MyClass* next member you could say void* next but that would still point to a MyClass object or a derived type).
However, I am careful with absolute statements -- just because I couldn't think of something doesn't mean it's not possible, so am I overlooking something? Are there data structures that are neither organised using mechanisms similar to linked lists / trees nor using continuous sequences exclusively?
Note: This is tagged both c++ and language-agnostic as I'd be interested specifically in the C++ language but also in theoretical aspects.
It's the only way to construct data structures (more precisely containers) that can accept arbitrary large data collections without using continuous memory areas.
After contemplating for a while, this statement seems be correct. It is self-evident, in fact.
Suppose I've a collection of elements in a non-contiguous memory. Also suppose that I'm currently at element e. Now the question is, how would I know the next element in the collection? Is there any way?
Given an element e from a collection, there are only two ways to compute the location of the next element:
If I assume that it is at offset sizeof(e) irrespective of what e is, then it means that the next element starts where the current element ends. But then this implies that the collection is in a contiguous memory, which is forbidden in this discussion.
The element e itself tells us the location of the next element. It may store the address itself, or an offset. Either way, it is using the concept of self-reference, which too is forbidden in this discussion.
As I see it, the underlying idea of both of these approaches is exactly same: they both implement self-reference. The only difference is that in the former, the self-reference is implemented implicitly, using sizeof(e) as offset. This implicit self-reference is supported by the language itself, and implemented by the compiler. In the latter, it is explicit, everything is done by the programmer himself, as now the offset (or pointer) is stored in the element itself.
Hence, I don't see any third approach to implement self-reference. If not self-reference, then what terminology would one use to describe the computation of the location of the next element to an element e.
So my conclusion is, your statement is absolutely correct.
The problem is that the dynamic allocator itself is managing contiguous storage. Think about the "tape" used for a Turing Machine, or the Von Neumann architecture. So to seriously consider the problem, you would likely need to develop a new computing model and new computer architecture.
If you think disregarding the contiguous memory of the underlying machine is okay, I am sure a number of solutions are possible. The first that comes to my mind is that each node of the container is marked with an identifier that has no relation to its position in memory. Then, to find the associated node, all of memory is scanned until the identifier is found. This isn't even particularly inefficient if given enough computing elements in a parallel machine.
Here's a sketch of a proof.
Given that a program must be of finite size, all types defined within the program must contain only finitely many members and reference only finitely many other types. The same holds for any program entrypoint and for any objects defined before program initialisation.
In the absence of contiguous arrays (which are the product of a type with a runtime natural number and are therefore unconstrained in size), all types must be arrived at through the composition of types as above; derivation of types (pointer-to-pointer-to-A) is still constrained by the size of the program. There are no facilities other than contiguous arrays to compose a runtime value with a type.
This is a little contentious; if e.g. mappings are considered primitive then one can approximate an array with a map whose keys are the natural numbers. Of course, any implementation of a map must use self-referential data structures (B-trees) or contiguous arrays (hash tables).
Next, if the types are non-recursive then any chain of types (A references B references C...) must terminate, and can be of no greater length than the number of types defined in the program. Thus the total size of data referenceable by the program is limited to the product of the sizes of each type multiplied by the number of names defined in the program (in its entrypoint and static data).
This holds even if functions are recursive (which strictly speaking breaks the prohibition on recursive types, since functions are types); the amount of data immediately visible at any one point in the program is still limited to the product of the sizes of each type multiplied by the number of names visible at that point.
An exception to this is if you store a "container" in a stack of recursive function calls; however such a program would not be able to traverse its data at random without unwinding the stack and having to reread data, which is something of a disqualification.
Finally, if it is possible to create types dynamically the above proof does not hold; we could for example create a Lisp-style list structure where each cell is of a distinct type: cons<4>('h', cons<3>('e', cons<2>('l', cons<1>('l', cons<0>('o', nil))))). This is not possible in most static-typed languages, although it is possible in some dynamic languages e.g. Python.
The statement is not correct. The simple counter example is std::deque in C++. The basic data structure (for the language-agnostic part) is a contiguous array of pointers to arrays of data. The actual data is stored in ropes (non-contiguous blocks), that are chained through a contiguous array.
This might be bordering your requirements, depending on what without using continuous memory areas mean. I am using the interpretation that the stored data is not contiguous, but this data structure depends on having arrays for the intermediate layer.
I think a better phrasing would be:
It's the only way to construct data structures (more precisely containers) that can accept
arbitrary large data collections without using memory areas of determinable address.
What I mean is that normal arrays use addr(idx)=idx*size+inital_addr to get the memory address of an element. However, if you change that to something like addr(idx)=idx*idx*size+initial_addr then the elements of the data structure are not stored in continuous memory areas, rather, there are large gaps between where elements are stored. Thus, it is not continuous memory.

An integer hashing problem

I have a (C++) std::map<int, MyObject*> that contains a couple of millions of objects of type MyObject*. The maximum number of objects that I can have, is around 100 millions. The key is the object's id. During a certain process, these objects must be somehow marked( with a 0 or 1) as fast as possible. The marking cannot happen on the objects themselves (so I cannot introduce a member variable and use that for the marking process). Since I know the minimum and maximum id (1 to 100_000_000), the first thought that occured to me, was to use a std::bit_set<100000000> and perform my marking there. This solves my problem and also makes it easier when marking processes run in parallel, since these use their own bit_set to mark things, but I was wondering what the solution could be, if I had to use something else instead of a 0-1 marking, e.g what could I use if I had to mark all objects with an integer number ?
Is there some form of a data structure that can deal with this kind of problem in a compact (memory-wise) manner, and also be fast ? The main queries of interest are whether an object is marked, and with what was marked with.
Thank you.
Note: std::map<int, MyObject*> cannot be changed. Whatever data structure I use, must not deal with the map itself.
How about making the value_type of your map a std::pair<bool, MyObject*> instead of MyObject*?
If you're not concerned with memory, then a std::vector<int> (or whatever suits your need in place of an int) should work.
If you don't like that, and you can't modify your map, then why not create a parallel map for the markers?
std::map<id,T> my_object_map;
std::map<id,int> my_marker_map;
If you cannot modify the objects directly, have you considered wrapping the objects before you place them in the map? e.g.:
struct
{
int marker;
T *p_x;
} T_wrapper;
std::map<int,T_wrapper> my_map;
If you're going to need to do lookups anyway, then this will be no slower.
EDIT: As #tenfour suggests in his/her answer, a std::pair may be a cleaner solution here, as it saves the struct definition. Personally, I'm not a big fan of std::pairs, because you have to refer to everything as first and second, rather than by meaningful names. But that's just me...
The most important question to ask yourself is "How many of these 100,000,000 objects might be marked (or remain unmarked)?" If the answer is smaller than roughly 100,000,000/(2*sizeof(int)), then just use another std::set or std::tr1::unordered_set (hash_set previous to tr1) to track which ones are so marked (or remained unmarked).
Where does 2*sizeof(int) come from? It's an estimate of the amount of memory overhead to maintain a heap structure in a deque of the list of items that will be marked.
If it is larger, then use std::bitset as you were about to use. It's overhead is effectively 0% for the scale of quantity you need. You'll need about 13 megabytes of contiguous ram to hold the bitset.
If you need to store a marking as well as presence, then use std::tr1::unordered_map using the key of Object* and value of marker_type. And again, if the percentage of marked nodes is higher than the aforementioned comparison, then you'll want to use some sort of bitset to hold the number of bits needed, with suitable adjustments in size, at 12.5 megabytes per bit.
A purpose-built object holding the bitset might be your best choice, given the clarification of the requirements.
Edit: this assumes that you've done proper time-complexity computations for what are acceptable solutions to you, since changing the base std::map structure is no longer permitted.
If you don't mind using hacks, take a look at the memory optimization used in Boost.MultiIndex. It can store one bit in the LSB of a stored pointer.

Collate Hash Function

In the local object there is a collate facet.
The collate facet has a hash method that returns a long.
http://www.cplusplus.com/reference/std/locale/collate/hash/
Two questions:
Does anybody know what hashing method is used.
I need a 32bit value.
If my long is longer than 32 bits, does anybody know about techniques for folding the hash into a shorter version. I can see that if done incorrectly that folding could generate lots of clashes (and though I can cope with clashes as I need to take that into account anyway, I would prefer if they were minimized).
Note:
I can't use C++0x features
Boost may be OK.
No, nobody really knows -- it can vary from one implementation to another. The primary requirements are (N3092, §20.8.15):
For all object types Key for which there exists a specialization hash, the instantiation hash shall:
satisfy the Hash requirements (20.2.4), with Key as the function call argument type, the DefaultConstructible requirements (33), the CopyAssignable requirements (37),
be swappable (20.2.2) for lvalues,
provide two nested types result_type and argument_type which shall be synonyms for size_t and Key, respectively,
satisfy the requirement that if k1 == k2 is true, h(k1) == h(k2) is also true, where h is an object of type hash and k1 and k2 are objects of type Key.
and (N3092, §20.2.4):
A type H meets the Hash requirements if:
it is a function object type (20.8),
it satisifes the requirements of CopyConstructible and Destructible (20.2.1),
the expressions shown in the following table are valid and have the indicated semantics, and
it satisfies all other requirements in this subclause.
§20.8.15 covers the requirements on the result of hashing, §20.2.4 on the hash itself. As you can see, however, both are pretty general. The table that's mentioned basically covers three more requirements:
A hash function must be "pure" (i.e., the result depends only on the input, not any context, history, etc.)
The function must not modify the argument that's passed to it, and
It must not throw any exceptions.
Exact algorithms definitely are not specified though -- and despite the length, most of the requirements above are really just stating requirements that (at least to me) seem pretty obvious. In short, the implementation is free to implement hashing nearly any way it wants to.
If the implementation uses a reasonable hash function, there should be no bits in the hash value that have any special correlation with the input. So if the hash function gives you 64 "random" bits, but you only want 32 of them, you can just take the first/last/... 32 bits of the value as you please. Which ones you take doesn't matter since every bit is as random as the next one (that's what makes a good hash function).
So the simplest and yet completely reasonable way to get a 32-bit hash value would be:
int32_t value = hash(...);
(Of course this collapses groups of 4 billion values down to one, which looks like a lot, but that can't be avoided if there are four billion times as many source values as target values.)