A compile time ordering on types - c++

I've been looking for a way to get an ordering on types at compile time. This would be useful, for example, for implementing (efficient) compile-time type-sets.
One obvious way to do it would be if there were a way to map every type to a unique integer. An answer to a previous question on that topic succinctly captures why that's difficult, and it seems like it would apply equally to any other way of trying to get an ordering:
the compiler has no way of knowing all compilation units and the linker has no concept of a type
Indeed, the challenge to the compiler would be considerable: it has to make sure that, in any invocation, for any source file, it returns the same integer for a given type / it returns the same ordering between any two given types, but at the same time, the universe of types is open and it has no knowledge of any types outside of the current file. A hard problem.
The idea I had is that types have names. And by the laws of C++, as far as I know the fully qualified name of a type must be unique across the entire program, otherwise you will get errors or undefined behaviour of some sort or another.
If two types have the same name, then they are the same type.
If two types are the same type, then either they have the same name, or they are typedefs for one another. The compiler has full knowledge of typedefs.
Names are strings, and strings have an ordering. So if I have it right, you could define a globally consistent ordering on types based on their names. More specifically, the ordering between any two types would be the ordering between the names of the types with the typedefs fully resolved. (Having a type behave differently from its typedefs would be problematic.)
Of course, standard C++ doesn't have any facilities for retrieving the names of types.
My questions are:
Do I have anything wrong? Are there any reasons this wouldn't, in theory, work?
Are there any compilers which give you access to the names of types (and ideally their typedef-resolved forms) at compile time as a language extension?
Is there any other way it could be done? Are there any compilers which do?
(I recognize that it's not polite to ask more than one question in the same question, but it seemed strange to post three separate questions with the same basic throat-clearing preceding them.)

the fully qualified name of a type must be unique across the entire program
But of course, that's only true if you consider seperate anonymous namespaces in different translation units to have different names in some sense, and have some way to figure out what they are.
The only sense in which I'm aware they really do have different names is in mangled linker symbols; you may (depending on the compiler) be able to get that from type_info::name(), but it isn't guaranteed, is limited to types with RTTI, and anyway doesn't seem to be declared as a constexpr so you can't use the value at compile time.
The ordering produced by type_info::before() naturally has the same limitations.
Out of interest, what are you trying to achieve with your compile-time type ordering?

Related

Practical meaning of std::strong_ordering and std::weak_ordering

I've been reading a bit about C++20's consistent comparison (i.e. operator<=>) but couldn't understand what's the practical difference between std::strong_ordering and std::weak_ordering (same goes for the _equality version for this manner).
Other than being very descriptive about the substitutability of the type, does it actually affect the generated code? Does it add any constraints for how one could use the type?
Would love to see a real-life example that demonstrates this.
Does it add any constraints for how one could use the type?
One very significant constraint (which wasn't intended by the original paper) was the adoption of the significance of strong_ordering by P0732 as an indicator that a class type can be used as a non-type template parameter. weak_ordering isn't sufficient for this case due to how template equivalence has to work. This is no longer the case, as non-type template parameters no longer work this way (see P1907R0 for explanation of issues and P1907R1 for wording of the new rules).
Generally, it's possible that some algorithms simply require weak_ordering but other algorithms require strong_ordering, so being able to annotate that on the type might mean a compile error (insufficiently strong ordering provided) instead of simply failing to meet the algorithm's requirements at runtime and hence just being undefined behavior. But all the algorithms in the standard library and the Ranges TS that I know of simply require weak_ordering. I do not know of one that requires strong_ordering off the top of my head.
Does it actually affect the generated code?
Outside of the cases where strong_ordering is required, or an algorithm explicitly chooses different behavior based on the comparison category, no.
There really isn't any reason to have std::weak_ordering. It's true that the standard describes operations like sorting in terms of a "strict" weak order, but there's an isomorphism between strict weak orderings and a totally ordered partition of the original set into incomparability equivalence classes. It's rare to encounter generic code that is interested both in the order structure (which considers each equivalence class to be one "value") and in some possibly finer notion of equivalence: note that when the standard library uses < (or <=>) it does not use == (which might be finer).
The usual example for std::weak_ordering is a case-insensitive string, since for instance printing two strings that differ only by case certainly produces different behavior despite their equivalence (under any operator). However, lots of types can have different behavior despite being ==: two std::vector<int> objects, for instance, might have the same contents and different capacities, so that appending to them might invalidate iterators differently.
The simple fact is that the "equality" implied by std::strong_ordering::equivalent but not by std::weak_ordering::equivalent is irrelevant to the very code that stands to benefit from it, because generic code doesn't depend on the implied behavioral changes, and non-generic code doesn't need to distinguish between the ordering types because it knows the rules for the type on which it operates.
The standard attempts to give the distinction meaning by talking about "substitutability", but that is inevitably circular because it can sensibly refer only to the very state examined by the comparisons. This was discussed prior to publishing C++20, but (perhaps for the obvious reasons) not much of the planned further discussion has taken place.

Will C++ compiler generate code for each template type?

I have two questions about templates in C++. Let's imagine I have written a simple List and now I want to use it in my program to store pointers to different object types (A*, B* ... ALot*). My colleague says that for each type there will be generated a dedicated piece of code, even though all pointers in fact have the same size.
If this is true, can somebody explain me why? For example in Java generics have the same purpose as templates for pointers in C++. Generics are only used for pre-compile type checking and are stripped down before compilation. And of course the same byte code is used for everything.
Second question is, will dedicated code be also generated for char and short (considering that they both have the same size and there are no specialization).
If this makes any difference, we are talking about embedded applications.
I have found a similar question, but it did not completely answer my question: Do C++ template classes duplicate code for each pointer type used?
Thanks a lot!
I have two questions about templates in C++. Let's imagine I have written a simple List and now I want to use it in my program to store pointers to different object types (A*, B* ... ALot*). My colleague says that for each type there will be generated a dedicated piece of code, even though all pointers in fact have the same size.
Yes, this is equivalent to having both functions written.
Some linkers will detect the identical functions, and eliminate them. Some libraries are aware that their linker doesn't have this feature, and factor out common code into a single implementation, leaving only a casting wrapper around the common code. Ie, a std::vector<T*> specialization may forward all work to a std::vector<void*> then do casting on the way out.
Now, comdat folding is delicate: it is relatively easy to make functions you think are identical, but end up not being the same, so two functions are generated. As a toy example, you could go off and print the typename via typeid(x).name(). Now each version of the function is distinct, and they cannot be eliminated.
In some cases, you might do something like this thinking that it is a run time property that differs, and hence identical code will be created, and the identical functions eliminated -- but a smart C++ compiler might figure out what you did, use the as-if rule and turn it into a compile-time check, and block not-really-identical functions from being treated as identical.
If this is true, can somebody explain me why? For example in Java generics have the same purpose as templates for pointers in C++. Generics are only used for per-compile type checking and are stripped down before compilation. And of course the same byte code is used for everything.
No, they aren't. Generics are roughly equivalent to the C++ technique of type erasure, such as what std::function<void()> does to store any callable object. In C++, type erasure is often done via templates, but not all uses of templates are type erasure!
The things that C++ does with templates that are not in essence type erasure are generally impossible to do with Java generics.
In C++, you can create a type erased container of pointers using templates, but std::vector doesn't do that -- it creates an actual container of pointers. The advantage to this is that all type checking on the std::vector is done at compile time, so there doesn't have to be any run time checks: a safe type-erased std::vector may require run time type checking and the associated overhead involved.
Second question is, will dedicated code be also generated for char and short (considering that they both have the same size and there are no specialization).
They are distinct types. I can write code that will behave differently with a char or short value. As an example:
std::cout << x << "\n";
with x being a short, this print an integer whose value is x -- with x being a char, this prints the character corresponding to x.
Now, almost all template code exists in header files, and is implicitly inline. While inline doesn't mean what most folk think it means, it does mean that the compiler can hoist the code into the calling context easily.
If this makes any difference, we are talking about embedded applications.
What really makes a difference is what your particular compiler and linker is, and what settings and flags they have active.
The answer is maybe. In general, each instantiation of a
template is a unique type, with a unique implementation, and
will result in a totally independent instance of the code.
Merging the instances is possible, but would be considered
"optimization" (under the "as if" rule), and this optimization
isn't wide spread.
With regards to comparisons with Java, there are several points
to keep in mind:
C++ uses value semantics by default. An std::vector, for
example, will actually insert copies. And whether you're
copying a short or a double does make a difference in the
generated code. In Java, short and double will be boxed,
and the generated code will clone a boxed instance in some way;
cloning doesn't require different code, since it calls a virtual
function of Object, but physically copying does.
C++ is far more powerful than Java. In particular, it allows
comparing things like the address of functions, and it requires
that the functions in different instantiations of templates have
different addresses. Usually, this is not an important point,
and I can easily imagine a compiler with an option which tells
it to ignore this point, and to merge instances which are
identical at the binary level. (I think VC++ has something like
this.)
Another issue is that the implementation of a template in C++
must be present in the header file. In Java, of course,
everything must be present, always, so this issue affects all
classes, not just template. This is, of course, one of the
reasons why Java is not appropriate for large applications. But
it means that you don't want any complicated functionality in a
template; doing so loses one of the major advantages of C++,
compared to Java (and many other languages). In fact, it's not
rare, when implementing complicated functionality in templates,
to have the template inherit from a non-template class which
does most of the implementation in terms of void*. While
implementing large blocks of code in terms of void* is never
fun, it does have the advantage of offering the best of both
worlds to the client: the implementation is hidden in compiled
files, invisible in any way, shape or manner to the client.

Memory management for types in complex languages

I've come across a slight problem for writing memory management with regard to the internal representation of types in a compiler for statically typed, complex languages. Consider a simple snippet in C++ which easily demonstrates a type that refers to itself.
class X {
void f(const X&) {}
};
Types can have nearly infinitely complex relationships to each other. So, as a compiler process, how do you make sure that they are properly collected?
So far, I've decided that garbage collection might be the right way to go, which I wouldn't be too happy with because I want to write the compiler in C++, or alternatively, just leave them and never collect them for the life of the compile phase for which they are needed (which has a very fixed lifetime) and then collect them all afterwards. The problem with that is that if you had a lot of complex types, you could lose a lot of memory that way.
Memory management is easy, just have some table type-name -> type-descriptor for each declaration scopes. Types are uniquely identified by name, no matter how complex the nesting is. Even a recursive type is still only a single type. As tp1 says correctly, you typically perform multiple passes to fill in all blanks. For instance, you might check that a type name is known in the first pass and then compute all links, later on, you compute the type.
Keep in mind that languages like C don't have a really complex type system -- even though they have pointers (which allow for recursive types), there is not much type computation going on.
I think you can remove the cycles from the dependency graph by using separate objects to represent declarations and definitions. Assuming a type system similar to C++, you will then have a hierarchical dependency:
Function definitions depend on type definitions and function declarations
Type definitions depend on function and type declarations (and definitions of contained types)
Function declarations depend on type declarations
In your example, the dependency graph is f_def -> X_def -> f_decl -> X_decl.
With no cycles in the graph, you can manage objects using simple reference counting.

Does a compiler collapse classes which are identical in their structure?

I hope this isn't a duplicate of a question itself, but the search terms are so ambiguous, I can't think of anything better.
Say we have two classes:
class FloatRect
{
float x,y,width,height;
};
and somewhere else
class FloatBox
{
float top,left,bottom,right;
};
From a practical standpoint, they're the same, so does the compiler treat them both as some sort of typedef?
Or will it produce two separate units of code?
I'm curious because I'd like to go beyond typedefs and make a few variants of a type to improve readability.
I don't want needless duplication, though...
This is completely implementation specific.
For example I can use CLang / LLVM to illustrate both point of view at once:
CLang is the C++ front-end, it uses two distinct types to resolve function calls etc... and treats them as completely different values
LLVM is the optimizer backend, it doesn't care (yet) about names, but only structural representation, and will therefore collapse them in a single type... or even entirely remove the time definition if useless.
If the question is about: does introducing a similarly laid-out class creates overhead, then the answer is no, so write the classes that you need.
Note: the same happens for functions, ie the optimizer can merge blocks of functions that are identical to get tighter code, this is not a reason to copy/paste though
They are totally unrelated classes with regards to the compiler.
If they are just POD C-structs, it won't actually generate any real code for them as such. (Yes there is a silent assignment operator and some other functions but I doubt there will be code actually compiled to do it, it will just inline them if they are used).
Since the classes you use as samples are only relevant during compilation, there's nothing to duplicate or collapse. Runtime, the member variables are simply accessed as "the value at at offset N".
This is, of course, hugely implementation-specific.
Any internal collapse here would be completely internal to the mechanism of the compiler, and would not have an effect on the produced translated code.
I would imagine it's very unlikely that this is the case, as I can think of no benefit and several ways in which this would really complicate matters. I can't present any evidence, though.
No. As they are literally two different types.
The compiler must treat them that way.
There is no magic merging going on.
No they are not treated as typedefs, because they are different types and can for example be used for overloading functions.
On the other hand, the types have no code in them so there will be nothing to duplicate.

unique synthesised name

I would like to generate various data types in C++ with unique deterministic names. For example:
struct struct_int_double { int mem0; double mem1; };
At present my compiler synthesises names using a counter, which means the names don't agree when compiling the same data type in distinct translation units.
Here's what won't work:
Using the ABI mangled_name function. Because it depends already on structs having unique names. Might work in C++11 compliant ABI by pretending struct is anonymous?
Templates eg struct2 because templates don't work with recursive types.
A complete mangling. Because it gives names which are way too long (hundreds of characters!)
Apart from a global registry (YUK!) the only thing I can think of is to first create a unique long mangled name, and then use a digest or hash function to shorten it (and hope there are no clashes).
Actual problem: to generate libraries which can be called where the types are anonymous, eg tuples, sum types, function types.
Any other ideas?
EDIT: Addition description of recursive type problem. Consider defining a linked list like this:
template<class T>
typedef pair<list<T>*, T> list;
This is actually what is required. It doesn't work for two reasons: first, you can't template a typedef. [NO, you can NOT use a template class with a typedef in it, it doesn't work] Second, you can't pass in list* as an argument because it isn't defined yet. In C without polymorphism you can do it:
struct list_int { struct list_int *next; int value; };
There are several work arounds. For this particular problem you can use a variant of the Barton-Nackman trick, but it doesn't generalise.
There is a general workaround, first shown me by Gabrielle des Rois, using a template with open recursion, and then a partial specialisation to close it. But this is extremely difficult to generate and would probably be unreadable even if I could figure out how to do it.
There's another problem doing variants properly too, but that's not directly related (it's just worse because of the stupid restriction against declaring unions with constructable types).
Therefore, my compiler simply uses ordinary C types. It has to handle polymorphism anyhow: one of the reasons for writing it was to bypass the problems of C++ type system including templates. This then leads to the naming problem.
Do you actually need the names to agree? Just define the structs separately, with different names, in the different translation units and reinterpret_cast<> where necessary to keep the C++ compiler happy. Of course that would be horrific in hand-written code, but this is code generated by your compiler, so you can (and I assume do) perform the necessary static type checks before the C++ code is generated.
If I've missed something and you really do need the type names to agree, then I think you already answered your own question: Unless the compiler can share information between the translation of multiple translation units (through some global registry), I can't see any way of generating unique, deterministic names from the type's structural form except the obvious one of name-mangling.
As for the length of names, I'm not sure why it matters? If you're considering using a hash function to shorten the names then clearly you don't need them to be human-readable, so why do they need to be short?
Personally I'd probably generate semi-human-readable names, in a similar style to existing name-mangling schemes, and not bother with the hash function. So, instead of generating struct_int_double you might generate sid (struct, int, double) or si32f64 (struct, 32-bit integer, 64-bit float) or whatever. Names like that have the advantage that they can still be parsed directly (which seems like it would be pretty much essential for debugging).
Edit
Some more thoughts:
Templates: I don't see any real advantage in generating template code to get around this problem, even if it were possible. If you're worried about hitting symbol name length limits in the linker, templates can't help you, because the linker has no concept of templates: any symbols it see will be mangled forms of the template structure generated by the C++ compiler and will have exactly the same problem as long mangled names generated directly by the felix compiler.
Any types that have been named in felix code should be retained and used directly (or nearly directly) in the generated C++ code. I would think there are practical (soft) readability/maintainability constraints on the complexity of anonymous types used in felix code, which are the only ones you need to generate names for. I assume your "variants" are discriminated unions, so each component part must have a name (the tag) defined in the felix code, and again these names can be retained. (I mentioned this in a comment, but since I'm editing my answer I might as well include it)
Reducing mangled-name length: Running a long mangled name through a hash function sounds like the easiest way to do it, and the chance of collisions should be acceptable as long as you use a good hash function and retain enough bits in your hashed name (and your alphabet for encoding the hashed name has 37 characters, so a full 160-bit sha1 hash could be written in about 31 characters). The hash function idea means that you won't be able to get directly back from a hashed name to the original name, but you might never need to do that. And you could dump out an auxiliary name-mapping table as part of the compilation process I guess (or re-generate the name from the C struct definition maybe, where it's available). Alternatively, if you still really don't like hash functions, you could probably define a reasonably compact bit-level encoding (then write that in the 37-character identifier alphabet), or even run some general purpose compression algorithm on that bit-level encoding. If you have enough felix code to analyse you could even pre-generate a fixed compression dictionary. That's stark raving bonkers of course: just use a hash.
Edit 2: Sorry, brain failure -- sha-1 digests are 160 bits, not 128.
PS. Not sure why this question was down-voted -- it seems reasonable to me, although some more context about this compiler you're working on might help.
I don't really understand your problem.
template<typename T>
struct SListItem
{
SListItem* m_prev;
SListItem* m_next;
T m_value;
};
int main()
{
SListItem<int> sListItem;
}