Why does std::filesystem provide so many non-member functions? - c++

Consider for example
file_size. To get the size of a file we will be using
std::filesystem::path p = std::filesystem::current_path();
// ... usual "does this exist && is this a file" boilerplate
auto n = std::filesystem::file_size(p);
Nothing wrong with that, if it were plain ol' C, but having been taught that C++ is an OO language [I do know it's multi-paradigm, apologies to our language lawyers :-)] that just feels so ... imperative (shudder) to me, where I have come to expect the object-ish
auto n = p.file_size();
instead. The same holds for other functions, such as resize_file, remove_file and probably more.
Do you know of any rationale why Boost and consequently std::filesystem chose this imperative style instead of the object-ish one? What is the benefit? Boost mentions the rule (at the very bottom), but no rationale for it.
I was thinking about inherent issues such as ps state after remove_file(p), or error flags (overloads with additional argument), but neither approach solves these less elegant than the other.
You can observe a similar pattern with iterators, where nowadays we can (are supposed to?) do begin(it) instead of it.begin(), but here I think the rationale was to be more in line with the non-modifying next(it) and such.

There are a couple of good answers already posted, but they do not get to the heart of the matter: all other things being equal, if you can implement something as a free, non-friend function, you always should.
Why?
Because, free, non-friend functions, do not have privileged access to state. Testing classes is much harder than testing functions because you have to convince yourself that the class' invariants are maintained no matter which members functions are called, or even combinations of member functions. The more member/friend functions you have, the more work you have to do.
Free functions can be reasoned about and tested standalone. Because they don't have privileged access to class state, they cannot possibly violate any class invariants.
I don't know the details of what invariants and what privileged access path allows, but obviously they were able to implement a lot of functionality as free functions, and they make the right choice and did so.
Scott Meyers brilliant article on this topic, giving the "algorithm" for whether to make a function a member or not.
Here's Herb Sutter bemoaning the massive interface of std::string. Why? Because, much of string's interface could have been implemented as free functions. It may be a bit more unwieldy to use on occasion, but it's easier to test, reason about, improves encapsulation and modularity, opens opportunities up for code reuse that were not there before, etc.

The Filesystem library has a very clear separation between the filesystem::path type, which represents an abstract path name (that doesn't even have be the name of a file that exists) and operations that access the actual physical filesystem, i.e. read+write data on disks.
You even pointed to the explanation of that:
The design rule is that purely lexical operations are supplied as class path member functions, while operations performed by the operating system are provided as free functions.
This is the reason.
It's theoretically possible to use a filesystem::path on a system with no disks. The path class just holds a string of characters and allows manipulating that string, converting between character sets and using some rules that define the structure of filenames and pathnames on the host OS. For example it knows that directory names are separated by / on POSIX systems and by \ on Windows. Manipulating the string held in a path is a "lexical operation", because it just performs string manipulation.
The non-member functions that are known as "filesystem operations" are entirely different. They don't just work with an abstract path object that is just a string of characters, they perform the actual I/O operations that access the filesystem (stat system calls, open, readdir etc.). These operations take a path argument that names the files or directories to operate on, and then they access the real files or directories. They don't just manipulate strings in memory.
Those operations depend on the API provided by the OS for accessing files, and they depend on hardware that might fail in completely different ways to in-memory string manipulations. Disks might be full, or might get unplugged before an operation completes, or might have hardware faults.
Looked at like that, of course file_size isn't a member of path, because it's nothing to do with the path itself. The path is just a representation of a filename, not of an actual file. The function file_size looks for a physical file with the given name and tries to read its size. That's not a property of the file name, it's a property of a persistent file on the filesystem. Something that exists entirely separately from the string of characters in memory that holds the name of a file.
Put another way, I can have a path object that contains complete nonsense, like filesystem::path p("hgkugkkgkuegakugnkunfkw") and that's fine. I can append to that path, or ask if it has a root directory etc. But I can't read the size of such a file if it doesn't exist. I can have a path to files that do exist, but I don't have permission to access, like filesystem::path p("/root/secret_admin_files.txt"); and that's also fine, because it's just a string of characters. I'd only get a "permission denied" error when I tried to access something in that location using the filesystem operation functions.
Because path member functions never touch the filesystem they can never fail due to permissions, or non-existent files. That's a useful guarantee.
You can observe a similar pattern with iterators, where nowadays we can (are supposed to?) do begin(it) instead of it.begin(), but here I think the rationale was to be more in line with the non-modifying next(it) and such.
No, it was because it works equally well with arrays (which can't have member functions) and class types. If you know the range-like thing you are dealing with is a container not an array then you can use x.begin() but if you're writing generic code and don't know whether it's a container or an array then std::begin(x) works in both cases.
The reasons for both these things (the filesystem design and the non-member range access functions) are not some anti-OO preference, they're for far more sensible, practical reasons. It would have been poor design to have based either of them because it feels better to some people who like OO, or feels better to people who don't like OO.
Also, there are things you can't do when everything's a member function:
struct ConvertibleToPath {
operator const std::filesystem::path& () const;
// ...
};
ConvertibleToPath c;
auto n = std::filesystem::file_size(c); // works fine
But if file_size was a member of path:
c.file_size(); // wouldn't work
static_cast<const std::filesystem::path&>(c).file_size(); // yay, feels object-ish!

Several reasons (somewhat speculative though, I don't follow the standardization process very closely):
Because it's based on boost::filesystem, which is designed that way. Now, you could ask "Why is boost::filesystem designed that way?", which would be a fair question, but given that it was, and that it's seen a lot of mileage the way it is, it was accepted into the standard with very few changes. So were some other Boost constructs (although sometimes there are some changes, under the hood mostly).
A common principle when designing classes is "if a function doesn't need access to a class' protected/private members, and can instead use existing members - you don't make it a member as well." While not everyone ascribes to that - it seems the designers of boost::filesystem do.
See a discussion of (and an argument for) this in the context of std::string(), a "monolith" class with a zillion methods, by C++ luminary Hebert Sutter, in Guru of the Week #84.
It was expected that in C++17 we might already have Uniform Call Syntax (see Bjarne's Stroustrup highly-readable proposal). If that had been accepted into the standard, calling
p.file_size();
would have been equivalent to calling
file_size(p);
so you could have chosen whatever you like. Basically.

Just in addition to what others already stated.
One of the reasons why people are unhappy with "nonmember" approach is the need to type std::filesystem:: in the front of the API or to use using directives.
But actually you don't have to, and simply skipping namespace for API call like this:
#include <iostream>
#include <filesystem>
int main()
{
auto p = std::filesystem::path{"/bin/cat"};
//notice file_size below has no namespace qualifiers
std::cout << "Binary size for your /bin/cat is " << file_size(p);
}
works perfectly fine because of function names are also looked up in the namespaces of their arguments due to ADL.
(live sample https://wandbox.org/permlink/JrFz8FJG3OdgRwg9)

Related

How is is_standard_layout useful?

From what I understand, standard layout allows three things:
Empty base class optimization
Backwards compatibility with C with certain pointer casts
Use of offsetof
Now, included in the library is the is_standard_layout predicate metafunction, but I can't see much use for it in generic code as those C features I listed above seem extremely rare to need checking in generic code. The only thing I can think of is using it inside static_assert, but that is only to make code more robust and isn't required.
How is is_standard_layout useful? Are there any things which would be impossible without it, thus requiring it in the standard library?
General response
It is a way of validating assumptions. You wouldn't want to write code that assumes standard layout if that wasn't the case.
C++11 provides a bunch of utilities like this. They are particularly valuable for writing generic code (templates) where you would otherwise have to trust the client code to not make any mistakes.
Notes specific to is_standard_layout
It looks to me like the (pseudo code) definition of is_pod would roughly be...
// note: applied recursively to all members
bool is_pod(T) { return is_standard_layout(T) && is_trivial(T); }
So, you need to know is_standard_layout in order to implement is_pod. Given that, we might as well expose is_standard_layout as a tool available to library developers. Also of note: if you have a use-case for is_pod, you might want to consider the possibility that is_standard_layout might actually be a better (more accurate) choice in that case, since POD is essentially a subset of standard layout.
I get the feeling that they added every conceivable variant of type evaluation, regardless of any obvious value, just in case someone might encounter a need sometime before the next standard comes out. I doubt if piling on these "extra" type properties adds a significant additional burden to compiler developers.
There is a nice discussion of standard layout here: Why is C++11's POD "standard layout" definition the way it is?
There is also a lot of good detail at cppreference.com: Non-static data members

Generic/template programming best practices: To limit types, or not to limit types

That is my question. I'm just curious what the consensus is on limiting the types that can be passed in to a generic function or class. I thought I had read at some point, that if you're doing generic programming, it was generally better to leave things open instead of trying to close them down (don't recall the source).
I'm writing a library that has some internal generic functions, and I feel that they should only allow types within the library to be used with them, simply because that's how I mean for them to be used. On the other hand, I'm not really sure my effort to lock things down is worth it.
Anybody maybe have some sources for statistics or authoritative commentary on this topic? I'm also interested in sound opinions. Hopefully that doesn't invalidate this question altogether :\
Also, are there any tags here on SO that equate to "best-practice"? I didn't see that one specifically, but it seems like it'd be helpful to be able to bring up all best-practice info for a given SO topic... maybe not, just a thought.
Edit: One answer so far mentioned that the type of library I'm doing would be significant. It's a database library that ends up working with STL containers, variadics (tuple), Boost Fusion, things of that nature. I can see how that would be relevant, but I'd also be interested in rules of thumb for determining which way to go.
Always leave it as open as possible - but make sure to
document the required interface and behaviour for valid types to use with your generic code.
use a type's interface characteristics (traits) to determine whether to allow/disallow it. Don't base your decision on the type name.
produce reasonable diagnosis if
someone uses a wrong type. C++
templates are great at raising tons
of deeply-nested errors if they get instanced with
the wrong types - using type traits, static assertions and related techniques, one can easily produce more succinct error messages.
In my database framework, I decided to forgo templates and use a single base class. Generic programming meant that any or all objects can be used. The specific type classes outweighed the few generic operations. For example, strings and numbers can be compared for equality; BLOBs (Binary Large OBjects) may want to use a different method (such as comparing MD5 checksums stored in a different record).
Also, there was an inheritance branch between strings and numeric types.
By using an inheritance hierarchy, I can refer to any field by using the Field class or to a specialized class such as Field_Int.
It's one of the strongest selling points of the STL that it's so open, and that its algorithms work with my data structures as well as with the one it provides itself, and that my algorithms work with its data structures as well as with mine.
Whether it makes sense to leave your algorithms open to all types or limit them to yours depends largely on the library you're writing, which we know nothing about.
(Initially I meant to answer that being widly open is what Generic Programming is all about, but now I see that there's always limits to genericity, and that you have to draw the line somewhere. It might just as well be limited to your types, if that makes sense.)
At least IMO, the right thing to do is roughly what concepts attempted: rather than attempting to verify that you're receiving the specified type (or one of the set of specified types), do your best to specify the requirements on the type, and verify that the type you've received has the right characteristics, and can meet the requirements of your template.
Much like with concepts, much of the motivation for that is to simply provide good, useful error messages when those requirements aren't met. Ultimately, the compiler will produce an error message if somebody attempts to instantiate your template over a type that doesn't meet its requirements. The problem is that, as likely as not, the error message won't by very helpful unless you take steps to ensure that it is.
The Problem
If you clients can see your internal functions in public headers, and if the names of these internal generic functions are "common", then you may be putting your clients at risk of accidentally calling your internal generic functions.
For example:
namespace Database
{
// internal API, not documented
template <class DatabaseItem>
void
store(DatabaseItem);
{
// ...
}
struct SomeDataBaseType {};
} // Database
namespace ClientCode
{
template <class T, class U>
struct base
{
};
// external API, documented
template <class T, class U>
void
store(base<T, U>)
{
// ...
}
template <class T, class U>
struct derived
: public base<T, U>
{
};
} // ClientCode
int main()
{
ClientCode::derived<int, Database::SomeDataBaseType> d;
store(d); // intended ClientCode::store
}
In this example the author of main doesn't even know Database::store exists. He intends on calling ClientCode::store, and gets lazy, letting ADL choose the function instead of specifying ClientCode::store. After all, his argument to store comes from the same namespace as store so it should just work.
It doesn't work. This example calls Database::store. Depending on the innards of Database::store this call may result in a compile-time error, or worse yet, a run time error.
How To Fix
The more generically you name your functions, the more likely this is to happen. Give your internal functions (the ones that must appear in your headers) really non-generic names. Or put them in a sub-namespace like details. In the latter case you have to make sure your clients won't ever have details as an associated namespace for the purpose of ADL. That's usually accomplished by not creating types that the client will use, either directly or indirectly, in namespace details.
If you want to get more paranoid, start locking things down with enable_if.
If perhaps you think your internal functions might be useful to your clients, then they are no longer internal.
The above example code is not far-fetched. It has happened to me. It has happened to functions in namespace std. I call store in this example overly generic. std::advance and std::distance are classic examples of overly generic code. It is something to guard against. And it is a problem concepts attempted to fix.

Creating serializeable unique compile-time identifiers for arbitrary UDT's

I would like a generic way to create unique compile-time identifiers for any C++ user defined types.
for example:
unique_id<my_type>::value == 0 // true
unique_id<other_type>::value == 1 // true
I've managed to implement something like this using preprocessor meta programming, the problem is, serialization is not consistent. For instance if the class template unique_id is instantiated with other_type first, then any serialization in previous revisions of my program will be invalidated.
I've searched for solutions to this problem, and found several ways to implement this with non-consistent serialization if the unique values are compile-time constants. If RTTI or similar methods, like boost::sp_typeinfo are used, then the unique values are obviously not compile-time constants and extra overhead is present. An ad-hoc solution to this problem would be, instantiating all of the unique_id's in a separate header in the correct order, but this causes additional maintenance and boilerplate code, which is not different than using an enum unique_id{my_type, other_type};.
A good solution to this problem would be using user-defined literals, unfortunately, as far as I know, no compiler supports them at this moment. The syntax would be 'my_type'_id; 'other_type'_id; with udl's.
I'm hoping somebody knows a trick that allows implementing serialize-able unique identifiers in C++ with the current standard (C++03/C++0x), I would be happy if it works with the latest stable MSVC and GNU-G++ compilers, although I expect if there is a solution, it's not portable.
I would like to make clear, that using mpl::set or similar constructs like mpl::vector and filtering, does not solve this problem, because the scope of the meta-set/vector is limited and actually causes more problems than just preprocessor meta programming.
A while back I added a build step to one project of mine, which allowed me to write #script_name(args) in a C++ source file and have it automatically replaced with the output of the associated script, for instance ./script_name.pl args or ./script_name.py args.
You may balk at the idea of polluting the language into nonstandard C++, but all you'd have to do is write #sha1(my_type) to get the unique integer hash of the class name, regardless of build order and without the need for explicit instantiation.
This is just one of many possible nonstandard solutions, and I think a fairly clean one at that. There's currently no great way to impose an arbitrary, consistent ordering on your classes without just specifying it explicitly, so I recommend you simply give in and go the explicit instantiation route; there's nothing really wrong with centralising the information, but as you said it's not all that different from an enumeration, which is what I'd actually use in this situation.
Persistence of data is a very interesting problem.
My first question would be: do you really want serialization ? If you are willing to investigate an alternative, then jump to the next section.
If you're still there, I think you have not given the typeid solution all its due.
// static detection
template <typename T>
size_t unique_id()
{
static size_t const id = some_hash(typeid(T)); // or boost::sp_typeinfo
return id;
}
// dynamic detection
template <typename T>
size_t unique_id(T const& t)
{
return some_hash(typeid(t)); // no memoization possible
}
Note: I am using a local static to avoid the order of initialization issue, in case this value is required before main is entered
It's pretty similar to your unique_id<some_type>::value, and even though it's computed at runtime, it's only computed once, and the result (for the static detection) is then memoized for future calls.
Also note that it's fully generic: no need to explicitly write the function for each type.
It may seem silly, but the issue of serialization is that you have a one-to-one mapping between the type and its representation:
you need to version the representation, so as to be able to decode "older" versions
dealing with forward compatibility is pretty hard
dealing with cyclic reference is pretty hard (some framework handle it)
and then there is the issue of moving information from one to another --> deserializing older versions becomes messy and frustrating
For persistent saves, I usually recommend using a dedicated BOM. Think of the saved data as a message to your future self. And I usually go the extra mile and proposes the awesome Google Proto Buffer library:
Backward and Forward compatibility baked-in
Several format outputs -> human readable (for debug) or binary
Several languages can read/write the same messages (C++, Java, Python)
Pretty sure that you will have to implement your own extension to make this happen, I've not seen nor heard of any such construct for compile-time. MSVC offers __COUNTER__ for the preprocessor but I know of no template equivalent.

Features of C++ that can't be implemented in C?

I have read that C++ is super-set of C and provide a real-time implementation by creating objects. Also C++ is closed to real world as it is enriched with Object Oriented concepts.
What all concepts are there in C++ that can not be implemented in C?
Some say that we can not over write methods in C then how can we have different flavors of printf()?
For example printf("sachin"); will print sachin and printf("%d, %s",count ,name); will print 1,sachin assuming count is an integer whose value is 1 and name is a character array initililsed with "sachin".
Some say data abstraction is achieved in C++, so what about structures?
Some responders here argues that most things that can be produced with C++ code can also be produced with C with enough ambition. This is true in some parts, but some things are inherently impossible to achieve unless you modify the C compiler to deviate from the standard.
Fakeable:
Inheritance (pointer to parent-struct in the child-struct)
Polymorphism (Faking vtable by using a group of function pointers)
Data encapsulation (opaque sub structures with an implementation not exposed in public interface)
Impossible:
Templates (which might as well be called preprocessor step 2)
Function/method overloading by arguments (some try to emulate this with ellipses, but never really comes close)
RAII (Constructors and destructors are automatically invoked in C++, so your stack resources are safely handled within their scope)
Complex cast operators (in C you can cast almost anything)
Exceptions
Worth checking out:
GLib (a C library) has a rather elaborate OO emulation
I posted a question once about what people miss the most when using C instead of C++.
Clarification on RAII:
This concept is usually misinterpreted when it comes to its most important aspect - implicit resource management, i.e. the concept of guaranteeing (usually on language level) that resources are handled properly. Some believes that achieving RAII can be done by leaving this responsibility to the programmer (e.g. explicit destructor calls at goto labels), which unfortunately doesn't come close to providing the safety principles of RAII as a design concept.
A quote from a wikipedia article which clarifies this aspect of RAII:
"Resources therefore need to be tied to the lifespan of suitable objects. They are acquired during initialization, when there is no chance of them being used before they are available, and released with the destruction of the same objects, which is guaranteed to take place even in case of errors."
How about RAII and templates.
It is less about what features can't be implemented, and more about what features are directly supported in the language, and therefore allow clear and succinct expression of the design.
Sure you can implement, simulate, fake, or emulate most C++ features in C, but the resulting code will likely be less readable, or maintainable. Language support for OOP features allows code based on an Object Oriented Design to be expressed far more easily than the same design in a non-OOP language. If C were your language of choice, then often OOD may not be the best design methodology to use - or at least extensive use of advanced OOD idioms may not be advisable.
Of course if you have no design, then you are likely to end up with a mess in any language! ;)
Well, if you aren't going to implement a C++ compiler using C, there are thousands of things you can do with C++, but not with C. To name just a few:
C++ has classes. Classes have constructors and destructors which call code automatically when the object is initialized or descrtucted (going out of scope or with delete keyword).
Classes define an hierarchy. You can extend a class. (Inheritance)
C++ supports polymorphism. This means that you can define virtual methods. The compiler will choose which method to call based on the type of the object.
C++ supports Run Time Information.
You can use exceptions with C++.
Although you can emulate most of the above in C, you need to rely on conventions and do the work manually, whereas the C++ compiler does the job for you.
There is only one printf() in the C standard library. Other varieties are implemented by changing the name, for instance sprintf(), fprintf() and so on.
Structures can't hide implementation, there is no private data in C. Of course you can hide data by not showing what e.g. pointers point to, as is done for FILE * by the standard library. So there is data abstraction, but not as a direct feature of the struct construct.
Also, you can't overload operators in C, so a + b always means that some kind of addition is taking place. In C++, depending on the type of the objects involved, anything could happen.
Note that this implies (subtly) that + in C actually is overridden; int + int is not the same code as float + int for instance. But you can't do that kind of override yourself, it's something for the compiler only.
You can implement C++ fully in C... The original C++ compiler from AT+T was infact a preprocessor called CFront which just translated C++ code into C and compiled that.
This approach is still used today by comeau computing who produce one of the most C++ standards compliant compilers there is, eg. It supports all of C++ features.
namespace
All the rest is "easily" faked :)
printf is using a variable length arguments list, not an overloaded version of the function
C structures do not have constructors and are unable to inherit from other structures they are simply a convenient way to address grouped variables
C is not an OO langaueage and has none of the features of an OO language
having said that your are able to imitate C++ functionality with C code but, with C++ the compiler will do all the work for you in compile time
What all concepts are there in C++
that can not be implemented in C?
This is somewhat of an odd question, because really any concept that can be expressed in C++ can be expressed in C. Even functionality similar to C++ templates can be implemented in C using various horrifying macro tricks and other crimes against humanity.
The real difference comes down to 2 things: what the compiler will agree to enforce, and what syntactic conveniences the language offers.
Regarding compiler enforcement, in C++ the compiler will not allow you to directly access private data members from outside of a class or friends of the class. In C, the compiler won't enforce this; you'll have to rely on API documentation to separate "private" data from "publicly accessible" data.
And regarding syntactic convenience, C++ offers all sorts of conveniences not found in C, such as operator overloading, references, automated object initialization and destruction (in the form of constructors/destructors), exceptions and automated stack-unwinding, built-in support for polymorphism, etc.
So basically, any concept expressed in C++ can be expressed in C; it's simply a matter of how far the compiler will go to help you express a certain concept and how much syntactic convenience the compiler offers. Since C++ is a newer language, it comes with a lot more bells and whistles than you would find in C, thus making the expression of certain concepts easier.
One feature that isn't really OOP-related is default arguments, which can be a real keystroke-saver when used correctly.
Function overloading
I suppose there are so many things namespaces, templates that could not be implemented in C.
There shouldn't be too much such things, because early C++ compilers did produce C source code from C++ source code. Basically you can do everything in Assembler - but don't WANT to do this.
Quoting Joel, I'd say a powerful "feature" of C++ is operator overloading. That for me means having a language that will drive you insane unless you maintain your own code. For example,
i = j * 5;
… in C you know, at least, that j is
being multiplied by five and the
results stored in i.
But if you see that same snippet of
code in C++, you don’t know anything.
Nothing. The only way to know what’s
really happening in C++ is to find out
what types i and j are, something
which might be declared somewhere
altogether else. That’s because j
might be of a type that has operator*
overloaded and it does something
terribly witty when you try to
multiply it. And i might be of a type
that has operator= overloaded, and the
types might not be compatible so an
automatic type coercion function might
end up being called. And the only way
to find out is not only to check the
type of the variables, but to find the
code that implements that type, and
God help you if there’s inheritance
somewhere, because now you have to
traipse all the way up the class
hierarchy all by yourself trying to
find where that code really is, and if
there’s polymorphism somewhere, you’re
really in trouble because it’s not
enough to know what type i and j are
declared, you have to know what type
they are right now, which might
involve inspecting an arbitrary amount
of code and you can never really be
sure if you’ve looked everywhere
thanks to the halting problem (phew!).
When you see i=j*5 in C++ you are
really on your own, bubby, and that,
in my mind, reduces the ability to
detect possible problems just by
looking at code.
But again, this is a feature. (I know I will be modded down, but at the time of writing only a handful of posts talked about downsides of operator overloading)

Why use prefixes on member variables in C++ classes

A lot of C++ code uses syntactical conventions for marking up member variables. Common examples include
m_memberName for public members (where public members are used at all)
_memberName for private members or all members
Others try to enforce using this->member whenever a member variable is used.
In my experience, most larger code bases fail at applying such rules consistently.
In other languages, these conventions are far less widespread. I see it only occasionally in Java or C# code. I think I have never seen it in Ruby or Python code. Thus, there seems to be a trend with more modern languages to not use special markup for member variables.
Is this convention still useful today in C++ or is it just an anachronism. Especially as it is used so inconsistently across libraries. Haven't the other languages shown that one can do without member prefixes?
I'm all in favour of prefixes done well.
I think (System) Hungarian notation is responsible for most of the "bad rap" that prefixes get.
This notation is largely pointless in strongly typed languages e.g. in C++ "lpsz" to tell you that your string is a long pointer to a nul terminated string, when: segmented architecture is ancient history, C++ strings are by common convention pointers to nul-terminated char arrays, and it's not really all that difficult to know that "customerName" is a string!
However, I do use prefixes to specify the usage of a variable (essentially "Apps Hungarian", although I prefer to avoid the term Hungarian due to it having a bad and unfair association with System Hungarian), and this is a very handy timesaving and bug-reducing approach.
I use:
m for members
c for constants/readonlys
p for pointer (and pp for pointer to pointer)
v for volatile
s for static
i for indexes and iterators
e for events
Where I wish to make the type clear, I use standard suffixes (e.g. List, ComboBox, etc).
This makes the programmer aware of the usage of the variable whenever they see/use it. Arguably the most important case is "p" for pointer (because the usage changes from var. to var-> and you have to be much more careful with pointers - NULLs, pointer arithmetic, etc), but all the others are very handy.
For example, you can use the same variable name in multiple ways in a single function: (here a C++ example, but it applies equally to many languages)
MyClass::MyClass(int numItems)
{
mNumItems = numItems;
for (int iItem = 0; iItem < mNumItems; iItem++)
{
Item *pItem = new Item();
itemList[iItem] = pItem;
}
}
You can see here:
No confusion between member and parameter
No confusion between index/iterator and items
Use of a set of clearly related variables (item list, pointer, and index) that avoid the many pitfalls of generic (vague) names like "count", "index".
Prefixes reduce typing (shorter, and work better with auto-completion) than alternatives like "itemIndex" and "itemPtr"
Another great point of "iName" iterators is that I never index an array with the wrong index, and if I copy a loop inside another loop I don't have to refactor one of the loop index variables.
Compare this unrealistically simple example:
for (int i = 0; i < 100; i++)
for (int j = 0; j < 5; j++)
list[i].score += other[j].score;
(which is hard to read and often leads to use of "i" where "j" was intended)
with:
for (int iCompany = 0; iCompany < numCompanies; iCompany++)
for (int iUser = 0; iUser < numUsers; iUser++)
companyList[iCompany].score += userList[iUser].score;
(which is much more readable, and removes all confusion over indexing. With auto-complete in modern IDEs, this is also quick and easy to type)
The next benefit is that code snippets don't require any context to be understood. I can copy two lines of code into an email or a document, and anyone reading that snippet can tell the difference between all the members, constants, pointers, indexes, etc. I don't have to add "oh, and be careful because 'data' is a pointer to a pointer", because it's called 'ppData'.
And for the same reason, I don't have to move my eyes out of a line of code in order to understand it. I don't have to search through the code to find if 'data' is a local, parameter, member, or constant. I don't have to move my hand to the mouse so I can hover the pointer over 'data' and then wait for a tooltip (that sometimes never appears) to pop up. So programmers can read and understand the code significantly faster, because they don't waste time searching up and down or waiting.
(If you don't think you waste time searching up and down to work stuff out, find some code you wrote a year ago and haven't looked at
since. Open the file and jump about half way down without reading it.
See how far you can read from this point before you don't know if
something is a member, parameter or local. Now jump to another random
location... This is what we all do all day long when we are single
stepping through someone else's code or trying to understand how to
call their function)
The 'm' prefix also avoids the (IMHO) ugly and wordy "this->" notation, and the inconsistency that it guarantees (even if you are careful you'll usually end up with a mixture of 'this->data' and 'data' in the same class, because nothing enforces a consistent spelling of the name).
'this' notation is intended to resolve ambiguity - but why would anyone deliberately write code that can be ambiguous? Ambiguity will lead to a bug sooner or later. And in some languages 'this' can't be used for static members, so you have to introduce 'special cases' in your coding style. I prefer to have a single simple coding rule that applies everywhere - explicit, unambiguous and consistent.
The last major benefit is with Intellisense and auto-completion. Try using Intellisense on a Windows Form to find an event - you have to scroll through hundreds of mysterious base class methods that you will never need to call to find the events. But if every event had an "e" prefix, they would automatically be listed in a group under "e". Thus, prefixing works to group the members, consts, events, etc in the intellisense list, making it much quicker and easier to find the names you want. (Usually, a method might have around 20-50 values (locals, params, members, consts, events) that are accessible in its scope. But after typing the prefix (I want to use an index now, so I type 'i...'), I am presented with only 2-5 auto-complete options. The 'extra typing' people attribute to prefixes and meaningful names drastically reduces the search space and measurably accelerates development speed)
I'm a lazy programmer, and the above convention saves me a lot of work. I can code faster and I make far fewer mistakes because I know how every variable should be used.
Arguments against
So, what are the cons? Typical arguments against prefixes are:
"Prefix schemes are bad/evil". I agree that "m_lpsz" and its ilk are poorly thought out and wholly useless. That's why I'd advise using a well designed notation designed to support your requirements, rather than copying something that is inappropriate for your context. (Use the right tool for the job).
"If I change the usage of something I have to rename it". Yes, of course you do, that's what refactoring is all about, and why IDEs have refactoring tools to do this job quickly and painlessly. Even without prefixes, changing the usage of a variable almost certainly means its name ought to be changed.
"Prefixes just confuse me". As does every tool until you learn how to use it. Once your brain has become used to the naming patterns, it will filter the information out automatically and you won't really mind that the prefixes are there any more. But you have to use a scheme like this solidly for a week or two before you'll really become "fluent". And that's when a lot of people look at old code and start to wonder how they ever managed without a good prefix scheme.
"I can just look at the code to work this stuff out". Yes, but you don't need to waste time looking elsewhere in the code or remembering every little detail of it when the answer is right on the spot your eye is already focussed on.
(Some of) that information can be found by just waiting for a tooltip to pop up on my variable. Yes. Where supported, for some types of prefix, when your code compiles cleanly, after a wait, you can read through a description and find the information the prefix would have conveyed instantly. I feel that the prefix is a simpler, more reliable and more efficient approach.
"It's more typing". Really? One whole character more? Or is it - with IDE auto-completion tools, it will often reduce typing, because each prefix character narrows the search space significantly. Press "e" and the three events in your class pop up in intellisense. Press "c" and the five constants are listed.
"I can use this-> instead of m". Well, yes, you can. But that's just a much uglier and more verbose prefix! Only it carries a far greater risk (especially in teams) because to the compiler it is optional, and therefore its usage is frequently inconsistent. m on the other hand is brief, clear, explicit and not optional, so it's much harder to make mistakes using it.
I generally don't use a prefix for member variables.
I used to use a m prefix, until someone pointed out that "C++ already has a standard prefix for member access: this->.
So that's what I use now. That is, when there is ambiguity, I add the this-> prefix, but usually, no ambiguity exists, and I can just refer directly to the variable name.
To me, that's the best of both worlds. I have a prefix I can use when I need it, and I'm free to leave it out whenever possible.
Of course, the obvious counter to this is "yes, but then you can't see at a glance whether a variable is a class member or not".
To which I say "so what? If you need to know that, your class probably has too much state. Or the function is too big and complicated".
In practice, I've found that this works extremely well. As an added bonus it allows me to promote a local variable to a class member (or the other way around) easily, without having to rename it.
And best of all, it is consistent! I don't have to do anything special or remember any conventions to maintain consistency.
By the way, you shouldn't use leading underscores for your class members. You get uncomfortably close to names that are reserved by the implementation.
The standard reserves all names starting with double underscore or underscore followed by capital letter. It also reserves all names starting with a single underscore in the global namespace.
So a class member with a leading underscore followed by a lower-case letter is legal, but sooner or late you're going to do the same to an identifier starting with upper-case, or otherwise break one of the above rules.
So it's easier to just avoid leading underscores. Use a postfix underscore, or a m_ or just m prefix if you want to encode scope in the variable name.
You have to be careful with using a leading underscore. A leading underscore before a capital letter in a word is reserved.
For example:
_Foo
_L
are all reserved words while
_foo
_l
are not. There are other situations where leading underscores before lowercase letters are not allowed. In my specific case, I found the _L happened to be reserved by Visual C++ 2005 and the clash created some unexpected results.
I am on the fence about how useful it is to mark up local variables.
Here is a link about which identifiers are reserved:
What are the rules about using an underscore in a C++ identifier?
I prefer postfix underscores, like such:
class Foo
{
private:
int bar_;
public:
int bar() { return bar_; }
};
Lately I have been tending to prefer m_ prefix instead of having no prefix at all, the reasons isn't so much that its important to flag member variables, but that it avoids ambiguity, say you have code like:
void set_foo(int foo) { foo = foo; }
That of cause doesn't work, only one foo allowed. So your options are:
this->foo = foo;
I don't like it, as it causes parameter shadowing, you no longer can use g++ -Wshadow warnings, its also longer to type then m_. You also still run into naming conflicts between variables and functions when you have a int foo; and a int foo();.
foo = foo_; or foo = arg_foo;
Been using that for a while, but it makes the argument lists ugly, documentation shouldn't have do deal with name disambiguity in the implementation. Naming conflicts between variables and functions also exist here.
m_foo = foo;
API Documentation stays clean, you don't get ambiguity between member functions and variables and its shorter to type then this->. Only disadvantage is that it makes POD structures ugly, but as POD structures don't suffer from the name ambiguity in the first place, one doesn't need to use it with them. Having a unique prefix also makes a few search&replace operations easier.
foo_ = foo;
Most of the advantages of m_ apply, but I reject it for aesthetic reasons, a trailing or leading underscore just makes the variable look incomplete and unbalanced. m_ just looks better. Using m_ is also more extendable, as you can use g_ for globals and s_ for statics.
PS: The reason why you don't see m_ in Python or Ruby is because both languages enforce the their own prefix, Ruby uses # for member variables and Python requires self..
When reading through a member function, knowing who "owns" each variable is absolutely essential to understanding the meaning of the variable. In a function like this:
void Foo::bar( int apples )
{
int bananas = apples + grapes;
melons = grapes * bananas;
spuds += melons;
}
...it's easy enough to see where apples and bananas are coming from, but what about grapes, melons, and spuds? Should we look in the global namespace? In the class declaration? Is the variable a member of this object or a member of this object's class? Without knowing the answer to these questions, you can't understand the code. And in a longer function, even the declarations of local variables like apples and bananas can get lost in the shuffle.
Prepending a consistent label for globals, member variables, and static member variables (perhaps g_, m_, and s_ respectively) instantly clarifies the situation.
void Foo::bar( int apples )
{
int bananas = apples + g_grapes;
m_melons = g_grapes * bananas;
s_spuds += m_melons;
}
These may take some getting used to at first—but then, what in programming doesn't? There was a day when even { and } looked weird to you. And once you get used to them, they help you understand the code much more quickly.
(Using "this->" in place of m_ makes sense, but is even more long-winded and visually disruptive. I don't see it as a good alternative for marking up all uses of member variables.)
A possible objection to the above argument would be to extend the argument to types. It might also be true that knowing the type of a variable "is absolutely essential to understanding the meaning of the variable." If that is so, why not add a prefix to each variable name that identifies its type? With that logic, you end up with Hungarian notation. But many people find Hungarian notation laborious, ugly, and unhelpful.
void Foo::bar( int iApples )
{
int iBananas = iApples + g_fGrapes;
m_fMelons = g_fGrapes * iBananas;
s_dSpuds += m_fMelons;
}
Hungarian does tell us something new about the code. We now understand that there are several implicit casts in the Foo::bar() function. The problem with the code now is that the value of the information added by Hungarian prefixes is small relative to the visual cost. The C++ type system includes many features to help types either work well together or to raise a compiler warning or error. The compiler helps us deal with types—we don't need notation to do so. We can infer easily enough that the variables in Foo::bar() are probably numeric, and if that's all we know, that's good enough for gaining a general understanding of the function. Therefore the value of knowing the precise type of each variable is relatively low. Yet the ugliness of a variable like "s_dSpuds" (or even just "dSpuds") is great. So, a cost-benefit analysis rejects Hungarian notation, whereas the benefit of g_, s_, and m_ overwhelms the cost in the eyes of many programmers.
I can't say how widespred it is, but speaking personally, I always (and have always) prefixed my member variables with 'm'. E.g.:
class Person {
....
private:
std::string mName;
};
It's the only form of prefixing I do use (I'm very anti Hungarian notation) but it has stood me in good stead over the years. As an aside, I generally detest the use of underscores in names (or anywhere else for that matter), but do make an exception for preprocessor macro names, as they are usually all uppercase.
The main reason for a member prefix is to distinguish between a member function and a member variable with the same name. This is useful if you use getters with the name of the thing.
Consider:
class person
{
public:
person(const std::string& full_name)
: full_name_(full_name)
{}
const std::string& full_name() const { return full_name_; }
private:
std::string full_name_;
};
The member variable could not be named full_name in this case. You need to rename the member function to get_full_name() or decorate the member variable somehow.
I don't think one syntax has real value over another. It all boils down, like you mentionned, to uniformity across the source files.
The only point where I find such rules interesting is when I need 2 things named identicaly, for example :
void myFunc(int index){
this->index = index;
}
void myFunc(int index){
m_index = index;
}
I use it to differentiate the two. Also when I wrap calls, like from windows Dll, RecvPacket(...) from the Dll might be wrapped in RecvPacket(...) in my code. In these particular occasions using a prefix like "_" might make the two look alike, easy to identify which is which, but different for the compiler
Some responses focus on refactoring, rather than naming conventions, as the way to improve readability. I don't feel that one can replace the other.
I've known programmers who are uncomfortable with using local declarations; they prefer to place all the declarations at the top of a block (as in C), so they know where to find them. I've found that, where scoping allows for it, declaring variables where they're first used decreases the time that I spend glancing backwards to find the declarations. (This is true for me even for small functions.) That makes it easier for me to understand the code I'm looking at.
I hope it's clear enough how this relates to member naming conventions: When members are uniformly prefixed, I never have to look back at all; I know the declaration won't even be found in the source file.
I'm sure that I didn't start out preferring these styles. Yet over time, working in environments where they were used consistently, I optimized my thinking to take advantage of them. I think it's possible that many folks who currently feel uncomfortable with them would also come to prefer them, given consistent usage.
Those conventions are just that. Most shops use code conventions to ease code readability so anyone can easily look at a piece of code and quickly decipher between things such as public and private members.
Others try to enforce using
this->member whenever a member
variable is used
That is usually because there is no prefix. The compiler needs enough information to resolve the variable in question, be it a unique name because of the prefix, or via the this keyword.
So, yes, I think prefixes are still useful. I, for one, would prefer to type '_' to access a member rather than 'this->'.
Other languages will use coding conventions, they just tend to be different. C# for example has probably two different styles that people tend to use, either one of the C++ methods (_variable, mVariable or other prefix such as Hungarian notation), or what I refer to as the StyleCop method.
private int privateMember;
public int PublicMember;
public int Function(int parameter)
{
// StyleCop enforces using this. for class members.
this.privateMember = parameter;
}
In the end, it becomes what people know, and what looks best. I personally think code is more readable without Hungarian notation, but it can become easier to find a variable with intellisense for example if the Hungarian notation is attached.
In my example above, you don't need an m prefix for member variables because prefixing your usage with this. indicates the same thing in a compiler-enforced method.
This doesn't necessarily mean the other methods are bad, people stick to what works.
When you have a big method or code blocks, it's convenient to know immediately if you use a local variable or a member. it's to avoid errors and for better clearness !
IMO, this is personal. I'm not putting any prefixes at all. Anyway, if code is meaned to be public, I think it should better has some prefixes, so it can be more readable.
Often large companies are using it's own so called 'developer rules'.
Btw, the funniest yet smartest i saw was DRY KISS (Dont Repeat Yourself. Keep It Simple, Stupid). :-)
As others have already said, the importance is to be colloquial (adapt naming styles and conventions to the code base in which you're writing) and to be consistent.
For years I have worked on a large code base that uses both the "this->" convention as well as using a postfix underscore notation for member variables. Throughout the years I've also worked on smaller projects, some of which did not have any sort of convention for naming member variables, and other which had differing conventions for naming member variables. Of those smaller projects, I've consistently found those which lacked any convention to be the most difficult to jump into quickly and understand.
I'm very anal-retentive about naming. I will agonize over the name to be ascribed to a class or variable to the point that, if I cannot come up with something that I feel is "good", I will choose to name it something nonsensical and provide a comment describing what it really is. That way, at least the name means exactly what I intend it to mean--nothing more and nothing less. And often, after using it for a little while, I discover what the name should really be and can go back and modify or refactor appropriately.
One last point on the topic of an IDE doing the work--that's all nice and good, but IDEs are often not available in environments where I have perform the most urgent work. Sometimes the only thing available at that point is a copy of 'vi'. Also, I've seen many cases where IDE code completion has propagated stupidity such as incorrect spelling in names. Thus, I prefer to not have to rely on an IDE crutch.
The original idea for prefixes on C++ member variables was to store additional type information that the compiler didn't know about. So for example, you could have a string that's a fixed length of chars, and another that's variable and terminated by a '\0'. To the compiler they're both char *, but if you try to copy from one to the other you get in huge trouble. So, off the top of my head,
char *aszFred = "Hi I'm a null-terminated string";
char *arrWilma = {'O', 'o', 'p', 's'};
where "asz" means this variable is "ascii string (zero-terminated) and "arr" means this variable is a character array.
Then the magic happens. The compiler will be perfectly happy with this statement:
strcpy(arrWilma, aszFred);
But you, as a human, can look at it and say "hey, those variables aren't really the same type, I can't do that".
Unfortunately a lot places use standards such as "m_" for member variables, "i" for integers no matter how used, "cp" for char pointers. In other words they're duplicating what the compiler knows, and making the code hard to read at the same time. I believe this pernicious practice should be outlawed by statute and subject to harsh penalties.
Finally, there's two points I should mention:
Judicious use of C++ features allows the compiler to know the information you had to encode in raw C-style variables. You can make classes that will only allow valid operations. This should be done as much as practical.
If your code blocks are so long that you forget what type a variable is before you use it, they are way too long. Don't use names, re-organize.
Our project has always used "its" as a prefix for member data, and "the" as a prefix for parameters, with no prefix for locals. It's a little cutesy, but it was adopted by the early developers of our system because they saw it used as a convention by some commercial source libraries we were using at the time (either XVT or RogueWave - maybe both). So you'd get something like this:
void
MyClass::SetName(const RWCString &theName)
{
itsName = theName;
}
The big reason I see for scoping prefixes (and no others - I hate Hungarian notation) is that it prevents you from getting into trouble by writing code where you think you're referring to one variable, but you're really referring to another variable with the same name defined in the local scope. It also avoids the problem of coming up with a variable names to represent that same concept, but with different scopes, like the example above. In that case, you would have to come up with some prefix or different name for the parameter "theName" anyway - why not make a consistent rule that applies everywhere.
Just using this-> isn't really good enough - we're not as interested in reducing ambiguity as we are in reducing coding errors, and masking names with locally scoped identifiers can be a pain. Granted, some compilers may have the option to raise warnings for cases where you've masked the name in a larger scope, but those warnings may become a nuisance if you're working with a large set of third party libraries that happen to have chosen names for unused variables that occasionally collide with your own.
As for the its/the itself - I honestly find it easier to type than underscores (as a touch typist, I avoid underscores whenever possible - too much stretching off the home rows), and I find it more readable than a mysterious underscore.
I use it because VC++'s Intellisense can't tell when to show private members when accessing out of the class. The only indication is a little "lock" symbol on the field icon in the Intellisense list. It just makes it easier to identify private members(fields) easier. Also a habit from C# to be honest.
class Person {
std::string m_Name;
public:
std::string Name() { return m_Name; }
void SetName(std::string name) { m_Name = name; }
};
int main() {
Person *p = new Person();
p->Name(); // valid
p->m_Name; // invalid, compiler throws error. but intellisense doesn't know this..
return 1;
}
I think that, if you need prefixes to distinguish class members from member function parameters and local variables, either the function is too big or the variables are badly named. If it doesn't fit on the screen so you can easily see what is what, refactor.
Given that they often are declared far from where they are used, I find that naming conventions for global constants (and global variables, although IMO there's rarely ever a need to use those) make sense. But otherwise, I don't see much need.
That said, I used to put an underscore at the end of all private class members. Since all my data is private, this implies members have a trailing underscore. I usually don't do this anymore in new code bases, but since, as a programmer, you mostly work with old code, I still do this a lot. I'm not sure whether my tolerance for this habit comes from the fact that I used to do this always and am still doing it regularly or whether it really makes more sense than the marking of member variables.
In python leading double underscores are used to emulate private members. For more details see this answer
I use m_ for member variables just to take advantage of Intellisense and related IDE-functionality. When I'm coding the implementation of a class I can type m_ and see the combobox with all m_ members grouped together.
But I could live without m_ 's without problem, of course. It's just my style of work.
It is useful to differentiate between member variables and local variables due to memory management. Broadly speaking, heap-allocated member variables should be destroyed in the destructor, while heap-allocated local variables should be destroyed within that scope. Applying a naming convention to member variables facilitates correct memory management.
Code Complete recommends m_varname for member variables.
While I've never thought the m_ notation useful, I would give McConnell's opinion weight in building a standard.
I almost never use prefixes in front of my variable names. If you're using a decent enough IDE you should be able to refactor and find references easily. I use very clear names and am not afraid of having long variable names. I've never had trouble with scope either with this philosophy.
The only time I use a prefix would be on the signature line. I'll prefix parameters to a method with _ so I can program defensively around them.
You should never need such a prefix. If such a prefix offers you any advantage, your coding style in general needs fixing, and it's not the prefix that's keeping your code from being clear. Typical bad variable names include "other" or "2". You do not fix that with requiring it to be mOther, you fix it by getting the developer to think about what that variable is doing there in the context of that function. Perhaps he meant remoteSide, or newValue, or secondTestListener or something in that scope.
It's an effective anachronism that's still propagated too far. Stop prefixing your variables and give them proper names whose clarity reflects how long they're used. Up to 5 lines you could call it "i" without confusion; beyond 50 lines you need a pretty long name.
I like variable names to give only a meaning to the values they contain, and leave how they are declared/implemented out of the name. I want to know what the value means, period. Maybe I've done more than an average amount of refactoring, but I find that embedding how something is implemented in the name makes refactoring more tedious than it needs to be. Prefixes indicating where or how object members are declared are implementation specific.
color = Red;
Most of the time, I don't care if Red is an enum, a struct, or whatever, and if the function is so large that I can't remember if color was declared locally or is a member, it's probably time to break the function into smaller logical units.
If your cyclomatic complexity is so great that you can't keep track of what is going on in the code without implementation-specific clues embedded in the names of things, most likely you need to reduce the complexity of your function/method.
Mostly, I only use 'this' in constructors and initializers.
According to JOINT STRIKE FIGHTER AIR VEHICLE C++ CODING STANDARDS (december 2005):
AV Rule 67
Public and protected data should only be used in
structs—not classes. Rationale: A class is able to maintain its
invariant by controlling access to its data. However, a class cannot
control access to its members if those members non-private. Hence all
data in a class should be private.
Thus, the "m" prefix becomes unuseful as all data should be private.
But it is a good habit to use the p prefix before a pointer as it is a dangerous variable.
Many of those conventions are from a time without sophisticated editors. I would recommend using a proper IDE that allows you to color every kind of variable. Color is by far easier to spot than any prefix.
If you need to get even more detail on a variable any modern IDE should be able to show it to you by moving the caret or cursor over it. And if you use a variable in a wrong way (for instance a pointer with the . operator) you will get an error, anyway.
Personally I use a relatively "simple" system to denote what variables are
I have the different "flags" that I combine then an underscore, then the memory type, then finally the name.
I like this because you can narrow down the amount of variables in an IDE's completion as much as possible as quickly as possible.
The stuff I use is:
m for member function
s for static
c for const/constexpr
then an underscore _
then the variable memory type
p for unowned pointer
v for list
r for reference
nothing for owned value
for example if I had a member variable which is a list of ints I would put
m_vName
and for a static const pointer to a pointer of lists of ints I would put
sc_ppvName
This lets me quickly tell what The variable is used for and how to access it. aswell as how to get/drop values