I know that there are rules for using underscores in identifiers in C/C++. Are there any rules for using them in source code filenames?
For instance, are there any restrictions against beginning or ending a filename with an underscore? Or having an underscore as the last character before the .c or .h extension? Double underscores?
References are appreciated, if there are any.
If the source files are subject for preprocessor #include directives, then the C and C++ standards specify a minimum set of requirements on the filename. The C standard says:
6.10.2 Source file inclusion
...
The implementation shall provide unique mappings for sequences consisting of one or more nondigits or digits (6.4.2.1) followed by a period (.) and a single nondigit. The first character shall not be a digit. The implementation may ignore distinctions of alphabetical case and restrict the mapping to eight significant characters before the period.
... where nondigit contains the letters A-Z, a-z and underscore.
The exact same text (except for the paragraph numbers) can also be found in the C++ standard, 16.2 Source file inclusion.
Beyond that, what passes for a valid filename depends on the operating system, file system, compiler, linker and other parts of the compilation tool chain.
These days, I'd expect most modern systems to allow almost anything that isn't directly forbidden by the file system.
References
The final public draft of the C11 standard, n1570
The final public draft of the C++11 standard, n3337
No. Files can be named whatever they want (given that the underlying file-system supports the name) - neither the C++, nor the C standard have any stake in that.
There are rules about "_" in identifiers, yes, but that does not carry over to external things like file names.
Neither the C nor the C++ language (and please remember that they're two different languages) has any rules for source file names. Both specify the names for their standard headers, and follow certain conventions for those names, but those conventions are not imposed on other source files. (Note that standard headers are not necessarily even implemented as files.)
An operating system, or a file system, or a compiler, or some other part of the environment might impose some requirements.
More specifically, Unix-like systems typically permits any characters in file names other than '/' (which is the directory path delimiter) and '\0' (which is the string terminator), and compilers typically permit any valid file name (possibly paying attention the extension to determine which language to compile). Windows disallows some other characters. Case-sensitivity varies from one system to another; foo.c and Foo.c may or may not name the same file. The latter can be significant for foo.c vs. foo.C; sometimes .C is used as an extension for C++ source. Use something else if your code might be used on a case-insensitive file system (Windows, MacOS).
There are a number of conventions for file extensions used to identifier the contents of a source file, such as .c for C source, .h for a header file, .cpp or .cc or .cxx for C++ source, and so on. Consult your compiler's documentation.
Related
T.C. left an interesting comment to my answer on this question:
Why aren't include guards in c++ the default?
T.C. states:
There's "header" and there's "source file". "header"s don't need to be
actual files.
What does this mean?
Perusing the standard, I see plenty of references to both "header files" and "headers". However, regarding #include, I noticed that the standard seems to make reference to "headers" and "source files". (C++11, § 16.2)
A preprocessing directive of the form
# include < h-char-sequence> new-line
searches a sequence of implementation-defined places for a header identified uniquely
by the specified sequence between the < and > delimiters, and causes the replacement
of that directive by the entire contents of the header. How the places are specified
or the header identified is implementation-defined.
and
A preprocessing directive of the form
# include " q-char-sequence" new-line
causes the replacement of that directive by the entire contents of the source *file*
identified by the specified sequence between the " delimiters. The named source *file*
is searched for in an implementation-defined manner.
I don't know if this is significant. It could be that "headers" in a C++ context unambiguously means "header files" but the word "sources" would be ambiguous so "headers" is a shorthand but "sources" is not. Or it could be that a C++ compiler is allowed leeway for bracket includes and only needs to act as if textual replacement takes place.
So when are header (files) not files?
The footnote mentioned by T.C. in the comments below is quite direct:
174) A header is not necessarily a source file, nor are the sequences
delimited by < and > in header names necessarily valid source file
names (16.2).
For the standard header "files" the C++ standard doesn't really make a mandate that the compiler uses a file or that the file, if it uses one, actually looks like a C++ file. Instead, the standard header files are specified to make a certain set of declarations and definitions available to the C++ program.
An alternative implementation to a file could be a readily packaged set of declarations represented in the compiler as data structure which is made available when using the corresponding #include-directive. I'm not aware of any compiler which does exactly that but clang started to implement a module system which makes the headers available from some already processed format.
They do not have to be files, since the C and C++ preprocessor are nearly identical it is reasonable to look into the C99 rationale for some clarity on this. If we look at the Rationale for International Standard—Programming Languages—C it says in section 7.1.2 Standard headers says (emphasis mine):
In many implementations the names of headers are the names of files in
special directories. This implementation technique is not required,
however: the Standard makes no assumptions about the form that a file
name may take on any system. Headers may thus have a special status if
an implementation so chooses. Standard headers may even be built into
a translator, provided that their contents do not become “known” until
after they are explicitly included. One purpose of permitting these
header “files” to be “built in” to the translator is to allow an
implementation of the C language as an interpreter in a free-standing
environment where the only “file” support may be a network interface.
It really depends on the definition of files.
If you consider any database which maps filenames to contents to be a filesystem, then yes, headers are files. If you only consider files to be that which is recognized by the OS kernel open system call, then no, headers don't have to be files.
They could be stored in a relational database. Or a compressed archive. Or downloaded over the network. Or stored in alternate streams or embedded resources of the compiler executable itself.
In the end, though, textual replacement is done, and the text comes from some sort of indexed-by-name database.
Dietmar mentioned modules and loading already processed content... but this is generally NOT allowable behavior for #include according to the C++ standard (modules will have to use a different syntax, or perhaps #include with a completely new quotation scheme other than <> or ""). The only processing that could be done in advance is tokenization. But contents of headers and included source files are subject to stateful preprocessing.
Some compilers implement "precompiled headers" which have done more processing than mere tokenization, but eventually you find some behavior that violates the Standard. For example, in Visual C++:
The compiler ... skips to just beyond the #include directive associated with the .h file, uses the code contained in the .pch file, and then compiles all code after filename.
Ignoring the actual source code prior to #include definitely does not conform to the Standard. (That doesn't prevent it from being useful, but you need to be aware that edits may not produce the expected behavior changes)
Reading the anwser from What are the rules about using an underscore in a c identifier I stumbled across the follwing quotation:
From the 2003 C++ Standard:
17.4.3.2.1 Global names [lib.global.names]
Certain sets of names and function signatures are always reserved to the implementation:
Each name that contains a double underscore (_ _) or begins with an underscore followed by an uppercase letter (2.11) is reserved to the implementation for any use.
Each name that begins with an underscore is reserved to the implementation for use as a name in the global namespace.165
165) Such names are also reserved in namespace ::std (17.4.3.1).
What exactly is meant with reserved for the implementation?
Means exactly this. It means, that you are only allowed to create such names if you are providing a compiler or standard library implementation.
The "implementation" refers to the "implementation of the C++ language". It consists of everything needed to execute a C++ program: A compiler, a standard library, hardware on which to execute, an operating system, a visualization system, input, etc.
The restriction in question means that your compiler may predefine names of the reserved form without telling you, or your standard library implementation may do so. For example, your standard library may define a macro __Foo, so if you tried to use __Foo as an identifier in your source code, you'd actually end up with the macro replacement.
The purpose of reserved names is to give your compiler and standard library freedom to express functionality in plain C++ without worrying about introducing name clashes with user code.
For a vivid example of how this is used in practice, just look at any header file of your standard library implementation.
Some reserved names have actually been made into well-defined, publicly available facilities: __FILE__, __cplusplus, __VA_ARGS__, to name a few. The C language (which has the same rules for reserved identifies) has been using reserved names exclusively to introduce new keywords (e.g. _Bool).
Implementation here means the combination of compiler(say gcc, msvc and so on), the standard library (says what features are included in the language), Operating System(Windows, Mac etc) and hardware(Intel,ARM and so on).
Depending upon the implementation, certain values are defined which the compiler uses to produce the object code that is specific to the implementation. For example
__TARGET_ARCH_ARM is defined by RealView #Matches first case
_M_ARM is defined by Visual Studio #Matches second case
to identify the CPU manufacturer.
In short these clauses are meant to discourage you from using macros of mentioned format.
In fact, n3797->17.6.5.3 Restrictions on macro definitions says, if you wish to define macros of the aforementioned formats they are :
suitable for use in #if preprocessing directives, unless explicitly
stated otherwise.
Example :
#ifndef _M_ARM
#define _M_ARM // Say you're compiling for another platform
#endif
Note
Macros, reserved for implementation, are not restricted to the format mentioned in question. For instance __arm__ is defined by gcc to identify the manufacturer.
T.C. left an interesting comment to my answer on this question:
Why aren't include guards in c++ the default?
T.C. states:
There's "header" and there's "source file". "header"s don't need to be
actual files.
What does this mean?
Perusing the standard, I see plenty of references to both "header files" and "headers". However, regarding #include, I noticed that the standard seems to make reference to "headers" and "source files". (C++11, § 16.2)
A preprocessing directive of the form
# include < h-char-sequence> new-line
searches a sequence of implementation-defined places for a header identified uniquely
by the specified sequence between the < and > delimiters, and causes the replacement
of that directive by the entire contents of the header. How the places are specified
or the header identified is implementation-defined.
and
A preprocessing directive of the form
# include " q-char-sequence" new-line
causes the replacement of that directive by the entire contents of the source *file*
identified by the specified sequence between the " delimiters. The named source *file*
is searched for in an implementation-defined manner.
I don't know if this is significant. It could be that "headers" in a C++ context unambiguously means "header files" but the word "sources" would be ambiguous so "headers" is a shorthand but "sources" is not. Or it could be that a C++ compiler is allowed leeway for bracket includes and only needs to act as if textual replacement takes place.
So when are header (files) not files?
The footnote mentioned by T.C. in the comments below is quite direct:
174) A header is not necessarily a source file, nor are the sequences
delimited by < and > in header names necessarily valid source file
names (16.2).
For the standard header "files" the C++ standard doesn't really make a mandate that the compiler uses a file or that the file, if it uses one, actually looks like a C++ file. Instead, the standard header files are specified to make a certain set of declarations and definitions available to the C++ program.
An alternative implementation to a file could be a readily packaged set of declarations represented in the compiler as data structure which is made available when using the corresponding #include-directive. I'm not aware of any compiler which does exactly that but clang started to implement a module system which makes the headers available from some already processed format.
They do not have to be files, since the C and C++ preprocessor are nearly identical it is reasonable to look into the C99 rationale for some clarity on this. If we look at the Rationale for International Standard—Programming Languages—C it says in section 7.1.2 Standard headers says (emphasis mine):
In many implementations the names of headers are the names of files in
special directories. This implementation technique is not required,
however: the Standard makes no assumptions about the form that a file
name may take on any system. Headers may thus have a special status if
an implementation so chooses. Standard headers may even be built into
a translator, provided that their contents do not become “known” until
after they are explicitly included. One purpose of permitting these
header “files” to be “built in” to the translator is to allow an
implementation of the C language as an interpreter in a free-standing
environment where the only “file” support may be a network interface.
It really depends on the definition of files.
If you consider any database which maps filenames to contents to be a filesystem, then yes, headers are files. If you only consider files to be that which is recognized by the OS kernel open system call, then no, headers don't have to be files.
They could be stored in a relational database. Or a compressed archive. Or downloaded over the network. Or stored in alternate streams or embedded resources of the compiler executable itself.
In the end, though, textual replacement is done, and the text comes from some sort of indexed-by-name database.
Dietmar mentioned modules and loading already processed content... but this is generally NOT allowable behavior for #include according to the C++ standard (modules will have to use a different syntax, or perhaps #include with a completely new quotation scheme other than <> or ""). The only processing that could be done in advance is tokenization. But contents of headers and included source files are subject to stateful preprocessing.
Some compilers implement "precompiled headers" which have done more processing than mere tokenization, but eventually you find some behavior that violates the Standard. For example, in Visual C++:
The compiler ... skips to just beyond the #include directive associated with the .h file, uses the code contained in the .pch file, and then compiles all code after filename.
Ignoring the actual source code prior to #include definitely does not conform to the Standard. (That doesn't prevent it from being useful, but you need to be aware that edits may not produce the expected behavior changes)
For the purposes of this question, I am interested only in Standard-Compliant C++, not C or C++0x, and not any implementation-specific details.
Questions arise from time to time regarding the difference between #include "" and #include <>. The argument typically boils down to two differences:
Specific implementations often search different paths for the two forms. This is platform-specific, and not in the scope of this question.
The Standard says #include <> is for "headers" whereas #include "" is for a "source file." Here is the relevant reference:
ISO/IEC 14882:2003(E)
16.2 Source file inclusion [cpp.include]
1 A #include directive shall identify a header or source file that can be processed by the implementation.
2 A preprocessing directive of the form
# include < h-char-sequence > new-line
searches a sequence of implementation-defined places for a header identified uniquely by the specified sequence between the < and > delimiters, and causes the replacement of that directive by the entire contents of the header. How the places are specified or the header identified is implementation-defined.
3 A preprocessing directive of the form
# include "q-char-sequence" new-line
causes the replacement of that directive by the entire contents of the source file identified by the specified sequence between the " delimiters. The named source file is searched for in an implementation-defined manner. If this search is not supported, or if the search fails, the directive is reprocessed as if it read
# include < h-char-sequence > new-line
with the identical contained sequence (including > characters, if any) from the original directive.
(Emphasis in quote above is mine.) The implication of this difference seems to be that the Standard intends to differentiate between a 'header' and a 'source file', but nowhere does the document define either of these terms or the difference between them.
There are few other places where headers or source files are even mentioned. A few:
158) A header is not necessarily a source file, nor are the sequences delimited by in header names necessarily valid source file names (16.2).
Seems to imply a header may not reside in the filesystem, but it doesn't say that source files do, either.
2 Lexical conventions [lex]
1 The text of the program is kept in units called source files in this International Standard. A source file together with all the headers (17.4.1.2) and source files included (16.2) via the preprocessing directive #include, less any source lines skipped by any of the conditional inclusion (16.1) preprocessing directives, is called a translation unit. [Note: a C + + program need not all be translated at the same time. ]
This is the closest I could find to a definition, and it seems to imply that headers are not the "text of the program." But if you #include a header, doesn't it become part of the text of the program? This is a bit misleading.
So what is a header? What is a source file?
My reading is that the standard headers, included by use of <> angle brackets, need not be actual files on the filesystem; e.g. an implementation would be free to enable a set of "built-in" operations providing the functionality of iostream when it sees #include <iostream>.
On the other hand, "source files" included with #include "xxx.h" are intended to be literal files residing on the filesystem, searched in some implementation-dependent manner.
Edit: to answer your specific question, I believe that "headers" are limited only to those #includeable facilities specified in the standard: iostream, vector and friends---or by the implementation as extensions to the standard. "Source files" would be any non-standard facilities (as .h files, etc.) the programmer may write or use.
Isn't this saying that a header may be implemented as a source file, but there again may not be? as for "what is a source file", it seems very sensible for the standard not to spell this out, given the many ways that "files" are implemented.
The standard headers (string, iostream) don't necessarily have to be files with those names, or even files at all. As long as when you say
#include <iostream>
a certain list of declarations come into scope, the Standard is satisfied. Exactly how that comes about is an implementation detail. (when the Standard was being written, DOS could only handle 8.3 filenames, but some of the standard header names were longer than that)
As your quotes say: a header is something included using <>, and a source file is the file being compiled, or something included using "". Exactly where the contents of these come from, and what non-standard headers are available, is up to the implementation. All the Standard specifies is what is defined if you include the standard headers.
By convention, headers are generally system-wide things, and source files are generally local to a project (for some definition of project), but the standard wisely doesn't get bogged down in anything to do with project organisation; it just gives very general definitions that are compatible with such conventions, leaving the details to the implementation and/or the user.
Nearly all of the standard deals with the program after it's been preprocessed, at which time there are no such things as source files or headers, just the translations units that your last quote defines.
Hmmm...
My casual understanding has been that the distinction between <> includes and "" includes was inherited from c and (though not defined by the standards) the de facto meaning was that <> searched paths for system and compiler provided headers and "" also searched local and user specified paths.
The definition above seem to agree in some sense with that usage, but restricts the use of "header" to things provided by the compiler or system exclusive of code provided by the user, even if they have the traditional "interface goes in the header" form.
Anyway, very interesting.
I was taking a look through some open-source C++ code and I noticed a lot of double underscores used within in the code, mainly at the start of variable names.
return __CYGWIN__;
Just wondering: Is there a reason for this, or is it just some people's code styles? I would think that it makes it hard to read.
From Programming in C++, Rules and Recommendations :
The use of two underscores (`__') in identifiers is reserved for the compiler's internal use according to the ANSI-C standard.
Underscores (`_') are often used in names of library functions (such as "_main" and "_exit"). In order to avoid collisions, do not begin an identifier with an underscore.
Unless they feel that they are "part of the implementation", i.e. the standard libraries, then they shouldn't.
The rules are fairly specific, and are slightly more detailed than some others have suggested.
All identifiers that contain a double underscore or start with an underscore followed by an uppercase letter are reserved for the use of the implementation at all scopes, i.e. they might be used for macros.
In addition, all other identifiers which start with an underscore (i.e. not followed by another underscore or an uppercase letter) are reserved for the implementation at the global scope. This means that you can use these identifiers in your own namespaces or in class definitions.
This is why Microsoft use function names with a leading underscore and all in lowercase for many of their core runtime library functions which aren't part of the C++ standard. These function names are guaranteed not to clash with either standard C++ functions or user code functions.
According to the C++ Standard, identifiers starting with one underscore are reserved for libraries. Identifiers starting with two underscores are reserved for compiler vendors.
The foregoing comments are correct. __Symbol__ is generally a magic token provided by your helpful compiler (or preprocessor) vendor. Perhaps the most widely-used of these are __FILE__ and __LINE__, which are expanded by the C preprocessor to indicate the current filename and line number. That's handy when you want to log some sort of program assertion failure, including the textual location of the error.
Double underscores are reserved to the implementation
The top voted answer cites Programming in C++: Rules and Recommendations:
"The use of two underscores (`__') in identifiers is reserved for the compiler's internal use according to the ANSI-C standard."
However, after reading a few C++ and C standards, I was unable to find any mention of underscores being restricted to just the compiler's internal use. The standards are more general, reserving double underscores for the implementation.
C++
C++ (current working draft, accessed 2019-5-26) states in lex.name:
Each identifier that contains a double underscore __ or begins with an underscore followed by an uppercase letter is reserved to the implementation for any use.
Each identifier that begins with an underscore is reserved to the implementation for use as a name in the global namespace.
C
Although this question is specific to C++, I've cited relevant sections from C standards 99 and 17:
C99 section 7.1.3
All identifiers that begin with an underscore and either an uppercase letter or another underscore are always reserved for any use.
All identifiers that begin with an underscore are always reserved for use as identifiers with file scope in both the ordinary and tag name
spaces.
C17 says the same thing as C99.
What is the implementation?
For C/C++, the implementation loosely refers to the set resources necessary to produce an executable from user source files. This includes:
preprocessor
compiler
linker
standard library
Example implementations
There are a number of different C++ implementations mentioned on Wikipedia. (no anchor link, ctrl+f "implementation")
Here's an example of Digital Mars' C/C++ implementation reserving some keywords for a feature of theirs.
It's something you're not meant to do in 'normal' code. This ensures that compilers and system libraries can define symbols that won't collide with yours.
In addition to libraries which many other people responded about, Some people also name macros or #define values for use with the preprocessor. This would make it easier to work with, and may have allowed bugs in older compilers to be worked around.
Like others mentioned, it helps prevent name collision and helps to delineate between library variables and your own.