I can't understand correctly what does they mean in the following article:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1566.htm
It is interesting to note that C89 explicitly allowed only letters in
header and include file names. C++ added underscores, and C99 added
digits. Probably both standards should allow both.
I found the following statements in all C and C++ standards:
ISO/IEC 9899:1990
6.1.7 Header names
Syntax
1 header-name:
< h-char-sequence >
" q-char-sequence "
h-char-sequence:
h-char
h-char-sequence h-char
h-char:
any member of the source character set except
the new-line character and >
q-char-sequence:
q-char
q-char-sequence q-char
q-char:
any member of the source character set except
the new-line character and "
ISO/IEC 9899:1990
5.2.1 Character sets
...
Both the basic source and basic execution character sets shall have the following
members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , — . / :
; < = > ? [ \ ] ^ _ { | } ~
For example, i see underscore and digits even in C89 / C90.
It's referring to this:
There shall be an implementation-defined mapping between the delimited
sequence and the external source file name. The implementation shall
provide unique mappings for sequences consisting of one or more
letters (as defined in $2.2.1) followed by a period (.) and a single
letter. The implementation may ignore the distinctions of
alphabetical case and restrict the mapping to six significant
characters before the period.
(C89)
This is the C99 version:
The implementation shall provide unique mappings for sequences
consisting of one or more letters or digits (as defined in 5.2.1)
followed by a period (.) and a single letter. The first character shall
be a letter. The implementation may ignore the distinctions of
alphabetical case and restrict the mapping to eight significant
characters before the period.
Related
Is it ok to write the following code?
const char* str = "§some-text";
Will str contain the correct UTF-8 representation of the § character if the source files was saved in a UTF-8 encoding?
Or is the only way to write it is to use u8-prefixed string literals?
Whether you can use Unicode characters in your source code (not just in string literals) is implementation-defined. The only way to be portable is to stick to characters in the "basic source character set" and use u8"\u00a7some-text".
[lex.phases]/1:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
The "basic source character set" is:
The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’
What are the valid characters for the file name in the #include directive?
Under Linux, for example, I can use pretty much any character except / and \0 to have a valid file name, but I would expect C preprocessor to be more restrictive about the names of files I can include.
In C++, the source character set is defined in [lex.charset]:
The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:14
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " '
And the grammar for #includes is anything in the source character set except for characters that would mess with the #include parsing: newlines and >/" depending ([lex.header]):
header-name:
< h-char-sequence >
" q-char-sequence "
h-char-sequence:
h-char
h-char-sequence h-char
h-char:
any member of the source character set except new-line and >
q-char-sequence:
q-char
q-char-sequence q-char
q-char:
any member of the source character set except new-line and "
So something like this is perfectly† fine:
#include <hel& lo% world$#-".h>
#include ">.<>."
†For some definition of "perfectly" anyway...
I'm going through my homework and can't seem to figure out how to do this one.
Say the alphabet is {a,b,c}, we want a expression that finds strings with an even number of cs.
Example strings that are included:
empty set,
ccab
abcc
cabc
ababababcc
and so on.. just an even amount of c's.
You can use this regex to allow only even # of c in input:
^(?=(([^c\n]*c){2})*[^\nc]*$)[abc]*$
RegEx Demo
The below regex would match the strings which has only even number of c's,
^(?:[^c]*c[^c]*c[^c\n]*)+?$
DEMO
OR
^(?:[ab]*c[ab]*c[ab]*)+?$
DEMO
Assuming that the total number of c's count, not consecutive cs - there is a nice theoretical approach, based on the fact that **a string with an even number ofc`s can be expressed as a finite state automaton with two states**.
The first state is the initial state, and it is also an accepting state. The second one is a rejecting state. Each c toggles us between the states. Other letters do nothing.
Now, you can convert this simple machine to regex using one of the methods described here.
Something like
^([^c]*(c[^c]*c)+)*[^c]*$
ought to do it. we can break it out, thus:
^ # - start-of-line, followed by
( # - a group, consisting of
[^c]* # - zero or more characters other than 'c', followed by
( # - a group, consisting of
c # - the literal character 'c', followed by
[^c]* # - zero or more characters other than 'c', followed by
c # - the literal character 'c'
)+ # repeated one or more times
)* # repeated zero or more times, followed by
[^c]* # - a final sequence of zero or more characters other than 'c', followed by
$ # - end-of-line
One might note that something like the following C# method will likely perform better and be easier to understand:
public bool ContainsEvenNumberOfCharacters( this string s , char x )
{
int cnt = 0 ;
foreach( char c in s )
{
cnt += ( c == x ? 1 : 0 ) ;
}
bool isEven = 0 == (cnt&1) ; // it's even if the low-order bit is off.
return isEven ;
}
Simply
/^(([^c]*c[^c]*){2})*$/
In English:
Zero or more strings, each of which contains exactly two instances of a c, preceded or followed by any number of non-c's.
This solution has the advantage that it is easily extendable to the case of a string with a number of c's which is multiple of 3, etc., and makes no assumptions about the alphabet.
Recently one of my friend encountered this question in an interview. The interviewer asked him if the special characters like $, #, |, ^, ~ have any usage in c or c++ and where.
I know that |, ^ and ~ are used as Bitwise OR, XOR and Complement respectively.
But I don't know if # and $ has any special meaning. If it does, could you please give example where it can be applied?
# is generally invalid in C; it is not used for anything. It is used for various purposes by Objective-C, but that's a whole other kettle of fish.
$ is invalid as well, but many implementations allow it to appear in identifiers, just like a letter. (In these implementations, for instance, you could name a variable or function $$$ if you liked.) Even there, though, it doesn't have any special meaning.
To complete the accepted answer, the # can be used to specify the absolute address of a variable on embedded systems.
unsigned char buf[128]#0x2000;
Note this is a non-standard compiler extension.
Check out a good explanation here
To complete the other answers. The C99-Standard in 5.2.1.3:
Both the basic source and basic execution character sets shall have
the following members:
the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
All other characters maybe not even exist. (And should not be used)
But there is also this point in the Common extensions: Annex J, J.5.2:
Characters other than the underscore _, letters, and digits, that are not part of the basic
source character set (such as the dollar sign $, or characters in national character sets)
may appear in an identifier (6.4.2).
Which is basically what duskwuff already wrote.
I read the following code from an open source library. What confuses me is the usage of dollar sign. Can anyone please clarify the meaning of $ in the code. Your help is greatly appreciated!
__forceinline MutexActive( void ) : $lock(LOCK_IS_FREE) {}
void lock ( void );
__forceinline void unlock( void ) {
__memory_barrier(); // compiler must not schedule loads and stores around this point
$lock = LOCK_IS_FREE;
}
protected:
enum ${ LOCK_IS_FREE = 0, LOCK_IS_TAKEN = 1 };
Atomic $lock;
There is a gcc switch, -fdollars-in-identifiers which explicitly allows $ in idenfitiers.
Perhaps they enable it and use the $ as something that is highly unlikely to clash with normal names.
-fdollars-in-identifiers
Accept $ in identifiers. You can also explicitly prohibit use of $ with the option -fno-dollars-in-identifiers. (GNU C allows $ by
default on most target systems, but there are a few exceptions.)
Traditional C allowed the character $ to form part of identifiers.
However, ISO C and C++ forbid $ in identifiers.
See the gcc documentation. Hopefully the link stays good.
It is being used as part of an identifer.
[C++11: 2.11/1] defines an identifier as "an arbitrarily long sequence of letters and digits." It defines "letters and digits" in a grammar given immediately above, which names only numeric digits, lower- and upper-case roman letters, and the underscore character explicitly, but does also allow "other implementation-defined characters", of which this is presumably one.
In this scenario the $ has no special meaning other than as part of an identifier — in this case, the name of a variable. There is no special significance with it being at the start of the variable name.
Even if dollar sign are not valid identifiers according to the standard, it can be accepted. For example visual studio (I think ggc too but I'm not sure about that) seems to accept it.
Check this doc : http://msdn.microsoft.com/en-us/library/565w213d(v=vs.80).aspx
and this : Are dollar-signs allowed in identifiers in C++03?
The C++ standard says:
The basic source character set consists of 96 characters: the space
character, the control characters representing horizontal tab,
vertical tab, form feed, and new-line, plus the following 91 graphical
characters: a b c d e f g h i j k l m n o p q r s t u v w x y z A B C
D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ! = , \ " ’
There is no $ in the basic source character set described above; The $ character in your code is an extension to the basic source character set, which isn't required. Consider in Britain, where the pound symbol (£ or ₤) is used in place of the dollar symbol ($).