L_ macro in glibc source code - c++

I was reading through the source code of glibc and I found that it has two macros which have the same name
This one is on the line 105
#define L_(Str) L##Str
and this on the line 130
#define L_(Str) Str
What do these macros really mean ? The usage is only for comparing two characters
For example on line 494, you could see it is used for comparing character values between *f and '$'
if(*f == L_('$')). If we wanted to compare the two characters, we could have compared them directly, instead of directing them through a macro ? Also, what is the use case for the macro on line 105 ?

It prepends macro argument with L prefix (wchar_t literal - it uses as large datatype as is needed to represent every possible character code point instead of normal 8 bit in char type) if you're compiling wscanf version of function (line 105). Otherwise it just passes argument as it is (line 130).
## is string concatenation operator in c preprocessor, L##'$' will expand to L'$' eventually.
To sum up: it is used to compile two, mutually exclusive versions of vscanf function - one operating on wchar_t, one on char.
Check out this answer: What exactly is the L prefix in C++?

Let's read the code. (I have no idea what it does, but I can read code)
First, why are there two defines as you point out? One of them is used when COMPILE_WSCANF is defined, the other is used otherwise. What is COMPILE_WSCANF? If we look further down the file, we can see that different functions are defined. When COMPILE_WSCANF is defined, the function we end up with (through various macros) is vfwscanf otherwise we get vfscanf. This is a pretty good indication that this file might be used to compile two different functions one for normal characters, one for wide characters. Most likely, the build system compiles the file twice with different defines. This is done so that we don't have to write the same file twice since both the normal and wide character functions will be pretty similar.
I'm pretty sure that means that this macro has something to do with wide characters. If we look at how it's used, it is used to wrap character constants in comparisons and such. When 'x' is a normal character constant, L'x' is a wide character constant (wchar_t type) representing the same character.
So the macro is used to wrap character constants inside the code so that we don't have to have #ifdef COMPILE_WSCANF.

Related

Using #define to a string in C++

While trying to upload an image with cURL, I am confused because of the code below.
#define UPLOAD_FILE_AS "testImage.jpg"
static const char buf_1 [] = "RNFR " UPLOAD_FILE_AS;
What I want to understand is,
exact type of defined UPLOAD_FILE_AS : array of char / string / or something else?
exact operation performed in second line : After second line, buf_1 becomes "RNFR testImage.jpg". But second line only have a space between "RNFR" and UPLOAD_FILE_AS. I've never heard a space can replace "+" operator or merging function. How is this possible?
The definition of the macro is preprocessor-style, it is just the sequence of characters, which happen to be within "". There is no type. The macro is expanded before the compiler (with its notion of types) starts the actual compilation.
C++ will always concatenate all character-sequences-within-"". "A" "B" will always be handled as "AB" during building. There is not operator, no implicit operator either. This is often used to have very long string literals, spanning several line in code.
define's do not create a variable, so there is no concept of type. What happens is, before the source file is compiled, a preprocessor is run. It literally replaces all the instances of your macro, namely UPLOAD_FILE_AS with its value ("testImage.jpg").
In other words, after the preprocessor stage, your code looks like this:
static const char buf_1 [] = "RNFR " "testImage.jpg";
And as C++ strings expand automatically, both of these strings become one: "RNFR testImage.jpg". You can find a better explanation here: link, mainly:
String literals placed side-by-side are concatenated at translation phase 6 (after the preprocessor). That is, "Hello," " world!" yields the (single) string "Hello, world!". If the two strings have the same encoding prefix (or neither has one), the resulting string will have the same encoding prefix (or no prefix).
exact type of defined UPLOAD_FILE_AS
There is no type. It is not a variable. It is a macro. Macros exist entirely outside of the type system.
The pre-processor replaces all instances of the macro with its definition. See the next paragraph for an example.
exact operation performed in second line
The operation is macro replacement. After the file is pre-processed, the second line becomes:
static const char buf_1 [] = "RNFR " "testImage.jpg";

If a C++ compiler supports Unicode character set, does it necessary that the basic character set of the implementation is also Unicode?

Consider the following statement -
cout
It displays an integration sign ( an Unicode character) if compiled on my g++ 4.8.2
1). Does it mean the basic character set of this implementation is also Unicode?
If yes, then consider the following statement -
C++ defines 'byte' differently. A C++ byte consists of enough no. of bits to accommodate at least the total no. of characters of the basic character set for implementation.
2). If my compiler supports the Unicode, then the no.of bits in a byte according to the above definition of 'byte' must be greater than 8. Hence CHAR_BIT >8 here, right? But my compiler shows CHAR_BIT == 8. WHY?
Reference : C++ Primer Plus
P.S. I'm a beginner. Don't throw me into the complex technical details. Keep it simple and straight. Thanks in advance!
Unicode has nothing to do with your compiler or C++ defining "byte" differently. It's simply about separating the concept of "byte" and "character" at the string level and the string level alone.
The only time Unicode's multi-byte characters come into play is during display and when manipulating the strings. See also the difference between std::wstring and std::string for a more technical explanation.
The compiler just compiles. It doesn't care about your character set except when it comes to dealing with the source-code.
Bytes are, as always, 8 bits only.
Does it mean the basic character set of this implementation is also Unicode?
No, there is no such requirement, and there are very few implementations where char is large enough to hold arbitrary Unicode characters.
char is large enough to hold members of the basic character set, but what happens with characters that aren't in the basic character set depends.
On some systems, everything might be converted to one character set such as ISO8859-1 which has fewer than 256 characters, so fits entirely in char.
On other systems, everything might be encoded as UTF-8, meaning a single logical character potentially takes up several char values.
Many compilers support UTF-8, with the basic character set being ASCII. In UTF-8, a Unicode code point consists of 1 to 4 bytes, so typically 1 to 4 chars. UTF-8 is designed so that most of C and C++ works just fine with it without having any direct support. Just be aware that for example strlen () returns the number of bytes, not the number of code points. But most of the time you don't really care about that. (Functions like strncpy which are dangerous anyway become just slightly more dangerous with UTF-8).
And of course forget about using char to store a Unicode code point. But then once you get into a bit more sophisticated string handling, many, many things cannot be done on a character level anyway.

C++ language symbol separator

I need to parse some c++ files to get some information out of it. One user case is I have a enum value "ID_XYZ", I want to find out how many times it appears in a source file. So my question is what are the separator dividing symbols in C++?
You can't really tokenize C or C++ source code based purely on separator characters -- you pretty much need to read in a character at a time, and figure out whether that character can be part of the current token or not.
Just for a couple of examples, when you see a C-style begin-comment token, you need to look at characters until you encounter a close-comment token. Likewise, strings and pre-processor directives (e.g., #if 0 .... #endif sequences). To do it truly correctly, you also need to deal correctly with trigraphs. For example, consider something like this:
// Why doesn't this work??/
ID_XYZ = 1;
If the lexer doesn't handle trigraphs correctly, it will probably identify this as an instance of your ID_XYZ -- but in reality, it's not -- the ??/ at the end of the previous line is really a trigraph that resolves to \, which means the "single-line" comment actually extends to the end of the next line, and the apparent instance of ID_XYZ is really part of the comment.

Strings. TCHAR LPWCS LPCTSTR CString. Whats what here, simple quick

TCHAR szExeFileName[MAX_PATH];
GetModuleFileName(NULL, szExeFileName, MAX_PATH);
CString tmp;
lstrcpy(szExeFileName, tmp);
CString out;
out.Format("\nInstall32 at %s\n", tmp);
TRACE(tmp);
Error (At the Format):
error C2664: 'void ATL::CStringT<BaseType,StringTraits>::Format(const wchar_t
*,...)' : cannot convert parameter 1 from 'const char [15]' to 'const wchar_t
I'd just like to get the current path that this program was launched from and copy it into a CString so I can use it elsewhere. I am currently just try to get to see the path by TRACE'ing it out. But strings, chars, char arrays, I can't ever get all the strait. Could someone give me a pointer?
The accepted answer addresses the problem. But the question also asked for a better understanding of the differences among all the character types on Windows.
Encodings
A char on Windows (and virtually all other systems) is a single byte. A byte is typically interpreted as either an unsigned value [0..255] or a signed value [-128..127]. (Older C++ standards guarantees a signed range of only [-127..127], but most implementations give [-128..127]. I believe C++11 guarantees the larger range.)
ASCII is a character mapping for values in the range [0..127] to particular characters, so you can store an ASCII character in either a signed byte or an unsigned byte, and thus it will always fit in a char.
But ASCII doesn't have all the characters necessary for most languages, so the character sets were often extended by using the rest of the values available in a byte to represent the additional characters needed for certain languages (or families of languages). So, while [0..127] almost always mean the same thing, values like 150 can only be interpreted in the context of a particular encoding. For single-byte alphabets, these encodings are called code pages.
Code pages helped, but they didn't solve all the problems. You always had to know which code page a particular document used in order to interpret it correctly. Furthermore, you typically couldn't write a single document that used different languages.
Also, some languages have more than 256 characters, so there was no way to map one char to one character. This led to the development of multi-byte character encodings, where [0..127] is still ASCII, but some of the other values are "escapes" that mean you have to look at some number of following chars to figure out what character you really had. (It's best to think of multi-byte as variable-byte, as some characters require only one byte while other require two or more.) Multi-byte works, but it's a pain to code for.
Meanwhile, memory was becoming more plentiful, so a bunch of organizations got together and created Unicode, with the goal of making a universal mapping of values to characters (for appropriately vague definitions of "characters"). Initially, it was believed that all characters (or at least all the ones anyone would ever use) would fit into 16-bit values, which was nice because you wouldn't have to deal with multi-byte encodings--you'd just use two bytes per character instead of one. About this time, Microsoft decided to adopt Unicode as the internal representation for text in Windows.
WCHAR
So Windows has a type called WCHAR, a two-byte value that represents a "Unicode" "character". I'm using quotation marks here because Unicode evolved past the original two-byte encoding, so what Windows calls "Unicode" isn't really Unicode today--it's actually a particular encoding of Unicode called UTF-16. And a "character" is not as simple a concept in Unicode as it was in ASCII, because, in some languages, characters combine or otherwise influence adjacent characters in interesting ways.
Newer versions of Windows used these 16-bit WCHAR values for text internally, but there was a lot of code out there still written for single-byte code pages, and even some for multi-byte encodings. Those programs still used chars rather than WCHARs. And many of these programs had to work with people using older versions of Windows that still used chars internally as well as newer ones that use WCHAR. So a technique using C macros and typedefs was devised so that you could mostly write your code one way and--at compile time--choose to have it use either char or WCHAR.
TCHAR
To accomplish this flexibility, you use a TCHAR for a "text character". In some header file (often <tchar.h>), TCHAR would be typedef'ed to either char or WCHAR, depending on the compile time environment. Windows headers adopted conventions like this:
LPTSTR is a (long) pointer to a string of TCHARs.
LPWSTR is a (long) pointer to a string of WCHARs.
LPSTR is a (long) pointer to a string of chars.
(The L for "long" is a leftover from 16-bit days, when we had long, far, and near pointers. Those are all obsolete today, but the L prefix tends to remain.)
Most of the Windows API functions that take and return strings were actually replaced with two versions: the A version (for "ANSI" characters) and the W version (for wide characters). (Again, historical legacy shows in these. The code pages scheme was often called ANSI code pages, though I've never been clear if they were actually ruled by ANSI standards.)
So when you call a Windows API like this:
SetWindowText(hwnd, lptszTitle);
what you're really doing is invoking a preprocessor macro that expands to either SetWindowTextA or SetWindowTextW. It should be consistent with however TCHAR is defined. That is, if you want strings of chars, you'll get the A version, and if you want strings of WCHARs, you get the W version.
But it's a little more complicated because of string literals. If you write this:
SetWindowText(hwnd, "Hello World"); // works only in "ANSI" mode
then that will only compile if you're targeting the char version, because "Hello World" is a string of chars, so it's only compatible with the SetWindowTextA version. If you wanted the WCHAR version, you'd have to write:
SetWindowText(hwnd, L"Hello World"); // only works in "Unicode" mode
The L here means you want wide characters. (The L actually stands for long, but it's a different sense of long than the long pointers above.) When the compiler sees the L prefix on the string, it knows that string should be encoded as a series of wchar_ts rather than chars.
(Compilers targeting Windows use a two-byte value for wchar_t, which happens to be identical to what Windows defined a WCHAR. Compilers targeting other systems often use a four-byte value for wchar_t, which is what it really takes to hold a single Unicode code point.)
So if you want code that can compile either way, you need another macro to wrap the string literals. There are two to choose from: _T() and TEXT(). They work exactly the same way. The first comes from the compiler's library and the second from the OS's libraries. So you write your code like this:
SetWindowText(hwnd, TEXT("Hello World")); // compiles in either mode
If you're targeting chars, the macro is a no-op that just returns the regular string literal. If you're targeting WCHARs, the macro prepends the L.
So how do you tell the compiler that you want to target WCHAR? You define UNICODE and _UNICODE. The former is for the Windows APIs and the latter is for the compiler libraries. Make sure you never define one without the other.
My guess is you are compiling in Unicode mode.
Try enclosing your format string in the _T macro, which is designed to provide an always-correct method of providing constant string parameters, regardless of whether you're compiling in Unicode or ANSI mode:
out.Format(_T("\nInstall32 at %s\n"), tmp);

Correct use of string storage in C and C++

Popular software developers and companies (Joel Spolsky, Fog Creek software) tend to use wchar_t for Unicode character storage when writing C or C++ code. When and how should one use char and wchar_t in respect to good coding practices?
I am particularly interested in POSIX compliance when writing software that leverages Unicode.
When using wchar_t, you can look up characters in an array of wide characters on a per-character or per-array-element basis:
/* C code fragment */
const wchar_t *overlord = L"ov€rlord";
if (overlord[2] == L'€')
wprintf(L"Character comparison on a per-character basis.\n");
How can you compare unicode bytes (or characters) when using char?
So far my preferred way of comparing strings and characters of type char in C often looks like this:
/* C code fragment */
const char *mail[] = { "ov€rlord#masters.lt", "ov€rlord#masters.lt" };
if (mail[0][2] == mail[1][2] && mail[0][3] == mail[1][3] && mail[0][3] == mail[1][3])
printf("%s\n%zu", *mail, strlen(*mail));
This method scans for the byte equivalent of a unicode character. The Unicode Euro symbol € takes up 3 bytes. Therefore one needs to compare three char array bytes to know if the Unicode characters match. Often you need to know the size of the character or string you want to compare and the bits it produces for the solution to work. This does not look like a good way of handling Unicode at all. Is there a better way of comparing strings and character elements of type char?
In addition, when using wchar_t, how can you scan the file contents to an array? The function fread does not seem to produce valid results.
If you know that you're dealing with unicode, neither char nor wchar_t are appropriate as their sizes are compiler/platform-defined. For example, wchar_t is 2 bytes on Windows (MSVC), but 4 bytes on Linux (GCC). The C11 and C++11 standards have been a bit more rigorous, and define two new character types (char16_t and char32_t) with associated literal prefixes for creating UTF-{8, 16, 32} strings.
If you need to store and manipulate unicode characters, you should use a library that is designed for the job, as neither the pre-C11 nor pre-C++11 language standards have been written with unicode in mind. There are a few to choose from, but ICU is quite popular (and supports C, C++, and Java).
I am particularly interested in POSIX compliance when writing software
that leverages Unicode.
In this case, you'll probably want to use UTF-8 (with char) as your preferred Unicode string type. POSIX doesn't have a lot of functions for working with wchar_t — that's mostly a Windows thing.
This method scans for the byte equivalent of a unicode character. The
Unicode Euro symbol € takes up 3 bytes. Therefore one needs to compare
three char array bytes to know if the Unicode characters match. Often
you need to know the size of the character or string you want to
compare and the bits it produces for the solution to work.
No, you don't. You just compare the bytes. Iff the bytes match, the strings match. strcmp works just as well with UTF-8 as it does with any other encoding.
Unless you want something like a case-insensitive or accent-insensitive comparison, in which case you'll need a proper Unicode library.
You should never-ever compare bytes, or even code points, to decide if strings are equal. That's because of a lot of strings can be identical from user perspective without being identical from code point perspective.