C/C++: Inherent ambiguity of "\xNNN" format in literal strings

C/C++: Inherent ambiguity of "\xNNN" format in literal strings - c++

Consider these two strings:
wchar_t* x = L"xy\x588xla";
wchar_t* y = L"xy\x588bla";
Upon reading this you would expect that both string literals are the same except one character - an 'x' instead of a 'b'.
It turns out that this is not the case. The first string compiles to:
y = {'x', 'y', 0x588, 'x', 'l', 'a' }
and the second is actually:
x = {'x', 'y', 0x588b, 'l', 'a' }
They are not even the same length!
Yes, the 'b' is eaten up by the hex representation ('\xNNN') character.
At the very least, this could cause confusion and subtle bugs for in hand-written strings (you could argue that unicode strings don't belong in the code body)
But the more serious problem, and the one I am facing, is in auto-generated code. There just doesn't seem to be any way to express this: {'x', 'y', 0x588, 'b', 'l', 'a' } as a literal string without resorting to writing the entire string in hex representation, which is wasteful and unreadable.
Any idea of a way around this?
What's the sense in the language behaving like this?

A simple way is to use compile time string literal concatenation, thus:
wchar_t const* y = L"xy\x588" L"bla";

Related

Is there a function for moving the contents of a char array a certain amount of addresses back in C++?

I have the following code for Arduino (C++). This is for a checksum consisting of 2 characters forming a base 16 value between 0 and 255. It takes int outputCheckSum and converts it to char outputCheckSumHex[3].
itoa (outputCheckSum, outputCheckSumHex, 16)
if (outputCheckSum < 16) { //Adds a 0 if CS has fewer than 2 numbers
outputCheckSumHex[1] = outputCheckSumHex[0];
outputCheckSumHex[0] = '0';
}
Since the output of itoa would be "X" instead of "0X" in the event of X having fewer than 2 characters, the last 3 lines are to move the characters one step back.
I now have plans to scale this up to a CS of 8 characters, and I was wondering whether there exists a function in C++ that can achieve that before I start writing more code. Please let me know if more information is required.

You should be able to use memmove, it's one of the legacy C functions rather than C++ but it's available in the latter (in cstring header) and handles overlapping memory correctly, unlike memcpy.
So, for example, you could use:
char buff[5] = {'a', 'b', 'c', '.', '.'};
memmove(&(buff[2]), &(buff[0], 3);
// Now it's {'a', 'b', 'a', 'b', 'c'} and you can replace the first two characters.
Alternatively, you could use std::copy from the algorithm header but, for something this basic, memmove should be fine.

You can use memmove in <cstring> for this. It does not check for terminating null characters, but instead copies num bytes (third argument) and works with overlapping regions as well.
void* memmove(void* destination, const void* source, size_t num);

The meaning of char *p_c = new char['1', '2', '3', '4'];

Consider
char *p_c = new char['1', '2', '3', '4'];
Is this syntax correct? If yes, then what does it do?
I don’t know why, but compiler allows this syntax! What will it do with regards to memory? I am not able to access the variable by *p_c. How does one determine the size of and the number of elements present?

Your code is syntactically valid C++, if rather strange, and I don't think it does what you intended:
new char['1', '2', '3', '4'] is evaluated as new char['4'] due to the way the comma operator works. (The preceding elements are evaluated from left to right, but the value of the expression is that of the rightmost element.)
So your statement is equivalent to char *p_c = new char['4'];
'4' is a char type with a numeric value that depends on the encoding that your platform uses (ASCII, EBCDIC &c. although the former is most likely on a desktop system.).
So the number of elements in the array is whatever '4' evaluates to when converted to a size_t. On an ASCII system the number of elements would be 52.

The syntax for the new expression you used is something like:
identifier = new Type[<expression>];
In the <expression> above, C++ allows any expression whose result is convertible to std::size_t. And for your own expression, you used the comma operator.
<expression> := '1', '2', '3', '4'
which will evaluate every item in the comma list and return the last, which is '4', and that result will be converted to its std::size_t value, probably (52); So, the code is equivalent to:
char* p_c = new char['4'];

char *p_c = new char['1', '2', '3', '4'];
is functionally equivalent to:
char *p_c = new char['4'];
because of comma operator. Comma operator evaluates its operant left to right and discards them except the last one (the right most one).
The character literal '4' has value 52 in ASCII (but your system doesn't have to use ASCII and neither is it required by C or C++ standards-- but almost all modern systems do use ASCII).
So, it's as if you used:
char *p_c = new char[52];

Macro that expands its only argument into its component characters

Is it possible to write a macro in C/C++ pre-processer that expands its single argument to the component characters it is composed of
For example
EXPAND( abcd )
would expand to
'a', 'b', 'c', 'd'
Other examples are
EXPAND( 1 )
'1'
EXPAND( 12 )
'1', '2'
EXPAND( func_name )
'f', 'u', 'n', 'c', '_', 'n', 'a', 'm', 'e'
EDIT:
The purpose would be to pass a character sequence as a parameter to a template like the one below
template<char... args>
struct Struct {
...
};
Instead of having to code the tedious
Struct<'a', 'b', 'c'>
one would simply do
Struct<EXPAND( abc )>
Ideally it would be best if one could code
Struct<"abc">
but string literals are not converted into char... sequences automatically.

No. This functionality is not provided by the C preprocessor.
Depending on your use case, a string might be equivalent (except for the null byte), so stringifying might work as well.
You might have a look at m4, a more advanced preprocessor by K&R. Maybe it provides the functionality you need.

Storing a string in an array of chars without the null character

I'm reading the C++ Primer Plus by Stephen Prata. He gives this example:
char dog[8] = { 'b', 'e', 'a', 'u', 'x', ' ', 'I', 'I'}; // not a string!
char cat[8] = {'f', 'a', 't', 'e', 's', 's', 'a', '\0'}; // a string!
with the comment that:
Both of these arrays are arrays of char, but only the second is a string.The null character
plays a fundamental role in C-style strings. For example, C++ has many functions that
handle strings, including those used by cout.They all work by processing a string character-
by-character until they reach the null character. If you ask cout to display a nice string
like cat in the preceding example, it displays the first seven characters, detects the null
character, and stops. But if you are ungracious enough to tell cout to display the dog array
from the preceding example, which is not a string, cout prints the eight letters in the
array and then keeps marching through memory byte-by-byte, interpreting each byte as a
character to print, until it reaches a null character. Because null characters, which really are
bytes set to zero, tend to be common in memory, the damage is usually contained quickly;
nonetheless, you should not treat nonstring character arrays as strings.
Now, if a declare my variables global, like this:
#include <iostream>
using namespace std;
char a[8] = {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'};
char b[8] = {'1', '2', '3', '4', '5', '6', '7', '8'};
int main(void)
{
cout << a << endl;
cout << b << endl;
return 0;
}
the output will be:
abcdefgh12345678
12345678
So, indeed, the cout "keeps marching through memory byte-by-byte" but only to the end of the second character array. The same thing happens with any combination of char array. I'm thinking that all the other addresses are initialized to 0 and that's why the cout stop. Is this true? If I do something like:
for (int i = 0; i < 100; ++i)
{
cout << *(&a + i) << endl;
}
I'm getting mostly empty space at output (like 95%, perhaps), but not everywhere.
If, however, i declare my char arrays a little bit shorter, like:
char a[3] = {'a', 'b', 'c'};
char b[3] = {'1', '2', '3'};
keeping all other things the same, I'm getting the following output:
abc
123
Now the cout doesn't even get past the first char array, not to mention the second. Why is this happening? I've checked the memory addresses and they are sequential, just like in the first scenario. For example,
cout << &a << endl;
cout << &b << endl;
gives
003B903C
003B9040
Why is the behavior different in this case? Why doesn't it read beyond the first char array?
And, lastly if I do declare my variables inside main, then I do get the behavior suggested by Prata, namely, a lot of junk gets printed before, somewhere a null character is reached.
I'm guessing that in the first case, the char array is declared on the heap and that this is initialized to 0 (but not everywhere, why?) and cout behaves differently based on the length of the char array (why?)
I'm using Visual Studio 2010 for these examples.

It looks like your C++ compiler is allocating space in 4-byte chunks, so that every object has an address that is a multiple of 4 (the hex addresses in your dump are divisible by 4). Compilers like to do this because they like to make sure larger datatypes such as intand float (4 bytes wide) are aligned to 4-byte boundaries. Compilers like to do this because some kinds of computer hardware take longer to load/move/store unaligned int and float values.
In your first example, each array need 8 bytes of memory - a char fills a single byte - so the compiler allocates exactly 8 bytes. In the second example each array is 3 bytes, so the compiler allocates 4 bytes, fills the first 3 bytes with your data, and leaves the 4th byte unused.
Now in this second case it appears the unused byte was filled with a null which explains why cout stopped at the end of the string. But as others have pointed out, you cannot depend on unused bytes to be initialized to any particular value, so the behaviour of the program cannot be guaranteed.
If you change your sample arrays to have 4 bytes the program will behave as in the first example.

The contents of memory out of bounds is indeterminate. Accessing memory you do not own, even just for reading, leads to undefined behavior.

Its an undefined behaviour, you cannot say what can happen.
Try on some other system you may get different output.
The answer to your question is that it is an Undefined Behaviour and its output cannot be explained.
In addition to above explanantion, in your particular case, you have declared array globally.
Therefore in your second example a \0 is appended in the fourth byte of four-byte boundary as explained by Peter Raynham.

The '\0' is just a solution to tell how long is a string. Lets say you know how long it is by storing a value before the string.
But your case is when you intentionally leave it out the functions and normally your code as well will keep searching for the delimiter ( which is a null character ).
It is undefined what is behind the bounds of a specified memory it greatly varies.
In Mingw in debug mode with gdb its usually zeroed out, without gdb its just junk... altho this is just my experience.
For the locally declared variables they are usually on the stack so what you are reading, is probably your call stack.

Enumerating digits

For my book class I'm storing an ISBN number and I need to validate the data entered so I decided to use enumeration. (First three inputs must be single digits, last one must be a letter or digit.) However, I'm wondering if it is even possible to enumerate numbers. Ive tried putting them in as regular integers, string style with double quotes, and char style with single quotes.
Example:
class Book{
public:
enum ISBN_begin{
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
};
enum ISBN_last{
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', a, b, c, d, e, f,
g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z};
The compiler error says expected identifier which means that its seeing the numbers as values for the identifiers and not identifiers themselves. So is enumerating digits possible?

I think you're going about this the wrong way...why not just use a simple regex that will validate the entire thing a bit more simply?
(yes, I know, that wasn't the original question, but it might make your life a lot easier.)
This page and this page provide some good examples on using regex to validate isbn numbers.
I think creating an enumeration whose values are equal to the entities they're enumerating...I think you're doing a lot more than you have to.

Why would you want to enumerate numbers? Enums exist to give numbers a name. If you want to store a number, store an integer - or in this case a char (as you also need characters to be stored). For validation, accept a string and write a function like this:
bool ISBN_Validate(string val)
{
if (!val.length == <length of ISBN>) return false;
if (val[0] < '0' || val[0] > '9') return false;
foreach (char ch in val)
{
if (ch is not between '0' and 'z') return false;
}
}
Easy - and no silly enumerations ;)

#include <ctype.h>
Don't forget the basics. The above include file gives you isalpha(), isdigit(), etc.

I would suggest using a string for each of the begin/end criteria, ie:
string BeginCriteria = "0123456789";
string EndCriteria = "0123456789abcd... so forth";
// Now to validate the input
// for the first three input characters:
if ( BeginCriteria.find( chInput ) != npos )
// Then its good.
// For the last 3 characters:
if ( EndCriteria.find( chInput ) != npos )
// Then its good.

enums really aren't what you want to use. They aren't sets like that. The members of an enum have to be symbols like variables or function names and you can give them values.
enum numbers { One = 1, Two, Three };
One after this is equivalent to a named constant with integer value 1. numbers is equivalent to a new type with a subrange of integer values.
What you probably want is to use a regular expression.

You would have to do something like this:
enum ISBN_begin {
ZERO, ONE, TWO // etc.
};

If you can't use regular expressions, why not just use an array of char instead? You could even use the same array, and just have a const index number where the ISBN_last chars begin in the array.

enums are not defined with literals, they are defined with variables

Some languages (Ada for one) allows what you want, so your request is not too silly. You are simply forgetting that in C and C++, character literals are just another form of integer literals (of type int in C, of type char in C++)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C/C++: Inherent ambiguity of "\xNNN" format in literal strings - c++

A simple way is to use compile time string literal concatenation, thus: wchar_t const* y = L"xy\x588" L"bla";

Related

Is there a function for moving the contents of a char array a certain amount of addresses back in C++?

The meaning of char *p_c = new char['1', '2', '3', '4'];

Macro that expands its only argument into its component characters

Storing a string in an array of chars without the null character

Enumerating digits

Categories

Resources