Wierd result (`\210`) when printing the end of a char array - c++

My codes are like this:
int main(int argc, char *argv[])
{
char ca[] = {'0'};
cout << *ca << endl;
cout << *(ca+1) << endl;
cout << ca[1] << endl;
cout << (char)(0) << endl;
return 0;
}
The result is like this:
0
\210
\210
^#
From this thread, I knew that ^# is the same as \0 actually. However, the \210 seems not because when I use hexdump to view the result.
bash-3.2$ ./playground | hexdump -C
00000000 30 0a 88 0a 88 0a 00 0a |0.......|
00000008
It can be seen clearly that \210 is 88 instead of 00.
As I understood, ca+1 should point to a null terminator, which is \0. But why cout << *(ca+1) << endl; gives me \210 as the result?

Because you have to manually add the null terminator when declaring a character array. If you make it a string (such as in char myString[] = "hi"), then it will add a null terminator. But if you make it an array, with the braces, it will not.
As for the 0x88 byte, it just happened to be the next byte in RAM for whatever reason.

In any valid C program the string literals are always null terminated. Here you are trying to initialize the individual element of character array but just with list initialization syntax and not to a string literal. As this is static array allocated with in same function, you can even confirm this with help of sizeof operator.
doing ca should give you 1 i.e. one character array. However if you would have done something like char ca[] = "0"; then applying sizeof(ca) should give you 2 i.e. character '0' and null termination character. As aaaaaa123456789 mentioned, this is just an output now you are getting, just another byte in a memory. If you run this at some different time, you will see different output or your program may crash. referring incorrect location may cause any runtime anomaly.

Related

Why printing out the characters “” (147, 148 ascii) does not work as expected on c++?

I do not understand what's going on here. This is compiled with GCC 10.2.0 compiler. Printing out the whole string is different than printing out each character.
#include <iostream>
int main(){
char str[] = "“”";
std::cout << str << std::endl;
std::cout << str[0] << str[1] << std::endl;
}
Output
“”
��
Why are not the two outputted lines the same? I would expect the same line twice. Printing out alphanumeric characters does output the same line twice.
Bear in mind that, on almost all systems, the maximum value a (signed) char can hold is 127. So, more likely than not, your two 'special' characters are actually being encoded as multi-byte combinations.
In such a case, passing the string pointer to std::cout will keep feeding data from that buffer until a zero (nul-terminator) byte is encountered. Further, it appears that, on your system, the std::cout stream can properly interpret multi-byte character sequences, so it shows the expected characters.
However, when you pass the individual char elements, as str[0] and str[1], there is no possibility of parsing those arguments as components of multi-byte characters: each is interpreted 'as-is', and those values do not correspond to valid, printable characters, so the 'weird' � symbol is shown, instead.
"“”" contains more bytes than you think. It's usually encoded as utf8. To see that, you can print the size of the array:
std::cout << sizeof str << '\n';
Prints 7 in my testing. Utf8 is a multi-byte encoding. That means each character is encoded in multiple bytes. Now, you're printing bytes of a utf8 encoded string, which are not printable themselves. That's why you get � when you try to print them.

How to get Windows-1252 character values in c++?

I have a weird input file with all kinds of control characters like nulls. I want to remove all control characters from this Windows-1252 encoded text file, but if you do this:
std::string test="tést";
for (int i=0;i<test.length();i++)
{
if (test[i]<32) test[i]=32; // change all control characters into spaces
}
It will change the é into a space as well.
So if you have a string like this, encoded in Windows-1252:
std::string test="tést";
The hex values would be:
t é s t
74 E9 73 74
See https://en.wikipedia.org/wiki/ASCII and https://en.wikipedia.org/wiki/Windows-1252
test[0] would equal to decimal 116 (=0x74), but apparently with é/0xE9, test[1] does not equal the decimal value 233.
So how can you recognize that é properly?
32 is a signed integer, comparing the char with the signed integer is performed by the compiler as signed: E9 (-23)<32 which return true.
Using an unsigned literal of 32, that is 32umakes the comparison to be performed on unsigned values: E9 (233) < 32 which return false.
Replace :
if (test[i]<32) test[i]=32;
By:
if (test[i]<32u) test[i]=32u;
And you should get the expected result.
Test this here:
https://onlinegdb.com/BJ8tj0kbd
Note: you can check that char is signed with the following code:
#include <limits>
...
std::cout << std::numeric_limits<char>::is_signed << std::endl;
Change
if (test[i]<32)
to
if (test[i] >= 0 && test[i] < 32)
chars are often signed types and 0xE9 is a negative value in an eight bit integer.

Binary data as command line argument

I have a simple c++ program (and a similar one for c) that just prints out the first argument
#include <iostream>
int main(int argc, char** argv)
{
if(argc > 1)
std::cout << ">>" << argv[1] << "<<\n";
}
I can pass binary data (i have tried on bash) as argument like
$./a.out $(printf "1\x0123")
>>1?23<<
If I try to pass a null there i get
./a.out $(printf "1\x0023")
bash: warning: command substitution: ignored null byte in input
>>123<<
Clearly bash(?) does not allow this
But is it possible to send a null as a command line argument this way?
Do either c or c++ put any restrictions on this?
Edit: I am not using this in day-to-day c++, this question is just out of curiosity
This answer is written in C, but can be compiled as C++ and works the same in both. I quote from the C11 standard; there are equivalent definitions in the C++ standards.
There isn't a good way to pass null bytes to a program's arguments
C11 §5.1.2.2.1 Program startup:
If the value of argc is greater than zero, the array members argv[0] through argv[argc-1] inclusive shall contain pointers to strings, which are given implementation-defined values by the host environment prior to program startup.
C11 §7.1.1 Definitions of terms
A string is a contiguous sequence of characters terminated by and including the first null character.
That means that each argument passed to main() in argv is a null-terminated string. There is no reliable data after the null byte at the end of the string — searching there would be accessing out of bounds of the string.
So, as noted at length in the comments to the question, it is not possible in the ordinary course of events to get null bytes to a program via the argument list because null bytes are interpreted as being the end of each argument.
By special agreement
That doesn't leave much wriggle room. However, if both the calling/invoking program and the called/invoked program agree on the convention, then, even with the limitations imposed by the standards, you can pass arbitrary binary data, including arbitrary sequences of null bytes, to the invoked program — up to the limits on the length of an argument list imposed by the implementation.
The convention has to be along the lines of:
All arguments (except argv[0], which is ignored, and the last argument, argv[argc-1]) consist of a stream of non-null bytes followed by a null.
If you need adjacent nulls, you have to provide empty arguments on the command line.
If you need trailing nulls, you have to provide empty arguments as the last arguments on the command line.
This could lead to a program such as (null19.c):
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static void hex_dump(const char *tag, size_t size, const char *buffer);
int main(int argc, char **argv)
{
if (argc < 2)
{
fprintf(stderr, "Usage: %s arg1 [arg2 '' arg4 ...]\n", argv[0]);
exit(EXIT_FAILURE);
}
size_t len_args = 0;
for (int i = 1; i < argc; i++)
len_args += strlen(argv[i]) + 1;
char buffer[len_args];
size_t offset = 0;
for (int i = 1; i < argc; i++)
{
size_t arglen = strlen(argv[i]) + 1;
memmove(buffer + offset, argv[i], strlen(argv[i]) + 1);
offset += arglen;
}
assert(offset != 0);
offset--;
hex_dump("Argument list", offset, buffer);
return 0;
}
static inline size_t min_size(size_t x, size_t y) { return (x < y) ? x : y; }
static void hex_dump(const char *tag, size_t size, const char *buffer)
{
printf("%s (%zu):\n", tag, size);
size_t offset = 0;
while (size != 0)
{
printf("0x%.4zX:", offset);
size_t count = min_size(16, size);
for (size_t i = 0; i < count; i++)
printf(" %.2X", buffer[offset + i] & 0xFF);
putchar('\n');
size -= count;
offset += count;
}
}
This could be invoked using:
$ ./null19 '1234' '5678' '' '' '' '' 'def0' ''
Argument list (19):
0x0000: 31 32 33 34 00 35 36 37 38 00 00 00 00 00 64 65
0x0010: 66 30 00
$
The first argument is deemed to consist of 5 bytes — four digits and a null byte. The second is similar. The third through sixth arguments each represent a single null byte (it gets painful if you need large numbers of contiguous null bytes), then there is another string of five bytes (three letters, one digit, one null byte). The last argument is empty but ensures that there is a null byte at the end. If omitted, the output would not include that final terminal null byte.
$ ./null19 '1234' '5678' '' '' '' '' 'def0'
Argument list (18):
0x0000: 31 32 33 34 00 35 36 37 38 00 00 00 00 00 64 65
0x0010: 66 30
$
This is the same as before except there is no trailing null byte in the data. The two examples in the question are easily handled:
$ ./null19 $(printf "1\x0123")
Argument list (4):
0x0000: 31 01 32 33
$ ./null19 1 23
Argument list (4):
0x0000: 31 00 32 33
$
This works strictly within the standard assuming only that empty strings are recognized as valid arguments. In practice, those arguments are already contiguous in memory so it might be possible on many platforms to avoid the copying phase into the buffer. However, the standard does not stipulate that the argument strings are laid out contiguously in memory.
If you need multiple arguments with binary data, you can modify the convention. For example, you could take a control argument of a string which indicates how many subsequent physical arguments make up one logical binary argument.
All this relies on the programs interpreting the argument list as agreed. It is not really a general solution.

How to convert int into a char array?

If I want to compile the following code:
int x = 8;
int y = 17;
char final[2];
final[0] = (char) x;
final[1] = (char) y%10;
cout << final[0] << final[1] << endl;
It shows nothing. I don't know why? So how can I successfully convert it?
(char)8 is not '8', but the ASCII value 8 (the backspace character). To display the character 8 you can add it to '0':
int x = 8;
int y = 17;
char final[2];
final[0] = '0' + x;
final[1] = '0' + (y%10);
cout << final[0] << final[1] << endl;
As per your program you are printing char 8, and char 7.
They are not printable. In fact they are BELL and Backspace characters respectively.
Just run you program and redirect it to a file.
Then do an hexdump, you will see what is printed.
./a.out > /tmp/b
hd /tmp/b
00000000 08 07 0a |...|
00000003
What you need to understand is that in C++, chars are numbers, just like ints, only in a smaller range. When you cast the integer 8 to char, C++ thinks you want a char holding the number 8. If we look at our handy ASCII table, we can see that 8 is BS (backspace) and 7 is BEL (which sometimes rings a bell when you print it to a terminal). Neither of those are printable, which is why you aren't seeing anything.
If you just want to print the 87 to standard output, then this code will do that:
cout << x << (y%10) << endl;
If you really need to convert it chars first, then fear not, it can still be done. I'm not much of a C++ person, but this is the way to do it in C:
char final[2];
snprintf(final, sizeof final, "%d%d", x, y % 10);
(See snprintf(3))
This code treats the 2-char array as a string, and writes the two numbers you specified to it using a printf format string. Strings in C normally end with a NUL byte, and if you extended the array to three chars, the final char would be set to NUL. But because we told snprintf how long the string is, it won't write more chars than it can.
This should also work in C++.

Difference between null terminated char (\0) and `^#`

My codes are like this:
#include <iostream>
using std::cout;
using std::endl;
int main(int argc, char *argv[])
{
cout << (int)('\0') << endl;
cout << (char)(0) << endl;
return 0;
}
I expected to see in terminal like this:
$ test-program
0
$
However, what I saw is like this:
$ test-program
0
^#
$
What makes me confusing is that I think '\0' can be converted to 0. And 0 can also be casted into \0. I expected to see a null char followed with a endl, but the result is something weird like ^#.
Does anyone have ideas about this?
^# is just how your terminal emulator renders '\0'.
^# is a common representation of a null character. Similarly, ^A is used to represent a character with ordinal value 1, ^G for the character with ordinal value 7 (bell), and ^M for the character with ordinal value 13 (Carriage return).
cout << (char)0
Is just printing the character representation, rather than the integer representation
If you're looking at your terminal's output, you don't know if it's your program which doesn't behave as expected or maybe just your terminal emulator.
On UNIXoid systems, use ./myProgram | hexdump -C to see the hexadecimal output. This way you make sure your program does what you expect it to do, so it's the terminal which behaves not as (you) expected:
00000000 30 0a 00 0a |0...|
00000004
If you see the same output than I do, you're actually printing a zero '0', newline '\n', null '\0', newline '\n'. So in this case, your program behaves as you've expected.
You might want to try different terminal emulators or settings.
If you cast 0 to char, it doesn't generate '0', but '\0' -- keep in mind casting from an integral type to char creates a char with that ASCII code.
So basically, the second line is outputting a null character (ASCII 0).
If you wanted to output a '0' char, you would have needed to do something like (char) 48, because 48 happens to be the ASCII value for the character 0.