Issues with converting between "string" "const unsigned char" and "utf8proc_uint8_t" - c++

Maybe a simple issue, but I've gotten confused between "byte arrays", pointers and casts in c++.
Take a look at the following and let me know what I need to read about to fix it, as well as the fix. It relates to the utf8proc library.
const unsigned char *aa = (const unsigned char*)e.c_str();
utf8proc_uint8_t* a = utf8proc_NFC(aa);
char b = (char)a;
string d = string(b);
It is bad enough no need for an error message here, but there is no constructor string on the string(b) line.

There appear to be a couple problems here. The biggest is the assignment:
char b = (char)a;
What you are doing is telling the compiler to take the pointer (memory location) and convert that to a char, then assign it to single char value b. So you'll basically have random jibberish in b.
Instead, if you want to treat a like a basic char*, you would write:
char* b = (char*)a;
Then you could use the string class with either:
string d = string(b);
or you could skip several line by the direct conversion:
string d = string((char*)a);
You are also looking for a headache down the line if you don't delete the conversion value returned by the utf8proc_NFC() call, and if you don't do an error check after the conversion.
Plus I'll put in a plug for using some Hungarian notation to distinguish a pointer (a 'p' prefix on variables). This makes it obvious that you can do things like:
char tmp = *pStr; // a single character (first in the string)
char tmp2 = pStr[1]; // a single character (second in the string)
char* pTmp = pStr; // a pointer to a null terminated string
But you would never see:
char tmp3 = (char)pStr; // compiles, but makes no sense to treat pointer as a character.
So here is a clean version of all of this:
utf8proc_uint8_t* pUTF = utf8proc_NFC( (const unsigned char*)e.c_str() );
string strUTF;
if (pUTF)
{
strUTF = (char*)pUTF;
free pUTF;
}

This code is almost certainly not what you want, since it is casting a pointer into a scalar.
utf8proc_uint8_t* a = ...;
char b = (char)a;
Instead, you want to cast and produce a pointer:
utf8proc_uint8_t* a = ...;
const char *b = (const char *)a;
I also added const, which is not strictly necessary but a good idea to use wherever you can.

Related

Pointers confusion char* and int*

I got this confusion while learning c++:
int *a = 8 ;
This gives an error because, as I have understood it, i am trying to set an integer to a pointer which is a memory location. But,
const char *name = "name";
works perfectly fine? I don't get it as name should be an hexadecimal memory location but i am trying to set it to a series of characters.
A string literal, "name" in your case, is of typeconst char[]. An array can decay to a pointer which is what's happening in this case. That pointer will then point to the first element in the array. Note that since C++11 assigning to a char* instead of a const char* (thus needing a conversion) as you are doing is illegal, always use const char* for string literals, or better yet, std::string.
8 is of type int, which has no conversion to a pointer type that an array has.
The rules for int* and char* are the same. However, you compare apples to oranges.
You cannot assign a value of the respective type to its pointer (it doesn’t matter whether there is a const here; I made things const for consistency with the next example where it matters to some extend):
int const* ip = 8; // ERROR
char const* cp = ‘8’; // ERROR
Arrays decay into pointers upon the slightest opportunity. String literals are arrays of type char const[N] where N is the number of chars in the string literal, including the terminating null char.
int ia[] = { 8 };
char ca[] = { ‘8’ };
int const* ip = ia; // OK
char const* cp = ca; // OK
char const* lp = “8”; // OK

modify const char * vs char * content in easy way

im realy confused about const char * and char *.
I know in char * when we want to modify the content, we need to do something like this
const char * temp = "Hello world";
char * str = new char[strlen(temp) + 1];
memcpy(str, temp, strlen(temp));
str[strlen(temp) + 1] = '\0';
and if we want to use something like this
char * str = "xxx";
char * str2 = "xts";
str = str2;
we get compiler warning. it's ok I know when i want to change char * I have to use something memory copy. but about const char * im realy confused. in const char * I can use this
const char * str = "Hello";
const char * str2 = "World";
str = str2; // and now str is Hello
and I have no compiler error ! why ? why we use memory copy when is not const and in const we only use equal operator ! and done !... how possible? is it ok to just use equal in const? no problem happen later?
As other answers say, you should distinguish pointers and bytes they point to.
Both types of pointers, char * and const char *, can be changed, that is, "redirected" to point to different bytes. However, if you want to change the bytes (characters) of the strings, you cannot use const char *.
So, if you have string literals "Hello" and "World" in your program, you can assign them to pointers, and printing the pointer will print the corresponding literal. However, to do anything non-trivial (e.g. change Hello to HELLO), you will need non-const pointers.
Another example: with some pointer manipulation, you can remove leading bytes from a string literal:
const char* str = "Hello";
std::cout << str; // Hello
str = str + 2;
std::cout << str; // llo
However, if you want to extract a substring, or do any other transformation on a string, you should reallocate it, and for that you need a non-const pointer.
BTW since you are using C++, you can use std::string, which makes it easier to work with strings. It reallocates strings without your intervention:
#include <string>
std::string str("Hello");
str = str.substr(1, 3);
std::cout << str; // ell
This is a confusing hangover from the days of early C. Early C didn't have const, so string literals were "char *". They remained char * to avoid breaking old code, but they became non-modifiable, so const char * in all but name. So modern C++ either warns or gives an error (to be strictly conforming) when the const is omitted.
Your memcpy missed the trailing nul byte, incidentally. Use strcpy() to copy a string, that's the right function with the right name. You can create a string in read/write memory by use of the
char rwstring[] = "I am writeable";
syntax.
That is cause your variables are just a pointers *. You're not modifiying their contents, but where they are pointing to.
char * a = "asd";
char * b = "qwe";
a = b;
now you threw away the contents of a. Now a and b points to the same place. If you modify one, both are modified.
In other words. Pointers are never constants (mostly). your const predicate in a pointer variable does not means nothing to the pointer.
The real difference is that the pointer (that is not const) is pointing to a const variable. and when you change the pointer it will be point to ANOTHER NEW const variable. That is why const has no effect on simple pointers.
Note: You can achieve different behaviours with pointers and const with more complex scenario. But with simple as it, it mostly has no effect.
Citing Malcolm McLean:
This is a confusing hangover from the days of early C. Early C didn't have const, so string literals were "char *". They remained char * to avoid breaking old code, but they became non-modifiable, so const char * in all but name.
Actually, string literals are not pointers, but arrays, this is why sizeof("hello world") works as a charm (yields 12, the terminating null character is included, in contrast to strlen...). Apart from this small detail, above statement is correct for good old C even in these days.
In C++, though, string literals have been arrays of constant characters (char const[]) right from the start:
C++ standard, 5.13.5.8:
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration.
(Emphasised by me.) In general, you are not allowed to assign pointer to const to pointer to non-const:
char const* s = "hello";
char ss = s;
This will fail to compile. Assigning string literals to pointer to non-const should normally fail, too, as the standard explicitly states in C.1.1, subclause 5.13.5:
Change: String literals made const.
The type of a string literal is changed from “array of char” to “array of const char”.
[...]char* p = "abc"; // valid in C, invalid in C++
Still, string literal assignement to pointer to non-const is commonly accepted by compilers (as an extension!), probably to retain compatibility to C. As this is, according to the standard, invalid, the compiler yields a warning, at least...

Character pointer access

I wanted to access character pointer ith element. Below is the sample code
string a_value = "abcd";
char *char_p=const_cast<char *>(a_value.c_str());
if(char_p[2] == 'b') //Is this safe to use across all platform?
{
//do soemthing
}
Thanks in advance
Array accessors [] are allowed for pointer types, and result in defined and predictable behaviors if the offset inside [] refers to valid memory.
const char* ptr = str.c_str();
if (ptr[2] == '2') {
...
}
Is correct on all platforms if the length of str is 3 characters or more.
In general, if you are not mutating the char* you are looking at, it best to avoid a const_cast and work with a const char*. Also note that std::string provides operator[] which means that you do not need to call .c_str() on str to be able to index into it and look at a char. This will similarly be correct on all platforms if the length of str is 3 characters or more. If you do not know the length of the string in advance, use std::string::at(size_t pos), which performs bound checking and throws an out_of_range exception if the check fails.
You can access the ith element in a std::string using its operator[]() like this:
std::string a_value = "abcd";
if (a_value[2] == 'b')
{
// do stuff
}
If you use a C++11 conformant std::string implementation you can also use:
std::string a_value = "abcd";
char const * p = &a_value[0];
// or char const * p = a_value.data();
// or char const * p = a_value.c_str();
// or char * p = &a_value[0];
21.4.1/5
The char-like objects in a basic_string object shall be stored contiguously.
21.4.7.1/1: c_str() / data()
Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
The question is essentially about querying characters in a string safely.
const char* a = a_value.c_str();
is safe unless some other operation modifies the string after it. If you can guarantee that no other code performs a modification prior to using a, then you have safely retrieved a pointer to a null-terminated string of characters.
char* a = const_cast<char *>(a_value.c_str());
is never safe. You have yielded a pointer to memory that is writeable. However, that memory was never designed to be written to. There is no guarantee that writing to that memory will actually modify the string (and actually no guarantee that it won't cause a core dump). It's undefined behaviour - absolutely unsafe.
reference here: http://en.cppreference.com/w/cpp/string/basic_string/c_str
addressing a[2] is safe provided you can prove that all possible code paths ensure that a represents a pointer to memory longer than 2 chars.
If you want safety, use either:
auto ch = a_string.at(2); // will throw an exception if a_string is too short.
or
if (a_string.length() > 2) {
auto ch = a_string[2];
}
else {
// do something else
}
Everyone explained very well for most how it's safe, but i'd like to extend a bit if that's ok.
Since you're in C++, and you're using a string, you can simply do the following to access a caracter (and you won't have any trouble, and you still won't have to deal with cstrings in cpp :
std::string a_value = "abcd";
std::cout << a_value.at(2);
Which is in my opinion a better option rather than going out of the way.
string::at will return a char & or a const char& depending on your string object. (In this case, a const char &)
In this case you can treat char* as an array of chars (C-string). Parenthesis is allowed.

c++ char * initialization in constructor

I'm just curious, I want to know what's going on here:
class Test
{
char * name;
public:
Test(char * c) : name(c){}
};
1) Why won't Test(const char * c) : name(c){} work? Because char * name isn't const? But what about this:
main(){
char * name = "Peter";
}
name is char*, but "Peter" is const char*, right? So how does that initialization work?
2) Test(char * c) : name(c){ c[0] = 'a'; } - this crashes the program. Why?
Sorry for my ignorance.
Why won't Test(const char * c) : name(c) {} work? Because char * name isn't const?
Correct.
how does this initialization work: char * name = "Peter";
A C++ string literal is of type char const[] (see here, as opposed to just char[] in C, as it didn't have the const keyword1). This assignment is considered deprecated in C++, yet it is still allowed2 for backward compatibility with C.
Test(char * c) : name(c) { c[0] = 'a'; } crashes the program. Why?
What are you passing to Test when initializing it? If you're passing a string literal or an illegal pointer, doing c[0] = 'a' is not allowed.
1 The old version of the C programming language (as described in the K&R book published in 1978) did not include the const keyword. Since then, ANSI C borrowed the idea of const from C++.
2 Valid in C++03, no longer valid in C++11.
A conversion to const is a one-way street, so to speak.
You can convert from T * to T const * implicitly.
Conversion from T const * to T * requires an explicit cast. Even if you started from T *, then converted to T const *, converting back to T * requires an explicit cast, even though it's really just "restoring" the access you had to start with.
Note that throughout, T const * and const T * are precisely equivalent, and T stands for "some arbitrary type" (char in your example, but could just as easily be something else like int or my_user_defined_type).
Initializing a char * from a string literal (e.g., char *s = "whatever";) is allowed even though it violates this general rule (the literal itself is basically const, but you're creating a non-const pointer to it). This is simply because there's lots of code that depends on doing this, and nobody's been willing to break that code, so they have a rule to allow it. That rule's been deprecated, though, so at least in theory some future compiler could reject code that depends on it.
Since the string literal itself is basically const, any attempt at modifying it results in undefined behavior. On most modern systems, this will result in the process being terminated, because the memory storing the string literal will be marked at 'read only'. That's not the only possible result. Just for example, back in the days of MS-DOS, it would often succeed. It could still have bizarre side-effects though. For one example, many compilers "knew" that string literals were supposed to be read-only, so they'd "merge" identical string literals. Therefore if you had something like:
char *a = "Peter"; a[1] = 'a';
char *b = "Peter";
cout << b;
The compiler would have "merged" a and b to actually point at the same memory -- so when you modified a, that change would also affect b, so it would print out "Pater" instead of "Peter".
Note that the string literals didn't need to be entirely identical for this to happen either. As long as one was identical to the end of another, they could be merged:
char *a = "this?";
char *b = "What's this?";
a[2] = 'a';
a[3] = 't';
cout << b; // could print "What's that?"
Mandating one behavior didn't make sense, so the result was (and is) simply undefined.
First of all this is C++, you have std::string. You should really consider using it.
Regarding your question, "Peter" is a char literal, hence it is unmodifiable and surely you can't write on it. You can:
have a const char * member variable and initialize it like you are doing name(c), by declaring "Peter" as const
have a char * member variable and copy the content, eg name(strdup(c)) (and remember to release it in destructor.
Correct.
"Peter" is typically stored in a read-only memory location (actually, it depends on what type of device we are on) because it is a string literal. It is undefined what happens when you attempt to modify a string literal (but you can probably guess that you shouldn't).
You should use std::string anyways.
1a) Right
1b) "Peter" is not const char*, its is char* but it may not be modified. The reason is for compatibility with times before const existed in the language. A lot of code already existed that said char* p = "fred"; and they couldn't just make that code illegal overnight.
2) Can't say why that would crash the program without seeing how you are using that constructor.

C++ Swap string

I am trying to create a non-recursive method to swap a c-style string. It throws an exception in the Swap method. Could not figure out the problem.
void Swap(char *a, char* b)
{
char temp;
temp = *a;
*a = *b;
*b = temp;
}
void Reverse_String(char * str, int length)
{
for(int i=0 ; i <= length/2; i++) //do till the middle
{
Swap(str+i, str+length - i);
}
}
EDIT: I know there are fancier ways to do this. But since I'm learning, would like to know the problem with the code.
It throws an exception in the Swap method. Could not figure out the problem.
No it doesn't. Creating a temporary character and assigning characters can not possibly throw an exception. You might have an access violation, though, if your pointers don't point to blocks of memory you own.
The Reverse_String() function looks OK, assuming str points to at least length bytes of writable memory. There's not enough context in your question to extrapolate past that. I suspect you are passing invalid parameters. You'll need to show how you call Reverse_String() for us to determine if the call is valid or not.
If you are writing something like this:
char * str = "Foo";
Reverse_String(str, 3);
printf("Reversed: '%s'.\n", str);
Then you will definitely get an access violation, because str points to read-only memory. Try the following syntax instead:
char str[] = "Foo";
Reverse_String(str, 3);
printf("Reversed: '%s'.\n", str);
This will actually make a copy of the "Foo" string into a local buffer you can overwrite.
This answer refers to the comment by #user963018 made under #André Caron's answer (it's too long to be a comment).
char *str = "Foo";
The above declares a pointer to the first element of an array of char. The array is 4 characters long, 3 for F, o & o and 1 for a terminating NULL character. The array itself is stored in memory marked as read-only; which is why you were getting the access violation. In fact, in C++, your declaration is deprecated (it is allowed for backward compatibility to C) and your compiler should be warning you as such. If it isn't, try turning up the warning level. You should be using the following declaration:
const char *str = "Foo";
Now, the declaration indicates that str should not be used to modify whatever it is pointing to, and the compiler will complain if you attempt to do so.
char str[] = "Foo";
This declaration states that str is a array of 4 characters (including the NULL character). The difference here is that str is of type char[N] (where N == 4), not char *. However, str can decay to a pointer type if the context demands it, so you can pass it to the Swap function which expects a char *. Also, the memory containing Foo is no longer marked read-only, so you can modify it.
std::string str( "Foo" );
This declares an object of type std::string that contains the string "Foo". The memory that contains the string is dynamically allocated by the string object as required (some implementations may contain a small private buffer for small string optimization, but forget that for now). If you have string whose size may vary, or whose size you do not know at compile time, it is best to use std::string.