std::string::size() strange behaviour - c++

I believe the output has to do with UTF, but I do not know how.
Would someone, please, explain?
#include <iostream>
#include <cstdint>
#include <iomanip>
#include <string>
int main()
{
std::cout << "sizeof(char) = " << sizeof(char) << std::endl;
std::cout << "sizeof(std::string::value_type) = " << sizeof(std::string::value_type) << std::endl;
std::string _s1 ("abcde");
std::cout << "s1 = " << _s1 << ", _s1.size() = " << _s1.size() << std::endl;
std::string _s2 ("abcdé");
std::cout << "s2 = " << _s2 << ", _s2.size() = " << _s2.size() << std::endl;
return 0;
}
The output is:
sizeof(char) = 1
sizeof(std::string::value_type) = 1
s1 = abcde, _s1.size() = 5
s2 = abcdé, _s2.size() = 6
g++ --version prints g++ (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609
QTCreator compiles like this:
g++ -c -m32 -pipe -g -std=c++0x -Wall -W -fPIC -I../strsize -I. -I../../Qt/5.5/gcc/mkspecs/linux-g++-32 -o main.o ../strsize/main.cpp
g++ -m32 -Wl,-rpath,/home/rodrigo/Qt/5.5/gcc -o strsize main.o
Thanks a lot!

é is encoded as 2 bytes, 0xC3 0xA9, in utf-8.

gcc default input character set is UTF-8. Your editor also probably saved the file as UTF-8, so in your input .cpp file the string abcdé will have 6 bytes (As Peter already answered, the LATIN SMALL LETTER E WITH ACUTE is encoded in UTF-8 with 2 bytes). std::string::length returns the length in bytes, ie. 6. QED
You should open your source .cpp file in a hex editor to confirm.

Even in C++11 std::string has nothing to do with UTF-8. In the description of size and length methods of std::string we can see:
For std::string, the elements are bytes (objects of type char), which are not the same as characters if a multibyte encoding such as UTF-8 is used.
Thus, you should use some third-party unicode-compatible library to handle unicode strings.
If you continue to use non-unicode string classes with unicode strings, you may face LOTS of other problems. For example, you'll get a bogus result when trying to compare same-looking combining character and precomposed character.

Related

How can I convert unsigned char 0xFF into "FF" string in C++ [duplicate]

I want to do:
int a = 255;
cout << a;
and have it show FF in the output, how would I do this?
Use:
#include <iostream>
...
std::cout << std::hex << a;
There are many other options to control the exact formatting of the output number, such as leading zeros and upper/lower case.
To manipulate the stream to print in hexadecimal use the hex manipulator:
cout << hex << a;
By default the hexadecimal characters are output in lowercase. To change it to uppercase use the uppercase manipulator:
cout << hex << uppercase << a;
To later change the output back to lowercase, use the nouppercase manipulator:
cout << nouppercase << b;
std::hex is defined in <ios> which is included by <iostream>. But to use things like std::setprecision/std::setw/std::setfill/etc you have to include <iomanip>.
If you want to print a single hex number, and then revert back to decimal you can use this:
std::cout << std::hex << num << std::dec << std::endl;
I understand this isn't what OP asked for, but I still think it is worth to point out how to do it with printf. I almost always prefer using it over std::cout (even with no previous C background).
printf("%.2X", a);
'2' defines the precision, 'X' or 'x' defines case.
std::hex gets you the hex formatting, but it is a stateful option, meaning you need to save and restore state or it will impact all future output.
Naively switching back to std::dec is only good if that's where the flags were before, which may not be the case, particularly if you're writing a library.
#include <iostream>
#include <ios>
...
std::ios_base::fmtflags f( cout.flags() ); // save flags state
std::cout << std::hex << a;
cout.flags( f ); // restore flags state
This combines Greg Hewgill's answer and info from another question.
There are different kinds of flags & masks you can use as well. Please refer http://www.cplusplus.com/reference/iostream/ios_base/setf/ for more information.
#include <iostream>
using namespace std;
int main()
{
int num = 255;
cout.setf(ios::hex, ios::basefield);
cout << "Hex: " << num << endl;
cout.unsetf(ios::hex);
cout << "Original format: " << num << endl;
return 0;
}
Use std::uppercase and std::hex to format integer variable a to be displayed in hexadecimal format.
#include <iostream>
int main() {
int a = 255;
// Formatting Integer
std::cout << std::uppercase << std::hex << a << std::endl; // Output: FF
std::cout << std::showbase << std::hex << a << std::endl; // Output: 0XFF
std::cout << std::nouppercase << std::showbase << std::hex << a << std::endl; // Output: 0xff
return 0;
}
C++20 std::format
This is now the cleanest method in my opinion, as it does not pollute std::cout state with std::hex:
main.cpp
#include <format>
#include <string>
int main() {
std::cout << std::format("{:x} {:#x} {}\n", 16, 17, 18);
}
Expected output:
10 0x11 18
Not yet implemented on GCC 10.0.1, Ubuntu 20.04.
But the awesome library that became C++20 and should be the same worked once installed on Ubuntu 22.04 with:
sudo apt install libfmt-dev
or:
git clone https://github.com/fmtlib/fmt
cd fmt
git checkout 061e364b25b5e5ca7cf50dd25282892922375ddc
mkdir build
cmake ..
sudo make install
main2.cpp
#include <fmt/core.h>
#include <iostream>
int main() {
std::cout << fmt::format("{:x} {:#x} {}\n", 16, 17, 18);
}
Compile and run:
g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o main2.out main2.cpp -lfmt
./main2.out
Documented at:
https://en.cppreference.com/w/cpp/utility/format/format
https://en.cppreference.com/w/cpp/utility/format/formatter#Standard_format_specification
More info at: std::string formatting like sprintf
Pre-C++20: cleanly print and restore std::cout to previous state
main.cpp
#include <iostream>
#include <string>
int main() {
std::ios oldState(nullptr);
oldState.copyfmt(std::cout);
std::cout << std::hex;
std::cout << 16 << std::endl;
std::cout.copyfmt(oldState);
std::cout << 17 << std::endl;
}
Compile and run:
g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o main.out main.cpp
./main.out
Output:
10
17
More details: Restore the state of std::cout after manipulating it
Tested on GCC 10.0.1, Ubuntu 20.04.
How are you!
#include <iostream>
#include <iomanip>
unsigned char arr[] = {4, 85, 250, 206};
for (const auto & elem : arr) {
std::cout << std::setfill('0')
<< std::setw(2)
<< std::uppercase
<< std::hex
<< (0xFF & elem)
<< " ";
}

I can't compile the filesystem library

I'm trying to use the filesystem library and it's not working I need help about compiling this.
I tried to change the included file and I updated my compiler but nothing works
here are the inclusions I made
#include <experimental/filesystem>
namespace fs = std::filesystem;
I compile the cpp file with this command
g++ -Wall -c indexation_fichier.cpp
I get this error
indexation_fichier.cpp:5:10: fatal error: experimental/filesystem: No such file or directory
#include <experimental/filesystem>
^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
and here is my compiler version
g++ (MinGW.org GCC-8.2.0-1) 8.2.0
when I type
g++ --version
I want to know what is wrong and what I need to do to make this library work because I need it for my project.
thanks.
You can either compile your code using -lstdc++fs flag OR like #pete mentioned in the comment: remove experimental, as it is now part of standard C++17.
#include <filesystem>
#include <iostream>
namespace fs = std::filesystem;
int main(){
fs::path pathToShow(fs::current_path());
std::cout << "exists() = " << fs::exists(pathToShow) << "\n"
<< "root_name() = " << pathToShow.root_name() << "\n"
<< "root_path() = " << pathToShow.root_path() << "\n"
<< "relative_path() = " << pathToShow.relative_path() << "\n"
<< "parent_path() = " << pathToShow.parent_path() << "\n"
<< "filename() = " << pathToShow.filename() << "\n"
<< "stem() = " << pathToShow.stem() << "\n"
<< "extension() = " << pathToShow.extension() << "\n";
return 0;
}
and then something like g++ -o fs filesystem.cpp will work fine.

C++ printing pointer doesn't acknowledge showbase

I've noticed a discrepancy in the way we print pointers. gcc by default is adding 0x prefix to hex output of pointer, and Microsoft's compiler doesn't do that. showbase/noshowbase doesn't affect either of them.
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
void * n = (void *)1;
cout << noshowbase << hex << n << dec << endl;
// output (g++ (GCC) 4.7.2, 5.4.0): 0x1
// output (VS 2010, 2013): 00000001
n = (void *)10;
cout << noshowbase << hex << n << dec << endl;
// output (g++ (GCC) 4.7.2, 5.4.0): 0xa
// output (VS 2010, 2013): 0000000A
n = (void *)0;
cout << noshowbase << hex << n << dec << endl;
// output (g++ (GCC) 4.7.2, 5.4.0): 0
// output (VS 2010, 2013): 00000000
return 0;
}
I assume this is implementation defined behavior and not a bug, but is there a way to stop any compiler from prepending the 0x? We are already prepending the 0x on our own but in gcc it comes out like 0x0xABCD.
I'm sure I could do something like ifdef __GNUC___ .... but I wonder if I'm missing something more obvious. Thanks
I propose that you cast to intptr and treat the input as a normal integer.
Your I/O manipulators should then work.
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
void * n = (void *)1;
cout << noshowbase << hex << reinterpret_cast<intptr_t>(n) << dec << endl;
n = (void *)10;
cout << noshowbase << hex << reinterpret_cast<intptr_t>(n) << dec << endl;
n = (void *)0;
cout << noshowbase << hex << reinterpret_cast<intptr_t>(n) << dec << endl;
}
// 1
// a
// 0
(live demo)
At first I was a little surprised by your question (more specifically, by these implementations' behaviour), but the more I think about it the more it makes sense. Pointers are not numbers*, and there's really no better authority on how to render them than the implementation.
* I'm serious! Although, funnily enough, for implementation/historical reasons, the standard calls a const void* "numeric" in this context, using the num_put locale stuff for formatted output, ultimately deferring to printf in [facet.num.put.virtuals]. The standard states that the formatting specifier %p should be used but, since the result of %p is implementation-defined, you could really get just about anything with your current code.

snprintf equivalent for wchar_t to calculate formatted string size (mac)

Is there any function available to calculate the formatted string size for wchar_t similar to char (snprintf) ?
msvc has snwprintf but I couldnt find an equivalent in mac.
If not is there a way to calculate this with std libraries (without boost)
I tried this on OSX (Apple LLVM version 8.0.0, clang-800.0.42.1) and CentOS (g++ 4.8.5 20150623 Red Hat 4.8.5-11) using the swprintf() function recommended by Cubbi.
NB the man entry for swprintf() on Mac indicates that its return value is the "number of chars written", not the number of wchars.
I ran the following code, and got the output results shown in the comments on each platform:
#include <iostream>
#include <clocale>
#include <string>
#include <cstring>
int main()
{
std::setlocale(LC_ALL, "en_US.utf8");
char narrow_str[] = "z\u00df\u6c34\U0001f34c";
std::cout << narrow_str << std::endl; // zß水🍌
std::cout << strlen(narrow_str) << std::endl; // 10 (= 1 + 2 + 3 + 4)
std::wstring wide_str = L"z\u00df\u6c34\U0001f34c";
std::cout << wide_str.length() << std::endl; // 4
std::cout << wcslen(wide_str.c_str()) << std::endl; // 4
wchar_t warr[100];
int n1 = std::swprintf(warr, sizeof warr/sizeof *warr, L"%s", narrow_str);
int n2 = std::swprintf(warr, sizeof warr/sizeof *warr, L"Converted from UTF-8: '%s'", narrow_str);
std::cout << n1 << std::endl; // 10 with LLVM, 4 with gcc
std::cout << n2 << std::endl; // 34 with LLVM, 28 with gcc
}
From this it appears that with LLVM, (1) the original code at cppreference fails and returns -1 because the warr[] array was too small, and (2) the return value from swprintf() is measured in char.
This may, or may not, be helpful.
'_snwprintf' is a wide-character version of _snprintf; the pointer arguments to '_snwprintf' are wide-character strings. Detection of encoding errors in '_snwprintf' might differ from the detection in _snprintf. '_snwprintf', just like swprintf, writes output to a string instead of a destination of type FILE.
int _snwprintf(
wchar_t *buffer,
size_t count,
const wchar_t *format [,
argument] ...
);
Parameters:
buffer
Storage location for the output.
count
Maximum number of characters to store.
format
Format-control string.
argument
Optional arguments.

Gcc: making UTF-8 the execution character set

I wrote this test program in a Latin-1 encoded file...
#include <cstring>
#include <iostream>
using namespace std;
const char str[] = "ÅÄÖ";
int main() {
cout << sizeof(str) << ' ' << strlen(str) << ' '
<< sizeof("Åäö") << ' ' << strlen("åäö") << endl;
return 0;
}
...and compiled it with g++ -fexec-charset=UTF-8 -pedantic -Wall on OpenBSD 5.3. I was expecting to see the size of the strings being 6 chars + 1 NUL char, but I get the output 4 3 4 3?
I tried changing my system locale from ISO-8859-1 to UTF-8 with export LC_CTYPE=sv_SE.UTF-8, but that didn't help. (Accordning to the gcc manual, that only changes the input character set, but hey, it was worth a try.)
So, what am I doing wrong?