A file path is passed as a string. How do I convert this string to a std::filesystem::path? Example:
#include <filesystem>
std::string inputPath = "a/custom/path.ext";
const std::filesystem::path path = inputPath; // Is this assignment safe?
Yes, this construction is safe:
const std::filesystem::path path = inputPath; // Is this assignment safe?
That is not assignment, that is copy initialization. You are invoking this constructor:
template< class Source >
path( const Source& source );
which takes:
Constructs the path from a character sequence provided by source (4), which is a pointer or an input iterator to a null-terminated character/wide character sequence, an std::basic_string or an std::basic_string_view,
So you're fine. Plus, it would be really weird if you couldn't construct a filesystem::path from a std::string.
I'm aware that this is a 5 year old question, but since it is still the top result for the search "string to filesystem::path" IMHO it requires some additional comment on usages on Windows (although the original question probably was implicitly focused on Linux, judging from the chosen path separator).
For Windows systems, the implicit construction of a filesystem::path from string is not safe with respect to encoding (despite the fact that it is legal according to the standard). If you have a path as a std::string that contains non-ASCII characters, you have to be absolutely clear about whether it is encoded in the local code-page or UTF-8 (there is no such distinction under Linux). The implicit construction shown in the question will assume encoding in the local code-page, while many 3rd party libraries will assume UTF-8 as the standard (e.g. QT in QString::toStdString() return UTF-8; strings in gRPC/protobuf should be expected being UTF-8 etc.). Therefore, you have to be very careful not to mix these:
const std::string path_as_string = "\xe4\xb8\xad"; //chinese character Zhong in UTF-8
const std::filesystem::path wrong_path = path_as_string; //looks innocent, but incorrect encoding is used!
std::ofstream stream;
stream.open(std::filesystem::current_path() / wrong_path); //create a file with the broken name
stream.close();
The correct way to construct the path in this case would be
const std::filesystem::path correct_path = std::filesystem::u8path(path_as_string);
Please also note, that this is a perfectly valid path, and you can easily create a file containing such characters using regular applications or the Windows Explorer. Due to the high potential to get the conversion wrong, IMHO it would have been preferable for Windows development, to not have any implicit constructor in the current situation, but requests to have an option to disable that behavior in the STL implementations have been rejected with reference to the standard. On the other hand, it seems that Windows will slowly move to UTF-8 as default code-page in the future (see e.g. this comment). Until then, the above mentioned caveats should to be taken into account.
Related
As this question is some years old
Is C++20 'char8_t' the same as our old 'char'?
I would like to know, what is the recommended way to handle the char8_t and char conversion right now? boost::nowide (1.80.0) doesn´t not yet understand char8_t nor (AFAIK) boost::locale.
As Tom Honermann noted that
reinterpret_cast<const char *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text"); // Undefined behavior.
So: How do i interact with APIs that just accept const char* or const wchar_t* (think Win32 API) if my application "default" string type is std::u8string? The recommendation seems to be https://utf8everywhere.org/.
If i got a std::u8string and convert to std::string by
std::u8string convert(std::string str)
{
return std::u8string(reinterpret_cast<const char8_t*>(str.data()), str.size());
}
std::string convert(std::u8string str)
{
return std::string(reinterpret_cast<const char_t*>(str.data()), str.size());
}
This would invoke the same UB that Tom Honermann mentioned. This would be used when i talk to Win32 API or any other API that wants some const char* or gives some const char* back. I could go all conversions through boost::nowide but in the end i get a const char* back from boost::nowide::narrow() that i need to cast.
Is the current recommendation to just stay at char and ignore char8_t?
This would invoke the same UB that Tom Honermann mentioned.
As pointed out in the post you referred to, UB only happens when you cast from a char* to a char8_t*. The other direction is fine.
If you are given a char* which is encoded in UTF-8 (and you care to avoid the UB of just doing the cast for some reason), you can use std::transform to convert the chars to char8_ts by converting the characters:
std::u8string convert(std::string str)
{
std::u8string ret(str.size());
std::ranges::transform(str, ret.begin(), [](char c) {return char8_t(c);});
return ret;
}
C++23's ranges::to will make using a named return variable unnecessary.
For dealing with wchar_t interfaces (which you shouldn't have to, since nowadays UTF-8 support exists through narrow character interfaces on Windows), you'll have to do an actual UTF-8->UTF-16 conversion. Which you would have had to do anyway.
Personally, I think all the char8_t stuff in C++ is unusable practically!
With the current standard combined with OS support, I would recommend to avoid it, if possible.
But that is not all yet. There is more critic:
Unfortunately the C++ standard itself deprecates its own conversion support before it offers a replacement!
For example, the support in std::filesystem by using an utf-8 encoded standard string (not u8string) is deprecated (std::filesystem::u8path). With that even to use utf-8 encoded std::string is a pain because you must always convert it from one to another and back again!
To your questions. It depends what you want to do. If you want have a std::string which is utf-8 encoded but you only have an std::u8string, then you can simply do the following (no reinterpret_cast needed):
std::string convert( std::u8string str )
{
return std::string(str.begin(), str.end());
}
But here, I personally would expect, that the standard would offer a move constructor in std::string taking a std::u8string. Because otherwise you always must make a copy with an extra allocation for the unchanged data.
Unfortunately the standard does not offer such simple things. They are forcing the users to do uncomfortable and expensive stuff.
The same is true, if you have a std::string and you have 100% verified that it is valid utf-8 then you can direct convert it:
std::u8string convert( std::string str )
{
return std::u8string( str.begin(), str.end() );
}
During writing the long answer I realized that it is even more bad than I though when it comes to conversion! If you need to do a real conversion of the encoding it turns out that std::u8string is not supported at all.
The only way possible (that is my research result so far) is to use std::string as the data holder for the conversion, since the available routines are working on char and NOT on char8_t!
So, for the conversion from std::string to std::u8string you must do the following:
Use std::mbrtoc16 or std::std::mbrtoc32 for convert narrow char to either UTF-16 or UTF-32.
Use std::codecvt_utf8 to produce an UTF-8 encoded std::string.
Finally use the routine above to convert from UTF-8 encoded std::string to std::u8string.
For the other way round from std::u8string to std::string you must do the following:
Use the routine above to create a UTF-8 encoded std::string.
Use std::codecvt_utf8 to create an UTF-16/32 string.
And finally use std::c16rtomb or std::c32rtomb to produce a narrow encoded std::string.
But guess what? The codecvt routines are deprecated without a replacement...
So, personally, I would recommend to use the Windows API for it and use std::string only (or on Windows std::wstring). Usually only on Windows the std::string / char is encoded with a Windows code page and everywhere else you can normally expect it is UTF-8 (except maybe for Mainframes and some very rare old systems).
The conclusion can only be: Don't mess around with char8_t and std::u8string at all. It is practically unusable.
This is happening inside a big project so I cannot really post a minimal reproducible example but I'll try asking anyway. I'm building a list of benchmarks application integrated with a framework we're working on and on one of them the convertion we require to make (float -> string) with to_string appears to reproduce a comma separated result.
| Monitored values:
| [ my_time_monitor.average = 61720,000000 ]
This is the function responsible:
std::string operating_point_parser::operator()(const int32_t num_threads, const float exec_time_ms) const {
return "{\"compute\":[{\"knobs\":{\"num_threads\":" + std::to_string(num_threads) + "},\"metrics\":{\"exec_time_ms\":[" + std::to_string(exec_time_ms) + ",0]}}]}";
}
Since as I said the same exact function is being called by other applications which don't show this unexpected behavior, my guess is that some internal compilation flags are messing around.
set( CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++17 -DENABLE_THREADS -DHAVE_RANDOM -DHAVE_UNISTD_H -DHAVE_SYS_FILE_H -DHAVE_SYS_MMAN_H -DHAVE_CONFIG_H -DVIPSDATASET_PATH=\"\\\"${CMAKE_CURRENT_SOURCE_DIR}/dataset/orion_18000x18000.v\\\"\"" )
set( CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -DHAVE_RANDOM -DHAVE_UNISTD_H -DIM_PREFIX=\"\\\"${CMAKE_CURRENT_SOURCE_DIR}/dataset\\\"\" -DIM_EXEEXT=\"\\\"\\\"\" -DIM_LIBDIR=\"\\\"${CMAKE_INSTALL_PREFIX}/lib\\\"\" -DGETTEXT_PACKAGE=\"\\\"vips7\"\\\" -DHAVE_SYS_FILE_H -DHAVE_SYS_MMAN_H -DHAVE_CONFIG_H")
If you want to take a look at the full application code here's the link. The operating_point_parse::operator() is called inside margot::compute::push_custom_monitor_values().
As stated in the Notes of https://en.cppreference.com/w/cpp/string/basic_string/to_string, to_string relies on the current locale for formatting purposes:
std::to_string relies on the current locale for formatting purposes,
and therefore concurrent calls to std::to_string from multiple threads
may result in partial serialization of calls. C++17 provides
std::to_chars as a higher-performance locale-independent alternative.
So, if you want to have dots instead of commans, you have to adjust the current locale.
Or instead of changing the global locale with std::locale::global(...) you could use a stringstream and imbue() the locale on that stream only, for ex.
stringstream ss;
ss.imbue( locale you want )
ss << ... write what you need
ss.str(); // get formatted string
std::to_string uses the currently active locale for formatting.
You can set the active locale using a C locale name using:
const char* locale = "C";
std::locale::global(std::locale(locale));
Meaning of locale name is specified in the C standard (quote from C11 draft):
7.11.1.1 The setlocale function
A value of "C" for locale specifies the minimal environment for C translation; a value of "" for locale specifies the locale-specific native environment. Other implementation-defined strings may be passed as the second argument to setlocale.
That value is going to be formatted inside a JSON string
In this case, and more generally when ever you wish to format using style that shouldn't depend on the global locale, should avoid std::to_string.
What would you recommend since you advised on avoiding it?
Anything that doesn't use locale, or lets you specify the locale to use instead of using the global locale. For example:
std::format("{}", 0.42); // doesn't use locale
std::format(std::locale("C"), "{}", 0.42); // use specific locale
Another example is a stringstream with imbued locale as suggested in the other answer.
I'm on Windows and I'm constructing std::filesystem::path from std::string. According to constructor reference (emphasis mine):
If the source character type is char, the encoding of the source is assumed to be the native narrow encoding (so no conversion takes place on POSIX systems)
If I understand correctly, this means string content will be treated as encoded in ANSI under Windows. To treat it as encoded in UTF-8, I need to use std::filesystem::u8path() function. See the demo: http://rextester.com/PXRH65151
I want constructor of path to treat contents of narrow string as UTF-8 encoded. For boost::filesystem::path I could use imbue() method to do this:
boost::filesystem::path::imbue(std::locale(std::locale(), new std::codecvt_utf8_utf16<wchar_t>()));
However, I do not see such method in std::filesystem::path. Is there a way to achieve this behavior for std::filesystem::path? Or do I need to spit u8path all over the place?
My solution to this problem is to fully alias the std::filesystem to a different namespace named std::u8filesystem with classes and methods that treat std::string as UTF-8 encoded. Classes inherit their corresponding in std::filesystem with same name, without adding any field or virtual method to offer full API/ABI interoperability. Full proof of concept code here, tested only on Windows so far and far to be complete. The following snippet shows the core working of the helper:
std::wstring U8ToW(const std::string &string);
namespace std
{
namespace u8filesystem
{
#ifdef WIN32
class path : public filesystem::path
{
public:
path(const std::string &string)
: fs::path(U8ToW(path))
{
}
inline std::string string() const
{
return filesystem::path::u8string();
}
}
#else
using namespace filesystem;
#endif
}
}
For the sake of performance, path does not have a global way to define locale conversions. Since C++ pre-20 does not have a specific type for UTF-8 strings, the system assumes any char strings are narrow character strings. So if you want to use UTF-8 strings, you have to spell it out explicitly, either by providing an appropriate conversion locale to the constructor or by using u8path.
C++20 gave us char8_t, which is always presumed to be UTF-8. So if you consistently use char8_t-based strings (like std::u8string), path's implicit conversion will pick up on it and work appropriately.
Let's say you have used the new std::filesystem (or std::experimental::filesystem) code to hunt down a file. You have a path variable that contains the full pathname to this variable.
How do you open that file?
That may sound silly, but consider the obvious answer:
std::filesystem::path my_path = ...;
std::ifstream stream(my_path.c_str(), std::ios::binary);
This is not guaranteed to work. Why? Because on Windows for example, path::string_type is std::wstring. So path::c_str will return a const wchar_t*. And std::ifstream can only take paths with a const char* type.
Now it turns out that this code will actually function in VS. Why? Because Visual Studio has a library extension that does permit this to work. But that's non-standard behavior and therefore not portable. For example, I have no idea if GCC on Windows provides the same feature.
You could try this:
std::filesystem::path my_path = ...;
std::ifstream stream(my_path.string().c_str(), std::ios::binary);
Only Windows confounds us again. Because if my_path happened to contain Unicode characters, then now you're reliant on setting the Windows ANSI locale stuff correctly. And even that won't necessarily save you if the path happens to have characters from multiple languages that cannot exist in the same ANSI locale.
Boost Filesystem actually had a similar problem. But they extended their version of iostreams to support paths directly.
Am I missing something here? Did the committee add a cross-platform filesystem library without adding a cross-platform way to open files in it?
Bo Persson pointed out that this is the subject of a standard library defect report. This defect has been resolved, and C++17 will ship, requiring implementations where path::value_type is not char to have their file stream types take const filesystem path::value_type*s in addition to the usual const char* versions.
From boost/filesystem/path.hpp:
# ifdef BOOST_WINDOWS_API
const std::string string() const
{
[...]
}
# else // BOOST_POSIX_API
// string_type is std::string, so there is no conversion
const std::string& string() const { return m_pathname; }
[...]
# endif
For wstring() it is exactly the other way around - returning by reference on Windows and by value on POSIX. Is there an interesting reason for that?
On Windows, path stores a wstring, since the only way to handle Unicode-encoded paths in Windows is with UTF-16. On other platforms, the filesystems handle Unicode via UTF-8 (or close enough), so on those platforms, path stores a string.
So on non-Windows platforms, path::string will return a const-reference to the actual internal data structure. On Windows, it has to generate a std::string, so it returns it by copy.
Note that the File System TS bound for C++17 does not do this. There, path::string will always return a copy. If you want the natively stored string type, you must use path::native, whose type will be platform-dependent.
For windows API it returns by value because the variable 'm_pathname' needs to be converted into a different format (string) as implemented by 'path_traits'. This introduces a temporary variable which of course cannot be passed by reference, though the extra copy will get elided by either NRVO or by implicit move.
For the posix case, the format of 'm_pathname' is already in native format (string), so no need to convert and hence can be passed as const reference.