Clean Way to Convert Python 3 Unicode to std::string

Clean Way to Convert Python 3 Unicode to std::string - c++

I wrap a lot of C++ using the Python 2 API (I can't use things like swig or boost.python for various technical reasons). When I have to pass a string (usually a path, always ASCII) into C/C++, I use something like this:
std::string file_name = PyString_AsString(py_file_name);
if (PyErr_Occurred()) return NULL;
Now I'm considering updating to Python 3, where PyString_* methods don't exist. I found one solution that says I should do something like this:
PyObject* bytes = PyUnicode_AsUTF8String(py_file_name);
std::string file_name = PyBytes_AsString(bytes);
if (PyErr_Occurred()) return NULL;
Py_DECREF(bytes);
However this is twice as many lines and seems a bit ugly (not to mention that it could introduce a memory leak if I forget the last line).
The other option is to redefine the python functions to operate on bytes objects, and to call them like this
def some_function(path_name):
_some_function(path_name.encode('utf8'))
This isn't terrible, but it does require a python-side wrapper for each function.
Is there some cleaner way to deal with this?

Looks like the solution exists in python 3.3, with char* PyUnicode_AsUTF8(PyObject* unicode). This should be exactly the same behavior as the PyString_AsString() function from python 2.

If you know (and of course, you could check with an assert or similar) that it's all ASCII, then you could simply create it like this:
std::string py_string_to_std_string(PyUnicode_string py_file_name)
{
len = length of py_file_name; // Not sure how you write that in python.
std::string str(len);
for(int i = 0; i < len; i++)
str += py_file_name[i];
return str;
}

Providing an improved version of accepted answer, instead of using PyUnicode_AsUTF8(...) better to use PyUnicode_AsUTF8AndSize(...).
Becasue string may contain null character (0 codepoint) somewhere in the middle, then your resulting std::string will contain truncated version of full string if you use PyUnicode_AsUTF8(...).
Py_ssize_t size = 0;
char const * pc = PyUnicode_AsUTF8AndSize(obj, &size);
std::string s;
if (pc)
s = std::string(pc, size);
else
// Error, handle!

Related

Reading contents of file into dynamically allocated char* array- can I read into std::string instead?

I have found myself writing code which looks like this
// Treat the following as pseudocode - just an example
iofile.seekg(0, std::ios::end); // iofile is a file opened for read/write
uint64_t f_len = iofile.tellg();
if(f_len >= some_min_length)
{
// Focus on the following code here
char *buf = new char[7];
char buf2[]{"MYFILET"}; // just some random string
// if we see this it's a good indication
// the rest of the file will be in the
// expected format (unlikely to see this
// sequence in a "random file", but don't
// worry too much about this)
iofile.read(buf, 7);
if(memcmp(buf, buf2, 7) == 0) // I am confident this works
{
// carry on processing file ...
// ...
// ...
}
}
else
cout << "invalid file format" << endl;
This code is probably an okay sketch of what we might want to do when opening a file, which has some specified format (which I've dictated). We do some initial check to make sure the string "MYFILET" is at the start of the file - because I've decided all my files for the job I'm doing are going to start with this sequence of characters.
I think this code would be better if we didn't have to play around with "c-style" character arrays, but used strings everywhere instead. This would be advantageous because we could do things like if(buf == buf2) if buf and buf2 where std::strings.
A possible alternative could be,
// Focus on the following code here
std::string buf;
std::string buf2("MYFILET"); // very nice
buf.resize(7); // okay, but not great
iofile.read(buf.data(), 7); // pretty awful - error prone if wrong length argument given
// also we have to resize buf to 7 in the previous step
// lots of potential for mistakes here,
// and the length was used twice which is never good
if(buf == buf2) then do something
What are the problems with this?
We had to use the length variable 7 (or constant in this case) twice. Which is somewhere between "not ideal" and "potentially error prone".
We had to access the contents of buf using .data() which I shall assume here is implemented to return a raw pointer of some sort. I don't personally mind this too much, but others may prefer a more memory-safe solution, perhaps hinting we should use an iterator of some sort? I think in Visual Studio (for Windows users which I am not) then this may return an iterator anyway, which will give [?] warnings/errors [?] - not sure on this.
We had to have an additional resize statement for buf. It would be better if the size of buf could be automatically set somehow.

It is undefined behavior to write into the const char* returned by std::string::data(). However, you are free to use std::vector::data() in this way.
If you want to use std::string, and dislike setting the size yourself, you may consider whether you can use std::getline(). This is the free function, not std::istream::getline(). The std::string version will read up to a specified delimiter, so if you have a text format you can tell it to read until '\0' or some other character which will never occur, and it will automatically resize the given string to hold the contents.
If your file is binary in nature, rather than text, I think most people would find std::vector<char> to be a more natural fit than std::string anyway.

We had to use the length variable 7 (or constant in this case) twice.
Which is somewhere between "not ideal" and "potentially error prone".
The second time you can use buf.size()
iofile.read(buf.data(), buf.size());
We had to access the contents of buf using .data() which I shall
assume here is implemented to return a raw pointer of some sort.
And pointed by John Zwinck, .data() return a pointer to const.
I suppose you could define buf as std::vector<char>; for vector (if I'm not wrong) .data() return a pointer to char (in this case), not to const char.
size() and resize() are working in the same way.
We had to have an additional resize statement for buf. It would be
better if the size of buf could be automatically set somehow.
I don't think read() permit this.
p.s.: sorry for my bad English.

We can validate a signature without double buffering (rdbuf and a string) and allocating from the heap...
// terminating null not included
constexpr char sig[] = { 'M', 'Y', 'F', 'I', 'L', 'E', 'T' };
auto ok = all_of(begin(sig), end(sig), [&fs](char c) { return fs.get() == (int)c; });
if (ok) {}

template<class Src>
std::string read_string( Src& src, std::size_t count){
std::string buf;
buf.resize(count);
src.read(&buf.front(), 7); // in C++17 make it buf.data()
return buf;
}
Now auto read = read_string( iofile, 7 ); is clean at point of use.
buf2 is a bad plan. I'd do:
if(read=="MYFILET")
directly, or use a const char myfile_magic[] = "MYFILET";.

I liked many of the ideas from the examples above, however I wasn't completely satisfied that there was an answer which would produce undefined-behaviour-free code for C++11 and C++17. I currently write most of my code in C++11 - because I don't anticipate using it on a machine in the future which doesn't have a C++11 compiler.
If one doesn't, then I add a new compiler or change machines.
However it does seem to me to be a bad idea to write code which I know may not work under C++17... That's just my personal opinion. I don't anticipate using this code again, but I don't want to create a potential problem for myself in the future.
Therefore I have come up with the following code. I hope other users will give feedback to help improve this. (For example there is no error checking yet.)
std::string
fstream_read_string(std::fstream& src, std::size_t n)
{
char *const buffer = new char[n + 1];
src.read(buffer, n);
buffer[n] = '\0';
std::string ret(buffer);
delete [] buffer;
return ret;
}
This seems like a basic, probably fool-proof method... It's a shame there seems to be no way to get std::string to use the same memory as allocated by the call to new.
Note we had to add an extra trailing null character in the C-style string, which is sliced off in the C++-style std::string.

C++: Overwrite std::string in Cache

I got a string variable (contains passphrase) and would like to overwrite it's value with a sequence of '0' before the variable is released. I tought about doing something like:
void overwrite(std::string &toOverwrite){
if(toOverwrite.empty())
return;
else{
std::string removeString;
size_t length = toOverwrite.size();
for(int i = 0; i < length; i++){
removeString += "0";
}
toOverwrite = removeString;
}
}
But somehow this doesn't feel right.
First because it seems to produce much overhead in the for loop.
Moreover I'm not sure if the last line would really overwrite the string. I know that e.g. in Java strings are immutable and therefore can not be overwritten at all. They are not immutable in C++ (at least not std::string) but would toOverwrite = removeString really replace toOverwrite or just make that the "pointer" of toOverwrite will point to removeString?
Is it possible that my compiler will optimize the code and removes this overwriting?
Maybe I should use the std::string::replace method or change the datatype to char* / byte[]?

Chances are that will just swap and free pointers, leaving the passphrase somewhere in memory which is no longer pointed to. If you want to overwrite the string data, do:
std::fill(toOverwrite.begin(), toOverwrite.end(), '0');
And you don't need a test for an empty string either.

C++/CX - I need to pass a Platform::String into a method that takes a const char*?

I'm new to c++ (I'm a c# developer).
I have an SQLite wrapper class that requires you to pass in a database name as a const char* , however I only have it as a Platform::String (after doing a file search).
I cant seem to find a way to convert the Platform::String to const char*.
Ive seen another question on StackOverflow that explain why it isnt straight-forward, but no sample code or end-to-end solution.
Can anyone help me ?
Thanks

Disclaimer: I know little about C++/CX, and I'm basing the answer on the documentation here.
The String class contains 16-bit Unicode characters, so you can't directly get a pointer to 8-bit char-typed characters; you'll need to convert the contents.
If the string is known to only contain ASCII characters, then you can convert it directly:
String s = whatever();
std::string narrow(s.Begin(), s.End());
function_requiring_cstring(narrow.c_str());
Otherwise, the string will need translating, which gets rather hairy. The following might do the right thing, converting the wide characters to multi-byte sequences of narrow characters:
String s = whatever();
std::wstring wide(s.Begin(), s.End());
std::vector<char> buffer(s.Length()+1); // We'll need at least that much
for (;;) {
size_t length = std::wcstombs(buffer.data(), wide.c_str(), buffer.size());
if (length == buffer.size()) {
buffer.resize(buffer.size()*2);
} else {
buffer.resize(length+1);
break;
}
}
function_requiring_cstring(buffer.data());
Alternatively, you may find it easier to ignore Microsoft's ideas about how strings should be handled, and use std::string instead.

What is std::safe_string?

An answer to one of my questions included the following line of code:
label = std::safe_string(name); // label is a std::string
The intent seems to be a wrapper around a string literal (so presumably no allocation takes place). I've never heard of safe_string and neither, apparently, has google (nor could I find it in the 98 standard).
Does anyone know what this is about?

After searching google code search (I should have thought of this first...) I found this:
//tools-cgi.cpp
string safe_string (const char * s)
{
return (s != NULL) ? s : "";
}
Which converts NULLs to zero length strings. Although this is not standard it's probably some sort of extension in a specific STL implementation which was referred to in the answer.

There is no standard safe_string. The safe_string you're seeing in that answerer's response is from what looks like a private STL extensions utility library.
Google for "stlext/stringext.h" and you'll see the same library referenced in a post on another forum.

There is no such thing as std::safe_string

It is not part of C++ standard (but perhaps it should be?)
I have been using the same kind of helper function to avoid a std::string throw an exception with a NULL char * string. But it was more something like:
// defined somewhere else as ""
extern const char * const g_strEmptyString ;
inline const char * safe_string(const char * p)
{
return (p) ? (p) : (g_strEmptyString) ;
}
No overhead, and no crash of a std::string when I feed it a char * string that could be NULL but that, in that particular case, should behave as an empty string.

C++: how to get fprintf results as a std::string w/o sprintf

I am working with an open-source UNIX tool that is implemented in C++, and I need to change some code to get it to do what I want. I would like to make the smallest possible change in hopes of getting my patch accepted upstream. Solutions that are implementable in standard C++ and do not create more external dependencies are preferred.
Here is my problem. I have a C++ class -- let's call it "A" -- that currently uses fprintf() to print its heavily formatted data structures to a file pointer. In its print function, it also recursively calls the identically defined print functions of several member classes ("B" is an example). There is another class C that has a member std::string "foo" that needs to be set to the print() results of an instance of A. Think of it as a to_str() member function for A.
In pseudocode:
class A {
public:
...
void print(FILE* f);
B b;
...
};
...
void A::print(FILE *f)
{
std::string s = "stuff";
fprintf(f, "some %s", s);
b.print(f);
}
class C {
...
std::string foo;
bool set_foo(std::str);
...
}
...
A a = new A();
C c = new C();
...
// wish i knew how to write A's to_str()
c.set_foo(a.to_str());
I should mention that C is fairly stable, but A and B (and the rest of A's dependents) are in a state of flux, so the less code changes necessary the better. The current print(FILE* F) interface also needs to be preserved. I have considered several approaches to implementing A::to_str(), each with advantages and disadvantages:
Change the calls to fprintf() to sprintf()
I wouldn't have to rewrite any format strings
print() could be reimplemented as: fprint(f, this.to_str());
But I would need to manually allocate char[]s, merge a lot of c strings , and finally convert the character array to a std::string
Try to catch the results of a.print() in a string stream
I would have to convert all of the format strings to << output format. There are hundreds of fprintf()s to convert :-{
print() would have to be rewritten because there is no standard way that I know of to create an output stream from a UNIX file handle (though this guy says it may be possible).
Use Boost's string format library
More external dependencies. Yuck.
Format's syntax is different enough from printf() to be annoying:
printf(format_str, args) -> cout << boost::format(format_str) % arg1 % arg2 % etc
Use Qt's QString::asprintf()
A different external dependency.
So, have I exhausted all possible options? If so, which do you think is my best bet? If not, what have I overlooked?
Thanks.

Here's the idiom I like for making functionality identical to 'sprintf', but returning a std::string, and immune to buffer overflow problems. This code is part of an open source project that I'm writing (BSD license), so everybody feel free to use this as you wish.
#include <string>
#include <cstdarg>
#include <vector>
#include <string>
std::string
format (const char *fmt, ...)
{
va_list ap;
va_start (ap, fmt);
std::string buf = vformat (fmt, ap);
va_end (ap);
return buf;
}
std::string
vformat (const char *fmt, va_list ap)
{
// Allocate a buffer on the stack that's big enough for us almost
// all the time.
size_t size = 1024;
char buf[size];
// Try to vsnprintf into our buffer.
va_list apcopy;
va_copy (apcopy, ap);
int needed = vsnprintf (&buf[0], size, fmt, ap);
// NB. On Windows, vsnprintf returns -1 if the string didn't fit the
// buffer. On Linux & OSX, it returns the length it would have needed.
if (needed <= size && needed >= 0) {
// It fit fine the first time, we're done.
return std::string (&buf[0]);
} else {
// vsnprintf reported that it wanted to write more characters
// than we allotted. So do a malloc of the right size and try again.
// This doesn't happen very often if we chose our initial size
// well.
std::vector <char> buf;
size = needed;
buf.resize (size);
needed = vsnprintf (&buf[0], size, fmt, apcopy);
return std::string (&buf[0]);
}
}
EDIT: when I wrote this code, I had no idea that this required C99 conformance and that Windows (as well as older glibc) had different vsnprintf behavior, in which it returns -1 for failure, rather than a definitive measure of how much space is needed. Here is my revised code, could everybody look it over and if you think it's ok, I will edit again to make that the only cost listed:
std::string
Strutil::vformat (const char *fmt, va_list ap)
{
// Allocate a buffer on the stack that's big enough for us almost
// all the time. Be prepared to allocate dynamically if it doesn't fit.
size_t size = 1024;
char stackbuf[1024];
std::vector<char> dynamicbuf;
char *buf = &stackbuf[0];
va_list ap_copy;
while (1) {
// Try to vsnprintf into our buffer.
va_copy(ap_copy, ap);
int needed = vsnprintf (buf, size, fmt, ap);
va_end(ap_copy);
// NB. C99 (which modern Linux and OS X follow) says vsnprintf
// failure returns the length it would have needed. But older
// glibc and current Windows return -1 for failure, i.e., not
// telling us how much was needed.
if (needed <= (int)size && needed >= 0) {
// It fit fine so we're done.
return std::string (buf, (size_t) needed);
}
// vsnprintf reported that it wanted to write more characters
// than we allotted. So try again using a dynamic buffer. This
// doesn't happen very often if we chose our initial size well.
size = (needed > 0) ? (needed+1) : (size*2);
dynamicbuf.resize (size);
buf = &dynamicbuf[0];
}
}

I am using #3: the boost string format library - but I have to admit that I've never had any problem with the differences in format specifications.
Works like a charm for me - and the external dependencies could be worse (a very stable library)
Edited: adding an example how to use boost::format instead of printf:
sprintf(buffer, "This is a string with some %s and %d numbers", "strings", 42);
would be something like this with the boost::format library:
string = boost::str(boost::format("This is a string with some %s and %d numbers") %"strings" %42);
Hope this helps clarify the usage of boost::format
I've used boost::format as a sprintf / printf replacement in 4 or 5 applications (writing formatted strings to files, or custom output to logfiles) and never had problems with format differences. There may be some (more or less obscure) format specifiers which are differently - but I never had a problem.
In contrast I had some format specifications I couldn't really do with streams (as much as I remember)

You can use std::string and iostreams with formatting, such as the setw() call and others in iomanip

The {fmt} library provides fmt::sprintf function that performs printf-compatible formatting (including positional arguments according to POSIX specification) and returns the result as std::string:
std::string s = fmt::sprintf("The answer is %d.", 42);
Disclaimer: I'm the author of this library.

The following might be an alternative solution:
void A::printto(ostream outputstream) {
char buffer[100];
string s = "stuff";
sprintf(buffer, "some %s", s);
outputstream << buffer << endl;
b.printto(outputstream);
}
(B::printto similar), and define
void A::print(FILE *f) {
printto(ofstream(f));
}
string A::to_str() {
ostringstream os;
printto(os);
return os.str();
}
Of course, you should really use snprintf instead of sprintf to avoid buffer overflows. You could also selectively change the more risky sprintfs to << format, to be safer and yet change as little as possible.

You should try the Loki library's SafeFormat header file (http://loki-lib.sourceforge.net/index.php?n=Idioms.Printf). It's similar to boost's string format library, but keeps the syntax of the printf(...) functions.
I hope this helps!

Is this about serialization? Or printing proper?
If the former, consider boost::serialization as well. It's all about "recursive" serialization of objects and sub-object.

Very very late to the party, but here's how I'd attack this problem.
1: Use pipe(2) to open a pipe.
2: Use fdopen(3) to convert the write fd from the pipe to a FILE *.
3: Hand that FILE * to A::print().
4: Use read(2) to pull bufferloads of data, e.g. 1K or more at a time from the read fd.
5: Append each bufferload of data to the target std::string
6: Repeat steps 4 and 5 as needed to complete the task.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Clean Way to Convert Python 3 Unicode to std::string - c++

Looks like the solution exists in python 3.3, with char* PyUnicode_AsUTF8(PyObject* unicode). This should be exactly the same behavior as the PyString_AsString() function from python 2.

Related

Reading contents of file into dynamically allocated char* array- can I read into std::string instead?

C++: Overwrite std::string in Cache

C++/CX - I need to pass a Platform::String into a method that takes a const char*?

What is std::safe_string?

C++: how to get fprintf results as a std::string w/o sprintf

Categories

Resources