How to extract values from boost::python::list to C++ - c++

I am creating .so files in linux so that i can import them in python scripts and start using them. I need to pass data from python to c++ layer so that i can use them. I am not able to extract the values despite referring to many posts. I have given the reference code below.
u8 => unsigned char
#include "cp2p_layer.h"
#include <boost/python.hpp>
using namespace boost::python;
BOOST_PYTHON_MODULE(cp2p_hal)
{
class_<SCSICommandsB>("SCSICommandsB")
.def("Write10", &SCSICommandsB::Write10)
;
}
The following code is from cp2p_layer.cpp. I can get the length of the list but the data is always black
u16 SCSICommandsB::Write10 (u8 lun, u8 config, u32 LBA, u16 transferLen, u8 control, u8 groupNo, boost::python::list pythonList)
{
u16 listLen;
u8* pDataBuf = new u8[transferLen];
listLen = boost::python::len(pythonList);
if( listLen != transferLen)
{
cout<<"\nwarning: The write10 cdb has transfer length "<<transferLen<<"that doesnt match with data buffer size "<<listLen<<"\n";
}
for(int i = 0; i < listLen; i++)
{
pDataBuf[i] = boost::python::extract<u8>( (pythonList)[i] );
cout<<boost::python::extract<u8>( (pythonList)[i] )<<"-";
//cout<<pDataBuf[i]<<".";
}
cout<<"\n";
cout<<"info: inside write10 boost len:"<<listLen<<"\n";
return oScsi.Write10 (lun, config, LBA, transferLen, control, groupNo, pDataBuf);
}
When i execute the python script as
#!/usr/bin/python
import cp2p_hal
scsiCmds = cp2p_hal.SCSICommandsB()
plist = [0,1,2,3,4,5,6,7,8,9]
print len(plist)
scsiCmds.Write10(0,0,0,10,0,0,plist)
The output comes as
10
-------- -
info: inside write10 boost len:10
Any help is much appreciated. I also have questions regarding how to read the data from the c++ layer once we have executed the read command. I will create a new post once i get this done. Thanks in advance.

The problem is only in your printing of the values. A u8 in C++ is an unsigned char, and cout will output the corresponding ASCII character. Your characters (0-9) are unprintable, except for ASCII 9, which happens to be a tab. Which explains the space before the final hyphen in your output.
How to fix it? Cast to an int before outputting:
cout << static_cast<int>(boost::python::extract<u8>(pythonList[i])) << "-";

Related

How can I "convert" ISO-8859-7 strings to UTF-8 in C++?

I'm working with 10+ years old machines which use ISO 8859-7 to represent Greek characters using a single byte each.
I need to catch those characters and convert them to UTF-8 in order to inject them in a JSON to be sent via HTTPS.
Also, I'm using GCC v4.4.7 and I don't feel like upgrading so I can't use codeconv or such.
Example: "OΛΑ":
I get char values [ 0xcf, 0xcb, 0xc1, ], I need to write this string "\u039F\u039B\u0391".
PS: I'm not a charset expert so please avoid philosophical answers like "ISO 8859 is a subset of Unicode so you just need to implement the algorithm".
Given that there are so few values to map, a simple solution is to use a lookup table.
Pseudocode:
id_offset = 0x80 // 0x00 .. 0x7F same in UTF-8
c1_offset = 0x20 // 0x80 .. 0x9F control characters
table_offset = id_offset + c1_offset
table = [
u8"\u00A0", // 0xA0
u8"‘", // 0xA1
u8"’",
u8"£",
u8"€",
u8"₯",
// ... Refer to ISO 8859-7 for full list of characters.
]
let S be the input string
let O be an empty output string
for each char C in S
reinterpret C as unsigned char U
if U less than id_offset // same in both encodings
append C to O
else if U less than table_offset // control code
append char '\xC2' to O // lead byte
append char C to O
else
append string table[U - table_offset] to O
All that said, I recommend to save some time by using a library instead.
One way could be to use the Posix libiconv library. On Linux, the functions needed (iconv_open, iconv and iconv_close) are even included in libc so no extra linkage is needed there. On your old machines you may need to install libiconv but I doubt it.
Converting may be as simple as this:
#include <iconv.h>
#include <cerrno>
#include <cstring>
#include <iostream>
#include <iterator>
#include <stdexcept>
#include <string>
// A wrapper for the iconv functions
class Conv {
public:
// Open a conversion descriptor for the two selected character sets
Conv(const char* to, const char* from) : cd(iconv_open(to, from)) {
if(cd == reinterpret_cast<iconv_t>(-1))
throw std::runtime_error(std::strerror(errno));
}
Conv(const Conv&) = delete;
~Conv() { iconv_close(cd); }
// the actual conversion function
std::string convert(const std::string& in) {
const char* inbuf = in.c_str();
size_t inbytesleft = in.size();
// make the "out" buffer big to fit whatever we throw at it and set pointers
std::string out(inbytesleft * 6, '\0');
char* outbuf = out.data();
size_t outbytesleft = out.size();
// the const_cast shouldn't be needed but my "iconv" function declares it
// "char**" not "const char**"
size_t non_rev_converted = iconv(cd, const_cast<char**>(&inbuf),
&inbytesleft, &outbuf, &outbytesleft);
if(non_rev_converted == static_cast<size_t>(-1)) {
// here you can add misc handling like replacing erroneous chars
// and continue converting etc.
// I'll just throw...
throw std::runtime_error(std::strerror(errno));
}
// shrink to keep only what we converted
out.resize(outbuf - out.data());
return out;
}
private:
iconv_t cd;
};
int main() {
Conv cvt("UTF-8", "ISO-8859-7");
// create a string from the ISO-8859-7 data
unsigned char data[]{0xcf, 0xcb, 0xc1};
std::string iso88597_str(std::begin(data), std::end(data));
auto utf8 = cvt.convert(iso88597_str);
std::cout << utf8 << '\n';
}
Output (in UTF-8):
ΟΛΑ
Using this you can create a mapping table, from ISO-8859-7 to UTF-8, that you include in your project instead of iconv:
Demo
Ok I decided to do this myself instead of looking for a compatible library. Here's how I did.
The main problem was figuring out how to fill the two bytes for Unicode using the single one for ISO, so I used the debugger to read the value for the same character, first written by the old machine and then written with a constant string (UTF-8 by default). I started with "O" and "Π" and saw that in UTF-8 the first byte was always 0xCE while the second one was filled with the ISO value plus an offset (-0x30). I built the following code to implement this and used a test string filled with all greek letters, both upper and lower case. Then I realised that starting from "π" (0xF0 in ISO) both the first byte and the offset for the second one changed, so I added a test to figure out which of the two rules to apply. The following method returns a bool to let the caller know whether the original string contained ISO characters (useful for other purposes) and overwrites the original string, passed as reference, with the new one. I worked with char arrays instead of strings for coherence with the rest of the project which is basically a C project written in C++.
bool iso_to_utf8(char* in){
bool wasISO=false;
if(in == NULL)
return wasISO;
// count chars
int i=strlen(in);
if(!i)
return wasISO;
// create and size new buffer
char *out = new char[2*i];
// fill with 0's, useful for watching the string as it gets built
memset(out, 0, 2*i);
// ready to start from head of old buffer
i=0;
// index for new buffer
int j=0;
// for each char in old buffer
while(in[i]!='\0'){
if(in[i] >= 0){
// it's already utf8-compliant, take it as it is
out[j++] = in[i];
}else{
// it's ISO
wasISO=true;
// get plain value
int val = in[i] & 0xFF;
// first byte to CF or CE
out[j++]= val > 0xEF ? 0xCF : 0xCE;
// second char to plain value normalized
out[j++] = val - (val > 0xEF ? 0x70 : 0x30);
}
i++;
}
// add string terminator
out[j]='\0';
// paste into old char array
strcpy(in, out);
return wasISO;
}

How to read custom string with C++ from binary recursively

I've recently been getting in to IO with C++. I am trying to read a string from a binary file stream.
The custom type is saved like this:
The string is prefixed with the length of the string. So hello, would be stored like this: 6Hello\0.
I am basically reading text from a table (in this case a name table) in a binary file. The file header tells me the offset of this table (112 bytes in this case) and the number of names (318).
Using this information I can read the first byte at this offset. This tells me the length of the string (e.g. 6). So I'll start at the next byte and read 5 more to get the full string "Hello". This seems to work fine with the first name at the offset. trying to recursively read the rest provides a lot of garbage really. I've tried using loops and recursive functions but its not working out so well. Not sure what the problem is, so reverted to the original one name retrieval method. Here's the code:
int printName(fstream& fileObj, __int8 buff, DWORD offset, int& iteration){
fileObj.seekg(offset);
fileObj.read((char*)&buff, sizeof(char));
int nameSize = (int)buff;
char* szName = new char[nameSize];
for(int i=1; i <= nameSize; i++){
fileObj.seekg(offset+i);
fileObj.read((char*)&szName[i-1], sizeof(char));
}
cout << szName << endl;
return 0;
}
Any idea how to iterate through all 318 names without creating dodgy output?
Thanks for taking the time to look through this, your help is greatly appreciated.
You're overcomplicating a bit - there's no need to seek to the next sequential read.
Removing unused and pointless parameters, I would write this function something like this:
void printName(fstream& fileObj, DWORD offset) {
char size = 0;
if (fileObj.seekg(offset) && fileObj.read(&size, sizeof(char)))
{
char* name = new char[size];
if (fileObj.read(name, size))
{
cout << name << endl;
}
delete [] name;
}
}

insert utf-8 data in openldap with c api

What is the correct method to insert utf-8 data in an openldap database? I have data in a std::wstring which utf-8 encoded with:
std::wstring converted = boost::locale::conv::to_utf<wchar_t>(line, "Latin1");
When the string needs to added tot an ldapMod structure, i use this fuction:
std::string str8(const std::wstring& s) {
return boost::locale::conv::utf_to_utf<char>(s);
}
to convert from wstring to string. This is used in my function to create an LDAPMod:
LDAPMod ** y::ldap::server::createMods(dataset& values) {
LDAPMod ** mods = new LDAPMod*[values.elms() + 1];
mods[values.elms()] = NULL;
for(int i = 0; i < values.elms(); i++) {
mods[i] = new LDAPMod;
data & d = values.get(i);
switch (d.getType()) {
case NEW: mods[i]->mod_op = 0; break;
case ADD: mods[i]->mod_op = LDAP_MOD_ADD; break;
case MODIFY: mods[i]->mod_op = LDAP_MOD_REPLACE; break;
case DELETE: mods[i]->mod_op = LDAP_MOD_DELETE; break;
default: assert(false);
}
std::string type = str8(d.getValue(L"type"));
mods[i]->mod_type = new char[type.size() + 1];
std::copy(type.begin(), type.end(), mods[i]->mod_type);
mods[i]->mod_type[type.size()] = '\0';
mods[i]->mod_vals.modv_strvals = new char*[d.elms(L"values") + 1];
for(int j = 0; j < d.elms(L"values"); j++) {
std::string value = str8(d.getValue(L"values", j));
mods[i]->mod_vals.modv_strvals[j] = new char[value.size() + 1];
std::copy(value.begin(), value.end(), mods[i]->mod_vals.modv_strvals[j]);
mods[i]->mod_vals.modv_strvals[j][value.size()] = '\0';
}
mods[i]->mod_vals.modv_strvals[d.elms(L"values")] = NULL;
}
return mods;
}
The resulting LDAPMod is passed on to ldap_modify_ext_s and works as long as i only use ASCII characters. But if other characters are present in the string I get an ldap operations error.
I've also tried this with the function provided by the ldap library (ldap_x_wcs_to_utf8s) but the result is the same as with the boost conversion.
It's not the conversion itself that is wrong, because if I convert the modifications back to a std::wstring and show it in my program output, the encoding is still correct.
AFAIK openldap supports utf-8 since long, so I wonder if there's something else that must be done before this works?
I've looked into the openldap client/tools examples, but the utf-8 functions provided by the library are never used in there.
Update:
I noticed I can insert utf-8 characters like é into ldap with Apache Directory Studio. I can retrieve these values from ldap in my c++ program. But if I insert the same character again, without changing anything to that string, I get the ldap operations error again.
It turns out that my code was not wrong at all. My modifications tried to store the full name in the 'displayName' field as well as in 'gecos'. But apparently 'gecos' cannot handle utf8 data.
We don't actually use gecos anymore. The value was only present because of some software we used years ago, so I removed it from the directory.
What made it hard to find was that even though the loglevel was set to 'parse', this error was still not in the logs.
Because libldap can be such a hard nut to crack, I'll include a link to the complete code of the project i'm working on. It might serve as a starting point for other programmers. (Most of the code in tutorials I have found is outdated.)
https://github.com/yvanvds/yATools/tree/master/libadmintools/ldap

C++ Character Encoding

This is my C++ Code where i'm trying to encode the received file path to utf-8.
#include <string>
#include <iostream>
using namespace std;
void latin1_to_utf8(unsigned char *in, unsigned char *out);
string encodeToUTF8(string _strToEncode);
int main(int argc,char* argv[])
{
// Code to receive fileName from Sockets
cout << "recvd ::: " << recvdFName << "\n";
string encStr = encodeToUTF8(recvdFName);
cout << "encoded :::" << encStr << "\n";
}
void latin1_to_utf8(unsigned char *in, unsigned char *out)
{
while (*in)
{
if (*in<128)
{
*out++=*in++;
}
else
{
*out++=0xc2+(*in>0xbf);
*out++=(*in++&0x3f)+0x80;
}
}
*out = '\0';
}
string encodeToUTF8(string _strToEncode)
{
int len= _strToEncode.length();
unsigned char* inpChar = new unsigned char[len+1];
unsigned char* outChar = new unsigned char[2*(len+1)];
memset(inpChar,'\0',len+1);
memset(outChar,'\0',2*(len+1));
memcpy(inpChar,_strToEncode.c_str(),len);
latin1_to_utf8(inpChar,outChar);
string _toRet = (const char*)(outChar);
delete[] inpChar;
delete[] outChar;
return _toRet;
}
And the OutPut is
recvd ::: /Users/zeus/ÄÈÊÑ.txt
encoded ::: /Users/zeus/AÌEÌEÌNÌ.txt
The above function latin1_to_utf8 is provided as an solution Convert ISO-8859-1 strings to UTF-8 in C/C++ , Looks like it works.[Answer is accepted]. So i think i must be making some mistake, but i'm not able to identify what it is. Can someone help me out with this , Please.
I have first posted this question in Codereview,but i'm not getting any answers out there. So sorry for the duplication.
Do you use any platform or you build it on the top of std? I am sure that many people use such convertions and therefore there is library. I strongly recommend you to use the libraray, because the library is tested and usually the best know way is used.
A library which I found doing this is boost locale
This is standard. If you use QT I will recommend you to use the QT conversion library for this (it is platform independant)
QT
In case you want to do it yourself (you want to see how it works or for any other reason)
1. Make sure that you allocate memory ! - this is very important in C,C++ . Since you use iostream use new to allocate memory and delete to release it (this is also important C++ won't figure out when to release it for sure. This is developer's job here - C++ is hardcore :D )
2. Check that you allocate the right size of memory. I expect unicode to be larger memory (it encodes more symbols and sometimes uses large numbers).
3. As already mentioned above read from somewhere (terminal or file) but output in new file. After that when you open the file with text editor make sure you set the encoding to be utf-8 ( your text editor has to know how to interpretate the data)
I hope that helps.
You are first outputting the original Latin-1 string to a terminal expecting a certain encoding, probably Latin-1. You then transcode to UTF-8 and output it to the same terminal, which interprets it differently. Classic mojibake. Try the following with the output instead:
for(size_t i=0, len=strlen(outChar); i!=len; ++i)
std::cout << static_cast<unsigned>(static_cast<unsigned char>(outChar[i])) << ' ';
Note that the two casts are to first get the unsigned byte value and then to get the unsigned value to keep the stream from treating it as a char. Note that your char might already be unsigned, but that's compile-dependent.

What is the proper method of reading and parsing data files in C++?

What is an efficient, proper way of reading in a data file with mixed characters? For example, I have a data file that contains a mixture of data loaded from other files, 32-bit integers, characters and strings. Currently, I am using an fstream object, but it gets stopped once it hits an int32 or the end of a string. if i add random data onto the end of the string in the data file, it seems to follow through with the rest of the file. This leads me to believe that the null-termination added onto strings is messing it up. Here's an example of loading in the file:
void main()
{
fstream fin("C://mark.dat", ios::in|ios::binary|ios::ate);
char *mymemory = 0;
int size;
size = 0;
if (fin.is_open())
{
size = static_cast<int>(fin.tellg());
mymemory = new char[static_cast<int>(size+1)];
memset(mymemory, 0, static_cast<int>(size + 1));
fin.seekg(0, ios::beg);
fin.read(mymemory, size);
fin.close();
printf(mymemory);
std::string hithere;
hithere = cin.get();
}
}
Why might this code stop after reading in an integer or a string? How might one get around this? Is this the wrong approach when dealing with these types of files? Should I be using fstream at all?
Have you ever considered that the file reading is working perfectly and it is printf(mymemory) that is stopping at the first null?
Have a look with the debugger and see if I am right.
Also, if you want to print someone else's buffer, use puts(mymemory) or printf("%s", mymemory). Don't accept someone else's input for the format string, it could crash your program.
Try
for (int i = 0; i < size ; ++i)
{
// 0 - pad with 0s
// 2 - to two zeros max
// X - a Hex value with capital A-F (0A, 1B, etc)
printf("%02X ", (int)mymemory[i]);
if (i % 32 == 0)
printf("\n"); //New line every 32 bytes
}
as a way to dump your data file back out as hex.