iconv encoding conversion problem

iconv encoding conversion problem - c++

I am having trouble converting strings from utf8 to gb2312. My convert function is below
void convert(const char *from_charset,const char *to_charset, char *inptr, char *outptr)
{
size_t inleft = strlen(inptr);
size_t outleft = inleft;
iconv_t cd; /* conversion descriptor */
if ((cd = iconv_open(to_charset, from_charset)) == (iconv_t)(-1))
{
fprintf(stderr, "Cannot open converter from %s to %s\n", from_charset, to_charset);
exit(8);
}
/* return code of iconv() */
int rc = iconv(cd, &inptr, &inleft, &outptr, &outleft);
if (rc == -1)
{
fprintf(stderr, "Error in converting characters\n");
if(errno == E2BIG)
printf("errno == E2BIG\n");
if(errno == EILSEQ)
printf("errno == EILSEQ\n");
if(errno == EINVAL)
printf("errno == EINVAL\n");
iconv_close(cd);
exit(8);
}
iconv_close(cd);
}
This is an example of how I used it:
int len = 1000;
char *result = new char[len];
convert("UTF-8", "GB2312", some_string, result);
edit: I most of the time get a E2BIG error.

outleft should be the size of the output buffer (e.g. 1000 bytes), not the size of the incoming string.
When converting, the string length usually changes in the process and you cannot know how long it is going to be until afterwards. E2BIG means that the output buffer wasn't large enough, in which case you need to give it more output buffer space (notice that it has already converted some of the data and adjusted the four variables passed to it accordingly).

As others have noted, E2BIG means that the output buffer wasn't large enough for the conversion and you were using the wrong value for outleft.
But I've also noticed some other possible problems with your function. Namely, with the way your function works, your caller has no way of knowing how many bytes are in the output string. Your convert() function neither nul-terminates the output buffer nor does it have a means of telling its caller the number of bytes it wrote to outptr.
If you want to deal with nul-terminates strings (and it appears that's what you want to do since your input string is nul-terminated), you might find the following approach to be much better:
char *
convert (const char *from_charset, const char *to_charset, const char *input)
{
size_t inleft, outleft, converted = 0;
char *output, *outbuf, *tmp;
const char *inbuf;
size_t outlen;
iconv_t cd;
if ((cd = iconv_open (to_charset, from_charset)) == (iconv_t) -1)
return NULL;
inleft = strlen (input);
inbuf = input;
/* we'll start off allocating an output buffer which is the same size
* as our input buffer. */
outlen = inleft;
/* we allocate 4 bytes more than what we need for nul-termination... */
if (!(output = malloc (outlen + 4))) {
iconv_close (cd);
return NULL;
}
do {
errno = 0;
outbuf = output + converted;
outleft = outlen - converted;
converted = iconv (cd, (char **) &inbuf, &inleft, &outbuf, &outleft);
if (converted != (size_t) -1 || errno == EINVAL) {
/*
* EINVAL An incomplete multibyte sequence has been encoun-
* tered in the input.
*
* We'll just truncate it and ignore it.
*/
break;
}
if (errno != E2BIG) {
/*
* EILSEQ An invalid multibyte sequence has been encountered
* in the input.
*
* Bad input, we can't really recover from this.
*/
iconv_close (cd);
free (output);
return NULL;
}
/*
* E2BIG There is not sufficient room at *outbuf.
*
* We just need to grow our outbuffer and try again.
*/
converted = outbuf - out;
outlen += inleft * 2 + 8;
if (!(tmp = realloc (output, outlen + 4))) {
iconv_close (cd);
free (output);
return NULL;
}
output = tmp;
outbuf = output + converted;
} while (1);
/* flush the iconv conversion */
iconv (cd, NULL, NULL, &outbuf, &outleft);
iconv_close (cd);
/* Note: not all charsets can be nul-terminated with a single
* nul byte. UCS2, for example, needs 2 nul bytes and UCS4
* needs 4. I hope that 4 nul bytes is enough to terminate all
* multibyte charsets? */
/* nul-terminate the string */
memset (outbuf, 0, 4);
return output;
}

Related

How to use libiconv correctly in C so that it would not report "Arg list too long"?

Code
/*char* to wchar_t* */
wchar_t*strtowstr(char*str){
iconv_t cd=iconv_open("wchar_t","UTF-8");
if(cd==(iconv_t)-1){
return NULL;
}
size_t len1=strlen(str),len2=1024;
wchar_t*wstr=(wchar_t*)malloc((len1+1)*sizeof(wchar_t));
char*ptr1=str;
wchar_t*ptr2=wstr;
if((int)iconv(cd,&ptr1,&len1,(char**)&ptr2,&len2)<0){
free(wstr);
iconv_close(cd);
return NULL;
}
*ptr2=L'\0';
iconv_close(cd);
return wstr;
}
I use strerror(errno) to get the error message,it says "Arg list too long".
How can I solve it?
Thanks to the comments,I change the code above.
I just use the function to read a text file.I think it reports the error because the file is too large.So I want to know how to use iconv for long string.

According to the man page, you get E2BIG when there's insufficient room at *outbuf.
I think the fifth argument should be a number of bytes.
wchar_t *utf8_to_wstr(const char *src) {
iconv_t cd = iconv_open("wchar_t", "UTF-8");
if (cd == (iconv_t)-1)
goto Error1;
size_t src_len = strlen(src); // In bytes, excludes NUL
size_t dst_len = sizeof(wchar_t) * src_len; // In bytes, excludes NUL
size_t dst_size = dst_len + sizeof(wchar_t); // In bytes, including NUL
wchar_t *buf = malloc(dst_size);
if (!buf)
goto Error2;
wchar_t *dst = buf;
if (iconv(cd, &(char*)src, &src_len, &(char*)dst, &dst_len) == (size_t)-1)
goto Error3;
*dst = L'\0';
iconv_close(cd);
return buf;
Error3:
free(buf);
Error2:
iconv_close(cd);
Error1:
return NULL;
}

How to get string from CP866 bytes

My program read bytes from file and try convert it to string. Head of file is a text in CP866.
iconv_t cd = iconv_open("UTF-8","CP866");
char* iconv_in = bytes.data(); //Bytes is a char vector
char* iconv_out = (char *)malloc(counter * sizeof(char)); //counter is a length of bytes array (char vector)
size_t iconv_in_bytes =counter;
size_t iconv_out_bytes = counter;
size_t ret = iconv(cd, &iconv_in, &iconv_in_bytes, &iconv_out, &iconv_out_bytes);
if ((size_t) -1 == ret) {
cout << "Error convert";
return NULL;
}
Program ends with failure (Error convert). And this way is not simple and beautiful. There is exist more simple solution?

simulate ulltoa() with a radix/base of 36

I need to convert an unsigned 64-bit integer into a string. That is in Base 36, or characters 0-Z. ulltoa does not exist in the Linux manpages. But sprintf DOES. How do I use sprintf to achieve the desired result? i.e. what formatting % stuff?
Or if snprintf does not work, then how do I do this?

You can always just write your own conversion function. The following idea is stolen from heavily inspired by this fine answer:
char * int2base36(unsigned int n, char * buf, size_t buflen)
{
static const char digits[] = "0123456789ABCDEFGHI...";
if (buflen < 1) return NULL; // buffer too small!
char * b = buf + buflen;
*--b = 0;
do {
if (b == buf) return NULL; // buffer too small!
*--b = digits[n % 36];
n /= 36;
} while(n);
return b;
}
This will return a pointer to a null-terminated string containing the base36-representation of n, placed in a buffer that you provide. Usage:
char buf[100];
std::cout << int2base36(37, buf, 100);
If you want and you're single-threaded, you can also make the char buffer static -- I guess you can figure out a suitable maximal length:
char * int2base36_not_threadsafe(unsigned int n)
{
static char buf[128];
static const size_t buflen = 128;
// rest as above

How to implement readlink to find the path

Using the readlink function used as a solution to How do I find the location of the executable in C?, how would I get the path into a char array? Also, what do the variables buf and bufsize represent and how do I initialize them?
EDIT: I am trying to get the path of the currently running program, just like the question linked above. The answer to that question said to use readlink("proc/self/exe"). I do not know how to implement that into my program. I tried:
char buf[1024];
string var = readlink("/proc/self/exe", buf, bufsize);
This is obviously incorrect.

This Use the readlink() function properly for the correct uses of the readlink function.
If you have your path in a std::string, you could do something like this:
#include <unistd.h>
#include <limits.h>
std::string do_readlink(std::string const& path) {
char buff[PATH_MAX];
ssize_t len = ::readlink(path.c_str(), buff, sizeof(buff)-1);
if (len != -1) {
buff[len] = '\0';
return std::string(buff);
}
/* handle error condition */
}
If you're only after a fixed path:
std::string get_selfpath() {
char buff[PATH_MAX];
ssize_t len = ::readlink("/proc/self/exe", buff, sizeof(buff)-1);
if (len != -1) {
buff[len] = '\0';
return std::string(buff);
}
/* handle error condition */
}
To use it:
int main()
{
std::string selfpath = get_selfpath();
std::cout << selfpath << std::endl;
return 0;
}

Accepted answer is almost correct, except you can't rely on PATH_MAX because it is
not guaranteed to be defined per POSIX if the system does not have such
limit.
(From readlink(2) manpage)
Also, when it's defined it doesn't always represent the "true" limit. (See http://insanecoding.blogspot.fr/2007/11/pathmax-simply-isnt.html )
The readlink's manpage also give a way to do that on symlink :
Using a statically sized buffer might not provide enough room for the
symbolic link contents. The required size for the buffer can be
obtained from the stat.st_size value returned by a call to lstat(2) on
the link. However, the number of bytes written by readlink() and read‐
linkat() should be checked to make sure that the size of the symbolic
link did not increase between the calls.
However in the case of /proc/self/exe/ as for most of /proc files, stat.st_size would be 0. The only remaining solution I see is to resize buffer while it doesn't fit.
I suggest the use of vector<char> as follow for this purpose:
std::string get_selfpath()
{
std::vector<char> buf(400);
ssize_t len;
do
{
buf.resize(buf.size() + 100);
len = ::readlink("/proc/self/exe", &(buf[0]), buf.size());
} while (buf.size() == len);
if (len > 0)
{
buf[len] = '\0';
return (std::string(&(buf[0])));
}
/* handle error */
return "";
}

Let's look at what the manpage says:
readlink() places the contents of the symbolic link path in the buffer
buf, which has size bufsiz. readlink does not append a NUL character to
buf.
OK. Should be simple enough. Given your buffer of 1024 chars:
char buf[1024];
/* The manpage says it won't null terminate. Let's zero the buffer. */
memset(buf, 0, sizeof(buf));
/* Note we use sizeof(buf)-1 since we may need an extra char for NUL. */
if (readlink("/proc/self/exe", buf, sizeof(buf)-1) < 0)
{
/* There was an error... Perhaps the path does not exist
* or the buffer is not big enough. errno has the details. */
perror("readlink");
return -1;
}

char *
readlink_malloc (const char *filename)
{
int size = 100;
char *buffer = NULL;
while (1)
{
buffer = (char *) xrealloc (buffer, size);
int nchars = readlink (filename, buffer, size);
if (nchars < 0)
{
free (buffer);
return NULL;
}
if (nchars < size)
return buffer;
size *= 2;
}
}
Taken from: http://www.delorie.com/gnu/docs/glibc/libc_279.html

#include <stdlib.h>
#include <unistd.h>
static char *exename(void)
{
char *buf;
char *newbuf;
size_t cap;
ssize_t len;
buf = NULL;
for (cap = 64; cap <= 16384; cap *= 2) {
newbuf = realloc(buf, cap);
if (newbuf == NULL) {
break;
}
buf = newbuf;
len = readlink("/proc/self/exe", buf, cap);
if (len < 0) {
break;
}
if ((size_t)len < cap) {
buf[len] = 0;
return buf;
}
}
free(buf);
return NULL;
}
#include <stdio.h>
int main(void)
{
char *e = exename();
printf("%s\n", e ? e : "unknown");
free(e);
return 0;
}
This uses the traditional "when you don't know the right buffer size, reallocate increasing powers of two" trick. We assume that allocating less than 64 bytes for a pathname is not worth the effort. We also assume that an executable pathname as long as 16384 (2**14) bytes has to indicate some kind of anomaly in how the program was installed, and it's not useful to know the pathname as we'll soon encounter bigger problems to worry about.
There is no need to bother with constants like PATH_MAX. Reserving so much memory is overkill for almost all pathnames, and as noted in another answer, it's not guaranteed to be the actual upper limit anyway. For this application, we can pick a common-sense upper limit such as 16384. Even for applications with no common-sense upper limit, reallocating increasing powers of two is a good approach. You only need log n calls for a n-byte result, and the amount of memory capacity you waste is proportional to the length of the result. It also avoids race conditions where the length of the string changes between the realloc() and the readlink().

storing return value from function into pointer to char variable is rightway to do?

I have written a read function which reads values from serial port(LINUX) . It returns values as pointer to char . I am calling this function in another function and storing it again in a variable as pointer to char . I occasionally got stack over flow problem and not sure if this function is creating problem.
The sample is provided below. Please give some suggestions or criticism .
char *ReadToSerialPort( )
{
const int buffer_size = 1024;
char *buffer = (char *)malloc(buffer_size);
char *bufptr = buffer;
size_t iIn;
int iMax = buffer+buffer_size-bufptr;
if ( fd < 1 )
{
printf( "port is not open\n" );
// return -1;
}
iIn = read( fd, bufptr, iMax-1 );
if ( iIn < 0 )
{
if ( errno == EAGAIN )
{
printf( "The errror in READ" );
return 0; // assume that command generated no response
}
else
printf( "read error %d %s\n", errno, strerror(errno) );
}
else
{
// *bufptr = '\0';
bufptr[(int)iIn<iMax?iIn:iMax] = '\0';
if(bufptr != buffer)
return bufptr;
}
free(buffer);
return 0;
} // end ReadAdrPort
int ParseFunction(void)
{
// some other code
char *sResult;
if( ( sResult = ReadToSerialPort()) >= 0)
{
printf("Response is %s\n", sResult);
// code to store char in string and put into db .
}
}
Thanks and regards,
SamPrat

You do not deallocate the buffer. You need to make free after you finished working with it.
char * getData()
{
char *buf = (char *)malloc(255);
// Fill buffer
return buf;
}
void anotherFunc()
{
char *data = getData();
// Process data
free(data);
}
In your case I think you should free the buffer after printf:
if( ( sResult = ReadToSerialPort()) >= 0)
{
printf("Response is %s\n", sResult);
// code to store char in string and put into db .
free(sResult);
}
UPDATE Static buffer
Another option to use static buffers. It could increase performance a little bit, but getData method will be not a thread-safe.
char buff[1024];
char *getData()
{
// Write data to buff
return buff;
}
int main()
{
char *data = getData();
printf("%s", data);
}
UPDATE Some notes about your code
int iMax = buffer+buffer_size-bufptr; - iMax will always be 1024;
I do not see any idea of using bufptr since its value is the same as buffer and you do not change it anywhere in your function;
iIn = read( fd, bufptr, buffer_size-1 );
You can replace bufptr[(int)iIn<iMax?iIn:iMax] = '\0'; with bufptr[iIn] = '\0';
if(bufptr != buffer) is always false and this is why your pointer is incorrect and you always return 0;
Do not forget to free the buffer if errno == EAGAIN is true. Currently you just return 0 without free(buffer).
Good luck ;)

Elalfer is partially correct. You do free() your buffer, but not in every case.
For example, when you reach if ( errno == EAGAIN ) and it evaluates to true, you return without doing free on your buffer.
The best would be to pass the buffer as a parameter and make it obvious that the user must free the buffer, outside the function. (this is what basically Elalfer sais in his edited answer).
Just realized this is a C question, I blame SO filtering for this :D sorry! Disregard the following, I'm leaving it so that comments still make sense.
The correct solution should use std::vector<char>, that way the destructor handles memory deallocation for you at the end of scope.

what is the purpose of the second pointer?
char *buffer = (char *)malloc(buffer_size);
char *bufptr = buffer;
what is the purpose of this?
int iMax = buffer+buffer_size-bufptr; // eh?
What is the purpose of this?
bufptr[(int)iIn<iMax?iIn:iMax] = '\0'; // so you pass in 1023 (iMax - 1), it reads 1023, you've effectively corrupted the last byte.
I would start over, consider using std::vector<char>, something like:
std::vector<char> buffer(1500); // default constructs 1500 chars
int iRead = read(fd, &buffer[0], 1500);
// resize the buffer if valid
if (iRead > 0)
buffer.resize(iRead); // this logically trims the buffer so that the iterators begin/end are correct.
return buffer;
Then in your calling function, use the vector<char> and if you need a string, construct one from this: std::string foo(vect.begin(), vect.end()); etc.

When you are setting the null terminator "bufptr[(int)iIn
bufptr[iMax]=>bufptr[1024]=>one byte beyond your allocation since arrays start at 0.
Also int this case "int iMax = buffer+buffer_size-bufptr;" can be re-written as iMax = buffer_size. It makes the code less readable.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

iconv encoding conversion problem - c++

Related

How to use libiconv correctly in C so that it would not report "Arg list too long"?

How to get string from CP866 bytes

simulate ulltoa() with a radix/base of 36

How to implement readlink to find the path

storing return value from function into pointer to char variable is rightway to do?

Categories

Resources