utf8 aware strncpy - c++

I find it hard to believe I'm the first person to run into this problem but searched for quite some time and didn't find a solution to this.
I'd like to use strncpy but have it be UTF8 aware so it doesn't partially write a utf8 character into the destination string.
Otherwise you can never be sure that the resulting string is valid UTF8, even if you know the source is (when the source string is larger than the max length).
Validating the resulting string can work but if this is to be called a lot it would be better to have a strncpy function that checks for it.
glib has g_utf8_strncpy but this copies a certain number of unicode chars, whereas Im looking for a copy function that limits by the byte length.
To be clear, by "utf8 aware", I mean that it should not exceed the limit of the destination buffer and it must never copy only part of a utf-8 character. (Given valid utf-8 input must never result in having invalid utf-8 output).
Note:
Some replies have pointed out that strncpy nulls all bytes and that it wont ensure zero termination, in retrospect I should have asked for a utf8 aware strlcpy, however at the time I didn't know of the existence of this function.

I've tested this on many sample UTF8 strings with multi-byte characters. If the source is too long, it does a reverse search of it (starts at the null terminator) and works backward to find the last full UTF8 character which can fit into the destination buffer. It always ensures the destination is null terminated.
char* utf8cpy(char* dst, const char* src, size_t sizeDest )
{
if( sizeDest ){
size_t sizeSrc = strlen(src); // number of bytes not including null
while( sizeSrc >= sizeDest ){
const char* lastByte = src + sizeSrc; // Initially, pointing to the null terminator.
while( lastByte-- > src )
if((*lastByte & 0xC0) != 0x80) // Found the initial byte of the (potentially) multi-byte character (or found null).
break;
sizeSrc = lastByte - src;
}
memcpy(dst, src, sizeSrc);
dst[sizeSrc] = '\0';
}
return dst;
}

I'm not sure what you mean by UTF-8 aware; strncpy copies bytes, not
characters, and the size of the buffer is given in bytes as well. If
what you mean is that it will only copy complete UTF-8 characters,
stopping, for example, if there isn't room for the next character, I'm
not aware of such a function, but it shouldn't be too hard to write:
int
utf8Size( char ch )
{
static int const sizeTable[] =
{
// ...
};
return sizeTable( static_cast<unsigned char>( ch ) )
}
char*
stru8ncpy( char* dest, char* source, int n )
{
while ( *source != '\0' && utf8Size( *source ) < n ) {
n -= utf8Size( *source );
switch ( utf8Size( ch ) ) {
case 6:
*dest ++ = *source ++;
case 5:
*dest ++ = *source ++;
case 4:
*dest ++ = *source ++;
case 3:
*dest ++ = *source ++;
case 2:
*dest ++ = *source ++;
case 1:
*dest ++ = *source ++;
break;
default:
throw IllegalUTF8();
}
}
*dest = '\0';
return dest;
}
(The contents of the table in utf8Size are a bit painful to generate,
but this is a function you'll be using a lot if you're dealing with
UTF-8, and you only have to do it once.)

To reply to own question, heres the C function I ended up with (Not using C++ for this project):
Notes:
- Realize this is not a clone of strncpy for utf8, its more like strlcpy from openbsd.
- utf8_skip_data copied from glib's gutf8.c
- It doesn't validate the utf8 - which is what I intended.
Hope this is useful to others and interested in feedback, but please no pedantic zealot's about NULL termination behavior unless its an actual bug, or misleading/incorrect behavior.
Thanks to James Kanze who provided the basis for this, but was incomplete and C++ (I need a C version).
static const size_t utf8_skip_data[256] = {
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,6,6,1,1
};
char *strlcpy_utf8(char *dst, const char *src, size_t maxncpy)
{
char *dst_r = dst;
size_t utf8_size;
if (maxncpy > 0) {
while (*src != '\0' && (utf8_size = utf8_skip_data[*((unsigned char *)src)]) < maxncpy) {
maxncpy -= utf8_size;
switch (utf8_size) {
case 6: *dst ++ = *src ++;
case 5: *dst ++ = *src ++;
case 4: *dst ++ = *src ++;
case 3: *dst ++ = *src ++;
case 2: *dst ++ = *src ++;
case 1: *dst ++ = *src ++;
}
}
*dst= '\0';
}
return dst_r;
}

strncpy() is a terrible function:
If there is insufficient space, the resulting string will not be nul terminated.
If there is enough space, the remainder is filled with NULs. This can be painful if the target string is very big.
Even if the characters stay in the ASCII range (0x7f and below), the resulting string will not be what you want. In the UTF-8 case it might be not nul-terminated and end in an invalid UTF-8 sequence.
Best advice is to avoid strncpy().
EDIT:
ad 1):
#include <stdio.h>
#include <string.h>
int main (void)
{
char buff [4];
strncpy (buff, "hello world!\n", sizeof buff );
printf("%s\n", buff );
return 0;
}
Agreed, the buffer will not be overrun. But the result is still unwanted. strncpy() solves only part of the problem. It is misleading and unwanted.
UPDATE(2012-10-31): Since this is a nasty problem, I decided to hack my own version, mimicking the ugly strncpy() behavior. The return value is the number of characters copied, though..
#include <stdio.h>
#include <string.h>
size_t utf8ncpy(char *dst, char *src, size_t todo);
static int cnt_utf8(unsigned ch, size_t len);
static int cnt_utf8(unsigned ch, size_t len)
{
if (!len) return 0;
if ((ch & 0x80) == 0x00) return 1;
else if ((ch & 0xe0) == 0xc0) return 2;
else if ((ch & 0xf0) == 0xe0) return 3;
else if ((ch & 0xf8) == 0xf0) return 4;
else if ((ch & 0xfc) == 0xf8) return 5;
else if ((ch & 0xfe) == 0xfc) return 6;
else return -1; /* Default (Not in the spec) */
}
size_t utf8ncpy(char *dst, char *src, size_t todo)
{
size_t done, idx, chunk, srclen;
srclen = strlen(src);
for(done=idx=0; idx < srclen; idx+=chunk) {
int ret;
for (chunk=0; done+chunk < todo; chunk++) {
ret = cnt_utf8( src[idx+chunk], srclen - (idx+chunk) );
if (ret ==1) continue; /* Normal character: collect it into chunk */
if (ret < 0) continue; /* Bad stuff: treat as normal char */
if (ret ==0) break; /* EOF */
if (!chunk) chunk = ret;/* an UTF8 multibyte character */
else ret = 1; /* we allready collected a number (chunk) of normal characters */
break;
}
if (ret > 1 && done+chunk > todo) break;
if (done+chunk > todo) chunk = todo - done;
if (!chunk) break;
memcpy( dst+done, src+idx, chunk);
done += chunk;
if (ret < 1) break;
}
/* This is part of the dreaded strncpy() behavior:
** pad the destination string with NULs
** upto its intended size
*/
if (done < todo) memset(dst+done, 0, todo-done);
return done;
}
int main(void)
{
char *string = "Hell\xc3\xb6 \xf1\x82\x82\x82, world\xc2\xa1!";
char buffer[30];
unsigned result, len;
for (len = sizeof buffer-1; len < sizeof buffer; len -=3) {
result = utf8ncpy(buffer, string, len);
/* remove the following line to get the REAL strncpy() behaviour */
buffer[result] = 0;
printf("Chop #%u\n", len );
printf("Org:[%s]\n", string );
printf("Res:%u\n", result );
printf("New:[%s]\n", buffer );
}
return 0;
}

Here is a C++ solution:
u8string.h:
#ifndef U8STRING_H
#define U8STRING_H 1
#include <stddef.h>
#ifdef __cplusplus
extern "C" {
#endif
/**
* Copies the first few characters of the UTF-8-encoded string pointed to by
* \p src into \p dest_buf, as many UTF-8-encoded characters as can be written in
* <code>dest_buf_len - 1</code> bytes or until the NUL terminator of the string
* pointed to by \p str is reached.
*
* The string of bytes that are written into \p dest_buf is NUL terminated
* if \p dest_buf_len is greater than 0.
*
* \returns \p dest_buf
*/
char * u8slbcpy(char *dest_buf, const char *src, size_t dest_buf_len);
#ifdef __cplusplus
}
#endif
#endif
u8slbcpy.cpp:
#include "u8string.h"
#include <cstring>
#include <utf8.h>
char * u8slbcpy(char *dest_buf, const char *src, size_t dest_buf_len)
{
if (dest_buf_len <= 0) {
return dest_buf;
} else if (dest_buf_len == 1) {
dest_buf[0] = '\0';
return dest_buf;
}
size_t num_bytes_remaining = dest_buf_len - 1;
utf8::unchecked::iterator<const char *> it(src);
const char * prev_base = src;
while (*it++ != '\0') {
const char *base = it.base();
ptrdiff_t diff = (base - prev_base);
if (num_bytes_remaining < diff) {
break;
}
num_bytes_remaining -= diff;
prev_base = base;
}
size_t n = dest_buf_len - 1 - num_bytes_remaining;
std::memmove(dest_buf, src, n);
dest_buf[n] = '\0';
return dest_buf;
}
The function u8slbcpy() has a C interface, but it is implemented in C++. My implementation uses the header-only UTF8-CPP library.
I think that this is pretty much what you are looking for, but note that there is still the problem that one or more combining characters might not be copied if the combining characters apply to the nth character (itself not a combining character) and the destination buffer is just large enough to store the UTF-8 encoding of characters 1 through n, but not the combining characters of character n. In this case, the bytes representing characters 1 through n are written, but none of the combining characters of n are. In effect, you could say that the nth character is partially written.

To comment on the above answer "strncpy() is a terrible function:".
I hate to even comment on such blanket statements at the expense of creating yet another internet programming jihad, but will anyhow since statements like this are misleading to those that might come here to look for answers.
Okay maybe C string functions are "old school". Maybe all strings in C/C++ should be in some kind of smart containers, etc., maybe one should use C++ instead of C (when you have a choice), these are more of a preference and an argument for other topics.
I came here looking for a UTF-8 strncpy() my self. Not that I couldn't make one (the encoding is IMHO simple and elegant) but wanted to see how others made theirs and perhaps find a optimized in ASM one.
To the "gods gift" of the programming world people, put your hubris aside for a moment and look at some facts.
There is nothing wrong with "strncpy()", or any other of the similar functions with the same side effects and issues like "_snprintf()", etc.
I say: "strncpy() is not terrible", but rather "terrible programmers use it terribly".
What is "terrible" is not knowing the rules.
Furthermore on the whole subject because of security (like buffer overrun) and program stability implications, there wouldn't be a need for example Microsoft to add to it's CRT lib "Safe String Functions" if the rules were just followed.
The main ones:
"sizeof()" returns the length of a static string w/terminator.
"strlen()" returns the length of string w/o terminator.
Most if no all "n" functions just clamp to 'n' with out adding a terminator.
There is implicit ambiguity on what "buffer size" is in functions that require and input buffer size. I.E. The "(char *pszBuffer, int iBufferSize)" types.
Safer to assume the worst and pass a size one less then the actual buffer size, and adding a terminator at the end to be sure.
For string inputs, buffers, etc., set and use a reasonable size limit based on expected average and maximum. To hopefully avoid input truncation, and to eliminate buffer overruns period.
This is how I personally handle such things, and other rules that are just to be known and practiced.
A handy macro for static string size:
// Size of a string with out terminator
#define SIZESTR(x) (sizeof(x) - 1)
When declaring local/stack string buffers:
A) The size for example limited to 1023+1 for terminator to allow for strings up to 1023 chars in length.
B) I'm initializing the the string to zero in length, plus terminating at the very end to cover a possible 'n' truncation.
char szBuffer[1024]; szBuffer[0] = szBuffer[SIZESTR(szBuffer)] = 0;
Alternately one could do just:
char szBuffer[1024] = {0};
of course but then there is some performance implication for a compiler generated "memset() like call to zero the whole buffer. It makes things cleaner for debugging though, and I prefer this style for static (vs local/stack) strings buffers.
Now a "strncpy()" following the rules:
char szBuffer[1024]; szBuffer[0] = szBuffer[SIZESTR(szBuffer)] = 0;
strncpy(szBuffer, pszSomeInput, SIZESTR(szBuffer));
There are other "rules" and issues of course, but these are the main ones that come to mind.
You just got to know how the lib functions work and to use safe practices like this.
Finally in my project I use ICU anyhow so I decided to go with it and use the macros in "utf8.h" to make my own "strncpy()".

Related

Subsetting char array without copying it in C++

I have a long array of char (coming from a raster file via GDAL), all composed of 0 and 1. To compact the data, I want to convert it to an array of bits (thus dividing the size by 8), 4 bytes at a time, writing the result to a different file. This is what I have come up with by now:
uint32_t bytes2bits(char b[33]) {
b[32] = 0;
return strtoul(b,0,2);
}
const char data[36] = "00000000000000000000000010000000101"; // 101 is to be ignored
char word[33];
strncpy(word,data,32);
uint32_t byte = bytes2bits(word);
printf("Data: %d\n",byte); // 128
The code is working, and the result is going to be written in a separate file. What I'd like to know is: can I do that without copying the characters to a new array?
EDIT: I'm using a const variable here just to make a minimal, reproducible example. In my program it's a char *, which is continually changing value inside a loop.
Yes, you can, as long as you can modify the source string (in your example code you can't because it is a constant, but I assume in reality you have the string in writable memory):
uint32_t bytes2bits(const char* b) {
return strtoul(b,0,2);
}
void compress (char* data) {
// You would need to make sure that the `data` argument always has
// at least 33 characters in length (the null terminator at the end
// of the original string counts)
char temp = data[32];
data[32] = 0;
uint32_t byte = bytes2bits(data);
data[32] = temp;
printf("Data: %d\n",byte); // 128
}
In this example by using char* as a buffer to store that long data there is not necessary to copy all parts into a temporary buffer to convert it to a long.
Just use a variable to step through the buffer by each 32 byte length period, but after the 32th byte there needs the 0 termination byte.
So your code would look like:
uint32_t bytes2bits(const char* b) {
return strtoul(b,0,2);
}
void compress (char* data) {
int dataLen = strlen(data);
int periodLen = 32;
char* periodStr;
char tmp;
int periodPos = periodLen+1;
uint32_t byte;
periodStr = data[0];
while(periodPos < dataLen)
{
tmp = data[periodPos];
data[periodPos] = 0;
byte = bytes2bits(periodStr);
printf("Data: %d\n",byte); // 128
data[periodPos] = tmp;
periodStr = data[periodPos];
periodPos += periodLen;
}
if(periodPos - periodLen <= dataLen)
{
byte = bytes2bits(periodStr);
printf("Data: %d\n",byte); // 128
}
}
Please than be careful to the last period, which could be smaller than 32 bytes.
const char data[36]
You are in violation of your contract with the compiler if you declare something as const and then modify it.
Generally speaking, the compiler won't let you modify it...so to even try to do so with a const declaration you'd have to cast it (but don't)
char *sneaky_ptr = (char*)data;
sneaky_ptr[0] = 'U'; /* the U is for "undefined behavior" */
See: Can we change the value of an object defined with const through pointers?
So if you wanted to do this, you'd have to be sure the data was legitimately non-const.
The right way to do this in modern C++ is by using std::string to hold your string and std::string_view to process parts of that string without copying it.
You can using string_view with that char array you have though. It's common to use it to modernize the classical null-terminated string const char*.

Alternate reading as char* and wchar_t*

I'm trying to write a program that parses ID3 tags, for educational purposes (so please explain in depth, as I'm trying to learn). So far I've had great success, but stuck on an encoding issue.
When reading the mp3 file, the default encoding for all text is ISO-8859-1. All header info (frame IDs etc) can be read in that encoding.
This is how I've done it:
ifstream mp3File("../myfile.mp3");
mp3File.read(mp3Header, 10); // char mp3Header[10];
// .... Parsing the header
// After reading the main header, we get into the individual frames.
// Read the first 10 bytes from buffer, get size and then read data
char encoding[1];
while(1){
char frameHeader[10] = {0};
mp3File.read(frameHeader, 10);
ID3Frame frame(frameHeader); // Parses frameHeader
if (frame.frameId[0] == 'T'){ // Text Information Frame
mp3File.read(encoding, 1); // Get encoding
if (encoding[0] == 1){
// We're dealing with UCS-2 encoded Unicode with BOM
char data[frame.size];
mp3File.read(data, frame.size);
}
}
}
This is bad code, because data is a char*, its' inside should look like this (converted undisplayable chars to int):
char = [0xFF, 0xFE, C, 0, r, 0, a, 0, z, 0, y, 0]
Two questions:
What are the first two bytes? - Answered.
How can I read wchar_t from my already open file? And then get back to reading the rest of it?
Edit Clarification: I'm not sure if this is the correct way to do it, but essentially what I wanted to do was.. Read the first 11 bytes to a char array (header+encoding), then the next 12 bytes to a wchar_t array (the name of the song), and then the next 10 bytes to a char array (the next header). Is that possible?
I figured out a decent solution: create a new wchar_t buffer and add the characters from the char array in pairs.
wchar_t* charToWChar(char* cArray, int len) {
char wideChar[2];
wchar_t wideCharW;
wchar_t *wArray = (wchar_t *) malloc(sizeof(wchar_t) * len / 2);
int counter = 0;
int endian = BIGENDIAN;
// Check endianness
if ((uint8_t) cArray[0] == 255 && (uint8_t) cArray[1] == 254)
endian = LITTLEENDIAN;
else if ((uint8_t) cArray[1] == 255 && (uint8_t) cArray[0] == 254)
endian = BIGENDIAN;
for (int j = 2; j < len; j+=2){
switch (endian){
case LITTLEENDIAN: {wideChar[0] = cArray[j]; wideChar[1] = cArray[j + 1];} break;
default:
case BIGENDIAN: {wideChar[1] = cArray[j]; wideChar[0] = cArray[j + 1];} break;
}
wideCharW = (uint16_t)((uint8_t)wideChar[1] << 8 | (uint8_t)wideChar[0]);
wArray[counter] = wideCharW;
counter++;
}
wArray[counter] = '\0';
return wArray;
}
Usage:
if (encoding[0] == 1){
// We're dealing with UCS-2 encoded Unicode with BOM
char data[frame.size];
mp3File.read(data, frame.size);
wcout << charToWChar(data, frame.size) << endl;
}

substitute strlen with sizeof for c-string

I want to use mbstowcs_s method but without iostream header. Therefore I cannot use strlen to predict the size of my buffer. The following method has to simply change c-string to wide c-string and return it:
char* changeToWide(char* value)
{
wchar_t* vOut = new wchar_t[strlen(value)+1];
mbstowcs_s(NULL,vOut,strlen(val)+1,val,strlen(val));
return vOut;
}
As soon as i change it to
char* changeToWide(char* value)
{
wchar_t* vOut = new wchar_t[sizeof(value)];
mbstowcs_s(NULL,vOut,sizeof(value),val,sizeof(value)-1);
return vOut;
}
I get wrong results (values are not the same in both arrays). What is the best way to work it out?
I am also open for other ideas how to make that conversion without using strings but pure arrays
Given a char* or const char* you cannot use sizeof() to get the size of the string being pointed by your char* variable. In this case, sizeof() will return you the number of bytes a pointer uses in memory (commonly 4 bytes in 32-bit architectures and 8 bytes in 64-bit architectures).
If you have an array of characters defined as array, you can use sizeof:
char text[] = "test";
auto size = sizeof(text); //will return you 5 because it includes the '\0' character.
But if you have something like this:
char text[] = "test";
const char* ptext = text;
auto size2 = sizeof(ptext); //will return you probably 4 or 8 depending on the architecture you are working on.
Not that I am an expert on this matter, but char to wchar_t conversion being made is seemingly nothing but using a wider space for the exact same bytes, in other words, prefixing each char with some set of zeroes.
I don't know C++ either, just C, but I can derive what it probably would look like in C++ by looking at your code, so here it goes:
wchar_t * changeToWide( char* value )
{
//counts the length of the value-array including the 0
int i = 0;
while ( value[i] != '\0' ) i++;
//allocates enough much memory
wchar_t * vOut = new wchar_t[i];
//assigns values including the 0
i = 0;
while ( ( vOut[i] = 0 | value[i] ) != '\0' ) i++;
return vOut;
}
0 | part looks truly obsolete to me, but I felt like including it, don't really know why...

Convert wchar_t to char

I was wondering is it safe to do so?
wchar_t wide = /* something */;
assert(wide >= 0 && wide < 256 &&);
char myChar = static_cast<char>(wide);
If I am pretty sure the wide char will fall within ASCII range.
Why not just use a library routine wcstombs.
assert is for ensuring that something is true in a debug mode, without it having any effect in a release build. Better to use an if statement and have an alternate plan for characters that are outside the range, unless the only way to get characters outside the range is through a program bug.
Also, depending on your character encoding, you might find a difference between the Unicode characters 0x80 through 0xff and their char version.
You are looking for wctomb(): it's in the ANSI standard, so you can count on it. It works even when the wchar_t uses a code above 255. You almost certainly do not want to use it.
wchar_t is an integral type, so your compiler won't complain if you actually do:
char x = (char)wc;
but because it's an integral type, there's absolutely no reason to do this. If you accidentally read Herbert Schildt's C: The Complete Reference, or any C book based on it, then you're completely and grossly misinformed. Characters should be of type int or better. That means you should be writing this:
int x = getchar();
and not this:
char x = getchar(); /* <- WRONG! */
As far as integral types go, char is worthless. You shouldn't make functions that take parameters of type char, and you should not create temporary variables of type char, and the same advice goes for wchar_t as well.
char* may be a convenient typedef for a character string, but it is a novice mistake to think of this as an "array of characters" or a "pointer to an array of characters" - despite what the cdecl tool says. Treating it as an actual array of characters with nonsense like this:
for(int i = 0; s[i]; ++i) {
wchar_t wc = s[i];
char c = doit(wc);
out[i] = c;
}
is absurdly wrong. It will not do what you want; it will break in subtle and serious ways, behave differently on different platforms, and you will most certainly confuse the hell out of your users. If you see this, you are trying to reimplement wctombs() which is part of ANSI C already, but it's still wrong.
You're really looking for iconv(), which converts a character string from one encoding (even if it's packed into a wchar_t array), into a character string of another encoding.
Now go read this, to learn what's wrong with iconv.
An easy way is :
wstring your_wchar_in_ws(<your wchar>);
string your_wchar_in_str(your_wchar_in_ws.begin(), your_wchar_in_ws.end());
char* your_wchar_in_char = your_wchar_in_str.c_str();
I'm using this method for years :)
A short function I wrote a while back to pack a wchar_t array into a char array. Characters that aren't on the ANSI code page (0-127) are replaced by '?' characters, and it handles surrogate pairs correctly.
size_t to_narrow(const wchar_t * src, char * dest, size_t dest_len){
size_t i;
wchar_t code;
i = 0;
while (src[i] != '\0' && i < (dest_len - 1)){
code = src[i];
if (code < 128)
dest[i] = char(code);
else{
dest[i] = '?';
if (code >= 0xD800 && code <= 0xD8FF)
// lead surrogate, skip the next code unit, which is the trail
i++;
}
i++;
}
dest[i] = '\0';
return i - 1;
}
Technically, 'char' could have the same range as either 'signed char' or 'unsigned char'. For the unsigned characters, your range is correct; theoretically, for signed characters, your condition is wrong. In practice, very few compilers will object - and the result will be the same.
Nitpick: the last && in the assert is a syntax error.
Whether the assertion is appropriate depends on whether you can afford to crash when the code gets to the customer, and what you could or should do if the assertion condition is violated but the assertion is not compiled into the code. For debug work, it seems fine, but you might want an active test after it for run-time checking too.
Here's another way of doing it, remember to use free() on the result.
char* wchar_to_char(const wchar_t* pwchar)
{
// get the number of characters in the string.
int currentCharIndex = 0;
char currentChar = pwchar[currentCharIndex];
while (currentChar != '\0')
{
currentCharIndex++;
currentChar = pwchar[currentCharIndex];
}
const int charCount = currentCharIndex + 1;
// allocate a new block of memory size char (1 byte) instead of wide char (2 bytes)
char* filePathC = (char*)malloc(sizeof(char) * charCount);
for (int i = 0; i < charCount; i++)
{
// convert to char (1 byte)
char character = pwchar[i];
*filePathC = character;
filePathC += sizeof(char);
}
filePathC += '\0';
filePathC -= (sizeof(char) * charCount);
return filePathC;
}
one could also convert wchar_t --> wstring --> string --> char
wchar_t wide;
wstring wstrValue;
wstrValue[0] = wide
string strValue;
strValue.assign(wstrValue.begin(), wstrValue.end()); // convert wstring to string
char char_value = strValue[0];
In general, no. int(wchar_t(255)) == int(char(255)) of course, but that just means they have the same int value. They may not represent the same characters.
You would see such a discrepancy in the majority of Windows PCs, even. For instance, on Windows Code page 1250, char(0xFF) is the same character as wchar_t(0x02D9) (dot above), not wchar_t(0x00FF) (small y with diaeresis).
Note that it does not even hold for the ASCII range, as C++ doesn't even require ASCII. On IBM systems in particular you may see that 'A' != 65

Howto read chunk of memory as char in c++

Hello I have a chunk of memory (allocated with malloc()) that contains bits (bit literal), I'd like to read it as an array of char, or, better, I'd like to printout the ASCII value of 8 consecutively bits of the memory.
I have allocated he memory as char *, but I've not been able to take characters out in a better way than evaluating each bit, adding the value to a char and shifting left the value of the char, in a loop, but I was looking for a faster solution.
Thank you
What I've wrote for now is this:
for allocation:
char * bits = (char*) malloc(1);
for writing to mem:
ifstream cleartext;
cleartext.open(sometext);
while(cleartext.good())
{
c = cleartext.get();
for(int j = 0; j < 8; j++)
{ //set(index) and reset(index) set or reset the bit at bits[i]
(c & 0x80) ? (set(index)):(reset(index));//(*ptr++ = '1'):(*ptr++='0');
c = c << 1;
}..
}..
and until now I've not been able to get character back, I only get the bits printed out using:
printf("%s\n" bits);
An example of what I'm trying to do is:
input.txt contains the string "AAAB"
My program would have to write "AAAB" as "01000001010000010100000101000010" to memory
(it's the ASCII values in bit of AAAB that are 65656566 in bits)
Then I would like that it have a function to rewrite the content of the memory to a file.
So if memory contains again "01000001010000010100000101000010" it would write to the output file "AAAB".
int numBytes = 512;
char *pChar = (char *)malloc(numBytes);
for( int i = 0; i < numBytes; i++ ){
pChar[i] = '8';
}
Since this is C++, you can also use "new":
int numBytes = 512;
char *pChar = new char[numBytes];
for( int i = 0; i < numBytes; i++ ){
pChar[i] = '8';
}
If you want to visit every bit in the memory chunk, it looks like you need std::bitset.
char* pChunk = malloc( n );
// read in pChunk data
// iterate over all the bits.
for( int i = 0; i != n; ++i ){
std::bitset<8>& bits = *reinterpret_cast< std::bitset<8>* >( pByte );
for( int iBit = 0; iBit != 8; ++iBit ) {
std::cout << bits[i];
}
}
I'd like to printout the ASCII value of 8 consecutively bits of the memory.
The possible value for any bit is either 0 or 1. You probably want at least a byte.
char * bits = (char*) malloc(1);
Allocates 1 byte on the heap. A much more efficient and hassle-free thing would have been to create an object on the stack i.e.:
char bits; // a single character, has CHAR_BIT bits
ifstream cleartext;
cleartext.open(sometext);
The above doesn't write anything to mem. It tries to open a file in input mode.
It has ascii characters and common eof or \n, or things like this, the input would only be a textfile, so I think it should only contain ASCII characters, correct me if I'm wrong.
If your file only has ASCII data you don't have to worry. All you need to do is read in the file contents and write it out. The compiler manages how the data will be stored (i.e. which encoding to use for your characters and how to represent them in binary, the endianness of the system etc). The easiest way to read/write files will be:
// include these on as-needed basis
#include <algorithm>
#include <iostream>
#include <iterator>
#include <fstream>
using namespace std;
// ...
/* read from standard input and write to standard output */
copy((istream_iterator<char>(cin)), (istream_iterator<char>()),
(ostream_iterator<char>(cout)));
/*-------------------------------------------------------------*/
/* read from standard input and write to text file */
copy(istream_iterator<char>(cin), istream_iterator<char>(),
ostream_iterator<char>(ofstream("output.txt"), "\n") );
/*-------------------------------------------------------------*/
/* read from text file and write to text file */
copy(istream_iterator<char>(ifstream("input.txt")), istream_iterator<char>(),
ostream_iterator<char>(ofstream("output.txt"), "\n") );
/*-------------------------------------------------------------*/
The last remaining question is: Do you want to do something with the binary representation? If not, forget about it. Else, update your question one more time.
E.g: Processing the character array to encrypt it using a block cipher
/* a hash calculator */
struct hash_sha1 {
unsigned char operator()(unsigned char x) {
// process
return rc;
}
};
/* store house of characters, could've been a vector as well */
basic_string<unsigned char> line;
/* read from text file and write to a string of unsigned chars */
copy(istream_iterator<unsigned char>(ifstream("input.txt")),
istream_iterator<char>(),
back_inserter(line) );
/* Calculate a SHA-1 hash of the input */
basic_string<unsigned char> hashmsg;
transform(line.begin(), line.end(), back_inserter(hashmsg), hash_sha1());
Something like this?
char *buffer = (char*)malloc(42);
// ... put something into the buffer ...
printf("%c\n", buffer[0]);
But, since you're using C++, I wonder why you bother with malloc and such...
char* ptr = pAddressOfMemoryToRead;
while(ptr < pAddressOfMemoryToRead + blockLength)
{
char tmp = *ptr;
// temp now has the char from this spot in memory
ptr++;
}
Is this what you are trying to achieve:
char* p = (char*)malloc(10 * sizeof(char));
char* p1 = p;
memcpy(p,"abcdefghij", 10);
for(int i = 0; i < 10; ++i)
{
char c = *p1;
cout<<c<<" ";
++p1;
}
cout<<"\n";
free(p);
Can you please explain in more detail, perhaps including code? What you're saying makes no sense unless I'm completely misreading your question. Are you doing something like this?
char * chunk = (char *)malloc(256);
If so, you can access any character's worth of data by treating chunk as an array: chunk[5] gives you the 5th element, etc. Of course, these will be characters, which may be what you want, but I can't quite tell from your question... for instance, if chunk[5] is 65, when you print it like cout << chunk[5];, you'll get a letter 'A'.
However, you may be asking how to print out the actual number 65, in which case you want to do cout << int(chunk[5]);. Casting to int will make it print as an integer value instead of as a character. If you clarify your question, either I or someone else can help you further.
Are you asking how to copy the memory bytes of an arbitrary struct into a char* array? If so this should do the trick
SomeType t = GetSomeType();
char* ptr = malloc(sizeof(SomeType));
if ( !ptr ) {
// Handle no memory. Probably should just crash
}
memcpy(ptr,&t,sizeof(SomeType));
I'm not sure I entirely grok what you're trying to do, but a couple of suggestions:
1) use std::vector instead of malloc/free and new/delete. It's safer and doesn't have much overhead.
2) when processing, try doing chunks rather than bytes. Even though streams are buffered, it's usually more efficient grabbing a chunk at a time.
3) there's a lot of different ways to output bits, but again you don't want a stream output for each character. You might want to try something like the following:
void outputbits(char *dest, char source)
{
dest[8] = 0;
for(int i=0; i<8; ++i)
dest[i] = source & (1<<(7-i)) ? '1':'0';
}
Pass it a char[9] output buffer and a char input, and you get a printable bitstring back. Decent compilers produce OK output code for this... how much speed do you need?