How to detect UTF16 strings in PE files

How to detect UTF16 strings in PE files - c++

I need to extract Unicode strings from a PE file. While extracting I need to detect it first. For UTF-8 characters, I used the following link - How to easily detect utf8 encoding in the string?. Is there any similar way to detect UTF-16 characters. I have tried the following code. Is this right? Please do help or provide suggestions. Thanks in advance!!!
BYTE temp1 = buf[offset];
BYTE temp2 = buf[offset+1];
while (!(temp1 == 0x00 && temp2 == 0x00) && offset <= bufSize)
{
if ((temp1 >= 0x00 && temp1 <= 0xFF) && (temp2 >= 0x00 && temp2 <= 0xFF))
{
tmp += 2;
}
else
{
break;
}
offset += 2;
temp1 = buf[offset];
temp2 = buf[offset+1];
if (temp1 == 0x00 && temp2 == 0x00)
{
break;
}
}

I just implemented right now a function for you, DecodeUtf16Char(), basically it is able to do two things - either just check if it is a valid utf-16 (when check_only = true) or check and return valid decoded Unicode code-point (32-bit). Also it supports either big endian (default, when big_endian = true) or little endian (big_endian = false) order of bytes within two-byte utf-16 word. bad_skip equals to number of bytes to be skipped if failed to decode a character (invalid utf-16), bad_value is a value that is used to signify that utf-16 wasn't decoded (was invalid) by default it is -1.
Example of usage/tests are included after this function definition. Basically you just pass starting (ptr) and ending pointer to this function and when returned check return value, if it is -1 then at pointer begin was invalid utf-16 sequence, if it is not -1 then this returned value contains valid 32-bit unicode code-point. Also my function increments ptr, by amount of decoded bytes in case of valid utf-16 or by bad_skip number of bytes if it is invalid.
My functions should be very fast, because it contains only few ifs (plus a bit of arithmetics in case when you ask to actually decode chars), always place my function into headers so that it is inlined into calling function to produce very fast code! Also pass in only compile-time-constants check_only and big_endian, this will remove extra decoding code through C++ optimizations.
If for example you just want to detect long runs of utf-16 bytes then you do next thing, iterate in a loop calling this function and whenever it first returned not -1 then it will be possible beginning, then iterate further and catch last not-equal-to -1 value, this will be the last point of text. Also important to pass in bad_skip = 1 when searching for utf-16 bytes because valid char may start at any byte.
I used for testing different characters - English ASCII, Russian chars (two-byte utf-16) plus two 4-byte chars (two utf-16 words). My tests append converted line to test.txt file, this file is UTF-8 encoded to be easily viewable e.g. by notepad. All of the code after my decoding function is not needed for it to work, the rest is just testing code.
My function to work needs two functions - _DecodeUtf16Char_ReadWord() (helper) plus DecodeUtf16Char() (main decoder). I only include one standard header <cstdint>, if you're not allowed to include anything then just define uint8_t and uint16_t and uint32_t, I use only these types definition from this header.
Also, for reference, see my other post which implements both from scratch (and using standard C++ library) all types of conversions between UTF-8<-->UTF-16<-->UTF-32!
Try it online!
#include <cstdint>
static inline bool _DecodeUtf16Char_ReadWord(
uint8_t const * & ptrc, uint8_t const * end,
uint16_t & r, bool const big_endian
) {
if (ptrc + 1 >= end) {
// No data left.
if (ptrc < end)
++ptrc;
return false;
}
if (big_endian) {
r = uint16_t(*ptrc) << 8; ++ptrc;
r |= uint16_t(*ptrc) ; ++ptrc;
} else {
r = uint16_t(*ptrc) ; ++ptrc;
r |= uint16_t(*ptrc) << 8; ++ptrc;
}
return true;
}
static inline uint32_t DecodeUtf16Char(
uint8_t const * & ptr, uint8_t const * end,
bool const check_only = true, bool const big_endian = true,
uint32_t const bad_skip = 1, uint32_t const bad_value = -1
) {
auto ptrs = ptr, ptrc = ptr;
uint32_t c = 0;
uint16_t v = 0;
if (!_DecodeUtf16Char_ReadWord(ptrc, end, v, big_endian)) {
// No data left.
c = bad_value;
} else if (v < 0xD800 || v > 0xDFFF) {
// Correct single-word symbol.
if (!check_only)
c = v;
} else if (v >= 0xDC00) {
// Unallowed UTF-16 sequence!
c = bad_value;
} else { // Possibly double-word sequence.
if (!check_only)
c = (v & 0x3FF) << 10;
if (!_DecodeUtf16Char_ReadWord(ptrc, end, v, big_endian)) {
// No data left.
c = bad_value;
} else if ((v < 0xDC00) || (v > 0xDFFF)) {
// Unallowed UTF-16 sequence!
c = bad_value;
} else {
// Correct double-word symbol
if (!check_only) {
c |= v & 0x3FF;
c += 0x10000;
}
}
}
if (c == bad_value)
ptr = ptrs + bad_skip; // Skip bytes.
else
ptr = ptrc; // Skip all eaten bytes.
return c;
}
// --------- Next code only for testing only and is not needed for decoding ------------
#include <iostream>
#include <string>
#include <codecvt>
#include <fstream>
#include <locale>
static std::u32string DecodeUtf16Bytes(uint8_t const * ptr, uint8_t const * end) {
std::u32string res;
while (true) {
if (ptr >= end)
break;
uint32_t c = DecodeUtf16Char(ptr, end, false, false, 2);
if (c != -1)
res.append(1, c);
}
return res;
}
#if (!_DLL) && (_MSC_VER >= 1900 /* VS 2015*/) && (_MSC_VER <= 1914 /* VS 2017 */)
std::locale::id std::codecvt<char16_t, char, _Mbstatet>::id;
std::locale::id std::codecvt<char32_t, char, _Mbstatet>::id;
#endif
template <typename CharT = char>
static std::basic_string<CharT> U32ToU8(std::u32string const & s) {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf_8_32_conv;
auto res = utf_8_32_conv.to_bytes(s.c_str(), s.c_str() + s.length());
return res;
}
template <typename WCharT = wchar_t>
static std::basic_string<WCharT> U32ToU16(std::u32string const & s) {
std::wstring_convert<std::codecvt_utf16<char32_t, 0x10ffffUL, std::little_endian>, char32_t> utf_16_32_conv;
auto res = utf_16_32_conv.to_bytes(s.c_str(), s.c_str() + s.length());
return std::basic_string<WCharT>((WCharT*)(res.c_str()), (WCharT*)(res.c_str() + res.length()));
}
template <typename StrT>
void OutputString(StrT const & s) {
std::ofstream f("test.txt", std::ios::binary | std::ios::app);
f.write((char*)s.c_str(), size_t((uint8_t*)(s.c_str() + s.length()) - (uint8_t*)s.c_str()));
f.write("\n\x00", sizeof(s.c_str()[0]));
}
int main() {
std::u16string a = u"привет|мир|hello|𐐷|world|𤭢|again|русский|english";
*((uint8_t*)(a.data() + 12) + 1) = 0xDD; // Introduce bad utf-16 byte.
// Also truncate by 1 byte ("... - 1" in next line).
OutputString(U32ToU8(DecodeUtf16Bytes((uint8_t*)a.c_str(), (uint8_t*)(a.c_str() + a.length()) - 1)));
return 0;
}
Output:
привет|мир|hllo|𐐷|world|𤭢|again|русский|englis

Related

String into binary

I used Huffman encoding that we wrote to compress a file.
The function takes String and its output is String.
The problem is I want to save it as binary to get lower size than the original size, but when I take it back (0's and 1's ) as a string its size is larger than the main file. How can I convert that string of (0's and 1's) to a binary so that every character is saved in 1 bit? I am using Qt to achieve this:
string Huffman_encoding(string text)
{
buildHuffmanTree(text);
string encoded = "";
unordered_map<char, string> StringEncoded;
encoding(main_root, "", StringEncoded);
for (char ch : text) {
encoded += StringEncoded[ch];
}
return encoded;
}

The canonical solution uses a "bit packer" that accepts bitstrings and emits packed bytes. As a first start, replace encoded by an instance of the following:
class BitPacker {
QByteArray res;
quint8 bitsLeft = 8;
quint8 buf = 0;
public:
void operator+=(const std::string& s) {
for (auto c : s) {
buf = buf << 1 | c - '0';
if (--bitsLeft == 0) {
res.append(buf);
buf = 0;
bitsLeft = 8;
}
}
}
QByteArray finish() {
if (bitsLeft < 8) {
res.append(buf << bitsLeft);
buf = 0;
bitsLeft = 8;
}
return res;
}
}
operator+= will add additional bits to buf and flush complete bytes to res. At the end of the process you may be left with, say, 3 bits. finish uses a simple algorithm: it pads the buffer with zeroes to produce a final byte and hands you back the fully encoded buffer.
A more sophisticated solution might be to introduce an explicit "end of stream" token that is not present in the source character set.

Seems what you're searching for is a way to convert a string containing a sequence of 0s and 1s like "0000010010000000" to an actual binary representation (numbers 4 and 128 in this example).
This could be achieved with a function like this:
#include <iostream>
#include <string>
#include <cstdint>
#include <vector>
std::vector<uint8_t> toBinary(std::string const& binStr)
{
std::vector<uint8_t> result;
result.reserve(binStr.size() / 8);
size_t pos = 0;
size_t len = binStr.length();
while (pos < len)
{
size_t curLen = std::min(static_cast<size_t>(8), len-pos);
auto curStr = binStr.substr(pos, curLen) + std::string(8-curLen, '0');
std::cout << "curLen: " << curLen << ", curStr: " << curStr << "\n";
result.push_back(std::stoi(curStr, 0, 2));
pos += 8;
}
return result;
}
// test:
int main()
{
std::string binStr("000001001000000001");
auto bin = toBinary(binStr);
for (auto i: bin)
{
std::cout << static_cast<int>(i) << " ";
}
return 0;
}
Output:
4 128 64
You can then do whatever you want with these numbers, e.g. write them into a binary file.
Note that toBinary as above, pads the last byte, if incomplete, with zeros.

You can create a bitstream using bitwise logic like this :
#include <cassert>
#include <string>
#include <stdexcept>
#include <vector>
auto to_bit_stream(const std::string& bytes)
{
std::vector<std::uint8_t> stream;
std::uint8_t shift{ 0 };
std::uint8_t out{ 0 };
// allocate enough bytes to hold the bits
// speeds up the code a bit
stream.reserve((bytes.size() + 7) / 8);
// loop over all bytes
for (const auto c : bytes)
{
// check input
if (!((c == '0') || (c == '1'))) throw std::invalid_argument("invalid character in input");
// shift output by one to accept next bit
out <<= 1;
// keep track of number of shifts
// after 8 shifts a byte has been filled
shift++;
// or the output with a 1 if needed
out |= (c == '1');
// complete an output byte
if (shift == 8)
{
stream.push_back(out);
out = 0;
shift = 0;
}
}
return stream;
}
int main()
{
// stream is 8 bits per value, values 0,1,2,3
auto stream = to_bit_stream("00000000000000010000001000000011");
assert(stream.size() == 4ul);
assert(stream[0] == 0);
assert(stream[1] == 1);
assert(stream[2] == 2);
assert(stream[3] == 3);
return 0;
}

Use std::stoi()
int n = std::stoi("01000100", nullptr, 2);

boost:uuid into char * without std::string

I am trying to convert a boost UUID to a char * without having to use std::string at all.
I mostly modified the to_string method from https://www.boost.org/doc/libs/1_68_0/boost/uuid/uuid_io.hpp to my own version. However, it fails on certain UUIDs.
Here is my modification:
#include <boost/uuid/string_generator.hpp>
#include <boost/uuid/uuid_generators.hpp>
using UUID = boost::uuids::uuid;
static constexpr std::size_t UUID_STR_LEN = 37;
inline char uuid_byte_to_char(size_t i)
{
if (i <= 9) {
return static_cast<char>('0' + i);
} else {
return static_cast<char>('a' + (i - 10));
}
}
inline void uuid_to_cstr(UUID const& uuid, char out[UUID_STR_LEN])
{
std::size_t out_i = 0;
std::size_t dash_i = 0;
for (UUID::const_iterator it_data = uuid.begin(); it_data != uuid.end(); ++it_data, ++dash_i) {
const size_t hi = ((*it_data) >> 4) & 0x0F;
out[out_i++] = uuid_byte_to_char(hi);
const size_t lo = (*it_data) & 0x0F;
out[out_i++] = uuid_byte_to_char(lo);
if (dash_i == 3 || dash_i == 5 || dash_i == 7 || dash_i == 9) {
out[out_i++] += '-';
}
}
out[UUID_STR_LEN - 1] = '\0';
}
Usage:
int main() {
UUID uuid(uuid_generator());
char uuid_cstr(UUID_STR_LEN];
uuid_to_str(uuid uuid_cstr);
std::cout << uuid_cstr << "\n";
}
So if the UUID was cd0fa728-e7d6-4578-9450-7beb284e0103 for example this works fine.
However, for 0cf31c43-7621-407c-94d6-6d593bae96e8 what I actually end up getting is 0cf31c43-7621�407cQ94d6-6d593bae96e8.
What's the problem in my code? As far as I'm aware, my char manipulations mimic what the std::string is doing minus the temporary copies due to the constant appending. Or am I mistaken?

Your buffer char uuid_cstr[UUID_STR_LEN]; was allocated onto stack so it has garbage values, all elements in buffer has some initial value probably not 0.
1) You can set all items to zero by
char uuid_cstr[UUID_STR_LEN];
memset (uuid_cstr,0,UUID_STR_LEN);
then the following statement can work
out[out_i++] += '-';
2) Or use the assignment
out[out_i++] = '-';

uintx_t to const char* in freestanding c++ using GNU compiler

so I am trying to convert some integers in to character arrays that my terminal can write. so I can see the value of my codes calculations for debugging purposes when its running.
as in if the int_t count = 57 I want the terminal to write 57.
so char* would be an array of character of 5 and 7
The kicker here though is that this is in an freestanding environment so that means no standard c++ library.
EDIT:
this means No std::string, no c_str, no _tostring, I cant just print integers.
The headers I have access to are iso646,stddef,float,limits,stdint,stdalign, stdarg, stdbool and stdnoreturn
Ive tried a few things from casting the int as an const char*, witch just led to random characters being displayed. To feeding my compiler different headers from the GCC collection but they just keeped needing other headers that I continued feeding it until I did not know what header the compiler wanted.
so here is where the code needs to be used to be printed.
uint8_t count = 0;
while (true)
{
terminal_setcolor(3);
terminal_writestring("hello\n");
count++;
terminal_writestring((const char*)count);
terminal_writestring("\n");
}
any advice with this would be greatly appreciated.
I am using an gnu, g++ cross compiler targeted at 686-elf and I guess I am using C++11 since I have access to stdnoreturn.h but it could be C++14 since I only just built the compiler with the latest gnu software dependencies.

Without C/C++ Standard Library you have no options except writing conversion function manually, e.g.:
template <int N>
const char* uint_to_string(
unsigned int val,
char (&str)[N],
unsigned int base = 10)
{
static_assert(N > 1, "Buffer too small");
static const char* const digits = "0123456789ABCDEF";
if (base < 2 || base > 16) return nullptr;
int i = N - 1;
str[i] = 0;
do
{
--i;
str[i] = digits[val % base];
val /= base;
}
while (val != 0 && i > 0);
return val == 0 ? str + i : nullptr;
}
template <int N>
const char* int_to_string(
int val,
char (&str)[N],
unsigned int base = 10)
{
// Output as unsigned.
if (val >= 0) return uint_to_string(val, str, base);
// Output as binary representation if base is not decimal.
if (base != 10) return uint_to_string(val, str, base);
// Output signed decimal representation.
const char* res = uint_to_string(-val, str, base);
// Buffer has place for minus sign
if (res > str)
{
const auto i = res - str - 1;
str[i] = '-';
return str + i;
}
else return nullptr;
}
Usage:
char buf[100];
terminal_writestring(int_to_string(42, buf)); // Will print '42'
terminal_writestring(int_to_string(42, buf, 2)); // Will print '101010'
terminal_writestring(int_to_string(42, buf, 8)); // Will print '52'
terminal_writestring(int_to_string(42, buf, 16)); // Will print '2A'
terminal_writestring(int_to_string(-42, buf)); // Will print '-42'
terminal_writestring(int_to_string(-42, buf, 2)); // Will print '11111111111111111111111111010110'
terminal_writestring(int_to_string(-42, buf, 8)); // Will print '37777777726'
terminal_writestring(int_to_string(-42, buf, 16)); // Will print 'FFFFFFD6'
Live example: http://cpp.sh/5ras

You could declare a string and get the pointer to it :
std::string str = std::to_string(count);
str += "\n";
terminal_writestring(str.c_str());

Most efficient way to convert 8 hex chars into a 4-uint8_t array?

I have a const char*, pointing to an array of 8 characters (that may be a part of a larger string), containing a hexadecimal value. I need a function that converts those chars into an array of 4 uint8_t, where the two first characters in the source array will become the first element in the target array, and so on. For example, if I have this
const char* s = "FA0BD6E4";
I want it converted to
uint8_t i[4] = {0xFA, 0x0B, 0xD6, 0xE4};
Currently, I have these functions:
inline constexpr uint8_t HexChar2UInt8(char h) noexcept
{
return static_cast<uint8_t>((h & 0xF) + (((h & 0x40) >> 3) | ((h & 0x40) >> 6)));
}
inline constexpr uint8_t HexChars2UInt8(char h0, char h1) noexcept
{
return (HexChar2UInt8(h0) << 4) | HexChar2UInt8(h1);
}
inline constexpr std::array<uint8_t, 4> HexStr2UInt8(const char* in) noexcept
{
return {{
HexChars2UInt8(in[0], in[1]),
HexChars2UInt8(in[2], in[3]),
HexChars2UInt8(in[4], in[5]),
HexChars2UInt8(in[6], in[7])
}};
}
Here's what it will look like where I call it from:
const char* s = ...; // the source string
std::array<uint8_t, 4> a; // I need to place the resulting value in this array
a = HexStr2UInt8(s); // the function call does not have to look like this
What I'm wondering, is there any more efficient (and portable) way do do this? For example, is returning a std::array a good thing to do, or should I pass a dst pointer to HexChars2UInt8? Or are there any other way to improve my function(s)?
The main reason I'm asking this is because I will likely need to optimize this at some point, and it will be problematic if the API (the function prototype) is changed in the future.

You can add parallelism, as the HexChar2Uint8 can access 8 characters at the same time. It's probably faster to load non-aligned 64-bit value once than 8 chars one by one (and to call the conversion function)
hexChar2Uints(uint8_t *ptr, uint64_t *result) // make result aligned to qword
{
uint64_t d=*(uint64_t*)ptr;
uint64_t hi = (d>>6) & 0x0101010101010101;
d &= 0x0f0f0f0f0f0f0f0f;
*result = d+(hi*9); // let compiler decide the fastest method
}
The last stage has to be done as OP suggested, just reading from modified "string":
for (n=0;n<4;n++) arr[n]=(tmp[2*n]<<4) | tmp[2*n+1];
The chances are slim that this can be considerably speeded up. The << 4 operation could be injected to hexChar2Uints making that parallel too, but I doubt it can be made in less than 4 arithmetic operations.

The most efficient, i.e. the fastest way to do the conversion is probably to set up a table of 65536 values for every possible pair of 2 characters, and to store in the valid ones their conversions.
If you store them as unsigned chars you won't be able to catch errors so you'll just have to hope you get valid input. If you store the value type as bigger than unsigned char you'll be able to use some kind of "error" value but checking if you get one will be an overhead. (The extra 65536 bytes probably isn't).
What you have written is probably efficient enough too though. Of course once again you are also not checking for invalid input and will get a result anyway.
If you keep yours I might change:
((h & 0x40) >> 3) | ((h & 0x40) >> 6)
which seems to be a substitute for
( (h & 0x40) ? 10 : 0 )
I can't see how my expression is less efficient than yours and is probably clearer in intention. (Use 0xA rather than 10 if you insist on hex)

There are several approaches possible. The simplest and the
most portable is to break the characters down into two character
std::string, using each to initialize an std::istringstream,
set up the correct format flags, and read the value from that.
A somewhat more efficient solution would be to create a single
string, inserted whitespace to separate the individual values,
and just use one std::istringstream, something like:
std::vector<uint8_t>
convert4UChars( std::string const& in )
{
assert( in.size() >= 8 );
std::string tmp( in.begin(), in.begin() + 8 );
int i = tmp.size();
while ( i > 2 ) {
i -= 2;
tmp.insert( i, 1, ' ');
}
std::istringstream s(tmp);
s.setf( std::ios_base::hex, std::ios_base::basefield );
std::vector<int> results( 4 );
s >> results[0] >> results[1] >> results[2] >> results[3];
if ( !s ) {
// error...
}
return std::vector<uint8_t>( results.begin(), results.end() );
}
If you really want to do it by hand, the alternative is to
create a 256 entry table, indexed by each character, and use
that:
class HexValueTable
{
std::array<uint_t, 256> myValues;
public:
HexValueTable()
{
std::fill( myValues.begin(), myValues.end(), -1 );
for ( int i = '0'; i <= '9'; ++ i ) {
myValues[ i ] = i - '0';
}
for ( int i = 'a'; i <= 'f'; ++ i ) {
myValues[ i ] = i - 'a' + 10;
}
for ( int i = 'A'; i <= 'A'; ++ i ) {
myValues[ i ] = i - 'a' + 10;
}
}
uint8_t operator[]( char ch ) const
{
uint8_t results = myValues[static_cast<unsigned char>( ch )];
if ( results == static_cast<unsigned char>( -1 ) ) {
// error, throw some exceptions...
}
return results;
}
};
std::array<uint8_t, 4>
convert4UChars( std::string const& in )
{
static HexValueTable const hexValues;
assert( in.size() >= 8 );
std::array<uint8_t, 4> results;
std::string::const_iterator source = in.begin();
for ( int i = 0; i < 4; ++ i ) {
results[i] = (hexValues[*source ++]) << 4;
results[i] |= hexValues[*source ++];
}
return results;
}

Converting from char string to an array of uint8_t?

I'm reading a string from a file so it's in the form of a char array. I need to tokenize the string and save each char array token as a uint8_t hex value in an array.
char* starting = "001122AABBCC";
// ...
uint8_t[] ending = {0x00,0x11,0x22,0xAA,0xBB,0xCC}
How can I convert from starting to ending? Thanks.

Here is a complete working program. It is based on Rob I's solution, but fixes several problems has been tested to work.
#include <string>
#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <iostream>
const char* starting = "001122AABBCC";
int main()
{
std::string starting_str = starting;
std::vector<unsigned char> ending;
ending.reserve( starting_str.size());
for (int i = 0 ; i < starting_str.length() ; i+=2) {
std::string pair = starting_str.substr( i, 2 );
ending.push_back(::strtol( pair.c_str(), 0, 16 ));
}
for(int i=0; i<ending.size(); ++i) {
printf("0x%X\n", ending[i]);
}
}

strtoul will convert text in any base you choose into bytes. You have to do a little work to chop the input string into individual digits, or you can convert 32 or 64bits at a time.
ps uint8_t[] ending = {0x00,0x11,0x22,0xAA,0xBB,0xCC}
Doesn't mean anything, you aren't storing the data in a uint8 as 'hex', you are storing bytes, it's upto how you (or your debugger) interpretes the binary data

With C++11, you may use std::stoi for that :
std::vector<uint8_t> convert(const std::string& s)
{
if (s.size() % 2 != 0) {
throw std::runtime_error("Bad size argument");
}
std::vector<uint8_t> res;
res.reserve(s.size() / 2);
for (std::size_t i = 0, size = s.size(); i != size; i += 2) {
std::size_t pos = 0;
res.push_back(std::stoi(s.substr(i, 2), &pos, 16));
if (pos != 2) {
throw std::runtime_error("bad character in argument");
}
}
return res;
}
Live example.

I think any canonical answer (w.r.t. the bounty notes) would involve some distinct phases in the solution:
Error checking for valid input
Length check and
Data content check
Element conversion
Output creation
Given the usefulness of such conversions, the solution should probably include some flexibility w.r.t. the types being used and the locale required.
From the outset, given the date of the request for a "more canonical answer" (circa August 2014) liberal use of C++11 will be applied.
An annotated version of the code, with types corresponding to the OP:
std::vector<std::uint8_t> convert(std::string const& src)
{
// error check on the length
if ((src.length() % 2) != 0) {
throw std::invalid_argument("conversion error: input is not even length");
}
auto ishex = [] (decltype(*src.begin()) c) {
return std::isxdigit(c, std::locale()); };
// error check on the data contents
if (!std::all_of(std::begin(src), std::end(src), ishex)) {
throw std::invalid_argument("conversion error: input values are not not all xdigits");
}
// allocate the result, initialised to 0 and size it to the correct length
std::vector<std::uint8_t> result(src.length() / 2, 0);
// run the actual conversion
auto str = src.begin(); // track the location in the string
std::for_each(result.begin(), result.end(), [&str](decltype(*result.begin())& element) {
element = static_cast<std::uint8_t>(std::stoul(std::string(str, str + 2), nullptr, 16));
std::advance(str, 2); // next two elements
});
return result;
}
The template version of the code adds flexibility;
template <typename Int /*= std::uint8_t*/,
typename Char = char,
typename Traits = std::char_traits<Char>,
typename Allocate = std::allocator<Char>,
typename Locale = std::locale>
std::vector<Int> basic_convert(std::basic_string<Char, Traits, Allocate> const& src, Locale locale = Locale())
{
using string_type = std::basic_string<Char, Traits, Allocate>;
auto ishex = [&locale] (decltype(*src.begin()) c) {
return std::isxdigit(c, locale); };
if ((src.length() % 2) != 0) {
throw std::invalid_argument("conversion error: input is not even length");
}
if (!std::all_of(std::begin(src), std::end(src), ishex)) {
throw std::invalid_argument("conversion error: input values are not not all xdigits");
}
std::vector<Int> result(src.length() / 2, 0);
auto str = std::begin(src);
std::for_each(std::begin(result), std::end(result), [&str](decltype(*std::begin(result))& element) {
element = static_cast<Int>(std::stoul(string_type(str, str + 2), nullptr, 16));
std::advance(str, 2);
});
return result;
}
The convert() function can then be based on the basic_convert() as follows:
std::vector<std::uint8_t> convert(std::string const& src)
{
return basic_convert<std::uint8_t>(src, std::locale());
}
Live sample.

uint8_t is typically no more than a typedef of an unsigned char. If you're reading characters from a file, you should be able to read them into an unsigned char array just as easily as a signed char array, and an unsigned char array is a uint8_t array.

I'd try something like this:
std::string starting_str = starting;
uint8_t[] ending = new uint8_t[starting_str.length()/2];
for (int i = 0 ; i < starting_str.length() ; i+=2) {
std::string pair = starting_str.substr( i, i+2 );
ending[i/2] = ::strtol( pair.c_str(), 0, 16 );
}
Didn't test it but it looks good to me...

You may add your own conversion from set of char { '0','1',...'E','F' } to uint8_t:
uint8_t ctoa(char c)
{
if( c >= '0' && c <= '9' ) return c - '0';
else if( c >= 'a' && c <= 'f' ) return 0xA + c - 'a';
else if( c >= 'A' && c <= 'F' ) return 0xA + c - 'A';
else return 0;
}
Then it will be easy to convert a string in to array:
uint32_t endingSize = strlen(starting)/2;
uint8_t* ending = new uint8_t[endingSize];
for( uint32_t i=0; i<endingSize; i++ )
{
ending[i] = ( ctoa( starting[i*2] ) << 4 ) + ctoa( starting[i*2+1] );
}

This simple solution should work for your problem
char* starting = "001122AABBCC";
uint8_t ending[12];
// This algo will work for any size of starting
// However, you have to make sure that the ending have enough space.
int i=0;
while (i<strlen(starting))
{
// convert the character to string
char str[2] = "\0";
str[0] = starting[i];
// convert string to int base 16
ending[i]= (uint8_t)atoi(str,16);
i++;
}

uint8_t* ending = static_cast<uint8_t*>(starting);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to detect UTF16 strings in PE files - c++

Related

String into binary

boost:uuid into char * without std::string

uintx_t to const char* in freestanding c++ using GNU compiler

Most efficient way to convert 8 hex chars into a 4-uint8_t array?

Converting from char string to an array of uint8_t?

Categories

Resources