Related
So I've asked this before but with significantly less detail. The question title accurately describes the problem: I have a method in C++ that I am trying to call from assembly (x86) that has both parameters and a return value. I have a rough understanding, at best, of assembly and a fairly solid understanding of C++ (otherwise I would not have undertaken this problem). Here's what I have as far as code goes:
// methodAddr is a pointer to the method address
void* methodAddr = method->Address;
// buffer is an int array of parameter values. The parameters can be anything (of any type)
// but are copied into an int array so they can be pushed onto the stack in reverse order
// 4 bytes at a time (as in push (int)). I know there's an issue here that is irrelevent to my baseline testing, in that if any parameter is over 4 bytes it will be broken and
// reversed (which is not good) but for basic testing this isn't an issue, so overlook this.
for (int index = bufferElementCount - 1; index >= 0; index--)
{
int val = buffer[index];
__asm
{
push val
}
}
int returnValueCount = 0;
// if there is a return value, allocate some space for it and push that onto the stack after
// the parameters have been pushed on
if (method->HasReturnValue)
{
*returnSize = method->ReturnValueSize;
outVal = new char[*returnSize];
returnValueCount = (*returnSize / 4) + (*returnSize % 4 != 0 ? 1 : 0);
memset(outVal, 0, *returnSize);
for (int index = returnValueCount - 1; index >= 0; index--)
{
char* addr = ((char*)outVal) + (index * 4);
__asm
{
push addr
}
}
}
// calculate the stack pointer offset so after the call we can pop the parameters and return value
int espOffset = (bufferElementCount + returnValueCount) * 4;
// call the method
__asm
{
call methodAddr;
add esp, espOffset
};
For my basic testing I am using a method with the following signature:
Person MyMethod3( int, char, int );
The problem is this: when omit the return value from the method signature, all of the parameter values are properly passed. But when I leave the method as is, the parameter data that is passed is incorrect but the value returned is correct. So my question, obviously, is what is wrong? I've tried pushing the return value space onto the stack before the parameters. The person structure is as follows:
class Person
{
public:
Text Name;
int Age;
float Cash;
ICollection<Person*>* Friends;
};
Any help would be greatly appreciated. Thanks!
I'm using Visual Studio 2013 with the November 2013 CTP compiler for C++, targeting x86.
As it relates to disassembly, this is the straight method call:
int one = 876;
char two = 'X';
int three = 9738;
Person p = MyMethod3(one, two, three);
And here is the disassembly for that:
00CB0A20 mov dword ptr [one],36Ch
char two = 'X';
00CB0A27 mov byte ptr [two],58h
int three = 9738;
00CB0A2B mov dword ptr [three],260Ah
Person p = MyMethod3(one, two, three);
00CB0A32 push 10h
00CB0A34 lea ecx,[p]
00CB0A37 call Person::__autoclassinit2 (0C6AA2Ch)
00CB0A3C mov eax,dword ptr [three]
00CB0A3F push eax
00CB0A40 movzx ecx,byte ptr [two]
00CB0A44 push ecx
00CB0A45 mov edx,dword ptr [one]
00CB0A48 push edx
00CB0A49 lea eax,[p]
00CB0A4C push eax
00CB0A4D call MyMethod3 (0C6B783h)
00CB0A52 add esp,10h
00CB0A55 mov dword ptr [ebp-4],0
My interpretation of this is as follows:
Execute the assignments to the local variables. Then create the output register. Then put the parameters in a particular register (the order here happens to be eax, ecx, and edx, which makes sense (eax and ebx are for one, ecx is for two, and edx and some other register for the last parameter?)). Then call LEA (load-effective address) which I don't understand but have understood to be a MOV. Then it calls the method with an address as the parameter? And then moves the stack pointer to pop the parameters and return value.
Any further explanation is appreciated, as I'm sure my understanding here is somewhat flawed.
While writing new code for Windows, I stumbled upon _cpuinfo() from the Windows API. As I am mainly dealing with a Linux environment (GCC) I want to have access to the CPUInfo.
I have tried the following:
#include <iostream>
int main()
{
int a, b;
for (a = 0; a < 5; a++)
{
__asm ( "mov %1, %%eax; " // a into eax
"cpuid;"
"mov %%eax, %0;" // eax into b
:"=r"(b) // output
:"r"(a) // input
:"%eax","%ebx","%ecx","%edx" // clobbered register
);
std::cout << "The CPUID level " << a << " gives EAX= " << b << '\n';
}
return 0;
}
This use assembly but I don't want to re-invent the wheel. Is there any other way to implement CPUInfo without assembly?
Since you are compiling with GCC then you can include cpuid.h which declares these functions:
/* Return highest supported input value for cpuid instruction. ext can
be either 0x0 or 0x8000000 to return highest supported value for
basic or extended cpuid information. Function returns 0 if cpuid
is not supported or whatever cpuid returns in eax register. If sig
pointer is non-null, then first four bytes of the signature
(as found in ebx register) are returned in location pointed by sig. */
unsigned int __get_cpuid_max (unsigned int __ext, unsigned int *__sig)
/* Return cpuid data for requested cpuid level, as found in returned
eax, ebx, ecx and edx registers. The function checks if cpuid is
supported and returns 1 for valid cpuid information or 0 for
unsupported cpuid level. All pointers are required to be non-null. */
int __get_cpuid (unsigned int __level,
unsigned int *__eax, unsigned int *__ebx,
unsigned int *__ecx, unsigned int *__edx)
You don't need to, and should not, re-implement this functionality.
for (a =0; a < 5; ++a;)
There should only be two semicolons there. You've got three.
This is basic C/C++ syntax; the CPUID is a red herring.
What's the reason behind applying two explicit type casts as below?
if (unlikely(val != (long)(char)val)) {
Code taken from lxml.etree.c source file from lxml's source package.
That's a cheap way to check to see if there's any junk in the high bits. The char cast chops of the upper 8, 24 or 56 bits (depending on sizeof(val)) and then promotes it back. If char is signed, it will sign extend as well.
A better test might be:
if (unlikely(val & ~0xff)) {
or
if (unlikely(val & ~0x7f)) {
depending on whether this test cares about bit 7.
Just for grins and completeness, I wrote the following test code:
void RegularTest(long val)
{
if (val != ((int)(char)val)) {
printf("Regular = not equal.");
}
else {
printf("Regular = equal.");
}
}
void MaskTest(long val)
{
if (val & ~0xff) {
printf("Mask = not equal.");
}
else {
printf("Mask = equal.");
}
}
And here's what the cast code turns into in debug in visual studio 2010:
movsx eax, BYTE PTR _val$[ebp]
cmp DWORD PTR _val$[ebp], eax
je SHORT $LN2#RegularTes
this is the mask code:
mov eax, DWORD PTR _val$[ebp]
and eax, -256 ; ffffff00H
je SHORT $LN2#MaskTest
In release, I get this for the cast code:
movsx ecx, al
cmp eax, ecx
je SHORT $LN2#RegularTes
In release, I get this for the mask code:
test DWORD PTR _val$[ebp], -256 ; ffffff00H
je SHORT $LN2#MaskTest
So what's going on? In the cast case it's doing a byte mov with sign extension (ha! bug - the code is not the same because chars are signed) and then a compare and to be totally sneaky, the compiler/linker has also made this function use register passing for the argument. In the mask code in release, it has folded everything up into a single test instruction.
Which is faster? Beats me - and frankly unless you're running this kind of test on a VERY slow CPU or are running it several billion times, it won't matter. Not in the least.
So the answer in this case, is to write code that is clear about its intent. I would expect a C/C++ jockey to look at the mask code and understand its intent, but if you don't like that, you should opt for something like this instead:
#define BitsAbove8AreSet(x) ((x) & ~0xff)
#define BitsAbove7AreSet(x) ((x) & ~0x7f)
or:
inline bool BitsAbove8AreSet(long t) { return (t & ~0xff) != 0; } // make it a bool to be nice
inline bool BitsAbove7AreSet(long t) { return (t & ~0x7f) != 0; }
And use the predicates instead of the actual code.
In general, I think "is it cheap?" is not a particularly good question to ask about this unless you're working in some very specific problem domains. For example, I work in image processing and when I have some kind of operation going from one image to another, I often have code that looks like this:
BYTE *srcPixel = PixelOffset(src, x, y, srcrowstride, srcdepth);
int srcAdvance = PixelAdvance(srcrowstride, right, srcdepth);
BYTE *dstPixel = PixelOffset(dst, x, y, dstrowstride, dstdepth);
int dstAdvance = PixelAdvance(dstrowstride, right, dstdepth);
for (y = top; y < bottom; y++) {
for (x=left; x < right; x++) {
ProcessOnePixel(srcPixel, srcdepth, dstPixel, dstdepth);
srcPixel += srcdepth;
dstPixel += dstdepth;
}
srcPixel += srcAdvance;
dstPixel += dstAdvance;
}
And in this case, assume that ProcessOnePixel() is actually a chunk of inline code that will be executed billions and billions of times. In this case, I care a whole lot about not doing function calls, not doing redundant work, not rechecking values, ensuring that the computational flow will translate into something that will use registers wisely, etc. But my actual primary concern is that the code can be read by the next poor schmuck (probably me) who has to look at it.
And in our current coding world, it is FAR FAR CHEAPER for nearly every problem domain to spend a little time up front ensuring that your code is easy to read and maintain than it is to worry about performance out of the gate.
Speculations:
cast to char: to mask the 8 low bits,
cast to long: to bring the value back to signed (if char is unsigned).
If val is a long then the (char) will strip off all but the bottom 8 bits. The (long) casts it back for the comparison.
I've run into an issue porting a codebase from linux (gcc) to windows (msvc). It seems like the C99 function vsscanf isn't available and has no obvious replacement.
I've read about a solution using the internal function _input_l and linking statically to the crt runtime, but unfortunately I cannot link statically since it would mess with all the plugins (as dlls) being loaded by the application.
So is there any replacement or a way to write a wrapper for vsscanf?
Update 2016-02-24:
When this was first asked there was no native replacement but since then MSVC has implemented support for this and much more.
VS2013 and later implements vsscanf and friends.
C++11 includes support as well.
A hack that should work:
int vsscanf(const char *s, const char *fmt, va_list ap)
{
void *a[20];
int i;
for (i=0; i<sizeof(a)/sizeof(a[0]); i++) a[i] = va_arg(ap, void *);
return sscanf(s, fmt, a[0], a[1], a[2], a[3], a[4], a[5], a[6], /* etc... */);
}
Replace 20 with the max number of args you think you might need. This code isn't terribly portable but it's only intended to be used on one particular broken system missing vsscanf so that shouldn't matter so much.
A quick search turned up several suggestions, including http://www.flipcode.net/archives/vsscanf_for_Win32.shtml
As this is tagged C++ have you considered just biting the bullet and moving away from the scanf line of functions completely? The C++ idiomatic way would be to use a std::istringstream. Rewriting to make use of that instead of looking for a vsscanf replacement would possibly be easier and more portable, not to mention having much greater type safety.
Funny it never came up for me before today. I could've sworn I'd used the function in the past. But anyway, here's a solution that works and is as safe as your arguments and format string:
template < size_t _NumArgs >
int VSSCANF_S(LPCTSTR strSrc, LPCTSTR ptcFmt, INT_PTR (&arr)[_NumArgs]) {
class vaArgs
{
vaArgs() {}
INT_PTR* m_args[_NumArgs];
public:
vaArgs(INT_PTR (&arr)[_NumArgs])
{
for(size_t nIndex=0;nIndex<_NumArgs;++nIndex)
m_args[nIndex] = &arr[nIndex];
}
};
return sscanf_s(strSrc, ptcFmt, vaArgs(arr));
}
///////////////////////////////////////////////////////////////////////////////
int _tmain(int, LPCTSTR argv[])
{
INT_PTR args[3];
int nScanned = VSSCANF_S(_T("-52 Hello 456 #"), _T("%d Hello %u %c"), args);
return printf(_T("Arg1 = %d, arg2 = %u, arg3 = %c\n"), args[0], args[1], args[2]);
}
Out:
Arg1 = -52, arg2 = 456, arg3 = #
Press any key to continue . . .
Well I can't get the formatting right but you get the idea.
if you want to wrap sscanf and you are using C++11, you can do this:
template<typename... Args>
int mysscanf(const char* str, const char* fmt, Args... args) {
//...
return sscanf(str, fmt, args...);
}
to make this work on msvc, you need to download this update:
http://www.microsoft.com/en-us/download/details.aspx?id=35515
modified from :
http://www.gamedev.net/topic/310888-no-vfscanf-in-visual-studio/
#if defined(_WIN32) && (_MSC_VER <= 1500)
static int vsscanf(
const char *buffer,
const char *format,
va_list argPtr
)
{
// Get an upper bound for the # of args
size_t count = 0;
const char* p = format;
while(1)
{
char c = *(p++);
if (c == 0)
break;
if (c == '%' && (p[0] != '*' && p[0] != '%'))
++count;
}
if (count <= 0)
return 0;
int result;
// copy stack pointer
_asm
{
mov esi, esp;
}
// push variable parameters pointers on stack
for (int i = count - 1; i >= 0; --i)
{
_asm
{
mov eax, dword ptr[i];
mov ecx, dword ptr [argPtr];
mov edx, dword ptr [ecx+eax*4];
push edx;
}
}
int stackAdvance = (2 + count) * 4;
_asm
{
// now push on the fixed params
mov eax, dword ptr [format];
push eax;
mov eax, dword ptr [buffer];
push eax;
// call sscanf, and more the result in to result
call dword ptr [sscanf];
mov result, eax;
// restore stack pointer
mov eax, dword ptr[stackAdvance];
add esp, eax;
}
return result;
}
#endif // _WIN32 / _MSC_VER <= 1500
tested only on Visual Studio 2008
Below is my current char* to hex string function. I wrote it as an exercise in bit manipulation. It takes ~7ms on a AMD Athlon MP 2800+ to hexify a 10 million byte array. Is there any trick or other way that I am missing?
How can I make this faster?
Compiled with -O3 in g++
static const char _hex2asciiU_value[256][2] =
{ {'0','0'}, {'0','1'}, /* snip..., */ {'F','E'},{'F','F'} };
std::string char_to_hex( const unsigned char* _pArray, unsigned int _len )
{
std::string str;
str.resize(_len*2);
char* pszHex = &str[0];
const unsigned char* pEnd = _pArray + _len;
clock_t stick, etick;
stick = clock();
for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++, pszHex += 2 ) {
pszHex[0] = _hex2asciiU_value[*pChar][0];
pszHex[1] = _hex2asciiU_value[*pChar][1];
}
etick = clock();
std::cout << "ticks to hexify " << etick - stick << std::endl;
return str;
}
Updates
Added timing code
Brian R. Bondy: replace the std::string with a heap alloc'd buffer and change ofs*16 to ofs << 4 - however the heap allocated buffer seems to slow it down? - result ~11ms
Antti Sykäri:replace inner loop with
int upper = *pChar >> 4;
int lower = *pChar & 0x0f;
pszHex[0] = pHex[upper];
pszHex[1] = pHex[lower];
result ~8ms
Robert: replace _hex2asciiU_value with a full 256-entry table, sacrificing memory space but result ~7ms!
HoyHoy: Noted it was producing incorrect results
This assembly function (based off my previous post here, but I had to modify the concept a bit to get it to actually work) processes 3.3 billion input characters per second (6.6 billion output characters) on one core of a Core 2 Conroe 3Ghz. Penryn is probably faster.
%include "x86inc.asm"
SECTION_RODATA
pb_f0: times 16 db 0xf0
pb_0f: times 16 db 0x0f
pb_hex: db 48,49,50,51,52,53,54,55,56,57,65,66,67,68,69,70
SECTION .text
; int convert_string_to_hex( char *input, char *output, int len )
cglobal _convert_string_to_hex,3,3
movdqa xmm6, [pb_f0 GLOBAL]
movdqa xmm7, [pb_0f GLOBAL]
.loop:
movdqa xmm5, [pb_hex GLOBAL]
movdqa xmm4, [pb_hex GLOBAL]
movq xmm0, [r0+r2-8]
movq xmm2, [r0+r2-16]
movq xmm1, xmm0
movq xmm3, xmm2
pand xmm0, xmm6 ;high bits
pand xmm2, xmm6
psrlq xmm0, 4
psrlq xmm2, 4
pand xmm1, xmm7 ;low bits
pand xmm3, xmm7
punpcklbw xmm0, xmm1
punpcklbw xmm2, xmm3
pshufb xmm4, xmm0
pshufb xmm5, xmm2
movdqa [r1+r2*2-16], xmm4
movdqa [r1+r2*2-32], xmm5
sub r2, 16
jg .loop
REP_RET
Note it uses x264 assembly syntax, which makes it more portable (to 32-bit vs 64-bit, etc). To convert this into the syntax of your choice is trivial: r0, r1, r2 are the three arguments to the functions in registers. Its a bit like pseudocode. Or you can just get common/x86/x86inc.asm from the x264 tree and include that to run it natively.
P.S. Stack Overflow, am I wrong for wasting time on such a trivial thing? Or is this awesome?
At the cost of more memory you can create a full 256-entry table of the hex codes:
static const char _hex2asciiU_value[256][2] =
{ {'0','0'}, {'0','1'}, /* ..., */ {'F','E'},{'F','F'} };
Then direct index into the table, no bit fiddling required.
const char *pHexVal = pHex[*pChar];
pszHex[0] = pHexVal[0];
pszHex[1] = pHexVal[1];
Faster C Implmentation
This runs nearly 3x faster than the C++ implementation. Not sure why as it's pretty similar. For the last C++ implementation that I posted it took 6.8 seconds to run through a 200,000,000 character array. The implementation took only 2.2 seconds.
#include <stdio.h>
#include <stdlib.h>
char* char_to_hex(const unsigned char* p_array,
unsigned int p_array_len,
char** hex2ascii)
{
unsigned char* str = malloc(p_array_len*2+1);
const unsigned char* p_end = p_array + p_array_len;
size_t pos=0;
const unsigned char* p;
for( p = p_array; p != p_end; p++, pos+=2 ) {
str[pos] = hex2ascii[*p][0];
str[pos+1] = hex2ascii[*p][1];
}
return (char*)str;
}
int main()
{
size_t hex2ascii_len = 256;
char** hex2ascii;
int i;
hex2ascii = malloc(hex2ascii_len*sizeof(char*));
for(i=0; i<hex2ascii_len; i++) {
hex2ascii[i] = malloc(3*sizeof(char));
snprintf(hex2ascii[i], 3,"%02X", i);
}
size_t len = 8;
const unsigned char a[] = "DO NOT WANT";
printf("%s\n", char_to_hex((const unsigned char*)a, len, (char**)hex2ascii));
}
Operate on 32 bits at a time (4 chars), then deal with the tail if needed. When I did this exercise with url encoding a full table lookup for each char was slightly faster than logic constructs, so you may want to test this in context as well to take caching issues into account.
It works for me with unsigned char:
unsigned char c1 = byteVal >> 4;
unsigned char c2 = byteVal & 0x0f;
c1 += c1 <= 9 ? '0' : ('a' - 10);
c2 += c2 <= 9 ? '0' : ('a' - 10);
std::string sHex(" ");
sHex[0] = c1 ;
sHex[1] = c2 ;
//sHex - contain what we need. For example "0f"
For one, instead of multiplying by 16 do a bitshift << 4
Also don't use the std::string, instead just create a buffer on the heap and then delete it. It will be more efficient than the object destruction that is needed from the string.
not going to make a lot of difference... *pChar-(ofs*16) can be done with [*pCHar & 0x0F]
This is my version, which, unlike the OP's version, doesn't assume that std::basic_string has its data in contiguous region:
#include <string>
using std::string;
static char const* digits("0123456789ABCDEF");
string
tohex(string const& data)
{
string result(data.size() * 2, 0);
string::iterator ptr(result.begin());
for (string::const_iterator cur(data.begin()), end(data.end()); cur != end; ++cur) {
unsigned char c(*cur);
*ptr++ = digits[c >> 4];
*ptr++ = digits[c & 15];
}
return result;
}
I assume this is Windows+IA32.
Try to use short int instead of the two hexadecimal letters.
short int hex_table[256] = {'0'*256+'0', '1'*256+'0', '2'*256+'0', ..., 'E'*256+'F', 'F'*256+'F'};
unsigned short int* pszHex = &str[0];
stick = clock();
for (const unsigned char* pChar = _pArray; pChar != pEnd; pChar++)
*pszHex++ = hex_table[*pChar];
etick = clock();
Changing
ofs = *pChar >> 4;
pszHex[0] = pHex[ofs];
pszHex[1] = pHex[*pChar-(ofs*16)];
to
int upper = *pChar >> 4;
int lower = *pChar & 0x0f;
pszHex[0] = pHex[upper];
pszHex[1] = pHex[lower];
results in roughly 5% speedup.
Writing the result two bytes at time as suggested by Robert results in about 18% speedup. The code changes to:
_result.resize(_len*2);
short* pszHex = (short*) &_result[0];
const unsigned char* pEnd = _pArray + _len;
const char* pHex = _hex2asciiU_value;
for(const unsigned char* pChar = _pArray;
pChar != pEnd;
pChar++, ++pszHex )
{
*pszHex = bytes_to_chars[*pChar];
}
Required initialization:
short short_table[256];
for (int i = 0; i < 256; ++i)
{
char* pc = (char*) &short_table[i];
pc[0] = _hex2asciiU_value[i >> 4];
pc[1] = _hex2asciiU_value[i & 0x0f];
}
Doing it 2 bytes at a time or 4 bytes at a time will probably result in even greater speedups, as pointed out by Allan Wind, but then it gets trickier when you have to deal with the odd characters.
If you're feeling adventurous, you might try to adapt Duff's device to do this.
Results are on an Intel Core Duo 2 processor and gcc -O3.
Always measure that you actually get faster results — a pessimization pretending to be an optimization is less than worthless.
Always test that you get the correct results — a bug pretending to be an optimization is downright dangerous.
And always keep in mind the tradeoff between speed and readability — life is too short for anyone to maintain unreadable code.
(Obligatory reference to coding for the violent psychopath who knows where you live.)
Make sure your compiler optimization is turned on to the highest working level.
You know, flags like '-O1' to '-03' in gcc.
I have found that using an index into an array, rather than a pointer, can speed things up a tick. It all depends on how your compiler chooses to optimize. The key is that the processor has instructions to do complex things like [i*2+1] in a single instruction.
If you're rather obsessive about speed here, you can do the following:
Each character is one byte, representing two hex values. Thus, each character is really two four-bit values.
So, you can do the following:
Unpack the four-bit values to 8-bit values using a multiplication or similar instruction.
Use pshufb, the SSSE3 instruction (Core2-only though). It takes an array of 16 8-bit input values and shuffles them based on the 16 8-bit indices in a second vector. Since you have only 16 possible characters, this fits perfectly; the input array is a vector of 0 through F characters, and the index array is your unpacked array of 4-bit values.
Thus, in a single instruction, you will have performed 16 table lookups in fewer clocks than it normally takes to do just one (pshufb is 1 clock latency on Penryn).
So, in computational steps:
A B C D E F G H I J K L M N O P (64-bit vector of input values, "Vector A") -> 0A 0B 0C 0D 0E 0F 0G 0H 0I 0J 0K 0L 0M 0N 0O 0P (128-bit vector of indices, "Vector B"). The easiest way is probably two 64-bit multiplies.
pshub [0123456789ABCDEF], Vector B
I'm not sure doing it more bytes at a time will be better... you'll probably just get tons of cache misses and slow it down significantly.
What you might try is to unroll the loop though, take larger steps and do more characters each time through the loop, to remove some of the loop overhead.
Consistently getting ~4ms on my Athlon 64 4200+ (~7ms with original code)
for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++) {
const char* pchars = _hex2asciiU_value[*pChar];
*pszHex++ = *pchars++;
*pszHex++ = *pchars;
}
The function as it is shown when I'm writing this produces incorrect output even when _hex2asciiU_value is fully specified. The following code works, and on my 2.33GHz Macbook Pro runs in about 1.9 seconds for 200,000,000 million characters.
#include <iostream>
using namespace std;
static const size_t _h2alen = 256;
static char _hex2asciiU_value[_h2alen][3];
string char_to_hex( const unsigned char* _pArray, unsigned int _len )
{
string str;
str.resize(_len*2);
char* pszHex = &str[0];
const unsigned char* pEnd = _pArray + _len;
const char* pHex = _hex2asciiU_value[0];
for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++, pszHex += 2 ) {
pszHex[0] = _hex2asciiU_value[*pChar][0];
pszHex[1] = _hex2asciiU_value[*pChar][1];
}
return str;
}
int main() {
for(int i=0; i<_h2alen; i++) {
snprintf(_hex2asciiU_value[i], 3,"%02X", i);
}
size_t len = 200000000;
char* a = new char[len];
string t1;
string t2;
clock_t start;
srand(time(NULL));
for(int i=0; i<len; i++) a[i] = rand()&0xFF;
start = clock();
t1=char_to_hex((const unsigned char*)a, len);
cout << "char_to_hex conversion took ---> " << (clock() - start)/(double)CLOCKS_PER_SEC << " seconds\n";
}