C++ ASM IDA address patterns - c++

I am trying to write a pattern finder for a test application i made that uses Directx9, but i am having some issues with finding the correct start pattern.
The pointer i am trying to find is the IDIRECT3D9 device pointer.
unsigned long dwStartAddress = 0x14008613F, dwLen = ( 0x14008613F - 0x14008615D );
bool bDataCompare(const unsigned char* pData, const unsigned char* bMask, const char* szMask)
{
for (; *szMask; ++szMask, ++pData, ++bMask)
if (*szMask == 'x' && *pData != *bMask)
return false;
return (*szMask) == 0;
}
unsigned long dwFindPattern(unsigned char *bMask, char * szMask, unsigned long dw_Address = dwStartAddress, unsigned long dw_Len = dwLen)
{
for (unsigned long i = 0; i < dw_Len; i++)
if (bDataCompare((unsigned char*)(dw_Address + i), bMask, szMask))
return (unsigned long)(dw_Address + i);
return 0;
}
Iv'e dug through the d3d9.h file and took a look at the sample projects for Microsoft DirectX SDK (August 2009) found here https://www.microsoft.com/en-us/download/details.aspx?id=23549
According to https://msdn.microsoft.com/en-us/library/windows/desktop/bb219676(v=vs.85).aspx Direct3DCreate9Ex checks if D3Dx9ex features are supported.
According to https://msdn.microsoft.com/en-us/library/windows/desktop/bb174313(v=vs.85).aspx
Direct3DCreate9::CreateDevice returns the pointer
below i am making the assumption that i can start my address scan at 0x14008613F and end at ( 0x14008613F - 0x14008615D )
Also i am making the assumption that
jnz short loc_14008615E
mov [rbx+0E8h], eax
is the location for my device pointer that im looking for. with
[rbx+0E8h]
being the pointer returned from Direct3DCreate9::CreateDevice
I hope iv'e given enough information on this, as this is my first attempt at doing any real memory manipulation.
Edit:
incase its important here is my signature and mask:
DWORD m_dwaddy = dwFindPattern((PBYTE)"\x00\x48\x8B\xF8", (char *)"?xxx");

Related

OpenCL result changes with arbitrary code alterations that are not related

This is a very strange issue. I'm working on an GPU based crypto miner and I have an issue with a SHA hash function.
1 - The initial function calls a SHA256 routine and then prints the results. I'm comparing those results to a CPU based SHA256 to make sure I get the same thing.
2 - Later on in the function, there are other operations that occur, such as adding, XOR and additional SHA rounds.
As part of the miner kernel, I wrote an auxiliary function to decompose an array of 8 uints into an array of 32 unsigned char, using AND mask and bit shift.
I'm calling the kernel with global/local work unit of 1.
So, here's where things get really strange. The part I am comparing is the very first SHA. I get a buffer of 80 bytes in, SHA it and then print the result. It matches under certain conditions. However, if I make changes to the code that is executing AFTER that SHA executes, then it doesnt match. This is what I've been able to narrow down:
1 - If I put a printf debug in the decomposition auxiliary function, the results match. Just removing that printf causes it to mismatch.
2 - There are 4 operations I use to decompose the uint into char. I tried lots of different ways to do this with the same result. However, if I remove any 1 of the 4 "for" loops in the routine, it matches. Simply removing a for loop in code that gets executed -after- the initial code, changes the result of the initial SHA.
3 - If I change my while loop to never execute then it matches. Again, this is all -after- the initial SHA comparison.
4 - If I remove all the calls to the auxiliary function, then it matches. Simply calling the function after the initial SHA causes a mismatch.
I've tried adding memory guards everywhere, however being that its 1 global and 1 local work unit, I don't see how that could apply.
I'd love to debug this, but apparently openCL cannot be debugged in VS 2019 (really?)
Any thoughts, guesses, insight would be appreciated.
Thanks!
inline void loadUintHash ( __global unsigned char* dest, __global uint* src) {
//**********if I remove this it doesn't work
printf ("src1 %08x%08x%08x%08x%08x%08x%08x%08x",
src[0],
src[1],
src[2],
src[3],
src[4],
src[5],
src[6],
src[7]
);
//**********if I take away any one of these 4 for loops, then it works
for ( int i = 0; i < 8; i++)
dest[i*4+3] = (src[i] & 0xFF000000) >> 24;
for ( int i = 0; i < 8; i++)
dest[i*4+2] = (src[i] & 0x00FF0000) >> 16;
for ( int i = 0; i < 8; i++)
dest[i*4+1] = (src[i] & 0x0000FF00) >> 8;
for ( int i = 0; i < 8; i++)
dest[i*4] = (src[i] & 0x000000FF);
//**********if I remove this it doesn't work
printf ("src2 %08x%08x%08x%08x%08x%08x%08x%08x",
src[0],
src[1],
src[2],
src[3],
src[4],
src[5],
src[6],
src[7]
);
}
#define HASHOP_ADD 0
#define HASHOP_XOR 1
#define HASHOP_SHA_SINGLE 2
#define HASHOP_SHA_LOOP 3
#define HASHOP_MEMGEN 4
#define HASHOP_MEMADD 5
#define HASHOP_MEMXOR 6
#define HASHOP_MEM_SELECT 7
#define HASHOP_END 8
__kernel void dyn_hash (__global uint* byteCode, __global uint* memGenBuffer, int memGenSize, __global uint* hashResult, __global char* foundFlag, __global unsigned char* header, __global unsigned char* shaScratch) {
int computeUnitID = get_global_id(0);
__global uint* myMemGen = &memGenBuffer[computeUnitID * memGenSize * 8]; //each memGen unit is 256 bits, or 8 bytes
__global uint* myHashResult = &hashResult[computeUnitID * 8];
__global char* myFoundFlag = foundFlag + computeUnitID;
__global unsigned char* myHeader = header + (computeUnitID * 80);
__global unsigned char* myScratch = shaScratch + (computeUnitID * 32);
sha256 ( computeUnitID, 80, myHeader, myHashResult );
//**********this is the result I am comparing
if (computeUnitID == 0) {
printf ("gpu first sha uint %08x%08x%08x%08x%08x%08x%08x%08x",
myHashResult[0],
myHashResult[1],
myHashResult[2],
myHashResult[3],
myHashResult[4],
myHashResult[5],
myHashResult[6],
myHashResult[7]
);
}
uint linePtr = 0;
uint done = 0;
uint currentMemSize = 0;
uint instruction = 0;
//**********if I change this to done == 1, then it works
while (done == 0) {
if (byteCode[linePtr] == HASHOP_ADD) {
linePtr++;
uint arg1[8];
for ( int i = 0; i < 8; i++)
arg1[i] = byteCode[linePtr+i];
linePtr += 8;
}
else if (byteCode[linePtr] == HASHOP_XOR) {
linePtr++;
uint arg1[8];
for ( int i = 0; i < 8; i++)
arg1[i] = byteCode[linePtr+i];
linePtr += 8;
}
else if (byteCode[linePtr] == HASHOP_SHA_SINGLE) {
linePtr++;
}
else if (byteCode[linePtr] == HASHOP_SHA_LOOP) {
printf ("HASHOP_SHA_LOOP");
linePtr++;
uint loopCount = byteCode[linePtr];
for ( int i = 0; i < loopCount; i++) {
loadUintHash(myScratch, myHashResult);
sha256 ( computeUnitID, 32, myScratch, myHashResult );
if (computeUnitID == 1) {
loadUintHash(myScratch, myHashResult);
... more irrelevant code...
This is how the kernel is being called:
size_t globalWorkSize = 1;// computeUnits;
size_t localWorkSize = 1;
returnVal = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL);
The issue ended up being multiple things. 1 - The CPU SHA had a bug in it that was causing an incorrect result in some cases. 2 - There was a very strange syntax error which seems to have broken the compiler in a weird way:
void otherproc () {
...do stuff...
}
if (something) {/
...other code
}
That forward slash after the opening curly brace was messing up "otherproc" in a weird way, and the compiler did not throw an error. After staring at the code line by line I found that slash, removed it, and everything started working.
If anyone is interested, the working implementation of a GPU miner can be found here:
https://github.com/dynamofoundation/dyn_miner

C++ PBKDF2 Issue

I have the following function:
void PBKDF2_HMAC_SHA_512_string(const char* pass, const char* salt, int32_t iterations, uint32_t HashLength, char* out) {
unsigned int i;
HashLength = HashLength / 2;
unsigned char* digest = new unsigned char[HashLength];
PKCS5_PBKDF2_HMAC(pass, strlen(pass), (const unsigned char*)salt, strlen(salt), iterations, EVP_sha512(), HashLength, digest);
for (i = 0; i < sizeof(digest); i++) {
sprintf(out + (i * 2), "%02x", 255 & digest[i]);
}
}
When I call the function like below, I expect to get a hash back of 2400 in length, however it returns me 16:
char PBKDF2Hash[1025]; //\0 terminating space?
memset(PBKDF2Hash, 0, sizeof(PBKDF2Hash));
PBKDF2_HMAC_SHA_512_string("Password", "0123456789123456", 3500, 1024, PBKDF2Hash);
//PBKDF2Hash is now always 16 long -> strlen(PBKDF2Hash),
//while I expect it to be 2400 long?
//How is this possible and why is this happening?
//I can't figure it out
Since digest is a pointer, sizeof(digest) will not give the length of the array. Depending on different platforms, sizeof(digest) may give you 4 or 8, which is not what you want. Maybe you should use for (i = 0; i < HashLength; i++).
Another unrelated issue of your code is that, digest is not deleted in PBKDF2_HMAC_SHA_512_string, which causes memory leak

char to int conversion in host device with CUDA

I have been having trouble converting from a single character to an integer while in the host function of my CUDA program. After the line -
token[j] = token[j] * 10 + (buf[i] - '0' );
I use cuda-gdb check the value for token[j], and I always get different numbers that do not seem to have a pattern. I have also tried simple casting, not multiplying by ten (which I saw in another thread), not subtracting '0', and I always seem to get a different result. Any help would be appreciated. This is my first time posting on stack overflow, so give me a break if my formatting is awful.
-A fellow struggling coder
__global__ void rread(unsigned int *table, char *buf, int *threadbytes, unsigned int *token) {
int i = 0;
int j = 0;
*token = NULL;
int tid = threadIdx.x;
unsigned int key;
char delim = ' ';
for(i = tid * *threadbytes; i <(tid * *threadbytes) + *threadbytes ; i++)
{
if (buf[i] != delim) { //check if its not a delim
token[j] = token[j] * 10 + (buf[i] - '0' );
There's a race condition on writing to token.
If you want to have a local array per block you can use shared memory. If you want a local array per thread, you will need to use local per-thread memory and declare the array on the stack. In the first case you will have to deal with concurrency inside the block as well. In the latter you don't have to, although you might potentially waste a lot more memory (and reduce collaboration).

Mapping large files using MapViewOfFile

I have a very large file and I need to read it in small pieces and then process each piece. I'm using MapViewOfFile function to map a piece in memory, but after reading first part I can't read the second. It throws when I'm trying to map it.
char *tmp_buffer = new char[bufferSize];
LPCWSTR input = L"input";
OFSTRUCT tOfStr;
tOfStr.cBytes = sizeof tOfStr;
HANDLE inputFile = (HANDLE)OpenFile(inputFileName, &tOfStr, OF_READ);
HANDLE fileMap = CreateFileMapping(inputFile, NULL, PAGE_READONLY, 0, 0, input);
while (offset < fileSize)
{
long k = 0;
bool cutted = false;
offset -= tempBufferSize;
if (fileSize - offset <= bufferSize)
{
bufferSize = fileSize - offset;
}
char *buffer = new char[bufferSize + tempBufferSize];
for(int i = 0; i < tempBufferSize; i++)
{
buffer[i] = tempBuffer[i];
}
char *tmp_buffer = new char[bufferSize];
LPCWSTR input = L"input";
HANDLE inputFile;
OFSTRUCT tOfStr;
tOfStr.cBytes = sizeof tOfStr;
long long offsetHigh = ((offset >> 32) & 0xFFFFFFFF);
long long offsetLow = (offset & 0xFFFFFFFF);
tmp_buffer = (char *)MapViewOfFile(fileMap, FILE_MAP_READ, (int)offsetHigh, (int)offsetLow, bufferSize);
memcpy(&buffer[tempBufferSize], &tmp_buffer[0], bufferSize);
UnmapViewOfFile(tmp_buffer);
offset += bufferSize;
offsetHigh = ((offset >> 32) & 0xFFFFFFFF);
offsetLow = (offset & 0xFFFFFFFF);
if (offset < fileSize)
{
char *next;
next = (char *)MapViewOfFile(fileMap, FILE_MAP_READ, (int)offsetHigh, (int)offsetLow, 1);
if (next[0] >= '0' && next[0] <= '9')
{
cutted = true;
}
UnmapViewOfFile(next);
}
ostringstream path_stream;
path_stream << tempPath << splitNum;
ProcessChunk(buffer, path_stream.str(), cutted, bufferSize);
delete buffer;
cout << (splitNum + 1) << " file(s) sorted" << endl;
splitNum++;
}
One possibility is that you're not using an offset that's a multiple of the allocation granularity. From MSDN:
The combination of the high and low offsets must specify an offset within the file mapping. They must also match the memory allocation granularity of the system. That is, the offset must be a multiple of the allocation granularity. To obtain the memory allocation granularity of the system, use the GetSystemInfo function, which fills in the members of a SYSTEM_INFO structure.
If you try to map at something other than a multiple of the allocation granularity, the mapping will fail and GetLastError will return ERROR_MAPPED_ALIGNMENT.
Other than that, there are many problems in the code sample that make it very difficult to see what you're trying to do and where it's going wrong. At a minimum, you need to solve the memory leaks. You seem to be allocating and then leaking completely unnecessary buffers. Giving them better names can make it clear what they are actually used for.
Then I suggest putting a breakpoint on the calls to MapViewOfFile, and then checking all of the parameter values you're passing in to make sure they look right. As a start, on the second call, you'd expect offsetHigh to be 0 and offsetLow to be bufferSize.
A few suspicious things off the bat:
HANDLE inputFile = (HANDLE)OpenFile(inputFileName, &tOfStr, OF_READ);
Every cast should make you suspicious. Sometimes they are necessary, but make sure you understand why. At this point you should ask yourself why every other file API you're using requires a HANDLE and this function returns an HFILE. If you check OpenFile documentation, you'll see, "This function has limited capabilities and is not recommended. For new application development, use the CreateFile function." I know that sounds confusing because you want to open an existing file, but CreateFile can do exactly that, and it returns the right type.
long long offsetHigh = ((offset >> 32) & 0xFFFFFFFF);
What type is offset? You probably want to make sure it's an unsigned long long or equivalent. When bitshifting, especially to the right, you almost always want an unsigned type to avoid sign-extension. You also have to make sure that it's a type that has more bits than the amount you're shifting by--shifting a 32-bit value by 32 (or more) bits is actually undefined in C and C++, which allows the compilers to do certain types of optimizations.
long long offsetLow = (offset & 0xFFFFFFFF);
In both of these statements, you have to be careful about the 0xFFFFFFFF value. Since you didn't cast it or give it a suffix, it can be hard to predict whether the compiler will treat it as an int or unsigned int. In this case,
it'll be an unsigned int, but that won't be obvious to many people. In fact,
I got this wrong when I first wrote this answer. [This paragraph corrected 16-MAY-2017] With bitwise operations, you almost always want to make sure you're using unsigned values.
tmp_buffer = (char *)MapViewOfFile(fileMap, FILE_MAP_READ, (int)offsetHigh, (int)offsetLow, bufferSize);
You're casting offsetHigh and offsetLow to ints, which are signed values. The API actually wants DWORDs, which are unsigned values. Rather than casting in the call, I would declare offsetHigh and offsetLow as DWORDs and do the casting in the initialization, like this:
DWORD offsetHigh = static_cast<DWORD>((offset >> 32) & 0xFFFFFFFFul);
DWORD offsetLow = static_cast<DWORD>( offset & 0xFFFFFFFFul);
tmp_buffer = reinterpret_cast<const char *>(MapViewOfFile(fileMap, FILE_MAP_READ, offsetHigh, offsetLow, bufferSize));
Those fixes may or may not resolve your problem. It's hard to tell what's going on from the incomplete code sample.
Here's a working sample you can compare to:
// Calls ProcessChunk with each chunk of the file.
void ReadInChunks(const WCHAR *pszFileName) {
// Offsets must be a multiple of the system's allocation granularity. We
// guarantee this by making our view size equal to the allocation granularity.
SYSTEM_INFO sysinfo = {0};
::GetSystemInfo(&sysinfo);
DWORD cbView = sysinfo.dwAllocationGranularity;
HANDLE hfile = ::CreateFileW(pszFileName, GENERIC_READ, FILE_SHARE_READ,
NULL, OPEN_EXISTING, 0, NULL);
if (hfile != INVALID_HANDLE_VALUE) {
LARGE_INTEGER file_size = {0};
::GetFileSizeEx(hfile, &file_size);
const unsigned long long cbFile =
static_cast<unsigned long long>(file_size.QuadPart);
HANDLE hmap = ::CreateFileMappingW(hfile, NULL, PAGE_READONLY, 0, 0, NULL);
if (hmap != NULL) {
for (unsigned long long offset = 0; offset < cbFile; offset += cbView) {
DWORD high = static_cast<DWORD>((offset >> 32) & 0xFFFFFFFFul);
DWORD low = static_cast<DWORD>( offset & 0xFFFFFFFFul);
// The last view may be shorter.
if (offset + cbView > cbFile) {
cbView = static_cast<int>(cbFile - offset);
}
const char *pView = static_cast<const char *>(
::MapViewOfFile(hmap, FILE_MAP_READ, high, low, cbView));
if (pView != NULL) {
ProcessChunk(pView, cbView);
}
}
::CloseHandle(hmap);
}
::CloseHandle(hfile);
}
}
You have a memory leak in your code:
char *tmp_buffer = new char[bufferSize];
[ ... ]
while (offset < fileSize)
{
[ ... ]
char *tmp_buffer = new char[bufferSize];
[ ... ]
tmp_buffer = (char *)MapViewOfFile(fileMap, FILE_MAP_READ, (int)offsetHigh, (int)offsetLow, bufferSize);
[ ... ]
}
You're never delete what you allocate via new char[] during every iteration there. If your file is large enough / you do enough iterations of this loop, the memory allocation will eventually fail - that's then you'll see a throw() done by the allocator.
Win32 API calls like MapViewOfFile() are not C++ and never throw, they return error codes (the latter NULL on failure). Therefore, if you see exceptions, something's wrong in you C++ code. Likely the above.
I also had some troubles with memory mapped files.
Basically I just wanted to share memory (1Mo) between 2 apps on the same Pc.
- Both apps where written in Delphi
- Using Windows8 Pro
At first one application (the first one launched) could read and write the memoryMappedFile, but the second one could only read it (error 5 : AccessDenied)
Finally after a lot of testing It suddenly worked when both application where using CreateFileMapping. I even tried to create my on security descriptor, nothing helped.
Just before my applications where first calling OpenFileMapping and then CreateFileMapping if the first one failed
Another thing that misleaded me is that the handles , although visibly referencing the same MemoryMappedFile where different in both applications.
One last thing, after this correction my application seemed to work all right, but after a while I had error_NotEnough_Memory. when calling MapViewOfFile.
It was just a beginner's mistake of my part, I was not always calling UnmapViewOfFile.

char[] to hex string exercise

Below is my current char* to hex string function. I wrote it as an exercise in bit manipulation. It takes ~7ms on a AMD Athlon MP 2800+ to hexify a 10 million byte array. Is there any trick or other way that I am missing?
How can I make this faster?
Compiled with -O3 in g++
static const char _hex2asciiU_value[256][2] =
{ {'0','0'}, {'0','1'}, /* snip..., */ {'F','E'},{'F','F'} };
std::string char_to_hex( const unsigned char* _pArray, unsigned int _len )
{
std::string str;
str.resize(_len*2);
char* pszHex = &str[0];
const unsigned char* pEnd = _pArray + _len;
clock_t stick, etick;
stick = clock();
for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++, pszHex += 2 ) {
pszHex[0] = _hex2asciiU_value[*pChar][0];
pszHex[1] = _hex2asciiU_value[*pChar][1];
}
etick = clock();
std::cout << "ticks to hexify " << etick - stick << std::endl;
return str;
}
Updates
Added timing code
Brian R. Bondy: replace the std::string with a heap alloc'd buffer and change ofs*16 to ofs << 4 - however the heap allocated buffer seems to slow it down? - result ~11ms
Antti Sykäri:replace inner loop with
int upper = *pChar >> 4;
int lower = *pChar & 0x0f;
pszHex[0] = pHex[upper];
pszHex[1] = pHex[lower];
result ~8ms
Robert: replace _hex2asciiU_value with a full 256-entry table, sacrificing memory space but result ~7ms!
HoyHoy: Noted it was producing incorrect results
This assembly function (based off my previous post here, but I had to modify the concept a bit to get it to actually work) processes 3.3 billion input characters per second (6.6 billion output characters) on one core of a Core 2 Conroe 3Ghz. Penryn is probably faster.
%include "x86inc.asm"
SECTION_RODATA
pb_f0: times 16 db 0xf0
pb_0f: times 16 db 0x0f
pb_hex: db 48,49,50,51,52,53,54,55,56,57,65,66,67,68,69,70
SECTION .text
; int convert_string_to_hex( char *input, char *output, int len )
cglobal _convert_string_to_hex,3,3
movdqa xmm6, [pb_f0 GLOBAL]
movdqa xmm7, [pb_0f GLOBAL]
.loop:
movdqa xmm5, [pb_hex GLOBAL]
movdqa xmm4, [pb_hex GLOBAL]
movq xmm0, [r0+r2-8]
movq xmm2, [r0+r2-16]
movq xmm1, xmm0
movq xmm3, xmm2
pand xmm0, xmm6 ;high bits
pand xmm2, xmm6
psrlq xmm0, 4
psrlq xmm2, 4
pand xmm1, xmm7 ;low bits
pand xmm3, xmm7
punpcklbw xmm0, xmm1
punpcklbw xmm2, xmm3
pshufb xmm4, xmm0
pshufb xmm5, xmm2
movdqa [r1+r2*2-16], xmm4
movdqa [r1+r2*2-32], xmm5
sub r2, 16
jg .loop
REP_RET
Note it uses x264 assembly syntax, which makes it more portable (to 32-bit vs 64-bit, etc). To convert this into the syntax of your choice is trivial: r0, r1, r2 are the three arguments to the functions in registers. Its a bit like pseudocode. Or you can just get common/x86/x86inc.asm from the x264 tree and include that to run it natively.
P.S. Stack Overflow, am I wrong for wasting time on such a trivial thing? Or is this awesome?
At the cost of more memory you can create a full 256-entry table of the hex codes:
static const char _hex2asciiU_value[256][2] =
{ {'0','0'}, {'0','1'}, /* ..., */ {'F','E'},{'F','F'} };
Then direct index into the table, no bit fiddling required.
const char *pHexVal = pHex[*pChar];
pszHex[0] = pHexVal[0];
pszHex[1] = pHexVal[1];
Faster C Implmentation
This runs nearly 3x faster than the C++ implementation. Not sure why as it's pretty similar. For the last C++ implementation that I posted it took 6.8 seconds to run through a 200,000,000 character array. The implementation took only 2.2 seconds.
#include <stdio.h>
#include <stdlib.h>
char* char_to_hex(const unsigned char* p_array,
unsigned int p_array_len,
char** hex2ascii)
{
unsigned char* str = malloc(p_array_len*2+1);
const unsigned char* p_end = p_array + p_array_len;
size_t pos=0;
const unsigned char* p;
for( p = p_array; p != p_end; p++, pos+=2 ) {
str[pos] = hex2ascii[*p][0];
str[pos+1] = hex2ascii[*p][1];
}
return (char*)str;
}
int main()
{
size_t hex2ascii_len = 256;
char** hex2ascii;
int i;
hex2ascii = malloc(hex2ascii_len*sizeof(char*));
for(i=0; i<hex2ascii_len; i++) {
hex2ascii[i] = malloc(3*sizeof(char));
snprintf(hex2ascii[i], 3,"%02X", i);
}
size_t len = 8;
const unsigned char a[] = "DO NOT WANT";
printf("%s\n", char_to_hex((const unsigned char*)a, len, (char**)hex2ascii));
}
Operate on 32 bits at a time (4 chars), then deal with the tail if needed. When I did this exercise with url encoding a full table lookup for each char was slightly faster than logic constructs, so you may want to test this in context as well to take caching issues into account.
It works for me with unsigned char:
unsigned char c1 = byteVal >> 4;
unsigned char c2 = byteVal & 0x0f;
c1 += c1 <= 9 ? '0' : ('a' - 10);
c2 += c2 <= 9 ? '0' : ('a' - 10);
std::string sHex(" ");
sHex[0] = c1 ;
sHex[1] = c2 ;
//sHex - contain what we need. For example "0f"
For one, instead of multiplying by 16 do a bitshift << 4
Also don't use the std::string, instead just create a buffer on the heap and then delete it. It will be more efficient than the object destruction that is needed from the string.
not going to make a lot of difference... *pChar-(ofs*16) can be done with [*pCHar & 0x0F]
This is my version, which, unlike the OP's version, doesn't assume that std::basic_string has its data in contiguous region:
#include <string>
using std::string;
static char const* digits("0123456789ABCDEF");
string
tohex(string const& data)
{
string result(data.size() * 2, 0);
string::iterator ptr(result.begin());
for (string::const_iterator cur(data.begin()), end(data.end()); cur != end; ++cur) {
unsigned char c(*cur);
*ptr++ = digits[c >> 4];
*ptr++ = digits[c & 15];
}
return result;
}
I assume this is Windows+IA32.
Try to use short int instead of the two hexadecimal letters.
short int hex_table[256] = {'0'*256+'0', '1'*256+'0', '2'*256+'0', ..., 'E'*256+'F', 'F'*256+'F'};
unsigned short int* pszHex = &str[0];
stick = clock();
for (const unsigned char* pChar = _pArray; pChar != pEnd; pChar++)
*pszHex++ = hex_table[*pChar];
etick = clock();
Changing
ofs = *pChar >> 4;
pszHex[0] = pHex[ofs];
pszHex[1] = pHex[*pChar-(ofs*16)];
to
int upper = *pChar >> 4;
int lower = *pChar & 0x0f;
pszHex[0] = pHex[upper];
pszHex[1] = pHex[lower];
results in roughly 5% speedup.
Writing the result two bytes at time as suggested by Robert results in about 18% speedup. The code changes to:
_result.resize(_len*2);
short* pszHex = (short*) &_result[0];
const unsigned char* pEnd = _pArray + _len;
const char* pHex = _hex2asciiU_value;
for(const unsigned char* pChar = _pArray;
pChar != pEnd;
pChar++, ++pszHex )
{
*pszHex = bytes_to_chars[*pChar];
}
Required initialization:
short short_table[256];
for (int i = 0; i < 256; ++i)
{
char* pc = (char*) &short_table[i];
pc[0] = _hex2asciiU_value[i >> 4];
pc[1] = _hex2asciiU_value[i & 0x0f];
}
Doing it 2 bytes at a time or 4 bytes at a time will probably result in even greater speedups, as pointed out by Allan Wind, but then it gets trickier when you have to deal with the odd characters.
If you're feeling adventurous, you might try to adapt Duff's device to do this.
Results are on an Intel Core Duo 2 processor and gcc -O3.
Always measure that you actually get faster results — a pessimization pretending to be an optimization is less than worthless.
Always test that you get the correct results — a bug pretending to be an optimization is downright dangerous.
And always keep in mind the tradeoff between speed and readability — life is too short for anyone to maintain unreadable code.
(Obligatory reference to coding for the violent psychopath who knows where you live.)
Make sure your compiler optimization is turned on to the highest working level.
You know, flags like '-O1' to '-03' in gcc.
I have found that using an index into an array, rather than a pointer, can speed things up a tick. It all depends on how your compiler chooses to optimize. The key is that the processor has instructions to do complex things like [i*2+1] in a single instruction.
If you're rather obsessive about speed here, you can do the following:
Each character is one byte, representing two hex values. Thus, each character is really two four-bit values.
So, you can do the following:
Unpack the four-bit values to 8-bit values using a multiplication or similar instruction.
Use pshufb, the SSSE3 instruction (Core2-only though). It takes an array of 16 8-bit input values and shuffles them based on the 16 8-bit indices in a second vector. Since you have only 16 possible characters, this fits perfectly; the input array is a vector of 0 through F characters, and the index array is your unpacked array of 4-bit values.
Thus, in a single instruction, you will have performed 16 table lookups in fewer clocks than it normally takes to do just one (pshufb is 1 clock latency on Penryn).
So, in computational steps:
A B C D E F G H I J K L M N O P (64-bit vector of input values, "Vector A") -> 0A 0B 0C 0D 0E 0F 0G 0H 0I 0J 0K 0L 0M 0N 0O 0P (128-bit vector of indices, "Vector B"). The easiest way is probably two 64-bit multiplies.
pshub [0123456789ABCDEF], Vector B
I'm not sure doing it more bytes at a time will be better... you'll probably just get tons of cache misses and slow it down significantly.
What you might try is to unroll the loop though, take larger steps and do more characters each time through the loop, to remove some of the loop overhead.
Consistently getting ~4ms on my Athlon 64 4200+ (~7ms with original code)
for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++) {
const char* pchars = _hex2asciiU_value[*pChar];
*pszHex++ = *pchars++;
*pszHex++ = *pchars;
}
The function as it is shown when I'm writing this produces incorrect output even when _hex2asciiU_value is fully specified. The following code works, and on my 2.33GHz Macbook Pro runs in about 1.9 seconds for 200,000,000 million characters.
#include <iostream>
using namespace std;
static const size_t _h2alen = 256;
static char _hex2asciiU_value[_h2alen][3];
string char_to_hex( const unsigned char* _pArray, unsigned int _len )
{
string str;
str.resize(_len*2);
char* pszHex = &str[0];
const unsigned char* pEnd = _pArray + _len;
const char* pHex = _hex2asciiU_value[0];
for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++, pszHex += 2 ) {
pszHex[0] = _hex2asciiU_value[*pChar][0];
pszHex[1] = _hex2asciiU_value[*pChar][1];
}
return str;
}
int main() {
for(int i=0; i<_h2alen; i++) {
snprintf(_hex2asciiU_value[i], 3,"%02X", i);
}
size_t len = 200000000;
char* a = new char[len];
string t1;
string t2;
clock_t start;
srand(time(NULL));
for(int i=0; i<len; i++) a[i] = rand()&0xFF;
start = clock();
t1=char_to_hex((const unsigned char*)a, len);
cout << "char_to_hex conversion took ---> " << (clock() - start)/(double)CLOCKS_PER_SEC << " seconds\n";
}