Memory scanner with a slow scan - c++

I'm working on a Memory Scanner, but the scan is so slow.. can anybody help-me improve it?
procedure FirstScan(scantype, scanvalue: string);
var
value :integer;
dwEndAddr : dword;
i:dword;
mbi : TMemoryBasicInformation;
begin
while (VirtualQuery(Pointer(DWORD(mbi.BaseAddress) + MBI.RegionSize), MBI, SizeOf(MEMORY_BASIC_INFORMATION))=SizeOf(TMemoryBasicInformation)) do begin
if (MBI.State = MEM_COMMIT) and (MBI.Protect = PAGE_READWRITE) then begin
dwEndAddr := DWORD(mbi.BaseAddress) + MBI.RegionSize;
for i := DWORD(MBI.BaseAddress) to (dwEndAddr - 1 - sizeof(DWORD)) do begin
Application.ProcessMessages;
try
if scantype = '1 Byte' then begin
value := PBYTE(i)^;
if scanvalue = IntToStr(value) then ListBox1.Items.Add(IntToHex(i,8));
end;
//others scantypes here...
except
Break;
end;
end;
end;
end;
end;
I've learned that I need to read 4096 byte pages at a time then store those in memory and do operations on it, until I need a new page then get another 4096 byte page...
But I don't know how can I do that...
Can anybody help-me? The code can be in C or C++...

I can help you a bit...get that Application.ProcessMessages out of the inner loop. You're calling it for every. single. byte. you. scan. You don't need to be quite that responsive to window messages. :)
Move it up into the outer loop, and you should see a significant speed increase. I'd say spawn a thread and get rid of Application.ProcessMessages altogether, as it's really not this code's job to handle, but i'm not sure how/whether Delphi does threads.
Also....you're passing the scan params as strings? If you insist on doing that, set an int or enum or something before you start the loop that says what scan type to use, convert the value to a useful type for your search, and compare that. String comparisons tend to be slower than integer comparisons, particularly when you're creating new strings each time.

To make slow code fast, there are a few things you can do. First, make sure your code is correct. Wrong results are still wrong results, even if you get them quickly. To that end, make sure that when you call VirtualQuery, you're passing in valid values for all the parameters. At that start of this function mbi is uninitialized, so the result of DWORD(mbi.BaseAddress) + MBI.RegionSize will be who-knows-what.
Once you have correctly working code, there are two ways to make it faster:
Find the slow parts and make them fast. To do this right, you need a profiler. A profiler will observe your program as it runs, and then tell you what percentage of time your program spent executing each part. That tells you where to focus your efforts.
Replace slow algorithms with faster algorithms. This might mean throwing away the entire function, or it might mean fixing just certain parts of the code.
For example, profiling might show that you spend a lot of time call ProcessMessages. You can't really make that function any faster since it's part of the VCL, but you can call it less often. You might even find that you don't need to call it at all, if the thread you're running this code on isn't expected to receive any messages that need processing.
Profiling might show that you're spending a lot of time doing string comparisons. If the starts of your strings are frequently equal, and usually only differ at the end, then you might wish to change your string-comparison algorithm to start comparing strings at the last character instead of the first.
Profiling might show that you're spending a lot of time converting integers into strings before you compare them. Most programming languages support comparing integers directly, so instead of using a string-comparing algorithm, you could try using an integer-comparing algorithm instead. You could convert scanvalue to an integer with StrToInt(scanvalue) and compare it directly to value.
Profiling might show that you're repeatedly calculating the same result from the same input. If a value doesn't change over some portion of a program, then values calculated from it won't change, either. You can reduce the cost of converting values by doing it only when a value has changed. For example, if you do integer comparisons, then you'll probably find that the integer version of scanvalue doesn't change in your function. You could convert scanvalue to an integer once at the start of the function, and then compare value to that inside the loop instead of calling StrToInt(scanvalue) lots of times.

Also: you're converting the current pointer to a string for every byte you access. Yech, horribly slow. Instead, change scanvalue to a BYTE at the start of your procedure and do the comparison directly on BYTEs.
Finally -- pull the "if scantype = '1 byte'" out of the for i:=DWORD(MBI.BaseAddress) loop. You don't want to be doing that "if" statement for every byte -- instead, do
if scantype = '1 byte' then
for i:= DWORD(...)
else if scantype='other scan type' then...
and so on. (And yes, you should convert that "if scantype" comparison to an enum or whatnot.)

Related

Large performance difference between comparing a variable to a fixed value and reading or writing from mapped memory address

I'm developing a software that runs on a DE10 board, in an ARM Cortex-A9 processor.
This software has to access physical memory addresses in order to communicate with the FPGA in the DE10, and this is done mapping /dev/mem, this method is described here.
I have a situation where I have to select which of 4 addresses to send some values, and this could be done in 1 of 2 ways:
Using an if statement and checking a integer variable (which is always 0 or 1 at that part of the loop) and only write if it's 1.
Multiply the values that should be sent by the aforementioned variable and write on all addresses without any conditional, because writing zero doesn't have any effect on my system.
I was curious about which would be faster, so I tried this:
First, I made this loop:
int test=0;
for(int i=0;i<1000000;i++)
{
if(test==9)
{
test=15;
}
test++;
if(test==9)
{
test=0;
}
}
The first if statement should never be satisfied, so its only contribution to the time taken in the loop is from its comparison itself.
The increment and the second if statement are just things I added in an attempt to prevent the compiler from just "optimizing out" the first if statement.
This loop is ran once without being benchmarked (just in case there's any frequency scaling ramp, although I'm pretty sure it has none) and then its ran once again being benchmarked, and it takes around 18350 μs to complete.
Without the first if statement, it takes around 17260 μs
Now, If I change that first if statement by a line that sets the value of a memory-mapped address to the value of the integer test, like this:
for(int i=0;i<1000000;i++)
{
*(uint8_t*)address=test;
test++;
if(test==9)
{
test=0;
}
}
This loops takes around 253600 μs to complete, almost 14 x slower.
Reading that address instead of writing on it barely changes anything.
Is this what it really is, or is there some kind of compiler optimization possibly frustrating my benchmarking?
Should I expect this difference in performance (and thus favoring the comparison method) in the actual software?

How do jump-tables work?

In the following document, pages 4-5:
http://www.open-std.org/jtc1/sc22/wg21/docs/ESC_Boston_01_304_paper.pdf
typedef int (* jumpfnct)(void * param);
static int CaseError(void * param)
{
return -1;
}
static jumpfnct const jumptable[] =
{
CaseError, CaseError, ...
.
.
.
Case44, CaseError, ...
.
.
.
CaseError, Case255
};
result = index <= 0xFF ? jumptable[index](param) : -1;
it is comparing IF-ELSE vs SWITCH and then introduces this "Jump table". Apparently it is the fastest implementation of the three. What exactly is it? I cannot see how it could work??
The jumptable is a method of mapping some input integer to an action. It stems from the fact that you can use the input integer as the index of an array.
The code sets up an array of pointers to functions. Your input integer is then used to select on of these function-pointers. Generally, it looks like it's going to be a pointer to the function CaseError. However, every now and again, it will be a different function that is being pointed to.
It's designed so that
jumptable[62] = Case62;
jumptable[95] = Case95;
jumptable[35] = Case35;
jumptable[34] = CaseError; /* For example... and so it goes on */
Thus, selecting the right function to call is constant time... with the if-elses and selects, the time taken to select the correct function is dependent on the input integer... assuming the compiler doesn't optimize the select to a jumptable itself... if it's for embedded code, then there's a chance that optimizations of this kind have been disabled... you'd have to check.
Once the correct function-pointer is found, the last line simply calls it:
result = index <= 0xFF ? jumptable[index](param) : -1;
becomes
result = index <= 0xFF /* Check that the index is within
the range of the jump table */
? jumptable[index](param) /* jumptable[index] selects a function
then it gets called with (param) */
: -1; /* If the index is out of range, set result to be -1
Personally, I think a better choice would be to call
CaseError(param) here */
Jumpfnct is a pointer to a function. Jumptable is an array that consists of a number of jumpfncts. The functions can be called just by referencing their position in the array.
For example, jumptable0 will execute the first function, passing along param. jumptable1 will execute the second function, etc.
If you don't know about function pointers, you shouldn't use this trick. They're very handy, in a narrow domain.
It's very fast and space efficient, when what you're doing is switching between a large number of similar function calls. You are adding a function call overhead that a switch statement doesn't necessarily have, so it might not be appropriate in all circumstances. If your code is something like this:
switch(x) {
case 1:
function1();
break;
case 2:
function2();
break;
...
}
A jump table might be a good substitution. If, though, your switch is something like this:
switch(x) {
case 1:
y++;
break;
case 1023:
y--;
break;
...
}
It probably wouldn't be worth doing.
I've used them in a toy FORTH language interpreter, where they were invaluable, but in most cases you're not going to see a speed benefit that makes them worth using. Use them if it makes the logic of your program clearer, not for optimization.
This jumptable returns a pointer-to-function by its index. You define this table in a way that invalid indexes point to the function that returns some invalid code (like -1 in the example) and valid indexes point to the functions you need to call.
Construction
jumptable[index]
returns pointer-to-function and this function gets called
jumptable[index](param)
where param is some custom parameter.
A Jump-Table is an obvious, but rarely used optimization, that for some reason seems to have fallen out of favor.
Briefly, instead of testing a value and exiting out of a switch/case or if-else block to branch to function or code path, you create an array which is filled with the addresses of the functions the program can branch to.
Once completed, this arrangement eliminates the relentless if testing attendant with if-else and switch/case blocks. The code uses the variable that would otherwise be tested with if as a subscript into the function-pointer array, and proceeds directly the the appropriate code - sans ANY if testing. A perfectly efficient branch. The assembly code should literally be a jump.
If you profile code, and find a hot-spot where the program is spending a large % of it's time, look to this kind of optimization to improve performance. A little bit of this can go a long way if it's part of a code's hot-spot.
Thanks for the link. Nice find!
As mentioned in the comment above, whether this solution is more or less effiicent than, for example, a switch statement depends on the amount of work needed to be done for each case.
Writing a regular switch statement for the values you want to process will definitely be a clearer way to see what the code does. So unless either space or speed requirements dictate that a more sophisticated solution, I would suggest that this is not a "better" solution.
Tables of function pointers is however an efficient and good way to solve certain problems. I use function pointers in a table quite regularly to do things like "Benchmark 11 different solutions to a problem", where I have a struct wiht the name of the function and the function, and some parameters perhaps. Then I have one function to time and loop over the code a few million times (or whatever it takes to get a long enough measurement to make sense)

Text iteration, Assembly versus C++

I am making a program in which is is frequently reading chunks of text received from the web looking for specific characters and parsing the data accordingly. I am becoming fairly skilled with C++, and have made it work well, however, is Assembly going to be faster than a
for(size_t len = 0;len != tstring.length();len++) {
if(tstring[len] == ',')
stuff();
}
Would an inline-assembly routine using cmp and jz/jnz be faster? I don't want to waste my time working with asm for the fact being able to say I used it, but for true speed purposes.
Thank you,
No way. Your loop is so simple, the cost of the optimizer losing the ability to reason about your code is going to be way higher than any performance you could gain. This isn't SSE intrinsics or a bootloader, it's a trivial loop.
An inline assembly routine using "plain old" jz/jnz is unlikely to be faster than what you have; that said, you have a few inefficiencies in your code:
you're retrieving tstring.length() once per loop iteration; that's unnecessary.
you're using random indexing, tstring[len] which might be a more-expensive operation than using a forward iterator.
you're calling stuff() during the loop; depending on what exactly that does, it might be faster to just let the loop build a list of locations within the string first (so that the scanned string as well as the scanning code stays cache-hot and is not evicted by whatever stuff() does), and only afterwards iterate over those results.
There's already a likely low-level optimized standard library function available,strchr(), for exactly that kind of scanning. The C++ STL std::string::find() is also likely to have been optimized for the purpose (and/or might use strchr() in the char specialization).
In particular, strchr() has SSE2 (using pcmpeqb, maskmov... and bsf) or SSE4.2 (using the string op pcmpistri) implementations; for examples/actual SSE code doing this, check e.g. strchr() in GNU libc (as used on Linux). See also the references and comments here (suitably named website ...).
My advice: Check your library implementation / documentation, and/or the actual generated assembly code for your program. You might well be using fast code already ... or would be if you'd switch from your hand-grown character-by-character simple search to just using std::string::find() or strchr().
If this is ultra-speed-critical, then inlining assembly code for strchr() as used by known/tested implementations (watch licensing) would eliminate function calls and gain a few cycles. Depends on your requirements ... code, benchmark, vary, benchmark again, ...
Checking characters one by one is not the fastest thing to do. Maybe you should try something like this and find out if it's faster.
string s("xxx,xxxxx,x,xxxx");
string::size_type pos = s.find(',');
while(pos != string::npos){
do_stuff(pos);
pos = s.find(',', pos+1);
}
Each iteration of the loop will give you the next position of a ',' character so the program will need only few loops to finish the job.
Would an inline-assembly routine using cmp and jz/jnz be faster?
Maybe, maybe not. It depends upon what stuff() does, what the type and scope of tstring is, and what your assembly looks like.
First, measure the speed of the maintainable C++ code. Only if this loop dominates your program's speed should you consider rewriting it.
If you choose to rewrite it, keep both implementations available, and comparatively measure them. Only use the less maintainable version if it is faster, and if the speed increase matters. Also, since you have the original version in place, future readers will be able to understand your intent even if they don't know asm that well.

Guessing a string with time comparison. Is it possible?

I was wondering about a strange idea: you are given and algorithm wich takes a string in input and compares it to a string that you don't know. The algoritm is just a trivial comparison, one char at a time. When a couple that doesn't match is found, 0 is returned. Otherwise it returns 1.
Can you guess the secret string in a polynomial time by using the provided algorithm?
When a string doesn't match, the time used to give the answer 0 is less than the time taken to return 1, because less comparisons are needed. Times involved are very small, and for this reason you can try a single instance many times to get a more accurate estimation. Estimating the time taken we could have informations about the secret string. If this works properly, we can guess the string one char at a time, in a polynomial time. So if this can happen we can try some kind of brute force attack char by char.
Does this make sense? Or is there something I'm misunderstanding?
Thanks in advance.
You can guess the secret string if you can can input your own strings to compare, or just observe enough strings (not chosen by you) being compared to the secret string, if the string comparison has been written in a way such that its execution time reveals information about the secret string.
This is a known weakness cryptographic software can have, and all serious cryptographic software written nowadays avoids this weakness.
For instance, to avoid revealing information about its arguments, a function that tests whether two buffers are the same or different may be written:
int crypto_memcmp(const char *s1, const char *s2, size_t n)
{
size_t i;
int answer;
for (i=0; i<n; i++)
answer = answer | (s1[i] != s2[i]);
return answer;
}
You can use several techniques to check that a piece of code does not leak secrets through timing attacks. I wrote how to do it with static analysis here but this is based on a previous idea that used Valgrind (dynamic analysis) here.
Note that it goes further than that. This article showed how you did not even need the execution path to depend on the secret to leak information. It was enough that the secret was used in the computation of some array indices that were subsequently accessed. On modern computers, this changes the execution time because the cache will make two successive accesses to similar indices faster than two successive accesses to indices that are far from each other, revealing information about the secret.
You can determine the string bit by bit. for each bit use binary search
for example:
you already know the first a bits. say it (Sa).
now you have to determine the (a+1)th bit.
there are upper bound (Sa)zzzzzzz... and lower bound (Sa)azzzzzz....
first you guess the (a+1)th bit is (a+z)/2, say r, then the string is (Sa)rzzzzzz..., and with the result, you update the upper bound and lower bound.

Any better alternatives for getting the digits of a number? (C++)

I know that you can get the digits of a number using modulus and division. The following is how I've done it in the past: (Psuedocode so as to make students reading this do some work for their homework assignment):
int pointer getDigits(int number)
initialize int pointer to array of some size
initialize int i to zero
while number is greater than zero
store result of number mod 10 in array at index i
divide number by 10 and store result in number
increment i
return int pointer
Anyway, I was wondering if there is a better, more efficient way to accomplish this task? If not, is there any alternative methods for this task, avoiding the use of strings? C-style or otherwise?
Thanks. I ask because I'm going to be wanting to do this in a personal project of mine, and I would like to do it as efficiently as possible.
Any help and/or insight is greatly appreciated.
The time it takes to extract the digits will be dwarfed by the time required to dynamically allocate the array. Consider returning the result in a struct:
struct extracted_digits
{
int number_of_digits;
char digits[12];
};
You'll want to pick a suitable value for the maximum number of digits (12 here, which is enough for a 32-bit integer). Alternatively, you could return a std::array<char, 12> and encode the terminal by using an invalid value (so, after the last value, store a 10 or something else that isn't a digit).
Depending on whether you want to handle negative values, you'll also have to decide how to report the unary minus (-).
Unless you want the representation of the number in a base that's a power of 2, that's about the only way to do it.
Smacks of premature optimisation. If profiling proves it matters, then be sure to compare your algo to itoa - internally it may use some CPU instructions that you don't have explicit access to from C++, and which your compiler's optimiser may not be clever enough to employ (e.g. AAM, which divs while saving the mod result). Experiment (and benchmark) coding the assembler yourself. You might dig around for assembly implementations of ITOA (which isn't identical to what you're asking for, but might suggest the optimal CPU instructions).
By "avoiding the use of strings", I'm going to assume you're doing this because a string-only representation is pretty inefficient if you want an integer value.
To that end, I'm going to suggest a slightly unorthodox approach which may be suitable. Don't store them in one form, store them in both. The code below is in C - it will work in C++ but you may want to consider using c++ equivalents - the idea behind it doesn't change however.
By "storing both forms", I mean you can have a structure like:
typedef struct {
int ival;
char sval[sizeof("-2147483648")]; // enough for 32-bits
int dirtyS;
} tIntStr;
and pass around this structure (or its address) rather than the integer itself.
By having macros or inline functions like:
inline void intstrSetI (tIntStr *is, int ival) {
is->ival = i;
is->dirtyS = 1;
}
inline char *intstrGetS (tIntStr *is) {
if (is->dirtyS) {
sprintf (is->sval, "%d", is->ival);
is->dirtyS = 0;
}
return is->sval;
}
Then, to set the value, you would use:
tIntStr is;
intstrSetI (&is, 42);
And whenever you wanted the string representation:
printf ("%s\n" intstrGetS(&is));
fprintf (logFile, "%s\n" intstrGetS(&is));
This has the advantage of calculating the string representation only when needed (the fprintf above would not have to recalculate the string representation and the printf only if it was dirty).
This is a similar trick I use in SQL with using precomputed columns and triggers. The idea there is that you only perform calculations when needed. So an extra column to hold the indexed lowercased last name along with an insert/update trigger to calculate it, is usually a lot more efficient than select lower(non_lowercased_last_name). That's because it amortises the cost of the calculation (done at write time) across all reads.
In that sense, there's little advantage if your code profile is set-int/use-string/set-int/use-string.... But, if it's set-int/use-string/use-string/use-string/use-string..., you'll get a performance boost.
Granted this has a cost, at the bare minimum extra storage required, but most performance issues boil down to a space/time trade-off.
And, if you really want to avoid strings, you can still use the same method (calculate only when needed), it's just that the calculation (and structure) will be different.
As an aside: you may well want to use the library functions to do this rather than handcrafting your own code. Library functions will normally be heavily optimised, possibly more so than your compiler can make from your code (although that's not guaranteed of course).
It's also likely that an itoa, if you have one, will probably outperform sprintf("%d") as well, given its limited use case. You should, however, measure, not guess! Not just in terms of the library functions, but also this entire solution (and the others).
It's fairly trivial to see that a base-100 solution could work as well, using the "digits" 00-99. In each iteration, you'd do a %100 to produce such a digit pair, thus halving the number of steps. The tradeoff is that your digit table is now 200 bytes instead of 10. Still, it easily fits in L1 cache (obviously, this only applies if you're converting a lot of numbers, but otherwise efficientcy is moot anyway). Also, you might end up with a leading zero, as in "0128".
Yes, there is a more efficient way, but not portable, though. Intel's FPU has a special BCD format numbers. So, all you have to do is just to call the correspondent assembler instruction that converts ST(0) to BCD format and stores the result in memory. The instruction name is FBSTP.
Mathematically speaking, the number of decimal digits of an integer is 1+int(log10(abs(a)+1))+(a<0);.
You will not use strings but go through floating points and the log functions. If your platform has whatever type of FP accelerator (every PC or similar has) that will not be a big deal ,and will beat whatever "sting based" algorithm (that is noting more than an iterative divide by ten and count)