I'm learning reverse engineering, and I have the following snippet which I am trying to make sense of:
var = strcmp("C:\\Windows\\System32\\svchost.exe", pe.szExeFile);
if (var)
var = -(var < 0) | 1;
if (var)
{
// additional code here
}
I think I understand most of what is going on here, but I'm confused about the purpose of the
var = -(var < 0) | 1; line. I'm only very vaguely familiar with C/C++, so I'm having a hard time wrapping my head around what this line does.
I understand that it's a bitwise OR, but I'm unsure how the -(var < 0) works. Is the expression inside the parentheses evaluated to a 1 or 0 and then the negative is applied and the OR? Is it evaluated as a boolean? If so, how does the | work on a boolean?
Or am I totally missing the point here?
strcmp() returns one of three possible results:
< 0
0
> 0
Assumed common two's complement, after the first if the variable var will be
-1 for the former "< 0"
0 for the former "= 0"
+1 for the former "> 0"
However, the second if will be taken only if var is non-zero.
The "mysterious" first if has no effect, as far as the source is concerned that you show.
Suppose I want to solve the equation x + 3 = 40 using GNU bc. One way I could do this would be to start by checking to see if 0 is a solution, then checking 1, and so on, until I get to the right answer. (Obviously not the best way to do algebra, but oh well.) So I enter the following code into GNU bc:
int solver(int x);
define solver(x){
if(x + 3 == 40) return x;
x = x + 1;
solver(x)
}
solver(0)
It produces 37 - the right answer, of course - but the 37 is then followed by 37 zeros. Based on some experimentation, it seems like each zero comes from an instance of the if statement being false, but how do I prevent the zeros from showing up? I'm using GNU bc to solve more complicated functions and create more complex lists of numbers, so it really isn't practical for me to sort through all the zeros. Any help would be appreciated, since I haven't yet figured anything out.
For each operation that isn't an assignment, bc prints an exit status. One way to suppress that is to assign to the dummy value . (which is just the value of the last result anyway), another way is to make sure you explicitly print exactly what you need.
I would have written your function like this:
#!/usr/bin/bc -q
define solver(x) {
if (x + 3 == 40) return x
return solver(x+1)
}
print solver(0), "\n"
quit
A few remarks for your attempt:
I don't understand what your first line is supposed to do, I just dropped it
I've indented the code, added some whitespace and removed the semicolons – mostly a matter of taste and readability
I've simplified the recursive call to avoid the solver(x) line stand on its own, as this produces the spurious 0
As for your suspicion that the if statement produces the zeroes: try, in an interactive session, the following:
1 == 2 # Equality test on its own produces output
0
1 == 1 # ... for both true and false statements
1
if (1 == 2) print "yes\n" # No output from false if condition
if (1 == 1) print "yes\n" # If statement is true, print string
yes
When testing small strings (e.g. isPhoneNumber or isHexadecimal) is there a performance benefit from using regular expressions, or would brute forcing them be faster? Wouldn't brute forcing them by just checking whether or not the given string's chars are within a specified range be faster than using a regex?
For example:
public static boolean isHexadecimal(String value)
{
if (value.startsWith("-"))
{
value = value.substring(1);
}
value = value.toLowerCase();
if (value.length() <= 2 || !value.startsWith("0x"))
{
return false;
}
for (int i = 2; i < value.length(); i++)
{
char c = value.charAt(i);
if (!(c >= '0' && c <= '9' || c >= 'a' && c <= 'f'))
{
return false;
}
}
return true;
}
vs.
Regex.match(/0x[0-9a-f]+/, "0x123fa") // returns true if regex matches whole given expression
There seems like there would be some overhead associated with the regex, even when the pattern is pre-compiled, just from the fact that regular expressions have to work in many general cases. In contrast, the brute-force method does exactly what is required and no more. Am I missing some optimization that regular expressions have?
Checking whether string characters are within a certain range is exactly what regular expressions are built to do. They convert the expression into an atomic series of instructions; They're essentially writing out your manual parsing steps but at a lower level.
What tends to be slow with regular expressions is the conversion of the expression into instructions. You can see real performance gains when a regex is used more than once. That's when you can compile the expression ahead of time and then simply apply the resulting compiled instructions in a match, search, replace, etc.
As is the case with anything to do with performance, perform some tests and measure the results.
I've written a small benchmark to estimate the performance of the:
NOP method (to get an idea of the baseline iteration speed);
Original method, as provided by the OP ;
RegExp;
Compiled Regexp;
The version provided by #maraca (w/o toLowerCase and substring);
"fastIsHex" version (switch-based), I've added just for fun.
The test machine configuration is as follows:
JVM: Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
CPU: Intel(R) Core(TM) i5-2500 CPU # 3.30GHz
And here are the results I got for the original test string "0x123fa" and 10.000.000 iterations:
Method "NOP" => #10000000 iterations in 9ms
Method "isHexadecimal (OP)" => #10000000 iterations in 300ms
Method "RegExp" => #10000000 iterations in 4270ms
Method "RegExp (Compiled)" => #10000000 iterations in 1025ms
Method "isHexadecimal (maraca)" => #10000000 iterations in 135ms
Method "fastIsHex" => #10000000 iterations in 107ms
as you can see even the original method by the OP is faster than the RegExp method (at least when using JDK-provided RegExp implementation).
(for your reference)
Benchmark code:
public static void main(String[] argv) throws Exception {
//Number of ITERATIONS
final int ITERATIONS = 10000000;
//NOP
benchmark(ITERATIONS,"NOP",() -> nop(longHexText));
//isHexadecimal
benchmark(ITERATIONS,"isHexadecimal (OP)",() -> isHexadecimal(longHexText));
//Un-compiled regexp
benchmark(ITERATIONS,"RegExp",() -> longHexText.matches("0x[0-9a-fA-F]+"));
//Pre-compiled regexp
final Pattern pattern = Pattern.compile("0x[0-9a-fA-F]+");
benchmark(ITERATIONS,"RegExp (Compiled)", () -> {
pattern.matcher(longHexText).matches();
});
//isHexadecimal (maraca)
benchmark(ITERATIONS,"isHexadecimal (maraca)",() -> isHexadecimalMaraca(longHexText));
//FastIsHex
benchmark(ITERATIONS,"fastIsHex",() -> fastIsHex(longHexText));
}
public static void benchmark(int iterations,String name,Runnable block) {
//Start Time
long stime = System.currentTimeMillis();
//Benchmark
for(int i = 0; i < iterations; i++) {
block.run();
}
//Done
System.out.println(
String.format("Method \"%s\" => #%d iterations in %dms",name,iterations,(System.currentTimeMillis()-stime))
);
}
NOP method:
public static boolean nop(String value) { return true; }
fastIsHex method:
public static boolean fastIsHex(String value) {
//Value must be at least 4 characters long (0x00)
if(value.length() < 4) {
return false;
}
//Compute where the data starts
int start = ((value.charAt(0) == '-') ? 1 : 0) + 2;
//Check prefix
if(value.charAt(start-2) != '0' || value.charAt(start-1) != 'x') {
return false;
}
//Verify data
for(int i = start; i < value.length(); i++) {
switch(value.charAt(i)) {
case '0':case '1':case '2':case '3':case '4':case '5':case '6':case '7':case '8':case '9':
case 'a':case 'b':case 'c':case 'd':case 'e':case 'f':
case 'A':case 'B':case 'C':case 'D':case 'E':case 'F':
continue;
default:
return false;
}
}
return true;
}
So, the answer is no, for short-strings and the task at hand, RegExp is not faster.
When it comes to a longer strings, the balance is quite different,
below are results for the 8192 long hex string, I've generated with:
hexdump -n 8196 -v -e '/1 "%02X"' /dev/urandom
and 10.000 iterations:
Method "NOP" => #10000 iterations in 2ms
Method "isHexadecimal (OP)" => #10000 iterations in 1512ms
Method "RegExp" => #10000 iterations in 1303ms
Method "RegExp (Compiled)" => #10000 iterations in 1263ms
Method "isHexadecimal (maraca)" => #10000 iterations in 553ms
Method "fastIsHex" => #10000 iterations in 530ms
As you can see, hand-written methods (the one by macara and my fastIsHex), still beat the RegExp, but original method does not,
(due to substring() and toLowerCase()).
Sidenote:
This benchmark is very simple indeed and only tests the "worst case" scenario (i.e. a fully valid string), a real life results, with the mixed data lengths and a non-0 valid-invalid ratio, might be quite different.
Update:
I also gave a try to the char[] array version:
char[] chars = value.toCharArray();
for (idx += 2; idx < chars.length; idx++) { ... }
and it was even a bit slower than getCharAt(i) version:
Method "isHexadecimal (maraca) char[] array version" => #10000000 iterations in 194ms
Method "fastIsHex, char[] array version" => #10000000 iterations in 164ms
my guess is that is due to array copy inside toCharArray.
Update (#2):
I've run an additional 8k/100.000 iterations test to see if there is any real difference in speed between the "maraca" and "fastIsHex" methods, and have also normalized them to use exactly the same precondition code:
Run #1
Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5341ms
Method "fastIsHex" => #100000 iterations in 5313ms
Run #2
Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5313ms
Method "fastIsHex" => #100000 iterations in 5334ms
I.e. the speed difference between these two methods is marginal at best, and is probably due to a measurement error (as I'm running this on my workstation and not a specially setup clean test environment).
Brute force approach to solve the problem is to systematically test all combinations. It is not Your case.
You can get better performance from hand written procedure. You can take advantage of the data distribution if You know it in advance. Or You can make some clever shortcuts that apply on Your case. But it really is not guaranteed that what You write would be automatically faster that regex. Regex implementation is optimized too and You can easily end up with code that is worse than that.
The code in Your question is really nothing special and most probably it would be on par with the regex. As I tested it, there was no clear winner, sometimes one was faster, sometimes the other, the difference was small. Your time is limited, think wisely where You spend it.
You're misusing the term "brute force." A better term is ad hoc custom matching.
Regex interpreters are generally slower than custom pattern matchers. The regex is compiled into a byte code, and compilation takes time. Even ignoring compilation (which might be fine if you compile only once and match a very long string and/or many times so the compilation cost isn't important), machine instructions spent in the matching interpreter are overhead that the custom matcher doesn't have.
In cases where the regex matcher wins out, it's normally that the regex engine is implemented in very fast native code, while the custom matcher is written in something slower.
Now you can compile regexes to native code that runs just as fast as a well-done custom matcher. This is the approach of e.g. lex/flex and others. But the most common library or built-in languages don't take this approach (Java, Python, Perl, etc.). They use interpreters.
Native code-generating libraries tend to be cumbersome to use except maybe in C/C++ where they've been part of the air for decades.
In other languages, I'm a fan of state machines. To me they are easier to understand and get correct than either regexes or custom matchers. Below is one for your problem. State 0 is the start state, and D stands for a hex digit.
Implementation of the machine can be extremely fast. In Java, it might look like this:
static boolean isHex(String s) {
int state = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
switch (state) {
case 0:
if (c == '-') state = 1;
else if (c == '0') state = 2;
else return false;
break;
case 1:
if (c == '0') state = 2;
else return false;
break;
case 2:
if (c == 'x') state = 3;
else return false;
break;
case 3:
if (isHexDigit(c)) state = 4;
else return false;
break;
case 4:
if (isHexDigit(c)) ; // state already = 4
else return false;
break;
}
}
return true;
}
static boolean isHexDigit(char c) {
return '0' <= c && c <= '9' || 'A' <= c && c <= 'F' || 'a' <= c && c <= 'f';
}
The code isn't super short, but it's a direct translation of the diagram. There's nothing to mess up short of simple typographical errors.
In C, you can implement states as goto labels:
int isHex(char *s) {
char c;
s0:
c = *s++;
if (c == '-') goto s1;
if (c == '0') goto s2;
return 0;
s1:
c = *s++;
if (c == '0') goto s2;
return 0;
s2:
c = *s++;
if (c == 'x') goto s3;
return 0;
s3:
c = *s++;
if (isxdigit(c)) goto s4;
return 0;
s4:
c = *s++;
if (isxdigit(c)) goto s4;
if (c == '\0') return 1;
return 0;
}
This kind of goto matcher written in C is generally the fastest I've seen. On my MacBook using an old gcc (4.6.4), this one compiles to only 35 machine instructions.
Usually what's better depends on your goals. If readability is the main goal (what it should be, unless you detected a performance issue) then regex are just fine.
If performance is your goal, then you have to analyze the problem first. E.g. if you know it's either a phone number or a hexadecimal number (and nothing else) then the problem becomes much simpler.
Now let's have a look at your function (performance-wise) to detect hexadecimal numbers:
Getting the substring is bad (creating a new object in general), better work with an index and advance it.
Instead of using toLower() it's better to compare to upper and lower case letters (the string is only iterated once, no superfluous substitutions are performed and no new object is created).
So a performance-optimized version could look something like this (you can maybe optimize further by using the charArray instead of the string):
public static final boolean isHexadecimal(String value) {
if (value.length() < 3)
return false;
int idx;
if (value.charAt(0) == '-' || value.charAt(0) == '+') { // also supports unary plus
if (value.length() < 4) // necessairy because -0x and +0x are not valid
return false;
idx = 1;
} else {
idx = 0;
}
if (value.chartAt(idx) != '0' || value.charAt(idx + 1) != 'x')
return false;
for (idx += 2; idx < value.length(); idx++) {
char c = value.charAt(idx);
if (!((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F')))
return false;
}
return true;
}
Well implemented regular expressions can be faster than naive brute force implementation of the same pattern.
On the other hand you always can implement a faster solution for a specific case.
Also as stated in the article above most implementations in popular languages are not efficient (in some cases).
I'd implement own solultions only when performance is an absolute priority and with extensive testing and profiling.
To get a perfomance that is better than naive handcoded validators, you may use a Regular Expression library that is based on deterministic automatons, e.g. Brics Automaton
I wrote a short jmh benchmark:
#State(Scope.Thread)
public abstract class MatcherBenchmark {
private String longHexText;
#Setup
public void setup() {
initPattern("0x[0-9a-fA-F]+");
this.longHexText = "0x123fa";
}
public abstract void initPattern(String pattern);
#Benchmark
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Warmup(iterations = 10)
#Measurement(iterations = 10)
#Fork(1)
public void benchmark() {
boolean result = benchmark(longHexText);
if (!result) {
throw new RuntimeException();
}
}
public abstract boolean benchmark(String text);
#TearDown
public void tearDown() {
donePattern();
this.longHexText = null;
}
public abstract void donePattern();
}
and implemented it with:
#Override
public void initPattern(String pattern) {
RegExp r = new RegExp(pattern);
this.automaton = new RunAutomaton(r.toAutomaton(true));
}
#Override
public boolean benchmark(String text) {
return automaton.run(text);
}
I also created benchmarks for Zeppelins, Genes and the compiled java.util.Regex solution, and a solution with rexlex. These are the results of the jmh benchmark on my machine:
BricsMatcherBenchmark.benchmark avgt 10 0,014 � 0,001 us/op
GenesMatcherBenchmark.benchmark avgt 10 0,017 � 0,001 us/op
JavaRegexMatcherBenchmark.benchmark avgt 10 0,097 � 0,005 us/op
RexlexMatcherBenchmark.benchmark avgt 10 0,061 � 0,002 us/op
ZeppelinsBenchmark.benchmark avgt 10 0,008 � 0,001 us/op
Starting the same benchmark with a non-hex-digit 0x123fax produces following results (note: I inverted the validation in benchmark for this benchmark)
BricsMatcherBenchmark.benchmark avgt 10 0,015 � 0,001 us/op
GenesMatcherBenchmark.benchmark avgt 10 0,019 � 0,001 us/op
JavaRegexMatcherBenchmark.benchmark avgt 10 0,102 � 0,001 us/op
RexlexMatcherBenchmark.benchmark avgt 10 0,052 � 0,002 us/op
ZeppelinsBenchmark.benchmark avgt 10 0,009 � 0,001 us/op
Regex have a huge lot of advantages but still Regex do have a performance issue.
Which language is smart so that it could understand variable a = 0 , 20, ..., 300 ? so you could easily create arrays with it giving step start var last var (or, better no last variable (a la infinite array)) and not only for numbers (but even complex numbers and custom structures like Sedenion's which you would probably define on your own as a class or whatever...)
Point is, find a language or algorithm usable in a language that can cach the law of how array of variables you've given (or params of that variables) change. And compose using that law a structure from which you would be able to get any variable(s).
To everyone - examples you provide are very helpful for all beginners out there. And at the same time are the basic knowledge required to build such 'Smart Array' class. So thank you wary much for your enthusiastic help.
As JeffSahol noticed
all possible rules might include some
that require evaluation of some/all
existing members to generate the nth
member.
So it is a hard Question. And I think language that would do it 'Naturally' would be great to play\work with, hopefully not only for mathematicians.
Haskell:
Prelude> let a=[0,20..300]
Prelude> a
[0,20,40,60,80,100,120,140,160,180,200,220,240,260,280,300]
btw: infinite lists are possible, too:
Prelude> let a=[0,20..]
Prelude> take 20 a
[0,20,40,60,80,100,120,140,160,180,200,220,240,260,280,300,320,340,360,380]
Excel:
Write 0 in A1
Write 20 in A2
Select A1:2
Drag the corner downwards
MatLab:
a = [0:20:300]
F#:
> let a = [|0..20..300|];;
val a : int [] =
[|0; 20; 40; 60; 80; 100; 120; 140; 160; 180; 200; 220; 240; 260; 280; 300|]
With complex numbers:
let c1 = Complex.Create( 0.0, 0.0)
let c2 = Complex.Create(10.0, 10.0)
let a = [|c1..c2|]
val a : Complex [] =
[|0r+0i; 1r+0i; 2r+0i; 3r+0i; 4r+0i; 5r+0i; 6r+0i; 7r+0i; 8r+0i; 9r+0i; 10r+0i|]
As you can see it increments only the real part.
If the step is a complex number too, it will increment the real part AND the imaginary part, till the last var real part has been reached:
let step = Complex.Create(2.0, 1.0)
let a = [|c1..step..c2|]
val a: Complex [] =
[|0r+0i; 2r+1i; 4r+2i; 6r+3i; 8r+4i; 10r+5i|]
Note that if this behavior doesn't match your needs you still can overload (..) and (.. ..) operators. E.g. you want that it increments the imaginary part instead of the real part:
let (..) (c1:Complex) (c2:Complex) =
seq {
for i in 0..int(c2.i-c1.i) do
yield Complex.Create(c1.r, c1.i + float i)
}
let a = [|c1..c2|]
val a : Complex [] =
[|0r+0i; 0r+1i; 0r+2i; 0r+3i; 0r+4i; 0r+5i; 0r+6i; 0r+7i; 0r+8i; 0r+9i; 0r+10i|]
And PHP:
$a = range(1,300,20);
Wait...
Python:
print range(0, 320, 20)
gives
[0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200, 220, 240, 260, 280, 300]
Props to the comments (I knew there was a more succinct way :P)
Scala:
scala> val a = 0 to 100 by 20
a: scala.collection.immutable.Range = Range(0, 20, 40, 60, 80, 100)
scala> a foreach println
0
20
40
60
80
100
Infinite Lists:
scala> val b = Stream from 1
b: scala.collection.immutable.Stream[Int] = Stream(1, ?)
scala> b take 5 foreach println
1
2
3
4
5
In python you have
a = xrange(start, stop, step)
(or simply range in python 3)
This gives you an iterator from start to stop. It can be infinite since it is built lazily.
>>> a = xrange(0, 300, 20)
>>> for item in a: print item
...
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
And C++ too [use FC++ library]:
// List is different from STL list
List<int> integers = enumFrom(1); // Lazy list of all numbers starting from 1
// filter and ptr_to_fun definitions provided by FC++
// The idea is to _filter_ prime numbers in this case
// prime is user provided routine that checks if a number is prime
// So the end result is a list of infinite primes :)
List<int> filtered_nums = filter( ptr_to_fun(&prime), integers );
FC++ lazy list implementation: http://www.cc.gatech.edu/~yannis/fc++/New/new_list_implementation.html
More details: http://www.cc.gatech.edu/~yannis/fc++/
Arpan
Groovy,
assert [ 1, *3..5, 7, *9..<12 ] == [1,3,4,5,7,9,10,11]
The SWYM language, which appears to no longer be online, could infer arithmetic and geometric progressions from a few example items and generate an appropriate list.
I believe the syntax in perl6 is start ... *+increment_value, end
You should instead use math.
- (int) infiniteList: (int)x
{
return (x*20);
}
The "smart" arrays use this format since I seriously doubt Haskel could let you do this:
a[1] = 15
after defining a.
C# for example does implement Enumerable.Range(int start, int count), PHP offers the function range(mixed low, mixed high, number step), ... There are programming languages that are "smart" enough.
Beside that, an infinite array is pretty much useless - it's not infinite at all but all-memory-consuming.
You cannot do this enumerating simply with complex numbers as there is no direct successor or predecessor for a given number. Edit: This does not mean that you cannot compare complex numbers or create an array with a specified step!
I may be misunderstanding the question, but the answers that specify way to code the specific example you gave (counting by 20's) don't really meet the requirement that the array "cache" an arbitrary rule for generating array members...it seems that almost any complete solution would require a custom collection class that allows generation of the members with a delegated function/method, especially since all possible rules might include some that require evaluation of some/all existing members to generate the nth member.
Just about any program language can give you this sequence. The question is what syntax you want to use to express it. For example, in C# you can write:
Enumerable.Range(0, 300).Where(x => (x % 20) == 0)
or
for (int i = 0; i < 300; i += 20) yield return i;
or encapsulated in a class:
new ArithmaticSequence(0, 301, 20);
or in a method in a static class:
Enumerable2.ArithmaticSequence(0, 301, 20);
So, what is your criteria?
Assembly:
Assuming edi contains the address of the desired array:
xor eax, eax
loop_location:
mov [edi], eax
add edi, #4
add eax, #20
cmp eax, #300
jl loop_location
MATLAB
it is not a Programming language itself but its a tool but still u can use it like a programming language.
It is built for such Mathematics operations to easily arrays are a breeze there :)
a = 0:1:20;
creates an array from 0 to 20 with an increment of 1.
instead of the number 1 you can also provide any value/operation for the increment
Php always does things much simpler, and sometimes dangerously simple too :)
Well… Java is the only language I've ever seriously used that couldn't do that (although I believe using a Vector instead of an Array allowed that).