Regex vs brute-force for small strings - regex

When testing small strings (e.g. isPhoneNumber or isHexadecimal) is there a performance benefit from using regular expressions, or would brute forcing them be faster? Wouldn't brute forcing them by just checking whether or not the given string's chars are within a specified range be faster than using a regex?
For example:
public static boolean isHexadecimal(String value)
{
if (value.startsWith("-"))
{
value = value.substring(1);
}
value = value.toLowerCase();
if (value.length() <= 2 || !value.startsWith("0x"))
{
return false;
}
for (int i = 2; i < value.length(); i++)
{
char c = value.charAt(i);
if (!(c >= '0' && c <= '9' || c >= 'a' && c <= 'f'))
{
return false;
}
}
return true;
}
vs.
Regex.match(/0x[0-9a-f]+/, "0x123fa") // returns true if regex matches whole given expression
There seems like there would be some overhead associated with the regex, even when the pattern is pre-compiled, just from the fact that regular expressions have to work in many general cases. In contrast, the brute-force method does exactly what is required and no more. Am I missing some optimization that regular expressions have?

Checking whether string characters are within a certain range is exactly what regular expressions are built to do. They convert the expression into an atomic series of instructions; They're essentially writing out your manual parsing steps but at a lower level.
What tends to be slow with regular expressions is the conversion of the expression into instructions. You can see real performance gains when a regex is used more than once. That's when you can compile the expression ahead of time and then simply apply the resulting compiled instructions in a match, search, replace, etc.
As is the case with anything to do with performance, perform some tests and measure the results.

I've written a small benchmark to estimate the performance of the:
NOP method (to get an idea of the baseline iteration speed);
Original method, as provided by the OP ;
RegExp;
Compiled Regexp;
The version provided by #maraca (w/o toLowerCase and substring);
"fastIsHex" version (switch-based), I've added just for fun.
The test machine configuration is as follows:
JVM: Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
CPU: Intel(R) Core(TM) i5-2500 CPU # 3.30GHz
And here are the results I got for the original test string "0x123fa" and 10.000.000 iterations:
Method "NOP" => #10000000 iterations in 9ms
Method "isHexadecimal (OP)" => #10000000 iterations in 300ms
Method "RegExp" => #10000000 iterations in 4270ms
Method "RegExp (Compiled)" => #10000000 iterations in 1025ms
Method "isHexadecimal (maraca)" => #10000000 iterations in 135ms
Method "fastIsHex" => #10000000 iterations in 107ms
as you can see even the original method by the OP is faster than the RegExp method (at least when using JDK-provided RegExp implementation).
(for your reference)
Benchmark code:
public static void main(String[] argv) throws Exception {
//Number of ITERATIONS
final int ITERATIONS = 10000000;
//NOP
benchmark(ITERATIONS,"NOP",() -> nop(longHexText));
//isHexadecimal
benchmark(ITERATIONS,"isHexadecimal (OP)",() -> isHexadecimal(longHexText));
//Un-compiled regexp
benchmark(ITERATIONS,"RegExp",() -> longHexText.matches("0x[0-9a-fA-F]+"));
//Pre-compiled regexp
final Pattern pattern = Pattern.compile("0x[0-9a-fA-F]+");
benchmark(ITERATIONS,"RegExp (Compiled)", () -> {
pattern.matcher(longHexText).matches();
});
//isHexadecimal (maraca)
benchmark(ITERATIONS,"isHexadecimal (maraca)",() -> isHexadecimalMaraca(longHexText));
//FastIsHex
benchmark(ITERATIONS,"fastIsHex",() -> fastIsHex(longHexText));
}
public static void benchmark(int iterations,String name,Runnable block) {
//Start Time
long stime = System.currentTimeMillis();
//Benchmark
for(int i = 0; i < iterations; i++) {
block.run();
}
//Done
System.out.println(
String.format("Method \"%s\" => #%d iterations in %dms",name,iterations,(System.currentTimeMillis()-stime))
);
}
NOP method:
public static boolean nop(String value) { return true; }
fastIsHex method:
public static boolean fastIsHex(String value) {
//Value must be at least 4 characters long (0x00)
if(value.length() < 4) {
return false;
}
//Compute where the data starts
int start = ((value.charAt(0) == '-') ? 1 : 0) + 2;
//Check prefix
if(value.charAt(start-2) != '0' || value.charAt(start-1) != 'x') {
return false;
}
//Verify data
for(int i = start; i < value.length(); i++) {
switch(value.charAt(i)) {
case '0':case '1':case '2':case '3':case '4':case '5':case '6':case '7':case '8':case '9':
case 'a':case 'b':case 'c':case 'd':case 'e':case 'f':
case 'A':case 'B':case 'C':case 'D':case 'E':case 'F':
continue;
default:
return false;
}
}
return true;
}
So, the answer is no, for short-strings and the task at hand, RegExp is not faster.
When it comes to a longer strings, the balance is quite different,
below are results for the 8192 long hex string, I've generated with:
hexdump -n 8196 -v -e '/1 "%02X"' /dev/urandom
and 10.000 iterations:
Method "NOP" => #10000 iterations in 2ms
Method "isHexadecimal (OP)" => #10000 iterations in 1512ms
Method "RegExp" => #10000 iterations in 1303ms
Method "RegExp (Compiled)" => #10000 iterations in 1263ms
Method "isHexadecimal (maraca)" => #10000 iterations in 553ms
Method "fastIsHex" => #10000 iterations in 530ms
As you can see, hand-written methods (the one by macara and my fastIsHex), still beat the RegExp, but original method does not,
(due to substring() and toLowerCase()).
Sidenote:
This benchmark is very simple indeed and only tests the "worst case" scenario (i.e. a fully valid string), a real life results, with the mixed data lengths and a non-0 valid-invalid ratio, might be quite different.
Update:
I also gave a try to the char[] array version:
char[] chars = value.toCharArray();
for (idx += 2; idx < chars.length; idx++) { ... }
and it was even a bit slower than getCharAt(i) version:
Method "isHexadecimal (maraca) char[] array version" => #10000000 iterations in 194ms
Method "fastIsHex, char[] array version" => #10000000 iterations in 164ms
my guess is that is due to array copy inside toCharArray.
Update (#2):
I've run an additional 8k/100.000 iterations test to see if there is any real difference in speed between the "maraca" and "fastIsHex" methods, and have also normalized them to use exactly the same precondition code:
Run #1
Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5341ms
Method "fastIsHex" => #100000 iterations in 5313ms
Run #2
Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5313ms
Method "fastIsHex" => #100000 iterations in 5334ms
I.e. the speed difference between these two methods is marginal at best, and is probably due to a measurement error (as I'm running this on my workstation and not a specially setup clean test environment).

Brute force approach to solve the problem is to systematically test all combinations. It is not Your case.
You can get better performance from hand written procedure. You can take advantage of the data distribution if You know it in advance. Or You can make some clever shortcuts that apply on Your case. But it really is not guaranteed that what You write would be automatically faster that regex. Regex implementation is optimized too and You can easily end up with code that is worse than that.
The code in Your question is really nothing special and most probably it would be on par with the regex. As I tested it, there was no clear winner, sometimes one was faster, sometimes the other, the difference was small. Your time is limited, think wisely where You spend it.

You're misusing the term "brute force." A better term is ad hoc custom matching.
Regex interpreters are generally slower than custom pattern matchers. The regex is compiled into a byte code, and compilation takes time. Even ignoring compilation (which might be fine if you compile only once and match a very long string and/or many times so the compilation cost isn't important), machine instructions spent in the matching interpreter are overhead that the custom matcher doesn't have.
In cases where the regex matcher wins out, it's normally that the regex engine is implemented in very fast native code, while the custom matcher is written in something slower.
Now you can compile regexes to native code that runs just as fast as a well-done custom matcher. This is the approach of e.g. lex/flex and others. But the most common library or built-in languages don't take this approach (Java, Python, Perl, etc.). They use interpreters.
Native code-generating libraries tend to be cumbersome to use except maybe in C/C++ where they've been part of the air for decades.
In other languages, I'm a fan of state machines. To me they are easier to understand and get correct than either regexes or custom matchers. Below is one for your problem. State 0 is the start state, and D stands for a hex digit.
Implementation of the machine can be extremely fast. In Java, it might look like this:
static boolean isHex(String s) {
int state = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
switch (state) {
case 0:
if (c == '-') state = 1;
else if (c == '0') state = 2;
else return false;
break;
case 1:
if (c == '0') state = 2;
else return false;
break;
case 2:
if (c == 'x') state = 3;
else return false;
break;
case 3:
if (isHexDigit(c)) state = 4;
else return false;
break;
case 4:
if (isHexDigit(c)) ; // state already = 4
else return false;
break;
}
}
return true;
}
static boolean isHexDigit(char c) {
return '0' <= c && c <= '9' || 'A' <= c && c <= 'F' || 'a' <= c && c <= 'f';
}
The code isn't super short, but it's a direct translation of the diagram. There's nothing to mess up short of simple typographical errors.
In C, you can implement states as goto labels:
int isHex(char *s) {
char c;
s0:
c = *s++;
if (c == '-') goto s1;
if (c == '0') goto s2;
return 0;
s1:
c = *s++;
if (c == '0') goto s2;
return 0;
s2:
c = *s++;
if (c == 'x') goto s3;
return 0;
s3:
c = *s++;
if (isxdigit(c)) goto s4;
return 0;
s4:
c = *s++;
if (isxdigit(c)) goto s4;
if (c == '\0') return 1;
return 0;
}
This kind of goto matcher written in C is generally the fastest I've seen. On my MacBook using an old gcc (4.6.4), this one compiles to only 35 machine instructions.

Usually what's better depends on your goals. If readability is the main goal (what it should be, unless you detected a performance issue) then regex are just fine.
If performance is your goal, then you have to analyze the problem first. E.g. if you know it's either a phone number or a hexadecimal number (and nothing else) then the problem becomes much simpler.
Now let's have a look at your function (performance-wise) to detect hexadecimal numbers:
Getting the substring is bad (creating a new object in general), better work with an index and advance it.
Instead of using toLower() it's better to compare to upper and lower case letters (the string is only iterated once, no superfluous substitutions are performed and no new object is created).
So a performance-optimized version could look something like this (you can maybe optimize further by using the charArray instead of the string):
public static final boolean isHexadecimal(String value) {
if (value.length() < 3)
return false;
int idx;
if (value.charAt(0) == '-' || value.charAt(0) == '+') { // also supports unary plus
if (value.length() < 4) // necessairy because -0x and +0x are not valid
return false;
idx = 1;
} else {
idx = 0;
}
if (value.chartAt(idx) != '0' || value.charAt(idx + 1) != 'x')
return false;
for (idx += 2; idx < value.length(); idx++) {
char c = value.charAt(idx);
if (!((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F')))
return false;
}
return true;
}

Well implemented regular expressions can be faster than naive brute force implementation of the same pattern.
On the other hand you always can implement a faster solution for a specific case.
Also as stated in the article above most implementations in popular languages are not efficient (in some cases).
I'd implement own solultions only when performance is an absolute priority and with extensive testing and profiling.

To get a perfomance that is better than naive handcoded validators, you may use a Regular Expression library that is based on deterministic automatons, e.g. Brics Automaton
I wrote a short jmh benchmark:
#State(Scope.Thread)
public abstract class MatcherBenchmark {
private String longHexText;
#Setup
public void setup() {
initPattern("0x[0-9a-fA-F]+");
this.longHexText = "0x123fa";
}
public abstract void initPattern(String pattern);
#Benchmark
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Warmup(iterations = 10)
#Measurement(iterations = 10)
#Fork(1)
public void benchmark() {
boolean result = benchmark(longHexText);
if (!result) {
throw new RuntimeException();
}
}
public abstract boolean benchmark(String text);
#TearDown
public void tearDown() {
donePattern();
this.longHexText = null;
}
public abstract void donePattern();
}
and implemented it with:
#Override
public void initPattern(String pattern) {
RegExp r = new RegExp(pattern);
this.automaton = new RunAutomaton(r.toAutomaton(true));
}
#Override
public boolean benchmark(String text) {
return automaton.run(text);
}
I also created benchmarks for Zeppelins, Genes and the compiled java.util.Regex solution, and a solution with rexlex. These are the results of the jmh benchmark on my machine:
BricsMatcherBenchmark.benchmark avgt 10 0,014 � 0,001 us/op
GenesMatcherBenchmark.benchmark avgt 10 0,017 � 0,001 us/op
JavaRegexMatcherBenchmark.benchmark avgt 10 0,097 � 0,005 us/op
RexlexMatcherBenchmark.benchmark avgt 10 0,061 � 0,002 us/op
ZeppelinsBenchmark.benchmark avgt 10 0,008 � 0,001 us/op
Starting the same benchmark with a non-hex-digit 0x123fax produces following results (note: I inverted the validation in benchmark for this benchmark)
BricsMatcherBenchmark.benchmark avgt 10 0,015 � 0,001 us/op
GenesMatcherBenchmark.benchmark avgt 10 0,019 � 0,001 us/op
JavaRegexMatcherBenchmark.benchmark avgt 10 0,102 � 0,001 us/op
RexlexMatcherBenchmark.benchmark avgt 10 0,052 � 0,002 us/op
ZeppelinsBenchmark.benchmark avgt 10 0,009 � 0,001 us/op

Regex have a huge lot of advantages but still Regex do have a performance issue.

Related

Enforce Compile Time Branch Determinism in a Loop with If Constexpr in C++17

Suppose we have the following code snippet in c++17:
for(int8 l = 0; l < 6; l++)
{
if constexpr (l % 2 == 0)
{
if(some_runtime_number > some_other_runtime_number)
{
// Do stuff
}
}
else
{
if(some_runtime_number < some_other_runtime_number)
{
// Do stuff
}
}
}
Will this actually evaluate at compile time so that every iteration the if statement switches?
The actual question is can I enforce the constexpr to cause the code to be handled at compile time rather than using the extra if every iteration.
I expect the code to function like this
When l = 0 uses the > if statement
When l = 1 uses the < if statement
When l = 2 uses the > if statement
etc..
As far as I understand this could be done being that all the numbers are known at compile time (being that it uses literals for the loop), but I would like a more knowledgeable individual to clarify that this valid and maybe provide a better understanding. Especially in regard to how this gets translated into machine code assembly, because the goal is to erase the if constexpr (l % 2 == 0) evaluation at runtime completely.

I just created an extremely fast way to sort primes. How do I improve it?

Basically, how it works is it converts a number into a string, and if it finds any even in the string then it gives foundEven variable a positive value. The same goes for odd numbers.
(One thing I don't get is why if I switch the '>' sign with an '<' in if (FoundEvenSignedInt < FoundOddSignedInt) it gives you the correct result of an odd number.)
Are there any ways I could improve the code? Are there any bugs in it? I'm fairly new at C++ programing.
#include <string>
#include <cstddef>
int IsPrime(long double a)
{
int var;
long double AVar = a;
signed int FoundEvenSignedInt, FoundOddSignedInt;
std::string str = std::to_string(a);
std::size_t foundEven = str.find_last_of("2468");
std::size_t foundOdd = str.find_last_of("3579");
FoundEvenSignedInt = foundEven;
FoundOddSignedInt = foundOdd;
if (FoundEvenSignedInt < FoundOddSignedInt)
{
var = 1;
goto EndOfIsPrimeFunction;
}
if (FoundEvenSignedInt > FoundOddSignedInt)
{
var = 2;
goto EndOfIsPrimeFunction;
}
// This if statement kept giving me this weird warning so I made it like this
if (FoundEvenSignedInt == -1)
{
if (FoundOddSignedInt == -1)
{
if (AVar == 10 || 100 || 1000 || 10000 || 100000 || 1000000)
{
var = 2;
goto EndOfIsPrimeFunction;
}
}
}
EndOfIsPrimeFunction:
return var;
}
Here are some ways to improve the code.
The Collatz conjecture is about integers. long double is a data type of floating point numbers. It is unsuitable for checking the conjecture. You need to work with an integral data type such as unsigned long long. If this doesn't have enough range for you, you need to work with some kind of Bignum dat atype. There isn't any in the standard C library, you need to find a third party one.
The Collatz conjecture has nothing to do with being prime. It is about even and odd integers. It is true that all prime numbers except 2 are odd, but this fact doesn't help you.
The data type to answer yes/no questions in C++ is bool. By convention. for any other numeric data type zero means "no" and all other values mean "yes" (technically, when converted to bool, zero is converted to false and other values to true, so you can do things like if (a % 2). A function that returns 1 and 2 for yes and no is highly unconventional.
A natural method of checking whether a number is odd is this:
bool isOdd (unsigned long long a)
{
return a % 2;
}
It is somewhat faster than your code (by a factor of about 400 on my computer), gives correct results every time, is readable, and has zero goto statements.
Instead of the if(AVar == 10 || 100 || ..., you can say if(!(AVar % 10)).

Separating every second digit in an integer C++

I am currently finishing up an assignment I have to complete for my OOP class and I am struggling with 1 part in particular. Keep in mind I am still a beginner. The question is as followed:
If the string contains 13 characters, all of characters are digits and the check digit is modulo 10, this function returns true; false otherwise.
This is in regards to a EAN. I basically have to separate every second digit from the rest digits. for example 9780003194876 I need to do calculations with 7,0,0,1,4,7. I have no clue about doing this.
Any help would be greatly appreciated!
bool isValid(const char* str){
if (atoi(str) == 13){
}
return false;
}
You can start with a for loop which increments itself by 2 for each execution:
for (int i = 1, len = strlen(str); i < len; i += 2)
{
int digit = str[i] - '0';
// do something with digit
}
The above is just an example though...
Since the question was tagged as C++ (Not C, so I suggest other answerers to not solve this using C libraries, please. Let us getting OP's C++ knoweledge in the right way since the beggining), and is an OOP class I'm going to solve this with the C++ way: Use the std::string class:
bool is_valid( const std::string& str )
{
if( str.size() == 13 )
{
for( std::size_t i = 0 ; i < 13 ; i += 2 )
{
int digit = str[i] - '0';
//Do what you wan't with the digit
}
}
else
return false;
}
First, if it's EAN, you have to process every digit, not just
every other one. In fact, all you need to do is a weighted sum
of the digits; for EAN-13, the weigths alternate between 1 and
3, starting with three. The simplest solution is probably to
put them in a table (i.e. int weigt[] = { 1, 3, 1, 3... };,
and iterate over the string (in this case, using an index rather
than iterators, since you want to be able to index into
weight as well), converting each digit into a numerical value
(str[i] - '0', if isdigit(static_cast<unsigned char>(str[i])
is true; if it's false, you haven't got a digi.), then
multiplying it by the running total. When you're finished, if
the total, modulo 10, is 0, it's correct. Otherwise, it isn't.
You certainly don't want to use atoi, since you don't want the
numerical value of the string; you want to treat each digit
separately.
Just for the record, professionally, I'd write something like:
bool
isValidEAN13( std::string const& value )
{
return value.size() == 13
&& std::find_if(
value.begin(),
value.end(),
[]( unsigned char ch ){ return !isdigit( ch ); } )
== value.end()
&& calculateEAN13( value ) == value.back() - '0';
}
where calculateEAN13 does the actual calculations (and can be
used for both generation and checking). I suspect that this
goes beyond the goal of the assignment, however, and that all
your teacher is looking for is the calculateEAN13 function,
with the last check (which is why I'm not giving it in full).

Wildcard String Search Algorithm

In my program I need to search in a quite big string (~1 mb) for a relatively small substring (< 1 kb).
The problem is the string contains simple wildcards in the sense of "a?c" which means I want to search for strings like "abc" or also "apc",... (I am only interested in the first occurence).
Until now I use the trivial approach (here in pseudocode)
algorithm "search", input: haystack(string), needle(string)
for(i = 0, i < length(haystack), ++i)
if(!CompareMemory(haystack+i,needle,length(needle))
return i;
return -1; (Not found)
Where "CompareMemory" returns 0 iff the first and second argument are identical (also concerning wildcards) only regarding the amount of bytes the third argument gives.
My question is now if there is a fast algorithm for this (you don't have to give it, but if you do I would prefer c++, c or pseudocode). I started here
but I think most of the fast algorithms don't allow wildcards (by the way they exploit the nature of strings).
I hope the format of the question is ok because I am new here, thank you in advance!
A fast way, which is kind of the same thing as using a regexp, (which I would recommend anyway), is to find something that is fixed in needle, "a", but not "?", and search for it, then see if you've got a complete match.
j = firstNonWildcardPos(needle)
for(i = j, i < length(haystack)-length(needle)+j, ++i)
if(haystack[i] == needle[j])
if(!CompareMemory(haystack+i-j,needle,length(needle))
return i;
return -1; (Not found)
A regexp would generate code similar to this (I believe).
Among strings over an alphabet of c characters, let S have length s and let T_1 ... T_k have average length b. S will be searched for each of the k target strings. (The problem statement doesn't mention multiple searches of a given string; I mention it below because in that paradigm my program does well.)
The program uses O(s+c) time and space for setup, and (if S and the T_i are random strings) O(k*u*s/c) + O(k*b + k*b*s/c^u) total time for searching, with u=3 in program as shown. For longer targets, u should be increased, and rare, widely-separated key characters chosen.
In step 1, the program creates an array L of s+TsizMax integers (in program, TsizMax = allowed target length) and uses it for c lists of locations of next occurrences of characters, with list heads in H[] and tails in T[]. This is the O(s+c) time and space step.
In step 2, the program repeatedly reads and processes target strings. Step 2A chooses u = 3 different non-wild key characters (in current target). As shown, the program just uses the first three such characters; with a tiny bit more work, it could instead use the rarest characters in the target, to improve performance. Note, it doesn't cope with targets with fewer than three such characters.
The line "L[T[r]] = L[g+i] = g+i;" within Step 2A sets up a guard cell in L with proper delta offset so that Step 2G will automatically execute at end of search, without needing any extra testing during the search. T[r] indexes the tail cell of the list for character r, so cell L[g+i] becomes a new, self-referencing, end-of-list for character r. (This technique allows the loops to run with a minimum of extraneous condition testing.)
Step 2B sets vars a,b,c to head-of-list locations, and sets deltas dab, dac, and dbc corresponding to distances between the chosen key characters in target.
Step 2C checks if key characters appear in S. This step is necessary because otherwise a while loop in Step 2E will hang. We don't want more checks within those while loops because they are the inner loops of search.
Step 2D does steps 2E to 2i until var c points to after end of S, at which point it is impossible to make any more matches.
Step 2E consists of u = 3 while loops, that "enforce delta distances", that is, crawl indexes a,b,c along over each other as long as they are not pattern-compatible. The while loops are fairly fast, each being in essence (with ++si instrumentation removed) "while (v+d < w) v = L[v]" for various v, d, w. Replicating the three while loops a few times may increase performance a little and will not change net results.
In Step 2G, we know that the u key characters match, so we do a complete compare of target to match point, with wild-character handling. Step 2H reports result of compare. Program as given also reports non-matches in this section; remove that in production.
Step 2I advances all the key-character indexes, because none of the currently-indexed characters can be the key part of another match.
You can run the program to see a few operation-count statistics. For example, the output
Target 5=<de?ga>
012345678901234567890123456789012345678901
abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg
# 17, de?ga and de3ga match
# 24, de?ga and defg4 differ
# 31, de?ga and defga match
Advances: 'd' 0+3 'e' 3+3 'g' 3+3 = 6+9 = 15
shows that Step 2G was entered 3 times (ie, the key characters matched 3 times); the full compare succeeded twice; step 2E while loops advanced indexes 6 times; step 2I advanced indexes 9 times; there were 15 advances in all, to search the 42-character string for the de?ga target.
/* jiw
$Id: stringsearch.c,v 1.2 2011/08/19 08:53:44 j-waldby Exp j-waldby $
Re: Concept-code for searching a long string for short targets,
where targets may contain wildcard characters.
The user can enter any number of targets as command line parameters.
This code has 2 long strings available for testing; if the first
character of the first parameter is '1' the jay[42] string is used,
else kay[321].
Eg, for tests with *hay = jay use command like
./stringsearch 1e?g a?cd bc?e?g c?efg de?ga ddee? ddee?f
or with *hay = kay,
./stringsearch bc?e? jih? pa?j ?av??j
to exercise program.
Copyright 2011 James Waldby. Offered without warranty
under GPL v3 terms as at http://www.gnu.org/licenses/gpl.html
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
//================================================
int main(int argc, char *argv[]) {
char jay[]="abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg";
char kay[]="ludehkhtdiokihtmaihitoia1htkjkkchajajavpajkihtijkhijhipaja"
"etpajamhkajajacpajihiatokajavtoia2pkjpajjhiifakacpajjhiatkpajfojii"
"etkajamhpajajakpajihiatoiakavtoia3pakpajjhiifakacpajjhkatvpajfojii"
"ihiifojjjjhijpjkhtfdoiajadijpkoia4jihtfjavpapakjhiifjpajihiifkjach"
"ihikfkjjjjhijpjkhtfdoiajakijptoik4jihtfjakpapajjkiifjpajkhiifajkch";
char *hay = (argc>1 && argv[1][0]=='1')? jay:kay;
enum { chars=1<<CHAR_BIT, TsizMax=40, Lsiz=TsizMax+sizeof kay, L1, L2 };
int L[L2], H[chars], T[chars], g, k, par;
// Step 1. Make arrays L, H, T.
for (k=0; k<chars; ++k) H[k] = T[k] = L1; // Init H and T
for (g=0; hay[g]; ++g) { // Make linked character lists for hay.
k = hay[g]; // In same loop, could count char freqs.
if (T[k]==L1) H[k] = T[k] = g;
T[k] = L[T[k]] = g;
}
// Step 2. Read and process target strings.
for (par=1; par<argc; ++par) {
int alpha[3], at[3], a=g, b=g, c=g, da, dab, dbc, dac, i, j, r;
char * targ = argv[par];
enum { wild = '?' };
int sa=0, sb=0, sc=0, ta=0, tb=0, tc=0;
printf ("Target %d=<%s>\n", par, targ);
// Step 2A. Choose 3 non-wild characters to follow.
// As is, chooses first 3 non-wilds for a,b,c.
// Could instead choose 3 rarest characters.
for (j=0; j<3; ++j) alpha[j] = -j;
for (i=j=0; targ[i] && j<3; ++i)
if (targ[i] != wild) {
r = alpha[j] = targ[i];
if (alpha[0]==alpha[1] || alpha[1]==alpha[2]
|| alpha[0]==alpha[2]) continue;
at[j] = i;
L[T[r]] = L[g+i] = g+i;
++j;
}
if (j != 3) {
printf (" Too few target chars\n");
continue;
}
// Step 2B. Set a,b,c to head-of-list locations, set deltas.
da = at[0];
a = H[alpha[0]]; dab = at[1]-at[0];
b = H[alpha[1]]; dbc = at[2]-at[1];
c = H[alpha[2]]; dac = at[2]-at[0];
// Step 2C. See if key characters appear in haystack
if (a >= g || b >= g || c >= g) {
printf (" No match on some character\n");
continue;
}
for (g=0; hay[g]; ++g) printf ("%d", g%10);
printf ("\n%s\n", hay); // Show haystack, for user aid
// Step 2D. Search for match
while (c < g) {
// Step 2E. Enforce delta distances
while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many
while (a+dac > c) {c = L[c]; ++sc; } // times as you like.
while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many
while (a+dac > c) {c = L[c]; ++sc; } // times as you like.
// Step 2F. See if delta distances were met
if (a+dab==b && b+dbc==c && c<g) {
// Step 2G. Yes, so we have 3-letter-match and need to test whole match.
r = a-da;
for (k=0; targ[k]; ++k)
if ((hay[r+k] != targ[k]) && (targ[k] != wild))
break;
printf ("# %3d, %s and ", r, targ);
for (i=0; targ[i]; ++i) putchar(hay[r++]);
// Step 2H. Report match, if found
puts (targ[k]? " differ" : " match");
// Step 2I. Advance all of a,b,c, to go on looking
a = L[a]; ++ta;
b = L[b]; ++tb;
c = L[c]; ++tc;
}
}
printf ("Advances: '%c' %d+%d '%c' %d+%d '%c' %d+%d = %d+%d = %d\n",
alpha[0], sa,ta, alpha[1], sb,tb, alpha[2], sc,tc,
sa+sb+sc, ta+tb+tc, sa+sb+sc+ta+tb+tc);
}
return 0;
}
Note, if you like this answer better than current preferred answer, unmark that one and mark this one. :)
Regular expressions usually use a finite state automation-based search, I think. Try implementing that.

c++ isalpha lookup table

Whats the easiest way to implement a lookup table that checks to see if a character is an alpha or not using a lookup table with an array 256 chars (256 bytes)? I know I can use isalpha function, but a lookup table can supposedly be more efficient, requiring one comparison instead of multiple ones. I was thinking of corresponding the index with the char decimal conversion and checking directly if it the char was equivalent to that.
I've always used this single compare method (i assume it pipelines better), since its faster than doing up to four compares.
unsigned((ch&(~(1<<5))) - 'A') <= 'Z' - 'A'
I benchmarked a few different ways and took into account TLB cache miss for lookup table method. I ran the benchmarks on windows. Here are the times if the charset was '0'..'z':
lookup tbl no tlb miss: 4.8265 ms
lookup table with tlb miss: 7.0217 ms
unsigned((ch&(~(1<<5))) - 'A') <= 'Z' - 'A': 10.5075 ms
(ch>='A' && ch<='Z') || (ch>='a' && ch<='z'): 17.2290 ms
isalpha: 28.0504 ms
You can clearly see that the locale code has a cost.
Here are the times if the charset was 0..255:
tbl no tlb miss: 12.6833 ms
unsigned((ch&(~(1<<5))) - 'A') <= 'Z' - 'A': 29.2403 ms
(ch>='A' && ch<='Z') || (ch>='a' && ch<='z'): 34.8818 ms
isalpha: 78.0317 ms
tbl with tlb miss: 143.7135 ms
The times are longer because more chars were tested. The # of segments I used for the tlb "flush" was larger in the second test. It might be that the table lookup method suffers more from the tlb miss than the first run indicates. You can also see that the single cmp method works better when the character is an alpha.
The lookup table method is the best if comparing many characters in a row, but it is not that much better than the single cmp method. If you are comparing characters here and there, then the tlb cache miss might make the tbl method worse than the single cmp method. The single cmp method works best when the characters are more likely to be alphas.
Here is the code:
__forceinline bool isalpha2(BYTE ch) {
return (ch>='A' && ch<='Z') || (ch>='a' && ch<='z');
}
__forceinline bool isalpha1(BYTE ch) {
return unsigned((ch&(~(1<<5))) - 'A') <= 'Z' - 'A';
}
bool* aTbl[256];
int main(int argc, char* argv[])
{
__int64 n = 0, cTries = 100000;
int b=63;
int ch0=' ', ch1 ='z'+1;
ch0=0, ch1 = 256;
// having 255 tables instead of one lets us "flush" the tlb.
// Intel tlb should have about ~32 entries (depending on model!) in it,
// so using more than 32 tables should have a noticable effect.
for (int i1=0 ; i1<256 ; ++i1) {
aTbl[i1] = (bool*)malloc(16384);
for (int i=0 ; i<256 ; ++i)
aTbl[i1][i] = isalpha(i);
}
{ CBench Bench("tbl with tlb miss");
for (int i=1 ; i<cTries ; ++i) {
for (int ch = ch0 ; ch < ch1 ; ++ ch)
n += aTbl[ch][ch]; // tlb buster
}
}
{ CBench Bench("tbl no tlb miss");
for (int i=1 ; i<cTries ; ++i) {
for (int ch = ch0 ; ch < ch1 ; ++ ch)
n += aTbl[0][ch];
}
}
{ CBench Bench("isalpha");
for (int i=1 ; i<cTries ; ++i) {
for (int ch = ch0 ; ch < ch1 ; ++ ch)
n += isalpha(ch);
}
}
{ CBench Bench("unsigned((ch&(~(1<<5))) - 'A') <= 'Z' - 'A'");
for (int i=1 ; i<cTries ; ++i) {
for (int ch = ch0 ; ch < ch1 ; ++ ch)
n += isalpha1(ch);
}
}
{ CBench Bench("(ch>='A' && ch<='Z') || (ch>='a' && ch<='z')");
for (int i=1 ; i<cTries ; ++i) {
for (int ch = ch0 ; ch < ch1 ; ++ ch)
n += isalpha2(ch);
}
}
return n;
}
class CBench {
public:
__declspec(noinline) CBench(CBench* p) : m_fAccumulate(false), m_nTicks(0),
m_cCalls(0), m_pBench(p), m_desc(NULL), m_nStart(GetBenchMark()) { }
__declspec(noinline) CBench(const char *desc_in, bool fAccumulate=false) :
m_fAccumulate(fAccumulate), m_nTicks(0), m_cCalls(0), m_pBench(NULL),
m_desc(desc_in), m_nStart(GetBenchMark()) { }
__declspec(noinline) ~CBench() {
__int64 n = (m_fAccumulate) ? m_nTicks : GetBenchMark() - m_nStart;
if (m_pBench) {
m_pBench->m_nTicks += n;
m_pBench->m_cCalls++;
return;
} else if (!m_fAccumulate) {
m_cCalls++;
}
__int64 nFreq;
QueryPerformanceFrequency((LARGE_INTEGER*)&nFreq);
double ms = ((double)n * 1000)/nFreq;
printf("%s took: %.4f ms, calls: %d, avg:%f\n", m_desc, ms, m_cCalls,
ms/m_cCalls);
}
__declspec(noinline) __int64 GetBenchMark(void) {
__int64 nBenchMark;
QueryPerformanceCounter((LARGE_INTEGER*)&nBenchMark);
return nBenchMark;
}
LPCSTR m_desc;
__int64 m_nStart, m_nTicks;
DWORD m_cCalls;
bool m_fAccumulate;
CBench* m_pBench;
};
Remember the first rule of optimisation: don't do it.
Then remember the second rule of optimisation, to be applied very rarely: don't do it yet.
Then, if you really are encountering a bottleneck and you've identified isalpha as the cause, then something like this might be faster, depending on how your library implements the function. You'll need to measure the performance in your environment, and only use it if there really is a measurable improvement. This assumes you don't need to test values outside the range of unsigned char (typically 0...255); you'll need a bit of extra work for that.
#include <cctype>
#include <climits>
class IsAlpha
{
public:
IsAlpha()
{
for (int i = 0; i <= UCHAR_MAX; ++i)
table[i] = std::isalpha(i);
}
bool operator()(unsigned char i) const {return table[i];}
private:
bool table[UCHAR_MAX+1];
};
Usage:
IsAlpha isalpha;
for (int i = 0; i <= UCHAR_MAX; ++i)
assert(isalpha(i) == bool(std::isalpha(i)));
Actually, according to Plauger in "The Standard C Library" [91] isalpha is oftentimes implemented using a lookup table. That book is really dated but this might still be the case today. Here's his proposed definition for isalpha
Function
int isalpha(int c)
{
return (_Ctype[c] & (_LO|_UP|_XA));
}
Macro
#define isalpha(c) (_Ctype[(int)(c)] & (_LO|_UP|_XA))
Your compiler library's implementation is likely to be quite efficient and is probably already using a lookup table for most cases, but also handling some situations that might be a little tricky to get right if you're going to do your own isalpha():
dealing with signed characters correctly (using negative indexing on the lookup table)
dealing with non-ASCII locales
You might not need to handle non-ASCII locales, in which case you might (maybe) be able to improve slightly over the library.
In fact, I wouldn't be surprised if a macro or function that simply returned the result of:
((('a' <= (c)) && ((c) <= 'z')) || (('A' <= (c)) && ((c) <= 'Z')))
might be faster than a table lookup since it wouldn't have to hit memory. But I doubt it would be faster in any meaningful way, and would be difficult to measure a difference except maybe in a benchmark that did nothing but isalpha() calls (which might also improve the table lookup results since the table would likely be in the cache for many of the tests).
And is isalpha() really a bottleneck? For anyone?
Just use the one in your compiler's library.
If you are looking for alphabetic characters, a-Z, that's a lot less characters than your 255 array. You can subtract 'A' from the ASCII character in question (the lowest alphabetic character), which will be the index into your array. e.g. 'B' - 'A' is 1. The test, if negative, is not alpha. If greater than your max alpha ('z'), then it is not alpha.
If you are using unicode at all, this method will not work.
I think you can implement isalpha a lot more trivially than with a lookup table. Using the fact that characters 'a'-'z' and 'A'-'Z' are successive in ASCII a simple test like this is sufficient:
char c ;
// c gets some value
if(('A'<=c && 'Z'>=c) || ('a'<=c && 'z'>=c)) // c is alpha
Note that this doesn't take into account different locales.