c++ isalpha lookup table

c++ isalpha lookup table - c++

Whats the easiest way to implement a lookup table that checks to see if a character is an alpha or not using a lookup table with an array 256 chars (256 bytes)? I know I can use isalpha function, but a lookup table can supposedly be more efficient, requiring one comparison instead of multiple ones. I was thinking of corresponding the index with the char decimal conversion and checking directly if it the char was equivalent to that.

I've always used this single compare method (i assume it pipelines better), since its faster than doing up to four compares.
unsigned((ch&(~(1<<5))) - 'A') <= 'Z' - 'A'
I benchmarked a few different ways and took into account TLB cache miss for lookup table method. I ran the benchmarks on windows. Here are the times if the charset was '0'..'z':
lookup tbl no tlb miss: 4.8265 ms
lookup table with tlb miss: 7.0217 ms
unsigned((ch&(~(1<<5))) - 'A') <= 'Z' - 'A': 10.5075 ms
(ch>='A' && ch<='Z') || (ch>='a' && ch<='z'): 17.2290 ms
isalpha: 28.0504 ms
You can clearly see that the locale code has a cost.
Here are the times if the charset was 0..255:
tbl no tlb miss: 12.6833 ms
unsigned((ch&(~(1<<5))) - 'A') <= 'Z' - 'A': 29.2403 ms
(ch>='A' && ch<='Z') || (ch>='a' && ch<='z'): 34.8818 ms
isalpha: 78.0317 ms
tbl with tlb miss: 143.7135 ms
The times are longer because more chars were tested. The # of segments I used for the tlb "flush" was larger in the second test. It might be that the table lookup method suffers more from the tlb miss than the first run indicates. You can also see that the single cmp method works better when the character is an alpha.
The lookup table method is the best if comparing many characters in a row, but it is not that much better than the single cmp method. If you are comparing characters here and there, then the tlb cache miss might make the tbl method worse than the single cmp method. The single cmp method works best when the characters are more likely to be alphas.
Here is the code:
__forceinline bool isalpha2(BYTE ch) {
return (ch>='A' && ch<='Z') || (ch>='a' && ch<='z');
}
__forceinline bool isalpha1(BYTE ch) {
return unsigned((ch&(~(1<<5))) - 'A') <= 'Z' - 'A';
}
bool* aTbl[256];
int main(int argc, char* argv[])
{
__int64 n = 0, cTries = 100000;
int b=63;
int ch0=' ', ch1 ='z'+1;
ch0=0, ch1 = 256;
// having 255 tables instead of one lets us "flush" the tlb.
// Intel tlb should have about ~32 entries (depending on model!) in it,
// so using more than 32 tables should have a noticable effect.
for (int i1=0 ; i1<256 ; ++i1) {
aTbl[i1] = (bool*)malloc(16384);
for (int i=0 ; i<256 ; ++i)
aTbl[i1][i] = isalpha(i);
}
{ CBench Bench("tbl with tlb miss");
for (int i=1 ; i<cTries ; ++i) {
for (int ch = ch0 ; ch < ch1 ; ++ ch)
n += aTbl[ch][ch]; // tlb buster
}
}
{ CBench Bench("tbl no tlb miss");
for (int i=1 ; i<cTries ; ++i) {
for (int ch = ch0 ; ch < ch1 ; ++ ch)
n += aTbl[0][ch];
}
}
{ CBench Bench("isalpha");
for (int i=1 ; i<cTries ; ++i) {
for (int ch = ch0 ; ch < ch1 ; ++ ch)
n += isalpha(ch);
}
}
{ CBench Bench("unsigned((ch&(~(1<<5))) - 'A') <= 'Z' - 'A'");
for (int i=1 ; i<cTries ; ++i) {
for (int ch = ch0 ; ch < ch1 ; ++ ch)
n += isalpha1(ch);
}
}
{ CBench Bench("(ch>='A' && ch<='Z') || (ch>='a' && ch<='z')");
for (int i=1 ; i<cTries ; ++i) {
for (int ch = ch0 ; ch < ch1 ; ++ ch)
n += isalpha2(ch);
}
}
return n;
}
class CBench {
public:
__declspec(noinline) CBench(CBench* p) : m_fAccumulate(false), m_nTicks(0),
m_cCalls(0), m_pBench(p), m_desc(NULL), m_nStart(GetBenchMark()) { }
__declspec(noinline) CBench(const char *desc_in, bool fAccumulate=false) :
m_fAccumulate(fAccumulate), m_nTicks(0), m_cCalls(0), m_pBench(NULL),
m_desc(desc_in), m_nStart(GetBenchMark()) { }
__declspec(noinline) ~CBench() {
__int64 n = (m_fAccumulate) ? m_nTicks : GetBenchMark() - m_nStart;
if (m_pBench) {
m_pBench->m_nTicks += n;
m_pBench->m_cCalls++;
return;
} else if (!m_fAccumulate) {
m_cCalls++;
}
__int64 nFreq;
QueryPerformanceFrequency((LARGE_INTEGER*)&nFreq);
double ms = ((double)n * 1000)/nFreq;
printf("%s took: %.4f ms, calls: %d, avg:%f\n", m_desc, ms, m_cCalls,
ms/m_cCalls);
}
__declspec(noinline) __int64 GetBenchMark(void) {
__int64 nBenchMark;
QueryPerformanceCounter((LARGE_INTEGER*)&nBenchMark);
return nBenchMark;
}
LPCSTR m_desc;
__int64 m_nStart, m_nTicks;
DWORD m_cCalls;
bool m_fAccumulate;
CBench* m_pBench;
};

Remember the first rule of optimisation: don't do it.
Then remember the second rule of optimisation, to be applied very rarely: don't do it yet.
Then, if you really are encountering a bottleneck and you've identified isalpha as the cause, then something like this might be faster, depending on how your library implements the function. You'll need to measure the performance in your environment, and only use it if there really is a measurable improvement. This assumes you don't need to test values outside the range of unsigned char (typically 0...255); you'll need a bit of extra work for that.
#include <cctype>
#include <climits>
class IsAlpha
{
public:
IsAlpha()
{
for (int i = 0; i <= UCHAR_MAX; ++i)
table[i] = std::isalpha(i);
}
bool operator()(unsigned char i) const {return table[i];}
private:
bool table[UCHAR_MAX+1];
};
Usage:
IsAlpha isalpha;
for (int i = 0; i <= UCHAR_MAX; ++i)
assert(isalpha(i) == bool(std::isalpha(i)));

Actually, according to Plauger in "The Standard C Library" [91] isalpha is oftentimes implemented using a lookup table. That book is really dated but this might still be the case today. Here's his proposed definition for isalpha
Function
int isalpha(int c)
{
return (_Ctype[c] & (_LO|_UP|_XA));
}
Macro
#define isalpha(c) (_Ctype[(int)(c)] & (_LO|_UP|_XA))

Your compiler library's implementation is likely to be quite efficient and is probably already using a lookup table for most cases, but also handling some situations that might be a little tricky to get right if you're going to do your own isalpha():
dealing with signed characters correctly (using negative indexing on the lookup table)
dealing with non-ASCII locales
You might not need to handle non-ASCII locales, in which case you might (maybe) be able to improve slightly over the library.
In fact, I wouldn't be surprised if a macro or function that simply returned the result of:
((('a' <= (c)) && ((c) <= 'z')) || (('A' <= (c)) && ((c) <= 'Z')))
might be faster than a table lookup since it wouldn't have to hit memory. But I doubt it would be faster in any meaningful way, and would be difficult to measure a difference except maybe in a benchmark that did nothing but isalpha() calls (which might also improve the table lookup results since the table would likely be in the cache for many of the tests).
And is isalpha() really a bottleneck? For anyone?
Just use the one in your compiler's library.

If you are looking for alphabetic characters, a-Z, that's a lot less characters than your 255 array. You can subtract 'A' from the ASCII character in question (the lowest alphabetic character), which will be the index into your array. e.g. 'B' - 'A' is 1. The test, if negative, is not alpha. If greater than your max alpha ('z'), then it is not alpha.
If you are using unicode at all, this method will not work.

I think you can implement isalpha a lot more trivially than with a lookup table. Using the fact that characters 'a'-'z' and 'A'-'Z' are successive in ASCII a simple test like this is sufficient:
char c ;
// c gets some value
if(('A'<=c && 'Z'>=c) || ('a'<=c && 'z'>=c)) // c is alpha
Note that this doesn't take into account different locales.

Related

Using getchar_unlocked()

I recently learnt that using getchar_unlocked() is a faster way of reading input.
I searched on the internet and found the code snippet below:
But I am unable to understand it.
void fast_scanf(int &number)
{
register int ch = getchar_unlocked();
number= 0;
while (ch > 47 && ch < 58) {
number = number * 10 + ch - 48;
ch = getchar_unlocked();
}
}
int main(void)
{
int test_cases;fast_scanf(test_cases);
while (test_cases--)
{
int n;
fast_scanf(n);
int array[n];
for (int i = 0; i < n; i++)
fast_scanf(array[i]);
}
return 0;
}
So, this code takes input for an integer array of size n for a given number of test_cases . I didn't understand anything in the function fast_scanf, like why this line:
while (ch > 47 && ch < 58)
{ number = number * 10 + ch - 48;
why the register is used while declaring ch?
why the getchar_unlocked() is used twice in the function? and so on..
It would be great help if someone elaborates this for me. Thanks in advance!

Okay, since what you are asking needs to be explained clearly, I am writing it here... so I don't jumble it all up inside the comments...
The function: (Edited it a bit to make it look like more C++-standard)
void fast_scanf(int &number)
{
auto ch = getchar_unlocked();
number= 0;
while (ch >= '0' && ch <= '9')
{
number = number * 10 + ch - '0';
ch = getchar_unlocked();
}
}
Here, take up consideration by looking at the ASCII Table first, since you won't understand how the results are coming up if you don't...
1) Here, you have a character ch takes up the input character from the user using getchar_unlocked() (The auto keyword does that automatically for you and is only usable in C++, not C)...
2) You assign the variable number to zero so that the variable can be re-used, note that the variable is a reference so it changes inside your program as well...
3) while (ch >= '0' && ch <= '9')... As pointed out, checks whether the characters is within the numerical ASCII limit, similar to saying that the character has to be greater than or equal to 48 but less than or equal to 57...
4) Here, things are a little bit tricky, Variable number is multiplied with the product of itself and 10 and the real integer value of the character you stored)...
5) In the next line, ch is reassigned so that you don't have to stay in the loop forever, since ch will remain that number forever if the user doesn't type anything... remember a loop goes back to where it was declared after reaching the end, checks if the condition is true, continues if it is true else breaks)...
For example: 456764
Here, ch will first take 4 then the others follow so we go with 4 first...
1) Number will be assigned to zero. While loop checks if the given character is a number or not, if it is continues the loop else breaks it...
2) Multiplication of 0 with 10 will be zero... and adding it with difference 52 (that is '4') with 48 (that is '0') gives you 4 (the real numerical value, not the char '4')...
So the variable number now is 4...
And the same continues with the others as well... See...
number = number * 10 + '5' - '0'
number = 4 * 10 + 53 - 48
number = 40 + 5
number = 45... etc, etc. for other numbers...

Regex vs brute-force for small strings

When testing small strings (e.g. isPhoneNumber or isHexadecimal) is there a performance benefit from using regular expressions, or would brute forcing them be faster? Wouldn't brute forcing them by just checking whether or not the given string's chars are within a specified range be faster than using a regex?
For example:
public static boolean isHexadecimal(String value)
{
if (value.startsWith("-"))
{
value = value.substring(1);
}
value = value.toLowerCase();
if (value.length() <= 2 || !value.startsWith("0x"))
{
return false;
}
for (int i = 2; i < value.length(); i++)
{
char c = value.charAt(i);
if (!(c >= '0' && c <= '9' || c >= 'a' && c <= 'f'))
{
return false;
}
}
return true;
}
vs.
Regex.match(/0x[0-9a-f]+/, "0x123fa") // returns true if regex matches whole given expression
There seems like there would be some overhead associated with the regex, even when the pattern is pre-compiled, just from the fact that regular expressions have to work in many general cases. In contrast, the brute-force method does exactly what is required and no more. Am I missing some optimization that regular expressions have?

Checking whether string characters are within a certain range is exactly what regular expressions are built to do. They convert the expression into an atomic series of instructions; They're essentially writing out your manual parsing steps but at a lower level.
What tends to be slow with regular expressions is the conversion of the expression into instructions. You can see real performance gains when a regex is used more than once. That's when you can compile the expression ahead of time and then simply apply the resulting compiled instructions in a match, search, replace, etc.
As is the case with anything to do with performance, perform some tests and measure the results.

I've written a small benchmark to estimate the performance of the:
NOP method (to get an idea of the baseline iteration speed);
Original method, as provided by the OP ;
RegExp;
Compiled Regexp;
The version provided by #maraca (w/o toLowerCase and substring);
"fastIsHex" version (switch-based), I've added just for fun.
The test machine configuration is as follows:
JVM: Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
CPU: Intel(R) Core(TM) i5-2500 CPU # 3.30GHz
And here are the results I got for the original test string "0x123fa" and 10.000.000 iterations:
Method "NOP" => #10000000 iterations in 9ms
Method "isHexadecimal (OP)" => #10000000 iterations in 300ms
Method "RegExp" => #10000000 iterations in 4270ms
Method "RegExp (Compiled)" => #10000000 iterations in 1025ms
Method "isHexadecimal (maraca)" => #10000000 iterations in 135ms
Method "fastIsHex" => #10000000 iterations in 107ms
as you can see even the original method by the OP is faster than the RegExp method (at least when using JDK-provided RegExp implementation).
(for your reference)
Benchmark code:
public static void main(String[] argv) throws Exception {
//Number of ITERATIONS
final int ITERATIONS = 10000000;
//NOP
benchmark(ITERATIONS,"NOP",() -> nop(longHexText));
//isHexadecimal
benchmark(ITERATIONS,"isHexadecimal (OP)",() -> isHexadecimal(longHexText));
//Un-compiled regexp
benchmark(ITERATIONS,"RegExp",() -> longHexText.matches("0x[0-9a-fA-F]+"));
//Pre-compiled regexp
final Pattern pattern = Pattern.compile("0x[0-9a-fA-F]+");
benchmark(ITERATIONS,"RegExp (Compiled)", () -> {
pattern.matcher(longHexText).matches();
});
//isHexadecimal (maraca)
benchmark(ITERATIONS,"isHexadecimal (maraca)",() -> isHexadecimalMaraca(longHexText));
//FastIsHex
benchmark(ITERATIONS,"fastIsHex",() -> fastIsHex(longHexText));
}
public static void benchmark(int iterations,String name,Runnable block) {
//Start Time
long stime = System.currentTimeMillis();
//Benchmark
for(int i = 0; i < iterations; i++) {
block.run();
}
//Done
System.out.println(
String.format("Method \"%s\" => #%d iterations in %dms",name,iterations,(System.currentTimeMillis()-stime))
);
}
NOP method:
public static boolean nop(String value) { return true; }
fastIsHex method:
public static boolean fastIsHex(String value) {
//Value must be at least 4 characters long (0x00)
if(value.length() < 4) {
return false;
}
//Compute where the data starts
int start = ((value.charAt(0) == '-') ? 1 : 0) + 2;
//Check prefix
if(value.charAt(start-2) != '0' || value.charAt(start-1) != 'x') {
return false;
}
//Verify data
for(int i = start; i < value.length(); i++) {
switch(value.charAt(i)) {
case '0':case '1':case '2':case '3':case '4':case '5':case '6':case '7':case '8':case '9':
case 'a':case 'b':case 'c':case 'd':case 'e':case 'f':
case 'A':case 'B':case 'C':case 'D':case 'E':case 'F':
continue;
default:
return false;
}
}
return true;
}
So, the answer is no, for short-strings and the task at hand, RegExp is not faster.
When it comes to a longer strings, the balance is quite different,
below are results for the 8192 long hex string, I've generated with:
hexdump -n 8196 -v -e '/1 "%02X"' /dev/urandom
and 10.000 iterations:
Method "NOP" => #10000 iterations in 2ms
Method "isHexadecimal (OP)" => #10000 iterations in 1512ms
Method "RegExp" => #10000 iterations in 1303ms
Method "RegExp (Compiled)" => #10000 iterations in 1263ms
Method "isHexadecimal (maraca)" => #10000 iterations in 553ms
Method "fastIsHex" => #10000 iterations in 530ms
As you can see, hand-written methods (the one by macara and my fastIsHex), still beat the RegExp, but original method does not,
(due to substring() and toLowerCase()).
Sidenote:
This benchmark is very simple indeed and only tests the "worst case" scenario (i.e. a fully valid string), a real life results, with the mixed data lengths and a non-0 valid-invalid ratio, might be quite different.
Update:
I also gave a try to the char[] array version:
char[] chars = value.toCharArray();
for (idx += 2; idx < chars.length; idx++) { ... }
and it was even a bit slower than getCharAt(i) version:
Method "isHexadecimal (maraca) char[] array version" => #10000000 iterations in 194ms
Method "fastIsHex, char[] array version" => #10000000 iterations in 164ms
my guess is that is due to array copy inside toCharArray.
Update (#2):
I've run an additional 8k/100.000 iterations test to see if there is any real difference in speed between the "maraca" and "fastIsHex" methods, and have also normalized them to use exactly the same precondition code:
Run #1
Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5341ms
Method "fastIsHex" => #100000 iterations in 5313ms
Run #2
Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5313ms
Method "fastIsHex" => #100000 iterations in 5334ms
I.e. the speed difference between these two methods is marginal at best, and is probably due to a measurement error (as I'm running this on my workstation and not a specially setup clean test environment).

Brute force approach to solve the problem is to systematically test all combinations. It is not Your case.
You can get better performance from hand written procedure. You can take advantage of the data distribution if You know it in advance. Or You can make some clever shortcuts that apply on Your case. But it really is not guaranteed that what You write would be automatically faster that regex. Regex implementation is optimized too and You can easily end up with code that is worse than that.
The code in Your question is really nothing special and most probably it would be on par with the regex. As I tested it, there was no clear winner, sometimes one was faster, sometimes the other, the difference was small. Your time is limited, think wisely where You spend it.

You're misusing the term "brute force." A better term is ad hoc custom matching.
Regex interpreters are generally slower than custom pattern matchers. The regex is compiled into a byte code, and compilation takes time. Even ignoring compilation (which might be fine if you compile only once and match a very long string and/or many times so the compilation cost isn't important), machine instructions spent in the matching interpreter are overhead that the custom matcher doesn't have.
In cases where the regex matcher wins out, it's normally that the regex engine is implemented in very fast native code, while the custom matcher is written in something slower.
Now you can compile regexes to native code that runs just as fast as a well-done custom matcher. This is the approach of e.g. lex/flex and others. But the most common library or built-in languages don't take this approach (Java, Python, Perl, etc.). They use interpreters.
Native code-generating libraries tend to be cumbersome to use except maybe in C/C++ where they've been part of the air for decades.
In other languages, I'm a fan of state machines. To me they are easier to understand and get correct than either regexes or custom matchers. Below is one for your problem. State 0 is the start state, and D stands for a hex digit.
Implementation of the machine can be extremely fast. In Java, it might look like this:
static boolean isHex(String s) {
int state = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
switch (state) {
case 0:
if (c == '-') state = 1;
else if (c == '0') state = 2;
else return false;
break;
case 1:
if (c == '0') state = 2;
else return false;
break;
case 2:
if (c == 'x') state = 3;
else return false;
break;
case 3:
if (isHexDigit(c)) state = 4;
else return false;
break;
case 4:
if (isHexDigit(c)) ; // state already = 4
else return false;
break;
}
}
return true;
}
static boolean isHexDigit(char c) {
return '0' <= c && c <= '9' || 'A' <= c && c <= 'F' || 'a' <= c && c <= 'f';
}
The code isn't super short, but it's a direct translation of the diagram. There's nothing to mess up short of simple typographical errors.
In C, you can implement states as goto labels:
int isHex(char *s) {
char c;
s0:
c = *s++;
if (c == '-') goto s1;
if (c == '0') goto s2;
return 0;
s1:
c = *s++;
if (c == '0') goto s2;
return 0;
s2:
c = *s++;
if (c == 'x') goto s3;
return 0;
s3:
c = *s++;
if (isxdigit(c)) goto s4;
return 0;
s4:
c = *s++;
if (isxdigit(c)) goto s4;
if (c == '\0') return 1;
return 0;
}
This kind of goto matcher written in C is generally the fastest I've seen. On my MacBook using an old gcc (4.6.4), this one compiles to only 35 machine instructions.

Usually what's better depends on your goals. If readability is the main goal (what it should be, unless you detected a performance issue) then regex are just fine.
If performance is your goal, then you have to analyze the problem first. E.g. if you know it's either a phone number or a hexadecimal number (and nothing else) then the problem becomes much simpler.
Now let's have a look at your function (performance-wise) to detect hexadecimal numbers:
Getting the substring is bad (creating a new object in general), better work with an index and advance it.
Instead of using toLower() it's better to compare to upper and lower case letters (the string is only iterated once, no superfluous substitutions are performed and no new object is created).
So a performance-optimized version could look something like this (you can maybe optimize further by using the charArray instead of the string):
public static final boolean isHexadecimal(String value) {
if (value.length() < 3)
return false;
int idx;
if (value.charAt(0) == '-' || value.charAt(0) == '+') { // also supports unary plus
if (value.length() < 4) // necessairy because -0x and +0x are not valid
return false;
idx = 1;
} else {
idx = 0;
}
if (value.chartAt(idx) != '0' || value.charAt(idx + 1) != 'x')
return false;
for (idx += 2; idx < value.length(); idx++) {
char c = value.charAt(idx);
if (!((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F')))
return false;
}
return true;
}

Well implemented regular expressions can be faster than naive brute force implementation of the same pattern.
On the other hand you always can implement a faster solution for a specific case.
Also as stated in the article above most implementations in popular languages are not efficient (in some cases).
I'd implement own solultions only when performance is an absolute priority and with extensive testing and profiling.

To get a perfomance that is better than naive handcoded validators, you may use a Regular Expression library that is based on deterministic automatons, e.g. Brics Automaton
I wrote a short jmh benchmark:
#State(Scope.Thread)
public abstract class MatcherBenchmark {
private String longHexText;
#Setup
public void setup() {
initPattern("0x[0-9a-fA-F]+");
this.longHexText = "0x123fa";
}
public abstract void initPattern(String pattern);
#Benchmark
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Warmup(iterations = 10)
#Measurement(iterations = 10)
#Fork(1)
public void benchmark() {
boolean result = benchmark(longHexText);
if (!result) {
throw new RuntimeException();
}
}
public abstract boolean benchmark(String text);
#TearDown
public void tearDown() {
donePattern();
this.longHexText = null;
}
public abstract void donePattern();
}
and implemented it with:
#Override
public void initPattern(String pattern) {
RegExp r = new RegExp(pattern);
this.automaton = new RunAutomaton(r.toAutomaton(true));
}
#Override
public boolean benchmark(String text) {
return automaton.run(text);
}
I also created benchmarks for Zeppelins, Genes and the compiled java.util.Regex solution, and a solution with rexlex. These are the results of the jmh benchmark on my machine:
BricsMatcherBenchmark.benchmark avgt 10 0,014 � 0,001 us/op
GenesMatcherBenchmark.benchmark avgt 10 0,017 � 0,001 us/op
JavaRegexMatcherBenchmark.benchmark avgt 10 0,097 � 0,005 us/op
RexlexMatcherBenchmark.benchmark avgt 10 0,061 � 0,002 us/op
ZeppelinsBenchmark.benchmark avgt 10 0,008 � 0,001 us/op
Starting the same benchmark with a non-hex-digit 0x123fax produces following results (note: I inverted the validation in benchmark for this benchmark)
BricsMatcherBenchmark.benchmark avgt 10 0,015 � 0,001 us/op
GenesMatcherBenchmark.benchmark avgt 10 0,019 � 0,001 us/op
JavaRegexMatcherBenchmark.benchmark avgt 10 0,102 � 0,001 us/op
RexlexMatcherBenchmark.benchmark avgt 10 0,052 � 0,002 us/op
ZeppelinsBenchmark.benchmark avgt 10 0,009 � 0,001 us/op

Regex have a huge lot of advantages but still Regex do have a performance issue.

Specify a range of ASCII lowercase chars in C++

I am writing a program that takes a char and compares it to see if it's in a range of certain chars. For instance, if the char I get is an n I go to state 3, if its a - m or o - z I go to state 4. I'm new to C++ so I'm still learning.
Can I say something like:
char c = file.next_char();
...
if (c in 'a'...'m', 'o'...'z')
{
state = 3;
} else {
state = 4;
}

There is no such syntax in C++. The options are:
Use a switch statement, when the list of values is generally not contiguous, or
Convert the list of explicit character values into contiguous ranges into equivalent boolean expressions. As you know, alphabetic characters consist of a contiguous range of octets in ASCII, so your pseudo-code is equivalent to:
if ( (c >= 'a' && c <= 'm')
||
(c >= 'o' && c <= 'z'))

If you are using ascii (English), you can rely on the fact that all the lower case letters are adjacent. Just check 'a' <= c && c <= 'z'
after ruling out 'n'.
You never said what happens if the state is not one of those, so I left it alone.
// 3 and 4 mean nothing. Give your states meaningful names
enum state_type {FirstState, SecondState, ThirdState, FourthState};
state_type state = FirstState;
char c = get_next_char();
if ('n' == c){
state = FourthState;
} else if ('a' < c && c < 'z'){
state = ThirdState;
} else {
// no change?
}

You could maybe use a for loop to compare it to see if it's a letter in the first or second range. The code would be something this:
char range1[/*amount in array here*/] = "abcdefghijklmnopqrstuvwxyz";
//Substitute range on line above(the characters in the string)
for(int i = 0; i <= /*amount in array here*/; i++) {
if(range1[i] == /*nameoflettervariablehere*/){
//Code here
}
}
I'm sorry but I don't know of a more efficient way.

Separating every second digit in an integer C++

I am currently finishing up an assignment I have to complete for my OOP class and I am struggling with 1 part in particular. Keep in mind I am still a beginner. The question is as followed:
If the string contains 13 characters, all of characters are digits and the check digit is modulo 10, this function returns true; false otherwise.
This is in regards to a EAN. I basically have to separate every second digit from the rest digits. for example 9780003194876 I need to do calculations with 7,0,0,1,4,7. I have no clue about doing this.
Any help would be greatly appreciated!
bool isValid(const char* str){
if (atoi(str) == 13){
}
return false;
}

You can start with a for loop which increments itself by 2 for each execution:
for (int i = 1, len = strlen(str); i < len; i += 2)
{
int digit = str[i] - '0';
// do something with digit
}
The above is just an example though...

Since the question was tagged as C++ (Not C, so I suggest other answerers to not solve this using C libraries, please. Let us getting OP's C++ knoweledge in the right way since the beggining), and is an OOP class I'm going to solve this with the C++ way: Use the std::string class:
bool is_valid( const std::string& str )
{
if( str.size() == 13 )
{
for( std::size_t i = 0 ; i < 13 ; i += 2 )
{
int digit = str[i] - '0';
//Do what you wan't with the digit
}
}
else
return false;
}

First, if it's EAN, you have to process every digit, not just
every other one. In fact, all you need to do is a weighted sum
of the digits; for EAN-13, the weigths alternate between 1 and
3, starting with three. The simplest solution is probably to
put them in a table (i.e. int weigt[] = { 1, 3, 1, 3... };,
and iterate over the string (in this case, using an index rather
than iterators, since you want to be able to index into
weight as well), converting each digit into a numerical value
(str[i] - '0', if isdigit(static_cast<unsigned char>(str[i])
is true; if it's false, you haven't got a digi.), then
multiplying it by the running total. When you're finished, if
the total, modulo 10, is 0, it's correct. Otherwise, it isn't.
You certainly don't want to use atoi, since you don't want the
numerical value of the string; you want to treat each digit
separately.
Just for the record, professionally, I'd write something like:
bool
isValidEAN13( std::string const& value )
{
return value.size() == 13
&& std::find_if(
value.begin(),
value.end(),
[]( unsigned char ch ){ return !isdigit( ch ); } )
== value.end()
&& calculateEAN13( value ) == value.back() - '0';
}
where calculateEAN13 does the actual calculations (and can be
used for both generation and checking). I suspect that this
goes beyond the goal of the assignment, however, and that all
your teacher is looking for is the calculateEAN13 function,
with the last check (which is why I'm not giving it in full).

Getting a wrong answer for SPOJ PLD

I am trying to solve problem PLD on SPOJ, but I'm getting a WA on the 9th testcase.
My Approach:
I am implementing Manacher's Algorithm and I believe that if something wrong is there, then it can be wrong in this code.
if((k%2==0)&&(p[i]>=k)&&(temp[i]=='#'))
count++;
if((k%2==1)&&(p[i]>=k)&&(temp[i]!='#'))
count++;
But according to my approach if character is #, then the maximum length of palindromic string centered at it can be even only, so if p[i] >= k, then I am increasing count if we are finding a palindromic string of even length.
Similarly for characters [considering input character i.e other than #] centered at i-th location but for odd length strings.
#include<stdio.h>
#include<string.h>
char a[30002],temp[60010];
int p[60010];
int min(int a,int b)
{
if(a<b)
return a;
return b;
}
int main()
{
//freopen("input.txt","r+",stdin);
//freopen("a.txt","w+",stdout);
int k,len,z;
scanf("%d",&k);
getchar();
gets(a);
len=strlen(a);
//Coverting String
temp[0]='$';
temp[1]='#';
z=2;
for(int i=1;i<=len;i++)
{
temp[z++]=a[i-1];
temp[z++]='#';
}
len=z;
int r=0,c=0,check=0,idash,t,count=0;
for(int i=1;i<len;i++)
{
check=0;
idash=c-(i-c);
p[i]=r>i?min(r-i,p[idash]):0;
t=p[i];
while(temp[i+p[i]+1]==temp[i-1-p[i]])
p[i]++;
if(r<i+p[i])
{
c=i;
r=i+p[i];
}
if((k%2==0)&&(p[i]>=k)&&(temp[i]=='#'))
count++;
if((k%2==1)&&(p[i]>=k)&&(temp[i]!='#'))
count++;
}
printf("%d",count);
//getchar();
//getchar();
return 0;
}

You may want to take advantage of C++ short-circuit evaluation of logical expressions.
For example, rearrange the order so you check for '#' first:
if ((temp[i] == '#') && (k % 2 == 0) && (p[i] >= k))
In the above rearrangement, if the character is not '#', none of the other expressions are evaluated.
You may want to extract (p[i] >= k) to an outside if statement since it is common to both:
if (p[i] >= k)
{
if ((temp[i] == '#') && (k % 2 == 0)) ++count;
if ((temp[i] != '#') && (k % 2 == 1)) ++count;
}
The above modification will result in only one evaluation of the expression (p[i] >= k).
Also examine your for loop to see if there are statements or expressions that don't change or are repeated. If a statement or expression doesn't change inside the loop, it is called an invariant, and can be moved before or after the loop.
Statements or expressions that are duplicated (such as array index calculations) can be evaluated and stored in a temporary variable. Although good compilers may do this (depending on the optimization level), in your performance requirements, you may want to help out the compiler.
Another suggestion is to replace p[i] with a pointer to the location or a reference to the location. Again, this is to help out the compiler when the optimization is not set optimally:
int& p_slot_i = p[i]; // This syntax needs checking
// or
int * p_slot_i = &p[i];
//...
t = *p_slot_i;
while(temp[i + *p_slot_i + 1] == temp[i - 1 - *p_slot_i)
{
*p_slot_i++;
}
Lastly, Elimination of spaces, blank lines and curly braces DOES NOT AFFECT PROGRAM PERFORMANCE. A program that is one line or spaced across multiply lines will have the exact assembly translation and the exact performance. So please, add spaces, blank lines and curly braces to improve readability.
Edit 1: performance of min()
You may want to declare you min() function as inline to suggest to the compiler you want the function pasted where it is called, rather than calling the function. Function calls slow down a programs execution.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

c++ isalpha lookup table - c++

Related

Using getchar_unlocked()

Regex vs brute-force for small strings

Specify a range of ASCII lowercase chars in C++

Separating every second digit in an integer C++

Getting a wrong answer for SPOJ PLD

Categories

Resources