Concatenation of the 7th lower bits of n bytes

Concatenation of the 7th lower bits of n bytes - c++

I just wanted to know if there is any difference between these two expressions:
1 : a = ( a | ( b&0x7F ) >> 7 );
2 : a = ( ( a << 8 ) | ( b&0x7F ) << 1 );
I'm not only speaking about result, but also about efficiency (but the first one looks better).
The purpose is to concatenate the 7 lower bits of multiple bytes, and I was at first using number 2 like this:
while(thereIsByte)
{
a = ( ( a << 8 ) | ( b&0x7F ) << i );
++i;
}
Thanks.

The two expression don't do anything alike:
a = ( a | ( b&0x7F ) >> 7 );
Explaining:
a = 0010001000100011
b = 1000100010001100
0x7f = 0000000001111111
b&0x7f = 0000000000001100
(b&0x7f) >> 7 = 0000000000000000 (this is always 0), you are selecting the lowest
7 bits of 'b' and shifting right 7bit, discarding
the selected bits).
(a | (b&0x7f) >> 7) always is equal to `a`
a = ( ( a << 8 ) | ( b&0x7F ) << 1 );
Explaining:
a = 0010001000100011
b = 1000100010001100
0x7f = 0000000001111111
b&0x7f = 0000000000001100
(b&0x7f) << 1 = 0000000000011000
(a << 8) = 0010001100000000
(a << 8) | (b&0x7F) << 1 = 0010001100011000
In the second expression the result would have the 3 lowest bytes of a as the 3 highest bytes and the lowest byte of b without the highest bit, shifting 1 bit to the left. Would line a = a * 256 + (b & 0x7f) * 2
If you want to concatenate the lowest 7bits of b in a would be:
while (thereIsByte) {
a = (a << 7) | (b & 0x7f);
// read the next byte into `b`
}
Example in case of sizeof(a) = 4 bytes and you are concatenating four 7bits info, the result of the pseudo code would be:
a = uuuuzzzzzzzyyyyyyyxxxxxxxwwwwwww
Where the z are the 7 bits of the first byte readed, the y are the 7bits of the second and so on. The u are unused bits (contain the info in the lowest 4 bits of a at the beginning)
In this case the size of a need to be greater that the total bits you want to concatenate (eg: at least 32 bits if you want to concatenate four 7bits info).
If a and b are one byte of size won't be really much concatenating, you probably need a data structure like boost::dynamic_bitset where you can append bits multiple times and it grow accondinly.

Yes they are different. On MSVC2010 here is the disassembly when both a and b are chars.
a = ( a | ( b&0x7F ) >> 7 );
012713A6 movsx eax,byte ptr [a]
012713AA movsx ecx,byte ptr [b]
012713AE and ecx,7Fh
012713B1 sar ecx,7
012713B4 or eax,ecx
012713B6 mov byte ptr [a],al
a = ( ( a << 8 ) | ( b&0x7F ) << 1 );
012813A6 movsx eax,byte ptr [a]
012813AA shl eax,8
012813AD movsx ecx,byte ptr [b]
012813B1 and ecx,7Fh
012813B4 shl ecx,1
012813B6 or eax,ecx
012813B8 mov byte ptr [a],al
Notice that the second method does two shift operations (for a total of 9 shifted bits, which each take a clock cycle) while the first does a single shift (only 7 bits) and read. Basically this is caused by the order of operations. The first method IS more optimized, however shifting is one of the computers most efficient operations and this difference is probably negligible for most applications.
Notice the compiler treated them as bytes, NOT signed ints.

Related

Impact of data padding on CRC calculation

I am calculating CRC on a large chunk of data every cycle in hardware (64B per cycle). In order to parallelize the CRC calculation, I want to calculate the CRC for small data chunks and then XOR them in parallel.
Approach:
We divide the data into small chunks (64B data divided into 8 chunks
of 8B each).
Then we calculate CRC's for all the chunks
individually (8 CRC's in parallel for 8B chunks).
Finally calculate
the CRC for padded data. This answer points out that the CRC
for padded data is obtained by multiplying the old CRC with x^n.
Hence, I am calculating the CRC for a small chunk of data, then multiply it with CRC of 0x1 shifted by 'i' times as shown below.
In short, I am trying to accomplish below:
For example: CRC-8 on this site:
Input Data=(0x05 0x07) CRC=0x54
Step-1: Data=0x5 CRC=0x1B
Step-2: Data=0x7 CRC=0x15
Step-3: Data=(0x1 0x0) CRC=0x15
Step-4: Multiply step-1 CRC and step-3 CRC with primitive polynomial 0x7. So, I calculate (0x1B).(0x15) = (0x1 0xC7) mod 0x7.
Step-5: Calculate CRC Data=(0x1 0xC7) CRC=0x4E (I assume this is same as (0x1 0xC7) mod 0x7)
Step-6: XOR the result to get the final CRC. 0x4E^0x15=0x5B
As we can see, the result in step-6 is not the correct result.
Can someone help me how to calculate the CRC for padded data? Or where am I going wrong in the above example?

Rather than calculate and then adjust multiple CRC's, bytes of data can be carryless multiplied to form a set of 16 bit "folded" products, which are then xor'ed and a single modulo operation performed on the xor'ed "folded" products. An optimized modulo operation uses two carryless multiples, so it's avoided until all folded products have been generated and xor'ed together. A carryless multiply uses XOR instead of ADD and a borrowless divide uses XOR instead of SUB. Intel has a pdf file about this using the XMM instruction PCLMULQDQ (carryless multiply), where 16 bytes are read at a time, split into two 8 byte groups, with each group folded into a 16 byte product, and the two 16 byte products are xor'ed to form a single 16 byte product. Using 8 XMM registers to hold folding products, 128 bytes at time are processed. (256 bytes at at time in the case of AVX512 and ZMM registers).
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf
Assume your hardware can implement a carryless multiply that takes two 8 bit operands and produces a 16 bit (technically 15 bit) product.
Let message = M = 31 32 33 34 35 36 37 38. In this case CRC(M) = C7
pre-calculated constants (all values shown in hex):
2^38%107 = DF cycles forwards 0x38 bits
2^30%107 = 29 cycles forwards 0x30 bits
2^28%107 = 62 cycles forwards 0x28 bits
2^20%107 = 16 cycles forwards 0x20 bits
2^18%107 = 6B cycles forwards 0x18 bits
2^10%107 = 15 cycles forwards 0x10 bits
2^08%107 = 07 cycles forwards 0x08 bits
2^00%107 = 01 cycles forwards 0x00 bits
16 bit folded (cycled forward) products (can be calculated in parallel):
31·DF = 16CF
32·29 = 07E2
33·62 = 0AC6
34·16 = 03F8
35·6B = 0A17
36·15 = 038E
37·07 = 0085
38·01 = 0038
----
V = 1137 the xor of the 8 folded products
CRC(V) = 113700 % 107 = C7
To avoid having to use borrowless divide for the modulo operation, CRC(V) can be computed using carryless multiply. For example
V = FFFE
CRC(V) = FFFE00 % 107 = 23.
Implementation, again all values in hex (hex 10 = decimal 16), ⊕ is XOR.
input:
V = FFFE
constants:
P = 107 polynomial
I = 2^10 / 107 = 107 "inverse" of polynomial
by coincidence, it's the same value
2^10 % 107 = 15 for folding right 16 bits
fold the upper 8 bits of FFFE00 16 bits to the right:
U = FF·15 ⊕ FE00 = 0CF3 ⊕ FE00 = F2F3 (check: F2F3%107 = 23 = CRC)
Q = ((U>>8)·I)>>8 = (F2·107)>>8 = ...
to avoid a 9 bit operand, split up 107 = 100 ⊕ 7
Q = ((F2·100) ⊕ (F2·07))>>8 = ((F2<<8) ⊕ (F2·07))>>8 = (F200 ⊕ 02DE)>>8 = F0DE>>8 = F0
X = Q·P = F0·107 = F0·100 ⊕ F0·07 = F0<<8 ⊕ F0·07 = F000 ⊕ 02D0 = F2D0
CRC = U ⊕ X = F2F3 ⊕ F2D0 = 23
Since the CRC is 8 bits, there's no need for the upper 8 bits in the last two steps, but it doesn't help that much for the overall calculation.
X = (Q·(P&FF))&FF = (F0·07)&FF = D0
CRC = (U&FF) ⊕ X = F3 ⊕ D0 = 23
Example program to generate 2^0x10 / 0x107 and powers of 2 % 0x107:
#include <stdio.h>
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
#define poly 0x107
uint16_t geninv(void) /* generate 2^16 / 9 bit poly */
{
uint16_t q = 0x0000u; /* quotient */
uint16_t d = 0x0001u; /* initial dividend = 2^0 */
for(int i = 0; i < 16; i++){
d <<= 1;
q <<= 1;
if(d&0x0100){ /* if bit 8 set */
q |= 1; /* q |= 1 */
d ^= poly; /* d ^= poly */
}
}
return q; /* return inverse */
}
uint8_t powmodpoly(int n) /* generate 2^n % 9 bit poly */
{
uint16_t d = 0x0001u; /* initial dividend = 2^0 */
for(int i = 0; i < n; i++){
d <<= 1; /* shift dvnd left */
if(d&0x0100){ /* if bit 8 set */
d ^= poly; /* d ^= poly */
}
}
return (uint8_t)d; /* return remainder */
}
int main()
{
printf("%04x\n", geninv());
printf("%02x %02x %02x %02x %02x %02x %02x %02x %02x %02x\n",
powmodpoly(0x00), powmodpoly(0x08), powmodpoly(0x10), powmodpoly(0x18),
powmodpoly(0x20), powmodpoly(0x28), powmodpoly(0x30), powmodpoly(0x38),
powmodpoly(0x40), powmodpoly(0x48));
printf("%02x\n", powmodpoly(0x77)); /* 0xd9, cycles crc backwards 8 bits */
return 0;
}
Long hand example for 2^0x10 / 0x107.
100000111 quotient
-------------------
divisor 100000111 | 10000000000000000 dividend
100000111
---------
111000000
100000111
---------
110001110
100000111
---------
100010010
100000111
---------
10101 remainder
I don't know how many registers you can have in your hardware design, but assume there are five 16 bit registers used to hold folded values, and either two or eight 8 bit registers (depending on how parallel the folding is done). Then following the Intel paper, you fold values for all 64 bytes, 8 bytes at a time, and only need one modulo operation. Register size, fold# = 16 bits, reg# = 8 bits. Note that powers of 2 modulo poly are pre-calculated constants.
foldv = prior buffer's folding value, equivalent to folded msg[-2 -1]
reg0 = foldv>>8
reg1 = foldv&0xFF
foldv = reg0·((2^0x18)%poly) advance by 3 bytes
foldv ^= reg1·((2^0x10)%poly) advance by 2 bytes
fold0 = msg[0 1] ^ foldv handling 2 bytes at a time
fold1 = msg[2 3]
fold2 = msg[4 5]
fold3 = msg[6 7]
for(i = 8; i < 56; i += 8){
reg0 = fold0>>8
reg1 = fold0&ff
fold0 = reg0·((2^0x48)%poly) advance by 9 bytes
fold0 ^= reg1·((2^0x40)%poly) advance by 8 bytes
fold0 ^= msg[i+0 i+1]
reg2 = fold1>>8 if not parallel, reg0
reg3 = fold1&ff and reg1
fold1 = reg2·((2^0x48)%poly) advance by 9 bytes
fold1 ^= reg3·((2^0x40)%poly) advance by 8 bytes
fold1 ^= msg[i+2 i+3]
...
fold3 ^= msg[i+6 i+7]
}
reg0 = fold0>>8
reg1 = fold0&ff
fold0 = reg0·((2^0x38)%poly) advance by 7 bytes
fold0 ^= reg1·((2^0x30)%poly) advance by 6 bytes
reg2 = fold1>>8 if not parallel, reg0
reg3 = fold1&ff and reg1
fold1 = reg2·((2^0x28)%poly) advance by 5 bytes
fold1 ^= reg3·((2^0x20)%poly) advance by 4 bytes
fold2 ... advance by 3 2 bytes
fold3 ... advance by 1 0 bytes
foldv = fold0^fold1^fold2^fold3
Say the final buffer has 5 bytes:
foldv = prior folding value, equivalent to folded msg[-2 -1]
reg0 = foldv>>8
reg1 = foldv&0xFF
foldv = reg0·((2^0x30)%poly) advance by 6 bytes
foldv ^= reg1·((2^0x28)%poly) advance by 5 bytes
fold0 = msg[0 1] ^ foldv
reg0 = fold0>>8
reg1 = fold0&ff
fold0 = reg0·((2^0x20)%poly) advance by 4 bytes
fold0 ^= reg1·((2^0x18)%poly) advance by 3 bytes
fold1 = msg[2 3]
reg2 = fold1>>8
reg3 = fold1&ff
fold1 = reg0·((2^0x10)%poly) advance by 2 bytes
fold1 ^= reg1·((2^0x08)%poly) advance by 1 bytes
fold2 = msg[4] just one byte loaded
fold3 = 0
foldv = fold0^fold1^fold2^fold3
now use the method above to calculate CRC(foldv)

As shown in your diagram, you need to calculate the CRC of 0x05 0x00, (A,0), and the CRC of 0x00 0x07, (0,B), and then exclusive-or those together. Calculating on the site you linked, you get 0x41 and 0x15 respectively. Exclusive-or those together, and, voila, you get 0x54, the CRC of 0x05 0x07.
There is a shortcut for (0,B), since for this CRC, the CRC of a string of zeros is zero. You can calculate the CRC of just 0x07 and get the same result as for 0x00 0x07, which is 0x15.
See crcany for how to combine CRCs in general. crcany will generate C code to compute any specified CRC, including code to combine CRCs. It employs a technique that applies n zeros to a CRC in O(log(n)) time instead of O(n) time.

Hex to Dec using bit-manipulation

Could somebody explain how this actually works for example the char input = 'a'.
I understand that << shift the bits over by four places (for more than one character). But why in the second part add 9? I know 0xf = 15.....Am I missing something obvious.
result = result << 4 | *str + 9 & 0xf;
Here is my understand so far:
char input = 'a' ascii value is 97. Add 9 is 106, 106 in binary is 01101010. 0xf = 15 (00001111), therefore 01101010 & 00001111 = 00001010, this gives the value of 10 and the result is then appended on to result.
Thanks in advance.

First, let's rewrite this with parenthesis to make the order of operations more clear:
result = (result << 4) | ((*str + 9) & 0xf);
If result is 0 on input, then we have:
result = (0 << 4) | ((*str + 9) & 0xf);
Which simplifies to:
result = (0) | ((*str + 9) & 0xf);
And again to:
result = (*str + 9) & 0xf;
Now let's look at the hex and binary representations of a - f:
a = 0x61 = 01100001
b = 0x62 = 01100010
c = 0x63 = 01100011
d = 0x64 = 01100100
e = 0x65 = 01100101
f = 0x66 = 01100110
After adding 9, the & 0xf operation clears out the top 4 bits, so we don't need to worry about those. So we're effectively just adding 9 to the lower 4 bits. In the case of a, the lower 4 bits are 1, so adding 9 gives you 10, and similarly for the others.
As chux mentioned in his comment, a more straightforward way of achieving this is as follows:
result = *str - 'a' + 10;

What does this expression calculate

Suppose X and Y are two positive integers and Y is a power of two. Then what does this expression calculate?
(X+Y-1) & ~(Y-1)
I found this expression appearing in certain c/c++ implementation of Memory Pool (X represents the object size in bytes and Y represents the alignment in bytes, the expression returns the block size in bytes fit for use in the Memory Pool).

&~(Y-1) where Y is a power of 2, zeroes the last n bits, where Y = 2n: Y-1 produces n 1-bits, inverting that via ~ gives you a mask with n zeroes at the end, anding via bit-level & zeroes the bits where the mask is zero.
Effectively that produces a number that is some multiple of Y's power of 2.
It can maximally have the effect of subtracting Y-1 from the number, so add that first, giving (X+Y-1) & ~(Y-1). This is a number that's not less than X, and is a multiple of Y.

It gives you the next Y-aligned address of current address X.
Say, your current address X is 0x10000, and your alignment is 0x100, it will give you 0x10000. But if your current address X is 0x10001, you will get "next" aligned address of 0x10100.
This is useful in the scenario that you want your new object always to be aligned to blocks in memory, but not leaving any block unused. So you want to know what is the next available block-aligned address.

Why don't you just try some input and observe what happens?
#include <iostream>
unsigned compute(unsigned x, unsigned y)
{
return (x + y - 1) & ~(y - 1);
}
int main()
{
std::cout << "(x + y - 1) & ~(y - 1)" << std::endl;
for (unsigned x = 0; x < 9; ++x)
{
std::cout << "x=" << x << ", y=2 -> " << compute(x, 2) << std::endl;
}
std::cout << "----" << std::endl;
std::cout << "(x + y - 1) & ~(y - 1)" << std::endl;
for (unsigned x = 0; x < 9; ++x)
{
std::cout << "(x=" << x << ", y=2) -> " << compute(x, 2) << std::endl;
}
return 0;
}
Live Example
Output:
First set uses x in [0, 8] and y is constant 2. Second set uses x in [0, 8] and y is constant 4.
(x + y - 1) & ~(y - 1)
x=0, y=2 -> 0
x=1, y=2 -> 2
x=2, y=2 -> 2
x=3, y=2 -> 4
x=4, y=2 -> 4
x=5, y=2 -> 6
x=6, y=2 -> 6
x=7, y=2 -> 8
x=8, y=2 -> 8
----
(x + y - 1) & ~(y - 1)
(x=0, y=2) -> 0
(x=1, y=2) -> 2
(x=2, y=2) -> 2
(x=3, y=2) -> 4
(x=4, y=2) -> 4
(x=5, y=2) -> 6
(x=6, y=2) -> 6
(x=7, y=2) -> 8
(x=8, y=2) -> 8
It's easy to see the output (i.e., result right of ->) is always a multiple of y such that the output is greater than or equal to x.

First I assume that X and Y are unsigned integers.
Let's have a look at the right part:
If Y is a power of 2, it is represented in binary by one bit to 1 and all the others to 0. Example 8 will be binary 00..01000.
If you substract 1 the highest bit will be 0 and all the bits to its right will become 1. Example 8-1= 7 and in binary 00..00111
If you ~ negate this number you will make sure that all highest bit (including the original one will turn to 1 and the lovest to 0. Example: ~7 will be 11..11000
Now if you do a binary AND (&) with any number, you will set to 0 all the lower bits, in our example, the 3 lower bits. THe resulting number is hence a multiple of Y.
Let's look at the left side:
We've already analysed Y-1. In our example we had 7, that is 00..00111
If you add this to any number, you make sure that the result is greater than or equal to Y. Example with 5: 5+7=12 so 00..01100 and example with 10: 10+7=17 so 00..10001
If you then perform the AND, you'll erase the lower bits. so in our example with 5, we come to 00..01000 = 8 and in our example with 10 we get 00..10000 16.
Conclusion, it's the smallest multiple of Y wich is greater or equal to X.

Let's break it down, piece by piece.
(X+Y-1) & ~(Y-1)
Let's suppose that X = 11 and Y = 16 in accordance with your rules and that the integers are 8 bits.
(11+16-1) & ~(16-1)
Do the Addition and Subtraction
(26) & ~(15)
Translate this into binary
(0001 1010) & ~(0000 1111)
~ means not or to invert the zeros and ones
(0001 1010) & (1111 0000)
& means only to take the bits that are both ones
0001 0000
convert back to decimal
16
other examples
X = 78, Y = 32 results in 96
X = 25, Y = 64 results in 64
X = 47, Y = 16 results in 48
So, it would seem to me that the purpose of this is to find lowest multiple of Y that is equal to or greater than X. This could be used for finding the start/end address of a block of memory, or it could be used for positioning items on the screen, or any number of other possible answers as well. But without context and possibly even a full code example. There's no guarantee.

(X+Y-1) & ~(Y-1)
x = 7 = 0b0111
y = 4 = 0b0100
x+y-1 = 0b1010
y-1 = 3 = 0b0011
~(y-1) = 0b1100
(x+y-1) & ~(y-1) = 0b1000 = 8
--
x = 12 = 0b1100
y = 2 = 0b0010
x+y-1 = 13 = 0b1101
y-1 = 1 = 0b0001
~(y-1) = 0b1110
(x+y-1) & ~(y-1) = 0b1100 = 12
(x+y-1) & ~(y-1) is the smallest multiple of y greater than or equal to x

It seems provides a specified alignment of a value for example of a memory address (for example when you want to get the next aligned address).
For example if you want that a memory address would be aligned at the paragraph bound you can write
( address + 16 - 1 ) & ~( 16 - 1 )
or
( address + 15 ) & ~15
or
( address + 15 ) & ~0xf
In this case all bits before 16 will be zeroed.
This part of expression
( address + alignment - )
is used for rounding.
and this part of expression
~( alignment - 1 )
is used to build a mask thet zeroes low bits.

Fastest way to get IPv4 address from string

I have the following code which is about 7 times faster than inet_addr . I was wondering if there is a way to improve this to make it even faster or if a faster alternative exists.
This code requires that a valid null terminated IPv4 address is supplied with no whitespace, which in my case is always the way, so I optimized for that case. Usually you would have more error checking, but if there is a way to make the following even faster or a faster alternative exists I would really appreciate it.
UINT32 GetIP(const char *p)
{
UINT32 dwIP=0,dwIP_Part=0;
while(true)
{
if(p[0] == 0)
{
dwIP = (dwIP << 8) | dwIP_Part;
break;
}
if(p[0]=='.')
{
dwIP = (dwIP << 8) | dwIP_Part;
dwIP_Part = 0;
p++;
}
dwIP_Part = (dwIP_Part*10)+(p[0]-'0');
p++;
}
return dwIP;
}

Since we are speaking about maximizing throughput of IP address parsing, I suggest using a vectorized solution.
Here is x86-specific fast solution (needs SSE4.1, or at least SSSE3 for poor):
__m128i shuffleTable[65536]; //can be reduced 256x times, see #IwillnotexistIdonotexist
UINT32 MyGetIP(const char *str) {
__m128i input = _mm_lddqu_si128((const __m128i*)str); //"192.167.1.3"
input = _mm_sub_epi8(input, _mm_set1_epi8('0')); //1 9 2 254 1 6 7 254 1 254 3 208 245 0 8 40
__m128i cmp = input; //...X...X.X.XX... (signs)
UINT32 mask = _mm_movemask_epi8(cmp); //6792 - magic index
__m128i shuf = shuffleTable[mask]; //10 -1 -1 -1 8 -1 -1 -1 6 5 4 -1 2 1 0 -1
__m128i arr = _mm_shuffle_epi8(input, shuf); //3 0 0 0 | 1 0 0 0 | 7 6 1 0 | 2 9 1 0
__m128i coeffs = _mm_set_epi8(0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1);
__m128i prod = _mm_maddubs_epi16(coeffs, arr); //3 0 | 1 0 | 67 100 | 92 100
prod = _mm_hadd_epi16(prod, prod); //3 | 1 | 167 | 192 | ? | ? | ? | ?
__m128i imm = _mm_set_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6, 4, 2, 0);
prod = _mm_shuffle_epi8(prod, imm); //3 1 167 192 0 0 0 0 0 0 0 0 0 0 0 0
return _mm_extract_epi32(prod, 0);
// return (UINT32(_mm_extract_epi16(prod, 1)) << 16) + UINT32(_mm_extract_epi16(prod, 0)); //no SSE 4.1
}
And here is the required precalculation for shuffleTable:
void MyInit() {
memset(shuffleTable, -1, sizeof(shuffleTable));
int len[4];
for (len[0] = 1; len[0] <= 3; len[0]++)
for (len[1] = 1; len[1] <= 3; len[1]++)
for (len[2] = 1; len[2] <= 3; len[2]++)
for (len[3] = 1; len[3] <= 3; len[3]++) {
int slen = len[0] + len[1] + len[2] + len[3] + 4;
int rem = 16 - slen;
for (int rmask = 0; rmask < 1<<rem; rmask++) {
// { int rmask = (1<<rem)-1; //note: only maximal rmask is possible if strings are zero-padded
int mask = 0;
char shuf[16] = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
int pos = 0;
for (int i = 0; i < 4; i++) {
for (int j = 0; j < len[i]; j++) {
shuf[(3-i) * 4 + (len[i]-1-j)] = pos;
pos++;
}
mask ^= (1<<pos);
pos++;
}
mask ^= (rmask<<slen);
_mm_store_si128(&shuffleTable[mask], _mm_loadu_si128((__m128i*)shuf));
}
}
}
Full code with testing is avaliable here. On Ivy Bridge processor it prints:
C0A70103
Time = 0.406 (1556701184)
Time = 3.133 (1556701184)
It means that the suggested solution is 7.8 times faster in terms of throughput than the code by OP. It processes 336 millions of addresses per second (single core of 3.4 Ghz).
Now I'll try to explain how it works. Note that on each line of the listing you can see contents of the value just computed. All the arrays are printed in little-endian order (though set intrinsics use big-endian).
First of all, we load 16 bytes from unaligned address by lddqu instruction. Note that in 64-bit mode memory is allocated by 16-byte chunks, so this works well automatically. On 32-bit it may theoretically cause issues with out of range access. Though I do not believe that it really can. The subsequent code would work properly regardless of the values in the after-the-end bytes. Anyway, you'd better ensure that each IP address takes at least 16 bytes of storage.
Then we subtract '0' from all the chars. After that '.' turns into -2, and zero turns into -48, all the digits remain nonnegative. Now we take bitmask of signs of all the bytes with _mm_movemask_epi8.
Depending on the value of this mask, we fetch a nontrivial 16-byte shuffling mask from lookup table shuffleTable. The table is quite large: 1Mb total. And it takes quite some time to precompute. However, it does not take precious space in CPU cache, because only 81 elements from this table are really used. That is because each part of IP address can be either one, two, three digits long => hence 81 variants in total.
Note that random trashy bytes after the end of the string may in principle cause increased memory footprint in the lookup table.
EDIT: you can find a version modified by #IwillnotexistIdonotexist in comments, which uses lookup table of only 4Kb size (it is a bit slower, though).
The ingenious _mm_shuffle_epi8 intrinsic allows us to reorder the bytes with our shuffle mask. As a result XMM register contains four 4-byte blocks, each block contains digits in little-endian order. We convert each block into a 16-bit number by _mm_maddubs_epi16 followed by _mm_hadd_epi16. Then we reorder bytes of the register, so that the whole IP address occupies the lower 4 bytes.
Finally, we extract the lower 4 bytes from the XMM register to GP register. It is done with SSE4.1 intrinsic (_mm_extract_epi32). If you don't have it, replace it with other line using _mm_extract_epi16, but it will run a bit slower.
Finally, here is the generated assembly (MSVC2013), so that you can check that your compiler does not generate anything suspicious:
lddqu xmm1, XMMWORD PTR [rcx]
psubb xmm1, xmm6
pmovmskb ecx, xmm1
mov ecx, ecx //useless, see #PeterCordes and #IwillnotexistIdonotexist
add rcx, rcx //can be removed, see #EvgenyKluev
pshufb xmm1, XMMWORD PTR [r13+rcx*8]
movdqa xmm0, xmm8
pmaddubsw xmm0, xmm1
phaddw xmm0, xmm0
pshufb xmm0, xmm7
pextrd eax, xmm0, 0
P.S. If you are still reading it, be sure to check out comments =)

As for alternatives: this is similar to yours but with some error checking:
#include <iostream>
#include <string>
#include <cstdint>
uint32_t getip(const std::string &sip)
{
uint32_t r=0, b, p=0, c=0;
const char *s;
s = sip.c_str();
while (*s)
{
r<<=8;
b=0;
while (*s&&((*s==' ')||(*s=='\t'))) s++;
while (*s)
{
if ((*s==' ')||(*s=='\t')) { while (*s&&((*s==' ')||(*s=='\t'))) s++; if (*s!='.') break; }
if (*s=='.') { p++; s++; break; }
if ((*s>='0')&&(*s<='9'))
{
b*=10;
b+=(*s-'0');
s++;
}
}
if ((b>255)||(*s=='.')) return 0;
r+=b;
c++;
}
return ((c==4)&&(p==3))?r:0;
}
void testip(const std::string &sip)
{
uint32_t nIP=0;
nIP = getip(sip);
std::cout << "\nsIP = " << sip << " --> " << std::hex << nIP << "\n";
}
int main()
{
testip("192.167.1.3");
testip("292.167.1.3");
testip("192.267.1.3");
testip("192.167.1000.3");
testip("192.167.1.300");
testip("192.167.1.");
testip("192.167.1");
testip("192.167..1");
testip("192.167.1.3.");
testip("192.1 67.1.3.");
testip("192 . 167 . 1 . 3");
testip(" 192 . 167 . 1 . 3 ");
return 0;
}

what this "if(k.c[3] & c)" part of code doing?

#include<stdio.h>
#include<iostream.h>
main()
{
unsigned char c,i;
union temp
{
float f;
char c[4];
} k;
cin>>k.f;
c=128;
for(i=0;i<8;i++)
{
if(k.c[3] & c) cout<<'1';
else cout<<'0';
c=c>>1;
}
c=128;
cout<<'\n';
for(i=0;i<8;i++)
{
if(k.c[2] & c) cout<<'1';
else cout<<'0';
c=c>>1;
}
return 0;
}

if(k.c[2] & c)
That is called bitwise AND.
Illustration of bitwise AND
//illustration : mathematics of bitwise AND
a = 10110101 (binary representation)
b = 10011010 (binary representation)
c = a & b
= 10110101 & 10011010
= 10010000 (binary representation)
= 128 + 16 (decimal)
= 144 (decimal)
Bitwise AND uses this truth table:
X | Y | R = X & Y
---------
0 | 0 | 0
0 | 1 | 0
1 | 0 | 0
1 | 1 | 1
See these tutorials on bitwise AND:
Bitwise Operators in C and C++: A Tutorial
Bitwise AND operator &

A bitwise operation (AND in this case) perform a bit by bit operation between the 2 operands.
For example the & :
11010010 &
11000110 =
11000010

Bitwise Operation in your code
c = 128 therefore the binary representation is
c = 10000000
a & c will and every ith but if c with evert ith bit of a. Because c only has 1 in the MSB position (pos 7), so a & c will be non-zero if a has a 1 in its position 7 bit, if a has a 0 in pos bit, then a & c will be zero. This logic is used in the if block above. The if block is entered depending upon if the MSB (position 7 bit) of the byte is 1 or not.
Suppose a = ? ? ? ? ? ? ? ? where a ? is either 0 or 1
Then
a = ? ? ? ? ? ? ? ?
AND & & & & & & & &
c = 1 0 0 0 0 0 0 0
---------------
? 0 0 0 0 0 0 0
As 0 & ? = 0. So if the bit position 7 is 0 then answer is 0 is bit position 7 is 1 then answer is 1.
In each iteration c is shifted left one position, so the 1 in the c propagates left wise. So in each iteration masking with the other variable you are able to know if there is a 1 or a 0 at that position of the variable.
Use in your code
You have
union temp
{
float f;
char c[4];
} k;
Inside the union the float and the char c[4] share the same memory location (as the property of union).
Now, sizeof (f) = 4bytes) You assign k.f = 5345341 or whatever . When you access the array k.arr[0] it will access the 0th byte of the float f, when you do k.arr[1] it access the 1st byte of the float f . The array is not empty as both the float and the array points the same memory location but access differently. This is actually a mechanism to access the 4 bytes of float bytewise.
NOTE THAT k.arr[0] may address the last byte instead of 1st byte (as told above), this depends on the byte ordering of storage in memory (See little endian and big endian byte ordering for this)
Union k
+--------+--------+--------+--------+ --+
| arr[0] | arr[1] | arr[2] | arr[3] | |
+--------+--------+--------+--------+ |---> Shares same location (in little endian)
| float f | |
+-----------------------------------+ --+
Or the byte ordering could be reversed
Union k
+--------+--------+--------+--------+ --+
| arr[3] | arr[2] | arr[1] | arr[0] | |
+--------+--------+--------+--------+ |---> Shares same location (in big endian)
| float f | |
+-----------------------------------+ --+
Your code loops on this and shifts the c which propagates the only 1 in the c from bit 7 to bit 0 in one step at a time in each location, and the bitwise anding checks actually every bit position of the bytes of the float variable f, and prints a 1 if it is 1 else 0.
If you print all the 4 bytes of the float, then you can see the IEEE 754 representation.

c has single bit in it set. 128 is 10000000 in binary. if(k.c[2] & c) checks if that bit is set in k.c[2] as well. Then the bit in c is shifted around to check for other bits.
As result the program is made to display the binary representation of float it seems.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Concatenation of the 7th lower bits of n bytes - c++

Related

Impact of data padding on CRC calculation

Hex to Dec using bit-manipulation

What does this expression calculate

Fastest way to get IPv4 address from string

what this "if(k.c[3] & c)" part of code doing?

Categories

Resources