Reading the following blog post. A so called "SHLD+BSR" huffman decoder is mentioned, which is then further expanded to MOV, MOV, SHLD, OR, BSR, MOV, SHR, MOV, OR, ADD, ADC, however I have not found any reference or source code that describes such decoding. Does anyone know what decoding method is referred?
I haven't really succeeded at understanding this method for decoding huffman codes, but the relevant "inner loop" here contains something like this (edited slightly to make the SHDL and BSR obvious):
uint32 posidx = pos >> 5;
uint32 code = src32[posidx];
uint32 extrabits = src32[posidx + 1];
SHLD(code, extrabits, pos);
code |= 1;
uint32 idx = BSR(code);
uint8 *p = (const uint8 *)(table->mBsrLenTable[idx] +
2*(code >> table->mBsrShiftTable[idx]));
result = p[0];
pos += p[1];
That accounts for the MOV, MOV, SHLD, OR, BSR, MOV, SHR, MOV but then I'm not sure anymore. I think the ADD refers to the multiplication by 2, the ADC is actually a trick to add p[1] to pos and the OR joins the mBsrLenTable entry and the rest of the code, but that seems to be in the wrong order and then the OR would correspond with an addition in the source. Perhaps I shouldn't do this sort of thing after midnight..
You'd probably better take a look at the source yourself, because to be honest, my answer is useless. I got it here: sourceforge.net/projects/virtualdub/files/virtualdub-win/1.9.11.32842/VirtualDub-1.9.11-src.7z look for the file src\Meia\source\decode_huffyuv.cpp, it starts with the table initializing and the actual decoding is in a macro called DECODE, about 200 lines down.
Related
I am writing a virtual machine for my own assembly language, I want to be able to set the carry, parity, zero, sign and overflowflags as they are set in the x86-64 architecture, when I perform operations such as addition.
Notes:
I am using Microsoft Visual C++ 2015 & Intel C++ Compiler 16.0
I am compiling as a Win64 application.
My virtual machine (currently) only does arithmetic on 8-bit integers
I'm not (currently) interested in any other flags (e.g. AF)
My current solution is using the following function:
void update_flags(uint16_t input)
{
Registers::flags.carry = (input > UINT8_MAX);
Registers::flags.zero = (input == 0);
Registers::flags.sign = (input < 0);
Registers::flags.overflow = (int16_t(input) > INT8_MAX || int16_t(input) < INT8_MIN);
// I am assuming that overflow is handled by trunctation
uint8_t input8 = uint8_t(input);
// The parity flag
int ones = 0;
for (int i = 0; i < 8; ++i)
if (input8 & (1 << i) != 0) ++ones;
Registers::flags.parity = (ones % 2 == 0);
}
Which for addition, I would use as follows:
uint8_t a, b;
update_flags(uint16_t(a) + uint16_t(b));
uint8_t c = a + b;
EDIT:
To clarify, I want to know if there is a more efficient/neat way of doing this (such as by accessing RFLAGS directly)
Also my code may not work for other operations (e.g. multiplication)
EDIT 2 I have updated my code now to this:
void update_flags(uint32_t result)
{
Registers::flags.carry = (result > UINT8_MAX);
Registers::flags.zero = (result == 0);
Registers::flags.sign = (int32_t(result) < 0);
Registers::flags.overflow = (int32_t(result) > INT8_MAX || int32_t(result) < INT8_MIN);
Registers::flags.parity = (_mm_popcnt_u32(uint8_t(result)) % 2 == 0);
}
One more question, will my code for the carry flag work properly?, I also want it to be set correctly for "borrows" that occur during subtraction.
Note: The assembly language I am virtualising is of my own design, meant to be simple and based of Intel's implementation of x86-64 (i.e. Intel64), and so I would like these flags to behave in mostly the same way.
TL:DR: use lazy flag evaluation, see below.
input is a weird name. Most ISAs update flags based on the result of an operation, not the inputs. You're looking at the 16bit result of an 8bit operation, which is an interesting approach. In the C, you should just use unsigned int, which is guaranteed to be at least uint16_t. It will compile to better code on x86, where unsigned is 32bit. 16bit ops take an extra prefix and can lead to partial-register slowdowns.
That might help with the 8bx8b->16b mul problem you noted, depending on how you want to define the flag-updating for the mul instruction in the architecture you're emulating.
I don't think your overflow detection is correct. See this tutorial linked from the x86 tag wiki for how it's done.
This will probably not compile to very fast code, especially the parity flag. Do you need the ISA you're emulating/designing to have a parity flag? You never said you're emulating an x86, so I assume it's some toy architecture you're designing yourself.
An efficient emulator (esp. one that needs to support a parity flag) would probably benefit a lot from some kind of lazy flag evaluation. Save a value that you can compute flags from if needed, but don't actually compute anything until you get to an instruction that reads flags. Most instructions only write flags without reading them, and they just save the uint16_t result into your architectural state. Flag-reading instructions can either compute just the flag they need from that saved uint16_t, or compute all of them and store that somehow.
Assuming you can't get the compiler to actually read PF from the result, you might try _mm_popcnt_u32((uint8_t)x) & 1. Or, horizontally XOR all the bits together:
x = (x&0b00001111) ^ (x>>4)
x = (x&0b00000011) ^ (x>>2)
PF = (x&0b00000001) ^ (x>>1) // tweaking this to produce better asm is probably possible
I doubt any of the major compilers can peephole-optimize a bunch of checks on a result into LAHF + SETO al, or a PUSHF. Compilers can be led into using a flag condition to detect integer overflow to implement saturating addition, for example. But having it figure out that you want all the flags, and actually use LAHF instead of a series of setcc instruction, is probably not possible. The compiler would need a pattern-recognizer for when it can use LAHF, and probably nobody's implemented that because the use-cases are so vanishingly rare.
There's no C/C++ way to directly access flag results of an operation, which makes C a poor choice for implementing something like this. IDK if any other languages do have flag results, other than asm.
I expect you could gain a lot of performance by writing parts of the emulation in asm, but that would be platform-specific. More importantly, it's a lot more work.
I appear to have solved the problem, by splitting the arguments to update flags into an unsigned and signed result as follows:
void update_flags(int16_t unsigned_result, int16_t signed_result)
{
Registers::flags.zero = unsigned_result == 0;
Registers::flags.sign = signed_result < 0;
Registers::flags.carry = unsigned_result < 0 || unsigned_result > UINT8_MAX;
Registers::flags.overflow = signed_result < INT8_MIN || signed_result > INT8_MAX
}
For addition (which should produce the correct result for both signed & unsigned inputs) I would do the following:
int8_t a, b;
int16_t signed_result = int16_t(a) + int16_t(b);
int16_t unsigned_result = int16_t(uint8_t(a)) + int16_t(uint8_t(b));
update_flags(unsigned_result, signed_result);
int8_t c = a + b;
And signed multiplication I would do the following:
int8_t a, b;
int16_t result = int16_t(a) * int16_t(b);
update_flags(result, result);
int8_t c = a * b;
And so on for the other operations that update the flags
Note: I am assuming here that int16_t(a) sign extends, and int16_t(uint8_t(a)) zero extends.
I have also decided against having a parity flag, my _mm_popcnt_u32 solution should work if I change my mind later..
P.S. Thank you to everyone who responded, it was very helpful. Also if anyone can spot any mistakes in my code, that would be appreciated.
In my application I need to use encryption algorithm that allows me to decrypt single byte at requested offset in encrypted buffer, without reading surrounding blocks. My choice is AES with CTR mode using Crypto++ library. Since I couldn't find any good example, I have wrote it on my own:
unique_ptr<vector<byte>> GetIV(int counter)
{
byte* counterPtr = (byte*)&counter;
unique_ptr<vector<byte>> iv(new vector<byte>());
for (int j = 0; j < 4; j++)
{
iv->push_back(counterPtr[j]);
}
return move(iv);
}
unique_ptr<vector<uint8_t>> Encrypt(const vector<uint8_t>& plainInput)
{
unique_ptr<vector<uint8_t>> encryptedOutput(new vector<uint8_t>(plainInput.size()));
for (int i = 0; i < plainInput.size(); i++)
{
auto iv = GetIV(i);
CTR_Mode<AES>::Encryption encryptor(_key->data(), _key->size(), iv->data());
byte encryptedValue = encryptor.ProcessByte(plainInput.at(i));
encryptedOutput->at(i) = encryptedValue;
}
return move(encryptedOutput);
}
unique_ptr<vector<uint8_t>> Decrypt(const vector<uint8_t>& encryptedInput, int position)
{
unique_ptr<vector<uint8_t>> decryptedOutput(new vector<uint8_t>(encryptedInput.size()));
for (int i = 0; i < encryptedInput.size(); i++)
{
auto iv = GetIV(position + i);
CTR_Mode<AES>::Decryption decryptor(_key->data(), _key->size(), iv->data());
byte decryptedValue = decryptor.ProcessByte(encryptedInput.at(i));
decryptedOutput->at(i) = decryptedValue;
}
return move(decryptedOutput);
}
As you can see, I iterate through all bytes in my input buffer, and encrypt\decrypt each of them separately, because it is necessary to have unique counter for each block (in CTR mode). Since I need to be able to decrypt random byte, I need to have as much blocks as buffer size is, is that correct? My solution works, but it is very very slow... Am I doing it right? Or maybe there is much more efficient way to do this?
There are several major problems with your code:
You are using unauthenticated encryption, which is insecure in most application domains. Please use AES-GCM instead, which looks a lot like AES-CTR anyway. This is in fact mentioned right on the documentation of Crypto++.
The IV of CTR mode is 16 bytes long, yet you use only 4 bytes. Your codes not only calculate it wrong, but also exhibit undefined behavior.
IV is per message, not per byte.
Because you choose the IV wrong, your algorithm basically reduces to the one-time pad, except not as secure. If you ever encrypt two messages with the same key, the system is broken.
The performance issue is your least concern. This whole implementation is simply incorrect and insecure. You must study cryptography systematically before trying to utilize it, for it is not a field you can learn just by trial and error. It is easy to design a system that passes all the unit tests and looks fine to your own eyes, but completely broken to the trained ones.
I recommend cryptography on coursera.
No, you are not doing this right. You don't need to iterate through the input of the decrypt method at all.
You only have to calculate the right counter for the block that contains the byte to decrypt. Then you can use that counter as IV value. Now you can encrypt or decrypt a block of ciphertext and retrieve the right byte. There is no need to decrypt specific bytes separately.
So if the block size of the cipher is 16, the IV/nonce is F000000000000000F000000000000000h and the offset of the byte is 260 then the counter/IV needs to be advanced with 260 / 16 = 16 = 10h. Then F000000000000000F000000000000000h + 10h = F000000000000000F000000000000010. Then you decrypt the 16th block and take the 4th byte at offset 3 (as 260 % 16 = 4).
I'm having trouble reading in a 16bit .wav file. I have read in the header information, however, the conversion does not seem to work.
For example, in Matlab if I read in wave file I get the following type of data:
-0.0064, -0.0047, -0.0051, -0.0036, -0.0046, -0.0059, -0.0051
However, in my C++ program the following is returned:
0.960938, -0.00390625, -0.949219, -0.00390625, -0.996094, -0.00390625
I need the data to be represented the same way. Now, for 8 bit .wav files I did the following:
uint8_t c;
for(unsigned i=0; (i < size); i++)
{
c = (unsigned)(unsigned char)(data[i]);
double t = (c-128)/128.0;
rawSignal.push_back(t);
}
This worked, however, when I did this for 16bit:
uint16_t c;
for(unsigned i=0; (i < size); i++)
{
c = (signed)(signed char)(data[i]);
double t = (c-256)/256.0;
rawSignal.push_back(t);
}
Does not work and shows the output (above).
I'm following the standards found Here
Where data is a char array and rawSignal is a std::vector<double> I'm probably just handing the conversion wrong but cannot seem to find out where. Anyone have any suggestions?
Thanks
EDIT:
This is what is now displaying (In a graph):
This is what it should be displaying:
There are a few problems here:
8 bit wavs are unsigned, but 16 bit wavs are signed. Therefore, the subtraction step given in the answers by Carl and Jay are unnecessary. I presume they just copied from your code, but they are wrong.
16 bit waves have a range from -32,768 to 32,767, not from -256 to 255, making the multiplication you are using incorrect anyway.
16-bit wavs are 2 bytes, thus you must read two bytes to make one sample, not one. You appear to be reading one character at a time. When you read the bytes, you may have to swap them if your native endianness is not little-endian.
Assuming a little-endian architecture, your code would look more like this (very close to Carl's answer):
for (int i = 0; i < size; i += 2)
{
int c = (data[i + 1] << 8) | data[i];
double t = c/32768.0;
rawSignal.push_back(t);
}
for a big-endian architecture:
for (int i = 0; i < size; i += 2)
{
int c = (data[i] << 8) | data[i+1];
double t = c/32768.0;
rawSignal.push_back(t);
}
That code is untested, so please LMK if it doesn't work.
(First of all about little-endian/big-endian-ness. WAV is just a container format, the data encoded in it can be in countless format. Most of the codecs are lossless (MPEG Layer-3 aka MP3, yes, the stream can be "packaged" into a WAV, various CCITT and other codecs). You assume that you deal with some kind of PCM format, where you see the actual wave in RAW format, no lossless transformation was done on it. The endianness depends on the codec, which produced the stream.
Is the endianness of format params guaranteed in RIFF WAV files?)
It's also a question if the one PCM sample is in linear scale sampled integer or there some scaling, log scale or other transformation behind it. Regular PCM wav files I encountered were simple linear scale samples, but I'm not working in the audio recording or producing industry.
So a path to your solution:
Make sure that you are dealing with regular 16 bit PCM encoded RIFF WAV file.
While reading the stream, always read two bytes (char) at a time and convert the two chars into a 16 bit short. People showed this before me.
The wave form you show clearly suggest that you either not estimated the frequency well (or you just have one mono channel instead of a stereo). Because the sampling rate (44.1kHz, 22KHz, 11KHz, 8kHz, etc) is just as important as the resolution (8 bit, 16 bit, 24 bit, etc). Maybe in the first case you had a stereo data. You can read it in as mono, you may not notice it. In the second case if you have mono data, then you'll run out of samples half way into reading the data. That's what it seems to happen according to your graphs. Talking about the other cause: the lower sampling resolutions (and 16 bit is also lower) often paired with lower sampling rates. So if your input data is the recording time, and you think you have a 22kHz data, but it's actually just 11kHz, then again you'll run out half way through from the actual samples and read in memory garbage. So either one of these.
Make sure that you interpret and treat your loop iterator variable and the size well. It seems that size tells how many bytes you have. You'll have exactly half as much short integer samples. Notice, that Bjorn's solution correctly increments i by 2 because of that.
My working code is
int8_t* buffer = new int8_t[size];
/*
HERE buffer IS FILLED
*/
for (int i = 0; i < size; i += 2)
{
int16_t c = ((unsigned char)buffer[i + 1] << 8) | (unsigned char)buffer[i];
double t = c/32768.0;
rawSignal.push_back(t);
}
A 16-bit quantity gives you a range from -32,768 to 32,767, not from -256 to 255 (that's just 9 bits). Use:
for (int i = 0; i < size; i += 2)
{
c = (data[i + 1] << 8) + data[i]; // WAV files are little-endian
double t = (c - 32768)/32768.0;
rawSignal.push_back(t);
}
You might want something more like this:
uint16_t c;
for(unsigned i=0; (i < size); i++)
{
// get a 16 bit pointer to the array
uint16_t* p = (uint16_t*)data;
// get the i-th element
c = *( p + i );
// convert to signed? I'm guessing this is what you want
int16_t cs = (int16_t)c;
double t = (cs-256)/256.0;
rawSignal.push_back(t);
}
Your code converts the 8 bit value to a signed value then writes it into an unsigned variable. You should look at that and see if it's what you want.
I use Linux x86_64 and clang 3.3.
Is this even possible in theory?
std::atomic<__int128_t> doesn't work (undefined references to some functions).
__atomic_add_fetch also doesn't work ('error: cannot compile this atomic library call yet').
Both std::atomic and __atomic_add_fetch work with 64-bit numbers.
It's not possible to do this with a single instruction, but you can emulate it and still be lock-free. Except for the very earliest AMD64 CPUs, x64 supports the CMPXCHG16B instruction. With a little multi-precision math, you can do this pretty easily.
I'm afraid I don't know the instrinsic for CMPXCHG16B in GCC, but hopefully you get the idea of having a spin loop of CMPXCHG16B. Here's some untested code for VC++:
// atomically adds 128-bit src to dst, with src getting the old dst.
void fetch_add_128b(uint64_t *dst, uint64_t* src)
{
uint64_t srclo, srchi, olddst[2], exchlo, exchhi;
srchi = src[0];
srclo = src[1];
olddst[0] = dst[0];
olddst[1] = dst[1];
do
{
exchlo = srclo + olddst[1];
exchhi = srchi + olddst[0] + (exchlo < srclo); // add and carry
}
while(!_InterlockedCompareExchange128((long long*)dst,
exchhi, exchlo,
(long long*)olddst));
src[0] = olddst[0];
src[1] = olddst[1];
}
Edit: here's some untested code going off of what I could find for the GCC intrinsics:
// atomically adds 128-bit src to dst, returning the old dst.
__uint128_t fetch_add_128b(__uint128_t *dst, __uint128_t src)
{
__uint128_t dstval, olddst;
dstval = *dst;
do
{
olddst = dstval;
dstval = __sync_val_compare_and_swap(dst, dstval, dstval + src);
}
while(dstval != olddst);
return dstval;
}
That isn't possible. There is no x86-64 instruction that does a 128-bit add in one instruction, and to do something atomically, a basic starting point is that it is a single instruction (there are some instructions which aren't atomic even then, but that's another matter).
You will need to use some other lock around the 128-bit number.
Edit: It is possible that one could come up with something that uses something like this:
__volatile__ __asm__(
" mov %0, %%rax\n"
" mov %0+4, %%rdx\n"
" mov %1,%%rbx\n"
" mov %1+4,%%rcx\n"
"1:\n
" add %%rax, %%rbx\n"
" adc %%rdx, %%rcx\n"
" lock;cmpxcchg16b %0\n"
" jnz 1b\n"
: "=0"
: "0"(&arg1), "1"(&arg2));
That's just something I just hacked up, and I haven't compiled it, never mind validated that it will work. But the principle is that it repeats until it compares equal.
Edit2: Darn typing too slow, Cory Nelson just posted the same thing, but using intrisics.
Edit3: Update loop to not unnecessary read memory that doesn't need reading... CMPXCHG16B does that for us.
Yes; you need to tell your compiler that you're on hardware that supports it.
This answer is going to assume you're on x86-64; there's likely a similar spec for arm.
From the generic x86-64 microarchitecture levels, you'll want at least x86-64-v2 to let the compiler know that you have the cmpxchg16b instruction.
Here's a working godbolt, note the compiler flag -march=x86-64-v2:
https://godbolt.org/z/PvaojqGcx
For more reading on the x86-64-psABI, the spec is published here.
What's the reason behind applying two explicit type casts as below?
if (unlikely(val != (long)(char)val)) {
Code taken from lxml.etree.c source file from lxml's source package.
That's a cheap way to check to see if there's any junk in the high bits. The char cast chops of the upper 8, 24 or 56 bits (depending on sizeof(val)) and then promotes it back. If char is signed, it will sign extend as well.
A better test might be:
if (unlikely(val & ~0xff)) {
or
if (unlikely(val & ~0x7f)) {
depending on whether this test cares about bit 7.
Just for grins and completeness, I wrote the following test code:
void RegularTest(long val)
{
if (val != ((int)(char)val)) {
printf("Regular = not equal.");
}
else {
printf("Regular = equal.");
}
}
void MaskTest(long val)
{
if (val & ~0xff) {
printf("Mask = not equal.");
}
else {
printf("Mask = equal.");
}
}
And here's what the cast code turns into in debug in visual studio 2010:
movsx eax, BYTE PTR _val$[ebp]
cmp DWORD PTR _val$[ebp], eax
je SHORT $LN2#RegularTes
this is the mask code:
mov eax, DWORD PTR _val$[ebp]
and eax, -256 ; ffffff00H
je SHORT $LN2#MaskTest
In release, I get this for the cast code:
movsx ecx, al
cmp eax, ecx
je SHORT $LN2#RegularTes
In release, I get this for the mask code:
test DWORD PTR _val$[ebp], -256 ; ffffff00H
je SHORT $LN2#MaskTest
So what's going on? In the cast case it's doing a byte mov with sign extension (ha! bug - the code is not the same because chars are signed) and then a compare and to be totally sneaky, the compiler/linker has also made this function use register passing for the argument. In the mask code in release, it has folded everything up into a single test instruction.
Which is faster? Beats me - and frankly unless you're running this kind of test on a VERY slow CPU or are running it several billion times, it won't matter. Not in the least.
So the answer in this case, is to write code that is clear about its intent. I would expect a C/C++ jockey to look at the mask code and understand its intent, but if you don't like that, you should opt for something like this instead:
#define BitsAbove8AreSet(x) ((x) & ~0xff)
#define BitsAbove7AreSet(x) ((x) & ~0x7f)
or:
inline bool BitsAbove8AreSet(long t) { return (t & ~0xff) != 0; } // make it a bool to be nice
inline bool BitsAbove7AreSet(long t) { return (t & ~0x7f) != 0; }
And use the predicates instead of the actual code.
In general, I think "is it cheap?" is not a particularly good question to ask about this unless you're working in some very specific problem domains. For example, I work in image processing and when I have some kind of operation going from one image to another, I often have code that looks like this:
BYTE *srcPixel = PixelOffset(src, x, y, srcrowstride, srcdepth);
int srcAdvance = PixelAdvance(srcrowstride, right, srcdepth);
BYTE *dstPixel = PixelOffset(dst, x, y, dstrowstride, dstdepth);
int dstAdvance = PixelAdvance(dstrowstride, right, dstdepth);
for (y = top; y < bottom; y++) {
for (x=left; x < right; x++) {
ProcessOnePixel(srcPixel, srcdepth, dstPixel, dstdepth);
srcPixel += srcdepth;
dstPixel += dstdepth;
}
srcPixel += srcAdvance;
dstPixel += dstAdvance;
}
And in this case, assume that ProcessOnePixel() is actually a chunk of inline code that will be executed billions and billions of times. In this case, I care a whole lot about not doing function calls, not doing redundant work, not rechecking values, ensuring that the computational flow will translate into something that will use registers wisely, etc. But my actual primary concern is that the code can be read by the next poor schmuck (probably me) who has to look at it.
And in our current coding world, it is FAR FAR CHEAPER for nearly every problem domain to spend a little time up front ensuring that your code is easy to read and maintain than it is to worry about performance out of the gate.
Speculations:
cast to char: to mask the 8 low bits,
cast to long: to bring the value back to signed (if char is unsigned).
If val is a long then the (char) will strip off all but the bottom 8 bits. The (long) casts it back for the comparison.