Inconsistent results in compressing with zlib between Win32 and Linux-64 bit - compression

Using zlib in a program and noticed a one bit difference in how "foo" is compressed on Windows 1F8B080000000000000A4BCBCF07002165738C03000000 and Linux 1F8B08000000000000034BCBCF07002165738C03000000. Both decompress back to "foo"
I decided to check outside our code to see if the implementation was correct and used the test programs in the zlib repository to double check. I got the same results:
Linux: echo -n foo| ./minigzip64 > text.txt'
Windows: echo|set /p="foo" | minigzip > text.txt
What would account for this difference? Is it a problem?
1F8B 0800 0000 0000 000 *3/A* 4BCB CF07 0021 6573 8C03 0000 00

First off, if it decompresses to what was compressed, then it's not a problem. Different compressors, or the same compressor at different settings, or even the same compressor with the same settings, but different versions, can produce different compressed output from the same input.
Second, the compressed data in this case is identical. Only the last byte
of the gzip header that precedes the compressed data is different. That byte identifies the originating operating system. Hence it rightly varies between Linux and Windows.
Even on the same operating system, the header can vary since it carries a modification date and time. However in both your cases the modification date and time was left out (set to zeros).

Just to add to the accepted answer here. I got curious and tried out for myself, saving the raw data and opening with 7zip:
Windows:
Linux:
You can immediately notice that the only field that's different is the Host OS.
What the data means
Header Data Footer
1F8B080000000000000A | 4BCBCF0700 | 2165738C03000000
Let's break that down.
Header
First, from this answer I realize it's actually a gzip instead of a zlib header:
Level ZLIB GZIP
1 | 78 01 | 1F 8B
9 | 78 DA | 1F 8B
Further searching led me to an article about Gzip on forensics wiki.
The values in this case are:
Offset Size Value Description
0 | 2 | 1f8b | Signature (or identification byte 1 and 2)
2 | 1 | 08 | Compression Method (deflate)
3 | 1 | | Flags
4 | 4 | | Last modification time
8 | 1 | | Compression flags (or extra flags)
9 | 1 | 0A | Operating system (TOPS-20)
Footer
Offset Size Value Description
0 | 4 | 2165738C | Checksum (CRC-32) (Little endian)
4 | 4 | 03 | Uncompressed data size Value in bytes.
Interesting thing to note here is that even if the Last modification time
and Operating system in header is different, it will compress to the same data
with the same checksum in the footer.
The IETF RFC has a more detailed summary of the format

Related

Removing null bytes from a file results in larger output after XZ compression

I am developing a custom file format consisting of a large number of integers. I'm using XZ to compress the data as it has the best compression ratio of all compression algorithms I've tested.
All integers are stored as u32s in RAM, but they are all a maximum of 24 bits large, so I updated the serializer to skip the high byte (as it will always be 0 anyways), to try to compress the data further. However, this had the opposite effect: the compressed file is now larger than the original.
$ xzcat 24bit.xz | hexdump -e '3/1 "%02x" " "'
020000 030000 552d07 79910c [...] b92c23 c82c23
$ xzcat 32bit.xz | hexdump -e '4/1 "%02x" " "'
02000000 03000000 552d0700 79910c00 [...] b92c2300 c82c2300
$ xz -l 24bit.xz 32bit.xz
Strms Blocks Compressed Uncompressed Ratio Check Filename
1 1 82.4 MiB 174.7 MiB 0.472 CRC64 24bit.xz
1 1 77.2 MiB 233.0 MiB 0.331 CRC64 32bit.xz
-------------------------------------------------------------------------------
2 2 159.5 MiB 407.7 MiB 0.391 CRC64 2 files
Now, I wouldn't have an issue if the size of the file had remained the same, as a perfect compressor would detect that all of those bytes are redundant anyways, and compress them all down to practically nothing. However, I do not understand how removing data from the source file can possibly result in a larger file after compression?
I've tried changing the LZMA compressor settings, and xz --lzma2=preset=9e,lc=4,pb=0 yielded a slightly smaller file at 82.2M, but this is still significantly larger than the original file.
For reference, both files can be downloaded from https://files.spaceface.dev/sW1Xif/32bit.xz and https://files.spaceface.dev/wKXEQm/24bit.xz.
The order of the integers is somewhat important, so naively sorting the entire file won't work. The file is made up of different chunks, and the numbers making up each chunk are currently sorted for slightly better compression; however, the order does not matter, just the order of the chunks themselves.
Chunk 1: 000002 000003 072d55 0c9179 148884 1e414b
Chunk 2: 00489f 0050c5 0080a6 0082f0 0086f6 0086f7 01be81 03bdb1 03be85 03bf4e 04dfe6 04dfea 0583b1 061125 062006 067499 07d7e6 08074d 0858b8 09d35d 09de04 0cfd78 0d06be 0d3869 0d5534 0ec366 0f529c 0f6d0d 0fecce 107a7e 107ab3 13bc0b 13e160 15a4f9 15ab39 1771e3 17fe9c 18137d 197a30 1a087a 1a2007 1ab3b9 1b7d3c 1ba52c 1bc031 1bcb6b 1de7d2 1f0866 1f17b6 1f300e 1f39e1 1ff426 206c51 20abbe 20cbbc 211a58 211a59 215f73 224ea8 227e3f 227eab 22f3b7 231aef 004b15 004c86 0484e7 06216e 08074d 0858b8 0962ed 0eb020 0ec366 1a62c2 1fefae 224ea8 0a2701 1e414b
Chunk 3: 000006 003b17 004b15 004b38 [...]

How do I reproduce checksum of gzip files copied with s3DistCp (from Google Cloud Storage to AWS S3)

I copied a large number of gzip files from Google Cloud Storage to AWS's S3 using s3DistCp (as this AWS article describes). When I try to compare the files' checksums, they differ (md5/sha-1/sha-256 have same issue).
If I compare the sizes (bytes) or the decompressed contents of a few files (diff or another checksum), they match. (In this case, I'm comparing files pulled directly down from Google via gsutil vs pulling down my distcp'd files from S3).
Using file, I do see a difference between the two:
file1-gs-direct.gz: gzip compressed data, original size modulo 2^32 91571
file1-via-s3.gz: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT), original size modulo 2^32 91571
My Goal/Question:
My goal is to verify that my downloaded files match the original files' checksums, but I don't want to have to re-download or analyze the files directly on Google. Is there something I can do on my s3-stored files to reproduce the original checksum?
Things I've tried:
Re-gzipping with different compressions:
While I wouldn't expect s3DistCp to change the original file's compression, here's my attempt at recompressing:
target_sha=$(shasum -a 1 file1-gs-direct.gz | awk '{print $1}')
for i in {1..9}; do
cur_sha=$(cat file1-via-s3.gz | gunzip | gzip -n -$i | shasum -a 1 | awk '{print $1}')
echo "$i. $target_sha == $cur_sha ? $([[ $target_sha == $cur_sha ]] && echo 'Yes' || echo 'No')"
done
1. abcd...1234 == dcba...4321 ? No
2. ... ? No
...
2. ... ? No
While typing out my question, I figured out the answer:
S3DistCp is apparently switching the "OS" version in the gzip header, which explains the "FAT filesystem" label I'm seeing with file. (Note: to rule out S3 directly causing the issue, I copied my "file1-gs-direct.gz" up to S3, and after pulling down, the checksum remains the same.)
Here's the diff between the two files:
$ diff <(cat file1-gs-direct.gz | hexdump -C) <(cat file1-via-s3.gz | hexdump -C)
1c1
< 00000000 1f 8b 08 00 00 00 00 00 00 ff ed 7d 59 73 db 4a |...........}Ys.J|
---
> 00000000 1f 8b 08 00 00 00 00 00 00 00 ed 7d 59 73 db 4a |...........}Ys.J|
It turns out the 10th byte in a gzip file "identifies the type of file system on which compression took place" (Gzip RFC):
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+
Using hexedit, I'm able to change my "via-s3" file's OS from 00 to FF and then the checksums match.
Caveat: Editing this on a file that is later decompressed may cause unexpected issues, so use with caution. (In my case, I'm doing a file checksum, so worse case a file shows as mismatching even when the uncompressed contents remained the same).

Which sample type (not size) to choose in QtMultimedia's QAudioFormat for 24, 32 and 64 bit audio?

I am writing a media player with Qt, but I'm now facing some unknown situation. Actually, I'm trying to use QAudioOutput and QAudioDecoder to play high res music (24, 32 or even 64 bit audio). But QAudioFormat (the glue between all audio classes) specify a sampleType as in the following table:
| Constant | Value | Description |
|---------------------------|-------|--------------------------------|
| QAudioFormat::Unknown | 0 | Not Set |
| QAudioFormat::SignedInt | 1 | Samples are signed integers |
| QAudioFormat::UnSignedInt | 2 | Samples are unsigned intergers |
| QAudioFormat::Float | 3 | Samples are floats |
Now, the problem arise when I also set the sample size to something greater than 16bits. I now have one hypothesis that I need confirmation :
assuming ints are 32bits in size, if I want to support up to 32bit sample sizes I have to use QAudioFormat::SignedInt with pcm audio for 24 and 32 bit audio (filling with 0 for 24bit audio).
But what if there is a higher sample size (eg: 64bit audio for dsd converted to pcm). Should I assume that I still set the sample type to QAudioFormat::SignedInt but that each "sample" of 64bits is stored in two ints ? Or is it simply not supported by QtMultimedia ?
I'm open to any enlightenment 😙!
From the documentation for QAudioFormat::setSampleSize():
void QAudioFormat::setSampleSize(int sampleSize)
Sets the sample size to the sampleSize specified, in bits.
This is typically 8 or 16, but some systems may support higher sample
sizes.
Therefore, to use 64-bit samples, you'd need to call setSampleSize(64). That could be called in combination with a call to setSampleType() to specify whether the samples will be fixed-point-signed vs fixed-point-unsigned vs floating-point -- note that the values in setSampleType() do not imply any particular sample size.
For 64-bit audio, each sample will be stored as 64 bits of data; you could access each sample as a long long int, or alternatively as an int64_t (or unsigned long long int or uint64_t for unsigned samples, or as a double for floating-point samples).
(Of course none of this guarantees that your Qt library's QtMultimedia actually supports 64-bit samples; it may or may not, but at least the API supports telling Qt what you want :) )

Reliable way to programmatically get the number of hardware threads on Windows

I'm struggling to find a reliable way to get the number of hardware threads on Windows. I am running a Windows 7 Professional SP1 64-bit on a machine with dual CPU Intel Xeon E5-2699 v3 # 2.30GHz totalizing 36 cores and 72 threads.
I have tried different methods to get the number of cores, and I have found that only two of them seem to work accurately in a 32-bit or 64-bit process. Here are my results:
+------------------------------------------------+----------------+----------------+
| Methods | 32-bit process | 64-bit process |
+------------------------------------------------+----------------+----------------+
| GetSystemInfo->dwNumberOfProcessors | 32 | 36 |
| GetNativeSystemInfo->dwNumberOfProcessors | 36 | 36 |
| GetLogicalProcessorInformation | 36 | 36 |
| GetProcessAffinityMask.processAffinityMask | 32 | 32 |
| GetProcessAffinityMask.systemAffinityMask | 32 | 32 |
| omp_get_num_procs | 32 | 36 |
| getenv("NUMBER_OF_PROCESSORS") | 36 | 36 |
| GetActiveProcessorCount(ALL_PROCESSOR_GROUPS) | 64 | 72 |
| GetMaximumProcessorCount(ALL_PROCESSOR_GROUPS) | 64 | 72 |
| boost::thread::hardware_concurrency() | 32 | 36 |
| Performance counter API | 36 | 36 |
| WMI | 72 | 72 |
| HARDWARE\DESCRIPTION\System\CentralProcessor | 72 | 72 |
+------------------------------------------------+----------------+----------------+
I do not explain why all these functions return different values. The only 2 methods which seem reliable to me is either using WMI (but fairly complicated) or simply to read in the Windows registry the following key: HARDWARE\DESCRIPTION\System\CentralProcessor.
What do you think?
Do you confirm that the WMI and registry key methods are the only reliable methods?
Thanks in advance
The API function that you need is GetLogicalProcessorInformationEx. Since you have more than 64 processors, your processors are grouped. GetLogicalProcessorInformation only reports the processors in the processor group that the thread is currently assigned. You need to use GetLogicalProcessorInformationEx to get past that limitation.
The documentation says:
On systems with more than 64 logical processors, the GetLogicalProcessorInformation function retrieves logical processor information about processors in the processor group to which the calling thread is currently assigned. Use the GetLogicalProcessorInformationEx function to retrieve information about processors in all processor groups on the system.
Late answer with code:
size_t myHardwareConcurrency(){
size_t concurrency=0;
DWORD length=0;
if(GetLogicalProcessorInformationEx(RelationAll,nullptr,&length)!=FALSE){
return concurrency;}
if(GetLastError()!=ERROR_INSUFFICIENT_BUFFER){
return concurrency;}
std::unique_ptr<void,void(*)(void*)>buffer(std::malloc(length),std::free);
if(!buffer){
return concurrency;}
unsigned char*mem=reinterpret_cast<unsigned char*>(buffer.get());
if(GetLogicalProcessorInformationEx(RelationAll,reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(mem),&length)==false){
return concurrency;}
for(DWORD i=0;i<length;){
auto*proc=reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(mem+i);
if(proc->Relationship==RelationProcessorCore){
for(WORD group=0;group<proc->Processor.GroupCount;++group){
for(KAFFINITY mask=proc->Processor.GroupMask[group].Mask;mask!=0;mask>>=1){
concurrency+=mask&1;}}}
i+=proc->Size;}
return concurrency;}
It worked on my dual Xeon gold 6154 with Windows 64 bit system (2 procs * 18 cores/proc * 2 threads/core = 72 threads). The result is 72 both for 32 bit processes and for 64 bit processes.
I do not have access to a system with a 32 bit Windows though.
In case of error, it returns zero like std::thread::hardware_concurrency does.
You can use the CPUID instruction to query the processor directly (platform independent, though since you can't do inline asm in MSVC anymore for some compilers you'll need to use different functions to have access to it). The only downside is that as of a few years ago Intel and AMD handle this instruction differently, and you'll need to do a lot of work to ensure you are reading the information correctly. In fact, not only will you be able to get a core count, but you can get all kinds of processor topology information. Not sure how it works in a VM though if you are using that environment.

What is IACA and how do I use it?

I've found this interesting and powerful tool called IACA (the Intel Architecture Code Analyzer), but I have trouble understanding it. What can I do with it, what are its limitations and how can I:
Use it to analyze code in C or C++?
Use it to analyze code in x86 assembler?
2019-04: Reached EOL. Suggested alternative: LLVM-MCA
2017-11: Version 3.0 released (latest as of 2019-05-18)
2017-03: Version 2.3 released
What it is:
IACA (the Intel Architecture Code Analyzer) is a (2019: end-of-life) freeware, closed-source static analysis tool made by Intel to statically analyze the scheduling of instructions when executed by modern Intel processors. This allows it to compute, for a given snippet,
In Throughput mode, the maximum throughput (the snippet is assumed to be the body of an innermost loop)
In Latency mode, the minimum latency from the first instruction to the last.
In Trace mode, prints the progress of instructions through their pipeline stages.
when assuming optimal execution conditions (All memory accesses hit L1 cache and there are no page faults).
IACA supports computing schedulings for Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell and Skylake processors as of version 2.3 and Haswell, Broadwell and Skylake as of version 3.0.
IACA is a command-line tool that produces ASCII text reports and Graphviz diagrams. Versions 2.1 and below supported 32- and 64-bit Linux, Mac OS X and Windows and analysis of 32-bit and 64-bit code; Version 2.2 and up only support 64-bit OSes and analysis of 64-bit code.
How to use it:
IACA's input is a compiled binary of your code, into which have been injected two markers: a start marker and an end marker. The markers make the code unrunnable, but allow the tool to find quickly the relevant pieces of code and analyze them.
You do not need the ability to run the binary on your system; In fact, the binary supplied to IACA can't run anyways because of the presence of the injected markers in the code. IACA only requires the ability to read the binary to be analyzed. Thus it is possible, using IACA, to analyze a Haswell binary employing FMA instructions on a Pentium III machine.
C/C++
In C and C++, one gains access to marker-injecting macros with #include "iacaMarks.h", where iacaMarks.h is a header that ships with the tool in the include/ subdirectory.
One then inserts the markers around the innermost loop of interest, or the straight-line chunk of interest, as follows:
/* C or C++ usage of IACA */
while(cond){
IACA_START
/* Loop body */
/* ... */
}
IACA_END
The application is then rebuilt as it otherwise would with optimizations enabled (In Release mode for users of IDEs such as Visual Studio). The output is a binary that is identical in all respects to the Release build except with the presence of the marks, which make the application non-runnable.
IACA relies on the compiler not reordering the marks excessively; As such, for such analysis builds certain powerful optimizations may need to be disabled if they reorder the marks to include extraneous code not within the innermost loop, or exclude code within it.
Assembly (x86)
IACA's markers are magic byte patterns injected at the correct location within the code. When using iacaMarks.h in C or C++, the compiler handles inserting the magic bytes specified by the header at the correct location. In assembly, however, you must manually insert these marks. Thus, one must do the following:
; NASM usage of IACA
mov ebx, 111 ; Start marker bytes
db 0x64, 0x67, 0x90 ; Start marker bytes
.innermostlooplabel:
; Loop body
; ...
jne .innermostlooplabel ; Conditional branch backwards to top of loop
mov ebx, 222 ; End marker bytes
db 0x64, 0x67, 0x90 ; End marker bytes
It is critical for C/C++ programmers that the compiler achieve this same pattern.
What it outputs:
As an example, let us analyze the following assembler example on the Haswell architecture:
.L2:
vmovaps ymm1, [rdi+rax] ;L2
vfmadd231ps ymm1, ymm2, [rsi+rax] ;L2
vmovaps [rdx+rax], ymm1 ; S1
add rax, 32 ; ADD
jne .L2 ; JMP
We add immediately before the .L2 label the start marker and immediately after jne the end marker. We then rebuild the software, and invoke IACA thus (On Linux, assumes the bin/ directory to be in the path, and foo to be an ELF64 object containing the IACA marks):
iaca.sh -64 -arch HSW -graph insndeps.dot foo
, thus producing an analysis report of the 64-bit binary foo when run on a Haswell processor, and a graph of the instruction dependencies viewable with Graphviz.
The report is printed to standard output (though it may be directed to a file with a -o switch). The report given for the above snippet is:
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.55 Cycles Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 0.5 0.0 | 0.5 | 1.5 1.0 | 1.5 1.0 | 1.0 | 0.0 | 1.0 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [rdi+rax*1]
| 2 | 0.5 | 0.5 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [rsi+rax*1]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | CP | vmovaps ymmword ptr [rdx+rax*1], ymm1
| 1 | | | | | | | 1.0 | | | add rax, 0x20
| 0F | | | | | | | | | | jnz 0xffffffffffffffec
Total Num Of Uops: 6
The tool helpfully points out that currently, the bottleneck is the Haswell frontend and Port 2 and 3's AGU. This example allows us to diagnose the problem as the store not being processed by Port 7, and take remedial action.
Limitations:
IACA does not support a certain few instructions, which are ignored in the analysis. It does not support processors older than Nehalem and does not support non-innermost loops in throughput mode (having no ability to guess which branch is taken how often and in what pattern).