Where in the zlib compress/deflate code does the library decide that a block should be copied as uncompressed into the compressed stream? - compression

In the "Maximum Expansion Factor" section of the zlib technical details it is stated "In the worst possible case, where the other block types would expand the data, deflation falls back to stored (uncompressed) blocks."
I am having a hard time figuring out where in the zlib compress/deflate code this decision actually happens. I can see that deflate_stored is called when the selected level is 0, which makes sense, but besides that I don't see it being used.
If someone can point me in the right direction that would be helpful.
Also, what is the block granularity (in terms of the uncompressed data) that these decisions are taken? I understand that in deflate the uncompressed block can be upto 64KB, but there is no defined block size for the compressed chunks. Clearly it has to do with how useful the huffman codes are for the blocks but it would be nice to know if there was a block size at which these decisions are taken.

Here in trees.c:
/* ===========================================================================
* Determine the best encoding for the current block: dynamic trees, static
* trees or store, and write out the encoded block.
*/
void ZLIB_INTERNAL _tr_flush_block(s, buf, stored_len, last)
The size of a block is measured in symbols, i.e. the number of literal and length/distance pairs -- not the size of the uncompressed data. The size of an emitted block is determined by the memory setting, where by default it is 16383 symbols. At that time, or at the end of the input data if that comes first, it is determined whether a dynamic, static, or stored block for those symbols would be coded the smallest.

The decision at the block level is taken in tree.c (https://github.com/madler/zlib/blob/master/trees.c)

Related

Cache mapping techniques

Im trying to understand hardware Caches. I have a slight idea, but i would like to ask on here whether my understanding is correct or not.
So i understand that there are 3 types of cache mapping, direct, full associative and set associative.
I would like to know is the type of mapping implemented with logic gates in hardware and specific to say some computer system and in order to change the mapping, one would be required to changed the electrical connections?
My current understanding is that in RAM, there exists a memory address to refer to each block of memory. Within a block contains words, each words contain a number of bytes. We can represent the number of options with number of bits.
So for example, 4096 memory locations, each memory location contains 16 bytes. If we were to refer to each byte then 2^12*2^4 = 2^16
16 bit memory address would be required to refer to each byte.
The cache also has a memory address, valid bit, tag, and some data capable of storing a block of main memory of n words and thus m bytes. Where m = n*i (bytes per word)
For an example, direct mapping
1 block of main memory can only be at one particular memory location in cache. When the CPU requests for some data using a 16bit memory location of RAM, it checks for cache first.
How does it know that this particular 16bit memory address can only be in a few places?
My thoughts are, there could be some electrical connection between every RAM address to a cache address. The 16bit address could then be split into parts, for example only compare the left 8bits with every cache memory address, then if match compare the byte bits, then tag bits then valid bit
Is my understanding correct? Thank you!
Really do appreciate if someone read this long post
You may want to read 3.3.1 Associativity in What Every Programmer Should Know About Memory from Ulrich Drepper.
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf#subsubsection.3.3.1
The title is a little bit catchy, but it explains everything you ask in detail.
In short:
the problem of caches is the number of comparisons. If your cache holds 100 blocks, you need to perform 100 comparisons in one cycle. You can reduce this number with the introduction of sets. if A specific memory-region can only be placed in slot 1-10, you reduce the number of comparisons to 10.
The sets are addressed by an additional bit-field inside the memory-address called index.
So for instance your 16 Bit (from your example) could be splitted into:
[15:6] block-address; stored in the `cache` as the `tag` to identify the block
[5:4] index-bits; 2Bit-->4 sets
[3:0] block-offset; byte position inside the block
so the choice of the method depends on the availability of hardware-resources and the access-time you want to archive. Its pretty much hardwired, since you want to reduce the comparison-logic.
There are few mapping functions used for map cache lines with main memory
Direct Mapping
Associative Mapping
Set-Associative Mapping
you have to have an slight idea about these three mapping functions

PE: Adding code at the end of .txt section

As I understand in PE file Virtual Size shows how much space it allocates for section during loading and Raw Size shows how big the section is on disk.
I came across to this executable which did the following:
It subtracted the virtual size (offset 0x8) from the raw data size (offset 0x10) and made sure there was some space(for example 100 bytes). At offset 0x14 from the text section header it found the offset in the file for the section itself. It added to this the virtual size, finding right where the section ends in the file. it copied some shellcode(which at the end eventually jumped to the original entry point of an executable to make sure original executable ran) to the end of the text section of the binary.
Now I'm little confused here, if the Virtual Size shows the exact space which will be allocated for executable, would not adding code at the end of .txt section overwrite some other data of executable and make it crash? thanks.
This is a great question and illustrates a very important point (or you might say quirk) about how the Windows loader calculates memory sizes.
The PE/COFF spec does indeed describe VirtualSize as "The total size of the section when loaded into memory". This is technically true if you consider total to be the total amount of REAL data that comprises the section, but it is not the total amount of memory Windows allocates for the section. You'll find VirtualSize is often LESS than the amount Windows allocates for the section in memory because VirtualSize must be rounded UP to the nearest memory alignment value (set in the PE image).
In other words, VirtualSize represents the unrounded size of the section while SizeOfRawData is the size of the data in the image file, but rounded up to the nearest file alignment padding value. This is the reason VirtualSize is the better representation of the true "raw" size of the data, in memory or on disk. The PE/COFF spec does not make this distinction. Why one is rounded in the image file and not the other probably has its ancient roots in backwards compatibility.
This is precisely why your shellcode uses VirtualSize to find the "real" end of data even as it resides in the image file. Not surprisingly, you can calculate SizeOfRawData by rounding VirtualSize up to the file alignment's value, at least in well formed PE files.
The shellcode simply used VirtualSize to find the end of the REAL code. Between there and SizeOfRawData bytes, are just unused padding zeroes, making it a prime spot to add new code without affecting the size of the file or breaking addressing offsets within the PE file.
In summary, the Windows loader essentially takes the VirtualSize value and rounds it up to the memory alignment value to get the real size of the memory allocation (and even this may be rounded up to the nearest 4k-minimum memory page). Then up to SizeOfRawData bytes are copied from the file to the beginning of the memory section. If it is less than the size of the section in memory, the remainder is zero filled.
Uhmmm, this hack code seems crafted to hide malicious code between sections.
Going to your question, you are correct VirtualSize is the space really allocated in memory and RawSize is the the space used on disk to hold the section data.
What you missed is (from MS PECOFF spec):
VirtualSize: The total size of the section when loaded into memory. If this value is greater than SizeOfRawData, the section is zero-padded. This field is valid only for executable images and should be set to zero for object files.
This means that if a the result of SizeOfRawData-VirtualSize is positive we have some space available on disk actually filled with 0's.
If that space is enough to hold the malicious code then adding the start of text section on disk with the VirtualSize you can get the start of 0's padded area that can be used to copy the code.
The rest is story...

Searching in large memory mapped files

I have a large data structure stored in memory mapped file. Data structure is very simple:
struct Header {
...some metadata...
uint32_t index_size;
uint64_t index[]
};
This header is placed in the beginning of the file, it uses a structure hack - variable sized structure, size of the last element is not set in stone and can be changed.
char* mmaped_region = ...; // This memory comes from memory mapped file!
Header* pheader = reinterpret_cast<Header*>(mmaped_region);
Memory mapped region starts with Header and Header::index_size contains correct length of the Header::index array. This array contains offsets of the data elements, we can do this:
uint64_t offset = pheader->index[x];
DataItem* item = reinterpret_cast<DataItem*>(mmaped_region + offset);
// At this point, variable item contains pointer to data element
// if variable x contains correct index value (less than pheader->index_size)
All the data elements is sorted (less than relation defined for data elements). Their are stored in the same memory mapped region as Header but starting from the end to the beginning. Data elements can't be moved, because their are of variable size, instead of that - indexes in header are moved during sort procedure. This is very much like B-tree page in modern databases, index array is usually called an indirection vector.
Searches
This data-structure is searched with interpolation search algorithm (with limited amount of steps) and than with binary search. First, I have a whole index array to search, I'm trying to calculate - where searched element can be stored if distribution is uniform. I get some calculated index - look at element at this index and it usually doesn't match. Than I narrow the search range and repeat. Number of interpolation search steps is limited by some small number. After that data-structure is searched with binary search. This works very good with small data-sets, because distribution is usually uniform. Few iterations of the interpolation search and we're done.
Problem definition.
Memory mapped region can be very large in reality. For testing I use 32Gb file backed storage and search for some random keys. This is very slow because this pattern cause lot of random disk reads (all data can't be cached in memory).
What can be done here? I think that setting MADV_RANDOM with madvise syscall can help, but probably not very much. I want to get on par with B-tree search speed. Maybe it is possible to use mincore syscall to check what data-elements can be painlessly checked during interpolation search? Maybe I can use prefetching of some sort?
The interpolation search appears to be a good idea here. It usually has a small benefit, but in this case even a small number of iterations saved helps a lot since they're s slow (disk I/O).
However, real databases duplicate the actual key values in their indices. The space overhead for that is fully justified in the performance improvement. Btrees are a further improvement because they pack multiple related nodes in a single contiguous block of memory, further reducing disk seeks.
This is probably the correct solution for you as well. You should duplicate the keys to avoid disk I/O. You can probably get away by duplicating the keys in a separate structure and keeping that that fully in memory, if you can't alter the existing header.
A compromise is possible, where you just cache the top (2^N)-1 keys for the first N levels of binary search. That means you have to give up your interpolation for that part of the search, but as noted before interpolation is not a huge win anyway. The disk seeks saved will easily pay off. Even caching just the median key (N=1) will already save you one disk seek per lookup. And you can still use interpolation once you've run out of the cache.
In comparison, any attempt to fiddle with memory mapping parameters will give you a few percent speed improvement at best. "On par with B-trees" is not going to happen. If your algorithm needs those physical seeks, you lose. No magical pixie dust will fix a bad algorithm or a bad datastructure.

Compressing strings with common parts

I have an application that manages a large number of strings. Strings are in a path-like format and have many common parts, but without a clear rule. They are not paths on the file-system but can be considered like so.
I clearly need to optimize memory consumption but without a big performance sacrifice.
I am considering 2 options:
- implement a compressed_string class that stores data zipped, but i need a fixed dictionary and i cant find a library for this right now. I don't want a Huffman on bytes, I want it on words.
- implement some kind of flyweight pattern on string parts.
The problem looks like a common one and I'm wonder what is the best solution to it or if someone knows a library that targets this issue.
thanks
Although it might be tempting to tune a specific algorithm for your problem, it is likely to require an unreasonable amount of time and effort, while standard compression techniques will immediately provide you a great boost to solve your memory consumption problem.
The "standard" way to handle this issue is to chunk source data into small blocks (such as 256KB), and compress them individually. When accessing data into the block, you need to decode it first. Therefore, the optimal block size really depends on your application, i.e. the more the application streams, the larger the blocks; on the other hand, the more random access pattern, the smaller the block size.
If you are worried by the compression/decompression speed, use a high-speed algorithm. If decompression speed is the most important metric (for access time), something like LZ4 will provide you about 1GB/s decoding performance per core, so this gives you an idea of how many blocks per second you can decode.
If only decompression speed matters, you may use the high-compression variant LZ4-HC, which will boost compression ratio even more by about 30%, while also improving decompression speed.
Strings are in a path-like format and have many common parts, but without a clear rule.
In the sense that they are locators in a hierarchy of the form name, (separator, name)*? If so, you can use interning: store the name parts as char const * elements that point into a pool of strings. That way, you effectively compress a name that is used n times to just over n * sizeof(char const *) + strlen(name) bytes. The full path would become a sequence of interned names, e.g. an std::vector.
It might seem that sizeof(char const *) is big on 64-bit hardware, but you also save some of the allocation overhead. Or, if you know for some reason that you'll never need more than, say, 65536 strings, you might store them as
class interned_name
{
uint16_t tab_idx;
public:
char const *c_str() const
{
return NAME_TABLE[tab_idx];
}
};
where NAME_TABLE is an static std::unordered_map<uint16_t, char const *>.

Can __attribute__((packed)) affect the performance of a program?

I have a structure called log that has 13 chars in it. after doing a sizeof(log) I see that the size is not 13 but 16. I can use the __attribute__((packed)) to get it to the actual size of 13 but I wonder if this will affect the performance of the program. It is a structure that is used quite frequently.
I would like to be able to read the size of the structure (13 not 16). I could use a macro, but if this structure is ever changed ie fields added or removed, I would like the new size to be updated without changing a macro because I think this is error prone. Have any suggestion?
Yes, it will affect the performance of the program. Adding the padding means the compiler can use integer load instructions to read things from memory. Without the padding, the compiler must load things separately and do bit shifting to get the entire value. (Even if it's x86 and this is done by the hardware, it still has to be done).
Consider this: Why would compilers insert random, unused space if it was not for performance reasons?
Don't use __attribute__((packed)). If your data structure is in-memory, allow it to occupy its natural size as determined by the compiler. If it's for reading/writing to/from disk, write serialization and deserialization functions; do not simply store cpu-native binary structures on disk. "Packed" structures really have no legitimate uses (or very few; see the comments on this answer for possible disagreeing viewpoints).
Yes, it can affect the performance. In this case, if you allocate an array of such structures with the ((packed)) attribute, most of them must end up unaligned (whereas if you use the default packing, they can all be aligned on 16 byte boundaries). Copying such structures around can be faster if they are aligned.
Yes, it can affect performance. How depends on what it is and how you use it.
An unaligned variable can possibly straddle two cache lines. For example, if you have 64-byte cache lines, and you read a 4-byte variable from an array of 13-byte structures, there is a 3 in 64 (4.6%) chance that it will be spread across two lines. The penalty of an extra cache access is pretty small. If everything your program did was pound on that one variable, 4.6% would be the upper bound of the performance hit. If logging represents 20% of the program's workload, and reading/writing to the that structure is 50% of logging, then you're already at a small fraction of a percent.
On the other hand, presuming that the log needs to be saved, shrinking each record by 3 bytes is saving you 19%, which translates to a lot of memory or disk space. Main memory and especially the disk are slow, so you will probably be better off packing the log to reduce its size.
As for reading the size of the structure without worrying about the structure changing, use sizeof. However you like to do numerical constants, be it const int, enum, or #define, just add sizeof.
As with all other performance optimizations, you'll need to profile your code to find the right answer. The right answer will vary by architecture --- and how your use your structure.
If you're creating gigantic arrays the space savings from packing might mean the difference between fitting and not fitting in cache. Or your data might already fit into your cache, in which case it will make no difference. If you're allocating large numbers of the structures in an STL associative container that allocates the storage for your struct with operator new it might not matter at all --- operator new might round your storage up to something that's aligned anyway.
If most of your structures live on the stack the extra storage might already be optimized away anyway.
For a change this simple to test, I suggest building a timing rig and then trying things both ways. For further optimizations I suggest using a profiler to identify your bottlenecks and go from there.