I'm trying to find this information but without much luck so far.
I have a fairly large file on S3(in a range of 64-128Gb) and I would like to read random byte ranges from it - is it doable? I guess byte-range offset would be very large for chunks somewhere close to the end of the file.
Typical read chunks are 1-12 Kb if that's important.
This won't be a problem.
The Amazon docs state that the Range header as supported as per spec (but only one range per request):
Range
Downloads the specified range bytes of an object. For more information about the HTTP Range header, see https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35
RFC2616 doesn't define any upper limit.
The first-byte-pos value in a byte-range-spec gives the byte-offset of the first byte in a range. The last-byte-pos value gives the byte-offset of the last byte in the range; that is, the byte positions specified are inclusive. Byte offsets start at zero.
This makes sense because the range is automatically limited by the file size, and a system that, say, only supports 32-bit numbers overall will usually also only support files up to a size of 2 GiB (minus one).
The maximum size of objects in S3 is 5 TiB.
Related
Im trying to understand hardware Caches. I have a slight idea, but i would like to ask on here whether my understanding is correct or not.
So i understand that there are 3 types of cache mapping, direct, full associative and set associative.
I would like to know is the type of mapping implemented with logic gates in hardware and specific to say some computer system and in order to change the mapping, one would be required to changed the electrical connections?
My current understanding is that in RAM, there exists a memory address to refer to each block of memory. Within a block contains words, each words contain a number of bytes. We can represent the number of options with number of bits.
So for example, 4096 memory locations, each memory location contains 16 bytes. If we were to refer to each byte then 2^12*2^4 = 2^16
16 bit memory address would be required to refer to each byte.
The cache also has a memory address, valid bit, tag, and some data capable of storing a block of main memory of n words and thus m bytes. Where m = n*i (bytes per word)
For an example, direct mapping
1 block of main memory can only be at one particular memory location in cache. When the CPU requests for some data using a 16bit memory location of RAM, it checks for cache first.
How does it know that this particular 16bit memory address can only be in a few places?
My thoughts are, there could be some electrical connection between every RAM address to a cache address. The 16bit address could then be split into parts, for example only compare the left 8bits with every cache memory address, then if match compare the byte bits, then tag bits then valid bit
Is my understanding correct? Thank you!
Really do appreciate if someone read this long post
You may want to read 3.3.1 Associativity in What Every Programmer Should Know About Memory from Ulrich Drepper.
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf#subsubsection.3.3.1
The title is a little bit catchy, but it explains everything you ask in detail.
In short:
the problem of caches is the number of comparisons. If your cache holds 100 blocks, you need to perform 100 comparisons in one cycle. You can reduce this number with the introduction of sets. if A specific memory-region can only be placed in slot 1-10, you reduce the number of comparisons to 10.
The sets are addressed by an additional bit-field inside the memory-address called index.
So for instance your 16 Bit (from your example) could be splitted into:
[15:6] block-address; stored in the `cache` as the `tag` to identify the block
[5:4] index-bits; 2Bit-->4 sets
[3:0] block-offset; byte position inside the block
so the choice of the method depends on the availability of hardware-resources and the access-time you want to archive. Its pretty much hardwired, since you want to reduce the comparison-logic.
There are few mapping functions used for map cache lines with main memory
Direct Mapping
Associative Mapping
Set-Associative Mapping
you have to have an slight idea about these three mapping functions
I have a dynamodb table where attribute names are large string, but whole item is of 1KB only. Shall I reduce attribute names to small string for network and storage performance since each item will have attribute names as well as value or dynamodb will automatically compress that to short codes and then store ?
While it is true that single item reads are charged indirectly via provisioned capacity units in increments of 4KB while writes are charged in increments of 1KB, these calculations during a query or table scan are used against the summed total read or written data size.
In other words, using short attribute names does help significantly in increasing throughput capacity (for the same provisioned price) for queries since you can read many more items per second since each item is smaller in size so it takes more of them to get to 4 KB for reads or 1 KB for writes where a capacity unit is consumed.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/CapacityUnitCalculations.html
"Note
We recommend that you choose shorter attribute names rather than long ones. This will help you optimize capacity unit consumption and reduce the amount of storage required for your data."
Note also that numbers are stored more compactly with 2 digits per byte compared to strings. So an ISO format datetime (e.g. 2018-02-21T10:17:44.123Z) that contains letters will take up much more space (24 bytes) than storing it as a number (e.g. 20180221101744.123) leaving out the letters taking up less space (10 bytes -- each digit pair is one byte plus one byte for sign and decimal places)
Attribute names are user-determined except for the primary keys of the base table and indexes, so DynamoDB is not able to optimize on storing attribute names. Furthermore writes are charged in 1KB increments. It does not matter if your item size is 600 or 1000 bytes; such an item will incur 1 WCU to write. For usability purposes, it is better to have human-readable attribute names so if your application permits it, perhaps leave the attribute names as is?
i'm developing an application that involves screen capture and hashing with C/C++. The image i'm capturing is about 250x250 in dimensions and i'm using the winapi HashData function for hashing.
My goal is to compare 2 hashes (etc. 2 images of 250x250) and instantly tell if they're equal.
My code:
const int PIXEL_SIZE = (sc_part.height * sc_part.width)*3;
BYTE* pixels = new BYTE[PIXEL_SIZE];
for(UINT y=0,b=0;y<sc_part.height;y++) {
for(UINT x=0;x<sc_part.width;x++) {
COLORREF rgb = sc_part.pixels[(y*sc_part.width)+x];
pixels[b++] = GetRValue(rgb);
pixels[b++] = GetGValue(rgb);
pixels[b++] = GetBValue(rgb);
}
}
const int MAX_HASH_LEN = 64;
BYTE Hash[MAX_HASH_LEN] = {0};
HashData(pixels,PIXEL_SIZE,Hash,MAX_HASH_LEN);
... i have now my variable-size hash, above example uses 64 bytes
delete[] pixels;
I've tested different hash sizes and their ~time for completion, which was roughly about:
32 bytes = ~30ms
64 bytes = ~47ms
128 bytes = ~65ms
256 bytes = ~125ms
My question is:
How long should the hash code be for a 250x250 image to prevent any duplicates, like never?
I don't like a hash code of 256 characters, since it will cause my app to run slowly (since the captures are very frequent). Is there a "safe" hash size per dimensions of image for comparing?
thanx
Assuming, based on your comments, that you're adding the hash calculated "on-the-fly" to the database, and so the hash of every image in the database ends up getting compared to the hash of every other image in the database then you've run into the birthday paradox. The likelihood that there are two identical numbers in a set of randomly selected numbers (eg. the birthdays of group of people) is greater than what you'd intuitively assume. If there are 23 people in a room then there's a 50:50 chance two of them share the same birthday.
That means assuming a good hash function then you can expect a collision, two images having the same hash despite not being identical, after 2^(N/2) hashes, where N is the number bits in the hash.1 If your hash function isn't so good you can expect a collision even earlier. Unfortunately only Microsoft knows how good HashData actually is.
Your commments also bring up a couple of other issues. One is that HashData doesn't produce variable sized hashes. It produces an array of bytes that's always the same length as the value you passed as the hash length. Your problem is that you're treating it instead as a string of characters. In C++ strings are zero terminated, meaning that the end of string is marked with a zero valued character ('\0'). Since the array of bytes will contain 0 valued elements at random positions it will appear to be truncated when used a string. Treating the hash a string like this will make it much more likely that you'll get a collision.
The other issue is that you said that you stored the images being compared in your database and that these images must be unique. If this uniqueness is being enforced by the database then checking for uniqueness in your own code is redundant. Your database might very well be able to do this faster than your own code.
GUIDs (Globally Unique IDs) are 16 bytes long, and Microsoft assumes that no GUIDs will ever collide.
Using a 32 byte hash is equivalent to taking two randomly generated GUIDs and comparing them against two other randomly generated GUIDs.
The odds are vanishingly small (1/2^256) or 1.15792089E-77 that you will get a collision with a 32 byte hash.
The universe will reach heat death long before you get a collision.
This comment from Michael Grier more or less encapsulates my beliefs. In the worst case, you should take an image, compute a hash, change the image by 1 byte, and recompute the hash. A good hash should change by more than one byte.
You also need to trade this off against the "birthday effect" (aka the pigeonhole principle) - any hash will generate collisions. A quick comparison of the first N bytes, though, will typically reject collisions.
Cryptographic hashes are typically "better" hashes in the sense that more hash bits change per input bit change, but are much slower to compute.
I want to use APR to mmap really large file, greater than 4Gb. At first I need to create file this big but I found that function apr_file_seek accepts parameter of type apr_seek_where_t that is just an alias for int. So it is possible to seek the first 4 gigs only. Is it possible to handle large files with APR?
You can seek multiple times with APR_CUR.
Also note that an int on a 32-bit system allows you two seek two gibibytes forward, not four.
Also note that on a 32-bit system the mmap will most probably fail to map more than two to three gibibytes. (When the address space is limited by 32 bits the maximum address space is four gibibytes but the operating system has to reserve some of that address space to itself).
I want to store lots of information to a block by bits, and save it into a file.
To keep my file not so big, I want to use a small number of bits to save specified information instead of a int.
For example, I want to store Day, Hour, Minute to a file.
I only want 5 bit(day) + 5 bit(Hour) + 6 bit(Minute) = 16 bit of memory for data storage.
I cannot find a efficient way to store it in a block to put in a file.
There are some big problems in my concern:
the data length I want to store each time is not constant. It depends on the incoming information. So I cannot use a structure to store it.
there must not be any unused bit in my block, I searched for some topics that mentioned that if I store 30 bits in an int(4 byte variable), then the next 3 bit I save will automatically go into the next int. but I do not want it to happen!!
I know I can use shift right, shift left to put a number to a char, and put the char into a block, but it is inefficient.
I want a char array that I can continue putting specified bits into, and use write to put it into a file.
I think I'd just use the number of bits necessary to store the largest value you might ever need for any given piece of information. Then, Huffman encode the data as you write it (and obviously Huffman decode it as you read it). Most other approaches are likely to be less efficient, and many are likely to be more complex as well.
I haven't seen such a library. So I'm afraid you'll have to write one yourself. It won't be difficult, anyway.
And about the efficiency. This kind of operations always need bits shifting and masking, because few CPUs support directly operating into bits, especially between two machine words. The only difference is you or your compiler does the translation.