I have a dynamodb table where attribute names are large string, but whole item is of 1KB only. Shall I reduce attribute names to small string for network and storage performance since each item will have attribute names as well as value or dynamodb will automatically compress that to short codes and then store ?
While it is true that single item reads are charged indirectly via provisioned capacity units in increments of 4KB while writes are charged in increments of 1KB, these calculations during a query or table scan are used against the summed total read or written data size.
In other words, using short attribute names does help significantly in increasing throughput capacity (for the same provisioned price) for queries since you can read many more items per second since each item is smaller in size so it takes more of them to get to 4 KB for reads or 1 KB for writes where a capacity unit is consumed.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/CapacityUnitCalculations.html
"Note
We recommend that you choose shorter attribute names rather than long ones. This will help you optimize capacity unit consumption and reduce the amount of storage required for your data."
Note also that numbers are stored more compactly with 2 digits per byte compared to strings. So an ISO format datetime (e.g. 2018-02-21T10:17:44.123Z) that contains letters will take up much more space (24 bytes) than storing it as a number (e.g. 20180221101744.123) leaving out the letters taking up less space (10 bytes -- each digit pair is one byte plus one byte for sign and decimal places)
Attribute names are user-determined except for the primary keys of the base table and indexes, so DynamoDB is not able to optimize on storing attribute names. Furthermore writes are charged in 1KB increments. It does not matter if your item size is 600 or 1000 bytes; such an item will incur 1 WCU to write. For usability purposes, it is better to have human-readable attribute names so if your application permits it, perhaps leave the attribute names as is?
Related
I'm trying to find this information but without much luck so far.
I have a fairly large file on S3(in a range of 64-128Gb) and I would like to read random byte ranges from it - is it doable? I guess byte-range offset would be very large for chunks somewhere close to the end of the file.
Typical read chunks are 1-12 Kb if that's important.
This won't be a problem.
The Amazon docs state that the Range header as supported as per spec (but only one range per request):
Range
Downloads the specified range bytes of an object. For more information about the HTTP Range header, see https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35
RFC2616 doesn't define any upper limit.
The first-byte-pos value in a byte-range-spec gives the byte-offset of the first byte in a range. The last-byte-pos value gives the byte-offset of the last byte in the range; that is, the byte positions specified are inclusive. Byte offsets start at zero.
This makes sense because the range is automatically limited by the file size, and a system that, say, only supports 32-bit numbers overall will usually also only support files up to a size of 2 GiB (minus one).
The maximum size of objects in S3 is 5 TiB.
I am planning to create id CHARACTER VARYING(100) ENCODE ZSTD,
id2 CHARACTER VARYING(5000) ENCODE ZSTD.
Now my id and id1 are about size 20 characters only.
In AWS Redshift does the space allocation happen based on actual data size which is 20 or does it allocate first based on defined size which is 100 and 5000 respectively. If so how is the performance effected on these scenarios.
Thanks
TOM
Two things here.
Storage: With varchars, the amount of space consumed is based on the actual amount of space required, not the length (in bytes) declared.
Query Performance: Redshift does not know in advance how many bytes will be required to hold the varchar. It allocates the number of bytes based on the length declared for the varchar. It will cause queries to consume more memory which can in certain cases cause queries to spill to disk. This can have particularly negative impact on vacuum performance.
Summary: Declare varchars to be as short as possible. So, in your case if it's 20 or so, maybe 25-30 would be a good length to go with.
Amazon Redshift stores data using a Compression Encoding, so it is not vital to allocate the minimum space.
It is often best to allow Redshift to choose the compression type when data is loaded via the COPY command, rather than specifying it yourself. This will result in the most efficient method being chosen, based on the first 100,000 rows loaded.
Apologies in advance as I think I need to give a background of my problem.
We have a proprietary database engine written in native c++ built for 32bit runtime, the database records are identified by their record number (basically offset in the file where a record is written) and a "unique id" (which is nothing more than -100 to LONG_MIN).
Previously the engine limits a database to only 2gb (where block of record could be a minimum size of 512bytes up to 512*(1 to 7)). This effectively limits the number of records to about 4 million.
We are indexing this 4 million records and storing the index in a hashtable (we implemented an extensible hashing for this) and works brilliantly for 2gb database. Each of the index is 24bytes each. each record's record number is indexed as well as the record's "unique id" (the indexes reside in the heap and both record number and "unique id" can point to the same index in heap). The indexes are persisted in memory and stored in the file (however only the record number based indexes are stored in file). While in memory, a 2gb database's index would consume about 95mb which still is fine in a 32bit runtime (but we limited the software to host about 7 databases per database engine for safety measure)
The problem begins when we decided to increases the size of the database from 2gb to 32gb. This effectively increased the number of records to about 64 million, which would mean the hashtable would contain 1.7gb worth of index in heap memory for a single 32gb database alone.
I ditched the in memory hashtable and wrote the index straight to a file, but I failed to consider the time it would take to search for an index in the file, considering I could not sort the indexes on demand (because write to the database happens all the time which means the indexes must be updated almost immediately). Basically I'm having problems with re-indexing, that is our software needs to check if a record exist and it does so by checking the current indexes if it resides there, but since I now changed it from in-memory to file I/O index, its now taking forever just to finish 32gb indexing (2gb indexing as I have computed it will apparently take 3 days to complete).
I then decided to store the indexes in order based on record number so I dont have to search them in file, and structure my index as such:
struct node {
long recNum; // Record Number
long uId; // Unique Id
long prev;
long next;
long rtype;
long parent;
}
It works perfectly if I use recNum to determine where in file the index record is stored and retrieves it using read(...), but my problem is if the search based on "unique id".
When I do a search on the index file based on "unique id", what I'm doing essentially is loading chunks of the 1.7gb index file and checking the "unique id" until I get a hit, however this proves to be a very slow process. I attempted to create an Index of the Index so that I could loop quicker but it still is slow. Basically, there is a function in the software that will eventually check every record in the database by checking if it exist in the index first using the "unique id" query, and if this function comes up, to finish the 1.7gb index takes 4 weeks in my calculation if I implement a file based index query and write.
So I guess what 'm trying to ask is, when dealing with large databases (such as 30gb worth of database) persisting the indexes in memory in a 32bit runtime probably isn't an option due to limited resource, so how does one implement a file based index or hashtable with out sacrificing time (at least not so much that its impractical).
It's quite simple: Do not try to reinvent the wheel.
Any full SQL database out there is easily capable of storing and indexing tables with several million entries.
For a large table you would commonly use a B+Tree. You don't need to balance the tree on every insert, only when a node exceeds the minimum or maximum size. This gives a bad worst case runtime, but the cost is amortized.
There is also a lot of logic involved in efficiently, dynamically caching and evicting parts of the index in memory. I strongly advise against trying to re-implement all of that on your own.
Is there a performance penalty (or improvement) for using STRING(MAX) instead of some fixed limit like STRING(256)?
Nope. STRING(MAX) is treated exactly the same as strings of limited length under-the-hood. Same applies for BYTES(MAX). So there is no performance difference.
The main reason to use a fixed limit is if there are logical constraints you want to enforce in your schema. For example: if you are using a STRING to store 2-letter country codes, then you might want to using STRING(2).
Note that, according to the docs, you can always change the length limit for a string, except with one caveat:
Supported Schema Updates: Increase or decrease the length limit for a
STRING or BYTES type (including to MAX), unless it is a primary
key column inherited by one or more child tables.
I am writing a database and I wish to assign every item of a specific type a unique ID (for internal data management purposes). However, the database is expected to run for a long (theoretically infinite) time and with a high turnover of entries (as in with entries being deleted and added on a regular basis).
If we model our unique ID as a unsigned int, and assume that there will always be less than 2^32 - 1 (we cannot use 0 as a unique ID) entries in the database, we could do something like the following:
void GenerateUniqueID( Object* pObj )
{
static unsigned int iCurrUID = 1;
pObj->SetUniqueID( iCurrUID++ );
}
However, this is fine until entries start getting removed and other ones added in their place, there may still be less than 2^32-1 entries, but we may overflow the iCurrUID and end up assigning "unique" IDs which already are being used.
One idea I had was to use a std::bitset<std::numeric_limits<unsigned int>::max-1> and then traversing that to find the first free unique ID, but this would have a high memory consumption and will take linear complexity to find a free unique ID, so I'm looking for a better method if one exists?
Thanks in advance!
I'm aware that changing the datatype to a 64-bit integer, instead of a 32-bit integer would resolve my problem; however, because I am working in the Win32 environment, and working with lists (with DWORD_PTR being 32-bits), I am looking for an alternative solution. Moreover, the data is sent over a network and I was trying to reduce bandwidth consumption by using a smaller size unique ID.
With an uint64_t (64bit), it would take you well, well over 100 years, even if you insert somewhere close to 100k entries per second.
Over 100 years, you should insert somewhere around 315,360,000,000,000 records (not taking into account leap years and leap seconds, etc). This number will fit into 49 bits.
How long to you anticipate that application to run?
Over 100 years?
This is the common thing database administrators do when they have an autoincrement field that apprpaches the 32bit limit. They change the value to the native 64bit type (or 128bit) for their DB system.
The real question is how many entries can you have until you are
guaranteed that the first one is deleted. And how often you
create new entries. An unsigned long long is guaranteed to
have a maximum value of at least 2^64, about 1.8x10^19. Even at
one creation per microsecond, this will last for a couple of
thousand centuries. Realistically, you're not going to be able
to create entries that fast (since disk speed won't allow it),
and your program isn't going to run for hundreds of centuries
(because the hardware won't last that long). If the unique id's
are for something disk based, you're safe using unsigned long
long for the id.
Otherwise, of course, generate as many bits as you think you
might need. If you're really paranoid, it's trivial to use
a 256 bit unsigned integer, or even longer. At some point,
you'll be fine even if every atom in the universe creates a new
entry every picosecond, until the end of the universe. (But
realistically... unsigned long long should suffice.)