AWS Redshift Data type space allocation - amazon-web-services

I am planning to create id CHARACTER VARYING(100) ENCODE ZSTD,
id2 CHARACTER VARYING(5000) ENCODE ZSTD.
Now my id and id1 are about size 20 characters only.
In AWS Redshift does the space allocation happen based on actual data size which is 20 or does it allocate first based on defined size which is 100 and 5000 respectively. If so how is the performance effected on these scenarios.
Thanks
TOM

Two things here.
Storage: With varchars, the amount of space consumed is based on the actual amount of space required, not the length (in bytes) declared.
Query Performance: Redshift does not know in advance how many bytes will be required to hold the varchar. It allocates the number of bytes based on the length declared for the varchar. It will cause queries to consume more memory which can in certain cases cause queries to spill to disk. This can have particularly negative impact on vacuum performance.
Summary: Declare varchars to be as short as possible. So, in your case if it's 20 or so, maybe 25-30 would be a good length to go with.

Amazon Redshift stores data using a Compression Encoding, so it is not vital to allocate the minimum space.
It is often best to allow Redshift to choose the compression type when data is loaded via the COPY command, rather than specifying it yourself. This will result in the most efficient method being chosen, based on the first 100,000 rows loaded.

Related

Usage of BYTES in BigQuery?

I was wondering what might be a use case where someone would be using the BYTES type in BigQuery? In digging through the public datasets that are provided, the only place I could find the usage of the BYTES data type is in the bitcoin_blockchain dataset, and in that case it looks like the data could be base64-encoded as a string (briefly glancing at the preview it seems this may already be like this)?
So basically my question is what are some use-cases for using the Bytes datatype, where it couldn't just as easily be done using the String type. (Does anyone store multimedia data in BQ or a data warehouse?). Could BQ do everything it currently does without the BYTES type or is that an essential (and used) type?
BASE64 string inflate the data by 33% on the size. Considering BigQuery charge you by the size of the data stored, and the size of data that is scanned, if there is lots of binary data to the degree that cost is a concern, BYTES gives you lower cost.

Performance difference for STRING(MAX)?

Is there a performance penalty (or improvement) for using STRING(MAX) instead of some fixed limit like STRING(256)?
Nope. STRING(MAX) is treated exactly the same as strings of limited length under-the-hood. Same applies for BYTES(MAX). So there is no performance difference.
The main reason to use a fixed limit is if there are logical constraints you want to enforce in your schema. For example: if you are using a STRING to store 2-letter country codes, then you might want to using STRING(2).
Note that, according to the docs, you can always change the length limit for a string, except with one caveat:
Supported Schema Updates: Increase or decrease the length limit for a
STRING or BYTES type (including to MAX), unless it is a primary
key column inherited by one or more child tables.

dynamodb attribute name compression

I have a dynamodb table where attribute names are large string, but whole item is of 1KB only. Shall I reduce attribute names to small string for network and storage performance since each item will have attribute names as well as value or dynamodb will automatically compress that to short codes and then store ?
While it is true that single item reads are charged indirectly via provisioned capacity units in increments of 4KB while writes are charged in increments of 1KB, these calculations during a query or table scan are used against the summed total read or written data size.
In other words, using short attribute names does help significantly in increasing throughput capacity (for the same provisioned price) for queries since you can read many more items per second since each item is smaller in size so it takes more of them to get to 4 KB for reads or 1 KB for writes where a capacity unit is consumed.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/CapacityUnitCalculations.html
"Note
We recommend that you choose shorter attribute names rather than long ones. This will help you optimize capacity unit consumption and reduce the amount of storage required for your data."
Note also that numbers are stored more compactly with 2 digits per byte compared to strings. So an ISO format datetime (e.g. 2018-02-21T10:17:44.123Z) that contains letters will take up much more space (24 bytes) than storing it as a number (e.g. 20180221101744.123) leaving out the letters taking up less space (10 bytes -- each digit pair is one byte plus one byte for sign and decimal places)
Attribute names are user-determined except for the primary keys of the base table and indexes, so DynamoDB is not able to optimize on storing attribute names. Furthermore writes are charged in 1KB increments. It does not matter if your item size is 600 or 1000 bytes; such an item will incur 1 WCU to write. For usability purposes, it is better to have human-readable attribute names so if your application permits it, perhaps leave the attribute names as is?

Size/Resize of GHashTable

Here is my use case: I want to use glib's GHashTable and use IP-addresses as keys, and the olume of data sent/received by this IP-address as the value. For instance I succeeded to implement the whole issue in user-space using some kernel variables in order to look to the volume per IP-address.
Now the question is: Suppose I have a LOT of IP-addresses (i.e. 500,000 up to 1,000,000 uniques) => it is really not clear what is the space allocated and the first size that was given to a new hash table created when using (g_hash_table_new()/g_hash_table_new_full()), and how the whole thing works in the background. It is known that when resizing a hash table it can take a lot of time. So how can we play with these parameters?
Neither g_hash_table_new() nor g_hash_table_new_full() let you specify the size.
The size of a hash table is only available as the number of values stored in it, you don't have access to the actual array size that typically is used in the implementation.
However, the existance of g_spaced_primes_closest() kind of hints that glib's hash table uses a prime-sized internal array.
I would say that although a million keys is quite a lot, it's not extraordinary. Try it, and then measure the performance to determine if it's worth digging deeper.

limit on string size in c++?

I have like a million records each of about 30 characters coming in over a socket. Can I read all of it into a single string? Is there a limit on the string size I can allocate?
If so, is there someway I can send data over the socket records by record and receive it record by record. I dont know the size of each record until runtime.
To answer your first question: The maximum size of a C++ string is given by string::max_size
std::string::max_size() will tell you the theoretical limit imposed by the architecture your program is running under. Other than that, as long as you have sufficient RAM and/or disk swap space, you can have std::strings of huge size.
The answer to your second question is yes, you can send record by record, moreover you might not be able to send big chunks of data over a socket at once - there are limits on the size of a single send operation. That the size of a single string is not known until runtime is not a problem, it doesn't need to be known at compile time for sending them over a socket. How to actually send those strings record by record depends on what socket/networking library you are using; consult the relevant documentation.
There is no official limit on the size of a string. The software will ask your system for memory and, as long as it gets it, it will be able to add characters to your string.
The rest of your question is not clear.
The only practical limit on string size in c++ is your available memory. That being said, it will be expensive to reallocate your string to the right size as you keep receiving data (assuming you do not know its total size in advance). Normally you would read chunks of the data into a fixed-size buffer and decode it into its naturally shape (your records) as you get it.
The size of a string is only limited by the amount of memory available to the program, it is more of a operating system limitation than a C++ limitation. C++/C strings are null terminated so the string routines will happily process extremely long strings until they find a null.
On Win32 the maximum amount of memory available for data is normally around 2 Gigs.
You can read arbitrarily large amounts of data from a socket, but you must have some way of delimiting the data that you're reading. There must be an end of record marker or length associated with the records that you are reading so use that to parse the records. Do you really want read the data into a string? What happens if your don't have enough free memory to hold the data in RAM? I suspect there is a more efficient way to handle this data, but I don't know enough about the problem.
In theory, no. But don't go allocating 100GB of memory, because the user will probably not have that much RAM. If you are using std::strings then the max size is std::string::npos.
If we are talking about char* You are limited with smth about 2^32 on 32-bit systems and with 2^64 on (surprise) 64-bit ones
Update: This is wrong. See comments
How about send them with different format?
in your server:
send(strlen(szRecordServer));
send(szRecordServer);
in you client:
recv(cbRecord);
alloc(szRecordClient);
recv(szRecordClient);
and repeat this million times.