Performance difference for STRING(MAX)? - google-cloud-platform

Is there a performance penalty (or improvement) for using STRING(MAX) instead of some fixed limit like STRING(256)?

Nope. STRING(MAX) is treated exactly the same as strings of limited length under-the-hood. Same applies for BYTES(MAX). So there is no performance difference.
The main reason to use a fixed limit is if there are logical constraints you want to enforce in your schema. For example: if you are using a STRING to store 2-letter country codes, then you might want to using STRING(2).
Note that, according to the docs, you can always change the length limit for a string, except with one caveat:
Supported Schema Updates: Increase or decrease the length limit for a
STRING or BYTES type (including to MAX), unless it is a primary
key column inherited by one or more child tables.

Related

SAS Hash Tables: Is there a way to find/join on different keys or have optional keys

I frequently work with some data for which the keys are not perfect, and I need to join data from a difference source, I want to continue using Hash Objects for the speed advantage however when I am using a lot of data I can run into crashes (memory restraints).
A simplified overview is I have 2 different keys which are all unique but not present for every record, we will call them Key1 and Key2.
My current solution, which is not very elegant (but it works) is to do the following:
if _N_ = 1 then do;
declare hash h1(Dataset:"DataSet1");
h1.DefineKey("key1");
h1.DefineData("Value");
h1.DefineDone();
declare hash h2(Dataset:"DataSet1");
h2.DefineKey("key2");
h2.DefineData("Value");
h2.DefineDone();
end;
set DataSet2;
rc = h1.find();
if rc NE 0 then do;
rc = h2.find();
end;
So I have exactly the same dataset in two hash tables, but with 2 different keys defined, if the first key is not found, then I try to find the second key.
Does anyone know of a way to make this more efficient/easier to read/less memory intensive?
Apologies if this seems a bad way to accomplish the task, I absolutely welcome criticism so I can learn!
Thanks in advance,
Adam.
I am a huge proponent of hash table lookups - they've helped me do some massive multi hundred-million row joins in minutes that otherwise would could have taken hours.
The way you're doing it isn't a bad route. If you find yourself running low on memory, the first thing to identify is how much memory your hash table is actually using. This article by sasnrd shows exactly how to do this.
Once you've figured out how much it's using and have a benchmark, or if it doesn't even run at all because it runs out of memory, you can play around with some options to see how they improve your memory usage and performance.
1. Include only the keys and data you need
When loading your hash table, exclude any unnecessary variables. You can do this before loading the hash table, or during. You can use dataset options to help reduce table size, such as where, keep, and drop.
dcl hash h1(dataset: 'mydata(keep=key var1)');
2. Reduce the variable lengths
Long character variables take up more memory. Decreasing the length to their minimum required value will help reduce memory usage. Use the %squeeze() macro to automatically reduce all variables to their minimum required size before loading. You can find that macro here.
%squeeze(mydata, mydata_smaller);
3. Adjust the hashexp option
hashexp helps improve performance and reduce hash collisions. Larger values of hashexp will increase memory usage but may improve performance. Smaller values will reduce memory usage. I recommend reading the link above and also looking at the link at the top of this post by sasnrd to get an idea of how it will affect your join. This value should be sized appropriately depending on the size of your table. There's no hard and fast answer as to what value you should use, my recommendation is as big as your system can handle.
dcl hash h1(dataset: 'mydata', hashexp:2);
4. Allocate more memory to your SAS session
If you often run out of memory with your hash tables, you may have too low of a memsize. Many machines have plenty of RAM nowadays, and SAS does a really great job of juggling multiple hard-hitting SAS sessions even on moderately equipped machines. Increasing this can make a huge difference, but you want to adjust this value as a last resort.
The default memsize option is 2GB. Try increasing it to 4GB, 8GB, 16GB, etc., but don't go overboard, like setting it to 0 to use as much memory as it wants. You don't want your SAS session to eat up all the memory on the machine if other users are also on it.
Temporarily setting it to 0 can be a helpful troubleshooting tool to see how much memory your hash object actually occupies if it's not running. But if it's your own machine and you're the only one using it, you can just go ham and set it to 0.
memsize can be adjusted at SAS invocation or within the SAS Configuration File directly (sasv9.cfg on 9.4, or SASV9_Option environment variable in Viya).
I have a fairly similar problem that I approached slightly differently.
First: all of what Stu says is good to keep in mind, regardless of the issue.
If you are in a situation though where you can't really reduce the character variable size (remember, all numerics are 8 bytes in RAM no matter what the dataset size, so don't try to shrink them for this reason), you can approach it this way.
Build a hash table with key1 as key, key2 as data along with your actual data. Make sure that key1 is the "better" key - the one that is more fully populated. Rename Key2 to some other variable name, to make sure you don't overwrite your real key2.
Search on key1. If key1 is found, great! Move on.
If key1 is missing, then use a hiter object (hash iterator) to iterate over all of the records searching for your key2.
This is not very efficient if key2 is used a lot. Step 3 also might be better done in a different way than using a hiter - you could do a keyed set or something else for those records, for example. In my particular case, both the table and the lookup were missing key1, so it was possible to simply iterate over the much smaller subset missing key1 - if in your case that's not true, and your master table is fully populated for both keys, then this is going to be a lot slower.
The other thing I'd consider is abandoning hash tables and using a keyed set, or a format, or something else that doesn't use RAM.
Or split your dataset:
data haskey1 nokey1;
set yourdata;
if missing(key1) then output nokey1;
else output haskey1;
run;
Then two data steps, one with a hash with key1 and one with a hash with key2, then combine the two back together.
Which of these is the most efficient depends heavily on your dataset sizes (both master and lookup) and on the missingness of key1.

AWS Redshift Data type space allocation

I am planning to create id CHARACTER VARYING(100) ENCODE ZSTD,
id2 CHARACTER VARYING(5000) ENCODE ZSTD.
Now my id and id1 are about size 20 characters only.
In AWS Redshift does the space allocation happen based on actual data size which is 20 or does it allocate first based on defined size which is 100 and 5000 respectively. If so how is the performance effected on these scenarios.
Thanks
TOM
Two things here.
Storage: With varchars, the amount of space consumed is based on the actual amount of space required, not the length (in bytes) declared.
Query Performance: Redshift does not know in advance how many bytes will be required to hold the varchar. It allocates the number of bytes based on the length declared for the varchar. It will cause queries to consume more memory which can in certain cases cause queries to spill to disk. This can have particularly negative impact on vacuum performance.
Summary: Declare varchars to be as short as possible. So, in your case if it's 20 or so, maybe 25-30 would be a good length to go with.
Amazon Redshift stores data using a Compression Encoding, so it is not vital to allocate the minimum space.
It is often best to allow Redshift to choose the compression type when data is loaded via the COPY command, rather than specifying it yourself. This will result in the most efficient method being chosen, based on the first 100,000 rows loaded.

dynamodb attribute name compression

I have a dynamodb table where attribute names are large string, but whole item is of 1KB only. Shall I reduce attribute names to small string for network and storage performance since each item will have attribute names as well as value or dynamodb will automatically compress that to short codes and then store ?
While it is true that single item reads are charged indirectly via provisioned capacity units in increments of 4KB while writes are charged in increments of 1KB, these calculations during a query or table scan are used against the summed total read or written data size.
In other words, using short attribute names does help significantly in increasing throughput capacity (for the same provisioned price) for queries since you can read many more items per second since each item is smaller in size so it takes more of them to get to 4 KB for reads or 1 KB for writes where a capacity unit is consumed.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/CapacityUnitCalculations.html
"Note
We recommend that you choose shorter attribute names rather than long ones. This will help you optimize capacity unit consumption and reduce the amount of storage required for your data."
Note also that numbers are stored more compactly with 2 digits per byte compared to strings. So an ISO format datetime (e.g. 2018-02-21T10:17:44.123Z) that contains letters will take up much more space (24 bytes) than storing it as a number (e.g. 20180221101744.123) leaving out the letters taking up less space (10 bytes -- each digit pair is one byte plus one byte for sign and decimal places)
Attribute names are user-determined except for the primary keys of the base table and indexes, so DynamoDB is not able to optimize on storing attribute names. Furthermore writes are charged in 1KB increments. It does not matter if your item size is 600 or 1000 bytes; such an item will incur 1 WCU to write. For usability purposes, it is better to have human-readable attribute names so if your application permits it, perhaps leave the attribute names as is?

How do you save objects with variable length members to a binary file usefully?

I'm working with c++ and I am writing a budget program (I'm aware many are available--it's just a learning project).
I want to save what I call a book object that contains other objects such as 'pages'. Pages also contain cashflows and entries. The issue is that there can be any amount of entries or cashflows.
I have found a lot of information on saving data to text files but that is not what I want to do.
I have tried looking into using the boost library as I've been told serialization might be the solution to this problem. I'm not entirely sure which functions is boost are going to help or even what the proper ways are to use boost.
Most examples of binary files that I have seen are with objects that have fixed size members. For example, a point might contain an x value and a y value that are both doubles. This will always be the case so it is simple to just use sizeOf(Point).
So, I'm either looking for direct answers to this question or useful links to information on how to solve my problem. But please make sure you links are specific to the question.
I've also posted the same question on cplusplus
In general, there are two methods to store variable length records:
Store size integer first, followed by the data.
Store the data, append a sentinel character (or value) at the end.
C-style strings use the 2nd option.
For option one, the number contains the size of the data.
Optional Fields
If your considering relational database design for optional fields, you would have one table with the known or fixed records and another table containing an option field with the record ID.
A simpler route may be to go to something similar to XML: field labels.
Split your object into two sections: static fields and optional fields.
The static field section would be followed by an optional field section. The optional field section would contain the field name, followed by the field data. Read in the field name then the value.
I suggest you review your design to see if optional fields can be eliminated. Also, for complex fields, have them read in their own data.
Storing Binary Data
If the data is shared between platforms, consider using ASCII or textual representation.
Read up upon Endianess and also bit sizes. For example one platform could store its binary representation least significant byte first and use 32 bits (4 bytes). The receiving platform, 64-bit, most significant byte first, would have problems reading the data directly and would need to convert; thus losing any benefit from binary storage.
Similarly, floating point doesn't fare well in binary either. There is also the loss of precision when converting between floating point formats.
When using optional fields in binary, one would use a sentinel byte or number for the field ID rather than a textual name.
Also, data in textual format is much easier to debug than data in binary format.
Consider using a Database
See At what point is it worth using a database?
The boost::serialization documentation is here.
boost::serialization handles user-written classes as well as STL containers: std::deque, std::list, etc.

Size/Resize of GHashTable

Here is my use case: I want to use glib's GHashTable and use IP-addresses as keys, and the olume of data sent/received by this IP-address as the value. For instance I succeeded to implement the whole issue in user-space using some kernel variables in order to look to the volume per IP-address.
Now the question is: Suppose I have a LOT of IP-addresses (i.e. 500,000 up to 1,000,000 uniques) => it is really not clear what is the space allocated and the first size that was given to a new hash table created when using (g_hash_table_new()/g_hash_table_new_full()), and how the whole thing works in the background. It is known that when resizing a hash table it can take a lot of time. So how can we play with these parameters?
Neither g_hash_table_new() nor g_hash_table_new_full() let you specify the size.
The size of a hash table is only available as the number of values stored in it, you don't have access to the actual array size that typically is used in the implementation.
However, the existance of g_spaced_primes_closest() kind of hints that glib's hash table uses a prime-sized internal array.
I would say that although a million keys is quite a lot, it's not extraordinary. Try it, and then measure the performance to determine if it's worth digging deeper.