How can we store a hash table in Apache Arrow? - apache-arrow

I am pretty new to Apache Arrow so this question may be ignorant. Apache Arrow provides the capability to store data structures like primitive types/struct/array in standardised memory format, I wonder if it is possible to store more complex data structures like hash table (or balanced search tree) with Apache Arrow?
Many algorithms relies on these data structures to work, do Apache Arrow users need to convert arrow data into language specific data structure in this case?

You can certainly define a static/immutable hash table backed by the Arrow columnar format (e.g. if you want to be able to memory map an on-disk hash table). You have to decide what is the "schema" of the hash table, for example it could be
is_filled: boolean
key: KeyType
value: ValueType
This presumes that the hash and comparison functions are known and constant to the application based on the key type.
If you want the keys and values to be next to each other in memory then you could encode them in a binary type
is_filled: boolean
keyvalue: binary
The actual implementation of the hash table is up to you. You're welcome to contribute such code to the Apache Arrow codebase itself.

Related

binary vs. string vs. number for storing UUID in DynamoDB partition key?

I'm trying to decide whether to use binary, number, or string for my DynamoDB table's partition key. My application is a React.js/Node.js social event-management application where as much as half of the data volume stored in DynamoDB will be used to store relationships between Items and Attributes to other Items and Attributes. For example: friends of a user, attendees at an event, etc.
Because the schema is so key-heavy, and because the maximum DynamoDB Item size is only 400KB, and for perf & cost reasons, I'm concerned about keys taking up too much space. That said, I want to use UUIDs for partition keys. There are well-known reasons to prefer UUIDs (or something with similar levels of entropy and minimal chance of collisions) for distributed, serverless apps where multiple nodes are giving out new keys.
So, I think my choices are:
Use a hex-encoded UUID (32 bytes stored after dashes are removed)
Encode the UUID using base64 (22 bytes)
Encode the UUID using z85 (20 bytes)
Use a binary-typed attribute for the key (16 bytes)
Use a number-typed attribute for the key (16-18 bytes?) - the Number type can only accommodate 127 bits, so I'd have to perform some tricks like stripping a version bit, but for my app that's probably OK. See
How many bits of integer data can be stored in a DynamoDB attribute of type Number? for more info.
Obviously there's a tradeoff in developer experience. Using a hex string is the clearest but also the largest. Encoded strings are smaller but harder to deal with in logs, while debugging, etc. Binary and Number are harder than strings, but are the smallest.
I'm sure I'm not the first person to think about these tradeoffs. Is there a well-known best practice or heuristic to determine how UUID keys should be stored in DynamoDB?
If not, then I'm leaning towards using the Binary type, because it's the smallest storage and because its native representation (as a base64-encoded string) can be used everywhere humans need to view and reason about keys, including queries, logging, and client code. Other than having to transform it to/from a Buffer if I use DocumentClient, am I missing some problem with the Binary type or advantage of one of the other options in the list above?
If it matters, I'm planning for all access to DynamoDB to happen via a Lambda API, so even if there's conversion or marshalling required, that's OK because I can do it inside my API.
BTW, this question is a sequel to a 4-year-old question (UUID data type in DynamoDB) but 4 years is a looooooong time in a fast-evolving space, so I figured it was worth asking again.
I had a similar issue and concluded that the size of the key did not matter too much as all my options were going to be small and lightweight, with only minor tradeoffs. I decided that a programmer friendly way i.e. me would be to use the 'sub' that is the number created by cognito for each unique user. That way all the collision issues should they arise would also be taken care of by cognito. I could then encode or not encode. So howseover a user logs in, they will end up with the 'sub' then I match that with the records in the hash key of dynamodb and that immediately grants them fine-grained access to only their data. Three years later, I have found that to be a very reliable method.

DynamoDB storage volume of attribute keys. Do shorter attribute keys and attribute types use less storage?

I was just wondering whether it is useful to have shorter attribute keys in dynamodb. I know this has drawbacks since they are not human readable but thinking of millions of rows this means significant storage volume, at least in my mind.
So, does a shorter attribute key use less storage?
P.S. side question: how about the types. Can I always use strings or are there advantages of using e.g. numbers or booleans?
Yes, shorter attribute keys use less storage. From the documentation:
An item size is the sum of lengths of its attribute names and values
(binary and UTF-8 lengths).
In terms of types, it is about how you want to model your data and what operations you want to perform on them. If you want to increment an attribute, you would want to use a Number data type. If you want to store a blob of compressed image data you might use the Binary data type. If you want a deleted attribute on a row you would use the Boolean data type. It also depends on if you have indexing requirements for any of these attributes.

C++ OpenSSL: md5-based 64-bits hash

I know the original md5 algorithm produces an 128-bits hash.
Following Mark Adler's comments here I'm interested in getting a good 64-bits hash.
Is there a way to create an md5-based 64-bits hash using OpenSSL? (md5 looks good enough for my needs).
If not, is there another algorithm implemented in the OpenSSL library that can get this job done with quality not less than md5's (except for the length of course)?
I claim, that 'hash quality' is strongly related to the hash length.
AFAIK, OpenSSL does not have 64bit hash algos, so the first idea I had is simple and most probably worthless:
halfMD5 = md5.hiQuadWord ^ md5.lowQuadWord
Finally, I'd simply use an algorithm with appropriate output, like crc64.
Some crc64 sources to verify:
http://www.backplane.com/matt/crc64.html
http://bioinfadmin.cs.ucl.ac.uk/downloads/crc64/
http://en.wikipedia.org/wiki/Computation_of_CRC
http://www.pathcom.com/~vadco/crc.html
Edit
At first glanceת Jenkins looks perfect, however I'm trying to find a friendly c++ implementation for it without luck so far. BTW, I'm wondering, since this is such a good hash for databases' duplication checking, how come that non of the common opensource libraries, like OpenSSL, provides an API of it? – Subway
This might be simply due to the fact, that OpenSSL is a crypto library in the first place, using large hash values with appropriate crypto characteristics.
Hash algos for data structures have some other primary goals, e.g. good distribution characteristics for hash tables, where small hash values are used as an index into a list of buckets containing zero, one or multiple (colliding) element(s).
So the point is, whether, how and where collisions are handled.
In a typical DBMS, an index on a column will handle them itself.
Corresponding containers (maps or sets):
C++: std::size_t (32 or 64bits) for std::unordered_multimap and std::unordered_multiset
In java, one would make a mapping with lists as buckets: HashMap<K,List<V>>
The unique constraint would additionally prohibit insertion of equal field contents:
C++: std::size_t (32 or 64bits) for std::unordered_map and std::unordered_set
Java: int (32bits) for HashMap and HashSet
For example, we have a table with file contents (plaintext, non-crypto application) and a checksum or hash value for mapping or consistency checks. We want to insert a new file. For that, we precompute the hash value or checksum and query for existing files with equal hash values or checksums respectively. If none exists, there won't be a collision, insertion would be safe. If there are one or more existing records, there is a high probability for an exact match and a lower probability for a 'real' collision.
In case collisions should be omitted, one could add an unique constraint to the hash column and reuse existing records with the possibility of mismatching/colliding contents. Here, you'd want to have a database friendly hash algo like 'Jenkins'.
In case collisions need to be handled, one could add an unique constraint to the plaintext column. Less database friendly checksum algos like crc won't have an influence on collisions among records and can be chosen according to certain types of corruption to be detected or other requirements. It is even possible to use the XOR'ed quad words of an md5 as mentioned at the beginning.
Some other thoughts:
If an index/constraint on plaintext columns does the mapping, any hash value can be used to do reasonably fast lookups to find potential matches.
No one will stop you from adding both, a mapping friendly hash and a checksum.
Unique constraints will also add an index, which are basically like the hash tables mentioned above.
In short, it greatly depends on what exactly you want to achieve with a 64bit hash algo.

set map implementation in C++

I find that both set and map are implemented as a tree. set is a binary search tree, map is a self-balancing binary search tree, such as red-black tree? I am confused about the difference about the implementation. The difference I can image are as follow
1) element in set has only one value(key), element in map has two values.
2) set is used to store and fetch elements by itself. map is used to store and fetch elements via key.
What else are important?
Maps and sets have almost identical behavior and it's common for the implementation to use the exact same underlying technique.
The only important difference is map doesn't use the whole value_type to compare, just the key part of it.
Usually you'll know right away which you need: if you just have a bool for the "value" argument to the map, you probably want a set instead.
Set is a discrete mathematics concept that, in my experience, pops up again and again in programming. The stl set class is a relatively efficient way to keep track of sets where the most common opertions are insert/remove/find.
Maps are used where objects have a unique identity that is small compared to their entire set of attributes. For example, a web page can be defined as a URL and a byte stream of contents. You could put that byte stream in a set, but the binary search process would be extremely slow (since the contents are much bigger than the URL) and you wouldn't be able to look up a web page if its contents change. The URL is the identity of the web page, so it is the key of the map.
A map is usually implemented as a set< std::pair<> >.
The set is used when you want an ordered list to quickly search for an item, basically, while a map is used when you want to retrieve a value given its key.
In both cases, the key (for map) or value (for set) must be unique. If you want to store multiple values that are the same, you would use multimap or multiset.

what type of data structure would be efficient for searching a process table

i have to search a process table which is populated by the names of processes running on a given set of ip adresses.
currently i am using multimaps in C++ with process name as key and ip address as the value.
is there any other efficient data structure which can do the same task.
also can i gain any sort of parallelism by using pthreads ? if so can anyone point me into a right direction
You do not need parallelism to access a data structure in RAM of several thousand entries. You can just lock over it (making sure only one process/thread accesses it at the time), and ensure the access is sufficient enough. Multimap is okay. A hashmap would be better though.
What is typical query to your table?
Try to use hashmap, it can be faster for big tables.
How do you store names and IP? UTF, string, char*? Ip as uint32 or string?
For readonly structure with a lot of read queries you can benefit from several threads.
upd: use std::unordered_multimap from #include <tr1/unordered_map>
Depending on the size of the table, you may find a hash table more efficient than the multimap container (which is implemented with a balanced binary tree).
The hash_multimap data structure implements a hash table STL container, and could be of use to you.