SAS Hash Tables: Is there a way to find/join on different keys or have optional keys - sas

I frequently work with some data for which the keys are not perfect, and I need to join data from a difference source, I want to continue using Hash Objects for the speed advantage however when I am using a lot of data I can run into crashes (memory restraints).
A simplified overview is I have 2 different keys which are all unique but not present for every record, we will call them Key1 and Key2.
My current solution, which is not very elegant (but it works) is to do the following:
if _N_ = 1 then do;
declare hash h1(Dataset:"DataSet1");
h1.DefineKey("key1");
h1.DefineData("Value");
h1.DefineDone();
declare hash h2(Dataset:"DataSet1");
h2.DefineKey("key2");
h2.DefineData("Value");
h2.DefineDone();
end;
set DataSet2;
rc = h1.find();
if rc NE 0 then do;
rc = h2.find();
end;
So I have exactly the same dataset in two hash tables, but with 2 different keys defined, if the first key is not found, then I try to find the second key.
Does anyone know of a way to make this more efficient/easier to read/less memory intensive?
Apologies if this seems a bad way to accomplish the task, I absolutely welcome criticism so I can learn!
Thanks in advance,
Adam.

I am a huge proponent of hash table lookups - they've helped me do some massive multi hundred-million row joins in minutes that otherwise would could have taken hours.
The way you're doing it isn't a bad route. If you find yourself running low on memory, the first thing to identify is how much memory your hash table is actually using. This article by sasnrd shows exactly how to do this.
Once you've figured out how much it's using and have a benchmark, or if it doesn't even run at all because it runs out of memory, you can play around with some options to see how they improve your memory usage and performance.
1. Include only the keys and data you need
When loading your hash table, exclude any unnecessary variables. You can do this before loading the hash table, or during. You can use dataset options to help reduce table size, such as where, keep, and drop.
dcl hash h1(dataset: 'mydata(keep=key var1)');
2. Reduce the variable lengths
Long character variables take up more memory. Decreasing the length to their minimum required value will help reduce memory usage. Use the %squeeze() macro to automatically reduce all variables to their minimum required size before loading. You can find that macro here.
%squeeze(mydata, mydata_smaller);
3. Adjust the hashexp option
hashexp helps improve performance and reduce hash collisions. Larger values of hashexp will increase memory usage but may improve performance. Smaller values will reduce memory usage. I recommend reading the link above and also looking at the link at the top of this post by sasnrd to get an idea of how it will affect your join. This value should be sized appropriately depending on the size of your table. There's no hard and fast answer as to what value you should use, my recommendation is as big as your system can handle.
dcl hash h1(dataset: 'mydata', hashexp:2);
4. Allocate more memory to your SAS session
If you often run out of memory with your hash tables, you may have too low of a memsize. Many machines have plenty of RAM nowadays, and SAS does a really great job of juggling multiple hard-hitting SAS sessions even on moderately equipped machines. Increasing this can make a huge difference, but you want to adjust this value as a last resort.
The default memsize option is 2GB. Try increasing it to 4GB, 8GB, 16GB, etc., but don't go overboard, like setting it to 0 to use as much memory as it wants. You don't want your SAS session to eat up all the memory on the machine if other users are also on it.
Temporarily setting it to 0 can be a helpful troubleshooting tool to see how much memory your hash object actually occupies if it's not running. But if it's your own machine and you're the only one using it, you can just go ham and set it to 0.
memsize can be adjusted at SAS invocation or within the SAS Configuration File directly (sasv9.cfg on 9.4, or SASV9_Option environment variable in Viya).

I have a fairly similar problem that I approached slightly differently.
First: all of what Stu says is good to keep in mind, regardless of the issue.
If you are in a situation though where you can't really reduce the character variable size (remember, all numerics are 8 bytes in RAM no matter what the dataset size, so don't try to shrink them for this reason), you can approach it this way.
Build a hash table with key1 as key, key2 as data along with your actual data. Make sure that key1 is the "better" key - the one that is more fully populated. Rename Key2 to some other variable name, to make sure you don't overwrite your real key2.
Search on key1. If key1 is found, great! Move on.
If key1 is missing, then use a hiter object (hash iterator) to iterate over all of the records searching for your key2.
This is not very efficient if key2 is used a lot. Step 3 also might be better done in a different way than using a hiter - you could do a keyed set or something else for those records, for example. In my particular case, both the table and the lookup were missing key1, so it was possible to simply iterate over the much smaller subset missing key1 - if in your case that's not true, and your master table is fully populated for both keys, then this is going to be a lot slower.
The other thing I'd consider is abandoning hash tables and using a keyed set, or a format, or something else that doesn't use RAM.
Or split your dataset:
data haskey1 nokey1;
set yourdata;
if missing(key1) then output nokey1;
else output haskey1;
run;
Then two data steps, one with a hash with key1 and one with a hash with key2, then combine the two back together.
Which of these is the most efficient depends heavily on your dataset sizes (both master and lookup) and on the missingness of key1.

Related

Performance implications of using a flatter schema

I'm using FlatBuffers (C++) to store metadata information about a file. This includes EXIF, IPTC, GPS and various other metadata values.
In my current schema, I have a fairly normalized definition whereby each of the groups listed above has its own table. The root table just includes properties for each sub-table.
Basic Example:
table GPSProperties {
latitude:double;
longitude:double;
}
table ContactProperties {
name:string;
email:string;
}
table EXIFProperties {
camera:string;
lens:string;
gps:GPSProperties;
}
table IPTCProperties {
city:string;
country:string;
contact:ContactProperties;
}
table Registry {
exifProperties:EXIFProperties;
iptcProperties:IPTCProperties;
}
root_type Registry;
This works, but the nesting restrictions when building a buffer are starting to make the code pretty messy. As well, breaking up the properties into separate tables is only for clarity in the schema.
I'm considering just "flattening" the entire schema into a single table but I was wondering if there are any performance or memory implications of doing that. This single table could have a few hundred fields, though most would be empty.
Proposal:
table Registry {
exif_camera:string;
exif_lens:string;
exif_gps_latitude:double;
exif_gps_longitude:double;
iptc_city:string;
iptc_country:string;
iptc_contact_name:string;
iptc_contact_email:string;
}
root_type Registry;
Since properties that are either not set or set to their default value don't take up any memory, I'm inclined to believe that a flattened schema might not be a problem. But I'm not certain.
(Note that performance is my primary concern, followed closely by memory usage. The normalized schema is performing excellently, but I think a flattened schema would really help me clean up my codebase.)
Basics you should be first clear with:
Every table has a vtable at top of it which tells the offset at whihc each field of table could be found. If there are too many fields in a table, this vtable will grow huge, no matter if you store the data or not.
If you try to create a hierarchy of tables, there are extra vtables you are creating and also adding indirection cost to the design.
Also vtables are shared if there is similar data being stored in multiple objects.. Like if you are creating objects with only exif_camera variable being used!
So it depends if your data is going to be huge and heterogeneous use the more organized hierarchy. But if your data is going to be homogeneous prefer a flattened table.
Since most of your data is strings, the size and speed of both of these designs will be very similar, so you should probably choose based on what works better for you from a software engineering perspective.
That said, the flat version will likely be slightly more efficient in size (less vtables) and certainly will be faster to access (though again, that is marginal given that it is mostly string data).
The only way in which the flat version could be less efficient is if you were to store a lot of them in one buffer, where which fields are set varies wildly between each table. Then the non-flat version may generate more vtable sharing.
In the non-flat version, tables like GPSProperties could be a struct if the fields are unlikely to ever change, which would be more efficient.
This single table could have a few hundred fields, though most would
be empty.
The performance cost is likely to be so small you won't notice, but your above quote, to me, is the swaying factor about which design to use.
While others are talking about the cost of vtables; I wouldn't worry about that at all. There's a single vtable per class, prepared once per run and will not be expensive.
Having 100's of strings that are empty and unused however is going to be very expensive (memory usage wise) and a drain on every object you create; in addition reading your fields will become much more complex since you can no longer assume that all the data for the class as you read it is there.
If most / all the fields were always there, then I can see the attraction of making a single class; but they're not.

How to handle allocation/deallocation for small objects of variable size in C++

I am currently writing C++ code to store and retrieve tabular data (e.g. a spreadsheet) in memory. The data is loaded from a database. The user can work with the data and there is also a GUI class which should render the tabular data. The GUI renders only a few rows at once, but the tabular data could contain 100,000s of rows at once.
My classes look like this:
Table: provides access to rows (by index) and column-definitions (by name of column)
Column: contains the column definition like name of the column and data-type
Row: contains multiple fields (as many as there are columns) and provides access to those fields (by column name)
Field: contains some "raw" data of variable length and methods to get/set this data
With this design a table with 40 columns and 200k rows contains over 8 million objects. After some experiments I saw that allocating and deallocating 8 million objects is a very time consuming task. Some research showed that other people are using custom allocators (like Boosts pool_allocator) to solve that problem. The problem is that I can't use them in my problem domain, since their performance boost comes from relying on the fact, that all objects allocated have the same size. This is not the case in my code, since my objects differ in size.
Are there any other techniques I could use for memory management? Or do you have suggestions about the design?
Any help would be greatly appreciated!
Cheers,
gdiquest
Edit: In the meantime I found out what my problem was. I started my program under Visual Studio, which means that the debugger was attached to the debug- and also to the release-build. With an attached debugger my executable uses a so called debug heap, which is very slow. (further details here) When I start my program without a debugger attached, everything is as fast as I would have expected it.
Thank you all for participating in this question!
Why not just allocate 40 large blocks of memory? One for each column. Most of the columns will have fixed length data which makes those easy and fast. eg vector<int> col1(200000). For the variable length ones just use vector<string> col5(200000). The Small String Optimization will ensure that your short strings require no extra allocation. Only rows with longer strings (generally > 15 characters) will require allocs.
If your variable length columns are not storing strings then you could also use vector<vector<unsigned char>> This also allows a nice pre-allocation strategy. eg Assume your biggest variable length field in this column is 100 bytes you could do:
vector<vector<unsigned char>> col2(200000);
for (auto& cell : col2)
{
cell.resize(100);
}
Now you have a preallocated column that supports 200000 rows with a max data length of 100 bytes. I would definitely go with the std::string version though if you can as it is conceptually simpler.
Try rapidjson allocators, they are not limited to objects of the same size AFAIK.
You might attach an allocator to a table and allocate all table objects with it.
For more granularity, you might have row or column pools.
Apache does this, attaching all data to request and connection pools.
If you want them to be STL-compatible then perhaps this answer will help to integrate them, although I'm not sure. (I plan to try something like this myself, but haven't gotten to it yet).
Also, some allocators might be faster than what your system offers by default. TCMalloc, for example. (See also). So, you might want to profile and see whether using a different system allocator helps.

Best way to store, load and use an inverted index in C++ (~500 Mo)

I'm developing a tiny search engine using TF-IDF and cosine similarity. When pages are added, I build an inverted index to keep words frequency in the different pages. I remove stopwords and more common words, and plural/verb/etc. are stemmed.
My inverted index looks like:
map< string, map<int, float> > index
[
word_a => [ id_doc=>frequency, id_doc2=>frequency2, ... ],
word_b => [ id_doc->frequency, id_doc2=>frequency2, ... ],
...
]
With this data structure, I can get the idf weight with word_a.size(). Given a query, the program loops over the keywords and scores the documents.
I don't know well data structures and my questions are:
How to store a 500 Mo inverted index in order to load it at search time? Currently, I use boost to serialize the index:
ofstream ofs_index("index.sr", ios::binary);
boost::archive::bynary_oarchive oa(ofs_index);
oa << index;
And then I load it at search time:
ifstream ifs_index("index.sr", ios::binary);
boost::archive::bynary_iarchive ia(ifs_index);
ia >> index;
But it is very slow, it takes sometines 10 seconds to load.
I don't know if map are efficient enough for inverted index.
In order to cluster documents, I get all keywords from each document and I loop over these keywords to score similar documents, but I would like to avoid reading again each document and use only this inverted index. But I think this data structure would be costly.
Thank you in advance for any help!
The answer will pretty much depend on whether you need to support data comparable to or larger than your machine's RAM and whether in your typical use case you are likely to access all of the indexed data or rather only a small fraction of it.
If you are certain that your data will fit into your machine's memory, you can try to optimize the map-based structure you are using now. Storing your data in a map should give pretty fast access, but there will always be some initial overhead when you load the data from disk into memory. Also, if you only use a small fraction of the index, this approach may be quite wasteful as you always read and write the whole index, and keep all of it in memory.
Below I list some suggestions you could try out, but before you commit too much time to any of them, make sure that you actually measure what improves the load and run times and what does not. Without profiling the actual working code on actual data you use, these are just guesses which may be completely wrong.
map is implemented as a tree (usually black-red tree). In many cases, a hash_map may give you better performance as well as better memory usage (fewer allocations and less fragmentation for example).
Try reducing the size of the data - less data means it will be faster to read it from disk, potentially less memory allocation, and sometimes better in-memory performance due to better locality. You may for example consider that you use float to store the frequency, but perhaps you could store only the number of occurrences as an unsigned short in the map values and in a separate map store the number of all words for each document (also as a short). Using the two numbers, you can re-calculate the frequency, but use less disk space when you save the data to disk, which could result in faster load times.
Your map has quite a few entries, and sometimes using custom memory allocators helps improve performance in such a case.
If your data could potentially grow beyond the size of your machine's RAM, I would suggest you use memory-mapped files for storing the data. Such an approach may require re-modelling your data structures and either using custom STL allocators or using completely custom data structures instead of std::map but it may improve your performance an order of magnitude if done well. In particular, this approach frees your from having to load the whole structure into memory at once, so your startup times will improve dramatically at the cost of slight delays related to disk accesses distributed over time as you touch different parts of the structure for the first time. The subject is quite broad, and requires much deeper changes to your code than just tuning the map, but if you plan handling huge data, you should certainly have a look at mmap and friends.

remove all duplicate records efficiently

I have a file which might be 30+GB or more. And each line in this file is called a record and is composed of 2 cols, which goes like this
id1 id2
All of this 2 ids are integers (32-bit). My job is to write a program to remove all the duplicate record, make the record unique, finally output the unique id2 into a file.
There is some constraints, 30G memory is allowed at most, and better get the job done efficiently by a non-multithread/process program.
Initially I came up with an idea: because of the memory constraints, I decided to read the file n times, each only keep in memory those record with id1 % n = i (i = 0,1,2,..,n-1). The data structure I use is a std::map<int, std::set<int> >, it takes id1 as key, and put id2 in id1's std::set.
This way, memory constraints will not be violated, but it's quite slow. I think it's because as the std::map and std::set grows larger, the insertion speed goes down. Moreover, I need to read the file n times, when each round is done, I gotta clear the std::map for next round which also cost some time.
I also tried hash, but it doesn't satisfy me either, which I thought there might be too many collisions even with 300W buckets.
So, I post my problem here, help you guys can offer me any better data structure or algorithm.
Thanks a lot.
PS
Scripts (shell, python) are desired, if it can do it efficiently.
Unless I overlooked a requirement, it should be possible to do this on the Linux shell as
sort -u inputfile > outputfile
Many implementations enable you to use sort in a parallelised manner as well:
sort --parallel=4 -u inputfile > outputfile
for up to four parallel executions.
Note that sort might use a lot of space in /tmp temporarily. If you run out of disk space there, you may use the -T option to point it to an alternative place on disk to use as temporary directory.
(Edit:) A few remarks about efficiency:
A significant portion of the time spent during execution (of any solution to your problem) will be spent on IO, something that sort is highly optimised for.
Unless you have extremely much RAM, your solution is likely to end up performing some of the work on disk (just like sort). Again, optimising this means a lot of work, while for sort all of that work has been done.
One disadvantage of sort is that it operates on string representations of the input lines. If you were to write your own code, one thing you could do (similar to what you suggesed already) is to convert the input lines to 64-bit integers and hash them. If you have enough RAM, that may be a way to beat sort in terms of speed, if you get IO and integer conversions to be really fast. I suspect it may not be worth the effort as sort is easy to use and – I think – fast enough.
I just don't think you can do this efficiently without using a bunch of disk. Any form of data structure will introduce so much memory and/or storage overhead that your algorithm will suffer. So I would expect a sorting solution to be best here.
I reckon you can sort large chunks of the file at a time, and then merge (ie from merge-sort) those chunks after. After sorting a chunk, obviously it has to go back to disk. You could just replace the data in the input file (assuming it's binary), or write to a temporary file.
As far as the records, you just have a bunch of 64-bit values. With 30GB RAM, you can hold almost 4 billion records at a time. That's pretty sweet. You could sort that many in-place with quicksort, or half that many with mergesort. You probably won't get a contiguous block of memory that size. So you're going to have to break it up. That will make quicksort a little trickier, so you might want to use mergesort in RAM as well.
During the final merge it's trivial to discard duplicates. The merge might be entirely file-based, but at worst you'll use an amount of disk equivalent to twice the number of records in the input file (one file for scratch and one file for output). If you can use the input file as scratch, then you have not exceeded your RAM limits OR your disk limits (if any).
I think the key here is the requirement that it shouldn't be multithreaded. That lends itself well to disk-based storage. The bulk of your time is going to be spent on disk access. So you wanna make sure you do that as efficiently as possible. In particular, when you're merge-sorting you want to minimize the amount of seeking. You have large amounts of memory as buffer, so I'm sure you can make that very efficient.
So let's say your file is 60GB (and I assume it's binary) so there's around 8 billion records. If you're merge-sorting in RAM, you can process 15GB at a time. That amounts to reading and (over)writing the file once. Now there are four chunks. If you want to do pure merge-sort then you always deal with just two arrays. That means you read and write the file two more times: once to merge each 15GB chunk into 30GB, and one final merge on those (including discarding of duplicates).
I don't think that's too bad. Three times in and out. If you figure out a nice way to quicksort then you can probably do this with one fewer pass through the file. I imagine a data structure like deque would work well, as it can handle non-contiguous chunks of memory... But you'd probably wanna build your own and finely tune your sorting algorithm to exploit it.
Instead of std::map<int, std::set<int> > use std::unordered_multimap<int,int>. If you can not use C++11 - write your own.
The std::map is node based and it calls malloc on each insertion, this is probably why it is slow. With unodered map (hash table), if you know number of records, you can pre-allocate. Even if you don't, number of mallocs will be O(log N) instead of O(N) with std::map.
I can bet this will be several times faster and more memory efficient than using external sort -u.
This approach may help when there are not too many duplicate records in the file.
1st pass. Allocate most of the memory for Bloom filter. Hash every pair from input file and put the result into Bloom filter. Write each duplicate, found by Bloom filter into temporary file (this file will also contain some amount of false positives, which are not duplicates).
2nd pass. Load temporary file and construct a map from its records. Key is std::pair<int,int>, value is a boolean flag. This map may be implemented either as std::unordered_map/boost::unordered_map, or as std::map.
3rd pass. Read input file again, search each record in the map, output its id2 if either not found or flag is not yet set, then set this flag.

Size/Resize of GHashTable

Here is my use case: I want to use glib's GHashTable and use IP-addresses as keys, and the olume of data sent/received by this IP-address as the value. For instance I succeeded to implement the whole issue in user-space using some kernel variables in order to look to the volume per IP-address.
Now the question is: Suppose I have a LOT of IP-addresses (i.e. 500,000 up to 1,000,000 uniques) => it is really not clear what is the space allocated and the first size that was given to a new hash table created when using (g_hash_table_new()/g_hash_table_new_full()), and how the whole thing works in the background. It is known that when resizing a hash table it can take a lot of time. So how can we play with these parameters?
Neither g_hash_table_new() nor g_hash_table_new_full() let you specify the size.
The size of a hash table is only available as the number of values stored in it, you don't have access to the actual array size that typically is used in the implementation.
However, the existance of g_spaced_primes_closest() kind of hints that glib's hash table uses a prime-sized internal array.
I would say that although a million keys is quite a lot, it's not extraordinary. Try it, and then measure the performance to determine if it's worth digging deeper.