Mapping into multiple maps in parallel with Java 8 Streams - concurrency

I'm iterating over a CloseableIterator (looping over elements) and currently adding to a hashmap (just putting into a HashMap, dealing with conflicts as needed). My goal is to do this process in parallel, add to multiple hashmaps in chunks using parallelism to speed up the process. Then reduce to a single hashmap.
Not sure how to do the first step, using streams to map into multiple hashmaps in parallel. Appreciate help.

Parallel streams collected into Collectors.toMap will already process the stream on multiple threads and then combine per-thread maps as a final step. Or in the case of toConcurrentMap multiple threads will process the stream and combine data into a thread-safe map.

If you only have an Iterator (as opposed to an Iterable or a Spliterator), it's probably not worth parallelizing. In Effective Java, Josh Bloch states that:
Even under the best of circumstances, parallelizing a pipeline is unlikely to increase its performance if the source is from Stream.iterate, or the intermediate operation limit is used.
An Iterator has only a next method, which (typically) must be called sequentially. Thus, any attempt to parallelize would be doing essentially what Stream.iterate does: sequentially starting the stream and then sending the data to other threads. There's a lot of overhead that comes with this transfer, and the cache is not on your side at all. There's a good chance that it wouldn't be worth it, except maybe if you have few elements to iterate over and you have a lot of work to do on each one. In this case, you may as well put them all into an ArrayList and parallelize from there.
It's a different story if you can get a reasonably parallelizable Stream. You can get these if you have a good Iterable or Spliterator. If you have a good Spliterator, you can get a Stream using the StreamSupport.stream methods. Any Iterable has a spliterator method. If you have a Collection, use the parallelStream method.
A Map in Java has key-value pairs, so I'm not exactly sure what you mean by "putting into a HashMap." For this answer, I'll assume that you mean that you're making a call to the put method where the key is one of the elements and the value Boolean.TRUE. If you update your question, I can give a more specific answer.
In this case, your code could look something like this:
public static <E> Map<E, Boolean> putInMap(Stream<E> elements) {
return elements.parallel()
.collect(Collectors.toConcurrentMap(e -> e, e -> Boolean.TRUE, (a, b) -> Boolean.TRUE));
}
e -> e is the key mapper, making it so that the keys are the elements.
e -> Boolean.TRUE is the value mapper, making it so the set values are true.
(a, b) -> Boolean.TRUE is the merge function, deciding how to merge two elements into one.

Related

What are some efficient ways to de-dupe a set of > 1 million strings?

For my project, I need to de-dupe very large sets of strings very efficiently. I.e., given a list of strings that may contain duplicates, I want to produce a list of all the strings in that list, but without any duplicates.
Here's the simplified pseudocode:
set = # empty set
deduped = []
for string in strings:
if !set.contains(string):
set.add(string)
deduped.add(string)
Here's the simplified C++ for it (roughly):
std::unordered_set <const char *>set;
for (auto &string : strings) {
// do some non-trivial work here that is difficult to parallelize
auto result = set.try_emplace(string);
}
// afterwards, iterate over set and dump strings into vector
However, this is not fast enough for my needs (I've benchmarked it carefully). Here are some ideas to make it faster:
Use a different C++ set implementation (e.g., abseil's)
Insert into the set concurrently (however, per the comment in the C++ implementation, this is hard. Also, there will be performance overhead to parallelizing)
Because the set of strings changes very little across runs, perhaps cache whether or not the hash function generates no collisions. If it doesn't generate any (while accounting for the changes), then strings can be compared by their hash during lookup, rather than for actual string equality (strcmp).
Storing the de-duped strings in a file across runs (however, although that may seem simple, there are lots of complexities here)
All of these solutions, I've found, are either prohibitively tricky or don't provide that big of a speedup. Any ideas for fast de-duping? Ideally, something that doesn't require parallelization or file caching.
You can try various algorithms and data structures to solve your problem:
Try using a prefix tree (trie), a suffix machine, a hash table. A hash table is one of the fastest ways to find duplicates. Try different hash tables.
Use various data attributes to reduce unnecessary calculations. For example, you can only process subsets of strings with the same length.
Try to implement the "divide and conquer" approach to parallelize the computations. For example, divide the set of strings by the number of subsets equal to the hardware threads. Then combine these subsets into one. Since the subsets will be reduced in size in the process (if the number of duplicates is large enough), combining these subsets should not be too expensive.
Unfortunately, there is no general approach to this problem. To a large extent, the decision depends on the nature of the data being processed. The second item on my list seems to me the most promising. Always try to reduce the computations to work with a smaller data set.
You can significantly parallelize your task by implementing simplified version of std::unordered_set manually:
Create arbitrary amount of buckets (probably should be proportional or equal to amount of threads in thread pool).
Using thread pool calculate hashes of your strings in parallel and split strings with their hashes btw buckets. You may need to lock individual buckets when adding your strings there but operation should be short and/or you may use lock free structure.
Process each bucket individually using your thread pool - compare hashes and if they equal compare string themselves.
You may need to experiment with bucket size and check how it would affect performance. Logically it should be not too big on one side but not too small on another - to prevent congestion.
Btw from your description it sounds that you load all strings into memory and then eliminate duplicates. You may try to read your data directly to std::unordered_set instead then you will save memory and increase speed as well.

Is there a faster way to query from NDB using list?

I have a list that I I need to query the corresponding information from.
I can do:
for i in list:
database.query(infomation==i).fetch()
but this is so slow, because for every element in the list, it have to go to data base and then back, repeat, instead of querying everything at once. Is there a way to speed this process up?
You can use the ndb async operations to speed up your code. Basically you would launch all your queries pretty much in parallel, then process the results as they come in, which would result in potentially much faster overall execution, especially if your list is long. Something along these lines:
futures = []
for i in list:
futures.append(
database.query(infomation==i).fetch_async())
for future in futures:
results = future.get_result()
# do something with your results
There are more advanced ways of using the async operations described in the mentioned doc which you may find interesting, depending on the structure of your actual code.

Data structure for sparse insertion

I am asking this question mostly for confirmation, because I am not an expert in data structures, but I think the structure that suits my need is a hashmap.
Here is my problem (which I guess is typical?):
We are looking at pairwise interactions between a large number of objects (say N=90k), so think about the storage as a sparse matrix;
There is a process, say (P), which randomly starts from one object, and computes a model which may lead to another object: I cannot predict the pairs in advance, so I need to be able to "create" entries dynamically (arguably the performance is not very critical here);
The process (P) may "revisit" existing pairs and update the corresponding element in the matrix: this happens a lot, and therefore I need to be able to find and update as fast as possible.
Finally, the process (P) is repeated millions of times, but only requires write access to the data structure, it does not need to know about the latest "state" of that storage. This feels intuitively like a detail that might be exploited to improve performance, but I don't think hashmaps do.
This last point is actually the main reason for my question here: is there a data-structure which satisfies the first three points (I'm thinking hash-map, correct?), and which would also exploit the last feature for improved performance (I'm thinking something like buffering operations and execute them in bulk asynchronously)?
EDIT: I am working with C++, and would prefer it if there was an existing library implementing that data structure. In addition, I am limited by the system requirements; I cannot use C++11 features.
I would use something like:
#include <boost/unordered_map.hpp>
class Data
{
boost::unordered_map<std::pair<int,int>,double> map;
public:
void update(int i, int j, double v)
{
map[std::pair<int,int>(i,j)] += v;
}
void output(); // Prints data somewhere.
};
That will get you going (you may need to declare a suitable hash function). You might be able to speed things up by making the key type be a 64-bit integer, and using ((int64_t)i << 32) | j to make the index.
If nearly all the updates go to a small fraction of the pairs, you could have two maps (small and large), and directly update the small map. Every time the size of small passed a threshold, you could update large and clear small. You would need to do some carefully testing to see if this helped or not. The only reason I think it might help, is by improving cache locality.
Even if you end up using a different data structure, you can keep this class interface, and the rest of the code will be undisturbed. In particular, dropping sparsehash into the same structure will be very easy.

When to use a sequence in F# as opposed to a list?

I understand that a list actually contains values, and a sequence is an alias for IEnumerable<T>. In practical F# development, when should I be using a sequence as opposed to a list?
Here's some reasons I can see when a sequence would be better:
When interacting with other .NET languages or libraries that require
IEnumerable<T>.
Need to represent an infinite sequence (probably not really useful in practice).
Need lazy evaluation.
Are there any others?
I think your summary for when to choose Seq is pretty good. Here are some additional points:
Use Seq by default when writing functions, because then they work with any .NET collection
Use Seq if you need advanced functions like Seq.windowed or Seq.pairwise
I think choosing Seq by default is the best option, so when would I choose different type?
Use List when you need recursive processing using the head::tail patterns
(to implement some functionality that's not available in standard library)
Use List when you need a simple immutable data structure that you can build step-by-step
(for example, if you need to process the list on one thread - to show some statistics - and concurrently continue building the list on another thread as you receive more values i.e. from a network service)
Use List when you work with short lists - list is the best data structure to use if the value often represents an empty list, because it is very efficient in that scenario
Use Array when you need large collections of value types
(arrays store data in a flat memory block, so they are more memory efficient in this case)
Use Array when you need random access or more performance (and cache locality)
Also prefer seq when:
You don't want to hold all elements in memory at the same time.
Performance is not important.
You need to do something before and after enumeration, e.g. connect to a database and close connection.
You are not concatenating (repeated Seq.append will stack overflow).
Prefer list when:
There are few elements.
You'll be prepending and decapitating a lot.
Neither seq nor list are good for parallelism but that does not necessarily mean they are bad either. For example, you could use either to represent a small bunch of separate work items to be done in parallel.
Just one small point: Seq and Array are better than List for parallelism.
You have several options: PSeq from F# PowerPack, Array.Parallel module and Async.Parallel (asynchronous computation). List is awful for parallel execution due to its sequential nature (head::tail composition).
list is more functional, math-friendly. when each element is equal, 2 lists are equal.
sequence is not.
let list1 = [1..3]
let list2 = [1..3]
printfn "equal lists? %b" (list1=list2)
let seq1 = seq {1..3}
let seq2 = seq {1..3}
printfn "equal seqs? %b" (seq1=seq2)
You should always expose Seq in your public APIs. Use List and Array in your internal implementations.

Scheme: Constant Access to the End of a List?

In C, you can have a pointer to the first and last element of a singly-linked list, providing constant time access to the end of a list. Thus, appending one list to another can be done in constant time.
As far as I am aware, scheme does not provide this functionality (namely constant access to the end of a list) by default. To be clear, I am not looking for "pointer" functionality. I understand that is non-idiomatic in scheme and (as I suppose) unnecessary.
Could someone either 1) demonstrate the ability to provide a way to append two lists in constant time or 2) assure me that this is already available by default in scheme or racket (e.g., tell me that append is in fact a constant operation if I am wrong to think otherwise)?
EDIT:
I should make myself clearer. I am trying to create an inspectable queue. I want to have a list that I can 1) push onto the front in constant time, 2) pop off the back in constant time, and 3) iterate over using Racket's foldr or something similar (a Lisp right fold).
Standard Lisp lists cannot be appended to in constant time.
However, if you make your own list type, you can do it. Basically, you can use a record type (or just a cons cell)---let's call this the "header"---that holds pointers to the head and tail of the list, and update it each time someone adds to the list.
However, be aware that if you do that, lists are no longer structurally inductive. i.e., a longer list isn't simply an extension of a shorter list, because of the extra "header" involved. Thus, you lose a great part of the simplicity of Lisp algorithms which involve recursing into the cdr of a list at each iteration.
In other words, the lack of easy appending is a tradeoff to enable recursive algorithms to be written much more easily. Most functional programmers will agree that this is the right tradeoff, since appending in a pure-functional sense means that you have to copy every cell in all but the last list---so it's no longer O(1), anyway.
ETA to reflect OP's edit
You can create a queue, but with the opposite behaviour: you add elements to the back, and retrieve elements in the front. If you are willing to work with that, such a data structure is easy to implement in Scheme. (And yes, it's easy to append two such queues in constant time.)
Racket also has a similar queue data structure, but it uses a record type instead of cons cells, because Racket cons cells are immutable. You can convert your queue to a list using queue->list (at O(n) complexity) for times when you need to fold.
You want a FIFO queue. user448810 mentions the standard implementation for a purely-functional FIFO queue.
Your concern about losing the "key advantage of Lisp lists" needs to be unpacked a bit:
You can write combinators for custom data structures in Lisp. If you implement a queue type, you can easily write fold, map, filter and so on for it.
Scheme, however, does lack in the area of providing polymorphic sequence functions that can work on multiple sequence types. You do often end up either (a) converting your data structures back to lists in order to use the rich library of list functions, or (b) implementing your own versions of various of these functions for your custom types.
This is very much a shame, because singly-linked lists, while they are hugely useful for tons of computations, are not a do-all data structure.
But what is worse is that there's a lot of Lisp folk who like to pretend that lists are a "universal datatype" that can and should be used to represent any kind of data. I've programmed Lisp for a living, and oh my god I hate the code that these people produce; I call it "Lisp programmer's disease," and have much too often had to go in and fix a lot of n^2 that uses lists to represent sets or dictionaries to use hash tables or search trees instead. Don't fall into that trap. Use proper data structures for the task at hand. You can always build your own opaque data types using record types and modules in Racket; you make them opaque by exporting the type but not the field accessors for the record type (you export your type's user-facing operations instead).
It sounds like you are looking for a deque, not a list. The standard idiom for a deque is to keep two lists, the front half of the list in normal order and the back half of the list in reverse order, thus giving access to both ends of the deque. If the half of the list that you want to access is empty, reverse the other half and swap the meaning of the two halves. Look here for a fuller explanation and sample code.