Serialization and concurrency

Serialization and concurrency - concurrency

Suppose there are two functions of x, f and g. Both change the value of x.
Case 1: Both are unserialized and are executed in parallel.
Case 2: f is serialized and g is not. They are executed in parallel.
Question:
Let all the possible values of x after the complete execution in case 1 be N.
Let all the possible values of x after the complete execution in case 2 be M.
is M equal to N?
In other words::
Is there any difference if only one of the two functions is serialized?
Unless both the functions are serialized, will there be any use of the serialization?

As the text says,
serialization creates distinguished sets of procedures such that only one execution of a procedure in each serialized set is permitted to happen at a time
so you need to serialize both procedures that will make changes on the shared varible.

Non-rigorous (sorry if this is for homework ;-)) but practical answer: all mutations (i.e., both functions in your case) must be synchronised in order to have predictable results.

Related

Using of hash(map) as key in durable cache

I'm going to use redis cache where key is a clojure map (serialized into bytes by nippy).
Can i use hash of the clojure map as a key in redis cache?
Another words, does clojure map hash depends only on data structure value and does not depend on any memory allocation.
Investigating:
I navigated through code and found out IHashEq interface which is implemented by clojure data structures.
In the result, IHashEq impl ends with calling of Object.hashCode which has following contract:
Whenever it is invoked on the same object more than once during
an execution of a Java application, the {#code hashCode} method
must consistently return the same integer, provided no information
used in {#code equals} comparisons on the object is modified.
This integer need not remain consistent from one execution of an
application to another execution of the same application.
Well, just want to clarify that i cannot use hash as id persistent in other process because:
two equal values give two equal hash codes, but not vice verse. So there is a chance of collision
there is no guarantee that clojure map hash will be the same for the the same values in different jvm processes
Please, confirm.

Re your two points:
Two equal values will yield the same hash code, but two unequal values may also give the same hash code. So the chance of collision makes this a bad choice of key.
Different JVM's should generate the same hashcode for a given value, given the same version of Java & Clojure (& very probably for different versions, although this is not guarenteed).

You can use secure hash library (like this one) to address your concerns (like in blockchain). Albeit you have to pay for its performance penalty.

Mapping into multiple maps in parallel with Java 8 Streams

I'm iterating over a CloseableIterator (looping over elements) and currently adding to a hashmap (just putting into a HashMap, dealing with conflicts as needed). My goal is to do this process in parallel, add to multiple hashmaps in chunks using parallelism to speed up the process. Then reduce to a single hashmap.
Not sure how to do the first step, using streams to map into multiple hashmaps in parallel. Appreciate help.

Parallel streams collected into Collectors.toMap will already process the stream on multiple threads and then combine per-thread maps as a final step. Or in the case of toConcurrentMap multiple threads will process the stream and combine data into a thread-safe map.

If you only have an Iterator (as opposed to an Iterable or a Spliterator), it's probably not worth parallelizing. In Effective Java, Josh Bloch states that:
Even under the best of circumstances, parallelizing a pipeline is unlikely to increase its performance if the source is from Stream.iterate, or the intermediate operation limit is used.
An Iterator has only a next method, which (typically) must be called sequentially. Thus, any attempt to parallelize would be doing essentially what Stream.iterate does: sequentially starting the stream and then sending the data to other threads. There's a lot of overhead that comes with this transfer, and the cache is not on your side at all. There's a good chance that it wouldn't be worth it, except maybe if you have few elements to iterate over and you have a lot of work to do on each one. In this case, you may as well put them all into an ArrayList and parallelize from there.
It's a different story if you can get a reasonably parallelizable Stream. You can get these if you have a good Iterable or Spliterator. If you have a good Spliterator, you can get a Stream using the StreamSupport.stream methods. Any Iterable has a spliterator method. If you have a Collection, use the parallelStream method.
A Map in Java has key-value pairs, so I'm not exactly sure what you mean by "putting into a HashMap." For this answer, I'll assume that you mean that you're making a call to the put method where the key is one of the elements and the value Boolean.TRUE. If you update your question, I can give a more specific answer.
In this case, your code could look something like this:
public static <E> Map<E, Boolean> putInMap(Stream<E> elements) {
return elements.parallel()
.collect(Collectors.toConcurrentMap(e -> e, e -> Boolean.TRUE, (a, b) -> Boolean.TRUE));
}
e -> e is the key mapper, making it so that the keys are the elements.
e -> Boolean.TRUE is the value mapper, making it so the set values are true.
(a, b) -> Boolean.TRUE is the merge function, deciding how to merge two elements into one.

Difference between fold and reduce revisted

I've been reading a nice answer to Difference between reduce and foldLeft/fold in functional programming (particularly Scala and Scala APIs)? provided by samthebest and I am not sure if I understand all the details:
According to the answer (reduce vs foldLeft):
A big big difference (...) is that reduce should be given a commutative monoid, (...)
This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduce even exists.
and
Reduce is defined formally as part of the MapReduce paradigm,
I am not sure how this two statements combine. Can anyone put some light on that?
I tested different collections and I haven't seen performance difference between reduce and foldLeft. It looks like ParSeq is a special case, is that right?
Do we really need order to define fold?
we cannot define fold because chunks do not have an ordering and fold only requires associativity, not commutativity.
Why it couldn't be generalized to unordered collection?

As mentioned in the comments, the term reduce means different thing when used in the context of MapReduce and when used in the context of functional programming.
In MapReduce, the system groups the results of the map function by a given key and then calls the reduce operation to aggregate values for each group (so reduce is called once for each group). You can see it as a function (K, [V]) -> R taking the group key K together with all the values belonging to the group [V] and producing some result.
In functional programming, reduce is a function that aggregates elements of some collection when you give it an operation that can combine two elements. In other words, you define a function (V, V) -> V and the reduce function uses it to aggregate a collection [V] into a single value V.
When you want to add numbers [1,2,3,4] using + as the function, the reduce function can do it in a number of ways:
It can run from the start and calculate ((1+2)+3)+4)
It can also calculate a = 1+2 and b = 3+4 in parallel and then add a+b!
The foldLeft operation is, by definition always proceeding from the left and so it always uses the evaluation strategy of (1). In fact, it also takes an initial value, so it evaluates something more like (((0+1)+2)+3)+4). This makes foldLeft useful for operations where the order matters, but it also means that it cannot be implemented for unordered collections (because you do not know what "left" is).

Perfect hash function generator for functions

I have a set of C++ functions. I want to map this functions in an hash table, something like: unordered_map<function<ReturnType (Args...)> , SomethingElse>, where SomethingElse is not relevant for this question.
This set of functions is previously known, small (let say less than 50) and static (is not gonna change).
Since lookup performance is crucial (should be performed in O(1)), I want to define a perfect hashing function.
There exists a perfect hash function generator for this scenario?
I know that there exists perfect hashing functions generators (like GPERF or CMPH) but since I've never used them, I don't know if they're suitable for my case.
REASON:
I'm trying to design a framework where, given a program written in C++, the user can select a subset F of the functions defined in this program.
For each f belonging to F, the framework implements a memoization strategy: when we call f with input i, we store (i,o) inside some data structure. So, if we are going to call AGAIN f with i, we will return o without performing again the (time expensive) computation.
The "already computed results" will be shared among different users (maybe on the cloud), so if user u1 has already computed o, user u2 will save computing time calling f with i (using the same annotation of before).
Obviously, we need to store the set of pairs (f,inputs_sets) (where inputs_sets is the already computed results set that I talked before), which is the original question: how do I do it?
So, using the "enumeration trick" proposed in the comments in this scenario could be a solution, assuming that the all the users use the exactly same enumeration, which could be a problem: supposing that our program has f1,f2,f3 what if u1 wants to memoize only f1 and f2 (so F={f1,f2}), while u2 wants to memoize only f3 (so F={f3})? An overkill solution could be to enumerate all the functions defined in the program, but this could generate an huge waste of memory.

Ok, maybe not what you want to hear but consider this: since you talk about a few functions, less than 50, the hash lookup should be negligible, even with collisions. Have you actually profiled and saw that the lookup is critical?
So my advise is to focus your energy on something else, most likely a perfect hash function would not bring any kind of improved performance in your case.
I am going to go one step further and say that I think that for less than 50 elements a flat map (good ol' vector) would have similar performance (or maybe even better due to cache locality). But again, measurements are required.

Using the composite design pattern to operate on objects of different type, is there a way to prevent an object from being operated on more than once?

I'd like to use the Composite design pattern in C++ to be able to create and operate on groups of objects. A problem I've encountered is that since leaves and composites are treated the same, and composites can be comprised of leaves and composites, it is quite possible for an object to be operated on more than once when a command is issued to a composite.
For example, a composite group1 contains objects A and B. Then, a composite group2 is created containing composite group1 and object A. When composite group2 is operated on, object A will be operated on twice. For some applications I guess this isn't a problem, but for my uses I'd like it if, for any command issued to a composite, unique objects are only operated on once.
Is there an idiomatic way to deal with this problem, either some how preventing multiple calls of an object's member function, or preventing an object from being included in a composite more than once?
--
Update:
By "idiomatic" I mean "traditional" or "accepted" way of handling this type of problem.
I guess I'm just assuming/hoping this is a common problem that has an established solution.

I don't know what you mean by idiomatic, but a solution would depend on how yo traverse the structure of composites. Here are some options
if you use a visitor and remember the already visited components, ignore duplicates
use a tick count and have the composite element ignore successive calls with the same tick count
in a two step method gather all the composite objects that need to be operated on in a set and then perform your operation
in a two step method set a flag in your composite objects that signifies when they have been touched this round, clear the flag before the next round

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Serialization and concurrency - concurrency

As the text says, serialization creates distinguished sets of procedures such that only one execution of a procedure in each serialized set is permitted to happen at a time so you need to serialize both procedures that will make changes on the shared varible.

Non-rigorous (sorry if this is for homework ;-)) but practical answer: all mutations (i.e., both functions in your case) must be synchronised in order to have predictable results.

Related

Using of hash(map) as key in durable cache

Mapping into multiple maps in parallel with Java 8 Streams

Difference between fold and reduce revisted

Perfect hash function generator for functions

Using the composite design pattern to operate on objects of different type, is there a way to prevent an object from being operated on more than once?

Categories

Resources