Performance of Infinispan Distributed Streams

Performance of Infinispan Distributed Streams - mapreduce

Infinispan 8.2.x (latest version) introduces distributed streams (of java 8) and depreciates the MapReduce implementation [1]. That rises the question on performance enhancements.
Are there any benchmarks run to test the performance benefits? According to Infinispan team, there have been internal benchmarks showing the performance benefit of Infinispan's distributed streams [2]. However, I could not find the results or pointers to detailed discussions yet.
How does the Infinispan distributed streams achieve higher performance than Infinispan MapReduce? Does it take advantage of the SIMD (Single input multiple data) operations?
[1] https://docs.jboss.org/infinispan/8.2/apidocs/org/infinispan/distexec/mapreduce/MapReduceTask.html
[2] https://developer.jboss.org/thread/268188?start=0&tstart=0

Thanks for the reminder, to be honest we have been a bit lax on blogging about the performance increase for this. I hope that we can get one out in the near future. I can however show you an image from the graph that was generated in the benchmark test that was ran. The y axis is MB/s and the x axis is # of unique words (in all tests it was doing a simple word count). The blog should show more details.
These tests were all ran with radargun [2] btw. The test driver for Map Reduce can be found at [3] and distributed streams with [4].
But from the chart Map/Reduce is close in performance (~15-30% less than distributed streams), but once the intermediate results grew in size Map/Reduce fell off and ran out of memory. Spark in this case also had twice as much memory as distributed streams (so I was fighting those results a little). But this should be detailed more once we have a blog post.
But in regards to why distributed streams is faster, the biggest thing is that we utilize Java 8 Streams under the hood which provide good cpu cache hits and low overhead of context switching using a fork join pool. Map/Reduce has a lot of these optimizations but was missing some :)
Also don't forget Distributed Streams is fully rehash aware and you don't have to worry about duplicate/lost entries when a node enter/leaves like you do with Map/Reduce. You can also read more about Distributed Streams at [5].
[2] https://github.com/radargun/radargun
[3] https://github.com/radargun/radargun/blob/master/extensions/mapreduce/src/main/java/org/radargun/stages/mapreduce/MapReduceStage.java
[4] https://github.com/radargun/radargun/blob/master/extensions/cache/src/main/java/org/radargun/stages/stream/StreamStage.java
[5] http://infinispan.org/docs/dev/user_guide/user_guide.html#streams

Related

Clojure : Number of chunks for pmap compared to processor cores

I have a big computation work to achieve which is basically applying a logistic regression on around 500 000 series.
Because the work is heavy, I divided the work into 4 chunks of 125000 series.
I have a 2-core processor with hyper threading and the result is really much faster.
But I have a question around this. Should the number of chunks be the same as the number of cores (or threads in case of hyper threading) ? i'm not sure about how pmap works, I read the incanter conf and still not sure because the guy has 2 cores and divide the work into 4 threads.
This is quite a heavy job anyway (more than 5 hours with pmap, a lot more without it) so any significant optimization is welcome.
Thanks

Check out clojure.core.reducers before your build your own.
Thinking this through for personal development is an important project because it builds the understanding and appreciation for how hard this problem really is. Good solutions include concepts like "work stealing" for instance where idle processors can take work from busy ones.
In real life it's best to go straight to clojure's built in reducers. They make doing this deceptively easy if you are working with immutable vectors as your input and it will automatically manage Java's fork/join framework for handling the batch sizes and work allocations. Also This blog post gives a lot of the background.
You may then want to look at using transducers to reduce the amount of intermediate data structures produced.

The hint to look at c.c.reducers is a good one. If it's ok for results of your job to be returned out of order, you may also want to check out Tesser, which will give you a neat API with a lot of flexibility and power, and happily run your job on several threads or a Hadoop cluster depending on your needs.

What can Hadoop do when I want to run concurrently the same Algorithm with 1000+ group of different parameters?

I want to run 1000+ different versions of the same algorithm(different arguments) at the same time, is Hadoop able to enhance the performance in this situation?
I have no knowledge of Hadoop currently, so the question might seem dumb.
I just want to know if Hadoop can do something about this, I don't need to know how to do it.

No it can't. Simply because it does not care what kinds of Jobs are running at the same time. You will see a few performance improvements because the OS tries to cache your input. But generally the framework will not optimize this situation.
Hadoop wasn't built for these kinds of jobs and I very much doubt that you will get a good performance with Hadoop.

You're thinking about Hadoop the wrong way. The strength and advantage of using Hadoop lies in the fact that it allows distributed computing on "data-intensive" tasks. This means that it excels when you have a relatively small/simple amount of processing to do on a large amount of data (many terabytes to a few petabytes even).
So when you're considering Hadoop, the question is, "do I have a huge amount of data?" If yes, then it could work for you. It looks like you're answer is no, and you want to use it for concurrent processing. In this case, it's not the way to go for you.

You can do it with hadoop. You will only profit from the part of its functionality - distributed task scheduling, and will not profit from the rest.
Technically I would suggest the following way:
a) To make each set of parameters to be a single input split.
b) To make each mapper read the parameters from the input and read data from the HDFS directly (or from the distributed cache).
What you will get - distribution of our load over cluster, restart of the failed tasks.

What NoSQL solution is recommended for mostly writing application?

We planning to move some of writes our back-end will do from RDBMS to NoSQL, as we expect them to be the main bottleneck.
Our business process has 95%-99% concurrent writes, and only concurrent 1%-5% reads on average. There will be a massive amount of data involved, so in-memory NoSQL DB won't fit.
What NoSQL DB on-disk would be optimal for this case?
Thanks!

If the concurrent writes are creating conflicts and data integrity is an issue, NoSQL isn't probably your way to go. You can easily test this with a data management that supports "optimistic concurrency" as then you can measure the real life locking conflicts and analyze them in details.
I am a little bit surprised when you say that you expect problems" without any further details. Let me give you one answer: Based on the facts you gave us. What is 100,000 sources and what is the writing scenario? MySQl is not best example of handling scalable concurrent writes etc.
It would be helpful if you'd provide some kind of use case(s) or anything helping to understand the problem in details.
Let me take two examples: In memory database having an advanced write dispatcher, data versioning etc, can easily take 1M "writers" the writers being network elements and the application an advanced NMS system. Lots of writes, no conflicts, optimistic concurrency, in-memory write buffering up to 16GB's, async parallel writing to 200+ virtual spindles (SSD or magnetic disks) etc. A real "sucker" to eat new data! An excellent candidate to scale the performance to its limits.
2nd example: MSC having a sparse number space, e.g. mobile numbers being "clusters" of numbers. Huge number space, but max. 200M individual addresses. Very rare situations where there are conflicting writes. RDBMS was replaced with memory mapped sparse files. And the performance improvement was close to 1000x, yes 1000x in best case, and "only" 100x in worst case. The replacement code was roughly 300 lines of C. That was a True BigNoSQL, as it was a good fit to the problem to be solved.
So, in short, without knowing more details, there is no "silver bullet" to answer your question. We're not after warewolves here, it's just "big bad data". When we don't know if your workload is "transactional" aka. number or IO's and latency sensitive, or "BLOB like" aka. streaming media, geodata etc, it would give 100% wrong results to promise anything. Bandwidth and io-rate/latency/transactions are more or less a trade-off in real life.
See for example http://publib.boulder.ibm.com/infocenter/soliddb/v6r3/index.jsp?topic=/com.ibm.swg.im.soliddb.sql.doc/doc/pessimistic.vs.optimistic.concurrency.control.html for some furher details.

Does anybody have any experience with FastDB (C++ in-memory database)?

FastDB is an open-source, in-memory database that's tightly integrated with C++ (it supports a SQL-like query language where tables are classes and rows are objects). Like most IMDBs, it's meant for applications dominated by read access patterns. The algorithms and data structures are optimized for systems that read and write data entirely in main memory (RAM). It's supposed to be very fast, even compared to other in-memory databases, but I can't find any benchmarks online.
I'm considering using FastDB for time-series data, in a project where 1) sub-millisecond random-access read latencies, and 2) millions of rows per second sequential read throughput would be very good to have.
I can't find many references to first-hand experience with FastDB; has anyone here used it? Can you point to any benchmarks of FastDB, especially those that consider read latency and throughput?

A recent post on an Erlang forum (from 2009): http://www.trapexit.org/forum/viewtopic.php?p=49476#49476 has someone (Serge Aleynikov) recommending FastDB for trading systems with sub-millisecond latencies:
If you don't want to spend too much time coding C++, since you have
already done good work of abstracting mnesia backend, why don't you
create an Erlang driver for this database: www.fastdb.org. It's based
on memory mapped files, implemented in C++, is relatively fast compared
to other in-memory databases (about 250k lookups/s, 50k inserts/s), has
time-series capabilities, simple C-API. I implemented FastDB interface
in several languages, and generally it's good for systems that deal with
latencies in sub-milliseconds range. It may suffice for you unless you
need to stay in the low microseconds realm.
My 2c.
Serge
It's pretty intimidating to see people worrying about latencies in the low microseconds; I'm considering FastDB for digital signal processing (DSP), where live audio systems generally limit latency to no more than about 10 milliseconds. Of course, if a system responds in milliseconds, we might use input pulses of only a few microseconds in length.
There's no mention of what system was used for the 250K lookups/s, 50K inserts/s. Still, it's a positive sign.

BerkeleyDB Concurrency

What's the optimal level of concurrency that the C++ implementation of BerkeleyDB can reasonably support?
How many threads can I have hammering away at the DB before throughput starts to suffer because of resource contention?
I've read the manual and know how to set the number of locks, lockers, database page size, etc. but I'd just like some advice from someone who has real-world experience with BDB concurrency.
My application is pretty simple, I'll be doing gets and puts of records that are about 1KB each. No cursors, no deleting.

It depends on what kind of application you are building. Create a representative test scenario, and start hammering away. Then you will know the definitive answer.
Besides your use case, it also depends on CPU, memory, front-side bus, operating system, cache settings, etcetera.
Seriously, just test your own scenario.
If you need some numbers (that actually may mean nothing in your scenario):
Oracle Berkeley DB:
Performance Metrics and
Benchmarks
Performance Metrics
& Benchmarks:
Berkeley DB

I strongly agree with Daan's point: create a test program, and make sure the way in which it accesses data mimics as closely as possible the patterns you expect your application to have. This is extremely important with BDB because different access patterns yield very different throughput.
Other than that, these are general factors I found to be of major impact on throughput:
Access method (which in your case i guess is BTREE).
Level of persistency with which you configured DBD (for example, in my case the 'DB_TXN_WRITE_NOSYNC' environment flag improved write performance by an order of magnitude, but it compromises persistency)
Does the working set fit in cache?
Number of Reads Vs. Writes.
How spread out your access is (remember that BTREE has a page level locking - so accessing different pages with different threads is a big advantage).
Access pattern - meanig how likely are threads to lock one another, or even deadlock, and what is your deadlock resolution policy (this one may be a killer).
Hardware (disk & memory for cache).
This amounts to the following point:
Scaling a solution based on DBD so that it offers greater concurrency has two key ways of going about it; either minimize the number of locks in your design or add more hardware.

Doesn't this depend on the hardware as well as number of threads and stuff?
I would make a simple test and run it with increasing amounts of threads hammering and see what seems best.

What I did when working against a database of unknown performance was to measure turnaround time on my queries. I kept upping the thread count until turn-around time dropped, and dropping the thread count until turn-around time improved (well, it was processes in my environment, but whatever).
There were moving averages and all sorts of metrics involved, but the take-away lesson was: just adapt to how things are working at the moment. You never know when the DBAs will improve performance or hardware will be upgraded, or perhaps another process will come along to load down the system while you're running. So adapt.
Oh, and another thing: avoid process switches if you can - batch things up.
Oh, I should make this clear: this all happened at run time, not during development.

The way I understand things, Samba created tdb to allow "multiple concurrent writers" for any particular database file. So if your workload has multiple writers your performance may be bad (as in, the Samba project chose to write its own system, apparently because it wasn't happy with Berkeley DB's performance in this case).
On the other hand, if your workload has lots of readers, then the question is how well your operating system handles multiple readers.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js