WSO2 CEP performance - wso2

I have few questions related to WSO2 CEP performance.
How many events can be process within a second?
How many execution plans can be handle at one time without doing
huge variation to the performance?
What is the maximum number of receivers and publishers which can be
add to CEP?
What is the maximum number of execution plans which can be add to CEP?

Everything depends on the scenario, so just giving a specific number may not applicable for your scenario. By using thrift wso2event protocol CEP can process over 100000 events per second. Depending on the complexity of the query performance numbers could change. Also you need to consider about the resource allocation such as memory heap size and etc. Allocated memory will be significance if the size of the event is high, Therefor things like number of receivers, publishers and execution plans depend on the complexity of process. As these are very dynamic situations you can tune the CEP instance according to your scenario. Please refer Performance Tuning Recommendations [1] for more details. For an instance, of you require to achieve very high throughput but you are not concerned about the latency, you can increase the QueueSize in data-agent-config. Depending on the size of the event, sometimes you may also have to increase the heap memory as well.
[1] https://docs.wso2.com/display/CEP400/Performance+Tuning+Recommendations

Related

What's the effect of the "Bytes Shuffled" metric from BigQuery on cost?

I'm optimizing a query in BigQuery and I managed to reduce all performance metrics by a good margin except for the "Bytes Consumed" metric which increased from 3GB to 3.56GB
I would like to know if there is an impact of the Bytes Shuffled metric on cost, and if so by how much?
To understand that, you have to have in mind the BigQuery architecture. It's more or less a Map Reduce architecture.
Map can be done on a single node (filter, transform, ...). Reduce require node communication to perform operation (join, substracts,...).
Of course, map operation are much more efficient than reduce operation (only in memory, no network communication, no synchronisation/wait,...)
Byte shuffling is the byte shared between the nodes.
The cost perspective is not simple to answer. If you pay as you use BigQuery (no slots reservation) there is no extra cost (the same volume of data are processed, therefore no impact, only a slower query).
If you have reserved slots (node and slots are similar), there is no extra cost also. But you keep the slots longer (the query is slower and the slot usage longer), and if you share the slots with other users/queries/projects, it can impact the overall performance, and, maybe the overall cost of your projects.
So, no direct cost, but a global overview to have about the duration impact.

Vertica PlannedConcurrency

I have been trying to tune the performance of queries running on a Vertica cluster by changing the value of PlannedConcurrency of the general resource pool. We have a cluster of 4 nodes with 32 cores/node.
According to Vertica docs,
Query budget = Queuing threshold of the GENERAL pool / PLANNEDCONCURRENCY
Increasing PlannedConcurrency should reduce the query budget, reserving lesser memory/query which might lead to fewer queries being queued up.
Increasing the value of PlannedConcurrency, seems to improve query performance.
PlannedConcurrency = 256 gives better performance than 128 which performs better than AUTO.
PlannedConcurrency being the preferred number of concurrently executing queries in the resource pool, how can this number be greater than the number of cores and still give better query performance?
Also, the difference between RESOURCE_ACQUISITIONS.MEMORY_INUSE_KB and QUERY_PROFILES.RESERVED_EXTRA_MEMORY should give the memory in use.
However, this number does not remain constant for a single query when the planned concurrency is changed.
Can someone please help me understand why does this memory usage differ with the value of PlannedConcurrency ?
Thanks !
References:
https://my.vertica.com/blog/do-you-need-to-put-your-query-on-a-budgetba-p236830/
https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/AdministratorsGuide/ResourceManager/GuidelinesForSettingPoolParameters.htm
It's hard to give an exact answer without the actual queries.
but, in general - increasing the planned concurrency means you reserve and allocate less resources per query and allow for greater concurrency.
If your use case has lot's of small queries which don't require lot's of resources - it might improve things.
also keep in mind that the CPU is not the only resource being used - you have to wait for IO (disks, network etc') this is time you can better spend on running more queries...

how to design threading for many short tasks

I want to use multi-threads to accelerate my program, but not sure which way is optimal.
Say we have 10000 small tasks, it takes maybe only 0.1s to finish one of them. Now I have a CPU with 12 cores and I want to use 12 threads to make it faster.
So far as I know, there are two ways:
1.Tasks Pool
There are always 12 threads running, each of them get one new task from the tasks pool after it finished its current work.
2.Separate Tasks
By separating the 10000 tasks into 12 parts and each thread works on one part.
The problem is, if I use tasks pool it is a waste of time for lock/unlock when multiple threads try to access the tasks pool. But the 2nd way is not ideal because some of the threads finish early, the total time depends on the slowest thread.
I am wondering how you deal with this kind of work and any other best way to do it? Thank you.
EDIT: Please note that the number 10000 is just for example, in practice, it may be 1e8 or more tasks and 0.1 per task is also an average time.
EDIT2: Thanks for all your answers :] It is good to know kinds of options.
So one midway between the two approaches is to break into say 100 batches of 100 tasks each and let the a core pick a batch of 100 tasks at a time from the task pool.
Perhaps if you model the randomness in execution time in a single core for a single task, and get an estimate of mutex locking time, you might be able to find an optimal batch size.
But without too much work we at least have the following lemma :
The slowest thread can only take at max 100*.1 = 10s more than others.
Task pool is always the best solution here. It's not just optimum time, it's also comprehensibility of code. You should never force your tasks to conform to the completely unrelated criteria of having the same number of subtasks as cores - your tasks have nothing to do with that (in general), and such a separation doesn't scale when you change machines, etc. It requires overhead to collaborate on combining results in subtasks for the final task, and just generally makes an easy task hard.
But you should not be worrying about the use of locks for taskpools. There are lockfree queues available if you ever determined them necessary. But determine that first. If time is your concern, use the appropriate methods of speeding up your task, and put your effort where you will get the most benefit. Profile your code. Why do your tasks take 0.1 s? Do they use an inefficient algorithm? Can loop unrolling help? If you find the hotspots in your code through profiling, you may find that locks are the least of your worries. And if you find everything is running as fast as possible, and you want that extra second from removing locks, search the internet with your favorite search engine for "lockfree queue" and "waitfree queue". Compare and swap makes atomic lists easy.
Both ways suggested in the question will perform well and similarly to each another (in simple cases with predictable and relatively long duration of the tasks). If the target system type is known and available (and if performance is really a top concern), the approach should be chosen based on prototyping and measurements.
Do not necessarily prejudice yourself as to the optimal number of threads matching the number of the cores. If this is a regular server or desktop system, there will be various system processes kicking in here and then and you may see your 12 threads variously floating between processors which hurts memory caching.
There are also crucial non-measurement factors you should check: do those small tasks require any resources to execute? Do these resources impose additional potential delays (blocking) or competition? Are there additional apps competing for the CPU power? Will the application need to be grow to accommodate different execution environments, task types, or user interaction models?
If the answer to all is negative, here are some additional approaches that you can measure and consider.
Use only 10 or 11 threads. You will observe a small slowdown, or even
a small speedup (the additional core will serve OS processes, so that
thread affinity of the rest will become more stable compared to 12
threads). Any concurrent interactive activity on the system will see
a big boost in responsiveness.
Create exactly 12 threads but explicitly set a different processor
affinity mask to each, to impose a 1-1 mapping between threads and processors.
This is good in the simplest near-academical case
where there are no resources other than CPU and shared memory
involved; you will see no chronic migration of threads across
processes. The drawback is an
algorithm closely coupled to a particular machine; on another machine
it could behave so poorly as to finish never at all (because of an
unrelated real time task that
blocks one of your threads forever).
Create 12 threads and split the tasks evenly. Have each thread
downgrade its own priority once it is past 40% and again once it is
past 80% of its load. This will improve load balancing inside your
process, but it will behave poorly if your application is competing
with other CPU-bound processes.
100ms/task - pile 'em on as they are - pool overhead will be insignificant.
OTOH..
1E8 tasks # 0.1s/task = 10,000,000 seconds
= 2777.7r hours
= 115.7 days
That's much more than the interval between patch Tuesday reboots.
Even if you run this on Linux, you should batch up the output and flush it to disk in such a manner that the job is restartable.
Is there a database involved? If so, you should have told us!
Each working thread may have its own small task queue with the capacity of no more than one or two memory pages. When the queue size becomes low (a half of capacity) it should send a signal to some manager thread to populate it with more tasks. If queue is organized in batches then working threads do not need to enter critical sections as long as current batch is not empty. Avoiding critical sections will give you extra cycles for actual job. Two batches per queue are enough, and in this case one batch can take one memory page, and so queue takes two.
The point of memory pages is that thread does not have to jump all over the memory to fetch data. If all data are in one place (one memory page) you avoid cache misses.

ColdFusion Template Request count optimization

In ColdFusion, under Request Tuning in the administrator, how do I determine what is an optimal number (or at least a good guess) for the Maximum Number of Simultaneous Template Requests?
Environment:
CF8 Standard
IIS 6
Win2k3
SQL2k5 on a separate box
The way of finding the right number of requests is load testing. That is, measuring changes in throughput under load when you vary the request number. Any significant change would require retesting. But I suspect most folks are going to baulk at that amount of work.
I think a good rule of thumb is about 8 threads per CPU (core).
In terms of efficiency, the lower the thread count (up to a point) the less swapping will be going on as the CPU processes your requests. If your pages execute very quickly then a lower number of requests is optimal.
If you have longer running requests, and especially if you have requests that are waiting on third-parties (like a database) then increasing the number of working threads will improve your throughput. That is, if your CPU is not tied up processing stuff you can afford to have more simultaneous requests working on the tasks at hand.
Although its a little bit dated, many of the principles on request tuning in Grant Straker's book on CF Performance & Troubleshooting would be worthwhile.
I would say at least 8 per core, not per CPU. And I think 8 is a little low given modern CPU cores, I would say at least 12.

BerkeleyDB Concurrency

What's the optimal level of concurrency that the C++ implementation of BerkeleyDB can reasonably support?
How many threads can I have hammering away at the DB before throughput starts to suffer because of resource contention?
I've read the manual and know how to set the number of locks, lockers, database page size, etc. but I'd just like some advice from someone who has real-world experience with BDB concurrency.
My application is pretty simple, I'll be doing gets and puts of records that are about 1KB each. No cursors, no deleting.
It depends on what kind of application you are building. Create a representative test scenario, and start hammering away. Then you will know the definitive answer.
Besides your use case, it also depends on CPU, memory, front-side bus, operating system, cache settings, etcetera.
Seriously, just test your own scenario.
If you need some numbers (that actually may mean nothing in your scenario):
Oracle Berkeley DB:
Performance Metrics and
Benchmarks
Performance Metrics
& Benchmarks:
Berkeley DB
I strongly agree with Daan's point: create a test program, and make sure the way in which it accesses data mimics as closely as possible the patterns you expect your application to have. This is extremely important with BDB because different access patterns yield very different throughput.
Other than that, these are general factors I found to be of major impact on throughput:
Access method (which in your case i guess is BTREE).
Level of persistency with which you configured DBD (for example, in my case the 'DB_TXN_WRITE_NOSYNC' environment flag improved write performance by an order of magnitude, but it compromises persistency)
Does the working set fit in cache?
Number of Reads Vs. Writes.
How spread out your access is (remember that BTREE has a page level locking - so accessing different pages with different threads is a big advantage).
Access pattern - meanig how likely are threads to lock one another, or even deadlock, and what is your deadlock resolution policy (this one may be a killer).
Hardware (disk & memory for cache).
This amounts to the following point:
Scaling a solution based on DBD so that it offers greater concurrency has two key ways of going about it; either minimize the number of locks in your design or add more hardware.
Doesn't this depend on the hardware as well as number of threads and stuff?
I would make a simple test and run it with increasing amounts of threads hammering and see what seems best.
What I did when working against a database of unknown performance was to measure turnaround time on my queries. I kept upping the thread count until turn-around time dropped, and dropping the thread count until turn-around time improved (well, it was processes in my environment, but whatever).
There were moving averages and all sorts of metrics involved, but the take-away lesson was: just adapt to how things are working at the moment. You never know when the DBAs will improve performance or hardware will be upgraded, or perhaps another process will come along to load down the system while you're running. So adapt.
Oh, and another thing: avoid process switches if you can - batch things up.
Oh, I should make this clear: this all happened at run time, not during development.
The way I understand things, Samba created tdb to allow "multiple concurrent writers" for any particular database file. So if your workload has multiple writers your performance may be bad (as in, the Samba project chose to write its own system, apparently because it wasn't happy with Berkeley DB's performance in this case).
On the other hand, if your workload has lots of readers, then the question is how well your operating system handles multiple readers.