Vertica PlannedConcurrency - concurrency

I have been trying to tune the performance of queries running on a Vertica cluster by changing the value of PlannedConcurrency of the general resource pool. We have a cluster of 4 nodes with 32 cores/node.
According to Vertica docs,
Query budget = Queuing threshold of the GENERAL pool / PLANNEDCONCURRENCY
Increasing PlannedConcurrency should reduce the query budget, reserving lesser memory/query which might lead to fewer queries being queued up.
Increasing the value of PlannedConcurrency, seems to improve query performance.
PlannedConcurrency = 256 gives better performance than 128 which performs better than AUTO.
PlannedConcurrency being the preferred number of concurrently executing queries in the resource pool, how can this number be greater than the number of cores and still give better query performance?
Also, the difference between RESOURCE_ACQUISITIONS.MEMORY_INUSE_KB and QUERY_PROFILES.RESERVED_EXTRA_MEMORY should give the memory in use.
However, this number does not remain constant for a single query when the planned concurrency is changed.
Can someone please help me understand why does this memory usage differ with the value of PlannedConcurrency ?
Thanks !
References:
https://my.vertica.com/blog/do-you-need-to-put-your-query-on-a-budgetba-p236830/
https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/AdministratorsGuide/ResourceManager/GuidelinesForSettingPoolParameters.htm

It's hard to give an exact answer without the actual queries.
but, in general - increasing the planned concurrency means you reserve and allocate less resources per query and allow for greater concurrency.
If your use case has lot's of small queries which don't require lot's of resources - it might improve things.
also keep in mind that the CPU is not the only resource being used - you have to wait for IO (disks, network etc') this is time you can better spend on running more queries...

Related

What's the effect of the "Bytes Shuffled" metric from BigQuery on cost?

I'm optimizing a query in BigQuery and I managed to reduce all performance metrics by a good margin except for the "Bytes Consumed" metric which increased from 3GB to 3.56GB
I would like to know if there is an impact of the Bytes Shuffled metric on cost, and if so by how much?
To understand that, you have to have in mind the BigQuery architecture. It's more or less a Map Reduce architecture.
Map can be done on a single node (filter, transform, ...). Reduce require node communication to perform operation (join, substracts,...).
Of course, map operation are much more efficient than reduce operation (only in memory, no network communication, no synchronisation/wait,...)
Byte shuffling is the byte shared between the nodes.
The cost perspective is not simple to answer. If you pay as you use BigQuery (no slots reservation) there is no extra cost (the same volume of data are processed, therefore no impact, only a slower query).
If you have reserved slots (node and slots are similar), there is no extra cost also. But you keep the slots longer (the query is slower and the slot usage longer), and if you share the slots with other users/queries/projects, it can impact the overall performance, and, maybe the overall cost of your projects.
So, no direct cost, but a global overview to have about the duration impact.

Redshift CPU utilisation is 100 percent most of the time

I have a 96 Vcpu Redshift ra3.4xlarge 8 node cluster, Most of the times the CPU utilisation is 100 percent , It was a dc2.large 3 node cluster before , that was also always 100 percent that's why we increased it to ra3. We are doing most of our computes on Redshift but the data is not that much! I read somewhere Doesn't matter how much compute you increase unless its significantly , there will only be a slight improvement in the Computation. Can anyone explain this?
I can give it a shot. Having 100% CPU for long stretches of time is generally not a good (optimal) thing in Redshift. You see Redshift is made for performing analytics on massive amounts of structured data. To do this it utilizes several resources - disks/disk IO bandwidth, memory, CPU, and network bandwidth. If you workload is well matched to Redshift your utilization of all these things will average around 60%. Sometimes CPU bound, sometimes memory bound, sometimes network bandwidth bound, etc. Lots of data being read means disk IO bandwidth is at a premium, lots of redistribution of data means network IO bandwidth is constraining. If you are using all these factors above 50% capacity you are getting what you paid for. Once any of these factors gets to 100% there is a significant drop-off of performance as working around the oversubscribed item steals performance.
Now you are in a situation where you are see 100% for a significant portion of the operating time, right? This means you have all these other attributes you have paid for but are not using AND inefficiencies are being realized to manage through this (though of all the factors, high CPU has the lease overhead). The big question is why.
There are a few possibilities but the most likely, in my experience, is inefficiently queries. An example might be the best way to explain this. I've seen queries that are intended to find all the combinations of certain factors from several tables. So they cross join these tables but this produces lots of repeats so they add DISTINCT, problem solved. But this still creates all the duplicates and then reduces the set down. All the work is being done and most of the results thrown away. However, if they pared down the factors in the tables first, then cross joined them, the total work will be significantly lower. This example will do exactly what you are seeing, high CPU as it spins making repeat combinations and then throwing most of them away.
If you have many of this type of "fat in the middle" query where lots of extra data is made and immediately reduced, you won't get a lot of benefit for adding CPU resources. Things will get 2X faster with 2X the cluster size but you are buying 2X of all these other resources that aren't helping you. You would expect that buying 2X CPU and 2X memory and 2X disk IO etc. would give you much more than a 2X improvement. Being constrained on 1 thing make scaling costly. Also, you are unlikely to see the CPU utilization come down as your queries just "spin the tires" of the CPU. More CPUs will just mean you can run more queries resulting in the spinning more tires.
Now the above is just my #1 guess based on my consulting experience. It could be that your workload just isn't right for Redshift. I've seen people try to put many small database problems into Redshift thinking that it's powerful so it must be good at this too. They turn up the slot count to try to pump more work into Redshift but just create more issues. Or I've seem people try to run transactional workloads. Or ... If you have the wrong tool for the job it may not work well. One 6 ton dump truck isn't the same thing as 50 motorcycle delivery team - each has their purpose but they aren't interchangeable.
Another possibility is that you have a very unusual workload but Redshift is still the best tool for the job. You don't need all the strengths of Redshift but this is ok, you are getting the job done at an appropriate cost. If this case 100% CPU is just how your workload uses Redshift. It's not a problem, just reality. Now I doubt this is the case, but it is possible. I'd want to be sure I'm getting all the value from the money I'm spending before assuming everything is ok.

RocksDB compaction: how to reduce data size and use more than 1 CPU core?

I'm trying to use RocksDB to store billions of records, so the resulting databases are fairly large - hundreds of gigabytes, several terabytes in some cases. The data is initially imported from a different service snapshot and updated from Kafka afterwards, but that's beside the point.
There are two parts of the problem:
Part 1) Initial data import takes hours with autocompactions disabled (it takes days if I enable them), after that I reopen the database with autocompactions enabled, but they aren't triggered automatically when the DB is opened, so I have to do it with CompactRange(Range{nil, nil}) in Go manually.
Manual compaction takes almost similar time with only one CPU core being busy and during compaction the overall size of the DB increases 2x-3x, but then ends up around 0.5x
Question 1: Is there a way to avoid 2x-3x data size growth during compaction? It becomes a problem when the data size reaches terabytes. I use the default Level Compaction, which according to the docs "optimizes disk footprint vs. logical database size (space amplification) by minimizing the files involved in each compaction step".
Question 2: Is it possible to engage more CPU cores for manual compaction? Looks like only one is used atm (even though MaxBackgroundCompactions = 32). It would speed up the process A LOT as there are no writes during initial manual compaction, I just need to prepare the DB without waiting days.
Would it work with several routines working on different sets of keys instead of just one routine working on all keys? If yes, what's the best way to divide the keys into these sets?
Part 2) Even after this manual compaction, RocksDB seems to perform autocompaction later, after I start adding/updating the data, and after it's done the DB size gets even smaller - around 0.4x comparing to the size before the manual compaction.
Question 3: What's the difference between manual and autocompation and why autocompaction seems to be more effective in terms of resulting data size?
My project is in Go, but I'm more or less familiar with RocksDB C++ code and I couldn't find any answers to these questions in the docs or in the source code.

WSO2 CEP performance

I have few questions related to WSO2 CEP performance.
How many events can be process within a second?
How many execution plans can be handle at one time without doing
huge variation to the performance?
What is the maximum number of receivers and publishers which can be
add to CEP?
What is the maximum number of execution plans which can be add to CEP?
Everything depends on the scenario, so just giving a specific number may not applicable for your scenario. By using thrift wso2event protocol CEP can process over 100000 events per second. Depending on the complexity of the query performance numbers could change. Also you need to consider about the resource allocation such as memory heap size and etc. Allocated memory will be significance if the size of the event is high, Therefor things like number of receivers, publishers and execution plans depend on the complexity of process. As these are very dynamic situations you can tune the CEP instance according to your scenario. Please refer Performance Tuning Recommendations [1] for more details. For an instance, of you require to achieve very high throughput but you are not concerned about the latency, you can increase the QueueSize in data-agent-config. Depending on the size of the event, sometimes you may also have to increase the heap memory as well.
[1] https://docs.wso2.com/display/CEP400/Performance+Tuning+Recommendations

BerkeleyDB Concurrency

What's the optimal level of concurrency that the C++ implementation of BerkeleyDB can reasonably support?
How many threads can I have hammering away at the DB before throughput starts to suffer because of resource contention?
I've read the manual and know how to set the number of locks, lockers, database page size, etc. but I'd just like some advice from someone who has real-world experience with BDB concurrency.
My application is pretty simple, I'll be doing gets and puts of records that are about 1KB each. No cursors, no deleting.
It depends on what kind of application you are building. Create a representative test scenario, and start hammering away. Then you will know the definitive answer.
Besides your use case, it also depends on CPU, memory, front-side bus, operating system, cache settings, etcetera.
Seriously, just test your own scenario.
If you need some numbers (that actually may mean nothing in your scenario):
Oracle Berkeley DB:
Performance Metrics and
Benchmarks
Performance Metrics
& Benchmarks:
Berkeley DB
I strongly agree with Daan's point: create a test program, and make sure the way in which it accesses data mimics as closely as possible the patterns you expect your application to have. This is extremely important with BDB because different access patterns yield very different throughput.
Other than that, these are general factors I found to be of major impact on throughput:
Access method (which in your case i guess is BTREE).
Level of persistency with which you configured DBD (for example, in my case the 'DB_TXN_WRITE_NOSYNC' environment flag improved write performance by an order of magnitude, but it compromises persistency)
Does the working set fit in cache?
Number of Reads Vs. Writes.
How spread out your access is (remember that BTREE has a page level locking - so accessing different pages with different threads is a big advantage).
Access pattern - meanig how likely are threads to lock one another, or even deadlock, and what is your deadlock resolution policy (this one may be a killer).
Hardware (disk & memory for cache).
This amounts to the following point:
Scaling a solution based on DBD so that it offers greater concurrency has two key ways of going about it; either minimize the number of locks in your design or add more hardware.
Doesn't this depend on the hardware as well as number of threads and stuff?
I would make a simple test and run it with increasing amounts of threads hammering and see what seems best.
What I did when working against a database of unknown performance was to measure turnaround time on my queries. I kept upping the thread count until turn-around time dropped, and dropping the thread count until turn-around time improved (well, it was processes in my environment, but whatever).
There were moving averages and all sorts of metrics involved, but the take-away lesson was: just adapt to how things are working at the moment. You never know when the DBAs will improve performance or hardware will be upgraded, or perhaps another process will come along to load down the system while you're running. So adapt.
Oh, and another thing: avoid process switches if you can - batch things up.
Oh, I should make this clear: this all happened at run time, not during development.
The way I understand things, Samba created tdb to allow "multiple concurrent writers" for any particular database file. So if your workload has multiple writers your performance may be bad (as in, the Samba project chose to write its own system, apparently because it wasn't happy with Berkeley DB's performance in this case).
On the other hand, if your workload has lots of readers, then the question is how well your operating system handles multiple readers.