I'm working on a project in aws redshift with a few billion rows where the main queries are rollups on time units. The current implementation has mvs for all these rollups. It seems to me that if redshift is all it's cracked up to be and the dist and sort keys are defined correctly the mvs should not be necessary and their costs in extra storage and maintenance (refresh). I'm wondering if anyone has analyzed this in a similar application.
You're thinking along the right path but the real world doesn't always allow for 'just do it better'.
You are correct that sometimes MVs are just used to forego the effort of optimizing a complex query but sometimes not. The selection of keys, especially distribution key, is a compromise between optimizing different workloads. Distribute one way and query A gets faster but query B gets slower. But if the results of query B don't need to be completely up to date, one can make an MV out of B and only pay the price on refresh.
Sometimes queries are very complex and time consuming (and not because they aren't optimized). The results of this query doesn't need to include the latest info to be valid so an MV can can make the cost of this query infrequent. [In reality MVs often represent complex subqueries that are referenced by a number of other queries which makes accentuates the frequent vs. infrequent value of the MV.]
Sometimes query types don't match well to Redshift's distributed, columnar nature and just don't perform well. Again, current-ness of data can be played off against cluster workload and these queries can be run at low usage times.
With all that said I think you are on the right path as I've also been trying to get people to see that many, many queries are just poorly written. Too often in the data world functionally correct equals done and in reality this is only half done. I've rewritten queries that were taking 90 minutes to execute (browning out the cluster when they ran) and got them down to 17 seconds. So keep up the good fight but use MVs as a last resort when compromise is the only solution.
Related
I have a 96 Vcpu Redshift ra3.4xlarge 8 node cluster, Most of the times the CPU utilisation is 100 percent , It was a dc2.large 3 node cluster before , that was also always 100 percent that's why we increased it to ra3. We are doing most of our computes on Redshift but the data is not that much! I read somewhere Doesn't matter how much compute you increase unless its significantly , there will only be a slight improvement in the Computation. Can anyone explain this?
I can give it a shot. Having 100% CPU for long stretches of time is generally not a good (optimal) thing in Redshift. You see Redshift is made for performing analytics on massive amounts of structured data. To do this it utilizes several resources - disks/disk IO bandwidth, memory, CPU, and network bandwidth. If you workload is well matched to Redshift your utilization of all these things will average around 60%. Sometimes CPU bound, sometimes memory bound, sometimes network bandwidth bound, etc. Lots of data being read means disk IO bandwidth is at a premium, lots of redistribution of data means network IO bandwidth is constraining. If you are using all these factors above 50% capacity you are getting what you paid for. Once any of these factors gets to 100% there is a significant drop-off of performance as working around the oversubscribed item steals performance.
Now you are in a situation where you are see 100% for a significant portion of the operating time, right? This means you have all these other attributes you have paid for but are not using AND inefficiencies are being realized to manage through this (though of all the factors, high CPU has the lease overhead). The big question is why.
There are a few possibilities but the most likely, in my experience, is inefficiently queries. An example might be the best way to explain this. I've seen queries that are intended to find all the combinations of certain factors from several tables. So they cross join these tables but this produces lots of repeats so they add DISTINCT, problem solved. But this still creates all the duplicates and then reduces the set down. All the work is being done and most of the results thrown away. However, if they pared down the factors in the tables first, then cross joined them, the total work will be significantly lower. This example will do exactly what you are seeing, high CPU as it spins making repeat combinations and then throwing most of them away.
If you have many of this type of "fat in the middle" query where lots of extra data is made and immediately reduced, you won't get a lot of benefit for adding CPU resources. Things will get 2X faster with 2X the cluster size but you are buying 2X of all these other resources that aren't helping you. You would expect that buying 2X CPU and 2X memory and 2X disk IO etc. would give you much more than a 2X improvement. Being constrained on 1 thing make scaling costly. Also, you are unlikely to see the CPU utilization come down as your queries just "spin the tires" of the CPU. More CPUs will just mean you can run more queries resulting in the spinning more tires.
Now the above is just my #1 guess based on my consulting experience. It could be that your workload just isn't right for Redshift. I've seen people try to put many small database problems into Redshift thinking that it's powerful so it must be good at this too. They turn up the slot count to try to pump more work into Redshift but just create more issues. Or I've seem people try to run transactional workloads. Or ... If you have the wrong tool for the job it may not work well. One 6 ton dump truck isn't the same thing as 50 motorcycle delivery team - each has their purpose but they aren't interchangeable.
Another possibility is that you have a very unusual workload but Redshift is still the best tool for the job. You don't need all the strengths of Redshift but this is ok, you are getting the job done at an appropriate cost. If this case 100% CPU is just how your workload uses Redshift. It's not a problem, just reality. Now I doubt this is the case, but it is possible. I'd want to be sure I'm getting all the value from the money I'm spending before assuming everything is ok.
We have a bucket in S3 where we store thousands of records every day (we end up having many GBs of data that keep increasing) and we want to be able to run Athena queries on them.
The data in S3 is stored in patterns like this:S3://bucket/Category/Subcategory/file.
There are multiple categories (more than 100) and each category has 1-20 subcategories. All the files we store in S3 (in apache parquet format) include sensor readings. There are categories with millions of sensor readings (sensors send thousands per day) and categories with just a few hundreds of readings (sensors send on average a few readings per month), so the data is not split evenly across categories. A reading includes a timestamp, a sensorid and a value among other things.
We want to run Athena queries on this bucket's objects, based on date and sensorid with the lowest cost possible. e.g.: Give me all the readings in that category above that value, or Give me the last readings of all sensorids in a category.
What is the best way to partition our athena table? And what is the best way to store our readings in S3 so that it is easier for Athena to run the queries? We have the freedom to save one reading per file - resulting in millions of files (be able to easily partition per sensorid or date but what about performance if we have millions of files per day?) or multiple readings per file (much less files but not able to directly partition per sensor id or date because not all readings in a file are from the same sensor and we need to save them in the order they arrive). Is Athena a good solution for our case or is there a better alternative?
Any insight would be helpful. Thank you in advance
Some comments.
Is Athena a good solution for our case or is there a better alternative?
Athena is great when you don't need or want to set up a more sophisticated big data pipeline: you simply put (or already have) your data in S3, and you can start querying it immediately. If that's enough for you, then Athena may be enough for you.
Here are few things that are important to consider to properly answer that specific question:
How often are you querying? (i.e., is it worth have some sort of big data cluster running non-stop like an EMR cluster? or is it better to just pay when you query, even if it means that per query your cost could end up higher?)
How much flexibility do you want when processing the dataset? (i.e., does Athena offer all the capabilities you need?)
What are all the data stores that you may want to query "together"? (i.e., is and will all the data be in S3? or do you or will you have data in other services such as DynamoDB, Redshift, EMR, etc...?)
Note that none of these answers would necessarily say "don't use Athena" — they may just suggest what kind of path you may want to follow going forward. In any case, since your data is in S3 already, in a format suitable for Athena, and you want to start querying it already, Athena is a very good choice right now.
Give me all the readings in that category above that value, or Give me the last readings of all sensorids in a category.
In both examples, you are filtering by category. This suggests that partitioning by category may be a good idea (whether you're using Athena or not!). You're doing that already, by having /Category/ as part of the objects' keys in S3.
One way to identify good candidates for partitioning schemes is to think about all the queries (at least the most common ones) that you're going to run, and check the filters by equality or the groups that they're doing. E.g., thinking in terms of SQL, if you often have queries with WHERE XXX = ?.
Maybe you have many more different types of queries, but I couldn't help but notice that both your examples had filters on category, thus it feels "natural" to partition by category (like you did).
Feel free to add a comment with other examples of common queries if that was just some coincidence and filtering by category is not as important/common as the examples suggest.
What is the best way to partition our athena table? And what is the best way to store our readings in S3 so that it is easier for Athena to run the queries?
There's hardly a single (i.e., best) answer here. It's always a trade-off based on lots of characteristics of the data set (structure; size; number of records; growth; etc) and the access patterns (proportion of reads and writes; kinds of writes, e.g. append-only, updates, removals, etc; presence of common filters among a large proportion of queries; which queries you're willing to sacrifice in order to optimize others; etc).
Here's some general guidance (not only for Athena, but in general, in case you decide you may need something other than Athena).
There are two very important things to focus on to optimize a big data environment:
I/O is slow.
Spread work evenly across all "processing units" you have, ideally fully utilizing each of them.
Here's why they matter.
First, for a lot of "real world access patterns", I/O is the bottleneck: reading from storage is many orders of magnitude slower than filtering a record in the CPU. So try to focus on reducing the amount of I/O. This means both reducing the volume of data read as well as reducing the number of individual I/O operations.
Second, if you end up with uneven distribution of work across multiple workers, it may happen that some workers finish quickly but other works take much longer, and their work cannot be divided further. This is also a very common issue. In this case, you'll have to wait for the slowest worker to complete before you can get your results. When you ensure that all workers are doing an equivalent amount of work, they'll all be working at near 100% and they'll all finish approximately at the same time. This way, you don't have to keep waiting longer for the slower ones.
Things to have in mind to help with those goals:
Avoid too big and too small files.
If you have a huge number of tiny files, then your analytics system will have to issue a huge number of I/O operations to retrieve data. This hurts performance (and, in case of S3, in which you pay per request, can dramatically increase cost).
If you have a small number of huge files, depending on the characteristics of the file format and the worker units, you may end up not being able to parallelize work too much, which can cause performance to suffer.
Try to keep the file sizes uniform, so that you don't end up with a worker unit finishing too quickly and then idling (may be an issue in some querying systems, but not in others).
Keeping files in the range of "a few GB per file" is usually a good choice.
Use compression (and prefer splittable compression algos).
Compressing files greatly improves performance because it reduces I/O tremendously: most "real world" datasets have a lot of common patterns, thus are highly compressible. When data is compressed, the analytics system spends less time reading from storage — and the "extra CPU time" spent to decompress the data before it can truly be queried is negligible compared to the time saved on reading form storage.
Keep in mind that there are some compression algorithms that are non-splittable: it means that one must start from the beginning of the compressed stream to access some bytes in the middle. When using a splittable compressions algorithm, it's possible to start decompressing from multiple positions in the file. There are multiple benefits, including that (1) an analytics system may be able to skip large portions of the compressed file and only read what matters, and (2) multiple workers may be able to work on the same file simultaneously, as they can each access different parts of the file without having to go over the entire thing from the beginning.
Notably, gzip is non-splittable (but since you mention Parquet specifically, keep in mind that the Parquet format may use gzip internally, and may compress multiple parts independently and just combine them into one Parquet file, leading to a structure that is splittable; in other words: read the specifics about the format you're using and check if it's splittable).
Use columnar storage.
That is, storing data "per columns" rather than "per rows". This way, a single large I/O operation will retrieve a lot of data for the column you need rather than retrieving all the columns for a few records and then discarding the unnecessary columns (reading unnecessary data hurts performance tremendously).
Not only you reduce the volume of data read from storage, you also improve how fast a CPU can process that data, since you'll have lots of pages of memory with useful data, and the CPU has a very simple set of operations to perform — this can dramatically improve performance at the CPU level.
Also, by keeping data organized by columns, you generally achieve better compression, leading to even less I/O.
You mention Parquet, so this is taken care of. If you ever want to change it, remember about using columnar storage.
Think about queries you need in order to decide about partitioning scheme.
Like in the example above about the category filtering, that was present in both queries you gave as examples.
When you partition like in the example above, you are greatly reducing I/O: the querying system will know exactly which files it needs to retrieve, and will avoid having to reading the entire dataset.
There you go.
These are just some high-level guidance. For more specific guidance, it would be necessary to know more about your dataset, but this should at least get you started in asking yourself the right questions.
I have a problem with SQLite. It seems that every call takes ~300ms to execute. After some testing I noticed that the delay is caused by transactions. 8 normal inserts with implicit transactions take about 2 seconds, however, if I start a transaction before the inserts and commit it after, I can do almost a million inserts in the same time. Calls affected include DROP TABLE, CREATE TABLE, INSERT and I assume others, too (probably all that implicitly begin a transaction).
Some more info:
Downloaded the source amalgamation from the SQLite website (3200100)
Compiled it using Visual Studio into a static library (Not using any compiler flags, although I have been playing around with them without luck)
I am using sqlite3_open16 followed by sqlite3_prepare16_v3 and then sqlite3_step to start execution and/or receive the first result
No multithreading, no access from multiple processes, database file is exclusively opened by this program
If I create the file on my SSD (960 EVO) instead the "transaction delay" goes from 300ms down to 10ms. Still an absurdly high value, though, but I feel like the speed of my disk shouldn't influence whatever is slowing the transactions down?
The function that is blocking is sqlite3_step (It also annoys me that I have to call a function with that name just to execute a DROP TABLE, for example, but not that it matters)
Edit: During the transaction, the CPU usage is 100%.
On a side note, is it possible to "help" SQLite with organizing data if you know that every single row of your table will be exactly, say, 64 Byte?
I hope you can help me with this or possibly recommend an alternative (relational, c++ api, file based, highly performant)
Thank you very much!
SQLite makes lots of effort to ensure it doesn't suffer data corruption, so with an implicit transaction, you are limited by your hard disk speed.
With a transaction, the data is written to other locations, and only committed to disk once, and is much faster
From sqlite speed
With synchronization turned on, SQLite executes an fsync() system call (or the equivalent) at key points to make certain that critical data has actually been written to the disk drive surface.
When creating a transaction, the data is written to other files, and only when all the data is committed, will the fsync cost be paid, and all together. That is a price for that part of the configuration. A positive from this, is I have never suffered from sqlite data loss through corruption.
I feel like the speed of my disk shouldn't influence whatever is slowing the transactions down?
This is an important trade-off. If you want improved data integrity, then the speed of your disk is relevant.
How long does committing a transaction take?
From sqlite faq :19 why are transactions slow
SQLite will easily do 50,000 or more INSERT statements per second on an average desktop computer. But it will only do a few dozen transactions per second.
You can :-
Use transactions to bind more work. The cost is per transaction, so can be bulked up.
Use temporary tables. Temporary tables do not suffer the performance, and will run at full speed.
NOT RECOMMENDED. Use PRAGMA synchronous=OFF to disable the synchronous write.
I have a 2 slightly different versions of a web-crawler. I want to compare them in performance (spesifically time taken to crawl a given domain). I have considered these two options:
Run them one at a time, compare time taken.
Run both of them at the same time, compare time taken.
The drawback of 1 is, network can be slower/faster when running second one. The drawback of 2 is, one can hijack most of the bandwidth and seems to be working faster, while other could work better given the same bandwidth.
I don't know how to (if possible) limit bandwidth (and cpu usage maybe?) per process. If I could do that, I would give each a fair share and run them at the same time, so it could work.
Any ideas how to do this?
Select Option 1 and take a lot of samples. Run one for a week, then run the other for a week. The network bandwidth will of course vary, but should average out.
On another note, you'll probably want to find a way to throttle your crawler so it doesn't consume all your resources. Once you have that, option 2 becomes a better choice.
I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear).
Two basic types of time-series exist based on the sampling/discretisation characteristic:
Regular discretisation (Every sample is taken with a common frequency)
Irregular discretisation(Samples are taken at arbitary time-points)
Queries that will be required:
All values in the time range [t0,t1]
All values in the time range [t0,t1] that are greater/less than v0
All values in the time range [t0,t1] that are in the value range[v0,v1]
The data sets consist of summarized time-series (which sort of gets over the Irregular discretisation), and multivariate time-series. The data set(s) in question are about 15-20TB in size, hence processing is performed in a distributed manner - because some of the queries described above will result in datasets larger than the physical amount of memory available on any one system.
Distributed processing in this context also means dispatching the required data specific computation along with the time-series query, so that the computation can occur as close to the data as is possible - so as to reduce node to node communications (somewhat similar to map/reduce paradigm) - in short proximity of computation and data is very critical.
Another issue that the index should be able to cope with, is that the overwhelming majority of data is static/historic (99.999...%), however on a daily basis new data is added, think of "in the field senors" or "market data". The idea/requirement is to be able to update any running calculations (averages, garch's etc) with as low a latency as possible, some of these running calculations require historical data, some of which will be more than what can be reasonably cached.
I've already considered HDF5, it works well/efficiently for smaller datasets but starts to drag as the datasets become larger, also there isn't native parallel processing capabilities from the front-end.
Looking for suggestions, links, further reading etc. (C or C++ solutions, libraries)
You would probably want to use some type of large, balanced tree. Like Tobias mentioned, B-trees would be the standard choice for solving the first problem. If you also care about getting fast insertions and updates, there is a lot of new work being done at places like MIT and CMU into these new "cache oblivious B-trees". For some discussion of the implementation of these things, look up Tokutek DB, they've got a number of good presentations like the following:
http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf
Questions 2 and 3 are in general a lot harder, since they involve higher dimensional range searching. The standard data structure for doing this would be the range tree (which gives O(log^{d-1}(n)) query time, at the cost of O(n log^d(n)) storage). You generally would not want to use a k-d tree for something like this. While it is true that kd trees have optimal, O(n), storage costs, it is a fact that you can't evaluate range queries any faster than O(n^{(d-1)/d}) if you only use O(n) storage. For d=2, this would be O(sqrt(n)) time complexity; and frankly that isn't going to cut it if you have 10^10 data points (who wants to wait for O(10^5) disk reads to complete on a simple range query?)
Fortunately, it sounds like your situation you really don't need to worry too much about the general case. Because all of your data comes from a time series, you only ever have at most one value per each time coordinate. Hypothetically, what you could do is just use a range query to pull some interval of points, then as a post process go through and apply the v constraints pointwise. This would be the first thing I would try (after getting a good database implementation), and if it works then you are done! It really only makes sense to try optimizing the latter two queries if you keep running into situations where the number of points in [t0, t1] x [-infty,+infty] is orders of magnitude larger than the number of points in [t0,t1] x [v0, v1].
General ideas:
Problem 1 is fairly common: Create an index that fits into your RAM and has links to the data on the secondary storage (datastructure: B-Tree family).
Problem 2 / 3 are quite complicated since your data is so large. You could partition your data into time ranges and calculate the min / max for that time range. Using that information, you can filter out time ranges (e.g. max value for a range is 50 and you search for v0>60 then the interval is out). The rest needs to be searched by going through the data. The effectiveness greatly depends on how fast the data is changing.
You can also do multiple indices by combining the time ranges of lower levels to do the filtering faster.
It is going to be really time consuming and complicated to implement this by your self. I recommend you use Cassandra.
Cassandra can give you horizontal scalability, redundancy and allow you to run complicated map reduce functions in future.
To learn how to store time series in cassandra please take a look at:
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
and http://www.youtube.com/watch?v=OzBJrQZjge0.