SpannerIO Batch Read Speed Affected by Slow Partition Reader - google-cloud-platform

We have been using SpannerIO.readAll to scan large amount of data in google dataflow setting. The ReadOperations passed to spanner are created withQuery(query) and withBatching(true). I noticed that though initially the throughput is OK, it dropped to very low throughput in the end probably due to outliers with larger amount work. Looking at BatchSpannerRead code, one DoFn is taking care of all the batch scan work for a partition. Although in a perfect world, we should assume the generated partitions should handle this outlier issues, but in practice, will it make sense to re-split the work of those slow workers?

Related

Redshift: experiencing slow query performance between 2 segments

We’re experiencing slow query performance on AWS Redshift. Frequently we see that queries can take ±12 seconds to run, but only very little time (<500ms) is spent actually executing the query (according to the AWS Redshift console for an individual query).
Querying from svl_compile we can confirm that the query compilation plan is already compiled.
In svl_query_report we see a long time delay between the start times of 2 segments accounting for the majority of the run time, although the segments themselves all execute very quickly (milliseconds)
There are a number of things that could be going on but I suspect network distribution is involved. Check STL_DIST.
Another possibility is that Redshift broke the query up and a subquery is running during that window. This can happen with very complex queries. Review the plan and see if there are any references to computer generated table names (I think they begin with't' but this is just from memory).
Spilling to disk could be happening but this seems unlikely given what you have said so far. Also queuing delays doesn't seem like a match. Both are possible but not likely.
If you post more info about how the query is running things will narrow down. Actual execution report, explain plan, and/or logging table info would help hone in on what is happening during this time window.

Redshift Query Performance to reduce CPU utilisation

I want to take a general Idea of how I can optimise the query performance in redshift Database, I have Huge queries with lots of joins , I do understand using sort and Dist key it can be achieved but is there a method which we can follow in order to get some optimal results.
What to look in a table and how to approach query optimisation in redshift?
What are the necessary steps to look for or approach in order to have a certain plan for optimisation?
Any guidance will help a lot
Having improved many queries on Redshift there are a few things I can point you towards. First let me list a few tools / techniques to make sure you have these in your toolbox.
Ability to read and EXPLAIN plan and find expected costly points
Know where to find the query "actual" execution report
Know the system tables to find join, distribution, and disk io reports
So with those understood let's look at where many queries go sideways on Redshift. I will try to list these out in pareto order but any of these, or combos, can create significant issue.
#1 - Fat in the middle queries. When joining it is possible to expand the number of rows being operated upon many fold. Cross joining is a clear way this can happen but isn't how this usually happens. If the join on conditions create a many to many join pattern the number of rows can expand. When the table sizes are very large and the "multiplication" can make absurd data sizes. The explain plan can show this but not always - use of DISTINCT and GROUP BY can "hide" the true size of the dataset in play. Performing a SELECT COUNT(*) on your join tree can help show how big this is. You may also may need to look a pieces of the join tree if a later join is collapsing the rows (failure of the query optimizer?). Redshift is a columnar database and not well set up for the creation of data - this includes during the execution of query.
#2 - Distribution of large amounts of data. Redshift is a cluster and the node are connected together by ethernet cables and these connections are the slowest part of the cluster. A lot of work is done by the query optimizer to minimize the amount of data that needs to move around the network. However, it doesn't know your data as well as you do and doesn't always do this well. Look at the type of joins you are getting - is distribution needed? how much data is being distributed? Also, group by (and window functions) need to combine rows and therefore may need redistribution to complete. How big are the data sets entering your aggregation steps?
Moving a lot of data around the network will be slow. The difficulty is that it isn't always clear how to reduce this movement. Large join trees like you say you have can do "odd" things when it comes to the resulting distribution of the "joined" data. Joins are performed one at a time and the order these happen can matter. The query optimizer is making a number of decisions about the order of joins and how to organize the resulting data from each join. The choices it makes is based on what it sees in the table metadata so completeness of metadata matters. WHERE conditions can also impact the optimizer's choices. There are just way to many interactions to itemize them out here. Best advice is to look at the performance per step and see if data distribution is a factor. Then work to control how data is distributed in the query's execution. This may mean changing the join trees or even decomposing the query into several with temp table that have distribution set so that data movement is minimized.
#3 Excessive IO traffic - While not as slow as the networks, the disk IO subsystem is often a bottleneck. This shows up in a few ways. Are you reading more data from disk than is needed? (Metadata up to date?) Do you need a redundant WHERE clause to eliminate data? (Redundant WHERE clause is one that isn't needed functionally but is added so Redshift can perform the metadata comparisons that will reduce data read at scan.) Data spill is another way that disk IO can be strained (this goes back to #1). If data needs to spill to disk it can bring the disk IO performance down considerably. Use your metadata and Where clauses well.
Now these 3 areas often team up to kill your performance. Read too many rows from your tables, join all these extra rows together across the network while also making many new rows. This data doesn't fit in memory so now Redshift needs to spill to disk to complete the query. Things slow down real fast in these conditions.
Lastly these factors I've listed are cluster wide "resources" of Redshift. If one query take up a lot of one of these then there is less for other queries running at the same time. What often happens is that the query writers on a cluster follow similar patterns (good or bad) and when their pattern is costly on one axis then many of their queries are costly on the same axis. This shows up as queries that work "ok" when run in isolation but very badly when others are using the cluster. This generally means that many queries are contributing to pushing the cluster "over the edge" on some limited resource. There are system tables that you can look at to see aggregated IO or network traffic to see these effects.
Good queries are:
Don't make a lot of new "rows" during execution (not fat in the middle)
Keep large data sets "on node" and only redistribute data once the data has been pared down significantly
Don't read more data from disk than is necessary and don't spill
The problem is that doing all of these isn't always possible the trick is to not over subscribe the cluster resources you have.

Slow Performance with Apache Spark Gradient Boosted Tree training runs

I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. My goal is to output the definition of the resulting GBT ensemble in human-readable format. My experience so far is that for my problem size adding more resources to the cluster seems to not have an effect on the length of the run. A 10-iteration training run seem to roughly take 13hrs. This isn't acceptable since I'm looking to do 100-300 iteration runs, and the execution time seems to explode with the number of iterations.
My Spark application
This isn't the exact code, but it can be reduced to:
SparkConf sc = new SparkConf().setAppName("GBT Trainer")
// unlimited max result size for intermediate Map-Reduce ops.
// Having no limit is probably bad, but I've not had time to find
// a tighter upper bound and the default value wasn't sufficient.
.set("spark.driver.maxResultSize", "0");
JavaSparkContext jsc = new JavaSparkContext(sc)
// The input file is encoded in plain-text LIBSVM format ~59GB in size
<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), "s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD();
BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Classification");
boostingStrategy.setNumIterations(10);
boostingStrategy.getTreeStrategy().setNumClasses(2);
boostingStrategy.getTreeStrategy().setMaxDepth(1);
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);
GradientBoostedTreesModel model = GradientBoostedTrees.train(data, boostingStrategy);
// Somewhat-convoluted code below reads in Parquete-formatted output
// of the GBT model and writes it back out as json.
// There might be cleaner ways of achieving the same, but since output
// size is only a few KB I feel little guilt leaving it as is.
// serialize and output the GBT classifier model the only way that the library allows
String outputPath = "s3://somebucket/somekeyprefex";
model.save(jsc.sc(), outputPath + "/parquet");
// read in the parquet-formatted classifier output as a generic DataFrame object
SQLContext sqlContext = new SQLContext(jsc);
DataFrame outputDataFrame = sqlContext.read().parquet(outputPath + "/parquet"));
// output DataFrame-formatted classifier model as json
outputDataFrame.write().format("json").save(outputPath + "/json");
Question
What is the performance bottleneck with my Spark application (or with GBT learning algorithm itself) on input of that size and how can I achieve greater execution parallelism?
I'm still a novice Spark dev, and I'd appreciate any tips on cluster configuration and execution profiling.
More details on the cluster setup
I'm running this app on a AWS EMR cluster (emr-4.0.0, YARN cluster mode) of r3.8xlarge instances (32 cores, 244GB RAM each). I'm using such large instances in order to maximize flexibility of resource allocation. So far I've tried using 1-3 r3.8xlarge instances with a variety of resource allocation schemes between the driver and workers. For example, for a cluster of 1 r3.8xlarge instances I submit the app as follows:
aws emr add-steps --cluster-id $1 --steps Name=$2,\
Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,\
Args=[/usr/lib/spark/bin/spark-submit,--verbose,\
--deploy-mode,cluster,--master,yarn,\
--driver-memory,60G,\
--executor-memory,30G,\
--executor-cores,5,\
--num-executors,6,\
--class,GbtTrainer,\
"s3://somebucket/somekey/spark.jar"],\
ActionOnFailure=CONTINUE
For a cluster of 3 r3.8xlarge instances I tweak resource allocation:
--driver-memory,80G,\
--executor-memory,35G,\
--executor-cores,5,\
--num-executors,18,\
I don't have a clear idea of how much memory is useful to give to every executor, but I feel that I'm being generous in either case. Looking through Spark UI, I'm not seeing task with input size of more than a few GB. I'm steering on the side of caution when giving the driver process so much memory in order to ensure that it isn't memory starved for any intermediate result-aggregation operations.
I'm trying to keep the number of cores per executor down to 5 as per suggestions in Cloudera's How To Tune Your Spark Jobs series (according to them, more that 5 cores tends to introduce a HDFS IO bottleneck). I'm also making sure that there is enough of spare RAM and CPUs left over for the host OS and Hadoop services.
My findings thus far
My only clue is Spark UI showing very long Scheduling Delay for a number of tasks at the tail-end of execution. I also get the feeling that the stages/tasks timeline shown by Spark UI does not account for all of the time that the job takes to finish. I suspect that the driver application is stuck performing some kind of a lengthy operation either at the end of every training iteration, or at the end of the entire training run.
I've already done a fair bit of research on tuning Spark applications. Most articles will give great suggestions on using RDD operations which reduce intermediate input size or avoid shuffling of data between stages. In my case I'm basically using an "out-of-the-box" algorithm, which is written by ML experts and should already be well tuned in this regard. My own code that outputs GBT model to S3 should take a trivial amount of time to run.
I haven't used MLLibs GBT implemention, but I have used both
LightGBM and XGBoost successfully. I'd highly suggest taking a look at these other libraries.
In general, GBM implementations need to train models iteratively as they consider the loss of the entire ensemble when building the next tree. This makes GBM training inherently bottlenecked and not easily parallelizable (unlike random forests which are trivially parallelizable). I'd expect it to perform better with fewer tasks, but that might not be your whole issue. Since you have so many features 500K, you're going to have very high overhead when calculating the histograms and split points during training. You should reduce the number of features you have, especially since they're much larger than the number of samples which will cause it to overfit.
As for tuning your cluster:
You want to minimize data movement, so fewer executors with more memory. 1 executor per ec2 instance, with the number of cores set to whatever the instance provides.
Your data is small enough to fit into ~2 EC2s of that size. Assuming you are using doubles (8 bytes), it comes to 8 * 500000 * 50000 = 200 GB Try loading it all into ram by using .cache() on your dataframe. If you perform an operation over all the rows (like sum) you should force it to load and you can measure how long the IO takes. Once its in ram and cached any other operations over it will be faster.
With a dataset of that size, you may well be better off loading the full dataset into memory and using XGBoost directly rather than the Spark implementation.
If you want to stick with Spark to give greater scalability, I'd recommend taking a closer look at your partitioning strategy. If your data isn't effectively partitioned, adding machines won't improve your runtime, as you describe above, and the subset of overloaded workers will remain your bottleneck. Ensure you have an effective partition key, and repartition your RDD before you begin your training stage.

DynamoDB change in throughput

I am using DynamoDB and I want to change throughput of Dynamo tables.
Will throughput will be changed instantly or it will take some time to be affected completely?
I tried searching for the answer but could not find it even on the Amazon website.
If I change a throughput for table, how much time it will to get affected?
It won't be instantaneous. From my experience it depends greatly on the size of your current throughput and data. A small table with low throughput (less than a few hundreds of reads or writes ps), it should take a few minutes.
For larger tables with higher through puts I've seen it take a lot longer, as long as 30 minutes. Sorry this is just based on observation, I don't have any formal metrics on it. You can continue to use the table while it's updating.
This document explains how DynamoDB responds to increases in throughput: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
While this doesn't explain how long it will take to create new partitions, it explains under what circumstances a new partition would actually need to be made.

What is the most efficient way to store time series in Riak with heavy reads

My current approach:
I have one domain class - Application
Each application in my system is stored in "applications" bucket under APPLICATION_KEY key
Apart from application metadata stored in this bucket, each application has its own bucket called "time_metrics/APPLICATION_KEY" where I store time series in a way:
KEY - timestamp / VALUE - some attributes
My concern is efficiency of queries made over specific time window for given application. Currently to get time series from some specific time window and eventually make some reductions I have to make map/reduce over whole "time_metric/APPLICATION_KEY" bucket, which what I have found is not the recommended use case for Riak Map/Reduce.
My question: what would be the best db structure for this kind of a system and how efficiently query it.
Adding onto #macintux's answer.
Basho has had a few customers that have used riak for time series metrics.
Boundary has a nice tech talk about how they use Riak with their network monitoring software. They rollup data into different chunks of time (1m, 5m, 15m) for analysis.
They also have a series of blog posts about lessons learned while implementing this system.
Kivra also has a good slide deck about how they use timeseries data with riak.
You could roll up your data into some sort of arbitrary time length, then read the range you need by issuing regular K/V gets, and then reconstruct the larger picture / reduce in your application.
If you have spare computing power and you know in advance what keys you need, you certainly can use Riak's MapReduce, but often retrieving the keys and running your processing on the client will be as fast (and won't strain your cluster).
Some general ideas:
Roll up your data into larger blocks
If you're concerned about losing data if your client crashes while buffering it, you can always store the data as it arrives
Similar idea: store the data as it arrives, then retrieve it and roll it up at certain intervals
You can automatically expire data once you're confident it is being reliably stored in larger blocks, using either the Bitcask or Memory backends
Memory backend is quite useful (RAM permitting) for any data that only needs to be stored for a limited period of time
Related: don't be afraid to store multiple copies of your data to make reading/reporting easier later
Multiple chunks of time (5- and 15-minute blocks, for example)
Multiple report formats
Having said all that, if you're doing straight key/value requests (it's ideal to always be able to compute the keys you need, rather than doing indexing or searching), Riak can support very heavy traffic loads, so I wouldn't recommend spending too much time creating alternative storage mechanisms unless you know you're going to face latency problems.