Running Presto large scan query on 5 nodes cluster it looks that only one node is the query coordinator and reads the data from 5 hdfs nodes over network.
All presto processes are running on data nodes.
Is there way to let 5 nodes to read data from hdfs using shortcut local reads ?
Are presto nodes doing any preaggregation?
It is not clear from you question if you have installed Presto workers on the same machine as your HDFS data nodes. If you have not, the installation instructions will help you do this.
Once you have Presto workers on all of your data nodes, Presto should automatically perform local reads when accessing data from the local DFS node. Presto will prefer scheduling work on the same machine as the DFS node, but if that machine is overloaded, it will schedule the work on another machine, so you will typically get some remote reads. The majority of reads should be local, and you can verify this distribution using the com.facebook.presto.execution:name=NodeScheduler mbean on the coordinator.
Presto always performs partial aggregation on the leaf worker nodes.
If you have presto installed on all the nodes, and want presto workers to process local stripes, you need to turn the "hive.force-local-scheduling" session flag to true. This is false by default in the presto versions i saw (0.153).
See more details at:
https://github.com/prestodb/presto/issues/894
https://github.com/prestodb/presto/pull/1770
Related
I have an Amazon DynamoDB table which is used for both read and write operations. Write operations are performed only when the batch job runs at certain intervals whereas Read operations are happening consistently throughout the day.
I am facing a problem of increased Read latency when there is significant amount of write operations are happening due to the batch jobs. I explored a little bit about having a separate read replica for DynamoDB but nothing much of use. Global tables are not an option because that's not what they are for.
Any ideas how to solve this?
Going by the Dynamo paper, the concept of a read-replica for a record or a table does not exist in Dynamo. Within the same region, you will have multiple copies of a record depending on the replication factor (R+W > N) where N is the replication factor. However when the client reads, one of those records are returned depending on the cluster health.
Depending on how the co-ordinator node is chosen either at the client library or at the cluster, the client can only ask for a record (get) or send a record(put) to either the cluster co-ordinator ( 1 extra hop ) or to the node assigned to the record (single hop to record). There is just no way for the client to say 'give me a read replica from another node'. The replicas are there for fault-tolerance, if one of the nodes containing the master copy of the record dies, replicas will be used.
I am researching the same problem in the context of hot keys. Every record gets assigned to a node in Dynamo. So a million reads on the same record will lead to hot keys, loss of reads/writes etc. How to deal with this ? A read-replica will work great because I can now manage the hot keys at the application and move all extra reads to read-replica(s). This is again fraught with issues.
I am trying to understand the difference between concurrent connections and concurrent queries in Redshift. As per documents, We can make 500 concurrent connections to a Redshift cluster but it says maximum 15 queries can be run at the same time in a cluster. Now what is the exact value?
How many queries can be in running state in a cluster at the same time ? If it is 15, does it include RETURNING state queries as well ?
How many concurrent COPY statement can run in a cluster ?
We are evaluating Redshift as our primary reporting data store. If we cannot run a large number of queries simultaneously it may be difficult for us to go with this model.
I think, you have misread somewhere, Max concurrent queries are 50 per WLM. Refer below thread for Amazon support response for more detail.
How many queries can be in running state in a cluster at the same time ? If it is 15, does it include RETURNING state queries as well ?
At a time, Max 50 queries could be running concurrently. Yes it does include INSERT/UPDATE/DELETE etc all.
How many concurrent COPY statement can run in a cluster ?
Ideally, you could Max go up to 50 concurrently, but Copy works bit differently.
Amazon Redshift automatically loads in parallel from multiple data files.
If you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load, which is much slower and requires a VACUUM at the end if the table has a sort column defined. For more information about using COPY to load data in parallel, see Loading Data from Amazon S3.
Meaning, you could run concurrent Copy commands but make sure one copy command at a time per table.
So practically, it doesn't depend on Nodes on cluster, but Number of tables as well.
So if you have only 1 table, you would like to execute 50 insert concurrently, it will result only 1 Copy concurrently.
I was studying Amazon Redshift using the Sybex Official Study Guide, at page 173 there are a couple of phrases:
You can configure the distribution style of a table to give Amazon RS hints as to how the data should be partitioned to best meet your query patterns. When you run a query, the optimizer shifts the rows to the compute node as needed to perform any joins and aggregates.
That leads me to some questions?
1) What is the role of "optimizer"? Do data re-arranged across compute nodes to boost performance for each new query?
2) if 1) is true and new query completly different is performed: What happen to the old data in the compute nodes?
3) Can you explain me better the 3 distribution styles (EVEN, KEY, ALL) particularly the KEY style.
Extra questions:
1) Does the leader node has records?
To clarify a few things:
The Distribution Key is not a hint -- data is actually distributed according to the key
When running a query, the data is not "shifted" -- rather, copies of data might be sent to other nodes so that data can be joined on a particular node, but the data does not then reside on the destination node
The optimizer doesn't actually "do" anything -- it just computes the process that the nodes will follow (Redshift apparently writes C programs that are sent to each node)
The only thing you really need to know about the Optimizer is:
Query optimizer
The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer implements significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation.
From Data Warehouse System Architecture:
Leader node
The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.
The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any of these functions will return an error if it references tables that reside on the compute nodes.
The Leader Node does not contain any data (unless you launch a single-node cluster, in which case the same server is used as the Leader Node and Compute Node).
For information on Distribution Styles, refer to: Distribution Styles
If you really want to learn about Redshift, then read the Redshift Database Developer Guide. If you are merely studying for the Solutions Architect exam, the above links will be sufficient for the level of Redshift knowledge.
We are using Elasticache(Redis) to implement a caching layer for our cloud platform. It has a mongodb back end. We use Node.js and Java to access these data in different different platform components.
Node.js example code is given below,
var redisClient = require('redis').createClient(config.aws.redis.port, config.aws.redis.endpoint, {no_ready_check: true});
var redisKey="urls_"+url;
redisClient.get(redisKey, function (redisErr, reply) {});
In the redis cache we cache different categories of data. Say X, Y, Z. Currently we store them in single redis node. And we use namespace prefix to partition data.
Ex,
key ="X_"+url
key ="Y_"+money
key ="Z_"+length
But I cam across this article and it says its better to avoid name spaces. http://www.mikeperham.com/2015/09/24/storing-data-with-redis/
So what is the best solution for our use case?
Having different redis nodes per different data partition,
X type data will be cached in A redis node
Y type data will be cached in B redis node
Z type data will be cached in C redis node
Using multiple redis dbs in single redis node?
X type data will be cached in 0th DB of A redis node
Y type data will be cached in 1st DB of A redis node
Z type data will be cached in 2nd DB of A redis node
For elasticache, it is generally better to store each set of data in its own cluster.
the primary downside of having multiple cluster is the management overhead, but with elasticache, most of that is managed, minimizing this downside.
while scaling your elasticache cluster is managed, it still needs to move data around -- the more data you have in a single cluster, the longer it will take to scale it, and more performance impact you will suffer while doing online rescaling.
as for cost, by splitting data into multiple clusters, each cluster can use smaller instances. In the end, it may cost about the same.
Of course, there are always exceptions, and you should consider those issues within your unique context.
each Input split is replicated in 3 times in hadoop cluster. for each replicate split , does hadoop assigns each map? . If then assign which map results send to reduce function. does hadoop replicates the reduce function also
No, Even though there are three replicas for a split, only one mapper will be assigned by MapReduce engine. It uses the concept called data localization in order to decide which replica of the split to use.
Hadoop does its best to run the map task on a node where the input
data resides in HDFS. This is called the data locality
optimization since it doesn’t use valuable cluster bandwidth.
Sometimes, however, all three nodes hosting the HDFS block replicas
for a map task’s input split are running other map tasks so the job
scheduler will look for a free map slot on a node in the same rack as
one of the blocks. Very occasionally even this is not possible, so an
off-rack node is used, which results in an inter-rack network transfer.
Please find below the excerpts from the Hadoop Definitive guide.
Hadoop divides the input to a MapReduce job into fixed-size pieces
called input splits, or just splits. Hadoop creates one map task for
each split, which runs the user- defined map function for each record
in the split.