each Input split is replicated in 3 times in hadoop cluster. for each replicate split , does hadoop assigns each map? . If then assign which map results send to reduce function. does hadoop replicates the reduce function also
No, Even though there are three replicas for a split, only one mapper will be assigned by MapReduce engine. It uses the concept called data localization in order to decide which replica of the split to use.
Hadoop does its best to run the map task on a node where the input
data resides in HDFS. This is called the data locality
optimization since it doesn’t use valuable cluster bandwidth.
Sometimes, however, all three nodes hosting the HDFS block replicas
for a map task’s input split are running other map tasks so the job
scheduler will look for a free map slot on a node in the same rack as
one of the blocks. Very occasionally even this is not possible, so an
off-rack node is used, which results in an inter-rack network transfer.
Please find below the excerpts from the Hadoop Definitive guide.
Hadoop divides the input to a MapReduce job into fixed-size pieces
called input splits, or just splits. Hadoop creates one map task for
each split, which runs the user- defined map function for each record
in the split.
Related
I read the docs for AWS Glue GetPartitionsAPI and noticed that it has "Segment" parameter:
"Segment": {
"SegmentNumber": number,
"TotalSegments": number
},
I checked the explanations for Segment here and it says:
Defines a non-overlapping region of a table's partitions, allowing multiple requests to be run in parallel.
SegmentNumber - The zero-based index number of the segment. For example, if the total number of segments is 4, SegmentNumber values range from 0 through 3.
TotalSegments - The total number of segments. Minimum value of 1. Maximum value of 10.
It looks strikingly similar to Segments in DynamoDB, which is used by DynamoDB's parallel scan feature (As discussed here: Scan vs Parallel Scan in AWS DynamoDB?). However, the difference I noticed was how the segments in Glue is limited to only 10, while it seems unlimited on DynamoDB. Therefore:
1. I would like to know what Glue Segments are, if it's really the same thing and use cases for parallel scans in DynamoDB apply here too (e.g. over 20GB table size, read throughput not fully utilized)
Also, Presto/Trino has a Glue-segment-related parameter on config.properties as seen on Trino AWS Glue Configuration Properties:
hive.metastore.glue.partitions-segments - Number of segments for partitioned Glue tables, defaults to 5.
which indicates that segments can be utilized by Presto/Trino. However, when I tried setting the parameter to different numbers, queries keeps using the same number of splits and the speed remains the same (I used same number of nodes every time). Therefore:
2. How Glue Segments fit into Presto/Trino's conceptual model of nodes, tasks and splits?
And in this article about How to use AWS Glue Data Catalog as Metastore for Hive on AWS EMR There is this notice about throttling:
In EMR 5.20.0 or later, parallel partition pruning is enabled automatically for Spark and Hive when is used as the metastore. This change significantly reduces query planning time by executing multiple requests in parallel to retrieve partitions. The total number of segments that can be executed concurrently range between 1 and 10. The default value is 5, which is a recommended setting. You can change it by specifying the property aws.glue.partition.num.segments in hive-site configuration classification. If throttling occurs, you can turn off the feature by changing the value to 1.
3. How Glue Segments may cause throttling and in what kind of scenario? (I think this one might become clear as my understanding about Glue Segment begins to form)
I have an Amazon DynamoDB table which is used for both read and write operations. Write operations are performed only when the batch job runs at certain intervals whereas Read operations are happening consistently throughout the day.
I am facing a problem of increased Read latency when there is significant amount of write operations are happening due to the batch jobs. I explored a little bit about having a separate read replica for DynamoDB but nothing much of use. Global tables are not an option because that's not what they are for.
Any ideas how to solve this?
Going by the Dynamo paper, the concept of a read-replica for a record or a table does not exist in Dynamo. Within the same region, you will have multiple copies of a record depending on the replication factor (R+W > N) where N is the replication factor. However when the client reads, one of those records are returned depending on the cluster health.
Depending on how the co-ordinator node is chosen either at the client library or at the cluster, the client can only ask for a record (get) or send a record(put) to either the cluster co-ordinator ( 1 extra hop ) or to the node assigned to the record (single hop to record). There is just no way for the client to say 'give me a read replica from another node'. The replicas are there for fault-tolerance, if one of the nodes containing the master copy of the record dies, replicas will be used.
I am researching the same problem in the context of hot keys. Every record gets assigned to a node in Dynamo. So a million reads on the same record will lead to hot keys, loss of reads/writes etc. How to deal with this ? A read-replica will work great because I can now manage the hot keys at the application and move all extra reads to read-replica(s). This is again fraught with issues.
I am trying to understand the difference between concurrent connections and concurrent queries in Redshift. As per documents, We can make 500 concurrent connections to a Redshift cluster but it says maximum 15 queries can be run at the same time in a cluster. Now what is the exact value?
How many queries can be in running state in a cluster at the same time ? If it is 15, does it include RETURNING state queries as well ?
How many concurrent COPY statement can run in a cluster ?
We are evaluating Redshift as our primary reporting data store. If we cannot run a large number of queries simultaneously it may be difficult for us to go with this model.
I think, you have misread somewhere, Max concurrent queries are 50 per WLM. Refer below thread for Amazon support response for more detail.
How many queries can be in running state in a cluster at the same time ? If it is 15, does it include RETURNING state queries as well ?
At a time, Max 50 queries could be running concurrently. Yes it does include INSERT/UPDATE/DELETE etc all.
How many concurrent COPY statement can run in a cluster ?
Ideally, you could Max go up to 50 concurrently, but Copy works bit differently.
Amazon Redshift automatically loads in parallel from multiple data files.
If you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load, which is much slower and requires a VACUUM at the end if the table has a sort column defined. For more information about using COPY to load data in parallel, see Loading Data from Amazon S3.
Meaning, you could run concurrent Copy commands but make sure one copy command at a time per table.
So practically, it doesn't depend on Nodes on cluster, but Number of tables as well.
So if you have only 1 table, you would like to execute 50 insert concurrently, it will result only 1 Copy concurrently.
I was studying Amazon Redshift using the Sybex Official Study Guide, at page 173 there are a couple of phrases:
You can configure the distribution style of a table to give Amazon RS hints as to how the data should be partitioned to best meet your query patterns. When you run a query, the optimizer shifts the rows to the compute node as needed to perform any joins and aggregates.
That leads me to some questions?
1) What is the role of "optimizer"? Do data re-arranged across compute nodes to boost performance for each new query?
2) if 1) is true and new query completly different is performed: What happen to the old data in the compute nodes?
3) Can you explain me better the 3 distribution styles (EVEN, KEY, ALL) particularly the KEY style.
Extra questions:
1) Does the leader node has records?
To clarify a few things:
The Distribution Key is not a hint -- data is actually distributed according to the key
When running a query, the data is not "shifted" -- rather, copies of data might be sent to other nodes so that data can be joined on a particular node, but the data does not then reside on the destination node
The optimizer doesn't actually "do" anything -- it just computes the process that the nodes will follow (Redshift apparently writes C programs that are sent to each node)
The only thing you really need to know about the Optimizer is:
Query optimizer
The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer implements significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation.
From Data Warehouse System Architecture:
Leader node
The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.
The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any of these functions will return an error if it references tables that reside on the compute nodes.
The Leader Node does not contain any data (unless you launch a single-node cluster, in which case the same server is used as the Leader Node and Compute Node).
For information on Distribution Styles, refer to: Distribution Styles
If you really want to learn about Redshift, then read the Redshift Database Developer Guide. If you are merely studying for the Solutions Architect exam, the above links will be sufficient for the level of Redshift knowledge.
Running Presto large scan query on 5 nodes cluster it looks that only one node is the query coordinator and reads the data from 5 hdfs nodes over network.
All presto processes are running on data nodes.
Is there way to let 5 nodes to read data from hdfs using shortcut local reads ?
Are presto nodes doing any preaggregation?
It is not clear from you question if you have installed Presto workers on the same machine as your HDFS data nodes. If you have not, the installation instructions will help you do this.
Once you have Presto workers on all of your data nodes, Presto should automatically perform local reads when accessing data from the local DFS node. Presto will prefer scheduling work on the same machine as the DFS node, but if that machine is overloaded, it will schedule the work on another machine, so you will typically get some remote reads. The majority of reads should be local, and you can verify this distribution using the com.facebook.presto.execution:name=NodeScheduler mbean on the coordinator.
Presto always performs partial aggregation on the leaf worker nodes.
If you have presto installed on all the nodes, and want presto workers to process local stripes, you need to turn the "hive.force-local-scheduling" session flag to true. This is false by default in the presto versions i saw (0.153).
See more details at:
https://github.com/prestodb/presto/issues/894
https://github.com/prestodb/presto/pull/1770