MapReduce inefficient reducer - mapreduce

What would cause only a single reducer in a MapReduce job apart from all the keys output by the map function being the same?

Possible causes:
Your cluster still has the default setting of having only 1 reducer (= default value).
Your code explicitly sets the value to be 1 reducer.
You are running in local mode (i.e. no cluster at all).
Quote from mapred-default.xml
<property>
<name>mapred.reduce.tasks</name>
<value>1</value>
<description>The default number of reduce tasks per job. Typically set to 99%
of the cluster's reduce capacity, so that if a node fails the reduces can
still be executed in a single wave.
Ignored when mapred.job.tracker is "local".
</description>
</property>

Related

The Apache Nifi state is not persisted to zookeeper

According to the official Nifi documentation, the state allows Nifi processors to "resume from the place where it left off after NiFi is restarted. Additionally, it allows for a Processor to store some piece of information so that the Processor can access that information from all of the different nodes in the cluster".
If my understanding is good, when we configure a zookeeper Provider, the state will not be persisted locally, instead, the data will be sent to zookeeper.
I've explored the zookeeper znodes and could not find any data related to the state, all I can find are the informations about the Coordinator and Primary nodes. However, the local state directory is still filled.
The configuration is very simple, I've 3 external ZK nodes and 3 Nifi instances.
Here is an exerpt of the nifi.properties file:
nifi.cluster.is.node=true
nifi.zookeeper.connect.string=zk-node1:2181,zk-node2:2181,zk-node3:2181
nifi.state.management.embedded.zookeeper.start=false
nifi.state.management.provider.cluster=zk-provider
And here is an exerpt of the state-management.xml file:
<cluster-provider>
<id>zk-provider</id>
<class>org.apache.nifi.controller.state.providers.zookeeper.ZooKeeperStateProvider</class>
<property name="Connect String">zk-node1:2181,zk-node2:2181,zk-node3:2181</property>
<property name="Root Node">/nifi</property>
<property name="Session Timeout">10 seconds</property>
<property name="Access Control">Open</property>
</cluster-provider>
When I try to ls the Zookeeper, I can see only 2 znodes: "components" but this znode is empty and the "leaders" zonde which contain some data about the Nifi Coordinator and Primary Nodes.
Also, when I explore the transactions logs, even after using some load balanced connections, I cannot find anything related to the Nifi State.
Could somebody explain what data goes the Zookeeper and why the local state directory is still filled even if we configure the zk provider ?
Thanks.
It depends on the processor, some cases it would never make sense to store cluster wide state because it could never be picked up by another node. For example, ListFile tracking from a local directory, another node cannot access the same directory so storing this state in ZK is not helpful.
There is always a local state provider in a write-ahead-log in the state directory, and it is up to the processor to say whether it should be cluster or local state when storing it.
The documentation for each processor should say how the state is stored. For example, from ListFile:
#Stateful(scopes = {Scope.LOCAL, Scope.CLUSTER}, description = "After performing a listing of files, the timestamp of the newest file is stored. "
+ "This allows the Processor to list only files that have been added or modified after "
+ "this date the next time that the Processor is run. Whether the state is stored with a Local or Cluster scope depends on the value of the "
+ "<Input Directory Location> property.")
If Input Directory Location is "remote" then it will use cluster state, otherwise local state.

Limit concurrent executions on AWS Data Pipeline

Is there a way to limit concurrent execution on an AWS Data Pipeline? We need to limit simultaneous executions to 1.
Something similar to what Oozie has with the <concurrency> property?
From the oozie docs:
concurrency: The maximum number of actions for this job that can be running at the same time. This value allows to materialize and submit multiple instances of the coordinator app, and allows operations to catchup on delayed processing. The default value is 1 .
You can use maxActiveInstances field under EC2Resource / EmrCluster to achieve this.
References -
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-emrcluster.html
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-ec2resource.html

Difference in default partitioning by instance type

My understanding was that spark will choose the 'default' number of partitions, solely based on the size of the file or if its a union of many parquet files, the number of parts.
However, in reading in a set of large parquet files, I see the that default # of partitions for an EMR cluster with a single d2.2xlarge is ~1200. However, in a cluster of 2 r3.8xlarge I'm getting default partitions of ~4700.
What metrics does Spark use to determine the default partitions?
EMR 5.5.0
spark.default.parallelism - Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
2X number of CPU cores available to YARN containers.
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#spark-defaults
Looks like it matches non EMR/AWS Spark as well
I think there was some transient issue because I restarted that EMR cluster with d2.2xlarge and it gave me the number of partitions I expected, which matched the r3.8xlarge, which was the number of files on s3.
If anyone knows why this kind of things happens though, I'll gladly mark yours as the answer.

Kinesis ProvisionedThroughputExceededException even after sufficient shards

We have facing ProvisionedThroughputExceededException issue while writing data on Kinesis stream.
Case 1:
We were used single m4.4xlarge (16 core, 64GB mem) instance to write data on stream pass 3k request from Jmeter, EC2 instance provides us 1100 request per second, So we choose 2 shard stream(i.e. 2000 eps).
In result we was able to write data on stream successfully without any loss.
Case 2:
For further testing we had created 10 EC2 m4.4xlarge (16 core, 64GB mem) cluster and 11 shard stream (based on simple calculation 1000eps for one shard, so 10 shard + 1 provision).
When we test that EC2 cluster with different request cases from Jmeter like 3, 10, 30 millions. We receive ProvisionedThroughputExceededException error on our log file.
On Jmeter side EC2 cluster provides us 7500eps and i believe with 7500eps stream having 11000eps capacity should not return such error.
Could you help me to understand reason behind this issue.
It sounds like Kinesis is not hashing/distributing your data evenly across your shards - some are "hot" (getting the ProvisionedThroughputExceededException), while others are "cold".
To solve this, I recommend
Use the ExplicitHashKey parameter in order to have control over which shards your data goes to. The PutRecords documentation has some basic info on this (but not as much as it should).
Also, make sure that your shards are evenly split across the hash space (appropriate starting/ending hash key).
The simplest pattern is just to have a single pre-defined ExplicitHashKey for each shard, and have your PutRecords logic just iterate through it for each record - perfectly even distribution. In any case, make sure your record hashing algorithm will distribute records evenly across the shards.
Another alternative/extension based on using ExplicitHashKey is to have a subset of your hashspace dedicated to "overflow" shard(s) - in your case, 1 specific ExplicitHashKey value mapped to one shard - when you start being throttled on your normal shards, send the records there for retry.
Check your producer side, are you sure you are inserting data to different shards? "PartitionKey" value in PutRecordRequest call may help you.
I think you need to pass different "Partition Keys" for records to share data between different "Shards".
Even if you have created multiple Shards and all of your records use the same partition key then you're still writing to a single shard, because they'll all have the same hash value. Check-out more here PartitionKey

Submiting the same oozie workflow job multiple times at the same time

I am wondered how oozie handle conflicts(if there really exists) when I submit two same workflow job(just the Oozie sample examples) at the same time.
I can submit the same two job successful and oozie server return two different jobId.In Oozie Web Console, I saw status of two job are all RUNNING, then all SUCCEEDED after some time.
My workflow.xml as followers:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf">
<start to="mr-node"/>
<action name="mr-node">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/mapreduce_test/output-data"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.oozie.example.SampleMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.oozie.example.SampleReducer</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/user/${wf:user()}/mapreduce_test/input</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/${wf:user()}/mapreduce_test/output-data/</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
I know in the "prepare" label delete output directory helps make the action repeatable and enables retries after failure, I also understand the basic action run model.
So, My questions are:
The same two jobs are really running concurrently?(I saw the two all in running state in oozie web console).
Is there exists write conflict?(two same job point one output directory)
Oozie does not detect any job duplication or anything like it. It accepts the workflow jobs and schedule them on the cluster for execution and monitor till the completion or failure.
The same two jobs are really running concurrently?(I saw the two all in running state in oozie web console).
Yes. Both the jobs will be running concurrently.
Is there exists write conflict?(two same job point one output directory)
Oozie does not have any checks related to write conflicts. I guess these are taken care by either map reduce or hdfs framework.
As per your question:
1. Oozie schedule jobs on the cluster for execution till the end of end with status like success/failure.
2. Both Jobs will be running at same time and will execute same action what have been defined.
To Avoid the same you you can perform below steps which will bit helpfull.
As Oozie Jobs started with execution by triggering job.properties or coordinator.properties and starts the workflow to executing by as per mention interval passed through job.xml/coordinator.xml.
So when ever a request have been submitted it sill make a fresh entry in
COORD_JOBS for coordinator and WF_JOBS for workflow
tables in Metastore DB of Oozie {which could be "Oracle/MySQL/Postgress/Derby".
So even though the Job have been triggered the same can be start repeatedly as every time New ID have been set for respected Job. {As COORDINATOR JOB ID has been set as Incremental Basis}.
One way to avoid Duplicate Processing of the same Job, you can be done from Metastore DB end with some validation check.
Create a Trigger for COORD_JOBS table under Metastore DB, which will check the table entry with Job Name, query alike
IF (SELECT COUNT(*) FROM COORD_JOBS WHERE (app_name = NEW.app_name) AND (status="RUNNING")) > 0 THEN
SET NEW='Error: Cannot Update table.';
END IF;
These DB Table trigger in COORD_JOBS/WF_JOBS tables will check every time when ever the Oozie tries to make update with new Job.
COORD_JOBS table can be replace with WF_JOBS table, which stores the Information of Workflow job details started by Coordinator.properties,