Yarn UI shows no active node while it appeared in HDFS UI - hdfs

I've setup Hadoop in my laptop,
and when I submit a job on Hadoop (though MapReduce and Tez),
the status always ACCEPTED, but the progress always stuck at 0% and description wrote something like "waiting for AM container to be allocated".
When I check the node through YARN UI(localhost:8088),
it shows that the active node is 0
But from HDFS UI(localhost:50070), it shows that there are one live node.
Is that the main reason that cause the job stuck since there are no available node? If that's the case, what should I do?

In your YARN UI, it shows you have zero vcores and zero memory so there is no way for any job to ever run since you lack computing resources. The datanode is only for storage (HDFS in this case) and does not matter as far as why your application is stuck.
To fix your problem, you need to update your yarn-site.xml and provide settings for the memory and vcore properties described in the following:
http://blog.cloudera.com/blog/2015/10/untangling-apache-hadoop-yarn-part-2/
You might consider using a Cloudera QuickStart VM or Hortonworks Sandbox (at least as a reference for configuration values for the yarn-site.xml).
https://www.cloudera.com/downloads/quickstart_vms/5-10.html
https://hortonworks.com/products/sandbox/

Related

Data flow pipeline got stuck

Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
I am using service account with all required IAM roles
Generally The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h can be caused by too long setup progress. In order to solve this issue you can try to increase worker resources (via --machine_type parameter) to overcome the issue.
For example, While installing several dependencies that required building wheels (pystan, fbprophet) which will take more than an hour on the minimal machine (n1-standard-1 with 1 vCPU and 3.75GB RAM). Using a more powerful instance (n1-standard-4 which has 4 times more resources) will solve the problem.
You can debug this by looking at the worker startup logs in cloud logging. You are likely to see pip issues with installing dependencies.
Do you have any error logs showing that Dataflow Workers are crashing when trying to start?
If not, maybe worker VMs are started but they can't reach the Dataflow service, which is often related to network connectivity.
Please note that by default, Dataflow creates jobs using the network and subnetwork default (please check if it exists on your project), and you can change to a specific one by specifying --subnetwork. Check https://cloud.google.com/dataflow/docs/guides/specifying-networks for more information.

Can I configure Google DataFlow to keep nodes up when I drain a pipeline

I am deploying a pipeline to Google Cloud DataFlow using Apache Beam. When I want to deploy a change to the pipeline, I drain the running pipeline and redeploy it. I would like to make this faster. It appears from the logs that on each deploy DataFlow builds up new worker nodes from scratch: I see Linux boot messages going by.
Is it possible to drain the pipeline without tearing down the worker nodes so the next deployment can reuse them?
rewriting Inigo's answer here:
Answering the original question, no, there's no way to do that. Updating should be the way to go. I was not aware it was marked as experimental (probably we should change that), but the update approach has not changed in the last 3 i have been using DF. About the special cases of update not working, supposing your feature existed, the workers would still need the new code, so no really much to save, and update should work in most of the other cases.

Trouble configuring Presto's memory allocation on AWS EMR

I am really hoping to use Presto in an ETL pipeline on AWS EMR, but I am having trouble configuring it to fully utilize the cluster's resources. This cluster would exist solely for this one query, and nothing more, then die. Thus, I would like to claim the maximum available memory for each node and the one query by increasing query.max-memory-per-node and query.max-memory. I can do this when I'm configuring the cluster by adding these settings in the "Edit software settings" box of the cluster creation view in the AWS console. But the Presto server doesn't start, reporting in the server.log file an IllegalArgumentException, saying that max-memory-per-node exceeds the useable heap space (which, by default, is far too small for my instance type and use case).
I have tried to use the session setting set session resource_overcommit=true, but that only seems to override query.max-memory, not query.max-memory-per-node, because in the Presto UI, I see that very little of the available memory on each node is being used for the query.
Through Google, I've been led to believe that I need to also increase the JVM heap size by changing the -Xmx and -Xms properties in /etc/presto/conf/jvm.config, but it says here (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html) that it is not possible to alter the JVM settings in the cluster creation phase.
To change these properties after the EMR cluster is active and the Presto server has been started, do I really have to manually ssh into each node and alter jvm.config and config.properties, and restart the Presto server? While I realize it'd be possible to manually install Presto with a custom configuration on an EMR cluster through a bootstrap script or something, this would really be a deal-breaker.
Is there something I'm missing here? Is there not an easier way to make Presto allocate all of a cluster to one query?
As advertised, increasing query.max-memory-per-node, and also by necessity the -Xmx property, indeed cannot be achieved on EMR until after Presto has already started with the default options. To increase these, the jvm.config and config.properties found in /etc/presto/conf/ have to be changed, and the Presto server restarted on each node (core and coordinator).
One can do this with a bootstrap script using commands like
sudo sed -i "s/query.max-memory-per-node=.*GB/query.max-memory-per-node=20GB/g" /etc/presto/conf/config.properties
sudo restart presto-server
and similarly for /etc/presto/jvm.conf. The only caveats are that one needs to include the logic in the bootstrap action to execute only after Presto has been installed, and that the server on the coordinating node needs to be restarted last (and possibly with different settings if the master node's instance type is different than the core nodes).
You might also need to change resources.reserved-system-memory from the default by specifying a value for it in config.properties. By default, this value is .4*(Xmx value), which is how much memory is claimed by Presto for the system pool. In my case, I was able to safely decrease this value and give more memory to each node for executing the query.
As a matter of fact, there are configuration classifications available for Presto in EMR. However, please note that these may vary depending on the EMR release version. For a complete list of the available configuration classifications per release version, please visit 1 (make sure to switch between the different tabs according to your desired release version). Specifically regarding to jvm.config properties, you will see in 2 that these are not currently configurable via configuration classifications. That being said, you can always edit the jvm.config file manually per your needs.
Amazon EMR 5.x Release Versions
1
Considerations with Presto on Amazon EMR - Some Presto Deployment Properties not Configurable:
2

Am I fully utilizing my EMR cluster?

Total Instances: I have created an EMR with 11 nodes total (1 master instance, 10 core instances).
job submission: spark-submit myApplication.py
graph of containers: Next, I've got these graphs, which refer to "containers" and I'm not entirely what containers are in the context of EMR, so this isn't obvious what its telling me:
actual running executors: and then I've got this in my spark history UI, which shows that I only have 4 executors ever got created.
Dynamic Allocation: Then I've got spark.dynamicAllocation.enabled=True and I can see that in my environment details.
Executor memory: Also, the default executor memory is at 5120M.
Executors: Next, I've got my executors tab, showing that I've got what looks like 3 active and 1 dead executor:
So, at face value, it appears to me that I'm not using all my nodes or available memory.
how do I know if i'm using all the resources I have available?
if I'm not using all available resources to their full potential, how do I change what I'm doing so that the available resources are being used to their full potential?
Another way to go to see how many resources are being used by each of the nodes of the cluster is to use the web tool of Ganglia.
This is published on the master node and will show a graph of each node's resource usage. The issue will be if you have not enable Ganglia at the time of cluster creation as one of the tools available on the EMR cluster.
Once enable however you can go to the web page and see how much each node is being utilized.

Spark step on EMR just hangs as "Running" after done writing to S3

Running PySpark 2 job on EMR 5.1.0 as a step. Even after the script is done with a _SUCCESS file written to S3 and Spark UI showing the job as completed, EMR still shows the step as "Running". I've waited for over an hour to see if Spark was just trying to clean itself up but the step never shows as "Completed". The last thing written in the logs is:
INFO MultipartUploadOutputStream: close closed:false s3://mybucket/some/path/_SUCCESS
INFO DefaultWriterContainer: Job job_201611181653_0000 committed.
INFO ContextCleaner: Cleaned accumulator 0
I didn't have this problem with Spark 1.6. I've tried a bunch of different hadoop-aws and aws-java-sdk jars to no avail.
I'm using the default Spark 2.0 configurations so I don't think anything else like metadata is being written. Also the size of the data doesn't seem to have an impact on this problem.
If you aren't already, you should close your spark context.
sc.stop()
Also, if you are watching the Spark Web UI via a browser, you should close that as it sometimes keeps the spark context alive. I recall seeing this on the spark dev mailing list, but can't find the jira for it.
We experienced this problem and resolved it by running the job in cluster deploy mode using the following spark-submit option:
spark-submit --deploy-mode cluster
It was something to do with when running in client mode the driver runs in the master instance and the spark-submit process is getting stuck despite the spark spark context closing. This was causing the instance controller to continuously polling for process as it never receives the completion signal. Running the driver on one of the instance nodes using the above option doesn't seem to have this problem. Hope this helps
I experienced the same issue with Spark on AWS EMR and I solved the issue by calling sys.exit(O) at the end of my Python script. The same worked with Scala program with System.exit(0).