Apache Beam Deploying on DataFlow - google-cloud-platform

Hi I have created an apache beam pipeline, tested it and ran it from inside eclipse, both locally and using dataflow runner. I can see in eclipse console that the pipeline is running I also see the details, i. e. logs on the console.
Now, how do I deploy this pipeline to GCP, so that it keeps working irrespective of the state of my machine. For e.g., if I run it using mvn compile exec:java the console shows it is running, but i can not find the job using the dataflow UI.
Also, what will happen if I kill the process locally, will the job on the GCP infrastructure also be stopped? How Do I know a job has been triggered independent of my machine`s state on the GCP infrastructure?
The maven compile exec:java with arguments output is as follows,
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/C:/Users/ThakurG/.m2/repository/org/slf4j/slf4j-
jdk14/1.7.14/slf4j-jdk14-1.7.14.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/C:/Users/ThakurG/.m2/repository/org/slf4j/slf4j-nop/1.7.25/slf4j-nop-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
Jan 08, 2018 5:33:22 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ main
INFO: starting the process...
Jan 08, 2018 5:33:25 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ
createStream
INFO: pipeline created::Pipeline#73387971
Jan 08, 2018 5:33:27 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ main
INFO: pie crated::Pipeline#73387971
Jan 08, 2018 5:54:57 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ$1 apply
INFO: Message received::1884408,16/09/2017,A,2007156,CLARK RUBBER FRANCHISING PTY LTD,A ,5075,6,Y,296,40467910,-34.868095,138.683535,66 SILKES RD,,,PARADISE,5075,0,7.4,5.6,18/09/2017 2:09,0.22
Jan 08, 2018 5:54:57 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ$1 apply
INFO: Payload from msg::1884408,16/09/2017,A,2007156,CLARK RUBBER FRANCHISING PTY LTD,A ,5075,6,Y,296,40467910,-34.868095,138.683535,66 SILKES RD,,,PARADISE,5075,0,7.4,5.6,18/09/2017 2:09,0.22
Jan 08, 2018 5:54:57 PM com.trial.apps.gcp.df.ReceiveAndPersistToBQ$1 apply
This is the maven command I`m using from cmd prompt,
`mvn compile exec:java -Dexec.mainClass=com.trial.apps.gcp.df.ReceiveAndPersistToBQ -Dexec.args="--project=analyticspoc-XXX --stagingLocation=gs://analytics_poc_staging --runner=DataflowRunner --streaming=true"`
This is the piece of code I`m using to create the pipeline and set the options on the same.
PipelineOptions options = PipelineOptionsFactory.create();
DataflowPipelineOptions dfOptions = options.as(DataflowPipelineOptions.class);
dfOptions.setRunner(DataflowRunner.class);
dfOptions.setJobName("gcpgteclipse");
dfOptions.setStreaming(true);
// Then create the pipeline.
Pipeline pipeL = Pipeline.create(dfOptions);

Can you clarify what exactly do you mean by "console shows it is running" and by "can not find the job using Dataflow UI"?
If your program's output prints the message:
To access the Dataflow monitoring console, please navigate to https://console.developers.google.com/project/.../dataflow/job/....
Then your job is running on the Dataflow service. Once it's running, killing the main program will not stop the job - all the main program does is periodically poll the Dataflow service for the status of the job and new log messages. Following the printed link should take you to the Dataflow UI.
If this message is not printed, then perhaps your program is getting stuck somewhere before actually starting the Dataflow job. If you include your program's output, that will help debugging.

To deploy a pipeline to be executed by Dataflow, you specify the runner and project execution parameters through the command line or via the DataflowPipelineOptions class. runner must be set to DataflowRunner (Apache Beam 2.x.x) and project is set to your GCP project ID. See Specifying Execution Parameters. If you do not see the job in the Dataflow Jobs UI list, then it is definitely not running in Dataflow.
If you kill the process that deploys a job to Dataflow, then the job will continue to run in Dataflow. It will not be stopped.
This is trivial, but to be absolutely clear, you must call run() on the Pipeline object in order for it to be executed (and therefore deployed to Dataflow). The return value of run() is a PipelineResult object which contains various methods for determining the status of a job. For example, you can call pipeline.run().waitUntilFinish(); to force your program to block execution until the job is complete. If your program is blocked, then you know the job was triggered. See the PipelineResult section of the Apache Beam Java SDK docs for all of the available methods.

Related

Run command from terminal window in AWS Instance at specified time or on start up

I have a AWS Cloud9 Instance that starts running at 11:52 PM MST and stops running at 11:59 PM MST. I have a dockerfile within the Instance that when ran with the correct mount will run a set of c++ .cpp files that collect live web data. The ultimate goal of this instance is to be fully automatic so that every night it collects the live web data for that date, hence why the Instance is open at the very end of the day each night. Is it possible to have my AWS Instance run a given command in a terminal window at a certain time, say 11:55 PM or even upon startup. So at the time, or at startup, the command "docker run -it...." is ran within the instance.
Is automating this process possible? I have looked into CloudWatch events and think that might be the best way to go about automating this process but I am not quite sure how I would create a rule to fulfill the job. If it is not possible to automate a certain command within a terminal window, could I automate the dockerfile to run at a certain time?
ofcourse you can automate running of commands not just docker but for the fact any commands using cron daemon. all you need to do is place your command in shell script file say doc.sh in your desired directory.
ssh into your instance
open terminal and type crontab -e
enter the following details in this manner a b c d e /directory/command
where a -Minute, b-hour c-day d-month e-day of the week
the /directory/command specifies the location and script you want to run.
for more reference cron examples,https://www.cyberciti.biz/faq/how-do-i-add-jobs-to-cron-under-linux-or-unix-oses/
If you have a dockerfile that you want to run for a few minutes a day, you should look into Fargate. You can schedule an event with Cloudwatch, run the container and then shut it down when it's done.
It will probably cost around $0.01/day to run this.

Why would Oozie fail a job with Error Code LimitExceededException when yarn reports that oozie launcher & mapreduce job have completed successfully?

There are a few questions similar to this on SO. However nothing has worked for me. So I am posting this question.
I am Using CDH 6.2.1
I have a workflow that has map-reduce action. The map-reduce job creates a lot of counters (I think m/r job produces ~300 counters).
I have set the cdh/yarn/config mapreduce.job.counters.max property to 8192.
I have also set the:
YARN Service Advanced Configuration Snippet (Safety Valve) for yarn-site.xml
YARN Service MapReduce Advanced Configuration Snippet (Safety Valve)
MapReduce Client Advanced Configuration Snippet (Safety Valve) for mapred-site.xml
If I run the map-reduce job as a stand-alone yarn job (using yarn jar command on the command-line), the job completes successfully.
When I run the job as part of the workflow:
On Yarn/All Applications Page I see that: the oozie launcher job completes successfully.
On Yarn/All Applications Page I see that: the map/reduce job completes successfully.
However oozie fails the job reporting: LimitExceededException: Too many counters: 121 max=120
The configuration for the mapreduce job & oozie launcher as reported by yarn has the setting:
<property>
<name>mapreduce.job.counters.max</name>
<value>8192</value>
<final>true</final>
<source>yarn-site.xml</source>
</property>
Oozie web interface System-Info/OS-Env reports that the following HADOOP_CONF_DIR: /var/run/cloudera-scm-agent/process/459-oozie-OOZIE_SERVER/yarn-conf/
In that folder I can see that the mapred-site.xml also has:
<!--'mapreduce.job.counters.max', originally set to '8192' (final), is overridden below by a safety valve-->
<property>
<name>mapreduce.job.counters.max</name>
<value>8192</value>
<final>true</final>
</property>
However I cannot find that property in the yarn-site.xml.
I am not sure what else I can do at this point...
This is an oozie issue which has been resolved. However, it is not available in the current version of cloudera.
I am posting this here, in case anyone else has the same issue.

AWS EMR pyspark notebook fails with `Failed to run command /usr/bin/virtualenv (...)`

I have created a basic EMR cluster in AWS, and I'm trying to use the Jupyter Notebooks provided through the AWS Console. Launching the notebooks seems to work fine, and I'm also able to run basic python code in notebooks started with the pyspark kernel. Two variables are set up in the notebook: spark is a SparkSession instance, and sc is a SparkContext instance. Displaying sc yields <SparkContext master=yarn appName=livy-session-0> (the output can of course vary slightly depending on the session).
The problem arises once I perform operations that actually hit the spark machinery. For example:
sc.parallelize(list(range(10))).map(lambda x: x**2).collect()
I am no spark expert, but I believe this code should distribute the integers from 0 to 9 across the cluster, square them, and return the results in a list. Instead, I get a lengthy stack trace, mostly from the JVM, but also some python components. I believe the central part of the stack trace is the following:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 116, ip-XXXXXXXXXXXXX.eu-west-1.compute.internal, executor 17): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --system-site-packages virtualenv_application_1586243436143_0002_0
The full stack trace is here.
A bit of digging in the AWS portal led me to log output from the nodes. stdout from one of the nodes includes the following:
The path python3 (from --python=python3) does not exist
I tried running the /usr/bin/virtualenv command on the master node manually (after logging in through), and that worked fine, but the error is of course still present after I did that.
While this error occurs most of the time, I was able to get this working in one session, where I could run several operations against the spark cluster as I was expecting.
Technical information on the cluster setup:
emr-6.0.0
Applications installed are "Ganglia 3.7.2, Spark 2.4.4, Zeppelin 0.9.0, Livy 0.6.0, JupyterHub 1.0.0, Hive 3.1.2". Hadoop is also included.
3 nodes (one of them as master), all r5a.2xlarge.
Any ideas what I'm doing wrong? Note that I am completely new to EMR and Spark.
Edit: Added the stdout log and information about running the virtualenv command manually on the master node through ssh.
I have switched to using emr-5.29.0, which seems to resolve the problem. Perhaps this is an issue with emr-6.0.0? In any case, I have a functional workaround.
The issue for me was that the virtualenv was being made on the executors with a python path that didn't exist. Pointing the executors to the right one did the job for me:
"spark.pyspark.python": "/usr/bin/python3.7"
Here is how I reconfiged the spark app at the beginning of the notebook:
{"conf":{"spark.pyspark.python": "/usr/bin/python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"}
}

How to execute mvn clean install in goCD

mvn clean build command does not execute in GoCD , The pipe line gets triggered but there is nothing displayed in logs and the job keeps running forever after setting inactivity time to 1 min.
I have created a pipe line and added mvn clean install command to it as in below image.Please let me know what needs to changed to generate artifacts as first step.
The most important clue is in your first screenshot, it says "Agent: Not yet assigned". That means that no agent (aka worker) could be found that that can handle your job.
Please read the manual on managing agents, specifically the section Matching jobs to agents.
Frequent reasons why no agent can be assigned:
No agents available at all
The agent(s) are in environments, but the pipeline isn't
Mismatch between resources specified in the job and in the agent management.

Dataflow process hanging

I am running a batch job on dataflow, querying from BigQuery. When I use the DirectRunner, everything works, and the results are written to a new BigQuery table. Things seem to break when I change to DataflowRunner.
The logs show that 30 worker instances are spun up successfully. The graph diagram in the web UI shows the job has started. The first 3 steps show "Running", the rest show "not started". None of the steps show any records transformed (i.e. outputcollections all show '-'). The logs show many messages that look like this, which may be the issue:
skipping: failed to "StartContainer" for "python" with CrashLoopBackOff: "Back-off 10s restarting failed container=python pod=......
I took a step back and just ran the minimal wordcount example, and that completed successfully. So all the necessary APIs seem to be enabled for Dataflow runner. I'm just trying to get a sense of what is causing my Dataflow job to hang.
I am executing the job like this:
python2.7 script.py --runner DataflowRunner --project projectname --requirements_file requirements.txt --staging_location gs://my-store/staging --temp_location gs://my-store/temp
I'm not sure if my solution was the cause of the error pasted above, but fixing dependencies problems (which were not showing up as errors in the log at all!) did solve the hanging dataflow processes.
So if you have a hanging process, make sure your workers have all their necessary dependencies. You can provide them through the --requirements_file argument, or through a custom setup.py script.
Thanks to the help I received in this post, the pipeline appears to be operating, albeit VERY SLOWLY.