When running a PySpark job on the dataproc server like this
gcloud --project <project_name> dataproc jobs submit pyspark --cluster <cluster_name> <python_script>
my print statements don't show up in my terminal.
Is there any way to output data onto the terminal in PySpark when running jobs on the cloud?
Edit: I would like to print/log info from within my transformation. For example:
def print_funct(l):
print(l)
return l
rddData.map(lambda l: print_funct(l)).collect()
Should print every line of data in the RDD rddData.
Doing some digging, I found this answer for logging, however, testing it provides me the results of this question, whose answer states that that logging isn't possible within the transformation
Printing or logging inside of a transform will end up in the Spark executor logs, which can be accessed through your Application's AppMaster or HistoryServer via the YARN ResourceManager Web UI.
You could alternatively collect the information you are printing alongside your output (e.g. in a dict or tuple). You could also stash it away in an accumulator and then print it from the driver.
If you are doing a lot of print statement debugging, you might find it faster to SSH into your master node and use the pyspark REPL or IPython to experiment with your code. This would also allow you to use the --master local flag which would make your print statements appear in stdout.
Related
I'm new to Dataproc and trying to submit a pig job to google dataproc via gcloud
gcloud config set project PROJECT
gcloud dataproc jobs submit pig --cluster=cluster-workaround --region=us-east4 --verbosity=debug --properties-file=gs://bucket/cvr_gcs_one.properties --file=gs://bucket-temp/intellibid-intermediat-cvr.pig
with below property file
jarLocation=gs://bucket-data-science/emr/jars/pig.jar
pigScriptLocation=gs://bucket-data-science/emr/pigs
logLocation=gs://bucket-data-science/prod/logs
udf_path=gs://bucket-data-science/emr/jars/udfs.jar
csv_dir=gs://bucket-db-dump/prod
currdate=2022-12-13
train_cvr=gs://bucket-temp/{2022-12-09}
output_dir=gs://analytics-bucket/outoout
and below is the sample of pig script which is uploaded to GCS
register $udf_path;
SET default_parallel 300;
SET pig.exec.mapPartAgg true; -- To remove load on combiner
SET pig.tmpfilecompression TRUE -- To make Compression true between
MapReduce Job Mainly when using Joins
SET pig.tmpfilecompression.codec gz -- To Specify the type of compression between MapReduce Job
SET mapreduce.map.output.compress TRUE --To make Compression true between Map and Reduce
SET mapreduce.map.output.compress.codec org.apache.hadoop.io.compress.GzipCodec
set mapred.map.tasks.speculative.execution false
SET mapreduce.task.timeout 10800000
set mapreduce.output.fileoutputformat.compress true
set mapreduce.output.fileoutputformat.compress.codec
org.apache.hadoop.io.compress.GzipCodec
SET mapreduce.map.maxattempts 16
SET mapreduce.reduce.maxattempts 16
SET mapreduce.job.queuename HIGH_PRIORITY
define GSUM com.java.udfs.common.javaSUM();
define get_cvr_key com.java.udfs.common.ALL_CTR_MODEL('$csv_dir', 'variableList.ini')
define multiple_file_generator com.java.udfs.common.CVR_KEY_GENERATION('$csv_dir','newcampaignToKeyMap')
train_tmp1 = load '$train_cvr/' using PigStorage('\t','-noschema') as (cookie,AdvID,nviews,ls_dst,ls_src,ls_di,ls_ft,ls_np,tos,nsess,e100_views,e200_views,e300_views,e400_views,e100_tos,e200_tos,e300_tos,e400_tos,uniq_prod,most_seen_prod_freq,uniq_cat,uniq_subcat,search_cnt,click_cnt,cart_cnt,HSDO,os,bwsr,dev,hc_c_v,hc_c_tp,hc_c_up,hc_c_ls,hc_s_v,hc_s_tp,hs_s_up,hc_s_ls,hc_clk_pub,hc_clk_cnt,hc_clk_lm,hp_ls_v,hp_ls_c,hp_ls_s,hp_ms_v,hp_ms_c,hp_ms_s,hu_v,hu_c,hu_s,purchase_flag,hp_ls_cvr,hp_ls_crr,hp_ms_cvr,hp_ms_crr,mpv,gc_c_tp,gc_clk_cnt,gc_c_up,gc_clk_lm,gc_c_v,gc_c_ls,gc_s_v,gc_s_lsts,gc_s_tp,gc_s_up,gc_clk_pub,epoch_ms,gc_ac_s,gc_ac_clk,gc_ac_vclk,udays,hc_vclk_cnt,gc_vclk_cnt,e205_view,e205_tos,AdvID_copy,hc_p_ms_p,hc_c_ms_p,most_seen_cat_freq,hc_p_ls_p,currstage,hc_c_city);
Getting below error
INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
ERROR org.apache.pig.impl.PigContext - Undefined parameter : udf_path
2022-12-13 11:58:51,504 [main]
ERROR org.apache.pig.Main - ERROR 2997: Encountered IOException.
org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : udf_path
Tried most of the methods using console as well, doesn't get good documentation to go through.
And Whats Exactly the difference between Query parameters Field(Specify the parameter names and values to insert in place of parameter entries in the query file. The query uses those values when it runs.) and Property Field(A list of key-value pairs to configure the job.
) in UI
Can somone guide me here on what im doing wrong and how can i run a pig script in Dataproc
Pass it like below ,
gcloud config set project PROJECT
gcloud dataproc jobs submit pig --cluster=cluster-workaround --region=us-east4 --verbosity=debug --properties-file=gs://bucket/cvr_gcs_one.properties --file=gs://bucket-temp/your_pig.pig --params udf_path=gs://your_udfs.jar
I am trying to process the billed bytes of each bigquery job runned by all user. I was able to find the details in BigQuery UI under Project History. Also running bq --location=europe-west3 show --job=true --format=prettyjson JOB_ID on Google Cloud Shell gives the exact information that I want (BQ SQL query, billed bytes, run time for each bigquery job).
For the next step, I want to access the json that returned by above script on local machine. I have already configured gcloud cli properly, and able to find bigquery jobs using gcloud alpha bq jobs list --show-all-users --limit=10.
I select a job id and run the following script: gcloud alpha bq jobs describe JOB_ID --project=PROJECT_ID,
I get (gcloud.alpha.bq.jobs.describe) NOT_FOUND: Not found: Job PROJECT_ID:JOB_ID--toyFH. It is possibly because of creation and end times
as shown here
What am I doing wrong? Is there another way to get details of a bigquery job using gcloud cli (maybe there is a way to get billed bytes with query details using Python SDK)?
You can get job details with diff APIs or as you are doing, but first, why are you using the alpha version of the bq?
To do it in python, you can try something like this:
from google.cloud import bigquery
def get_job(
client: bigquery.Client,
location: str = "us",
job_id: str = << JOB_ID >>,
) -> None:
job = client.get_job(job_id, location=location)
print(f"{job.location}:{job.job_id}")
print(f"Type: {job.job_type}")
print(f"State: {job.state}")
print(f"Created: {job.created.isoformat()}")
There are more properties that you can get with some kind of command from the job. Also check the status of the job in the console first, to compare between them
You can find more details here: https://cloud.google.com/bigquery/docs/managing-jobs#python
Scenario:
I am running the Spark Scala job in AWS EMR. Now my job dumps some metadata unique to that application. Now for dumping I am writing at location "s3://bucket/key/<APPLICATION_ID>" Where ApplicationId is val APPLICATION_ID: String = getSparkSession.sparkContext.getConf.getAppId
Now basically is there a way to write at s3 location something like "s3://bucket/key/<emr_cluster_id>_<emr_step_id>".
How can i get the cluster id and step id from inside the spark Scala application.
Writing in this way will help me debug and help me in reaching the cluster based and debug the logs.
Is there any way other than reading the "/mnt/var/lib/info/job-flow.json" ?
PS: I am new to spark, scala and emr . Apologies in advance if this is an obvious query.
With PySpark on EMR, EMR_CLUSTER_ID and EMR_STEP_ID are available as environment variables (confirmed on emr-5.30.1).
They can be used in code as follows:
import os
emr_cluster_id = os.environ.get('EMR_CLUSTER_ID')
emr_step_id = os.environ.get('EMR_STEP_ID')
I can't test but the following similar code should work in Scala.
val emr_cluster_id = sys.env.get("EMR_CLUSTER_ID")
val emr_step_id = sys.env.get("EMR_STEP_ID")
Since sys.env is simply a Map[String, String] its get method returns an Option[String], which doesn't fail if these environment variables don't exist. If you want to raise an Exception you could use sys.env("EMR_x_ID")
The EMR_CLUSTER_ID and EMR_STEP_ID variables are visible in the Spark History Server UI under the Environment tab, alongside with other variables that may be of interest.
I was having same problem recenlty to get the cluster-id programitically. I ended by using listClusters() method of the emrClient.
You can use Java SDK for AWS or Scala wrapper on top of it to use this method.
Adding on top of A.B's answer, you can pass the cluster ID to listSteps method to get a list of the step IDs like this:
emrClient.listSteps(new ListStepsRequest().withClusterId(jobFlowId)).getSteps()
The question: Imagine I run a very simple Python script on EMR - assert 1 == 2. This script will fail with an AssertionError. The log the contains the traceback containing that AssertionError will be placed (if logs are enabled) in an S3 bucket that I specified on setup, and then I can read the log containing the AssertionError when those logs get dropped into S3. However, where do those logs exist before they get dropped into S3?
I presume they would exist on the EC2 instance that the particular script ran on. Let's say I'm already connected to that EC2 instance and the EMR step that the script ran on had the ID s-EXAMPLE. If I do:
[n1c9#mycomputer cwd]# gzip -d /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr.gz
[n1c9#mycomputer cwd]# cat /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
Then I'll get an output with the typical 20/01/22 17:32:50 INFO Client: Application report for application_1 (state: ACCEPTED) that you can see in the stderr log file you can access on EMR:
So my question is: Where is the log (stdout) to see the actual AssertionError that was raised? It gets placed in my S3 bucket indicated for logging about 5-7 minutes after the script fails/completes, so where does it exist in EC2 before that? I ask because getting to these error logs before they are placed on S3 would save me a lot of time - basically 5 minutes each time I write a script that fails, which is more often than I'd like to admit!
What I've tried so far: I've tried checking the stdout on the EC2 machine in the paths in the code sample above, but the stdout file is always empty:
What I'm struggling to understand is how that stdout file can be empty if there's an AssertionError traceback available on S3 minutes later (am I misunderstanding how this process works?). I also tried looking in some of the temp folders that PySpark builds, but had no luck with those either. Additionally, I've printed the outputs of the consoles for the EC2 instances running on EMR, both core and master, but none of them seem to have the relevant information I'm after.
I also looked through some of the EMR methods for boto3 and tried the describe_step method documented here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.describe_step - which, for failed steps, have a FailureDetails json dict response. Unfortunately, this only includes a LogFile key which links to the stderr.gz file on S3 (even in that file doesn't exist yet) and a Message key which contain a generic Exception in thread.. message, not the stdout. Am I misunderstanding something about the existence of those logs?
Please feel free to let me know if you need any more information!
It is quite normal that with log collecting agents, the actual logs files doesn't actually grow, but they just intercept stdout to do what they need.
Most probably when you configure to use S3 for the logs, the agent is configured to either read and delete your actual log file, or maybe create a symlink of the log file to somewhere else, so that file is actually never writen when any process open it for write.
maybe try checking if there is any symlink there
find -L / -samefile /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
but it can be something different from a symlink to achieve the same logic, and I ddint find anything in AWS docs, so most probably is not intended that you will have both S3 and files at the same time and maybe you wont find it
If you want to be able to check your logs more frequently, you may want to think about installing a third party logs collector (logstash, beats, rsyslog,fluentd) and ship logs to SolarWinds Loggly, logz.io, or set up a ELK (Elastic search, logstash, kibana)
You can check this article from Loggly, or create a free acount in logz.io and check the lots of free shippers that they support
I've honed my transformations in DataPrep, and am now trying to run the DataFlow job directly using gcloud CLI.
I've exported my template and template metadata file, and am trying to run them using gcloud dataflow jobs run and passing in the input & output locations as parameters.
I'm getting the error:
Template metadata regex '[ \t\n\x0B\f\r]*\{[ \t\n\x0B\f\r]*((.|\r|\n)*".*"[ \t\n\x0B\f\r]*:[ \t\n\x0B\f\r]*".*"(.|\r|\n)*){17}[ \t\n\x0B\f\r]*\}[ \t\n\x0B\f\r]*' was too large. Max size is 1000 but was 1187.
I've not specified this at the command line, so I know it's getting it from the metadata file - which is straight from DataPrep, unedited by me.
I have 17 input locations - one containing source data, all the others are lookups. There is a regex for each one, plus one extra.
If it's running when prompted by DataPrep, but won't run via CLI, am I missing something?
In this case I'd suspect the root cause is a limitation in gcloud that is not present in the Dataflow API or Dataprep. The best thing to do in this case is to open a new Cloud Dataflow issue in the public tracker and provide details there.