I need to be able to get the yarn applicationId from a mapreduce job. I can't find any API to do that. An Example of my mapreduce job:
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.submit();
job.waitForCompletion(true);
Is there an API similar to job.getJobId to retrieve the yarn applicationId? I know about yarn application -list command but I need to be able to know the applicationId in my program through some kind of API. It looks like jobId is same as applicationId execpt for the prefix ('job' vs 'application') which I could parse but I am hoping there is something from the API I can use.
I ended up parsing the jobId, removing 'job' prefix and adding 'application' prefix as it appears applicationId is not exposed for mapreduce job and it is basically the same id as jobId with different prefix. It's a hacky approach but works for now.
You can also try this :
job.getJobID().appendTo(new StringBuilder("application"))
If you see the JobID class, there they are passing "JOB" as an argument which can be replaced by application in this case.
This will give the application id.
Related
When running my job, I am getting the following exception:
Exception in User Class: org.apache.spark.SparkException : Job aborted due to stage failure: Task 32 in stage 2.0 failed 4 times, most recent failure: Lost task 32.3 in stage 2.0 (TID 50) (10.100.1.48 executor 8): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is.
I have tried to apply the requested configuration value, as follows:
val conf = new SparkConf()
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY")
val spark: SparkContext = new SparkContext(conf)
//Get current sparkconf which is set by glue
val glueContext: GlueContext = new GlueContext(spark)
val args = GlueArgParser.getResolvedOptions(
sysArgs,
Seq("JOB_NAME").toArray
)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
but the same error occurs. I have also tried setting it to "CORRECTED" via the same approach.
It seems that the config is not properly making its way into the Spark execution. What is the proper way to get Spark config values set from a ScalaSpark job on Glue?
When you are migrating between versions it is always best to check out the Migration guides by AWS. In your case this can be set in your Glue Job properties by passing below properties as per requirement.To set these navigate to Glue console -> Jobs -> Click on Job -> Job details -> Advanced properties -> Job parameters.
- Key: --conf
- Value: spark.sql.legacy.parquet.int96RebaseModeInRead=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=[CORRECTED|LEGACY]
Please refer to below guide for the more information:
https://docs.aws.amazon.com/glue/latest/dg/migrating-version-30.html#migrating-version-30-from-20
This code at the top of my glue job seems to have done the trick
val conf = new SparkConf()
//alternatively, use LEGACY if that is required
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
val spark: SparkContext = new SparkContext(conf)
val glueContext: GlueContext = new GlueContext(spark)
Scenario:
I am running the Spark Scala job in AWS EMR. Now my job dumps some metadata unique to that application. Now for dumping I am writing at location "s3://bucket/key/<APPLICATION_ID>" Where ApplicationId is val APPLICATION_ID: String = getSparkSession.sparkContext.getConf.getAppId
Now basically is there a way to write at s3 location something like "s3://bucket/key/<emr_cluster_id>_<emr_step_id>".
How can i get the cluster id and step id from inside the spark Scala application.
Writing in this way will help me debug and help me in reaching the cluster based and debug the logs.
Is there any way other than reading the "/mnt/var/lib/info/job-flow.json" ?
PS: I am new to spark, scala and emr . Apologies in advance if this is an obvious query.
With PySpark on EMR, EMR_CLUSTER_ID and EMR_STEP_ID are available as environment variables (confirmed on emr-5.30.1).
They can be used in code as follows:
import os
emr_cluster_id = os.environ.get('EMR_CLUSTER_ID')
emr_step_id = os.environ.get('EMR_STEP_ID')
I can't test but the following similar code should work in Scala.
val emr_cluster_id = sys.env.get("EMR_CLUSTER_ID")
val emr_step_id = sys.env.get("EMR_STEP_ID")
Since sys.env is simply a Map[String, String] its get method returns an Option[String], which doesn't fail if these environment variables don't exist. If you want to raise an Exception you could use sys.env("EMR_x_ID")
The EMR_CLUSTER_ID and EMR_STEP_ID variables are visible in the Spark History Server UI under the Environment tab, alongside with other variables that may be of interest.
I was having same problem recenlty to get the cluster-id programitically. I ended by using listClusters() method of the emrClient.
You can use Java SDK for AWS or Scala wrapper on top of it to use this method.
Adding on top of A.B's answer, you can pass the cluster ID to listSteps method to get a list of the step IDs like this:
emrClient.listSteps(new ListStepsRequest().withClusterId(jobFlowId)).getSteps()
As part of an automated deployment with the .NET AWS SDK, I am trying to create a new task definition revision, update the docker image tag label with my newly deployed version and then update a service to use that new revision.
I have something like this:
var taskDefinitionResponse = await _ecsClient.RegisterTaskDefinitionAsync(new RegisterTaskDefinitionRequest
{
ContainerDefinitions = new List<ContainerDefinition>(new[] {new ContainerDefinition(){Image = "new image:v123"}})
});
await _ecsClient.UpdateServiceAsync(new UpdateServiceRequest()
{
TaskDefinition = taskDefinitionResponse.TaskDefinition.TaskDefinitionArn,
});
My concern is with the above code it doesn't duplicate the existing task definition for example in AWS Console when you click "Create new revision" you have to select a task definition so that the button creates a duplicate so you can then modify it and save the new revision so would I need some code that gets an existing task definition then just change the docker image and then call the RegisterTaskDefinitionAsync with the existing definition and modified docker image?
The UI is automatically making multiple API calls and giving you the option to create new revision from previous ones. In order for you to achieve the same, you can try something like this.
First List the TaskDefinitions using the family prefix(assuming you are creating the task definitions using image name or some prefix).
Task<ListTaskDefinitionsResponse> ListTaskDefinitionsAsync(
ListTaskDefinitionsRequest request,
CancellationToken cancellationToken
)
Using the The ListTaskDefinitionsResponse, select the Latest Task Definition ARN and make another API call to get full Task Definition Response.
Task<DescribeTaskDefinitionResponse> DescribeTaskDefinitionAsync(
DescribeTaskDefinitionRequest request,
CancellationToken cancellationToken
)
Now, you have the latest TaskDefinition object where you can modify the Image Version and publish it again.
Task<RegisterTaskDefinitionResponse> RegisterTaskDefinitionAsync(
RegisterTaskDefinitionRequest request,
CancellationToken cancellationToken
)
AWS .NET SDK Reference -
https://docs.aws.amazon.com/sdkfornet/v3/apidocs/items/ECS/TECSClient.html
Let me know your thoughts!!.
I am using Airflow to run Spark jobs on Google Cloud Composer. I need to
Create cluster (YAML parameters supplied by user)
list of spark jobs (job params also supplied by per job YAML)
With the Airflow API - I can read YAML files, and push variables across tasks using xcom.
But, consider the DataprocClusterCreateOperator()
cluster_name
project_id
zone
and a few other arguments are marked as templated.
What if I want to pass in other arguments as templated (which are currently not so)? - like image_version,
num_workers, worker_machine_type etc?
Is there any workaround for this?
Not sure what you mean for 'dynamic', but when yaml file updated, if the reading file process is in dag file body, the dag will be refreshed to apply for the new args from yaml file. So actually, you don't need XCOM to get the arguments.
just simply create a params dictionary then pass to default_args:
CONFIGFILE = os.path.join(
os.path.dirname(os.path.realpath(\__file__)), 'your_yaml_file')
with open(CONFIGFILE, 'r') as ymlfile:
CFG = yaml.load(ymlfile)
default_args = {
'cluster_name': CFG['section_A']['cluster_name'], # edit here according to the structure of your yaml file.
'project_id': CFG['section_A']['project_id'],
'zone': CFG['section_A']['zone'],
'mage_version': CFG['section_A']['image_version'],
'num_workers': CFG['section_A']['num_workers'],
'worker_machine_type': CFG['section_A']['worker_machine_type'],
# you can add all needs params here.
}
DAG = DAG(
dag_id=DAG_NAME,
schedule_interval=SCHEDULE_INTEVAL,
default_args=default_args, # pass the params to DAG environment
)
Task1 = DataprocClusterCreateOperator(
task_id='your_task_id',
dag=DAG
)
But if you want dynamic dags rather than arguments, you may need other strategy like this.
So you probably need to figure out the basic idea:
In which level the dynamics is? Task level? DAG level?
Or you can create your own Operator to do the job and take the parameters.
I fail to understand how to simply list the contents of an S3 bucket on EMR during a spark job.
I wanted to do the following
Configuration conf = spark.sparkContext().hadoopConfiguration();
FileSystem s3 = S3FileSystem.get(conf);
List<LocatedFileStatus> list = toList(s3.listFiles(new Path("s3://mybucket"), false))
This always fails with the following error
java.lang.IllegalArgumentException: Wrong FS: s3://*********/, expected: hdfs://**********.eu-central-1.compute.internal:8020
in the hadoopConfiguration fs.defaultFS -> hdfs://**********.eu-central-1.compute.internal:8020
The way I understand it if I don't use a protocol just /myfolder/myfile instead of i.e. hdfs://myfolder/myfile it will default to the df.defaultFS.
But I would expect if I specify my s3://mybucket/ the fs.defaultFS should not matter.
How does one access the directory information? spark.read.parquet("s3://mybucket/*.parquet") works just fine but for this task I need to check the existence of some files and would also like to delete some. I assumed org.apache.hadoop.fs.FileSystem would be the correct tool.
PS: I also don't understand how logging works. If I use deploy-mode cluster (i want to deploy jars from s3 which does not work in client mode), the I can only find my logs in s3://logbucket/j-.../containers/application.../conatiner...0001. There is quite a long delay before those show in S3. How do I find it via ssh on the master? or is there some faster/better way to check spark application logs?
UPDATE: Just found them under /mnt/var/log/hadoop-yarn/containers however the it is owned by yarn:yarn and as hadoop user I cannot read it. :( Ideas?
In my case I needed to read a parquet file that was generated by prior EMR jobs, I was looking for list of files for a given s3 prefix, but nice thing is we don't need to do all that, we can simply do this:
spark.read.parquet(bucket+prefix_directory)
URI.create() should be used to point it to correct Filesystem.
val fs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val dirPaths = FileSystem.get(URI.create("<s3-path>"), fs.getConf).listStatus(new Path("<s3-path>"))```
I don't think you are picking up the FS right; just use the static FileSystem.get() method, or Path.get()
Try something like:
Path p = new Path("s3://bucket/subdir");
FileSystem fs = p.get(conf);
FileStatus[] status= fs.listStatus(p);
Regarding logs, YARN UI should let you at them via the node managers.