Can't access GCP Access Secret Manager from Dataproc sparkjob

Can't access GCP Access Secret Manager from Dataproc sparkjob - google-cloud-platform

I am trying to fetch GCP secret manager secret from dataproc spark job. But I am getting the error "Exception in thread "main" java.lang.NoClassDefFoundError: com/google/cloud/secretmanager/v1/AccessSecretVersionResponse".
I have added the jars "google-cloud-secretmanager-1.4.2.jar" and "gax-1.62.0.jar" in the dataproc spark job dependency.
I am using the code mentioned in the below GCP link.
https://cloud.google.com/secret-manager/docs/reference/libraries
Am I missing something here?

2.0-debian10 has python >= 3.0 installed. google-cloud-secretmanager-1.4.2.jar does not support python >= 3.0 (https://pypi.org/project/google-cloud-secret-manager/1.0.0/). Please use a later version of google-cloud-secretmanager.

Related

EMR Serverless Airflow Operator not allowing EMR custom images

I want to launch a Spark job on EMR Serverless from Airflow. I want to use Spark 3.3.0 and Scala 2.13 but the 6.9.0 EMR Release ships with Scala 2.12. I created a FAT jar including all Spark dependencies and it won't work either. As an alternative, I am trying to use an EMR custom image by creating an application using --image-configuration with the Airflow operator but it won't just pass through all the arguments from the boto API.
create_app = EmrServerlessCreateApplicationOperator(
task_id="create_my_app",
job_type="SPARK",
release_label="emr-6.9.0",
config={"name": "data-ingestion",
"imageConfiguration": {
"imageUri": "xxxxxxx.dkr.ecr.eu-west-1.amazonaws.com/emr-custom-image:0.0.1"}})
Airflow gives the following error message:
Unknown parameter in input: "imageConfiguration", must be one of:
name, releaseLabel, type, clientToken, initialCapacity, maximumCapacity, tags, autoStartConfiguration, autoStopConfiguration, networkConfiguration
This other config won't work either:
config={"name": "data-ingestion",
"imageUri": "xxxxxxx.dkr.ecr.eu-west-1.amazonaws.com/emr-custom-image:0.0.1"})
Does anybody have any ideas other than downgrading my Scala version?

Airflow operator passes the argument to the boto3 client, and this client create the application.
The configuration imageConfiguration is added to boto3 client in 1.26.44 (PR), and the other configuration are added in different version (please check the changelog).
So you can try to upgrade the version of boto3 in you Airflow server, provided that it is compatible with the others dependencies, and if not, you may need to upgrade your Airflow version.

AWS EMR pyspark java.lang.illegalArgumentException when using pandas_udf

Using pyspark 2.4.7 and pyarrow 6.0.1.
I know from documentation there is compatibility issue therefore I need to set ARROW_PRE_0_15_IPC_FORMAT = 1 inside spark-env.sh
This solves the problem on my local machine however still getting the same error in AWS Emr 5.33.1
I am usint boto3 and have configured spark-env by passing
[...{'Classification': 'spark-env', 'Configurations':[{'Classification': 'export', 'Properties':{'ARROW_PRE_0_15_IPC_FORMAT':'1'}}],
'Properties':{}
}
and EMR loads property and has its config can be see in EMR UI.
I've read that these config only used for master node, so worker nodes are still getting the same error?

AWS EMR Spark error with `Failed to load class of driverClassName com.mysql.jdbc.Driver`

I'm currently trying to add a process in EMR 6.1.0 that will use Spark to store aggregated data in mysql.
However, when I actually run Spark, I get the following error.
Exception in thread "main" java.lang.RuntimeException: Failed to load class of driverClassName com.mysql.jdbc.
This error did not occur in EMR 6.0.0.
In the process of updating from EMR 6.0.0 to 6.1.0, I changed the Spark version from 2.4.4 to 3.0.0.
The code itself has not changed significantly, and we know that it is not a network problem.
I've spent a lot of time looking through the AWS documentation and can't seem to find any hints.
Can anyone help me?

Place the MySQL connector jar under $SPARK_HOME/jars folder or pass the the MySQL connector jar path in spark-shell/spark-submit command using --jars flag.

Spark 3.x depends on HikariCP.
https://github.com/apache/spark/blob/v3.0.0/dev/deps/spark-deps-hadoop-3.2-hive-2.3#L1
Preloaded HikariCP can't load your application classes due to ClassLoader.
https://github.com/brettwooldridge/HikariCP/blob/HikariCP-2.5.1/src/main/java/com/zaxxer/hikari/HikariConfig.java#L318
this.getClass().getClassLoader().loadClass(driverClassName)
You should add shade settings if use sbt-assemlby plugin.
assembly / assemblyShadeRules := {
Seq("com.zaxxer.hikari").map { packageName =>
ShadeRule.rename(s"${packageName}.**" -> s"my_app_shade_package.${packageName}.#1").inAll
}
}

Accessing Airflow REST API in AWS Managed Workflows?

I have Airflow running in AWS MWAA, I would like to access REST API and there are 2 ways to do this but doesn't seem to work for me.
Overriding api.auth_backend. This used to work and now AWS MWAA won't allow you to add this, it is consider as 'blocklist' and not allow.
api.auth_backend = airflow.api.auth.backend.default
Using MWAA Cli(Python). This doesn't work if any of the DAGs uses packages that are in requirments.txt file.
a. as an example, I have "paramiko" in requirements.txt because I have a task that uses SSHOperator. The MWAA Cli fails with "no module paramiko"
b. Also noted here, https://docs.aws.amazon.com/mwaa/latest/userguide/access-airflow-ui.html
"Any command that parses a DAG (such as list_dags, backfill) will fail if the DAG uses plugins that depend on packages that are installed through requirements.txt."

We are using MWAA 2.0.2 and managed to use Airflow's Rest-API through MWAA CLI, basically following the instructions and sample codes of the Apache Airflow CLI command reference. You'll notice that not all Rest-API calls are supported, but many of them are (even when you have a requirements.txt in place).
Also have a look at AWS sample codes on GitHub.

Hadoop 2.9.2 AWS

I'm managed to setup Hadoop with 3 datanodes as a small cluster and everything work ok.
When trying to access AWS bucket on S3A protocol I get this error:
hadoop fs -ls s3a://my-bucket/
-ls: Fatal internal error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2395)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3208)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3240)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3291)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3259)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:470)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:245)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:228)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:103)
at org.apache.hadoop.fs.shell.Command.run(Command.java:175)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:317)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:380)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2299)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393)
... 16 more
What I did wrong ? How do fix that ?
P.S. Bucket on Amazon if fully public. Anyone can download from it.
Amazon credentials was configured in hadoop/core-site.xml as described here: Hadoop-AWS module: Integration with Amazon Web Services

As per the link you shared the issue seems to be related JAR file missing from CLASSPATH. Can you check if it is accessible. If it is not can you copy required JARS as shown below matching your Hadoop version and retry.
sudo cp hadoop/share/hadoop/tools/lib/$AWS_JAVA_SDK_VERSION.jar hadoop/share/hadoop/common/lib/
sudo cp hadoop/share/hadoop/tools/lib/$AWS_HADOOP_VERSION.jar hadoop/share/hadoop/common/lib/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Can't access GCP Access Secret Manager from Dataproc sparkjob - google-cloud-platform

2.0-debian10 has python >= 3.0 installed. google-cloud-secretmanager-1.4.2.jar does not support python >= 3.0 (https://pypi.org/project/google-cloud-secret-manager/1.0.0/). Please use a later version of google-cloud-secretmanager.

Related

EMR Serverless Airflow Operator not allowing EMR custom images

AWS EMR pyspark java.lang.illegalArgumentException when using pandas_udf

AWS EMR Spark error with `Failed to load class of driverClassName com.mysql.jdbc.Driver`

Accessing Airflow REST API in AWS Managed Workflows?

Hadoop 2.9.2 AWS

Categories

Resources