Spark read from Redshift very slow - amazon-web-services

I have a requirement in which I need to read data from AWS Redshift and write the result as CSV in AWS S3 Bucket using Apache Spark on a EC2 node instance.
I am using io.github.spark_redshift_community.spark.redshift driver to read the data from Redshift using a query. This driver executes the query and stores the result in a temporary space in S3 in CSV format.
I do not want to use Athena or the UNLOAD command due to certain constraints
I am able to achieve this but the read process from the S3 temp_directory is very slow.
As you can see above, it is taking almost a minute to read from S3 temp_directory and then write to S3 location 10k records of size 2MB
Based on logs, I can tell that storing the Redshift data into the temp_directory of S3 is fairly quick. The delay is happening while reading from this temp_directory
The EC2 instance on which spark is running has IAM role access to the S3 bucket.
Below is the code which reads from redshift
spark.read()
.format("io.github.spark_redshift_community.spark.redshift")
.option("url",URL)
.option("query", QUERY)
.option("user", USER_ID)
.option("password", PASSWORD)
.option("tempdir", TEMP_DIR)
.option("forward_spark_s3_credentials", "true")
.load();
Below is the pom.xml dependencies
<dependencies>
<dependency>
<groupId>com.eclipsesource.minimal-json</groupId>
<artifactId>minimal-json</artifactId>
<version>0.9.5</version>
</dependency>
<dependency>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>org.ini4j</groupId>
<artifactId>ini4j</artifactId>
<version>0.5.4</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.2</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.26</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>3.3.1</version>
</dependency>
<dependency>
<groupId>io.github.spark-redshift-community</groupId>
<artifactId>spark-redshift_2.12</artifactId>
<version>4.2.0</version>
</dependency>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.15</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
<version>1.12.389</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-bundle</artifactId>
<version>1.12.389</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hadoop-cloud_2.12</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>

I found the solution this issue.
Turns out the version 4.2.0 of io.github.spark_redshift_community.spark.redshift driver that I was using was causing this issue.
When I switched to the most recent version which is 5.1.0, the issue was resolved and the same job completed within 10 seconds.
Thanks!

Related

Unable to override uri-endpoint-override override-endpoint options for STS

Our goal is to no longer access AWS endpoints via a custom proxy but to access them via VPC endpoints from AWS. To make this work in our secured network we use our own VPC endpoints which we configure with the option: uri-endpoint-override (string) and override-endpoint (boolean). Now the problem is that the options are not used at all and the application always uses the default endpoints which have no access in our network. Because of this the STS component can't execute a HTTP request.
And in the console the following error message appears:
Unable to execute HTTP request: Connect to sts.eu-central-1.amazonaws.com:443 [sts.eu-central-1.amazonaws.com/54.239.54.207] failed: Connect timed out, ContainerCredentialsProvider(): Cannot fetch credentials from container - neither AWS_CONTAINER_CREDENTIALS_FULL_URI or AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variables are set
As taken from the error message, the default endpoint sts.eu-central-1.amazonaws.com:443 is used.
This is how our application.properties looks in which the options are set:
camel.component.aws2-sts.override-endpoint=true
camel.component.aws2-sts.uri-endpoint-override=https://vpce-???-???.sts.eu central1.vpce.amazonaws.com
We are using the following versions:
Apache Camel 3.14.2
Spring Boot 2.5.10
Dependencies
<properties>
<java.version>14</java.version>
<camel.version>3.14.2</camel.version>
<spring-boot.version>2.5.10</spring-boot.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
<exclusions>
<exclusion>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-logging</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.camel.springboot</groupId>
<artifactId>camel-aws2-s3-starter</artifactId>
<exclusions>
<exclusion>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-logging</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.camel.springboot</groupId>
<artifactId>camel-aws2-sts-starter</artifactId>
<exclusions>
<exclusion>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-logging</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-jetty</artifactId>
<version>${camel.version}</version>
</dependency>
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-json-validator</artifactId>
<version>${camel.version}</version>
</dependency>
</dependencies>
Info
With ticket CAMEL-16171 , Camel added the usage of uri-endpoint-override and override-endpoint options attributes for all AWS components among others for STS.
Do you have any idea why the options are not overwritten? Thanks a lot for your help!

Unable to start embedded Tomcat On EC2 Instance

I am new to AWS, and I want to deploy my SpringBoot application on My LINUX EC2 instance with the AWS S3 bucket. And for that i have created a SpringBoot Application, Uploaded my Application's jar on S3 bucket successfully, I have installed Java 1.8 on my EC2 instance, I "wget" my jar on my instance. But when I am trying to run command java -jar <myfilename.jar> it gives error: org.springframework.context.ApplicationContextException: Unable to start web server; nested exception is org.springframework.boot.web.server.WebServerException: Unable to start embedded Tomcat.
Could you please tell me what should I do now?
Here is my POM.xml file:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.0.2.RELEASE</version>
<relativePath /> <!-- lookup parent from repository -->
</parent>
<groupId>com.ab</groupId>
<artifactId>SpringBootTryApplication-2</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>SpringBootTryApplication-2</name>
<description>Demo project for Spring Boot</description>
<properties>
<java.version>1.8</java.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-tomcat</artifactId>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-aop</artifactId>
</dependency>
<dependency>
<groupId>org.aspectj</groupId>
<artifactId>aspectjrt</artifactId>
</dependency>
<dependency>
<groupId>org.aspectj</groupId>
<artifactId>aspectjweaver</artifactId>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context</artifactId>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context-support</artifactId>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-core</artifactId>
<version> 5.2.1.Final</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.6.7</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<executable>true</executable>
</configuration>
</plugin>
</plugins>
</build>
</project>
It says Java.lang.NoClassDefFoundError: com/fasterxml/jackson/databind/exc/InvalidDefinitionException. It means that this particular version of jackson-binder library does not have this particular class.
It Looks like the version 2.6 does not have this class.But the 2.9 does.
You should use the newer versions of jackson-databind library. The latest version seems to be 2.11.2 as of today.

AWS RDS gives java.lang.NoSuchMethodError

I am facing the following issue while connecting to AWS RDS.
I have tried changing the maven dependencies(from 1.11.458 and above) but I'm still facing the same issue while creating AWSRdsClient.
Exception in thread "main" java.lang.NoSuchMethodError: com.amazonaws.client.AwsSyncClientParams.getAdvancedConfig()Lcom/amazonaws/client/builder/AdvancedConfig;
at com.amazonaws.services.rds.AmazonRDSClient.<init>(AmazonRDSClient.java:334)
at com.amazonaws.services.rds.AmazonRDSClient.<init>(AmazonRDSClient.java:318)
at com.amazonaws.services.rds.AmazonRDSClientBuilder.build(AmazonRDSClientBuilder.java:61)
at com.amazonaws.services.rds.AmazonRDSClientBuilder.build(AmazonRDSClientBuilder.java:27)
at com.amazonaws.client.builder.AwsSyncClientBuilder.build(AwsSyncClientBuilder.java:46)
at com.cloudlytics.war.rules.RDS.main(RDS.java:35)
Here is what worked for me:
From https://github.com/aws/aws-sdk-java
Add this to the dependencyManagement section of your POM:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-bom</artifactId>
<version>1.11.549</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
And then use the SDK Maven modules without specifying a version:
<dependencies>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-ec2</artifactId>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-dynamodb</artifactId>
</dependency>
</dependencies>

springboot rest api: memory and cpu always in high level

I used spring boot to write Restful webservice
I do not use too many libraries and service only some api
For example one dependences I used
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
<exclusions>
<exclusion>
<groupId>javax.validation</groupId>
<artifactId>validation-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>${project.groupId}</groupId>
<artifactId>myapp.wsdl</artifactId>
<version>${project.version}</version>
<exclusions>
<exclusion>
<groupId>org.glassfish.metro</groupId>
<artifactId>webservices-api</artifactId>
</exclusion>
<exclusion>
<groupId>org.glassfish.metro</groupId>
<artifactId>webservices-rt</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>${project.groupId}</groupId>
<artifactId>com.bean</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>okhttp</artifactId>
<version>3.10.0</version>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.6</version>
<type>jar</type>
</dependency>
</dependencies>
I don’t understand why deploy on server, when I check RAM and CPU then i saw them always high level: from 9 to 15% of RAM.
Please advice me how to improve performance and reduce RAM and CPU

AWS Beanstalk: Cannot determine embedded database driver class

I have a spring boot application which runs just fine on my local instance (through Intellij) but while deploying on AWS BEanstalk, the application throws the following error (sorry about the formatting. This is how spring generated the exception):
org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'org.springframework.boot.autoconfigure.jdbc.DataSourceAutoConfig
uration$JdbcTemplateConfiguration': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationExcept
ion: Could not autowire field: private javax.sql.DataSource org.springframework.boot.autoconfigure.jdbc.DataSourceAutoConfiguration$JdbcTemplateConfigur
ation.dataSource; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'dataSource' defined in cla
ss path resource [org/springframework/boot/autoconfigure/jdbc/DataSourceAutoConfiguration$NonEmbeddedConfiguration.class]: Bean instantiation via factor
y method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [javax.sql.DataSource]: Factory method
'dataSource' threw exception; nested exception is org.springframework.boot.autoconfigure.jdbc.DataSourceProperties$DataSourceBeanCreationException: Cann
ot determine embedded database driver class for database type NONE. If you want an embedded database please put a supported one on the classpath. If you
have database settings to be loaded from a particular profile you may need to active it (the profiles "aws" are currently active).
pom.xml
<dependencyManagement>
<dependencies>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>${jackson.version}</version>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-rest</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-dynamodb</artifactId>
<version>1.10.56</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
<plugin>
<artifactId>maven-war-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.my.app.path.MyApplication</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
application.properties
spring.profiles.active=aws
dynamodb.tablename=my_dynamodb_table
application-aws.properties
spring.profiles.active=aws
The application uses a table in dynamodb. Could this be because I might need to set permissions in AWS to allow beanstalk to talk to dynamodb? If so, please let me know how to do that.
My EC2 instance is tomcat8 type.
Found the solution to my question on this post. See the answer by #user672009.
Just add this to your pom.
<dependency>
<groupId>com.h2database</groupId>
<artifactId>h2</artifactId>
<version>1.3.156</version>
</dependency>