I am writing a spark/scala program which submits a query on athena (using aws-java-sdk-athena:1.11.420) and waits for the query to complete. Once the query is complete, my spark program directly reads from the S3 bucket using s3a protocol (the output location of the query) using spark's sparkSession.read.csv() function.
In order to read the CSV file, I need to use org.apache.hadoop.hadoop-aws:1.8+ and org.apache.hadoop.hadoop-client:1.8+. Both of these libraries are build using AWS SDK version 1.10.6. However, AWS athena does not have any SDK with that version. The oldest version they have is 1.11+.
How can I resolve the conflict? I need to use the latest version of AWS SDK to get access to athena, but hadoop-aws pushed me back to an older version?
Are there other dependency version of hadoop-aws that uses 1.11+ AWS SDKs? If so, what are the versions that will work for me? If not, what other options do I have?
I found out that I can use hadoop-aws:3.1+ which comes with aws-java-sdk-bundle:1.11+. This AWS SDK Bundle comes with Athena packaged.
I although still need to run spark with hadoop-commons:3.1+ libraries. The spark cluster I have runs 2.8 version libraries.
Due to my spark cluster running 2.8, spark-submit jobs were failing, while normal execution of the jar (java -jar myjar.jar) was working fine. This is because Spark was replacing the hadoop libraries I provided with the version it was bundled with.
Related
I am using AWS glue to create ETL workflow, where I am fetching the data from the API and loading it into RDS. In AWS Glue, I used pyspark script. In the same script, I have used the 'aiohttp' and 'asyncio' modules of python to call my API asynchronously. But in AWS glue it is throwing me an error that Module Not found for the only aiohttp.
I have already tried with different versions of aiohttp module and tested in the glue job but still throwing me the same error. Can someone please help me with this topic?
Glue 2.0
AWS Glue version 2.0 lets you provide additional Python modules or different versions at the job level. You can use the --additional-python-modules job parameter with a list of comma-separated Python modules to add a new module or change the version of an existing module.
Also, within the --additional-python-modules option you can specify an Amazon S3 path to a Python wheel module.
This link to official documentation lists all modules already available. If you need a different version or need one to be installed, it can be specified in the parameter mentioned above.
Glue 1.0 & 2.0
You can zip the python library, upload it so s3 and specify the path as --extra-py-files job parameter.
See link to official documentation for more information.
I am using amazon emr and running python code using spark submit.
I would like to be able to access the EMRFS filesystem as an actual file system, meaning that I would like to be able to list all the files using something like
import os
os.listdir()
and would like to save files into the local storage and have them persisted in the s3 bucket.
I found this which is old and seems to work only with Java and I had no luck utilizing it for python, and found this, which is more general, old and does not fit my needs.
How can I do that?
What are the best use cases where I can use AWS Glue services in ETL with its limitations on Python packages support?
According to AWS Glue Documentation:
Only pure Python libraries can be used. Libraries that rely on C
extensions, such as the pandas Python Data Analysis Library, are not
yet supported.
I have attempted some ETL jobs, run using AWS Glue, and I packaged some libraries such as Pandas, Holidays, etc. as a separate zip file and tried as well, but the jobs failed due these libraries (ImportError: Pandas)?
AWS do not have any ETA for providing the support for such libraries in the near future?
Or, is it too early to use AWS Glue, given that limitations on the python libraries being a major block now?
I am looking for using KCL on SparkStreaming using pySpark.
Any pointers would be helpful.
I tried few given by spark Kinesis Ingeration link.
But i get the error for JAVA class reference.
Seems Python is using JAVA class.
i tried linking
spark-streaming-kinesis-asl-assembly_2.10-2.0.0-preview.jar
while trying to apply the KCL app on spark.
but still having the error.
Please let me know if anyone has done it already.
if i search online i get more about Twitter and Kafka.
Not able to get much help with regard to Kinesis.
spark verision used: 1.6.3
I encountered the same problem. The kinesis-asl jar had several files missing.
To overcome this problem, I had included the following jars in my spark-submit.
amazon-kinesis-client-1.9.0.jar
aws-java-sdk-1.11.310.jar
jackson-dataformat-cbor-2.6.7.jar
Note: I am using Spark 2.3.0 so the jar versions listed might not be the same as those you should be using for your spark version.
Hope this helps.
Amazon's Elastic MapReduce tool seems to only have support for HBase v0.92.x and v0.94.x.
The documentation for the EMR AMIs and HBase is seemingly out-of-date and there is no information about HBase on the newest release label emr-4.0.0.
Using this example from an AWS engineer, I was able to concoct a way to install another version of HBase on the nodes, but it was ultimately unsuccessful.
After much trial and error with the Java SDK to provision EMR with better versions, I ask:
Is it possible to configure EMR to use more recent versions of HBase (e.g. 0.98.x and newer?)
After several days of trial, error and support tickets to AWS, I was able to implement HBase 0.98 on Amazon's ElasticMapReduce service. Here's how to do it using the Java SDK, some bash and ruby-based bootstrap actions.
Credit for these bash and ruby scripts goes to Amazon support. They are in-development scripts and not officially supported. It's a hack.
Supported Version: HBase 0.98.2 for Hadoop 2
I also created mirrors in Google Drive for the supporting files incase Amazon pulls them down from S3.
Java SDK example
RunJobFlowRequest jobFlowRequest = new RunJobFlowRequest()
.withSteps(new ArrayList<StepConfig>())
.withName("HBase 0.98 Test")
.withAmiVersion("3.1.0")
.withInstances(instancesConfig)
.withLogUri(String.format("s3://bucket/path/to/logs")
.withVisibleToAllUsers(true)
.withBootstrapActions(new BootstrapActionConfig()
.withName("Download HBase")
.withScriptBootstrapAction(new ScriptBootstrapActionConfig()
.withPath("s3://bucket/path/to/wget/ssh/script.sh"))
)
.withBootstrapActions(new BootstrapActionConfig()
.withName("Install HBase")
.withScriptBootstrapAction(new ScriptBootstrapActionConfig()
.withPath("s3://bucket/path/to/hbase/install/ruby/script"))
)
.withServiceRole("EMR_DefaultRole")
.withJobFlowRole("EMR_EC2_DefaultRole");
"Download HBase" Bootstrap Action (bash script)
Original from S3
Mirror from Google Drive
"Install HBase" Bootstrap Action (ruby script)
Original from S3
Mirror from Google Drive
HBase Installation Tarball (used in "Download HBase" script)
Original from S3
Mirror from Google Drive
Make copies of these files
I highly recommend that you download these files and upload them into your own S3 bucket for use in your bootstrap actions / and scripts. Adjust where necessary.