Pyspark from S3 - java.lang.ClassNotFoundException: com.amazonaws.services.s3.model.MultiObjectDeleteException - amazon-web-services

I'm trying to get the data from s3 with pyspark from AWS EMR Cluster.
I'm still getting this error - An error occurred while calling o27.parquet. : java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException.
I have tried with different versions of jars/clusters, still no results.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().set("spark.jars","/usr/lib/spark/jars/hadoop-aws-3.2.1.jar,/usr/lib/spark/aws-java-sdk-s3-1.11.873.jar")
sc = SparkContext( conf=conf)
sqlContext = SQLContext(sc)
df2 = sqlContext.read.parquet("s3a://stackdev2prq/Badges/*")
I'm using hadoop-aws-3.2.1.jar and aws-java-sdk-s3-1.11.873.jar.
Spark 3.0.1 on Hadoop 3.2.1 YARN
I know that version I need propper version aws-java-sdk, but how can I check which version should I dounload?

mvnrepo provides the information
I don't see it for 3.2.1. Looking in the hadoop-project pom and JIRA versions, HADOOP-15642 says "1.11.375"; the move to 1.11.563 only went in with 3.2.2
do put the whole (huge) aws-java-sdk-bundle on the classpath. that shades everything and avoids version mismatch hell with jackson, httpclient etc.
That said: if you are working with EMR, you should just use the s3:// URLs and pick up the EMR team's S3 connector.

Related

Read/Write Shapefiles .shp stored in AWS S3 from AWS EC2 using python

I am using the below code on my local CLI python window which work but unable to perform on EC2 (Amazon Linux 2),Please help me finding the solution as i have a tried many approaches from internet .
import pandas as pd
import geopandas as gpd
gdf1=gpd.read_file("s3://bucketname/inbound/folder/filename.shp")
gdf2=gpd.read_file("s3://bucketname/inbound/folder/filename.shp")
gdf = gpd.GeoDataFrame(pd.concat([gdf1, gdf2]))
gdfw =gdf.to_file("s3://bucket/outbound/folder/")

EMR 5.21 , Spark 2.4 - Json4s Dependency broken

Issue
In EMR 5.21 , Spark - Hbase integration is broken.
df.write.options().format().save() fails.
Reason is json4s-jackson version 3.5.3 in spark 2.4 , EMR 5.21
it works fine in EMR 5.11.2 , Spark 2.2 , son4s-jackson version 3.2.11
Problem is this is EMR so i cant rebuild spark with lower json4s .
is there any workaround ?
Error
py4j.protocol.Py4JJavaError: An error occurred while calling o104.save. : java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;
Submission
spark-submit --master yarn \
--jars /usr/lib/hbase/ \
--packages com.hortonworks:shc-core:1.1.3-2.3-s_2.11 \
--repositories http://repo.hortonworks.com/content/groups/public/ \
pysparkhbase_V1.1.py s3://<bucket>/ <Namespace> <Table> <cf> <Key>
Code
import sys
from pyspark.sql.functions import concat
from pyspark import SparkContext
from pyspark.sql import SQLContext,SparkSession
spark = SparkSession.builder.master("yarn").appName("PysparkHbaseConnection").config("spark.some.config.option", "PyHbase").getOrCreate()
spark.sql("set spark.sql.parquet.compression.codec=uncompressed")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
df = spark.read.parquet(file)
df.createOrReplaceTempView("view")
.
cat = '{|"table":{"namespace":"' + namespace + '", "name":"' + name + '", "tableCoder":"' + tableCoder + '", "version":"' + version + '"}, \n|"rowkey":"' + rowkey + '", \n|"columns":{'
.
df.write.options(catalog=cat).format(data_source_format).save()
There's no obvious answer. A quick check of the SHC POM doesn't show a direct import of the json file, so you can't just change that pom & build the artifact yourself.
You're going to have to talk to the EMR team to get them to build the connector & HBase in sync.
FWIW, getting jackson in sync is one of the stress point of releasing a big data stack, and the AWS SDK's habit of updating their requirements on their fortnightly release one of the stress points ... Hadoop moved to the aws shaded SDK purely to stop the AWS engineering decisions defining the choices for everyone.
downgrade json4s to 3.2.10 can resolve it. but I think it's SHC bug,need to upgrade it.

Export bigQuery table to google data storage runs into AttributeError exception

I am trying to export table from google bigQuery to google data storage as json file.
Running this python snippet
from google.cloud import bigquery
client = bigquery.Client()
bucket_name = 'mybucket'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'myfile.json')
dataset_ref = client.dataset('mydataset')
table_ref = dataset_ref.table('mytable')
job_config = bigquery.job.ExtractJobConfig()
job_config.destination_format = (
bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON)
extract_job = client.extract_table(
table_ref, destination_uri, job_config=job_config
)
extract_job.result()
I received this error
AttributeError: module 'google.cloud.bigquery' has no attribute 'DestinationFormat'
I followed the official documentation
https://cloud.google.com/bigquery/docs/exporting-data#configuring_export_options
Here my python packages version
google-api-core (1.1.0)
google-auth (1.4.1)
google-cloud-bigquery (0.31.0)
google-cloud-core (0.28.1)
google-resumable-media (0.3.1)
googleapis-common-protos (1.5.3)
How is it possible to receive this error with the latest packages / documentation ?
Thank you in advance for your help
Regards
Can you try replacing bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON with bigquery.job.DestinationFormat.NEWLINE_DELIMITED_JSON?. It is probably a bug in the documentation.
Ensure the version installed locally has the required attribute, maybe you have an older version : open Python console, import bigquery, and dir it or help(bq) or such to see that the attribute is there. If it isn't, pip update the gcloud package and try again.
If from the Python shell the attribute is actually there, but it's not when you run the script, then there must be a second version of Python installed.
There may be other causes but let's see what you find.

Get files from S3 using Jython in Grinder test script

I have master/worker EC2 instances that I'm using for Grinder tests. I need to try out a load test that directly gets files from an S3 bucket, but I'm not sure how that would look in Jython for the Grinder test script.
Any ideas or tips? I've looked into it a little and saw that Python has the boto package for working with AWS - would that work in Jython as well?
(Edit - adding code and import errors for clarification.)
Python approach:
Did "pip install boto3"
Test script:
from net.grinder.script.Grinder import grinder
from net.grinder.script import Test
import boto3
# boto3 for Python
test1 = Test(1, "S3 request")
resource = boto3.resource('s3')
def accessS3():
obj = resource.Object(<bucket>,<key>)
test1.record(accessS3)
class TestRunner:
def __call__(self):
accessS3()
The error for this is:
net.grinder.scriptengine.jython.JythonScriptExecutionException: : No module named boto3
Java approach:
Added aws-java-sdk-1.11.221 jar from .m2\repository\com\amazonaws\aws-java-sdk\1.11.221\ to CLASSPATH
from net.grinder.script.Grinder import grinder
from net.grinder.script import Test
import com.amazonaws.services.s3 as s3
# aws s3 for Java
test1 = Test(1, "S3 request")
s3Client = s3.AmazonS3ClientBuilder.defaultClient()
test1.record(s3Client)
class TestRunner:
def __call__(self):
result = s3Client.getObject(s3.model.getObjectRequest(<bucket>,<key>))
The error for this is:
net.grinder.scriptengine.jython.JythonScriptExecutionException: : No module named amazonaws
I'm also running things on a Windows computer, but I'm using Git Bash.
Given that you are using Jython, I'm not sure whether you want to execute the S3 request in java or python syntax.
However, I would suggest following along with the python guide at the link below.
http://docs.ceph.com/docs/jewel/radosgw/s3/python/

pyspark Cassandra connector

I have to install pyspark-cassandra-connector which available in https://github.com/TargetHolding/pyspark-cassandra
but I faced huge problems and errors and no supported document regarding to spark with python which called pyspark!!!
I want to know is pyspark-cassandra-connector package is depricated or something else?. Also, I need clear step-by-step tutorials for git clone pyspark-cassandra-connector package, installation and import it in pyspark shell and make successful connection with cassandra and make transactions, building tables or keyspaces via pyspark and effect on it.
Approach 1 (spark-cassandra-connector)
Use below command to start pyspark shell by using spark-cassandra-connector
pyspark --packages com.datastax.spark:spark-cassandra-connector_2.11:2.4.2
Now you can import modules
Read data from cassandra table "emp" and keyspace "test" as
spark.read.format("org.apache.spark.sql.cassandra").options(table="emp", keyspace="test").load().show()
Approach 2 (pyspark-cassandra)
Use below command to start pyspark shell by using pyspark-cassandra
pyspark --packages anguenot/pyspark-cassandra:2.4.0
Read data from cassandra table "emp" and keyspace "test" as
spark.read.format("org.apache.spark.sql.cassandra").options(table="emp", keyspace="test").load().show()
I hope this link helps you in your task
https://github.com/datastax/spark-cassandra-connector/#documentation
The Link in your question points to a repository where the build are failing.
It also has a link to the above repository.
There are two ways to do this:
Either using pyspark or spark-shell
#1 pyspark:
Steps to follow:
pyspark --packages com.datastax.spark: spark-cassandra-connector_2.11: 2.4.2
df = spark.read.format("org.apache.spark.sql.cassandra").option("keyspace":"<keyspace_name>").option("table":"<table_name>").load
Note: this will create a dataframe on which you can perform further oprations
try agg(),select(),show(),etc. methods or tab after 'df.', which will show you available options
example: df.select(sum("<column_name>")).show()
#2 spark-shell:
spark --packages or
use above package or use a connector jar file with spark-shell
above (#1)steps will work exactly the same, but just use 'val' to create variable
ex. val df = read.format().load()
Note : use ':paste' option in scala to write multiple lines or to paste your code
#3 Steps to download spark-cassandra-connector:
download the spark-cassandra-connector by cloning https://github.com/datastax/spark-cassandra-connector.git
cd to the spark-cassandra-connector
./sbt/sbt assembly
this will download the spark-cassandra-connector and will put them into 'project' folder
use spark-shell
all set
Cheers 🍻!
you can use this to connect to cassandra
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
you can read like this
if you have keyspace called test and a table called my_table
val test_spark_rdd = sc.cassandraTable("test", "my_table")
test_spark_rdd.first