All
I am trying to connect to S3 environment from a spark installed in local mac machine and using the following commands
./bin/spark-shell --packages com.amazonaws:aws-java-sdk-pom:1.11.271,org.apache.hadoop:hadoop-aws:3.1.1,org.apache.hadoop:hadoop-hdfs:2.7.1
This connects to scala and downloads all the libraries
Then I execute the following commands in spark shell
val accessKeyId = System.getenv("AWS_ACCESS_KEY_ID")
val secretAccessKey = System.getenv("AWS_SECRET_ACCESS_KEY")
val hadoopConf=sc.hadoopConfigurationhadoopConf.set("fs.s3.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", accessKeyId)
hadoopConf.set("fs.s3.awsSecretAccessKey", secretAccessKey)
hadoopConf.set("fs.s3n.awsAccessKeyId", accessKeyId)
hadoopConf.set("fs.s3n.awsSecretAccessKey", secretAccessKey)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("s3a://path/1551467354353.c948f177e1fb.dev.0fd8f5fd-22d4-4523-b6bc-b68c181b4906.gz")
But I get NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities when I use S3a or S3
Any idea what I could be missing here ?
Related
I'm trying to provide a list of files for spark to read as and when it needs them (which is why I'd rather not use boto or whatever else to pre-download all the files onto the instance and only then read them into spark "locally").
os.environ['PYSPARK_SUBMIT_ARGS'] = "--master local[3] pyspark-shell"
spark = SparkSession.builder.getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3.access.key', credentials['AccessKeyId'])
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3.access.key', credentials['SecretAccessKey'])
spark.read.json(['s3://url/3521.gz', 's3://url/2734.gz'])
No idea what local[3] is about but without this --master flag, I was getting another exception:
Exception: Java gateway process exited before sending the driver its port number.
Now, I'm getting this:
Py4JJavaError: An error occurred while calling o37.json.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
...
Not sure what o37.json refers to here but it probably doesn't matter.
I saw a bunch of answers to similar questions suggesting an addition of flags like:
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell"
I tried prepending it and appending it to the other flag but it doesn't work.
Just like the many variations I see in other answers and elsewhere on the internet (with different packages and versions), for example:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] --jars spark-snowflake_2.12-2.8.4-spark_3.0.jar,postgresql-42.2.19.jar,mysql-connector-java-8.0.23.jar,hadoop-aws-3.2.2,aws-java-sdk-bundle-1.11.563.jar'
A typical example for reading files from S3 is as below -
Additional you can go through this answer to ensure the minimalistic structure and necessary modules are in place -
java.io.IOException: No FileSystem for scheme: s3
Read Parquet - S3
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=com.amazonaws:aws-java-sdk-bundle:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0 pyspark-shell"
sc = SparkContext.getOrCreate()
sql = SQLContext(sc)
hadoop_conf = sc._jsc.hadoopConfiguration()
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
access_key = config.get("****", "aws_access_key_id")
secret_key = config.get("****", "aws_secret_access_key")
session_key = config.get("****", "aws_session_token")
hadoop_conf.set("fs.s3.aws.credentials.provider", "org.apache.hadoop.fs.s3.TemporaryAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.access.key", access_key)
hadoop_conf.set("fs.s3a.secret.key", secret_key)
hadoop_conf.set("fs.s3a.session.token", session_key)
s3_path = "s3a://xxxx/yyyy/zzzz/"
sparkDF = sql.read.parquet(s3_path)
I have written a python program using boto3 to launch a new instance and supplied startup script using UserData parameter as shown below ( '--' some id)
launchec2.py :
import boto3
ec2 = boto3.resource('ec2')
instances = ec2.create_instances(ImageId='--', MinCount=1, MaxCount=1, InstanceType = 't2.micro',KeyName='--',
SecurityGroupIds=['--'], UserData = open('startup_script.sh').read())
print(instances)
instance = instances[0]
instance.wait_until_running()
instance.load()
# printing dns name
print(instance.public_dns_name)
startup_script.sh :
#!/usr/bin/python
import os
os.system('sudo yum install -y python-pip')
os.system('sudo pip install boto3')
os.system('sudo yum -y update')
os.system('sudo yum install -y httpd')
os.system('sudo service httpd start')
# os.system('cd /var/www/html')
import boto3
access_id_key = ''
secret_access_key = ''
session_token_key = ''
s3 = boto3.resource('s3',aws_access_key_id = access_id_key,aws_secret_access_key = secret_acces_key,aws_session_token = session_token_key)
my_bucket = s3.Bucket('cs351-lab2')
s3client = boto3.client('s3',aws_access_key_id = access_id_key,aws_secret_access_key = secret_access_key,aws_session_token = session_token_key)
for file_object in my_bucket.objects.all():
print(file_object.key,type(file_object.key))
s3client.download_file('cs351-lab2',file_object.key,file_object.key)
I have supplied every value like access id, secret key, and session token correctly. Now my problem is that the script is working perfectly fine up to os.system('sudo service httpd start')( when it is passed as UserData) and then it is not able to download files from
s3 bucket.
But if I run the script manually in that instance by command "./startup_script.sh" after creating startup_script.sh and enabling permissions to execute it, it is perfectly working fine and able to install all the files from the s3 bucket, but I am not sure why it is unable to download files when passed as UserData in launchec2.py.
I am using putty.
Can someone please let me know the solution? it would be of great help.
User Data scripts execute as the root user.
Therefore, they should not use the sudo command.
When you run it manually, you are running it as the ec2-user, which is a different environment.
I am trying to use the boto (ver 2.43.0) library in Python to connect to S3, but I keep getting socket.gaierror: [Errno 11004] when I try to do this:
from boto.s3.connection import S3Connection
access_key = 'accesskey_here'
secret_key = 'secretkey_here'
conn = S3Connection(access_key, secret_key)
mybucket = conn.get_bucket('s3://diap.prod.us-east-1.mybucket/')
print("success!")
I can connect to and access folders in mybucket using AWS CLI by using a command like this in Windows:
> aws s3 ls s3://diap.prod.us-east-1.mybucket/
<list of folders in mybucket will be here>
or using software like CloudBerry or S3Browser.
Is there something that I am doing wrong here to access S3 bucket and folders properly?
get_bucket() expects a bucket name.
get_bucket(bucket_name, validate=True, headers=None)
Try:
mybucket = conn.get_bucket('mybucket')
If it doesn't work, show the full stack trace.
{Update]: There is a bug in boto library for bucket names with dot. Update your boto config
[s3]
calling_format = boto.s3.connection.OrdinaryCallingFormat
Or
from boto.s3.connection import S3Connection, OrdinaryCallingFormat
conn = S3Connection(access_key, secret_key, calling_format=OrdinaryCallingFormat())
I downloaded the AWS cli and was able to successfully list objects from my bucket. But doing the same from a Python script does not work. The error is forbidden error.
How should I configure the boto to use the same default AWS credentials ( as used by AWS cli )
Thank you
import logging import urllib, subprocess, boto, boto.utils, boto.s3
logger = logging.getLogger("test") formatter =
logging.Formatter('%(asctime)s %(message)s') file_handler =
logging.FileHandler("test.log") file_handler.setFormatter(formatter)
stream_handler = logging.StreamHandler(sys.stderr)
logger.addHandler(file_handler) logger.addHandler(stream_handler)
logger.setLevel(logging.INFO)
# wait until user data is available while True:
logger.info('**************************** Test starts *******************************')
userData = boto.utils.get_instance_userdata()
if userData:
break
time.sleep(5)
bucketName = ''
deploymentDomainName = ''
if bucketName:
from boto.s3.key import Key
s3Conn = boto.connect_s3('us-east-1')
logger.info(s3Conn)
bucket = s3Conn.get_bucket('testbucket')
key.key = 'test.py'
key.get_contents_to_filename('test.py')
CLI is -->
aws s3api get-object --bucket testbucket --key test.py my.py
Is it possible to use the latest Python SDK from Amazon (Boto 3)? If so, set up your credentials as outlined here: Boto 3 Quickstart.
Also, you might check your environment variable. If they don't exist, that is okay. If they don't match those on your account, then that could be the problem as some AWS SDKs and other tools with use environment variables over the config files.
*nix:
echo $AWS_ACCESS_KEY_ID && echo $AWS_SECRET_ACCESS_KEY
Windows:
echo %AWS_ACCESS_KEY% & echo %AWS_SECRET_ACCESS_KEY%
(sorry if my windows-foo is weak)
When you use CLI by default it takes credentials from .aws/credentials file, but for running bot you will have to specify access key and secret key in your python script.
import boto
import boto.s3.connection
access_key = 'put your access key here!'
secret_key = 'put your secret key here!'
conn = boto.connect_s3(
aws_access_key_id = access_key,
aws_secret_access_key = secret_key,
host = 'bucketname.s3.amazonaws.com',
#is_secure=False, # uncomment if you are not using ssl
calling_format = boto.s3.connection.OrdinaryCallingFormat(),
)
I'm trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capability of connecting to JDBC:
https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#jdbc-to-other-databases
The load command seems fairly straightforward (although I don't know how I would enter AWS credentials here, maybe in the options?).
df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")
And I'm not entirely sure how to deal with the SPARK_CLASSPATH variable. I'm running Spark locally for now through an iPython notebook (as part of the Spark distribution). Where do I define that so that Spark loads it?
Anyway, for now, when I try running these commands, I get a bunch of undecipherable errors, so I'm kind of stuck for now. Any help or pointers to detailed tutorials are appreciated.
Although this seems to be a very old post, anyone who is still looking for answer, below steps worked for me!
Start the shell including the jar.
bin/pyspark --driver-class-path /path_to_postgresql-42.1.4.jar --jars /path_to_postgresql-42.1.4.jar
Create a df by giving appropriate details:
myDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:redshift://host:port/db_name") \
.option("dbtable", "table_name") \
.option("user", "user_name") \
.option("password", "password") \
.load()
Spark Version: 2.2
It turns out you only need a username/pwd to access Redshift in Spark, and it is done as follows (using the Python API):
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load(source="jdbc",
url="jdbc:postgresql://host:port/dbserver?user=yourusername&password=secret",
dbtable="schema.table"
)
Hope this helps someone!
If you're using Spark 1.4.0 or newer, check out spark-redshift, a library which supports loading data from Redshift into Spark SQL DataFrames and saving DataFrames back to Redshift. If you're querying large volumes of data, this approach should perform better than JDBC because it will be able to unload and query the data in parallel.
If you still want to use JDBC, check out the new built-in JDBC data source in Spark 1.4+.
Disclosure: I'm one of the authors of spark-redshift.
You first need to download Postgres JDBC driver. You can find it here: https://jdbc.postgresql.org/
You can either define your environment variable SPARK_CLASSPATH in .bashrc, conf/spark-env.sh or similar file or specify it in the script before you run your IPython notebook.
You can also define it in your conf/spark-defaults.conf in the following way:
spark.driver.extraClassPath /path/to/file/postgresql-9.4-1201.jdbc41.jar
Make sure it is reflected in the Environment tab of your Spark WebUI.
You will also need to set appropriate AWS credentials in the following way:
sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "***")
The simplest way to make a jdbc connection to Redshift using python is as follows:
# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession
jdbc_url = "jdbc:redshift://xxx.xxx.redshift.amazonaws.com:5439/xxx"
jdbc_user = "xxx"
jdbc_password = "xxx"
jdbc_driver = "com.databricks.spark.redshift"
spark = SparkSession.builder.master("yarn") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport().getOrCreate()
# Read data from a query
df = spark.read \
.format(jdbc_driver) \
.option("url", jdbc_url + "?user="+ jdbc_user +"&password="+ jdbc_password) \
.option("query", "your query") \
.load()
This worked for in Scala in AWS Glue with Spark 2.4:
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val sqlContext = new org.apache.spark.sql.SQLContext(spark)
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql://HOST:PORT/DBNAME?user=USERNAME&password=PASSWORD",
"dbtable" -> "(SELECT a.row_name FROM schema_name.table_name a) as from_redshift")).load()
// back to DynamicFrame
val datasource0 = DynamicFrame(jdbcDF, glueContext)
Works with any SQL query.