Change the file format used by to_sql method - amazon-athena

This works as expected and creates a new table. But the data is stored in a format that only spark can read. How do I store the data in csv format?
from pyathena.pandas.util import to_sql
to_sql(
mrdf,
"mrdf_table3",
conn,
"s3://" + bucket + "/tutorial/s3dir3/",
schema="hunspell",
index=False,
if_exists="replace",
)
I tried flavor="csv" or flavor="textfile" but the file that is generated is still not readable.
Update: Connection string
from pyathena import connect
bucket = "hunspell"
conn = connect(
aws_access_key_id="XXX",
aws_secret_access_key="XXX",
s3_staging_dir="s3://" + bucket + "/tutorial/staging/",
region_name="us-east-1",
)

Related

How can I copy data from CSV into QuestDB using Python?

I'm using the psychopg2 module to make queries against QuestDB from Python. I have had some trouble using the copy_from() cursor object to get CSV data into a table. What's the best way to get this into the database?
I'm trying the following:
import pandas as pd
import numpy as np
import psycopg2
import os
conn = psycopg2.connect(user="admin",
password="quest",
host="127.0.0.1",
port="8812",
database="qdb")
cursor = conn.cursor()
dest_table = "eur_fr_bulk"
temp_dataframe = "./temp_dataframe.csv"
# input
df = pd.read_csv("./data/eur_fr.csv")
df.to_csv(temp_dataframe, index_label='id', header=False)
f = open(temp_dataframe, 'r')
cursor = conn.cursor()
try:
cursor.copy_from(f, dest_table)
conn.commit()
except (Exception, psycopg2.DatabaseError) as error:
os.remove(temp_dataframe)
print("Error: %s" % error)
conn.rollback()
cursor.close()
cursor.close()
The copy_from() wrapper in psychopg2 is executing some SQL in the background that's not yet supported in QuestDB as of yet, specifically, it will run
COPY my_table FROM stdin WITH DELIMITER AS ' ' NULL AS '\\N'
The DELIMITER keyword is not yet implemented. As a workaround, you can either make the request via HTTP in python, which might be the most convenient:
import requests
csv = {'data': ('my_table_import', open('./data/eur_fr.csv', 'r'))}
server = 'http://localhost:9000/imp'
response = requests.post(server, files=csv)
print(response.text)
or you can specify a copy directory in the server.conf file which allows loading CSV files. This is documented on the COPY documentation page.

Change output CSV file name of AWS Athena queries

I wan to run my Athena query through AWS Lambda, but also change the name of my output CSV file from Query Execution ID to my-bucket/folder/my-preferred-string.csv
I tried searching for the results on web, but couldn't found the exact code for lambda function.
I am a data scientist and a beginner to AWS. This is a one time thing for me, so looking for a quick solution or a patch up.
This question is already posted here
client = boto3.client('athena')
s3 = boto3.resource("s3")
# Run query
queryStart = client.start_query_execution(
# PUT_YOUR_QUERY_HERE
QueryString = '''
SELECT *
FROM "db_name"."table_name"
WHERE value > 50
''',
QueryExecutionContext = {
# YOUR_ATHENA_DATABASE_NAME
'Database': "covid_data"
},
ResultConfiguration = {
# query result output location you mentioned in AWS Athena
"OutputLocation": "s3://bucket-name-X/folder-Y/"
}
)
# Executes query and waits 3 seconds
queryId = queryStart['QueryExecutionId']
time.sleep(3)
# Copies newly generated csv file with appropriate name
# query result output location you mentioned in AWS Athena
queryLoc = "bucket-name-X/folder-Y/" + queryId + ".csv"
# Destination location and file name
s3.Object("bucket-name-A", "report-2018.csv").copy_from(CopySource = queryLoc)
# Deletes Athena generated csv and it's metadata file
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv"
)
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv.metadata"
)
print('{file-name} csv generated')

How to extract files in S3 on the fly with boto3?

I'm trying to find a way to extract .gz files in S3 on the fly, that is no need to download it to locally, extract and then push it back to S3.
With boto3 + lambda, how can i achieve my goal?
I didn't see any extract part in boto3 document.
You can use BytesIO to stream the file from S3, run it through gzip, then pipe it back up to S3 using upload_fileobj to write the BytesIO.
# python imports
import boto3
from io import BytesIO
import gzip
# setup constants
bucket = '<bucket_name>'
gzipped_key = '<key_name.gz>'
uncompressed_key = '<key_name>'
# initialize s3 client, this is dependent upon your aws config being done
s3 = boto3.client('s3', use_ssl=False) # optional
s3.upload_fileobj( # upload a new obj to s3
Fileobj=gzip.GzipFile( # read in the output of gzip -d
None, # just return output as BytesIO
'rb', # read binary
fileobj=BytesIO(s3.get_object(Bucket=bucket, Key=gzipped_key)['Body'].read())),
Bucket=bucket, # target bucket, writing to
Key=uncompressed_key) # target key, writing to
Ensure that your key is reading in correctly:
# read the body of the s3 key object into a string to ensure download
s = s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read()
print(len(s)) # check to ensure some data was returned
The above answers are for gzip files, for zip files, you may try
import boto3
import zipfile
from io import BytesIO
bucket = 'bucket1'
s3 = boto3.client('s3', use_ssl=False)
Key_unzip = 'result_files/'
prefix = "folder_name/"
zipped_keys = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter = "/")
file_list = []
for key in zipped_keys['Contents']:
file_list.append(key['Key'])
#This will give you list of files in the folder you mentioned as prefix
s3_resource = boto3.resource('s3')
#Now create zip object one by one, this below is for 1st file in file_list
zip_obj = s3_resource.Object(bucket_name=bucket, key=file_list[0])
print (zip_obj)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key='result_files/' + f'{filename}')
This will work for your zip file and your result unzipped data will be in result_files folder. Make sure to increase memory and time on AWS Lambda to maximum since some files are pretty large and needs time to write.
Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.
However, you could use an AWS Lambda function to retrieve an object from S3, decompress it, then upload content back up again. However, please note that there is default limit of 500MB in temporary disk space for Lambda, so avoid decompressing too much data at the same time.
You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:
Use boto3 to download the new file
Use the gzip Python library to extract files
Use boto3 to upload the resulting file(s)
Sample code:
import gzip
import io
import boto3
bucket = '<bucket_name>'
key = '<key_name>'
s3 = boto3.client('s3', use_ssl=False)
compressed_file = io.BytesIO(
s3.get_object(Bucket=bucket, Key=key)['Body'].read())
uncompressed_file = gzip.GzipFile(None, 'rb', fileobj=compressed_file)
s3.upload_fileobj(Fileobj=uncompressed_file, Bucket=bucket, Key=key[:-3])

Fetch a file(.csv) from S3 bucket and copy to an RDS

I'm gonna connect to a S3 bucket, get the csv files and copy the rows to RDS DB. On this script we are using arcpy, I'm not that familiar with this package, I'm just trying to get the csv file directly from S3 bucket as source without downloading it on the server. The code is as follows:
import arcpy
from boto.s3.key import Key
import StringIO
import pandas as pd
import boto
import boto.s3.connection
access_key = ''
secret_key = ''
conn = boto.connect_s3(aws_access_key_id = access_key,aws_secret_access_key = secret_key,host = 's3.amazonaws.com')
b = conn.get_bucket('mybucket')
#for key in b.list:
b_key = b.get_key('file1.csv')
arcpy.env.overwriteOutput = True
b_url = b_key.generate_url(0, query_auth=False, force_http=True)
print b_url
##Read file
k = Key(b,file1.csv)
content = k.get_contents_as_string()
sourcefile_csv = pd.read_csv(StringIO.StringIO(content))
##CopyRows_management (in_rows, out_table, {config_keyword})
#http://pro.arcgis.com/en/pro-app/tool-reference/data-management/copy-rows.htm
arcpy.CopyRows_management(sourcefile_csv, "RDSTablePath", "")
print("copy rows done")
Error: in CopyRows arcgisscripting.ExecuteError. Failed to execute Parameters are not valid
If we use a path on the server as source path like below it works fine:
sourcefile_csv = "D:\\DEV\\file1.csv"
arcpy.CopyRows_management(sourcefile_csv, "RDSTablePath", "")
Any help would be appreciated.
It looks like you are trying to use the Pandas dataframe as the table to read from with CopyRows_management? I don't think that is a valid input for the function, thus the "Parameters are not valid" error. The documentation says that in_rows should be "The rows from a feature class, layer, table, or table view to be copied." I think the use of pandas is unnecessary here anyways.
So either save the csv somewhere that the script can access it (as you did in when you used the path on the server) or, if you don't want to save the file anywhere, just read the contents of the csv and iterate through it using an Insert Cursor to write it to your table/feature class.
See this post on how to read a csv from a string using the csv module. Then just loop through the rows of the csv and use the Insert Cursor to write to your table.
If your RDS happens to be an Aurora MySql then you should take a look into Loading Data from S3 feature, where you can skip the code and just loads line by line into your DB.

AWS Python Lambda Function - Upload File to S3

I have an AWS Lambda function written in Python 2.7 in which I want to:
1) Grab an .xls file form an HTTP address.
2) Store it in a temp location.
3) Store the file in an S3 bucket.
My code is as follows:
from __future__ import print_function
import urllib
import datetime
import boto3
from botocore.client import Config
def lambda_handler(event, context):
"""Make a variable containing the date format based on YYYYYMMDD"""
cur_dt = datetime.datetime.today().strftime('%Y%m%d')
"""Make a variable containing the url and current date based on the variable
cur_dt"""
dls = "http://11.11.111.111/XL/" + cur_dt + ".xlsx"
urllib.urlretrieve(dls, cur_dt + "test.xls")
ACCESS_KEY_ID = 'Abcdefg'
ACCESS_SECRET_KEY = 'hijklmnop+6dKeiAByFluK1R7rngF'
BUCKET_NAME = 'my-bicket'
FILE_NAME = cur_dt + "test.xls";
data = open('/tmp/' + FILE_NAME, 'wb')
# S3 Connect
s3 = boto3.resource(
's3',
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=ACCESS_SECRET_KEY,
config=Config(signature_version='s3v4')
)
# Uploaded File
s3.Bucket(BUCKET_NAME).put(Key=FILE_NAME, Body=data, ACL='public-read')
However, when I run this function, I receive the following error:
'IOError: [Errno 30] Read-only file system'
I've spent hours trying to address this issue but I'm falling on my face. Any help would be appreciated.
'IOError: [Errno 30] Read-only file system'
You seem to lack some write access right. If your lambda has another policy, try to attach this policy to your role:
arn:aws:iam::aws:policy/AWSLambdaFullAccess
It has full access on S3 as well, in case you can't write in your bucket. If it solves your issue, you'll remove some rights after that.
I have uploaded the image to s3 Bucket. In "Lambda Test Event", I have created one json test event which contains BASE64 of Image to be uploaded to s3 Bucket and Image Name.
Lambda Test JSON Event as fallows ======>
{
"ImageName": "Your Image Name",
"img64":"BASE64 of Your Image"
}
Following is the code to upload an image or any file to s3 ======>
import boto3
import base64
def lambda_handler(event, context):
s3 = boto3.resource(u's3')
bucket = s3.Bucket(u'YOUR-BUCKET-NAME')
path_test = '/tmp/output' # temp path in lambda.
key = event['ImageName'] # assign filename to 'key' variable
data = event['img64'] # assign base64 of an image to data variable
data1 = data
img = base64.b64decode(data1) # decode the encoded image data (base64)
with open(path_test, 'wb') as data:
#data.write(data1)
data.write(img)
bucket.upload_file(path_test, key) # Upload image directly inside bucket
#bucket.upload_file(path_test, 'FOLDERNAME-IN-YOUR-BUCKET /{}'.format(key)) # Upload image inside folder of your s3 bucket.
print('res---------------->',path_test)
print('key---------------->',key)
return {
'status': 'True',
'statusCode': 200,
'body': 'Image Uploaded'
}
change data = open('/tmp/' + FILE_NAME, 'wb') change the wb for "r"
also, I assume your IAM user has full access to S3 right?
or maybe the problem is in the request of that url...
try that cur_dt starts with "/tmp/"
urllib.urlretrieve(dls, "/tmp/" + cur_dt + "test.xls")