I am running the following code to get the number of records in a parquet file placed inside an S3 bucket.
import boto3
import os
s3 = boto3.client('s3')
sql_stmt = """SELECT count(*) FROM s3object s"""
req_fact =s3.select_object_content(
Bucket = 'test_hadoop',
Key = 'counter_db.cm_workload_volume_sec.dt=2023-01-23.cm_workload_volume_sec+2+000000347262.parquet',
ExpressionType = 'SQL',
Expression = sql_stmt,
InputSerialization={'Parquet':{}},
OutputSerialization = {'JSON': {}})
for event in req_fact['Payload']:
if 'Records' in event:
print(event['Records']['Payload'].decode('utf-8'))
elif 'Stats' in event:
print(event['Stats'])
However I get this error: botocore.exceptions.ClientError: An error occurred (XNotImplemented) when calling the SelectObjectContent operation: This node does not support SelectObjectContent.
What is the issue?
I ran your code against a known good (uncompressed) parquet file with no errors.
resp = s3.select_object_content(
Bucket='my-test-bucket',
Key='counter_db.cm_workload_volume_sec.dt=2023-01-23.cm_workload_volume_sec+2+000000347262.parquet',
ExpressionType='SQL',
Expression="SELECT count(*) FROM s3object s",
InputSerialization={'Parquet': {}},
OutputSerialization={'JSON': {}},
)
Output:
{"_1":4}
In the AWS console you can navigate to the S3 bucket and find the file, highlight it and choose to run S3 select there (Actions > Query with S3 Select). That will allow you to validate that the file can be queried with S3 select (which I think is your issue here)
Note the following: Amazon S3 Select does not support whole-object compression for Apache Parquet objects.
Related
Senario:
commit Athena query with boto3 and output to s3
download result in s3
Error: An error occurred (404) when calling the HeadObject operation: Not Found
It's weird that the file exists in S3 and I can copy it down with aws s3 cp command. But I just cannot download with boto3 and failed to execute head-object.
aws s3api head-object --bucket dsp-smaato-sink-prod --key /athena_query_results/c96bdc09-d545-4ee3-bc66-be3be928e3f2.csv
It does work. I've checked account policies and it has granted admin policy.
# snippets
def s3_donwload(url, target=None):
# s3 = boto3.resource('s3')
# client = s3.meta.client
client = boto3.client("s3", region_name=constant.AWS_REGION, endpoint_url='https://s3.ap-southeast-1.amazonaws.com')
s3_file = urlparse(url)
if target:
target = os.path.abspath(target)
else:
target = os.path.abspath(os.path.basename(s3_file.path))
logger.info(f"download {url} to {target}...")
client.download_file(s3_file.netloc, s3_file.path, target)
logger.info(f"download {url} to {target} done!")
Take a look at the value of s3_file.path -- does it start with a slash? If so, it needs to change because Amazon S3 keys do not start with a slash.
I suggest that you print the content of netloc, path and target to see what values it is actually passing.
It's a bit strange to use os.path with an S3 URL, so it might need some tweaking.
I have an s3 bucket which will receive new files throughout the day. I want to download these to my ec2 instance everytime a new file is uploaded to the bucket.
I have read that its possible using sqs or sns or lambda. Which is the easiest of them all? I need the file to be downloaded as early as possible once it is uploaded into the bucket.
EDIT
I basically will be getting png images in the bucket every few seconds or minutes. Everytime a new image is uploaded, I want to download that on the instance which is already running. I will do some AI processing. As the images will keeep coming into the bucket, I want to constantly keep downloading it in the ec2 and process it as soon as possible.
This is my code in the Lambda function so far.
import boto3
import json
def lambda_handler(event, context):
"""Read file from s3 on trigger."""
#print(event)
s3 = boto3.client("s3")
client = boto3.client("ec2")
ssm = boto3.client("ssm")
instanceid = "******"
if event:
file_obj = event["Records"][0]
#print(file_obj)
bucketname = str(file_obj["s3"]["bucket"]["name"])
print(bucketname)
filename = str(file_obj["s3"]["object"]["key"])
print(filename)
response = ssm.send_command(
InstanceIds=[instanceid],
DocumentName="AWS-RunShellScript",
Parameters={
"commands": [f"aws s3 cp {filename} ."]
}, # replace command_to_be_executed with command
)
# fetching command id for the output
command_id = response["Command"]["CommandId"]
time.sleep(3)
# fetching command output
output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid)
print(output)
return
However I am getting the following error
Test Event Name
test
Response
{
"errorMessage": "2021-12-01T14:11:30.781Z 88dbe51b-53d6-4c06-8c16-207698b3a936 Task timed out after 3.00 seconds"
}
Function Logs
START RequestId: 88dbe51b-53d6-4c06-8c16-207698b3a936 Version: $LATEST
END RequestId: 88dbe51b-53d6-4c06-8c16-207698b3a936
REPORT RequestId: 88dbe51b-53d6-4c06-8c16-207698b3a936 Duration: 3003.58 ms Billed Duration: 3000 ms Memory Size: 128 MB Max Memory Used: 87 MB Init Duration: 314.81 ms
2021-12-01T14:11:30.781Z 88dbe51b-53d6-4c06-8c16-207698b3a936 Task timed out after 3.00 seconds
Request ID
88dbe51b-53d6-4c06-8c16-207698b3a936
When I remove all the lines related to ssm, it works fine. Is there any permission issue or is there any problem with the code?
EDIT2
My code is working but I dont see any output or change in my ec2 instance. I should be seeing an empty text file in the home directory but I dont see anything
Code
import boto3
import json
import time
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
"""Read file from s3 on trigger."""
#print(event)
s3 = boto3.client("s3")
client = boto3.client("ec2")
ssm = boto3.client("ssm")
instanceid = "******"
print("HI")
if event:
file_obj = event["Records"][0]
#print(file_obj)
bucketname = str(file_obj["s3"]["bucket"]["name"])
print(bucketname)
filename = str(file_obj["s3"]["object"]["key"])
print(filename)
print("sending")
try:
response = ssm.send_command(
InstanceIds=[instanceid],
DocumentName="AWS-RunShellScript",
Parameters={
"commands": ["touch hi.txt"]
}, # replace command_to_be_executed with command
)
# fetching command id for the output
command_id = response["Command"]["CommandId"]
time.sleep(3)
# fetching command output
output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid)
print(output)
except Exception as e:
logger.error(e)
raise e
There are several ways. One would be to setup s3 notifications to invoke a lambda function. Then lambda function would use SSM Run Command to execute AWS CLI S3 command on your instance to download the file from S3.
I don't know why there is any recommendation of Lambda here. What you need is simple: S3 object created event notification -> SQS and some job on your EC2 instance watching a long polling queue.
Here is an example of such a python script. You need to sort out how the object key is encoded in the event, but it will be there. I haven't tested this, but it should be pretty close.
import boto3
def main() -> None:
s3 = boto3.client("s3")
sqs = boto3.client("sqs")
while True:
res = sqs.receive_message(
QueueUrl="yourQueue",
WaitTimeSeconds=20,
)
for msg in res.get("Messages", []):
s3.download_file("yourBucket", msg["key"], "local/file/path")
if __name__ == "__main__":
main()
You can use S3 Event Notifications, which react to a new file coming into the s3 bucket.
The destinations supported by s3 event are SNS, SQS or AWS lambda.
You can directly use the lambda as destination as described by #Marcin
You can use SQS has queue with a lambda behind pulling from the queue. It allows you to have some capability like dead letter queue. You can then pull messages from the queue using different methods:
AWS CLI
AWS SDK
You can use SNS with different things behind (you can have many of these desinations in a row which symbolise the fan-out pattern:
a SQS queue to manage the files
an email to notify
a lambda function
...
You can find more explication in ths article: https://aws.plainenglish.io/system-design-s3-events-to-lambda-vs-s3-events-to-sqs-sns-to-lambda-2d41477d1cc9
following the answers to this question Load S3 Data into AWS SageMaker Notebook I tried to load data from S3 bucket to SageMaker Jupyter Notebook.
I used this code:
import pandas as pd
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_csv(data_location)
I replaced 'my-bucket' by the ARN (Amazon Ressource name) of my S3 bucket (e.g. "arn:aws:s3:::name-of-bucket") and replaced 'train.csv' by the csv-filename which is stored in the S3 bucket. Regarding the rest I did not change anything at all. What I got was this ValueError:
ValueError: Failed to head path 'arn:aws:s3:::name-of-bucket/name_of_file_V1.csv': Parameter validation failed:
Invalid bucket name "arn:aws:s3:::name-of-bucket": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:s3:[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
What did I do wrong? Do I have to modify the name of my S3 bucket?
The path should be:
data_location = 's3://{}/{}'.format(bucket, data_key)
where bucket is <bucket-name> not ARN. For example bucket=my-bucket-333222.
I wan to run my Athena query through AWS Lambda, but also change the name of my output CSV file from Query Execution ID to my-bucket/folder/my-preferred-string.csv
I tried searching for the results on web, but couldn't found the exact code for lambda function.
I am a data scientist and a beginner to AWS. This is a one time thing for me, so looking for a quick solution or a patch up.
This question is already posted here
client = boto3.client('athena')
s3 = boto3.resource("s3")
# Run query
queryStart = client.start_query_execution(
# PUT_YOUR_QUERY_HERE
QueryString = '''
SELECT *
FROM "db_name"."table_name"
WHERE value > 50
''',
QueryExecutionContext = {
# YOUR_ATHENA_DATABASE_NAME
'Database': "covid_data"
},
ResultConfiguration = {
# query result output location you mentioned in AWS Athena
"OutputLocation": "s3://bucket-name-X/folder-Y/"
}
)
# Executes query and waits 3 seconds
queryId = queryStart['QueryExecutionId']
time.sleep(3)
# Copies newly generated csv file with appropriate name
# query result output location you mentioned in AWS Athena
queryLoc = "bucket-name-X/folder-Y/" + queryId + ".csv"
# Destination location and file name
s3.Object("bucket-name-A", "report-2018.csv").copy_from(CopySource = queryLoc)
# Deletes Athena generated csv and it's metadata file
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv"
)
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv.metadata"
)
print('{file-name} csv generated')
I have an AWS Lambda function written in Python 2.7 in which I want to:
1) Grab an .xls file form an HTTP address.
2) Store it in a temp location.
3) Store the file in an S3 bucket.
My code is as follows:
from __future__ import print_function
import urllib
import datetime
import boto3
from botocore.client import Config
def lambda_handler(event, context):
"""Make a variable containing the date format based on YYYYYMMDD"""
cur_dt = datetime.datetime.today().strftime('%Y%m%d')
"""Make a variable containing the url and current date based on the variable
cur_dt"""
dls = "http://11.11.111.111/XL/" + cur_dt + ".xlsx"
urllib.urlretrieve(dls, cur_dt + "test.xls")
ACCESS_KEY_ID = 'Abcdefg'
ACCESS_SECRET_KEY = 'hijklmnop+6dKeiAByFluK1R7rngF'
BUCKET_NAME = 'my-bicket'
FILE_NAME = cur_dt + "test.xls";
data = open('/tmp/' + FILE_NAME, 'wb')
# S3 Connect
s3 = boto3.resource(
's3',
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=ACCESS_SECRET_KEY,
config=Config(signature_version='s3v4')
)
# Uploaded File
s3.Bucket(BUCKET_NAME).put(Key=FILE_NAME, Body=data, ACL='public-read')
However, when I run this function, I receive the following error:
'IOError: [Errno 30] Read-only file system'
I've spent hours trying to address this issue but I'm falling on my face. Any help would be appreciated.
'IOError: [Errno 30] Read-only file system'
You seem to lack some write access right. If your lambda has another policy, try to attach this policy to your role:
arn:aws:iam::aws:policy/AWSLambdaFullAccess
It has full access on S3 as well, in case you can't write in your bucket. If it solves your issue, you'll remove some rights after that.
I have uploaded the image to s3 Bucket. In "Lambda Test Event", I have created one json test event which contains BASE64 of Image to be uploaded to s3 Bucket and Image Name.
Lambda Test JSON Event as fallows ======>
{
"ImageName": "Your Image Name",
"img64":"BASE64 of Your Image"
}
Following is the code to upload an image or any file to s3 ======>
import boto3
import base64
def lambda_handler(event, context):
s3 = boto3.resource(u's3')
bucket = s3.Bucket(u'YOUR-BUCKET-NAME')
path_test = '/tmp/output' # temp path in lambda.
key = event['ImageName'] # assign filename to 'key' variable
data = event['img64'] # assign base64 of an image to data variable
data1 = data
img = base64.b64decode(data1) # decode the encoded image data (base64)
with open(path_test, 'wb') as data:
#data.write(data1)
data.write(img)
bucket.upload_file(path_test, key) # Upload image directly inside bucket
#bucket.upload_file(path_test, 'FOLDERNAME-IN-YOUR-BUCKET /{}'.format(key)) # Upload image inside folder of your s3 bucket.
print('res---------------->',path_test)
print('key---------------->',key)
return {
'status': 'True',
'statusCode': 200,
'body': 'Image Uploaded'
}
change data = open('/tmp/' + FILE_NAME, 'wb') change the wb for "r"
also, I assume your IAM user has full access to S3 right?
or maybe the problem is in the request of that url...
try that cur_dt starts with "/tmp/"
urllib.urlretrieve(dls, "/tmp/" + cur_dt + "test.xls")