AWS SageMaker Processing job - amazon-web-services

I was able to run a simple python code in Notebook instance to read and write csv files from/to S3 bucket. Now I want to create the SageMaker processing job to run the same code without any input/output data configuration. I have downloaded the same code and pushed the image to ECR repository. How to run this code in processing job and it should be able to install 's3fs' module?I just want to run python code in processing jobs without giving any input/output algorithms/configuration. Used boto3 to read/write from s3 bucket. With the current code it's stuck in "In Progress"
downloaded code in vs code
downloaded code in vs code
!pip install s3fs
import boto3
import pandas as pd
from io import StringIO
client = boto3.client('s3')
path = 's3://weatheranalysis/weatherset.csv'
df = pd.read_csv(path)
df.head()
filename = 'newdata.csv'
bucketName = 'weatheranalysis'
csv_buffer = StringIO()
df.to_csv(csv_buffer)
client = boto3.client('s3')
response = client.put_object(
ACL='private',
Body = csv_buffer.getvalue(),
Bucket =bucketName,
Key = filename
)

You can do so by defining a 'sagemaker-processing-container' and Set up the ScriptProcessor from the SageMaker Python SDK to run your existing python script in preprocessor.py.
A simple example can be found here
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
script_processor = ScriptProcessor(command=['python3'],
image_uri='image_uri',
role='role_arn',
instance_count=1,
instance_type='ml.m5.xlarge')
script_processor.run(code='preprocessing.py',
inputs=[ProcessingInput(
source='s3://path/to/my/input-data.csv',
destination='/opt/ml/processing/input')],
outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'),
ProcessingOutput(source='/opt/ml/processing/output/validation'),
ProcessingOutput(source='/opt/ml/processing/output/test')])

Related

Download FTP file to S3 through AWS Glue, S3 path is not recognized

on a AWS Glue Job, I'm using ftplib to download files and store them to S3, with the following code:
from ftplib import FTP
ftp = FTP()
ftp.connect("ftp.ser.ver", 21)
ftp.login("user", "password")
remotefile='filename.txt'
download='s3://bucket/folder/filename.txt'
with open(download,'wb') as file:
ftp.retrbinary('RETR %s' % remotefile, file.write)
And I got an error stated as follow:
FileNotFoundError: [Errno 2] No such file or directory
Ran the same code through local and changed the download path to local path and the code works. I'm fairly new to S3 and Glue and not sure where to look for right documentations. Any insight and suggestion is greatly appreciated.
You can't download an FTP file and directly save it towards S3. You will have to use either a memory-based or file-based stream to save it in the glue environment before you could upload it to S3.
from boto3.session import Session
import boto3
from ftplib import FTP
ftp = FTP()
ftp.connect("ftp.ser.ver", 21)
ftp.login("user", "password")
with open("/tmp/filename.txt",'wb') as file:
ftp.retrbinary("filename.txt", file.write)
s3 = boto3.client('s3')
with open("/tmp/filename.txt", "rb") as f:
s3.upload_fileobj(f, "BUCKET_NAME", "OBJECT_NAME")

AWS sagemaker model monitor- ImportError: cannot import name 'ModelQualityMonitor'

I am trying to create a model quality monitor job, using the class ModelQualityMonitor from Sagemaker model_monitor, and i think i have all the import statements defined yet i get the message cannot import name error
from sagemaker import get_execution_role, session, Session
from sagemaker.model_monitor import ModelQualityMonitor
role = get_execution_role()
session = Session()
model_quality_monitor = ModelQualityMonitor(
role=role,
instance_count=1,
instance_type='ml.m5.xlarge',
volume_size_in_gb=20,
max_runtime_in_seconds=1800,
sagemaker_session=session
)
Any pointers are appreciated
Are you using an Amazon SageMaker Notebook? When I run your code above in a new conda_python3 Amazon SageMaker notebook, I don't get any errors at all.
Example screenshot output showing no errors:
If you're getting something like NameError: name 'ModelQualityMonitor' is not defined then I suspect you are running in a Python environment that doesn't have the Amazon SageMaker SDK installed in it. Perhaps try running pip install sagemaker and then see if this resolves your error.

Read/Write Shapefiles .shp stored in AWS S3 from AWS EC2 using python

I am using the below code on my local CLI python window which work but unable to perform on EC2 (Amazon Linux 2),Please help me finding the solution as i have a tried many approaches from internet .
import pandas as pd
import geopandas as gpd
gdf1=gpd.read_file("s3://bucketname/inbound/folder/filename.shp")
gdf2=gpd.read_file("s3://bucketname/inbound/folder/filename.shp")
gdf = gpd.GeoDataFrame(pd.concat([gdf1, gdf2]))
gdfw =gdf.to_file("s3://bucket/outbound/folder/")

AWS Lambda unable to link the tesseract Executable

Tesseract OCR on AWS Lambda via virtualenv
Scroll to Adapatations for tesseract 4.
I have used this link to create the executable and the dependency libraries for tesseract. I have zipped everything and dropped in S3.
I am using lambda to download this zip, extract the dependencies in to /tmp folder. Now I am planning to use these dependencies in my lambda(python3 platform).
I am getting this error
Response:
{
"errorMessage": "tesseract is not installed or it's not in your path",
"errorType": "TesseractNotFoundError",
This is happening cause of not setting the environmental variable.
I have tried to do it but cannot by pass this error.
# Setting the modules path
sys.path.insert(0, '/tmp/')
import boto3
import cv2
import numpy as np
import subprocess
os.environ['PATH'] = "{}:/tmp/pytesseract:/tmp/".format(os.environ['PATH'])
os.environ['TESSDATA_PREFIX'] = "/tmp/tessdata/"
import pytesseract
I have set the environmental variables like this in the lambda function. Still I am getting the same error. I have even tried setting the variables like shown in the image below. Still hard luck.
I am sure this lambda package works because I have created a new ec2 instance, downloaded the same zip file and extracted the libraries into /tmp/ folder. I wrote a basic test function for testing tesseract. This works.
import cv2
import pytesseract
import os
# os.environ['PATH'] = "{}:/tmp/pytesseract:/tmp/".format(os.environ['PATH'])
os.environ['LD_LIBRARY_PATH'] = '/tmp/lib:/tmp'
config = ('-l eng --oem 1 --psm 3')
im = cv2.imread('pytesseract/test-european.jpg', cv2.IMREAD_COLOR)
text = pytesseract.image_to_string(im, config=config)
print(text)
Can somebody tell me what did I do wrong with lambda.
I don't want to zip everything because my zip file is greater than 50 MB. Also I want to try downloading the packages/modules/binaries from S3 to lambda and make it work.
Apparently lambda doesn't allow you to make changes to the PATH variable.
Try adding this to your script
pytesseract.pytesseract.tesseract_cmd = r'/var/task/tesseract'

Get files from S3 using Jython in Grinder test script

I have master/worker EC2 instances that I'm using for Grinder tests. I need to try out a load test that directly gets files from an S3 bucket, but I'm not sure how that would look in Jython for the Grinder test script.
Any ideas or tips? I've looked into it a little and saw that Python has the boto package for working with AWS - would that work in Jython as well?
(Edit - adding code and import errors for clarification.)
Python approach:
Did "pip install boto3"
Test script:
from net.grinder.script.Grinder import grinder
from net.grinder.script import Test
import boto3
# boto3 for Python
test1 = Test(1, "S3 request")
resource = boto3.resource('s3')
def accessS3():
obj = resource.Object(<bucket>,<key>)
test1.record(accessS3)
class TestRunner:
def __call__(self):
accessS3()
The error for this is:
net.grinder.scriptengine.jython.JythonScriptExecutionException: : No module named boto3
Java approach:
Added aws-java-sdk-1.11.221 jar from .m2\repository\com\amazonaws\aws-java-sdk\1.11.221\ to CLASSPATH
from net.grinder.script.Grinder import grinder
from net.grinder.script import Test
import com.amazonaws.services.s3 as s3
# aws s3 for Java
test1 = Test(1, "S3 request")
s3Client = s3.AmazonS3ClientBuilder.defaultClient()
test1.record(s3Client)
class TestRunner:
def __call__(self):
result = s3Client.getObject(s3.model.getObjectRequest(<bucket>,<key>))
The error for this is:
net.grinder.scriptengine.jython.JythonScriptExecutionException: : No module named amazonaws
I'm also running things on a Windows computer, but I'm using Git Bash.
Given that you are using Jython, I'm not sure whether you want to execute the S3 request in java or python syntax.
However, I would suggest following along with the python guide at the link below.
http://docs.ceph.com/docs/jewel/radosgw/s3/python/