Get files from S3 using Jython in Grinder test script - amazon-web-services

I have master/worker EC2 instances that I'm using for Grinder tests. I need to try out a load test that directly gets files from an S3 bucket, but I'm not sure how that would look in Jython for the Grinder test script.
Any ideas or tips? I've looked into it a little and saw that Python has the boto package for working with AWS - would that work in Jython as well?
(Edit - adding code and import errors for clarification.)
Python approach:
Did "pip install boto3"
Test script:
from net.grinder.script.Grinder import grinder
from net.grinder.script import Test
import boto3
# boto3 for Python
test1 = Test(1, "S3 request")
resource = boto3.resource('s3')
def accessS3():
obj = resource.Object(<bucket>,<key>)
test1.record(accessS3)
class TestRunner:
def __call__(self):
accessS3()
The error for this is:
net.grinder.scriptengine.jython.JythonScriptExecutionException: : No module named boto3
Java approach:
Added aws-java-sdk-1.11.221 jar from .m2\repository\com\amazonaws\aws-java-sdk\1.11.221\ to CLASSPATH
from net.grinder.script.Grinder import grinder
from net.grinder.script import Test
import com.amazonaws.services.s3 as s3
# aws s3 for Java
test1 = Test(1, "S3 request")
s3Client = s3.AmazonS3ClientBuilder.defaultClient()
test1.record(s3Client)
class TestRunner:
def __call__(self):
result = s3Client.getObject(s3.model.getObjectRequest(<bucket>,<key>))
The error for this is:
net.grinder.scriptengine.jython.JythonScriptExecutionException: : No module named amazonaws
I'm also running things on a Windows computer, but I'm using Git Bash.

Given that you are using Jython, I'm not sure whether you want to execute the S3 request in java or python syntax.
However, I would suggest following along with the python guide at the link below.
http://docs.ceph.com/docs/jewel/radosgw/s3/python/

Related

Where do I put a custom config (.cfg) file in AWS MWAA Airflow?

I have a config file, dev.cfg, that looks like this:
[S3]
bucket_name = my-bucket
I need this in my code to do S3 things. I do not have the access (nor will I be given it) to modify the environmental or config variables in the AWS console. My only method of putting files in S3 is the CLI (aws s3 cp ...).
This is the project directory structure in S3:
my-bucket/
dags/
dev.cfg
some_dag_file.py
plugins.zip
requirements.txt
In the plugins.zip file, there is a plugin that should set the path to dev.cfg to an env var (DEV_CONFIG_PATH) that my code uses:
import os
from airflow.plugins_manager import AirflowPlugin
os.environ["DEV_CONFIG_PATH"] = os.path.join(
os.getenv('AIRFLOW_HOME'), 'dags', 'dev.cfg')
class EnvVarPlugin(AirflowPlugin):
name = 'env_var_plugin'
However, I'm getting an import error in the Airflow UI:
configparser.NoSectionError: No section: 'S3'
Any help is appreciated.
I'm using MWAA with Airflow version 2.4.3 running python 3.10.
Try asking your infra team to update the config: core.lazy_load_plugins : False

AWS SageMaker Processing job

I was able to run a simple python code in Notebook instance to read and write csv files from/to S3 bucket. Now I want to create the SageMaker processing job to run the same code without any input/output data configuration. I have downloaded the same code and pushed the image to ECR repository. How to run this code in processing job and it should be able to install 's3fs' module?I just want to run python code in processing jobs without giving any input/output algorithms/configuration. Used boto3 to read/write from s3 bucket. With the current code it's stuck in "In Progress"
downloaded code in vs code
downloaded code in vs code
!pip install s3fs
import boto3
import pandas as pd
from io import StringIO
client = boto3.client('s3')
path = 's3://weatheranalysis/weatherset.csv'
df = pd.read_csv(path)
df.head()
filename = 'newdata.csv'
bucketName = 'weatheranalysis'
csv_buffer = StringIO()
df.to_csv(csv_buffer)
client = boto3.client('s3')
response = client.put_object(
ACL='private',
Body = csv_buffer.getvalue(),
Bucket =bucketName,
Key = filename
)
You can do so by defining a 'sagemaker-processing-container' and Set up the ScriptProcessor from the SageMaker Python SDK to run your existing python script in preprocessor.py.
A simple example can be found here
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
script_processor = ScriptProcessor(command=['python3'],
image_uri='image_uri',
role='role_arn',
instance_count=1,
instance_type='ml.m5.xlarge')
script_processor.run(code='preprocessing.py',
inputs=[ProcessingInput(
source='s3://path/to/my/input-data.csv',
destination='/opt/ml/processing/input')],
outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'),
ProcessingOutput(source='/opt/ml/processing/output/validation'),
ProcessingOutput(source='/opt/ml/processing/output/test')])

AWS sagemaker model monitor- ImportError: cannot import name 'ModelQualityMonitor'

I am trying to create a model quality monitor job, using the class ModelQualityMonitor from Sagemaker model_monitor, and i think i have all the import statements defined yet i get the message cannot import name error
from sagemaker import get_execution_role, session, Session
from sagemaker.model_monitor import ModelQualityMonitor
role = get_execution_role()
session = Session()
model_quality_monitor = ModelQualityMonitor(
role=role,
instance_count=1,
instance_type='ml.m5.xlarge',
volume_size_in_gb=20,
max_runtime_in_seconds=1800,
sagemaker_session=session
)
Any pointers are appreciated
Are you using an Amazon SageMaker Notebook? When I run your code above in a new conda_python3 Amazon SageMaker notebook, I don't get any errors at all.
Example screenshot output showing no errors:
If you're getting something like NameError: name 'ModelQualityMonitor' is not defined then I suspect you are running in a Python environment that doesn't have the Amazon SageMaker SDK installed in it. Perhaps try running pip install sagemaker and then see if this resolves your error.

AWS Glue Python Shell Job Connect Timeout Error

Trying to run AWS Glue Python Shell Job but gives me Connect Timeout Error
Error Image : https://i.stack.imgur.com/MHpHg.png
Script : https://i.stack.imgur.com/KQxkj.png
It looks like you didn't added secretsmanager endpoint to your VPC. As the traffic will not leave AWS network there will not be internet access inside your Glue job's VPC. So if you want to connect to secretsmanager then you need to add it to your VPC.
Refer to this on how you can add this to your VPC and this to make sure you have properly configured security groups.
AWS Glue Git Issue
Hi,
We got AWS Glue Python Shell working with all dependency as follows. The Glue has awscli dependency as well along with boto3
AWS Glue Python Shell with Internet
Add awscli and boto3 whl files to Python library path during Glue Job execution. This option is slow as it has to download and install dependencies.
Download the following whl files
awscli-1.18.183-py2.py3-none-any.whl
boto3-1.16.23-py2.py3-none-any.whl
Upload the files to s3 bucket in your given python library path
Add the s3 whl file paths in the Python library path. Give the entire whl file s3 referenced path separated by comma
AWS Glue Python Shell without Internet connectivity
Reference: AWS Wrangler Glue dependency build
We followed the steps mentioned above for awscli and boto3 whl files
Below is the latest requirements.txt compiled for the newest versions
colorama==0.4.3
docutils==0.15.2
rsa==4.5.0
s3transfer==0.3.3
PyYAML==5.3.1
botocore==1.19.23
pyasn1==0.4.8
jmespath==0.10.0
urllib3==1.26.2
python_dateutil==2.8.1
six==1.15.0
Download the dependencies to libs folder
pip download -r requirements.txt -d libs
Move the original main whl files also to the lib directory
awscli-1.18.183-py2.py3-none-any.whl
boto3-1.16.23-py2.py3-none-any.whl
Package as a zip file
cd libs zip ../boto3-depends.zip *
Upload the boto3-depends.zip to s3 and add the path to Glue jobs Referenced files path
Note: It is Referenced files path and not Python library path
Placeholder code to install latest awcli and boto3 and load into AWS Python Glue Shell.
import os.path
import subprocess
import sys
# borrowed from https://stackoverflow.com/questions/48596627/how-to-import-referenced-files-in-etl-scripts
def get_referenced_filepath(file_name, matchFunc=os.path.isfile):
for dir_name in sys.path:
candidate = os.path.join(dir_name, file_name)
if matchFunc(candidate):
return candidate
raise Exception("Can't find file: ".format(file_name))
zip_file = get_referenced_filepath("awswrangler-depends.zip")
subprocess.run()
# Can't install --user, or without "-t ." because of permissions issues on the filesystem
subprocess.run(, shell=True)
#Additonal code as part of AWS Thread https://forums.aws.amazon.com/thread.jspa?messageID=954344
sys.path.insert(0, '/glue/lib/installation')
keys =
for k in keys:
if 'boto' in k:
del sys.modules[k]
import boto3
print('boto3 version')
print(boto3.__version__)
Check if the code is working with latest AWS CLI API
Thanks
Sarath

AWS Lambda unable to link the tesseract Executable

Tesseract OCR on AWS Lambda via virtualenv
Scroll to Adapatations for tesseract 4.
I have used this link to create the executable and the dependency libraries for tesseract. I have zipped everything and dropped in S3.
I am using lambda to download this zip, extract the dependencies in to /tmp folder. Now I am planning to use these dependencies in my lambda(python3 platform).
I am getting this error
Response:
{
"errorMessage": "tesseract is not installed or it's not in your path",
"errorType": "TesseractNotFoundError",
This is happening cause of not setting the environmental variable.
I have tried to do it but cannot by pass this error.
# Setting the modules path
sys.path.insert(0, '/tmp/')
import boto3
import cv2
import numpy as np
import subprocess
os.environ['PATH'] = "{}:/tmp/pytesseract:/tmp/".format(os.environ['PATH'])
os.environ['TESSDATA_PREFIX'] = "/tmp/tessdata/"
import pytesseract
I have set the environmental variables like this in the lambda function. Still I am getting the same error. I have even tried setting the variables like shown in the image below. Still hard luck.
I am sure this lambda package works because I have created a new ec2 instance, downloaded the same zip file and extracted the libraries into /tmp/ folder. I wrote a basic test function for testing tesseract. This works.
import cv2
import pytesseract
import os
# os.environ['PATH'] = "{}:/tmp/pytesseract:/tmp/".format(os.environ['PATH'])
os.environ['LD_LIBRARY_PATH'] = '/tmp/lib:/tmp'
config = ('-l eng --oem 1 --psm 3')
im = cv2.imread('pytesseract/test-european.jpg', cv2.IMREAD_COLOR)
text = pytesseract.image_to_string(im, config=config)
print(text)
Can somebody tell me what did I do wrong with lambda.
I don't want to zip everything because my zip file is greater than 50 MB. Also I want to try downloading the packages/modules/binaries from S3 to lambda and make it work.
Apparently lambda doesn't allow you to make changes to the PATH variable.
Try adding this to your script
pytesseract.pytesseract.tesseract_cmd = r'/var/task/tesseract'