How to input fsx for lustre to Amazon Sagemaker? - amazon-web-services

I am trying to set up Amazon sagemaker reading our dataset from our AWS Fsx for Lustre file system.
We are using the Sagemaker API, and previously we were reading our dataset from s3 which worked fine:
estimator = TensorFlow(
entry_point='model_script.py',
image_uri='some-repo:some-tag',
instance_type='ml.m4.10xlarge',
instance_count=1,
role=role,
framework_version='2.0.0',
py_version='py3',
subnets=["subnet-1"],
security_group_ids=["sg-1", "sg-2"],
debugger_hook_config=False,
)
estimator.fit({
'training': f"s3://bucket_name/data/{hyperparameters['dataset']}/"}
)
But now that I'm changing the input data source to Fsx Lustre file system, I'm getting an error that the file input should be s3:// or file://. I was following these docs (fsx lustre):
estimator = TensorFlow(
entry_point='model_script.py',
# image_uri='some-docker:some-tag',
instance_type='ml.m4.10xlarge',
instance_count=1,
role=role,
framework_version='2.0.0',
py_version='py3',
subnets=["subnet-1"],
security_group_ids=["sg-1", "sg-2"],
debugger_hook_config=False,
)
fsx_data_folder = FileSystemInput(file_system_id='fs-1',
file_system_type='FSxLustre',
directory_path='/fsx/data',
file_system_access_mode='ro')
estimator.fit(f"{fsx_data_folder}/{hyperparameters['dataset']}/")
Throws the following error:
ValueError: URI input <sagemaker.inputs.FileSystemInput object at 0x0000016A6C7F0788>/dataset_name/ must be a valid S3 or FILE URI: must start with "s3://" or "file://"
Does anyone understand what I am doing wrong? Thanks in advance!

I was (quite stupidly, it was late ;)) treating the FileSystemInput object as a string instead of an object. The error complained that the concatenation of obj+string is not a valid URI pointing to a location in s3.
The correct way to do it is making a FileSystemInput object out of the entire path to the dataset. Note that the fit now takes this object, and will mount it to data_dir = "/opt/ml/input/data/training".
fsx_data_obj = FileSystemInput(
file_system_id='fs-1',
file_system_type='FSxLustre',
directory_path='/fsx/data/{dataset}',
file_system_access_mode='ro'
)
estimator.fit(fsx_data_obj)

Related

Re-encoding audio file to linear16 for google cloud speech api fails with '[Errno 30] Read-only file system'

I'm trying to convert an audio file to linear 16 format using FFmpeg module. I've stored the audio file in one cloud storage bucket and want to move the converted file to a different bucket. The code works perfectly in VS code and deploys successfully to cloud functions. But, fails with [Errno 30] Read-only file system when run on the cloud.
Here's the code
from google.cloud import speech
from google.cloud import storage
import ffmpeg
import sys
out_bucket = 'encoded_audio_landing'
input_bucket_name = 'audio_landing'
def process_audio(input_bucket_name, in_filename, out_bucket):
'''
converts audio encoding for GSK call center call recordings to linear16 encoding and 16,000
hertz sample rate
Params:
in_filename: a gsk call audio file
returns an audio file encoded so that google speech to text api can transcribe
'''
storage_client = storage.Client()
bucket = storage_client.bucket(input_bucket_name)
blob = bucket.blob(in_filename)
blob.download_to_filename(blob.name)
print('type contents: ', type('processedfile'))
#print('blob name / len / type', blob.name, len(blob.name), type(blob.name))
try:
out, err = (
ffmpeg.input(blob.name)
#ffmpeg.input()
.output('pipe: a', format="s16le", acodec="pcm_s16le", ac=1, ar="16k")
.overwrite_output()
.run(capture_stdout=True, capture_stderr=True)
)
except ffmpeg.Error as e:
print(e.stderr, file=sys.stderr)
sys.exit(1)
up_bucket = storage_client.bucket(out_bucket)
up_blob = up_bucket.blob(blob.name)
#print('type / len out', type(out), len(out))
up_blob.upload_from_string(out)
#delete source file
blob.delete()
def hello_gcs(event, context):
"""Background Cloud Function to be triggered by Cloud Storage.
This generic function logs relevant data when a file is changed,
and works for all Cloud Storage CRUD operations.
Args:
event (dict): The dictionary with data specific to this type of event.
The `data` field contains a description of the event in
the Cloud Storage `object` format described here:
https://cloud.google.com/storage/docs/json_api/v1/objects#resource
context (google.cloud.functions.Context): Metadata of triggering event.
Returns:
None; the output is written to Cloud Logging
"""
#print('Event ID: {}'.format(context.event_id))
#print('Event type: {}'.format(context.event_type))
print('Bucket: {}'.format(event['bucket']))
print('File: {}'.format(event['name']))
print('Metageneration: {}'.format(event['metageneration']))
#print('Created: {}'.format(event['timeCreated']))
#print('Updated: {}'.format(event['updated']))
#convert audio encoding
print('begin process_audio')
process_audio(input_bucket_name, event['name'], out_bucket)
The problem was that I was downloading the file to my local directory, which obviously wouldn't work on the cloud. I read another article where someone used added the get file path function and used that as an input into blob.download_tofilename(). I'm not sure why that worked.
I did try just removing the whole download_tofilename bit, but it didn't work without that.
I'd very much appreciate an explanation if someone knows why
#this gets around downloading the file to a local folder. it creates some sort of templ location
def get_file_path(filename):
file_name = secure_filename(filename)
return os.path.join(tempfile.gettempdir(), file_name)
def process_audio(input_bucket_name, in_filename, out_bucket):
'''
converts audio encoding for GSK call center call recordings to linear16 encoding and 16,000
hertz sample rate
Params:
in_filename: a gsk call audio file
input_bucket_name: location of the sourcefile that needs to be re-encoded
out_bucket: where to put the newly encoded file
returns an audio file encoded so that google speech to text api can transcribe
'''
storage_client = storage.Client()
bucket = storage_client.bucket(input_bucket_name)
blob = bucket.blob(in_filename)
print(blob.name)
#creates some sort of temp loaction for the tile
file_path = get_file_path(blob.name)
blob.download_to_filename(file_path)
print('type contents: ', type('processedfile'))
#print('blob name / len / type', blob.name, len(blob.name), type(blob.name))
#envokes the ffmpeg library to re-encode the audio file, it's actually some sort of command line application
# that is available in Python and google cloud. The things in the .outuput bit are options from ffmpeg, you
# pass these options into ffmpeg there
try:
out, err = (
ffmpeg.input(file_path)
#ffmpeg.input()
.output('pipe: a', format="s16le", acodec="pcm_s16le", ac=1, ar="16k")
.overwrite_output()
.run(capture_stdout=True, capture_stderr=True)
)
except ffmpeg.Error as e:
print(e.stderr, file=sys.stderr)
sys.exit(1)

File truncated on upload to GCS

I am uploading a relatively small(<1 MiB) .jsonl file on Google CLoud storage using the python API. The function I used is from the gcp documentation:
def upload_blob(key_path,bucket_name, source_file_name, destination_blob_name):
"""Uploads a file to the bucket."""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The path to your file to upload
# source_file_name = "local/path/to/file"
# The ID of your GCS object
# destination_blob_name = "storage-object-name"
storage_client = storage.Client.from_service_account_json(key_path)
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
print(
"File {} uploaded to {}.".format(
source_file_name, destination_blob_name
)
)
The issue I am having is that the .jsonl file is getting truncated at 9500 lines after the upload. In fact, the 9500th line is not complete. I am not sure what the issue is and don't think there would be any limit for this small file. Any help is appreciated.
I had a similar problem some time ago. In my case the upload to bucket was called inside a with python clause right after the line where I recorded contents to source_file_name, so I just needed to move the upload line outside the with in order to properly recorded and close local file to be uploaded.

ClientError: Failed to download data. Please check your s3 objects and ensure that there is no object that is both a folder as well as a file

How are you?
I'm trying to execute a sagemaker job but i get this error:
ClientError: Failed to download data. Cannot download s3://pocaaml/sagemaker/xsell_sc1_test/model/model_lgb.tar.gz, a previously downloaded file/folder clashes with it. Please check your s3 objects and ensure that there is no object that is both a folder as well as a file.
I'm have that model_lgb.tar.gz on that s3 path as you can see here:
This is my code:
project_name = 'xsell_sc1_test'
s3_bucket = "pocaaml"
prefix = "sagemaker/"+project_name
account_id = "029294541817"
s3_bucket_base_uri = "{}{}".format("s3://", s3_bucket)
dev = "dev-{}".format(strftime("%y-%m-%d-%H-%M", gmtime()))
region = sagemaker.Session().boto_region_name
print("Using AWS Region: {}".format(region))
# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()
boto3.setup_default_session(region_name=region)
boto_session = boto3.Session(region_name=region)
s3_client = boto3.client("s3", region_name=region)
sagemaker_boto_client = boto_session.client("sagemaker") #este pinta?
sagemaker_session = sagemaker.session.Session(
boto_session=boto_session, sagemaker_client=sagemaker_boto_client
)
sklearn_processor = SKLearnProcessor(
framework_version="0.23-1", role=role, instance_type='ml.m5.4xlarge', instance_count=1
)
PREPROCESSING_SCRIPT_LOCATION = 'funciones_altas.py'
preprocessing_input_code = sagemaker_session.upload_data(
PREPROCESSING_SCRIPT_LOCATION,
bucket=s3_bucket,
key_prefix="{}/{}".format(prefix, "code")
)
preprocessing_input_data = "{}/{}/{}".format(s3_bucket_base_uri, prefix, "data")
preprocessing_input_model = "{}/{}/{}".format(s3_bucket_base_uri, prefix, "model")
preprocessing_output = "{}/{}/{}/{}/{}".format(s3_bucket_base_uri, prefix, dev, "preprocessing" ,"output")
processing_job_name = params["project_name"].replace("_", "-")+"-preprocess-{}".format(strftime("%d-%H-%M-%S", gmtime()))
sklearn_processor.run(
code=preprocessing_input_code,
job_name = processing_job_name,
inputs=[ProcessingInput(input_name="data",
source=preprocessing_input_data,
destination="/opt/ml/processing/input/data"),
ProcessingInput(input_name="model",
source=preprocessing_input_model,
destination="/opt/ml/processing/input/model")],
outputs=[
ProcessingOutput(output_name="output",
destination=preprocessing_output,
source="/opt/ml/processing/output")],
wait=False,
)
preprocessing_job_description = sklearn_processor.jobs[-1].describe()
and on funciones_altas.py i'm using ohe_altas.tar.gz and not model_lgb.tar.gz making this error super weird.
can you help me?
Looks like you are using sagemaker generated execution role and the error is related to S3 permissions.
Here are a couple of things you can do:
make sure to check the policies on the role that they have access to your bucket.
check if the objects are encrypted in your bucket, if so then ensure to also include kms policy to the role you are linking to the job. https://aws.amazon.com/premiumsupport/knowledge-center/s3-403-forbidden-error/
You can always create your own role as well and pass the arn to the code to run the processing job.

Loading data from S3 bucket to SageMaker Jupyter Notebook - ValueError - Invalid bucket name

following the answers to this question Load S3 Data into AWS SageMaker Notebook I tried to load data from S3 bucket to SageMaker Jupyter Notebook.
I used this code:
import pandas as pd
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_csv(data_location)
I replaced 'my-bucket' by the ARN (Amazon Ressource name) of my S3 bucket (e.g. "arn:aws:s3:::name-of-bucket") and replaced 'train.csv' by the csv-filename which is stored in the S3 bucket. Regarding the rest I did not change anything at all. What I got was this ValueError:
ValueError: Failed to head path 'arn:aws:s3:::name-of-bucket/name_of_file_V1.csv': Parameter validation failed:
Invalid bucket name "arn:aws:s3:::name-of-bucket": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:s3:[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
What did I do wrong? Do I have to modify the name of my S3 bucket?
The path should be:
data_location = 's3://{}/{}'.format(bucket, data_key)
where bucket is <bucket-name> not ARN. For example bucket=my-bucket-333222.

How to get URI of a blob in a google cloud storage (Python)

If I have a Blob object how can I get the URI (gs://...)?
The documentation says I can use self_link property to get the URI, but it returns the https URL instead (https://googleapis.com...)
I am using python client library for cloud storage.
Thank you
Since you are not sharing with us how exactly are you trying to achieve this I did a quick script in Python to get this info
There is no specific method in blob to get the URI as gs:// in Python but you can try to script this by using the path_helper
def get_blob_URI():
"""Prints out a bucket's labels."""
# bucket_name = 'your-bucket-name'
storage_client = storage.Client()
bucket_name = 'YOUR_BUCKET'
blob_name = 'YOUR_OBJECT_NAME'
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(blob_name)
link = blob.path_helper(bucket_name, blob_name)
pprint.pprint('gs://' + link)
If you want to use the gsutil tool you can also get all the gs:// Uris of a bucket using the command gsutil ls gs://bucket