Sagemaker Data Capture does not write files - amazon-web-services

I want to enable data capture for a specific endpoint (so far, only via the console). The endpoint works fine and also logs & returns the desired results. However, no files are written to the specified S3 location.
Endpoint Configuration
The endpoint is based on a training job with a scikit learn classifier. It has only one variant which is a ml.m4.xlarge instance type. Data Capture is enabled with a sampling percentage of 100%. As data capture storage locations I tried s3://<bucket-name> as well as s3://<bucket-name>/<some-other-path>. With the "Capture content type" I tried leaving everything blank, setting text/csv in "CSV/Text" and application/json in "JSON".
Endpoint Invokation
The endpoint is invoked in a Lambda function with a client. Here's the call:
sagemaker_body_source = {
"segments": segments,
"language": language
}
payload = json.dumps(sagemaker_body_source).encode()
response = self.client.invoke_endpoint(EndpointName=endpoint_name,
Body=payload,
ContentType='application/json',
Accept='application/json')
result = json.loads(response['Body'].read().decode())
return result["predictions"]
Internally, the endpoint uses a Flask API with an /invocation path that returns the result.
Logs
The endpoint itself works fine and the Flask API is logging input and output:
INFO:api:body: {'segments': [<strings...>], 'language': 'de'}
INFO:api:output: {'predictions': [{'text': 'some text', 'label': 'some_label'}, ....]}

Data capture can be enabled by using the SDK as shown below -
data_capture_config = DataCaptureConfig(
enable_capture=True, sampling_percentage=100, destination_s3_uri=s3_capture_upload_path
)
predictor = model.deploy(
initial_instance_count=1,
instance_type="ml.m4.xlarge",
endpoint_name=endpoint_name,
data_capture_config=data_capture_config,
)
Make sure to reference your data capture config in your endpoint creation step. I've always seen this method to work. Can you try this and let me know? Reference notebook
NOTE - I work for AWS SageMaker , but my opinions are my own.

So the issue seemed to be related to the IAM role. The default role (ModelEndpoint-Role) does not have access to write S3 files. It worked via the SDK since it uses another role in the sagemaker studio. I did not receive any error message about this.

Related

AWS Service Quota: How to get service quota for Amazon S3 using boto3

I get the error "An error occurred (NoSuchResourceException) when calling the GetServiceQuota operation:" while trying running the following boto3 python code to get the value of quota for "Buckets"
client_quota = boto3.client('service-quotas')
resp_s3 = client_quota.get_service_quota(ServiceCode='s3', QuotaCode='L-89BABEE8')
In the above code, QuotaCode "L-89BABEE8" is for "Buckets". I presumed the value of ServiceCode for Amazon S3 would be "s3" so I put it there but I guess that is wrong and throwing error. I tried finding the documentation around ServiceCode for S3 but could not find it. I even tried with "S3" (uppercase 'S' here), "Amazon S3" but that didn't work as well.
What I tried?
client_quota = boto3.client('service-quotas') resp_s3 = client_quota.get_service_quota(ServiceCode='s3', QuotaCode='L-89BABEE8')
What I expected?
Output in the below format for S3. Below example is for EC2 which is the output of resp_ec2 = client_quota.get_service_quota(ServiceCode='ec2', QuotaCode='L-6DA43717')
I just played around with this and I'm seeing the same thing you are, empty responses from any service quota list or get command for service s3. However s3 is definitely the correct service code, because you see that come back from the service quota list_services() call. Then I saw there are also list and get commands for AWS default service quotas, and when I tried that it came back with data. I'm not entirely sure, but based on the docs I think any quota that can't be adjusted, and possibly any quota your account hasn't requested an adjustment for, will probably come back with an empty response from get_service_quota() and you'll need to run get_aws_default_service_quota() instead.
So I believe what you need to do is probably run this first:
client_quota.get_service_quota(ServiceCode='s3', QuotaCode='L-89BABEE8')
And if that throws an exception, then run the following:
client_quota.get_aws_default_service_quota(ServiceCode='s3', QuotaCode='L-89BABEE8')

multi-part upload Google Storage

I'm trying to implement multi-part upload to Google Storage but to my surprise it does not seem to be straightforward (I could not find java example).
Only mention I found was in the XML API https://cloud.google.com/storage/docs/multipart-uploads
Also found some discussion around a compose API StorageExample.java#L446 mentioned here google-cloud-java issues 1440
Any recommendations how to do multipart upload?
I got the multi-part upload working with #Koblan suggestion. (for details check blog post)
This is how I create the S3 Client and point it to Google Storage
def createClient(accessKey: String, secretKey: String, region: String = "us"): AmazonS3 = {
val endpointConfig = new EndpointConfiguration("https://storage.googleapis.com", region)
val credentials = new BasicAWSCredentials(accessKey, secretKey)
val credentialsProvider = new AWSStaticCredentialsProvider(credentials)
val clientConfig = new ClientConfiguration()
clientConfig.setUseGzip(true)
clientConfig.setMaxConnections(200)
clientConfig.setMaxErrorRetry(1)
val clientBuilder = AmazonS3ClientBuilder.standard()
clientBuilder.setEndpointConfiguration(endpointConfig)
clientBuilder.withCredentials(credentialsProvider)
clientBuilder.withClientConfiguration(clientConfig)
clientBuilder.build()
}
Because I'm doing the upload from the frontend (after I generate signled URLs for each part using the AmazonS3 client) I needed to enable CORS.
For testing, I enabled everything for now
$ gsutil cors get gs://bucket
$ echo '[{"origin": ["*"],"responseHeader": ["Content-Type", "ETag"],"method": ["GET", "HEAD", "PUT", "DELETE", "PATCH"],"maxAgeSeconds": 3600}]' > cors-config.json
$ gsutil cors set cors-config.json gs://bucket
See https://cloud.google.com/storage/docs/configuring-cors#gsutil_1
Currently Java Client library for multi part upload in Cloud Storage is not available. You can raise a feature request for the same in this link. As mentioned by John Hanley, the next best thing you can do is, do a parallel composite upload with gsutil (CLI), JSON and XML support/ resumable upload with Java libraries.
In parallel compose, the parallel writes can be done by using the JSON or XML API for Google Cloud Storage. Specifically, you would write a number of smaller objects in parallel and then (once all of those objects have been written) call the Compose request to compose them into one larger object.
If you're using the JSON API the compose documentation is at : https://cloud.google.com/storage/docs/json_api/v1/objects/compose
If you're using the XML API the compose documentation is at : https://cloud.google.com/storage/docs/reference-methods#putobject (see the compose query parameter).
Also there is an interesting document link provided by Kolban which you can try and work out. Also I would like to mention that you can have multi part uploads in Java, if you use the Google Drive API(v3). Here is the code example where we use the files.create method with uploadType=multipart.

Get all items in DynamoDB with API Gateway's Mapping Template

Is there a simple way to retrieve all items from a DynamoDB table using a mapping template in an API Gateway endpoint? I usually use a lambda to process the data before returning it but this is such a simple task that a Lambda seems like an overkill.
I have a table that contains data with the following format:
roleAttributeName roleHierarchyLevel roleIsActive roleName
"admin" 99 true "Admin"
"director" 90 true "Director"
"areaManager" 80 false "Area Manager"
I'm happy with getting the data, doesn't matter the representation as I can later transform it further down in my code.
I've been looking around but all tutorials explain how to get specific bits of data through queries and params like roles/{roleAttributeName} but I just want to hit roles/ and get all items.
All you need to do is
create a resource (without curly braces since we dont need a particular item)
create a get method
use Scan instead of Query in Action while configuring the integration request.
Configurations as follows :
enter image description here
now try test...you should get the response.
to try it out on postman deploy the api first and then use the provided link into postman followed by your resource name.
API Gateway allows you to Proxy DynamoDB as a service. Here you have an interesting tutorial on how to do it (you can ignore the part related to index to make it work).
To retrieve all the items from a table, you can use Scan as the action in API Gateway. Keep in mind that DynamoDB limits the query sizes to 1MB either for Scan and Query actions.
You can also limit your own query before it is automatically done by using the Limit parameter.
AWS DynamoDB Scan Reference

Is there a way to pass credentials programmatically for using google documentAI without reading from a disk?

I am trying to run the demo code given in pdf parsing of GCP document AI. To run the code, exporting google credentials as a command line works fine. The problem comes when the code needs to run in memory and hence no credential files are allowed to be accessed from disk. Is there a way to pass the credentials in the document ai parsing function?
The sample code of google:
def main(project_id='YOUR_PROJECT_ID',
input_uri='gs://cloud-samples-data/documentai/invoice.pdf'):
"""Process a single document with the Document AI API, including
text extraction and entity extraction."""
client = documentai.DocumentUnderstandingServiceClient()
gcs_source = documentai.types.GcsSource(uri=input_uri)
# mime_type can be application/pdf, image/tiff,
# and image/gif, or application/json
input_config = documentai.types.InputConfig(
gcs_source=gcs_source, mime_type='application/pdf')
# Location can be 'us' or 'eu'
parent = 'projects/{}/locations/us'.format(project_id)
request = documentai.types.ProcessDocumentRequest(
parent=parent,
input_config=input_config)
document = client.process_document(request=request)
# All text extracted from the document
print('Document Text: {}'.format(document.text))
def _get_text(el):
"""Convert text offset indexes into text snippets.
"""
response = ''
# If a text segment spans several lines, it will
# be stored in different text segments.
for segment in el.text_anchor.text_segments:
start_index = segment.start_index
end_index = segment.end_index
response += document.text[start_index:end_index]
return response
for entity in document.entities:
print('Entity type: {}'.format(entity.type))
print('Text: {}'.format(_get_text(entity)))
print('Mention text: {}\n'.format(entity.mention_text))
When you run your workloads on GCP, you don't need to have a service account key file. You MUSTN'T!!
Why? 2 reasons:
It's useless because all GCP products have, at least, a default service account. And most of time, you can customize it. You can have a look on Cloud Function identity in your case.
Service account key file is a file. It means a lot: you can copy it, send it by email, commit it in Git repository... many persons can have access to it and you loose the management of this secret. And because it's a secret, you have to store it securely, you have to rotate it regularly (at least every 90 days, Google recommendation),... It's nighmare! When you can, don't use service account key file!
What the libraries are doing?
There are looking if GOOGLE_APPLICATION_CREDENTIALS env var exists.
There are looking into the "well know" location (when you perform a gcloud auth application-default login to allow the local application to use your credential to access to Google Resources, a file is created in a "standard location" on your computer)
If not, check if the metadata server exists (only on GCP). This server provides the authentication information to the libraries.
else raise an error.
So, simply use the correct service account in your function and provide it the correct role to achieve what you want to do.

Specify Maximum File Size while uploading a file in AWS S3

I am creating temporary credentials via AWS Security Token Service (AWS STS).
And Using these credentials to upload a file to S3 from S3 JAVA SDK.
I need some way to restrict the size of file upload.
I was trying to add policy(of s3:content-length-range) while creating a user, but that doesn't seem to work.
Is there any other way to specify the maximum file size which user can upload??
An alternative method would be to generate a pre-signed URL instead of temporary credentials. It will be good for one file with a name you specify. You can also force a content length range when you generate the URL. Your user will get URL and will have to use a specific method (POST/PUT/etc.) for the request. They set the content while you set everything else.
I'm not sure how to do that with Java (it doesn't seem to have support for conditions), but it's simple with Python and boto3:
import boto3
# Get the service client
s3 = boto3.client('s3')
# Make sure everything posted is publicly readable
fields = {"acl": "private"}
# Ensure that the ACL isn't changed and restrict the user to a length
# between 10 and 100.
conditions = [
{"acl": "private"},
["content-length-range", 10, 100]
]
# Generate the POST attributes
post = s3.generate_presigned_post(
Bucket='bucket-name',
Key='key-name',
Fields=fields,
Conditions=conditions
)
When testing this make sure every single header item matches or you'd get vague access denied errors. It can take a while to match it completely.
I believe there is no way to limit the object size before uploading, and reacting to that would be quite hard. A workaround would be to create an S3 event notification that triggers your code, through a Lambda funcation or SNS topic. That could validate or delete the object and notify the user for example.