Error Tracking in Amazon SageMaker - amazon-web-services

I am trying to create a custom Image Classifier in Amazon SageMaker. It is giving me the following error:
"ClientError: Data download failed:NoSuchKey (404): The specified key does not exist."
I'm assuming this means one of the pictures in my .lst file is missing from the directory. Is there some way to find out which .lst listing it is specifically having trouble with?

Upon further examination (of the log files), it appears the issue does not lie with the .lst file itself, but with the image files it was referencing (which now leaves me wondering why AWS doesn't just say that instead of saying the .lst file is corrupt). I'm going through the image files one-by-one to verify they are correct, hopefully that will solve the problem.

Related

S3 trigger to perform a file conversion for a multi-part file type

I am working on converting shapefiles to geojson. Shapefiles are composed of at least 3 required files and as many as 8 separate files all residing in a folder. To convert to geojson you need all the constituent parts. Right now I have a batch conversion process that goes through all the shapefiles stored in an s3 bucket, downloads all the separate file parts and performs the conversion. What I'm trying to figure out now is how to run the file conversion process based on the upload of a single shapefile folder, hopefully using an s3 bucket trigger.
I have reviewed this answer (AWS - want to upload multiple files to S3 and only when all are uploaded trigger a lambda function) but in my case there is no frontend client (the answer presented in that question appears to be to signal a final event, but that is done from the client interface). Maybe I need to build one, but I was trying to handle this only in the backend (there is no frontend and no plans to have one). The 'user' would be dropping the files right into s3 directly without a file upload interface.
Of course when someone uploads a folder with all the shapefile parts in it, it triggers the s3 trigger for each part but each part cannot produce a shapefile alone.
A few solutions I thought of but with their own problems:
I am converting the shapefiles to geojson and storing the geojson in a separate s3 bucket using a naming convention for the geojson based on the s3 file name. In theory you could always check if the geojson exists in the destination s3 bucket already and if not, run the conversion. But this still doesn't take care of the timing aspect of all the multiple parts of the file being uploaded. I could check the name but it would be triggered multiple times, fail on some and then ultimately (probably) succeed after all the parts are in place.
1a. Maybe some type of try/except error checking on the conversion mentioned above? meaning, for each file part uploaded, go ahead and try to download and convert. This seems fragile and potentially error prone. Also, I believe that a certain subset of all the files will likely produce a geojson without error but without all the metadata or complete set of data so a successful conversion may not actually be a success.
Using a database to track which files have been converted, which would basically be the same solution as 1 above.
Partly a question as a solution: on the s3 web console there is 'file' upload and 'folder' upload. To upload the shapefile folder containing all the component parts, you'd have to use the 'folder' option. The question then is, is there any way to know, from the event trigger perspective, that the operation was a folder upload, not just a file upload and to therefore wait until all the parts of the folder are uploaded OR if there is any event data in AWS that, when a FOLDER is uploaded it counts the underlying file parts (1 of 6, 2 of 6 etc) and could send an event after all the parts of the folder have been uploaded(?)
I also am aware of the 'multipart' upload which would, I think, do what I proposed in #3 above but that multipart 'tag' is only if you upload via sdk or cli. Unless the s3 console folder upload is underneath a multi-part upload?

AWS Comprehend output

I am new to using AWS Comprehend to search for PII. I get the job to run against an S3 bucket but can't read the output. The output is in another bucket that I specified. All of the output files have .out as the extension. I was expecting output in report form or at least the ability to open the output files and verify PII. One example of the output is a png file that has as extension .png.out
I do not want to redact the PII at this point. I just want to identify it. Any help would be appreciated.

GCP create function zip upload error without description

I'm trying to create a simple GCP Cloud Function via GCP console but the zip upload fails every time without a detailed reason:
the zip file includes the source files (and not a file with the source files).
in that way, the function isn't being created. I've tried to search online but couldn't find an answer.
screenshot of the error message
Thanks.

How should I handle public (TCGA) data on AWS?

I'm new to AWS development and studying how to use TCGA data (https://registry.opendata.aws/tcga/) on my EC2 instance.
$ aws s3 ls s3://tcga-2-open/ gives me millions of files. However, the XML file http://tcga-2-open.s3.amazonaws.com shows me only 1000 entries. Is there a full list of files, hopefully describing their hierarchy, so that I can lookup the coverage of this resource?
All TCGA files have prefix, and they seem to be GDC UUIDs (https://docs.gdc.cancer.gov/Encyclopedia/pages/UUID/). However, some files have UUID that I cannot find from original source. Downloading 'manifest' from https://portal.gdc.cancer.gov/repository gives me a list of UUIDs and their implications. However, some UUIDs from the manifest file and from the XML file are mutually exclusive. So how do I know what file, for example, http://tcga-2-open.s3.amazonaws.com/00002fe8-ec8e-4e0e-a174-35f039c15d06/6057825020_R01C01_Grn.idat, is, when I cannot find '00002fe8-ec8e-4e0e-a174-35f039c15d06' from the ground truth manifest file?
Is there any step-by-step tutorial to using Open Data on AWS?
Any help to get me through this would be really appreciated.

AWS Rekognition Custom Labels Training "The manifest file contains too many invalid data objects" error

I'm trying to do a quick PoC on the AWS Rekognition custom labels feature. I'd like to try using it for object detection.
I've had a couple of attempts at setting it up using only tools in the AWS Console. I'm using images imported from the Rekognition bucket in S3, then I added bounding boxes using the tools in the Rekognition console.
All of my images are marked up with bounding boxes, no whole image labels have been used. I have 9 labels, all of which appear in at least 1 drawing.
I've ensured my images are less than 4096x4096 in size (which is mentioned on this AWS forums thread as a possible cause of this issue.
When I attempt to train my model I get the "The manifest file contains too many invalid data objects" error.
What could be wrong here? An error message complaining about the format of a file I didn't create manually, that I can't see or edit isn't exactly intuitive.