SageMaker RCF Data - amazon-web-services

I have a DynamoDB table filled with nice data. I use Datapipeline to extract this to S3 and it generates a folder with 3 files.
1) "139xx-x911-407x-83xx-06x5x659xx16" that contains all DB data in this format:
{"TimeStamp":{"s":"1539699960"},"SystemID":{"n":"1001"},"AccMin":{"n":"497"},"AccMax":{"n":"509"},"CustomerID":{"n":"10001"},"SensorID":{"n":"101"}}
2) "manifest"
{"name":"DynamoDB-export","version":3,
entries: [
{"url":"s3://cxxxx/2018-10-18-15-25-02/139xx-x911-407x-83xx-06x5x659xx16","mandatory":true}
]}
3) "_SUCCESS" No data inside.
I then go to SageMaker -> Training Jobs -> Create Training Job. Here I fill in everything to create a Random Cut Forest model, and point it towards the above data (I have tried both manifest file and the bigger data-file.
The training fails with error:
"ClientError: No data was found. Please make sure training data is
provided."
What am I doing wrong?

Thank you for your interest in SageMaker.
The manifest is optional, but if provided it should conform to the schema described at https://docs.aws.amazon.com/sagemaker/latest/dg/API_S3DataSource.html . Also, RandomCutForest does not support input data in JSON format. Only protobuf and CSV are supported, see https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html
In order to get training working you have to convert input data to CSV or protobuf format and set content_type value appropriately. If you want to use a manifest file, then S3 location should point to that file and context has to be fixed to conform the schema. You can however remove the manifest and point S3 location to s3://bucket/path/to/data/.
I hope this helps.
Regards,
Yury

Related

AWS Glue Crawler Unable to Classify CSV files

I'm unable to get the default crawler classifier, nor a custom classifier to work against many of my CSV files. The classification is listed as 'UNKNOWN'. I've tried re-running existing classifiers, as well as creating new ones. Is anyone aware of a specific configuration for a custom classifier for CSV files that works for files of any size?
I'm also unable to find any errors specific to this issue in the logs.
Although I have seen reference to issues for JSON files over 1MB in size, I can't find anything detailing this same issue for CSV files, nor a solution to the problem.
AWS crawler could not classify the file type stores in S3 if its size >1MB
AWS Glue Crawler Classifies json file as UNKNOWN
Default CSV classifiers supported by Glue Crawler:
CSV - Checks for the following delimiters: comma (,), pipe (|), tab
(\t), semicolon (;), and Ctrl-A (\u0001). Ctrl-A is the Unicode
control character for Start Of Heading.
If you have any other delimiter, then it will not work with default CSV classfier. In that case you will have to write grok pattern.

How do I import JSON data from S3 using AWS Glue?

I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.

"Invalid arguments of request" when clicking "Start training" i

I have uploaded 22 images and tagged with 2 tags.
But when I click start training I get an "Invalid arguments of request".
The images have been uploaded through the interface. I had to manually create the bucket.
What can have gone wrong? I have attached a screenshot.
This is likely the same error as in Google AutoML training error
Each image is assigned for one of TRAIN, VALIDATION, and TEST set. You have enough labeled images, but not enough images assigned for VALIDATION or TEST. Adding images for these two sets should solve this issue.
The best way of adding images to specified sets is importing a CSV file.
According to Preparing your training data
If you use csv file to import data, you need to assign the content to specific data set. (You need to upload your file to GCS first)
Ex.
TRAIN, gs://{YOUR_BUCKET_NAME}/{OBJECT_NAME_1}, {LABLE}
VALIDATION, gs://{YOUR_BUCKET_NAME}/{OBJECT_NAME_2}, {LABLE}
TEST, gs://{YOUR_BUCKET_NAME}/{OBJECT_NAME_3}, {LABLE}
or do not assign the data set, AutoML automatically places it in one of the three sets
gs://{YOUR_BUCKET_NAME}/{OBJECT_NAME_1}, {LABLE}
gs://{YOUR_BUCKET_NAME}/{OBJECT_NAME_2}, {LABLE}
gs://{YOUR_BUCKET_NAME}/{OBJECT_NAME_3}, {LABLE}

AWS Glue custom crawler based on file name

So what I am trying to do is to crawl data on S3 bucket with AWS Glue. Data stored as nested json and path looks like this:
s3://my-bucket/some_id/some_subfolder/datetime.json
When running default crawler (no custom classifiers) it does partition it based on path and deserializes json as expected, however, I would like to get a timestamp from the file name as well in a separate field. For now Crawler omits it.
For example if I run crawler on:
s3://my-bucket/10001/fromage/2017-10-10.json
I get table schema like this:
Partition 1: 10001
Partition 2: fromage
Array: JSON data
I did try to add custom classifier based on Grok pattern:
%{INT:id}/%{WORD:source}/%{TIMESTAMP_ISO8601:timestamp}
However, whenever I re-run crawler it skips custom classifier and uses default JSON one. As a solution obviously I could append file name to the JSON itself before running a crawler, but was wondering if I can avoid this step?
Classifiers only analyze the data within the file, not the filename itself. What you want to do is not possible today. If you can change the path where the files land, you could add the date as another partition:
s3://my-bucket/id=10001/source=fromage/timestamp=2017-10-10/data-file-2017-10-10.json

AWS Data Pipeline, Best way to Structure Data in S3 for DynamoDB Mass Import?

I'm looking at migrating a massive database to Amazon's DynamoDB (think 150 million plus records).
I'm currently storing these records in Elasticsearch.
I'm reading up on Data Pipeline and you can import into DynamoDB from S3 using a TSV, CSV or JSON file.
It seems the best way to go is a JSON file and I've found two examples of how it should be structured:
From AWS:
{"Name"ETX {"S":"Amazon DynamoDB"}STX"Category"ETX {"S":"Amazon Web Services"}}
{"Name"ETX {"S":"Amazon push"}STX"Category"ETX {"S":"Amazon Web Services"}}
{"Name"ETX {"S":"Amazon S3"}STX"Category"ETX {"S":"Amazon Web Services"}}
From Calorious' Blog:
{"Name": {"S":"Amazon DynamoDB"},"Category": {"S":"Amazon Web Services"}}
{"Name": {"S":"Amazon push"},"Category": {"S":"Amazon Web Services"}}
{"Name": {"S":"Amazon S3"},"Category": {"S":"Amazon Web Services"}}
So, my questions are the following:
Do I have to put a literal 'START of LINE (STX)'?
How reliable is this method? Should I be concerned about failed uploads? There doesn't seem to be a way to do error handling so I do I just assume that AWS got it right?
Is there an ideal size of file? For example should I break up the database into say 100K chunks of records and store each 100k chunk in one file?
I want to get this right the first time and not incur extra charges as apparently you get charged when you're right or wrong in your setup.
Any specific parts/links to the manual that I missed would also be greatly appreciated.
I am doing this exact thing right now. In fact I extracted 340 million rows using Data pipeline, transformed them using Lambda and am importing them right now using pipeline.
A couple of things:
1) JSON is a good way to go.
2) On the export, AWS limits each file to 100,000 records. Not sure if this is required or just a design decision.
3) In order to use the pipeline for import, there is a requirement to have a manifest file. This was news to me. I had an example from the export which you won't have. Without it your import probably won't work. It's structure is:
{"name":"DynamoDB-export","version":3,
"entries": [
{"url":"s3://[BUCKET_NAME]/2019-03-06-20-17-23/dd3906a0-a548-453f-96d7-ee492e396100-transformed","mandatory":true},
...
]}
4) Calorious' Blog has the format correct. I am not sure if the "S" needs to be lower case - mine all are. Here is an example row from my import file:
{"x_rotationRate":{"s":"-7.05723"},"x_acceleration":{"s":"-0.40001"},"altitude":{"s":"0.5900"},"z_rotationRate":{"s":"1.66556"},"time_stamp":{"n":"1532710597553"},"z_acceleration":{"s":"0.42711"},"y_rotationRate":{"s":"-0.58688"},"latitude":{"s":"37.3782895682606"},"x_quaternion":{"s":"-0.58124"},"x_user_accel":{"s":"0.23021"},"pressure":{"s":"101.0524"},"z_user_accel":{"s":"0.02382"},"cons_key":{"s":"index"},"z_quaternion":{"s":"-0.48528"},"heading_angle":{"s":"-1.000"},"y_user_accel":{"s":"-0.14591"},"w_quaternion":{"s":"0.65133"},"y_quaternion":{"s":"-0.04934"},"rotation_angle":{"s":"221.53970"},"longitude":{"s":"-122.080872377186"}}
I would enter few rows manually and export them using data-pipeline to see the exact format it generates; which will then be the same format you need to follow if you were to do imports (I think it's the first format in your examples).
Then I would set up a file with few rows (100 maybe) and run data-pipepline to ensure it works fine.
Breaking your file into chunks sounds good for me and it might help you recovering from a failure without having the start all over again.
Make sure you dont have keys with empty, null or undefined values. That will break and stop the import completely. When you're exporting entries from your current database you can omit keys with no values or set a default non-empty value them.
Based on my experience I recommend JSON as the most reliable format, assuming of course, that the the JSON blobs you generate are properly formatted JSON objects (ie. proper escaping).
If you can generate valid JSON then go that route!