how to structure input/formats for batch inference in sagemaker? - amazon-web-services

example provided in the aws documentation , https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html, states that the input csv can be structured like a sample below. I noticed for batch jobs in sagemaker, it can accept json as well. how to structure the json, does each record need to in a single line as shown in a csv example or can it be multiline?
Record1-Attribute1, Record1-Attribute2, Record1-Attribute3, ..., Record1-AttributeM
...

It is recommended to make use of JSON Lines (i.e. each JSON to be on a single line). You can then set BatchStrategy to MultiRecord and SplitType to Line.
Batch Transform can then fit as many records in a mini-batch within the MaxPayloadInMB limit.
Kindly see the CreateTransformJob API for more information.

Related

Use a custom classifier in Glue for multi line records

I have some files in the following format
AB1|STUFF|1234|
AB2|SF|STUFF|
AB1|STUFF|45670|
AB2|AF|STUFF|
Each bit of data is delimited by '|' and a record is made up of the data in lines AB1 and AB2.
I would like to use a custom grok classifier in Glue something like the following:
?<LINE1>(?:AB1)?|%{WORD:ignore1}|%{NUMBER:id}\|\n%{WORD:LINE2}|%{WORD:make}|%{WORD:stuff2}\|
That is a multi line grok expression to extract the data from a multi line record as shown above. I am unsure how the classifiers in Glue work any comments or advice would be very helpful.
According to the Glue Documentation:
Grok patterns can process only one line at a time. Multiple-line
patterns are not supported. Also, line breaks within a pattern are not
supported.
I am not sure what the actual question is, if you need general guidance on how to create your own classifier, I would advise you to read this and this.

Annotation specs - AutoML (GCP)

I'm using the Natural Language module on Google Cloud Platform and more specifically AUTOML for text classification.
I come across this error which I do not understand when I have finished importing my data and the text has been processed :
Error: The dataset has too many annotation specs, the maximum allowed number is 5000.
What does it mean? Have you already got it?
Thanks
Take a look at the AutoML Quotas & Limits documentation for better understanding.
It seems that you are touching the highest limit of labels per dataset. Check it on the AutoML limits --> Labels per dataset --> 2 - 5000 (for classification).
Take into account that limits, unlike quotas, cannot be increased.
I also got this error while I was certain that my number of labels are below 5000. It turns out to be an error with my CSV formatting.
When you create your text data using to_csv() in Pandas, it will only quotes that part of text data that contains comma, while AutoML Text wants you to quote all lines of the text. I have written the solution in this Stackoverflow answer

AWS Athena - how to process huge results file

Looking for a way to process ~ 4Gb file which is a result of Athena query and I am trying to know:
Is there some way to split Athena's query result file into small pieces? As I understand - it is not possible from Athena side. Also, looks like it is not possible to split it with Lambda - this file too large and looks like s3.open(input_file, 'r') does not work in Lambda :(
Is there some other AWS services that can solve this issue? I want to split this CSV file to small (about 3 - 4 Mb) to send them to external source (POST requests)
You can use the option to CTAS with Athena and use the built-in partition capabilities.
A common way to use Athena is to ETL raw data into a more optimized and enriched format. You can turn every SELECT query that you run into a CREATE TABLE ... AS SELECT (CTAS) statement that will transform the original data into a new set of files in S3 based on your desired transformation logic and output format.
It is usually advised to have the newly created table in a compressed format such as Parquet, however, you can also define it to be CSV ('TEXTFILE').
Lastly, it is advised to partition a large table into meaningful partitions to reduce the cost to query the data, especially in Athena that is charged by data scanned. The meaningful partitioning is based on your use case and the way that you want to split your data. The most common way is using time partitions, such as yearly, monthly, weekly, or daily. Use the logic that you would like to split your files as the partition key of the newly created table.
CREATE TABLE random_table_name
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
partitioned_by = ARRAY['year','month'])
AS SELECT ...
When you go to s3://bucket/folder/ you will have a long list of folders and files based on the selected partition.
Note that you might have different sizes of files based on the amount of data in each partition. If this is a problem or you don't have any meaningful partition logic, you can add a random column to the data and partition with it:
substr(to_base64(sha256(some_column_in_your_data)), 1, 1) as partition_char
Or you can use bucketing and provide how many buckets you want:
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
bucketed_by = ARRAY['column_with_high_cardinality'],
bucket_count = 100
)
You won't be able to do this with Lambda as your memory is maxed out around 3GB and your file system storage is maxed out at 512 MB.
Have you tried just running the split command on the filesystem (if you are using a Unix based OS)?
If this job is reoccurring and needs to be automated and you wanted to still be "serverless", you could create a Docker image that contains a script to perform this task and then run it via a Fargate task.
As for the specific of how to use split, this other stack overflow question may help:
How to split CSV files as per number of rows specified?
You can ask S3 for a range of the file with the Range option. This is a byte range (inclusive), for example bytes=0-1000 to get the first 1000 bytes.
If you want to process the whole file in the same Lambda invocation you can request a range that is about what you think you can fit in memory, process it, and then request the next. Request the next chunk when you see the last line break, and prepend the partial line to the next chunk. As long as you make sure that the previous chunk gets garbage collected and you don't aggregate a huge data structure you should be fine.
You can also run multiple invocations in parallel, each processing its own chunk. You could have one invocation check the file size and then invoke the processing function as many times as necessary to ensure each gets a chunk it can handle.
Just splitting the file into equal parts won't work, though, you have no way of knowing where lines end, so a chunk may split a line in half. If you know the maximum byte size of a line you can pad each chunk with that amount (both at the beginning and end). When you read a chunk you skip ahead until you see the last line break in the start padding, and you skip everything after the first line break inside the end padding – with special handling of the first and last chunk, obviously.

Skip top N lines in snowflake load

My actual data in csv extracts starts from line 10. How can I skip top few lines in snowflake load using copy or any other utility. Do we have anything similar to SKIP_HEADER ?
I have files on S3 and its my stage. I would be creating a snowpipe later on this datasource.
yes there is a skip_header option for CSV, allowing you to skip a specified number of rows, when defining a file format. Please have a look here:
https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.html#type-csv
So you create a file format associated with the csv files you have in mind and then use this when calling the copy commands.

Blankspace and colon not found in firstline

I have a jupyter notebook in SageMaker in which I want to run the XGBoost algorithm. The data has to match 3 criteria:
-No header row
-Outcome variable in the first column, features in the rest of the columns
-All columns need to be numeric
The error I get is the following:
Error for Training job xgboost-2019-03-13-16-21-25-000:
Failed Reason: ClientError: Blankspace and colon not found in firstline
'0.0,0.0,99.0,314.07,1.0,0.0,0.0,0.0,0.48027846,0.0...' of file 'train.csv'
In the error itself it can be seen that there are no headers, the output is the first column (it just takes 1.0 and 0.0 values) and all features are numerical. The data is stored in its own bucket.
I have seen a related question in GitHub but there are no solution there. Also, the example notebook that Amazon has does not take care of change the default sep or anything when saving a dataframe to csv for using it later on.
The error message indicated XGBoost was expecting the input data set as libsvm format instead of csv. SageMaker XGBoost by default assumed the input data set was in libsvm format. For using input data set in csv, please explicitly specify content-type as text/csv.
For more information: https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost