Sagemaker KMeans Built-In - List of files csv as input - amazon-web-services

I Want to use Sagemaker KMeans BuilIn Algorithm in one of my applications. I have a large CSV file in S3 (raw data) that I split into several parts to be easy to clean. Before I had cleaned, I tried to use it as the input of Kmeans to perform the training job but It doesn't work.
My manifest file:
[
{"prefix": "s3://<BUCKET_NAME>/kmeans_data/KMeans-2019-28-07-13-40-00-001/"},
"file1.csv",
"file2.csv"
]
The error I've got:
Failure reason: ClientError: Unable to read data channel 'train'. Requested content-type is 'application/x-recordio-protobuf'. Please verify the data matches the requested content-type. (caused by MXNetError) Caused by: [16:47:31] /opt/brazil-pkg-cache/packages/AIAlgorithmsCppLibs/AIAlgorithmsCppLibs-2.0.1620.0/AL2012/generic-flavor/src/src/aialgs/io/iterator_base.cpp:100: (Input Error) The header of the MXNet RecordIO record at position 0 in the dataset does not start with a valid magic number. Stack trace returned 10 entries: [bt] (0) /opt/amazon/lib/libaialgs.so(+0xb1f0) [0x7fb5674c31f0] [bt] (1) /opt/amazon/lib/libaialgs.so(+0xb54a) [0x7fb5674c354a] [bt] (2) /opt/amazon/lib/libaialgs.so(aialgs::iterator_base::Next()+0x4a6) [0x7fb5674cc436] [bt] (3) /opt/amazon/lib/libmxnet.so(MXDataIterNext+0x21) [0x7fb54ecbcdb1] [bt] (4) /opt/amazon/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call_unix64+0x4c) [0x7fb567a1e858] [bt] (5) /opt/amazon/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call+0x15f) [0x7fb567a1d95f
My question is: It's possible to use multiple CSV files as input in Sagemaker KMeans BuilIn Algorithm only in GUI? If it's possible, How should I format my manifest?

the manifest looks fine, but based on the error message, it looks like you haven't set the right data format for you S3 data. It's expecting protobuf, which is the default format :)
You have to set the CSV data format explicitly. See https://sagemaker.readthedocs.io/en/stable/session.html#sagemaker.session.s3_input.
It should look something like this:
s3_input_train = sagemaker.s3_input(
s3_data='s3://{}/{}/train/manifest_file'.format(bucket, prefix),
s3_data_type='ManifestFile',
content_type='csv')
...
kmeans_estimator = sagemaker.estimator.Estimator(kmeans_image, ...)
kmeans_estimator.set_hyperparameters(...)
s3_data = {'train': s3_input_train}
kmeans_estimator.fit(s3_data)
Please note the KMeans estimator in the SDK only supports protobuf, see https://sagemaker.readthedocs.io/en/stable/kmeans.html

Related

Batch Transform output not included in Amazon SageMaker Model Monitor job

I am trying to do data quality monitoring for a Batch transform job on a very simple .csv dataset with five inputs + one target variable. Both data capture configuration and monitoring job ended without error. However, the monitoring job reported a violation on data columns (five instead of six, the baseline is a similar dataset to the input). Any clues about that?
I successfully enabled data capture for the batch transform job using this sample code:
transformer = Transformer(
model_name=model_name,
instance_count=1,
instance_type="ml.m5.large",
accept="text/csv",
output_path=output_path,
sagemaker_session=pipeline_session
)
transformer.transform(
data=input_path,
data_type="S3Prefix",
content_type="text/csv",
batch_data_capture_config=BatchDataCaptureConfig(
destination_s3_uri= f"s3://{default_bucket}/{s3_prefix_destination}",
generate_inference_id=True,
),
split_type="Line"
)
It correctly gives as outputs two folders inside my S3 destination path named input/ and output/. For the data quality monitoring job I used this:
data_quality_model_monitor = DefaultModelMonitor(
role=role,
instance_count=1,
instance_type="ml.t3.medium",
volume_size_in_gb=50,
max_runtime_in_seconds=3600,
)
schedule = data_quality_model_monitor.create_monitoring_schedule(
monitor_schedule_name="schedule",
batch_transform_input=BatchTransformInput(
data_captured_destination_s3_uri=f"s3://{default_bucket}/{s3_prefix_destination}",
destination="/opt/ml/processing/input",
dataset_format=MonitoringDatasetFormat.csv(header=False),
),
output_s3_uri=f's3://{default_bucket}/{s3_prefix_output}',
statistics=f"s3://{default_bucket}/{s3_prefix_statistics}",
constraints=f"s3://{default_bucket}/{s3_prefix_constraints}",
schedule_cron_expression=CronExpressionGenerator.hourly(),,
enable_cloudwatch_metrics=True,
)
It reports a violation on data columns (five instead of six). It seems that the monitoring job analyzes only the manifest file contained in the folder input/, missing the files inside output/. Probably there is an option to enable output/ control, but I did not find it.

Process File in Chunks in Apache Beam and Memory error

We have multiple files in cloud storage location.We are reading file from the location and based on file type we are processing file and writing to output topic .Write to output topic is based on file type.Here is the code
PCollection<FileIO.ReadableFile> data = pipeline.
apply(FileIO.match().filepattern(options.getReadDir())
.continuously(Duration.standardSeconds(10), Watch.Growth.never()))
.apply(FileIO.readMatches().withCompression(Compression.AUTO));
PCollectionTuple outputData = data.apply(ParDo.of(new Transformer(tupleTagsMap, options.getBlockSize()))
.withOutputTags(TupleTags.customerTag, TupleTagList.of(tupleTagList))
);
outputData.get(TupleTags.topicA)
.apply("Write to customer topic",
PubsubIO.writeStrings().to(options.topicA));
outputData.get(TupleTags.topicB)
.apply("Write to transaction topic",
PubsubIO.writeStrings().to(options.topicB));
processContext.output(TupleTagsMap().get(importContext.getFileType().toString()), jsonBlock); This block of code is inside the transformer
The issue here is one of the file is very large and it contains 100 million of records .
We are adding to above processContext.output in chunks but when it is writing to output topic it is writing when whole file processing is completed.jsonBlock is one of the chunk .
Due to this we are getting memory error while processing the large file .The reason being it is not getting written to output topic
How to solve this issue ?

OutOfMemory error when writing to s3a through EMR

Getting an OutOfMemory error for the following PySpark code: (fails after a certain number of rows are written. This does not happen if I attempt to write to the hadoop filesystem instead of using s3a, so I think I've narrowed it down to the problem being s3a. ) - end goal to write to s3a.
Was wondering if there was an optimal s3a configuration where I will not run out of memory for extremely large tables.
df = spark.sql("SELECT * FROM my_big_table")
df.repartition(1).write.option("header", "true").csv("s3a://mycsvlocation/folder/")
my s3a configurations (emr default) :
('fs.s3a.attempts.maximum', '10')
('fs.s3a.buffer.dir', '${hadoop.tmp.dir}/s3a')
('fs.s3a.connection.establish.timeout', '5000')
('fs.s3a.connection.maximum', '15')
('fs.s3a.connection.ssl.enabled', 'true')
('fs.s3a.connection.timeout', '50000')
('fs.s3a.fast.buffer.size', '1048576')
('fs.s3a.fast.upload', 'true')
('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
('fs.s3a.max.total.tasks', '1000')
('fs.s3a.multipart.purge', 'false')
('fs.s3a.multipart.purge.age', '86400')
('fs.s3a.multipart.size', '104857600')
('fs.s3a.multipart.threshold', '2147483647')
('fs.s3a.paging.maximum', '5000')
('fs.s3a.threads.core', '15')
('fs.s3a.threads.keepalivetime', '60')
('fs.s3a.threads.max', '256')
('mapreduce.fileoutputcommitter.algorithm.version', '2')
('spark.authenticate', 'true')
('spark.network.crypto.enabled', 'true')
('spark.network.crypto.saslFallback', 'true')
('spark.speculation', 'false')
base of the stack trace:
Caused by: java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.hadoop.fs.s3a.S3AFastOutputStream.write(S3AFastOutputStream.java:194)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:60)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
at com.univocity.parsers.common.input.WriterCharAppender.writeCharsAndReset(WriterCharAppender.java:152)
at com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:808)
... 16 more
The problem here is that the default s3a upload does not support uploading of a singular large file greater than 2GB or 2147483647 bytes.
('fs.s3a.multipart.threshold', '2147483647')
My EMR version is older than the more recent ones so the multipart.threshold parameter is only an integer, thus the limit is 2147483647 bytes, for a single "part" or file. The more recent versions use long instead of int and can support a larger single file size limit.
I'll be using a work around write the file to local hdfs then moving it to s3 via a separate java program.

Fluentd S3 output plugin not recognizing index

I am facing problems while using S3 output plugin with fluent-d.
s3_object_key_format %{path}%{time_slice}_%{index}.%{file_extension}
using %index at end never resolves to _0,_1 . But i always end up with log file names as
sflow_.log
while i need sflow_0.log
Regards,
Can you paste your fluent.conf?. It's hard to find the problem without full info. File creations are mainly controlled by time slice flag and buffer configuration..
<buffer>
#type file or memory
path /fluentd/buffer/s3
timekey_wait 1m
timekey 1m
chunk_limit_size 64m
</buffer>
time_slice_format %Y%m%d%H%M
With above, you create a file every minute and within 1min if your buffer limit is reached or due to any other factor another file is created with index 1 under same minute.

The flume event was truncated

Here I'm facing a issue that I receive message from Kafka source, and write a interceptor to extract two fields(dataSoure and businessType) from the kafka message(json format). Here I'm using gson.fromJson(). But the issue is I got below error.
Here I want to know whether the Flume truncate the Flume event when it exceed a limit? If yes, how to setup it to bigger value. As my kafka message always very long, about 60K bytes.
Looking forward reply. Thanks in advance!
2015-12-09 11:48:05,665 (PollableSourceRunner-KafkaSource-apply)
[ERROR -
org.apache.flume.source.kafka.KafkaSource.process(KafkaSource.java:153)]
KafkaSource EXCEPTION, {} com.google.gson.JsonSyntaxException:
com.google.gson.stream.MalformedJsonException: Unterminated string at
line 1 column 4096
at com.google.gson.Gson.fromJson(Gson.java:809)
at com.google.gson.Gson.fromJson(Gson.java:761)
at com.google.gson.Gson.fromJson(Gson.java:710)
at com.xxx.flume.interceptor.JsonLogTypeInterceptor.intercept(JsonLogTypeInterceptor.java:43)
at com.xxx.flume.interceptor.JsonLogTypeInterceptor.intercept(JsonLogTypeInterceptor.java:61)
at org.apache.flume.interceptor.InterceptorChain.intercept(InterceptorChain.java:62)
at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:146)
at org.apache.flume.source.kafka.KafkaSource.process(KafkaSource.java:130)
Finally, I find the root cause by debug the source code.
It is becaues I tried to convert event.getBody() to a map using Gson, which is incorrect, as the event.getBody() is a byte[], not a String, which can't be converted. The correct code should be as below:
String body = new String(event.getBody(), "UTF-8");
Map<String, Object> map = gson.fromJson(body, new TypeToken<Map<String, Object>>() {}.getType());