How to download and process large S3 file? - amazon-web-services

I have a large JSON file (i.e 100MB to 3GB) in S3. How to process this?
Today, I am using s3client.getObjectContent() to get the input stream and trying to process.
As I stream it, I am passing the inputstream to Jackson jsonparser and fetching each JSON object and calling another microservice to process the JSON object retrieved from the s3 input stream.
Problem:
As I am processing the JSON object, S3 stream is getting closed without processing the entire payload from S3.
I am getting warning:
S3AbortableInputStream:Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection
I am looking for a way to handle large S3 payload without the S3 client closing the stream before I process the entire payload. Any best practices or insights are appreciated.
Constraints: I need to process this as a stream or with a minimal memory footprint.

Can you please make the following change in your code and check?
FROM:
if (s3ObjectInputStream != null) {
s3ObjectInputStream.abort();
}
TO:
if (s3ObjectInputStream == null) {
s3ObjectInputStream.abort();
}

Related

LAMBDA_RUNTIME Failed to post handler success response. Http response code: 413

I have node/express + serverless backend api which I deploy to Lambda function.
When I call an api, request goes to API gateway to lambda, lambda connects to S3, reads a large bin file, parses it and generates an output in JSON object.
The response JSON object size is around 8.55 MB (I verified using postman, running node/express code locally). Size can vary as per bin file size.
When I make an api request, it fails with the following msg in cloudwatch,
LAMBDA_RUNTIME Failed to post handler success response. Http response code: 413
I can't/don't want to change this pipeline : HTTP API Gateway + Lambda + S3.
What should I do to resolve the issue ?
the AWS lambda functions have hard limits for the sizes of the request and of the response payloads. These limits cannot be increased.
The limits are:
6MB for Synchronous requests
256KB for Asynchronous requests
You can find additional information in the official documentation here:
https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html
You might have different solutions:
use EC2, ECS/Fargate
use the lambda to parse and transform the bin file into the desired JSON. Then save this JSON directly in an S3 public bucket. In the lambda response, you might return the client the public URL/URI/FileName of the created JSON.
For the last solution, if you don't want to make the JSON file visible to whole the world, you might consider using AWS Amplify in your client or/and AWS Cognito in order to give only an authorised user access to the file that he has just created.
As noted in other questions, API Gateway/Lambda has limits on on response sizes. From the discussion I read that latency is a concern additionally.
With these two requirements Lambda are mostly out of the question, as they need some time to start up (which can be lowered with provisioned concurrency) and do only have normal network connections (whereas EC2,EKS can have enhanced networking).
With this requirements it would be better (from AWS Point Of View) to move away from Lambda.
Looking further we could also question the application itself:
Large JSON objects need to be generated on demand. Why can't these be pre-generated asynchronously and then downloaded from S3 directly? Which would give you the best latency and speed and can be coupled with CloudFront
Why need the JSON be so large? Large JSONs also need to be parsed on the client side requiring more CPU. Maybe it can be split and/or compressed?

how to stream microphone audio from browser to S3

I want to stream the microphone audio from the web browser to AWS S3.
Got it working
this.recorder = new window.MediaRecorder(...);
this.recorder.addEventListener('dataavailable', (e) => {
this.chunks.push(e.data);
});
and then when user clicks on stop upload the chunks new Blob(this.chunks, { type: 'audio/wav' }) as multiparts to AWS S3.
But the problem is if the recording is 2-3 hours longer then it might take exceptionally longer and user might close the browser before waiting for the recording to complete uploading.
Is there a way we can stream the web audio directly to S3 while it's going on?
Things I tried but can't get a working example:
Kineses video streams, looks like it's only for real time streaming between multiple clients and I have to write my own client which will then save it to S3.
Thought to use kinesis data firehose but couldn't find any client data producer from brower.
Even tried to find any resource using aws lex or aws ivs but I think they are just over engineering for my use case.
Any help will be appreciated.
You can set the timeslice parameter when calling start() on the MediaRecorder. The MediaRecorder will then emit chunks which roughly match the length of the timeslice parameter.
You could upload those chunks using S3's multipart upload feature as you already mentioned.
Please note that you need a library like extendable-media-recorder if you want to record a WAV file since no browser supports that out of the box.

Kinesis put records not returned in response from get records request

I have a Scala app using the aws-java-sdk-kinesis to issue a series of putRecord requests to a local kinesis stream.
The response returned after each putRecord request indicates its successfully putting the records into the stream.
The scala code making the putRecordRquest:
def putRecord(kinesisClient: AmazonKinesis, value: Array[Byte], streamName: String): Try[PutRecordResult] = Try {
val putRecordRequest = new PutRecordRequest()
putRecordRequest.setStreamName(streamName)
putRecordRequest.setData(ByteBuffer.wrap(value))
putRecordRequest.setPartitionKey("integrationKey")
kinesisClient.putRecord(putRecordRequest)
}
To confirm this I have a small python app that basically consumes from the stream (initialStreamPosition: LATEST). And prints the records it finds by iterating through the shard-iterators. But unexpectedly however it returns an empty set of records for each obtained shardIterator.
Trying this using the aws cli tool, I do however get records returned for the same shardIterator. I am confused? How can that be?
Running the python consumer (with LATEST), returns:
Shard-iterators: ['AAAAAAAAAAH9AUYVAkOcqkYNhtibrC9l68FcAQKbWfBMyNGko1ypHvXlPEuQe97Ixb67xu4CKzTFFGoLVoo8KMy+Zpd+gpr9Mn4wS+PoX0VxTItLZXxalmEfufOqnFbz2PV5h+Wg5V41tST0c4X0LYRpoPmEnnKwwtqwnD0/VW3h0/zxs7Jq+YJmDvh7XYLf91H/FscDzFGiFk6aNAVjyp+FNB3WHY0d']
Records: []
If doing the "same" with the aws cli tool however I get:
> aws kinesis get-records --shard-iterator AAAAAAAAAAH9AUYVAkOcqkYNhtibrC9l68FcAQKbWfBMyNGko1ypHvXlPEuQe97Ixb67xu4CKzTFFGoLVoo8KMy+Zpd+gpr9Mn4wS+PoX0VxTItLZXxalmEfufOqnFbz2PV5h+Wg5V41tST0c4X0LYRpoPmEnnKwwtqwnD0/VW3h0/zxs7Jq+YJmDvh7XYLf91H/FscDzFGiFk6aNAVjyp+FNB3WHY0d --endpoint-url http://localhost:4567
Returns:
{"Records":[{"SequenceNumber":"49625122979782922897342908653629584879579547704307482626","ApproximateArrivalTimestamp":1640263797.328,"Data":{"type":"Buffer","data":[123,34,116,105,109,101,115,116,97,109,112,34,58,49,54,52,48,50,54,51,55,57,55,44,34,100,116,109,34,58,49,54,52,48,50,54,51,55,57,55,44,34,101,34,58,34,101,34,44,34,116,114,97,99,107,101,114,95,118,101,114,115,105,111,110,34,58,34,118,101,114,115,105,111,110,34,44,34,117,114,108,34,58,34,104,116,116,112,115,58,47,47,116,101,115,116,46,99,111,109,34,44,34,104,99,99,34,58,102,97,108,115,101,44,34,115,99,34,58,49,44,34,99,111,110,116,101,120,116,34,58,123,34,101,116,34,58,34,101,116,34,44,34,100,101,118,34,58,34,100,101,118,34,44,34,100,119,101,108,108,34,58,49,44,34,111,105,100,34,58,49,44,34,119,105,100,34,58,49,44,34,115,116,97,116,101,34,58,123,34,108,99,34,58,123,34,99,111,100,101,34,58,34,115,111,109,101,45,99,111,100,101,34,44,34,105,100,34,58,34,115,111,109,101,45,105,100,34,125,125,125,44,34,121,117,105,100,34,58,34,102,53,101,52,57,53,98,102,45,100,98,102,100,45,52,102,53,102,45,56,99,56,98,45,53,97,56,98,50,56,57,98,52,48,49,97,34,125]},"PartitionKey":"integrationKey"},{"SequenceNumber":"49625122979782922897342908653630793805399163707871723522","ApproximateArrivalTimestamp":1640263817.338,"Data":{"type":"Buffer","data":[123,34,116,105,109,101,115,116,97,109,112,34,58,49,54,52,48,50,54,51,56,49,55,44,34,100,116,109,34,58,49,54,52,48,50,54,51,56,49,55,44,34,101,34,58,34,101,34,44,34,116,114,97,99,107,101,114,95,118,101,114,115,105,111,110,34,58,34,118,101,114,115,105,111,110,34,44,34,117,114,108,34,58,34,104,116,116,112,115,58,47,47,116,101,115,116,46,99,111,109,34,44,34,104,99,99,34,58,102,97,108,115,101,44,34,115,99,34,58,49,44,34,99,111,110,116,101,120,116,34,58,123,34,101,116,34,58,34,101,116,34,44,34,100,101,118,34,58,34,100,101,118,34,44,34,100,119,101,108,108,34,58,49,44,34,111,105,100,34,58,49,44,34,119,105,100,34,58,49,44,34,115,116,97,116,101,34,58,123,34,108,99,34,58,123,34,99,111,100,101,34,58,34,115,111,109,101,45,99,111,100,101,34,44,34,105,100,34,58,34,115,111,109,101,45,105,100,34,125,125,125,44,34,121,117,105,100,34,58,34,102,53,101,52,57,53,98,102,45,100,98,102,100,45,52,102,53,102,45,56,99,56,98,45,53,97,56,98,50,56,57,98,52,48,49,97,34,125]},"PartitionKey":"integrationKey"},{"SequenceNumber":"49625122979782922897342908653632002731218779711435964418","ApproximateArrivalTimestamp":1640263837.347,"Data":{"type":"Buffer","data":[123,34,116,105,109,101,115,116,97,109,112,34,58,49,54,52,48,50,54,51,56,51,55,44,34,100,116,109,34,58,49,54,52,48,50,54,51,56,51,55,44,34,101,34,58,34,101,34,44,34,116,114,97,99,107,101,114,95,118,101,114,115,105,111,110,34,58,34,118,101,114,115,105,111,110,34,44,34,117,114,108,34,58,34,104,116,116,112,115,58,47,47,116,101,115,116,46,99,111,109,34,44,34,104,99,99,34,58,102,97,108,115,101,44,34,115,99,34,58,49,44,34,99,111,110,116,101,120,116,34,58,123,34,101,116,34,58,34,101,116,34,44,34,100,101,118,34,58,34,100,101,118,34,44,34,100,119,101,108,108,34,58,49,44,34,111,105,100,34,58,49,44,34,119,105,100,34,58,49,44,34,115,116,97,116,101,34,58,123,34,108,99,34,58,123,34,99,111,100,101,34,58,34,115,111,109,101,45,99,111,100,101,34,44,34,105,100,34,58,34,115,111,109,101,45,1pre05,100,34,125,125,125,44,34,121,117,105,100,34,58,34,102,53,101,52,57,53,98,102,45,100,98,102,100,45,52,102,53,102,45,56,99,56,98,45,53,97,56,98,50,56,57,98,52,48,49,97,34,125]},"PartitionKey":"integrationKey"}],"NextShardIterator":"AAAAAAAAAAE+9W/bI4CsDfzvJGN3elplafFFBw81/cVB0RjojS39hpSglW0ptfsxrO6dCWKEJWu1f9BxY7OZJS9uUYyLn+dvozRNzKGofpHxmGD+/1WT0MVYMv8tkp8sdLdDNuVaq9iF6aBKma+e+iD079WfXzW92j9OF4DqIOCWFIBWG2sl8wn98figG4x74p4JuZ6Q5AgkE41GT2Ii2J6SkqBI1wzM","MillisBehindLatest":0}
The actual python consumer I have used in many other settings to introspec other kinesis streams we have and its working as expected. But for some reason here its not working.
Does anyone have a clue what might be going on here?
So I was finally able to identify the issue, and perhaps it will be useful for someone else with similar problem.
In my setup, I am using a local kinesis stream (kinesalite) which doesn't support CBOR. You have to disable this explicitly otherwise I was seeing the following error when trying to deserialize the received record.
Unable to unmarshall response (We expected a VALUE token but got: START_OBJECT). Response Code: 200, Response Text: OK
In my case, setting the environment variable: AWS_CBOR_DISABLE=1 did the trick

Streaming upload to Google Storage API when the final stream size is not known

So Google Storage has this great API for resumable uploads: https://cloud.google.com/storage/docs/json_api/v1/how-tos/resumable-upload which I'd like to utilize to upload a large object in multiple chunks. However this is done a in stream processing pipeline where the total amount of bytes in the stream is not know in advance.
According to the documentation of the API, you're supposed to use Content-Range header to tell the Google Storage API that you're done uploading the file, e.g.:
PUT https://www.googleapis.com/upload/storage/v1/b/myBucket/o?uploadType=resumable&upload_id=xa298sd_sdlkj2 HTTP/1.1
Content-Length: 1024
Content-Range: bytes 1023-2048/2048
[BYTES 1023-2048]
If I'm understanding how this works correctly, that bytes 1023-2048/2048 value of the Content-Range header is how Google Storage determines that you're uploading the last chunk of data and it can successfully finish the resumable upload session after it's done.
In my case however the total stream size is not known in advance, so I need to keep uploading until there's no more data to upload. Is there a way to do this using the Google Storage API? Ideally I'd like some way of manually telling the API "hey I'm done, don't expect any more data from me".
In my case however the total stream size is not known in advance,
In this case you need to send Content-Range: bytes 1023-2048/* in the PUT requests. Note however, that these requests must be in multiples of 256KiB:
https://cloud.google.com/storage/docs/json_api/v1/how-tos/resumable-upload#example_uploading_the_file
so I need to keep uploading until there's no more data to upload. Is there a way to do this using the Google Storage API?
Yes. You send the chunks with bytes NNNNN-MMMMM/*.
Ideally I'd like some way of manually telling the API "hey I'm done, don't expect any more data from me".
You do that by either (a) sending a chunk that is not a multiple of 256KiB, or (b) sending a chunk with bytes NNN-MMM/(MMM+1). That is, the last chunk contains the total size for the upload and indicates that it contains the last byte.
The documentation you linked states that:
Content-Length. Required unless you are using chunked transfer encoding. Set to the number of bytes in the body of this initial request.
So if you click that link to chunked transfer encoding, the HTTP spec will explain how to send chunks of data until the transfer is complete:
Chunked enables content streams of unknown size to be transferred as a
sequence of length-delimited buffers, which enables the sender to
retain connection persistence and the recipient to know when it has
received the entire message.
It likely not going to be easy to implement this on your own, so I suggest finding an HTTP client library that knows how to do this for you.

Amazon S3 Client setReadLimit

While uploading file to S3 , we are getting this random error msg for one single case
"If the request involves an input stream, the maximum stream buffer size can be configured via request.getRequestClientOptions().setReadLimit(int)"
source being : https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/AmazonS3Client.java
As per AWS SDK for Java 1.8.10
We can set maximum stream buffer size to be configured per request via
request.getRequestClientOptions().setReadLimit(int)
We are using com.amazonaws.services.s3.AmazonS3 object to upload data.
Can anyone suggest how we can set ReadLimit() via com.amazonaws.services.s3.AmazonS3
https://aws.amazon.com/releasenotes/0167195602185387
It sounds like you're uploading data from an InputStream, but some sort of transient error is interrupting the upload. The SDK isn't able to retry the request because InputStreams are mark/resetable by default. The error message is trying to give guidance on buffer sizes, but for large data, you probably don't want to load it all into memory anyway.
If you're able to upload from a File source, then you shouldn't see this error again. Because Files are resettable, the SDK is able to retry your request if it encounters an error during the first attempt.
A little bit necroing, but you need to create a PutObjectRequest and use the setReadLimit on that:
PutObjectRequest putObjectRequest = new PutObjectRequest(bucketName, key, fileInputStream, objectMetadata);
putObjectRequest.getRequestClientOptions().setReadLimit(xxx);
s3Client.putObject(putObjectRequest);
If you look in the implementation of the putObjectRequest(String, String, InputStream, ObjectMetadata), you can see that it just creates a PutObjectRequest and passes that to putObject(PutObjectRequest)