S3 trigger to perform a file conversion for a multi-part file type - amazon-web-services

I am working on converting shapefiles to geojson. Shapefiles are composed of at least 3 required files and as many as 8 separate files all residing in a folder. To convert to geojson you need all the constituent parts. Right now I have a batch conversion process that goes through all the shapefiles stored in an s3 bucket, downloads all the separate file parts and performs the conversion. What I'm trying to figure out now is how to run the file conversion process based on the upload of a single shapefile folder, hopefully using an s3 bucket trigger.
I have reviewed this answer (AWS - want to upload multiple files to S3 and only when all are uploaded trigger a lambda function) but in my case there is no frontend client (the answer presented in that question appears to be to signal a final event, but that is done from the client interface). Maybe I need to build one, but I was trying to handle this only in the backend (there is no frontend and no plans to have one). The 'user' would be dropping the files right into s3 directly without a file upload interface.
Of course when someone uploads a folder with all the shapefile parts in it, it triggers the s3 trigger for each part but each part cannot produce a shapefile alone.
A few solutions I thought of but with their own problems:
I am converting the shapefiles to geojson and storing the geojson in a separate s3 bucket using a naming convention for the geojson based on the s3 file name. In theory you could always check if the geojson exists in the destination s3 bucket already and if not, run the conversion. But this still doesn't take care of the timing aspect of all the multiple parts of the file being uploaded. I could check the name but it would be triggered multiple times, fail on some and then ultimately (probably) succeed after all the parts are in place.
1a. Maybe some type of try/except error checking on the conversion mentioned above? meaning, for each file part uploaded, go ahead and try to download and convert. This seems fragile and potentially error prone. Also, I believe that a certain subset of all the files will likely produce a geojson without error but without all the metadata or complete set of data so a successful conversion may not actually be a success.
Using a database to track which files have been converted, which would basically be the same solution as 1 above.
Partly a question as a solution: on the s3 web console there is 'file' upload and 'folder' upload. To upload the shapefile folder containing all the component parts, you'd have to use the 'folder' option. The question then is, is there any way to know, from the event trigger perspective, that the operation was a folder upload, not just a file upload and to therefore wait until all the parts of the folder are uploaded OR if there is any event data in AWS that, when a FOLDER is uploaded it counts the underlying file parts (1 of 6, 2 of 6 etc) and could send an event after all the parts of the folder have been uploaded(?)
I also am aware of the 'multipart' upload which would, I think, do what I proposed in #3 above but that multipart 'tag' is only if you upload via sdk or cli. Unless the s3 console folder upload is underneath a multi-part upload?

Related

Correct way to fetch data from an aws server into a flutter app?

I have a general understanding question. I am building a flutter app that relies on a content library containing text files, latex equations, images, pdfs, videos etc.
The content lies on an aws amplify backend. Depending on the navigation of the user in the app, the corresponding data is fetched and displayed.
I am not sure about the correct way of fetching the data. The current method (which works) is that the data is stored in an S3 bucket. When data is requested, the data is downloaded to a temporary directory and then opened and processed in the app. This is actually not slow, but I feel that it is not the way it should be done.
When data is downloaded a file transfer notification pops up, which bothers me because it is shown all the time. Also I would like to read the data directly with something like a get request, without downloading the file first (specially for text files, which I would like to read directly into a String). But here I don't know how it works, because I don't see that you can save data in a file system with the other amplify services like data store or the rest api. Also, the S3 bucket is an intuitive way of storing data that is easy to use for the content creators of my company, for me it seems that the S3 bucket is the way to go. However with S3 I have only figured out the download method to fetch data.
Could someone give me a hint on what is the correct approach for this use case? Thank you very much!

Using FileGetMimeType() with uploads to Amazon S3

I have so far allowed users to upload images to my server and then used CF's FileGetMimeType() function to determine if the MIME type is valid (.e.g jpg)
The problem is that FileGetMimeType() wants a full path to the file on the server to work. Amazon S3 is just a URL of where the image is stored. In order to get FileGetMimeType() to work, I have to first upload the image to Amazon S3 then download it again using CFHTTP and then determine the file type. This seems way less efficient than the old way.
So why not just upload to my own server first, determine the MIME type, and then upload to S3 right? I can't do that because some of these files are going to be huge with thousands of users uploading at the same time. We're talking videos as well as images.
Is there an efficient way to upload files to an external server i.e. Amazon S3 and then get the MIME type somehow without having to download the file all over again? Can it be done on S3's end?

Upload files to S3 Bucket directly from a url

We need to move our video file storage to AWS S3. The old location is a cdn, so I only have url for each file (1000+ files, > 1TB total file size). Running an upload tool directly on the storage server is not an option.
I already created a tool that downloads the file, uploads file to S3 bucket and updates the DB records with new HTTP url and works perfectly except it takes forever.
Downloading the file takes some time (considering each file close to a gigabyte) and uploading it takes longer.
Is it possible to upload the video file directly from cdn to S3, so I could reduce processing time into half? Something like reading chunk of file and then putting it to S3 while reading next chunk.
Currently I use System.Net.WebClient to download the file and AWSSDK to upload.
PS: I have no problem with internet speed, I run the app on a server with 1GBit network connection.
No, there isn't a way to direct S3 to fetch a resource, on your behalf, from a non-S3 URL and save it in a bucket.
The only "fetch"-like operation S3 supports is the PUT/COPY operation, where S3 supports fetching an object from one bucket and storing it in another bucket (or the same bucket), even across regions, even across accounts, as long as you have a user with sufficient permission for the necessary operations on both ends of the transaction. In that one case, S3 handles all the data transfer, internally.
Otherwise, the only way to take a remote object and store it in S3 is to download the resource and then upload it to S3 -- however, there's nothing preventing you from doing both things at the same time.
To do that, you'll need to write some code, using presumably either asynchronous I/O or threads, so that you can simultaneously be receiving a stream of downloaded data and uploading it, probably in symmetric chunks, using S3's Multipart Upload capability, which allows you to write individual chunks (minimum 5MB each) which, with a final request, S3 will validate and consolidate into a single object of up to 5TB. Multipart upload supports parallel upload of chunks, and allows your code to retry any failed chunks without restarting the whole job, since the individual chunks don't have to be uploaded or received by S3 in linear order.
If the origin supports HTTP range requests, you wouldn't necessarily even need to receive a "stream," you could discover the size of the object and then GET chunks by range and multipart-upload them. Do this operation with threads or asynch I/O handling multiple ranges in parallel, and you will likely be able to copy an entire object faster than you can download it in a single monolithic download, depending on the factors limiting your download speed.
I've achieved aggregate speeds in the range of 45 to 75 Mbits/sec while uploading multi-gigabyte files into S3 from outside of AWS using this technique.
This has been answered by me in this question, here's the gist:
object = Aws::S3::Object.new(bucket_name: 'target-bucket', key: 'target-key')
object.upload_stream do |write_stream|
IO.copy_stream(URI.open('http://example.com/file.ext'), write_stream)
end
This is no 'direct' pull-from-S3, though. At least this doesn't download each file and then uploads in serial, but streams 'through' the client. If you run the above on an EC2 instance in the same region as your bucket, I believe this is as 'direct' as it gets, and as fast as a direct pull would ever be.
if a proxy ( node express ) is suitable for you then the portions of code at these 2 routes could be combined to do a GET POST fetch chain, retreiving then re-posting the response body to your dest. S3 bucket.
step one creates response.body
step two
set the stream in 2nd link to response from the GET op in link 1 and you will upload to dest.bucket the stream ( arrayBuffer ) from the first fetch

Merging files on AWS S3 (Using Apache Camel)

I have some files that are being uploaded to S3 and processed for some Redshift task. After that task is complete these files need to be merged. Currently I am deleting these files and uploading merged files again.
These eats up a lot of bandwidth. Is there any way the files can be merged directly on S3?
I am using Apache Camel for routing.
S3 allows you to use an S3 file URI as the source for a copy operation. Combined with S3's Multi-Part Upload API, you can supply several S3 object URI's as the sources keys for a multi-part upload.
However, the devil is in the details. S3's multi-part upload API has a minimum file part size of 5MB. Thus, if any file in the series of files under concatenation is < 5MB, it will fail.
However, you can work around this by exploiting the loop hole which allows the final upload piece to be < 5MB (allowed because this happens in the real world when uploading remainder pieces).
My production code does this by:
Interrogating the manifest of files to be uploaded
If first part is
under 5MB, download pieces* and buffer to disk until 5MB is buffered.
Append parts sequentially until file concatenation complete
If a non-terminus file is < 5MB, append it, then finish the upload and create a new upload and continue.
Finally, there is a bug in the S3 API. The ETag (which is really any MD5 file checksum on S3, is not properly recalculated at the completion of a multi-part upload. To fix this, copy the fine on completion. If you use a temp location during concatenation, this will be resolved on the final copy operation.
* Note that you can download a byte range of a file. This way, if part 1 is 10K, and part 2 is 5GB, you only need to read in 5110K to get meet the 5MB size needed to continue.
** You could also have a 5MB block of zeros on S3 and use it as your default starting piece. Then, when the upload is complete, do a file copy using byte range of 5MB+1 to EOF-1
P.S. When I have time to make a Gist of this code I'll post the link here.
You can use Multipart Upload with Copy to merge objects on S3 without downloading and uploading them again.
You can find some examples in Java, .NET or with the REST API here.

Growing files on Amazon S3

Is it possible to have growing files on amazon s3?
That is, can i upload a file that i when the upload starts don't know the final size of. So that I can start writing more data to the file with at an specified offset.
for example write 1000 bytes in one go, and then in the next call continue to write to the file with offset 1001, so that the next bytes being written is the 1001 byte of the file.
Amazon S3 indeed allows you to do that by Uploading Objects Using Multipart Upload API:
Multipart upload allows you to upload a single object as a set of
parts. Each part is a contiguous portion of the object's data. You can
upload these object parts independently and in any order. If
transmission of any part fails, you can retransmit that part without
affecting other parts. After all parts of your object are uploaded,
Amazon S3 assembles these parts and creates the object. [...]
One of the listed advantages precisely addresses your use case, namely to Begin an upload before you know the final object size - You can upload an object as you are creating it.
This functionality is available by Using the REST API for Multipart Upload and all AWS SDKs as well as 3rd party libraries like boto (a Python package that provides interfaces to Amazon Web Services) do offer multipart upload support based on this API as well.