AWS S3 Write At Offset - amazon-web-services

is there any possibility to write at some offset inside S3 stored file? We really really don't want to download it for read-modify-write all the time because files are rather big (few GBs each).

There is no way to append data in S3.
One possible workaround could be to create new files every time (possibly using Kinesis Firehose) and run EMR jobs (possibly using Data Pipeline) to merge these small files at hourly or daily cadence as needed.

Related

Compose Google Storage Objects without headers via CLI

I was wondering if it would be possible to compose Google Storage Objects (specifically csv files) without headers (i.e. without the row with column names) while using gsutil.
Currently, I can do the following:
gsutil compose gs://bucket/test_file_1.csv gs://bucket/test_file_2.csv gs://bucket/test-composition-files.csv
However, I will be unable to ingest test-composition-files.csv into Google BigQuery because compose blindly appended the files (including the column names).
One possible solution would be to download the file locally and process it with pandas, but this is not ideal for large files.
Is there any way to do this via the CLI? I could not find anything in the docs.
By reading the comment, I think you are spending effort in the wrong way. I understood that you wanted to load your files into big query, but the large number of file prevented you to do this (too many API calls). And dataflow is too slow.
Maybe you can think differently. I have 2 solutions to propose
If you need "near real time" ingestion, and if file size is bellow 1.5Gb, the best way is to build a function which read the file and perform a stream write to BigQuery. This function is triggered by a Cloud Storage event. If there is several file in the same time, several functions will be spawn. Be careful, stream write to BigQuery is not free
If you can wait up to 2 minutes when a file arrive, I recommend you to build a Cloud Functions, triggered every 2 minutes. This function read the file name in a bucket, move them to a sub directory and perform a load job of all the files in the sub directory. You are limited to 1000 load jobs per day (and per table), a day contains 1440 minutes. Batch every 2 minutes you are OK. The load job are free.
Is it acceptable alternatives?

Storing many small files (on S3)?

I have 2 million zipped HTML files (100-150KB) being added each day that I need to store for a long time.
Hot data (70-150 million) is accessed semi regularly, anything older than that is barely ever accessed.
This means each day I'm storing an additional 200-300GB worth of files.
Now, Standard storage costs $0.023 per GB and $0.004 for Glacier.
While Glacier is cheap, the problem with it is that it has additional costs, so it would be a bad idea to dump 2 million files into Glacier:
PUT requests to Glacier $0.05 per 1,000 requests
Lifecycle Transition Requests into Glacier $0.05 per 1,000 requests
Is there a way of gluing the files together, but keeping them accessible individually?
An important point, that if you need to provide quick access to these files, then Glacier can give you access to the file in up to 12 hours. So the best you can do is to use S3 Standard – Infrequent Access (0,0125 USD per GB with millisecond access) instead of S3 Standard. And maybe for some really not using data Glacier. But it still depends on how fast do you need that data.
Having that I'd suggest following:
as html (text) files have a good level of compression, you can compress historical data in big zip files (daily, weekly or monthly) as together they can have even better compression;
make some index file or database to know where each html-file is stored;
read only desired html-files from archives without unpacking whole zip-file. See example in python how to implement that.
Glacier would be extremely cost sensitive when it comes to the number of files. The best method would be to create a Lambda function that handles zip, unzip operations for you.
Consider this approach:
Lambda creates archive_date_hour.zip of the 2 Million files from that day by hour, this solves the "per object" cost problem by creating 24 giant archival files.
Set a policy on the s3 bucket to move expired objects to glacier over 1 day old.
Use an unzipping Lambda function to fetch and extract potential hot items from the glacier bucket from within the zip files.
Keep the main s3 bucket for hot files with high frequent access, as a working directory for the zip/unzip operations, and for collecting new files daily
Your files are just too small. You will need to combine them probably in an ETL pipeline such as glue. You can also use the Range header i.e. -range bytes=1000-2000 to download part of an object on S3.
If you do that you'll need to figure out the best way to track the bytes ranges, such as after combining the files recording the range for each one, and changing the clients to use the range as well.
The right approach though depends on how this data is accessed and figuring out the patterns. If somebody who looks at TinyFileA also looks at TinyFileB you could combine them together and just send them both along with other files they are likely to use. I would be figuring out logical groupings of files which make sense to consumers and will reduce the number of requests they need, without sending too much irrelevant data.

Need to upload 6tb data file

I'm using storage gateway as my second source of real-time backup to S3. The issue is the data is around 6tb. I would like to know can I upload 6tb data to S3. I know that 5tb is the limit. But do we've any way to fulfill the requirement?
The maximum size of a single object is 5TB.
If you are copying multiple files, you will be fine. It will just take a while!

Extend aws firehose stream buffering period

Can I somehow extend the duration of the firehose stream buffering interval to be more than 900 seconds? I'm working with small files sized ~100kb after 15 min of streaming to s3. I need extended buffering interval or another way to merge 4 files every hour and what is the best way to do that?
I don't want to download these files and then do merging because of many firehose streams, so only direct solution on AWS would be considered.
I've read so many things linked to this problem, and I couldn't find any useful answer.
Unfortunately 900s (15 minutes) is a hard limit on the amount of time Kinesis will wait to buffer.
However, if your data is moving this slowly then you can handle your hourly merge yourself.
An approach to this would to be to use a lambda function that was scheduled to invoke every hour, listed the files in the target bucket(s), read them, merged them, wrote them into a "merged" bucket/prefix and then deleted the merged files.
Alternatively you could have a lambda trigger on the S3 fire-hose bucket invoked whenever a file is written. This trigger would read all the files in that bucket, and merge them. It has the advantage that you merge your buckets in parallel and will not have to wait an hour for your file to be combined.
(You should note that S3 is not consistent at fast write speeds or when listing large numbers of files, so this is not a good solution if your data speed increases to the point where you're writing multiple files a minute.)

How to use S3 and EBS in tandem for cost effective analytics on AWS?

I receive very large (5TB) .csv files from my clients on S3 buckets. I have to process these files, add columns to them and store them back.
I might need to work with the files in the same way as I increase the number of features for future improved models.
Clearly because S3 stores data as objects, every time I make a change, I have to read and write 5TB of data.
What is the best approach I can take to process these data cost effectively and promptly:
Store a 5TB file on S3 as object, every time read the object, do
the processing and save the result back to S3
Store the 5TB on S3 as object, read the object, chunk it to smaller objects and save them back to S3 as multiple objects so in future just work with the chunks I am interested in
Save every thing on EBS from start, mount it to the EC2 and do the processing
Thank you
First, a warning -- the maximum size of an object in Amazon S3 is 5TB. If you are going to add information that results in a larger object, then you will likely hit that limit.
The smarter way of processing this amount of data is to do it in parallel and preferably in multiple, smaller files rather than a single 5TB file.
Amazon EMR (effectively, a managed Hadoop environment) is excellent for performing distributed operations across large data sets. It can process data from many files in parallel and can compress/decompress data on-the-fly. It's complex to learn, but very efficient and capable.
If you are sticking with your current method of processing the data, I would recommend:
If your application can read directly from S3, use that as the source. Otherwise, copy the file(s) to EBS.
Process the data
Store the output locally in EBS, preferably in smaller files (GBs rather than TBs)
Copy the files to S3 (or keep them on EBS if that meets your needs)