I know that parquet files are splittable if they are stored in block storage. E.g stored on HDFS
Are they also splittable when stored in object storage such as AWS s3?
This confuses me because, object storage is supposed to be atomic. You either access the entire file or none of the file. You can't even change meta data on an S3 file without rewriting the entire file. On the other hand, AWS reccomends using splittable file formats in S3 to improve the performance of Athena and other frameworks in the hadoop ecosystem.
Yes, Parquet files are splittable.
S3 supports positioned reads (range requests), which can be used to read only selected portions of the input file (object).
I'm not 100% sure what you mean here, but generally (I think), you have parquet partition on partition keys and save columns into blocks of rows. When I have used in it AWS S3 it has saved like:
|-Folder
|--Partition Keys
|---Columns
|----Rows_1-100.snappy.parquet
|----Rows_101-200.snappy.parquet
This handles the splitting efficiencies you mention.
Related
I have multiple S3 files in a bucket.
Input S3 bucket :
File1 - 2GB data
File 2 - 500MB data
File 3 - 1Gb Data
file 4 - 2GB data
and so on. Assume there are 50 such files. Data within files is of same schema, lets say attribute1, attribute 2.
I want to merge these files and output into a new bucket as follows, such that each file is less than 1GB in same schema as before.
Files 1 - < 1GB
Files 2 - < 1GB
Files 3 - < 1GB
I am looking for AWS based solutions which I can deliver using AWS CDK. I was considering following two solutions :
AWS Athena - reads and writes to S3 but not sure if I can set up a 1GB limit while writing.
AWS Lambda - read file sequentially, store in memory, when size is near 1GB, write to new file in s3 bucket. Repeat until all files completed. I'm worried about the 15 min timeout, not sure if lambda will be able to process.
Expected scales -> Overall file input size sum : 1 TB
What would be a good way to go about implementing this? Hope I have phrased the question right, I'd be happy to comment if any doubts.
Thanks!
Edit :
Based on a comment ->
Apologies for calling it a merge. More of a reset. All files have the same schema, placed in csv files. In terms of pseudo code
List<Files> listOfFiles = ReadFromS3(key)
New file named temp.csv
for each file : listOfFiles :
append file to temp.csv
List<1GBGiles> finalList = Break down temp.csv into sets of 1GB each
for(File file : finalList)
writeToS3(finalList)
Amazon Athena can run a query across multiple objects in a given Amazon S3 path, as long as they all have the same format (eg same columns in a CSV file).
It can store the result in a new External Table, with a location pointing to an S3 bucket, by using a CREATE TABLE AS command and a LOCATION parameter.
The size of the output files can be controlled by setting the number of output buckets (which is not the same as an S3 bucket).
See:
Bucketing vs partitioning - Amazon Athena
Set the number or size of files for a CTAS query in Amazon Athena
If your process includes ETL(Extraction Transformation Load) post process, you could use AWS GLUE
Please find here an example for Glue using s3 as a source.
If you’d like to use it with Java SDK, the best starting points are:
the Glue GitHub repo
The aws Java code sample catalog for Glue
Out of all of them your the Tutorial to create a crawler (that you can find in GitHub as per above url) should match your case as it crawls an S3 bucket and put it in a glue catalog for transformation.
I have a task to analyze weather forecast data in Quicksight. The forecast data is held in NetCDF binary files in a public S3 bucket. The question is: how do you expose the contents of these binary files to Quicksight or even Athena?
There are python libraries that will decode the data from the binary files, such as Iris. They are used like this:
import iris
filename = iris.sample_data_path('forecast_20200304.nc')
cubes = iris.load(filename)
print(cubes)
So what would be the AWS workflow and services necessary to create a data ingestion pipeline that would:
Respond to an SQS message that a new binary file is available
Access the new binary file and decode it to access the forecast data
Add the decoded data to the set of already decoded data from previous SQS notifications
Make all the decoded data available in Athena / Quicksight
Tricky one, this...
What I would do is probably something like this:
Write a Lambda function in Python that is triggered when new files appear in the S3 bucket – either by S3 notifications (if you control the bucket), by SNS, SQS, or by schedule in EventBridge. The function uses the code snipplet included in your question to transform each new file and upload the transformed data to another S3 bucket.
I don't know the size of these files and how often they are published, so whether to convert to CSV, JSON, or Parquet is something you have to decide – if the data is small CSV will probably be easiest and will be good enough.
With the converted data in a new S3 bucket all you need to do is create an Athena table for the data set and start using QuickSight.
If you end up with a lot of small files you might want to implement a second step where you once per day combine the converted files into bigger files, and possibly Parquet, but don't do anything like that unless you have to.
An alternative way would be to use Athena Federated Query: by implementing Lambda function(s) that respond to specific calls from Athena you can make Athena read any data source that you want. It's currently in preview, and as far as I know all the example code is written in Java – but theoretically it would be possible to write the Lambda functions in Python.
I'm not sure whether it would be less work than implementing an ETL workflow like the one you suggest, but yours is one of the use cases for which Athena Federated Query was designed for and it might be worth looking into. If NetCDF files are common and a data source for such files would be useful for other people I'm sure the Athena team would love to talk to you and help you out.
I have a web app with a download buttons to download objects from s3 buckets. I also have plot buttons to read the contents of csv files in s3 bucket using pandas read_csv to read the columns and make visualizations. I wanted to understand if the price for s3 data transfer out of the internet is only for actually download of files or it also includes just reading the contents too because the bytes are transferred over the internet in that case as well.
S3 does not operate like a file system. There is no notion of reading and writing portions of files as you would to a local or remote drive. To read a file you must always download the entire file and then read portions as needed. That is why AWS only shows pricing for data transfer.
Is there a way to have the dynamodb rows for each user, backed up in s3 with a csv file.
Then using streams, when a row is mutated, change that row in s3 in the csv file.
The csv readers that are currently out there are geared towards parsing the csv for use within the lambda.
Whereas I would like to find a specific row, given by the stream and then replace it with another row without having to load the whole file into memory as it may be quite big. The reason I would like a backup on s3, is because in the future I will need to do batch processing on it and reading 300k files from dynamo within a short period of time, is not preferable.
Read the data from S3, parse as csv using your favorite library and update, then write back to S3:
import io
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('mybucket')
with io.BytesIO() as data:
bucket.download_fileobj('my_key', data)
# parse csv data and update as necessary
# then write back to s3
bucket.upload_fileobj(data, 'my_key')
Note that S3 does not support object append or update if that was what you were hoping for- see here. You can only read and overwrite. You might take this into account when designing your system.
I have large numbers of files within s3 bucket and usually import it to Redshift. Since number of files is large I need a column in Redshift table which should contain source file name from s3 location.
Is there any means to carried out problem ?
Agree with Ketan that this is currently not possible in Redshift. If this is what you would want to achieve, it is possible through either
Reading the S3 files programmatically and write a new S3 files with file name as the column and load the new file
Alternatively, use Hive. Create external table on S3 file bucket location and use INPUT__FILE__NAME to get the file names, create a new table and then write back to S3. You can also do some pre-processing in Hive.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
Hope this helps.
That isn't possible. During a Copy operation, Redshift only loads file contents into a table; it doesn't provide access to S3 file names.
To achieve what you want, you need to preprocess the data to add additional information inside the files.