I have S3 bucket to which many different small files (2 files 1kB per 1 min) are uploaded.
Is it good practice to injest them by trigger using lambda at once to Redshift?
Or maybe will it be better to push them to some stage area like Postgres and then at the end of the day do batch etl from stage area to Redshift?
Or maybe do the job of making manifest file that contains all of the file names per day and use COPY command for injesting them to Redshift?
As Mitch says, #3. Redshift wants to work on large data sets and if you ingest small things many times you will need to vacuum the table. Loading many files at once fixes this.
However there is another potential problem - your files are too small for efficient bulk retrieval from S3. S3 is an object store and each request needs to be translated from bucket/object-key pair to a location in S3. This takes on the order of .5 seconds to do. Not an issue for loading a few at a time. But if you need to load a million of them in series then that’s 500K seconds of lookup time. Now Redshift will do the COPY in parallel but only to the number of slices you have in your cluster - it is still going to take a long time.
So depending on your needs you may need to think about a change in your use of S3. If so then your may end up with a Lambda that combines small files into bigger ones as part of your solution. You can do this in a parallel process to RS COPY if you only need to load many, many files at once during some recovery process. But an archive of 1 billion 1KB files will be near useless if they need to be loaded quickly.
Related
I am running a simple Athena query as in
SELECT * FROM "logs"
WHERE parse_datetime(requestdatetime,'dd/MMM/yyyy:HH:mm:ss Z')
BETWEEN parse_datetime('2021-12-01:00:00:00','yyyy-MM-dd:HH:mm:ss')
AND
parse_datetime('2021-12-21:19:00:00','yyyy-MM-dd:HH:mm:ss');
However this times out due to the default DML 30 min timeout.
The entries of the path I am querying are a few millions.
Is there a way to address this in Athena or is there a better suited alternative for this purpose?
This is normally solved with partitioning. For data that's organized by date, partition projection is the way to go (versus an explicit partition list that's updated manually or via Glue crawler).
That, of course, assumes that your data is organized by the partition (eg, s3://mybucket/2021/12/21/xxx.csv). If not, then I recommend changing your ingest process as a first step.
You my want to change your ingest process anyway: Athena isn't very good at dealing with a large number of small files. While the tuning guide doesn't give an optimal filesize, I recommend at least a few tens of megabytes. If you're getting a steady stream of small files, use a scheduled Lambda to combine them into a single file. If you're using Firehose to aggregate files, increase the buffer sizes / time limits.
And while you're doing that, consider moving to a columnar format such as Parquet if you're not already using it.
We are in Google Cloud Platform so technologies there would be a good win. We have a huge file that comes in and dataflow scales on the input to break up the file quite nicely. After that however, it streams through many system, microservice1 over to dataconnectors grabbing related data over to ML and finally over to a final microservice.
Since the final stage could be around 200-1000 servers depending on load, how can we take all the requests coming in (yes, we have a file id attached to every request including a customerRequestId in case a file is dropped multiple times). We only need to be writing every line with the same customerRequestId to the same file on output.
What is the best method to do this? The resulting file is almost always a csv file.
Any ideas or good options I can explore? I wonder if dataflow was good at ingestion and reading a massively large file in parallel, is it good at taking in various inputs on a cluster of nodes(not a single node which would bottleneck us).
EDIT: I seem to recall hdfs has files partitioned across nodes and I think can be written by many nodes at the same time somehow (a
node per partition). Does anyone know if google cloud storage files are this way as well? Is there a way to have 200 nodes writing to 200 partitions of the same file in google cloud storage in such a way that it is all 1 file?
EDIT 2:
I see that there is a streaming pub/sub to bigquery option that could be done as one stage in this list: https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming
HOWEVER in this list, there is not a batch bigquery to csv(what our customer wants). I do see a bigquery to parquet option though here: https://cloud.google.com/dataflow/docs/guides/templates/provided-batch
I would prefer to go directly to csv though. Is there a way?
thanks,
Dean
You case is complex and hard (and expensive) to reproduce. My first idea is to use BigQuery. Sink all the data in the same table with Dataflow.
Then, create a temporary table with only the data to export to CSV like that
CREATE TABLE `myproject.mydataset.mytemptable`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
) AS
SELECT ....
And then to export the temporary table to CSV. If the table is less than 1Gb, only one CSV will be generated.
If you need to orchestrate these steps, you can use Workflows
I'm using Athena and I'm dealing with about 1000 raw compressed data files daily (each of 13MB). I need to process and store efficiently to improve queries speed and cost and I do it by partitions.
I’m currently using lambdas (being triggered for each file creation) as my ETL and it - loads the raw data file, reorganizes columns, changes the file format (to parquet) and then splits it and saves it to multiple files on S3 bucket.
then another lambda starts and creates new partitions for the Athena table:
s3://raw_data/2020/01/01 --> s3://table_name/2020/01/01/0-50/ (Athena table built over these files)
Many partitions affects on the queries speed, so I've heard of a
new feature called Partitions Projection. using this feature I can speed up the query processing and automate partition management.
I’m a little bit confused of the options.
Another point is that theoretically, if I understand it correctly, I can get rid of the lambdas and save some cost by using Athena CTAS once to create the table and then executing INSERT INTO daily (limited up to 100 partitions) but then I still use partitions and can't use Partitions Projection.
What is the most efficient way to deal ETL when it comes to daily provided files?
I took a look at the link and trying to understand what s3 select is.
Most applications have to retrieve the entire object and then filter out only the required data for further analysis. S3 Select enables applications to offload the heavy lifting of filtering and accessing data inside objects to the Amazon S3 service.
Based on the statement above, I am trying to imagine what is the proper use case.
Is it helpful that if I have a single excel file with 100million rows, sitting on S3, I can use S3 Select to query partial rows, instead of downloading the entire 100mil rows?
There are many use cases. But two cases that are apparent are centralization and time efficiency.
Lets say you have this "single excel file with 100million rows" in S3. Now if you have several people/department/branches that need to access it, all of them would have to download it, store and process. Since it would be downloaded by each of them separately, in no time you would end up with all of them either having old version of the file (new version could be uploaded to S3), or just different versions - one person version from today, the other would work on a version from last week. With S3 select, all of them would query and get data from the one version of the object stored in S3.
Also if you have 100 million of records, you getting selected data can save you a lot of time. Just image one person needing only 10 records from this file, other person 1000 records. Instead of downloading 100 million records, the first person uses S3 Select to find 10 records only, while the other just gets his/hers 1000 records. All this without needing to download 100 million records.
Even more benefits come from using S3 select in Glacier, from where you can't readily download your files if needed.
I have a pipe delimited text file that is 360GB, compressed (gzip).
It has over 1,620 columns. I can't show the exact field names, but here's basically what it is:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
Seriously, there are over 800 of these property name/value fields.
There are roughly 280 million rows.
The file is in an S3 bucket.
I need to get the data into Redshift, but the column limit in Redshift is 1,600.
The users want me to pivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
What is a good way to pivot the file in the aws environment? The data is in a single file, but I'm planning on splitting the data into many different files to allow for parallel processing.
I've considered using Athena. I couldn't find anything that states the maximum number of columns allowed by Athena. But, I found a page about Presto (on which Athena is based) that says “there is no exact hard limit, but we've seen stuff break with more than few thousand.” (https://groups.google.com/forum/#!topic/presto-users/7tv8l6MsbzI).
Thanks.
First, pivot your data, then load to Redshift.
In more detail, the steps are:
Run a spark job (using EMR or possibly AWS Glue) which reads in your
source S3 data and writes out (to a different s3 folder) a pivoted
version. by this i mean if you have 800 value pairs, then you would
write out 800 rows. At the same time, you can split the file into multiple parts to enable parallel load.
"COPY" this pivoted data into Redshift
What I learnt from most of the time from AWS is, if you are reaching a limit, you are doing it in a wrong way or not in a scalable way. Most of the time architects designed with scalability, performance in mind.
We had similar problems, having 2000 columns. Here is how we solved it.
Split the file across 20 different tables, 100+1 (primary key) column each.
Do a select across all those tables in a single query to return all the data you want.
If you say you want to see all the 1600 columns in a select, then the business user is looking at wrong columns for their analysis or even for machine learning.
To load 10TB+ of data we had split the data into multiple files and load them in parallel, that way loading was faster.
Between Athena and Redshift, performance is the only difference. Rest of them are same. Redshift performs better than Athena. Initial Load time and Scan Time is higher than Redshift.
Hope it helps.