Use case of Amazon S3 Select - amazon-web-services

I took a look at the link and trying to understand what s3 select is.
Most applications have to retrieve the entire object and then filter out only the required data for further analysis. S3 Select enables applications to offload the heavy lifting of filtering and accessing data inside objects to the Amazon S3 service.
Based on the statement above, I am trying to imagine what is the proper use case.
Is it helpful that if I have a single excel file with 100million rows, sitting on S3, I can use S3 Select to query partial rows, instead of downloading the entire 100mil rows?

There are many use cases. But two cases that are apparent are centralization and time efficiency.
Lets say you have this "single excel file with 100million rows" in S3. Now if you have several people/department/branches that need to access it, all of them would have to download it, store and process. Since it would be downloaded by each of them separately, in no time you would end up with all of them either having old version of the file (new version could be uploaded to S3), or just different versions - one person version from today, the other would work on a version from last week. With S3 select, all of them would query and get data from the one version of the object stored in S3.
Also if you have 100 million of records, you getting selected data can save you a lot of time. Just image one person needing only 10 records from this file, other person 1000 records. Instead of downloading 100 million records, the first person uses S3 Select to find 10 records only, while the other just gets his/hers 1000 records. All this without needing to download 100 million records.
Even more benefits come from using S3 select in Glacier, from where you can't readily download your files if needed.

Related

AWS Redshift and small datasets

I have S3 bucket to which many different small files (2 files 1kB per 1 min) are uploaded.
Is it good practice to injest them by trigger using lambda at once to Redshift?
Or maybe will it be better to push them to some stage area like Postgres and then at the end of the day do batch etl from stage area to Redshift?
Or maybe do the job of making manifest file that contains all of the file names per day and use COPY command for injesting them to Redshift?
As Mitch says, #3. Redshift wants to work on large data sets and if you ingest small things many times you will need to vacuum the table. Loading many files at once fixes this.
However there is another potential problem - your files are too small for efficient bulk retrieval from S3. S3 is an object store and each request needs to be translated from bucket/object-key pair to a location in S3. This takes on the order of .5 seconds to do. Not an issue for loading a few at a time. But if you need to load a million of them in series then that’s 500K seconds of lookup time. Now Redshift will do the COPY in parallel but only to the number of slices you have in your cluster - it is still going to take a long time.
So depending on your needs you may need to think about a change in your use of S3. If so then your may end up with a Lambda that combines small files into bigger ones as part of your solution. You can do this in a parallel process to RS COPY if you only need to load many, many files at once during some recovery process. But an archive of 1 billion 1KB files will be near useless if they need to be loaded quickly.

Spark write to S3 with 'partitionBy()'; but not have column name (just value) in S3 directory

I have a bucket in AWS S3 called 's3a://devices'
I am trying to write a dataframe to it, based on Company name using code below.
df = df.withColumn("companyname_part", df["companyname"])
df.repartition("companyname_part").dropDuplicates().write.mode("append").partitionBy("companyname_part").parquet('s3a://devices/')
I have a few questions relate to this storage and daily incremental code above.
If I pull all present data daily and want to overwrite and drop duplicates daily, should I change the above code to run daily with "overwrite"? And previous incarnations of the S3 buckets will be overwritten for each Partition?
Is there a "HDFS" command that I could automate, or possibly something I could use in the PySpark script through os.command/subprocess to get ride of the "companyname_part=" portion of the directories, so that it just reads the value after "=". I am worried that all partitions starting with the same text will make Athena queries slower.
Does anyone have feedback/advice on Optimal partitioning for S3? I am using this partition column because it will be a lookup for an API. But I'm curious about optimal Sizing (megabytes) of Partitions. And is 5000 too many?

How to split data when archiving from AWS database to S3

For a project we've inherited we have a large-ish set of legacy data, 600GB, that we would like to archive, but still have available if need be.
We're looking at using the AWS data pipeline to move the data from the database to be in S3, according to this tutorial.
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
However, we would also like to be able to retrieve a 'row' of that data if we find the application is actually using a particular row.
Apparently that tutorial puts all of the data from a table into a single massive CSV file.
Is it possible to split the data up into separate files, with 100 rows of data in each file, and giving each file a predictable file name, such as:
foo_data_10200_to_10299.csv
So that if we realise we need to retrieve row 10239, we can know which file to retrieve, and download just that, rather than all 600GB of the data.
If your data is stored in CSV format in Amazon S3, there are a couple of ways to easily retrieve selected data:
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
S3 Select (currently in preview) enables applications to retrieve only a subset of data from an object by using simple SQL expressions.
These work on compressed (gzip) files too, to save storage space.
See:
Welcome - Amazon Athena
S3 Select and Glacier Select – Retrieving Subsets of Objects

Pivoting 1,620 columns to rows in 360gb text file in aws

I have a pipe delimited text file that is 360GB, compressed (gzip).
It has over 1,620 columns. I can't show the exact field names, but here's basically what it is:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
Seriously, there are over 800 of these property name/value fields.
There are roughly 280 million rows.
The file is in an S3 bucket.
I need to get the data into Redshift, but the column limit in Redshift is 1,600.
The users want me to pivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
What is a good way to pivot the file in the aws environment? The data is in a single file, but I'm planning on splitting the data into many different files to allow for parallel processing.
I've considered using Athena. I couldn't find anything that states the maximum number of columns allowed by Athena. But, I found a page about Presto (on which Athena is based) that says “there is no exact hard limit, but we've seen stuff break with more than few thousand.” (https://groups.google.com/forum/#!topic/presto-users/7tv8l6MsbzI).
Thanks.
First, pivot your data, then load to Redshift.
In more detail, the steps are:
Run a spark job (using EMR or possibly AWS Glue) which reads in your
source S3 data and writes out (to a different s3 folder) a pivoted
version. by this i mean if you have 800 value pairs, then you would
write out 800 rows. At the same time, you can split the file into multiple parts to enable parallel load.
"COPY" this pivoted data into Redshift
What I learnt from most of the time from AWS is, if you are reaching a limit, you are doing it in a wrong way or not in a scalable way. Most of the time architects designed with scalability, performance in mind.
We had similar problems, having 2000 columns. Here is how we solved it.
Split the file across 20 different tables, 100+1 (primary key) column each.
Do a select across all those tables in a single query to return all the data you want.
If you say you want to see all the 1600 columns in a select, then the business user is looking at wrong columns for their analysis or even for machine learning.
To load 10TB+ of data we had split the data into multiple files and load them in parallel, that way loading was faster.
Between Athena and Redshift, performance is the only difference. Rest of them are same. Redshift performs better than Athena. Initial Load time and Scan Time is higher than Redshift.
Hope it helps.

content replacement in S3 files when unique id matched in both the sides by using big data solutions

I am trying to explore on a use case like "we have huge data (50B records) in files and each file has around 50M records and each record has a unique identifier. And it is possible that a record that present in file 10 can also present in file 100 but the latest state of that record is present in file 100. Files sits in AWS S3.
Now lets say around 1B records out of 50B records needs reprocessing and once reprocessing completed, we need to identify all the files which ever has these 1B records and replace the content of those files for these 1B unique ids.
Challenges: right now, we dont have a mapping that tells which file contains what all unique ids. And the whole file replacement needs to complete in one day, which means we needs parallel execution.
We have already initiated a task for maintaining the mapping for file to unique ids, and we need to load this data while processing 1B records and look up in this data set and identify all the distinct file dates for which content replacement is required.
The mapping will be huge, because it has to hold 50B records and may increase as well as it is a growing system.
Any thoughts around this?
You will likely need to write a custom script that will ETL all your files.
Tools such as Amazon EMR (Hadoop) and Amazon Athena (Presto) would be excellent for processing the data in the files. However, your requirement to identify the latest version of data based upon filename is not compatible with the way these tools would normally process data. (They look inside the files, not at the filenames.)
If the records merely had an additional timestamp field, then it would be rather simple for either EMR or Presto to read the files and output a new set of files with only one record for each unique ID (with the latest date).
Rather than creating a system to lookup unique IDs in files, you should have your system output a timestamp. This way, the data is not tied to a specific file and can easily be loaded and transformed based upon the contents of the file.
I would suggest:
Process each existing file (yes, I know you have a lot!) and add a column that represents the filename
Once you have a new set of input files with the filename column (that acts to identify the latest record), use Amazon Athena to read all records and output one row per unique ID (with the latest date). This would be a normal SELECT... GROUP BY statement with a little playing around to get only the latest record.
Athena would output new files to Amazon S3, which will contain the data with unique records. These would then be the source records for any future processing you perform.