For each distinct value in col_a, yield a new table - amazon-web-services

I have an Athena table of data in S3 that acts as a source table, with columns id, name, event. For every unique name value in this table, I would like to output a new table with all of the rows corresponding to that name value, and save to a different bucket in S3. This will result in n new files stored in S3, where n is also the number of unique name values in the source table.
I have tried single Athena queries in Lambda using PARTITION BY and CTAS queries, but can't seem to get the result that I wanted. It seems that AWS Glue may be able to get my expected result, but I've read online that it's more expensive, and that perhaps I may be able to get my expected result using Lambda.
How can I store a new file (JSON format, preferably) that contains all rows corresponding to each unique name in S3?
Preferably I would run this once a day to update the data stored by name, but the question above is the main concern for now.

When u write your spark/glue code you will need to partition the data using the name column. However this will result in a path having the below format
S3://bucketname/folder/name=value/file.json
This should give a separate set of files for each name value, but if u want to access that as a separate table u might need to get rid of that = sign from the key before u crawl the data and make it available via Athena
If u do use a lambda, the operation involves going through the data , similar to what glue does, and partitioning the data
I guess it all depends on the volume of data that it needs to process. Glue, if using spark may have a little bit of an extra start up time. Glue python shells have comparatively better start up times

Related

Athena query is very slow

I am storing 400,000 parquet files in S3 that are partitioned based on a unique id (e.g. 412812). The files range in size from 25kb to 250kb of data. I then want to query the data using Athena. Like so,
Select *
From Table
where id in (412812, 412813, 412814)
This query is much slower than anticipated. I want to be able to search for any set of ids and get a fast response. I believe it is slow is because Athena must search through the entire glue catalog looking for the right file (i.e., a full scan of files).
The following query is extremely fast. Less than a second.
Select *
From Table
where id = 412812
partition.filtering is enabled on the table. I tried adding an index to the table that was the same as the partition, but it did not speed anything up.
Is there something wrong with my approach or a table configuration that would make this process faster?
Your basic problem is that you have too many files and too many partitions.
While Amazon Athena does operate in parallel, there are limits to how many files it can process simultaneously. Plus, each extra file adds overhead for listing, opening, etc.
Also, putting just a single file in each partition greatly adds to the overhead of handling so many partitions and is probably counterproductive for increasing the efficiency of the system.
I have no idea of how you actually use your data, but based on your description I would recommend that you create a new table that is bucketed_by the id, rather than partitioned:
CREATE TABLE new_table
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
external_location = 's3://bucket/new_location/',
bucketed_by = ARRAY['id']
)
AS SELECT * FROM existing_table
Let Athena create as many files as it likes -- it will optimize based upon the amount of data. More importantly, it will create larger files that allow it to operate more efficiently.
See: Bucketing vs Partitioning - Amazon Athena
In general, partitions are great when you can divide the data into some major subsets (eg by country, state or something that represents a sizeable chunk of your data), while bucketing is better for fields that have values that are relatively uncommon (eg user IDs). Bucketing will create multiple files and Athena will be smart enough to know which files contain the IDs you want. However, it will not be partitioned into subdirectories based upon those values.
Creating this new table will greatly reduce the number of files that Amazon Athena would need to process for each query, which will make your queries run a LOT faster.

How to use max, min functions in projection.columnName.range to the AWS Glue Table Property

I use the below table property to set range to date column
'projection.date.range' = 'NOW-365DAYS,NOW+1DAYS'
The table has no data from NOW-365DAYS as it is a new table. While querying from Athena on this table results in a high volume of listbucket requests. I don't want to happen this. SO thought to set a range like below
'projection.date.range' = 'MAX(2022/01/12, NOW-365DAYS), NOW+1DAYS' so that i can avoid empty partitions. But it is throwing me an error.
Is there a way to use MAX/MIN functions in projection.date.range?
It's not possible to qualify the partition projection range like that, unfortunately. I suggest setting the lower bound to the actual first date with data until the relative range makes sense.
Since you say Athena is making a lot of S3 list requests I assume you are querying the table without filters on this partition key. This will always result in a lot of S3 listings, at least 365 of them, regardless of whether there is data or not. Why aren't your queries filtering on the date partition key?
Is the reason why you want the range to be the last 365 days that you will remove data after one year, or is there another reason?

Does "$path" limit the amount of data scanned by Athena?

The main question:
I can't seem to find definitive info about how $path works when used in a where clause in Athena.
select * from <database>.<table> where $path = 'know/path/'
Given a table definition at the top level of a bucket, if there are no partitions specified but the bucket is organized using prefixes does it scan the whole table? Or does it limit the scan to the specified path in a similar way to partitions? Any reference to an official statement on this?
The specific case:
I have information being stored in s3, this information needs to be counted and queried once or twice a day, the prefixes are two different IDs (s3:bucket/IDvalue1/IDvalue2/) and then the file with the relevant data. On a given day any number of new folders might be created (on busy days it could be day tens of thousands) or new files added to existing prefixes. So, maintaining the partition catalog up to date seems a little complicated.
One proposed approach to avoid partitions is using $path when getting data from a know combination of IDs, but I cannot seem to find whether using such approach would actually limit the amount of data scanned per query. I read a comment saying it does not but I cannot find it in the documentation and was wondering if anyone knows how it works and can point to the proper reference.
So far googling and reading the docs has not clarified this.
Athena does not have any optimisation for limiting the files scanned when using $path in a query. You can verify this for yourself by running SELECT * FROM some_table and SELECT * FROM some_table WHERE $path = '…' and comparing the bytes scanned (they will be the same, if there was an optimisation they would be different – assuming there is more than one file of course).
See Query by "$path" field and Athena: $path vs. partition
For your use case I suggest using partition projection with the injected type. This way you can limit the prefixes on S3 that Athena will scan, while at the same time not have to explicitly add partitions.
You could use something like the following table properties to set it up (use the actual column names in place of id_col_1 and id_col_2, obviously):
CREATE EXTERNAL TABLE some_table
…
TBLPROPERTIES (
"projection.id_col_1.type" = "injected",
"projection.id_col_2.type" = "injected",
"storage.location.template" = "s3://bucket/${id_col_1}/${id_col_2}/"
)
Note that when querying a table that uses partition projection with the injected type all queries must contain explicit values for the the projected columns.

Athena Query Results: Are they always strings?

I'm in the process of building new "ETL" pipelines with CTAS. Unfortunately, Quite often the CTAS query is too intensive which causes Athena to time out. As such, I use CTAS to create the initial table and populate with a small sample. I then write a script that queries the same table the CTAS was generated from (which is parquet format) for the remaining days that the CTAS couldn’t handle upfront. I write the output of these query results to the same directory that is holding the results of the CTAS query before repairing the table (to pick up new data). However, it seems to be a pretty clunky process for a number of reasons:
1) Query results written out with a standard SQL statements all end up being strings. For example, when I write out the number of DAUs (which is a count and cast to an int) the csv output is a string I.e. wrapped in “”.
Is it possible to write out Athena "query_results" (not the CTAS) as anything other than a string when in CSV format. The main problem with this is it means it can't be read back into the table produced by the CTAS since these column expect a bigint. This, of course, can be resolved with a lambda function but seems like a big overhead for something that should be trivial.
2) Can you put query results (not from CTAS) directly into parquet instead of CSV?
3) Is there any way to prevent metadata being generated with the query_results (not from CTAS). Again, it can be cleaned up with a lambda function, but it's just additional nonsense I need to handle.
Thanks in advance!
The data type of the result depends on the SQL used to create it and also on how you consume it. Based on your question I'm going to assume that you're creating a table using CTAS and that the output is CSV, and that you're then looking at the CSV data directly.
That CSV is going to have quotes in it, but that doesn't mean that it's not possible to read integer values as integers, and so on. Athena uses a schema-on-read approach, and as long as the serde can interpret a value as a particular type, that type will work as the type of the column.
If you query the table created by your CTAS operation you should get back integers for the integer columns.
Using CTAS you can also create output of different types, like JSON, Avro, Parquet, and ORC, that keep the type information. Just use the format property to select the output type.
I am a bit confused what you mean by your third question. With a normal query you get two files on S3, the data file and the metadata file, and they will be written to the output location given in the StartQueryExecution API call, but with a CTAS query you get the output data in a different location (given in the SQL) than the metadata file.
Are you actually using CTAS, or are you talking about the regular query result files?
Update after the question got clarified:
1) Athena is unfortunately unable to properly read it's own output in many situations. This is something that really surprises me that they never considered before launch. You might be able to set up a table that uses the regex serde.
2) No, unfortunately the only output of a regular query is CSV at this time.
3) No, the metadata is always written to the same prefix as the output.
I think your best bet is running multiple CTAS queries that select subsets of your source data, if there is a date column for example you could make one CTAS per month or some other time range that works. After the CTAS queries have completed you can move the result files into the same directory on S3 and create a final table that has that directory as its location.

What is the idiomatic way to perform a migration on a dynamo table

Suppose I have a dynamo table named Person, which has 2 fields, name (string), age (int). Let's assume it has a TB worth of data and experiences a small amount of read throughput, but a ton of write throughput. Now I want to add a new field called Phone (string). What is the best way to go about moving the data from one table to another?
Note: Dynamo doesn't let you rename tables, and fields cannot be null.
Here are the options I think I have:
Dump the table to .csv, run a script (overnight probably since it's a TB worth of data) to add a default phone number to this file. (Not ideal, will also lose all new data submitted into old table, unless I bring the service offline to perform the migration (which is not an option in this case)).
Use the SCAN api call. (SCAN will read all values, then will consume significant write throughput on the new table to insert all old data into it).
How can I do perform a dynamo migration on a large table w/o
significant data loss?
you don't need to do anything. This is NoSQL, not SQL. (i.e. there is no idiomatic way to do this as you normally don't need migrations for NoSQL)
Just start writing entries with the additional key.
Records you get back that are written before will not have this key. What you normally do is have a default value you use when missing.
If you want to backfill, just go through and read the value + put the value with the additional field. You can do this in one run via a scan or again do it lazily when accessing the data.