The main question:
I can't seem to find definitive info about how $path works when used in a where clause in Athena.
select * from <database>.<table> where $path = 'know/path/'
Given a table definition at the top level of a bucket, if there are no partitions specified but the bucket is organized using prefixes does it scan the whole table? Or does it limit the scan to the specified path in a similar way to partitions? Any reference to an official statement on this?
The specific case:
I have information being stored in s3, this information needs to be counted and queried once or twice a day, the prefixes are two different IDs (s3:bucket/IDvalue1/IDvalue2/) and then the file with the relevant data. On a given day any number of new folders might be created (on busy days it could be day tens of thousands) or new files added to existing prefixes. So, maintaining the partition catalog up to date seems a little complicated.
One proposed approach to avoid partitions is using $path when getting data from a know combination of IDs, but I cannot seem to find whether using such approach would actually limit the amount of data scanned per query. I read a comment saying it does not but I cannot find it in the documentation and was wondering if anyone knows how it works and can point to the proper reference.
So far googling and reading the docs has not clarified this.
Athena does not have any optimisation for limiting the files scanned when using $path in a query. You can verify this for yourself by running SELECT * FROM some_table and SELECT * FROM some_table WHERE $path = '…' and comparing the bytes scanned (they will be the same, if there was an optimisation they would be different – assuming there is more than one file of course).
See Query by "$path" field and Athena: $path vs. partition
For your use case I suggest using partition projection with the injected type. This way you can limit the prefixes on S3 that Athena will scan, while at the same time not have to explicitly add partitions.
You could use something like the following table properties to set it up (use the actual column names in place of id_col_1 and id_col_2, obviously):
CREATE EXTERNAL TABLE some_table
…
TBLPROPERTIES (
"projection.id_col_1.type" = "injected",
"projection.id_col_2.type" = "injected",
"storage.location.template" = "s3://bucket/${id_col_1}/${id_col_2}/"
)
Note that when querying a table that uses partition projection with the injected type all queries must contain explicit values for the the projected columns.
Related
Seemingly cannot get Athena to partition projection to work.
When I add partitions the "old fashioned" way and then run a MSCK REPAIR TABLE testparts; I can query the data.
I drop the table and recreate with the partition projections below and it fails to query any data at all. The queries that I do get to run take a very very long time with no results, or they time out like below query.
For the sake of argument I followed AWS documentation:
select distinct year from testparts;
I get :
HIVE_EXCEEDED_PARTITION_LIMIT: Query over table 'mydb.testparts' can potentially read more than 1000000 partitions.
I have ~7500 files in there at the moment in the file structures indicated in the table setups below.
I have:
Tried entering the separated parts as date type, provided format "yyyy-MM-dd" and still it did not work (including deleting and changing my s3 structures as well). I then tried to split the dates into different folders and set as integers (which you see below) and still did not work.
Given I can get it to operate "manually" after repairing the table, then successfully querying my structures - I must be doing something wrong at a fundamental level with partition projections.
I have also changed user from injected type to enum (not ideal given it's a plain old string, but did it for the purpose of testing)
Table creation :
CREATE EXTERNAL TABLE `testparts`(
`thisdata` array<struct<thistype:string,amount:float,identifiers:map<string,struct<id:string,type:string>>,selections:map<int,array<int>>>> COMMENT 'from deserializer')
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`user` string,
`thisid` int,
`account` int)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://testoutputs/partitiontests/responses'
TBLPROPERTIES (
'classification'='json',
'projection.day.digits'='2',
'projection.day.range'='1,31',
'projection.day.type'='integer',
'projection.enabled'='true',
'projection.account.range'='1,50',
'projection.account.type'='integer',
'projection.month.digits'='2',
'projection.month.range'='1,12',
'projection.month.type'='integer',
'projection.thisid.range'='1,15',
'projection.thisid.type'='integer',
'projection.user.type'='enum',
'projection.user.values'='usera,userb',
'projection.year.range'='2017,2027',
'projection.year.type'='integer',
'storage.location.template'='s3://testoutputs/partitiontests/responses/year=${year}/month=${month}/day=${day}/user=${user}/thisid=${thisid}/account=${account}/',
'transient_lastDdlTime'='1653445805')
If you run a query like SELECT * FROM testparts Athena will generate all permutations of possible values for the partition keys and list the corresponding location on S3. For your table this means doing 11,160,000 listings.
I don't believe that there's any optimization for SELECT DISTINCT year FROM testparts that would skip building the list of partition key values, so something similar would happen with that query too. Similarly, if you use "Preview table" to run SELECT * FROM testparts LIMIT 10 there is no optimization that skips building the list of partitons or skips listing the locations on S3.
Try running a query that doesn't wildcard any of the partition keys to validate that your config is correct.
Partition projection works differently from adding partitions to the catalog, and some care needs to be taken with wildcards. When partitions are in the catalog non-existent partitions can be eliminated cheaply, but with partition projection S3 has to be listed for every permutation of partition keys after predicates have been applied.
Partition projection works best when there are never wildcards on partition keys, to minimize the number of S3 listings that need to happen.
I am storing 400,000 parquet files in S3 that are partitioned based on a unique id (e.g. 412812). The files range in size from 25kb to 250kb of data. I then want to query the data using Athena. Like so,
Select *
From Table
where id in (412812, 412813, 412814)
This query is much slower than anticipated. I want to be able to search for any set of ids and get a fast response. I believe it is slow is because Athena must search through the entire glue catalog looking for the right file (i.e., a full scan of files).
The following query is extremely fast. Less than a second.
Select *
From Table
where id = 412812
partition.filtering is enabled on the table. I tried adding an index to the table that was the same as the partition, but it did not speed anything up.
Is there something wrong with my approach or a table configuration that would make this process faster?
Your basic problem is that you have too many files and too many partitions.
While Amazon Athena does operate in parallel, there are limits to how many files it can process simultaneously. Plus, each extra file adds overhead for listing, opening, etc.
Also, putting just a single file in each partition greatly adds to the overhead of handling so many partitions and is probably counterproductive for increasing the efficiency of the system.
I have no idea of how you actually use your data, but based on your description I would recommend that you create a new table that is bucketed_by the id, rather than partitioned:
CREATE TABLE new_table
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
external_location = 's3://bucket/new_location/',
bucketed_by = ARRAY['id']
)
AS SELECT * FROM existing_table
Let Athena create as many files as it likes -- it will optimize based upon the amount of data. More importantly, it will create larger files that allow it to operate more efficiently.
See: Bucketing vs Partitioning - Amazon Athena
In general, partitions are great when you can divide the data into some major subsets (eg by country, state or something that represents a sizeable chunk of your data), while bucketing is better for fields that have values that are relatively uncommon (eg user IDs). Bucketing will create multiple files and Athena will be smart enough to know which files contain the IDs you want. However, it will not be partitioned into subdirectories based upon those values.
Creating this new table will greatly reduce the number of files that Amazon Athena would need to process for each query, which will make your queries run a LOT faster.
I have an Athena table of data in S3 that acts as a source table, with columns id, name, event. For every unique name value in this table, I would like to output a new table with all of the rows corresponding to that name value, and save to a different bucket in S3. This will result in n new files stored in S3, where n is also the number of unique name values in the source table.
I have tried single Athena queries in Lambda using PARTITION BY and CTAS queries, but can't seem to get the result that I wanted. It seems that AWS Glue may be able to get my expected result, but I've read online that it's more expensive, and that perhaps I may be able to get my expected result using Lambda.
How can I store a new file (JSON format, preferably) that contains all rows corresponding to each unique name in S3?
Preferably I would run this once a day to update the data stored by name, but the question above is the main concern for now.
When u write your spark/glue code you will need to partition the data using the name column. However this will result in a path having the below format
S3://bucketname/folder/name=value/file.json
This should give a separate set of files for each name value, but if u want to access that as a separate table u might need to get rid of that = sign from the key before u crawl the data and make it available via Athena
If u do use a lambda, the operation involves going through the data , similar to what glue does, and partitioning the data
I guess it all depends on the volume of data that it needs to process. Glue, if using spark may have a little bit of an extra start up time. Glue python shells have comparatively better start up times
I have an ingestion time partitioned table that's getting a little large. I wanted to group by the values in one of the columns and use that to split it into multiple tables. Is there an easy way to do that while retaining the original _PARTITIONTIME values in the set of new ingestion time partitioned tables?
Also I'm hoping for something that's relatively simple/cheap. I could do something like copy my table a bunch of times and then delete the data for all but one value on each copy, but I'd get charged a huge amount for all those DELETE operations.
Also I have enough unique values in the column I want to split on that saving a "WHERE column = value" query result to a table for every value would be cost prohibitive. I'm not finding any documentation that mentions whether this approach would even preserve the partitions, so even if it weren't cost prohibitive it may not work.
Case you describe required having two level partitioning which is not supported yet
You can create column partition table https://cloud.google.com/bigquery/docs/creating-column-partitions
And after this build this value of column as needed that used to partitioning before insert - but in this case you lost _PARTITIONTIME value
Based on additional clarification - I had similar problem - and my solution was to write python application that will read source table (read is important here - not query - so it will be free) - split data based on your criteria and stream data (simple - but not free) or generate json/csv files and upload it into target tables (which also will be free but with some limitation on number of these operations) - will required more coding/exception handling if you go second route.
You can also can do it via DataFlow - it will be definitely more expensive than custom solution but potentially more robust.
Examples for gcloud python library
client = bigquery.Client(project="PROJECT_NAME")
t1 = client.get_table(source_table_ref)
target_schema = t1.schema[1:] #removing first column which is a key to split
ds_target = client.dataset(project=target_project, dataset_id=target_dataset)
rows_to_process_iter = client.list_rows( t1, start_index=start_index, max_results=max_results)
# convert to list
rows_to_process = list(rows_to_process_iter)
# doing something with records
# stream records to destination
errors = client.create_rows(target_table, records_to_stream)
BigQuery now supports clustered partitioned tables, which allow you to specify additional columns that the data should be split by.
I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.