DynamoDB date GSI - amazon-web-services

I have a DynamoDB table that stores executions of some programs, this is what it looks like:
Partition Key
Sort Key
StartDate
...
program-name
execution-id (uuid)
YYYY-MM-DD HH:mm:ss
...
I have two query scenarios for this table:
Query by program name and execution id (easy)
Query by start date range, for example: all executions from 2021-05-15 00:00:00 to 2021-07-15 23:59:59
What is the correct way to perform the second query?
I understand I need to create a GSI to do that, but how should this GSI look like?
I was thinking about splitting the StartDate attribute into two, like this:
Partition Key
Sort Key
StartMonthYear
StartDayTime
...
program-name
execution-id (uuid)
YYYY-MM
DD HH:mm:ss
...
So I can define a GSI using the StartMonthYear as the partition key and the StartDayTime as the sort key.
The only problem with this approach is that I would have to write some extra logic in my application to identify all the partitions I would need to query in the requested range. For example:
If the range is: 2021-05-15 00:00:00 to 2021-07-15 23:59:59
I would need to query 2021-05, 2021-06 and 2021-07 partitions with the respective day/time restrictions (only the first and last partition is this example).
Is this the correct way of doing this or am I totally wrong?

If you quickly want to fetch all executions in a certain time-frame no matter the program, there are a few ways to approach this.
The easiest solution would be a setup like this:
PK
SK
GSI1PK
GSI1SK
StartDate
PROG#<name>
EXEC#<uuid>
ALL_EXECUTIONS
S#<yyyy-mm-ddThh:mm:ss>#EXEC<uuid>
yyyy-mm-ddThh:mm:ss
PK is the partition key for the base table
SK is the sort key for the base table
GSI1PK is the partition key for the global secondary index GSI1
GSI1SK is the sort key for the global secondary index GSI1
Query by program name and execution id (easy)
Still easy, do a GetItem based on the program name for <name> and uuid for <uuid>.
Query by start date range, for example: all executions from 2021-05-15 00:00:00 to 2021-07-15 23:59:59
Do a Query on GSI1 with the KeyConditionExpression: PK = ALL_EXECUTIONS AND SK >= 'S#2021-05-15 00:00:00' AND SK <= 'S#2021-07-15 23:59:59'. This would return all the executions in the given time range.
But: You'll also build a hot partition, since you effectively write all your data in a single partition in GSI1.
To avoid that, we can partition the data a bit and the partitioning depends on the number of executions you're dealing with. You can choose years, months, days, hours, minutes or seconds.
Instead of GSI1PK just being ALL_EXECUTIONS, we can set it to a subset of the StartDate.
PK
SK
GSI1PK
GSI1SK
StartDate
PROG#<name>
EXEC#<uuid>
EXCTS#<yyyy-mm>
S#<yyyy-mm-ddThh:mm:ss>#EXEC<uuid>
yyyy-mm-ddThh:mm:ss
In this case you'd have a monthly partition, i.e.: all executions per month are grouped. Now you would have to make multiple queries to DynamoDB and later join the results.
For the query range from 2021-05-15 00:00:00 to 2021-07-15 23:59:59 you'd have to do these queries on GSI1:
#GSI1: GSI1PK=EXCTS#2021-05 AND GSI1SK >= S#2021-05-15 00:00:00
#GSI1: GSI1PK=EXCTS#2021-06
#GSI1: GSI1PK=EXCTS#2021-07 AND GSI1SK <= S#2021-07-15 23:59:59
You can even parallelize these and later join the results together.
Again: Your partitioning scheme depends on the number of executions you have in a day and also which maximum query ranges you want to support.
This is a long-winded way of saying that your approach is correct in principle, but you can choose to tune it based on your use case.

Related

Why bigquery can't handle a query processing 4TB data?

I'm trying to run this query
SELECT
id AS id,
ARRAY_AGG(DISTINCT users_ids) AS users_ids,
MAX(date) AS date
FROM
users,
UNNEST(users_ids) AS users_ids
WHERE
users_ids != " 1111"
AND users_ids != " 2222"
GROUP BY
id;
Where users table is splitted table with id column and user_ids (comma separated) column and date column
on a +4TB and it give me resources
Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations.
.. any idea why?
id userids date
1 2,3,4 1-10-20
2 4,5,6 1-10-20
1 7,8,4 2-10-20
so the final result I'm trying to reach
id userids date
1 2,3,4,7,8 2-10-20
2 4,5,6 1-10-20
Execution details:
It's constantly repartitioning - I would guess that you're trying to cramp too much stuff into the aggregation part. Just remove the aggregation part - I don't even think you have to cross join here.
Use a subquery instead of this cross join + aggregation combo.
Edit: just realized that you want to aggregate the arrays but with distinct values
WITH t AS (
SELECT
id AS id,
ARRAY_CONCAT_AGG(ARRAY(SELECT DISTINCT uids FROM UNNEST(user_ids) as uids WHERE
uids != " 1111" AND uids != " 2222")) AS users_ids,
MAX(date) OVER (partition by id) AS date
FROM
users
GROUP BY id
)
SELECT
id,
ARRAY(SELECT DISTINCT * FROM UNNEST(user_ids)) as user_ids
,date
FROM t
Just the draft I assume id is unique but it should be something along those lines? Grouping by arrays is not possible ...
array_concat_agg() has no distinct so it comes in a second step.

How to select data from aws athena table which is partitioned like 'year=yyyy/month=MM/date=dd/' for a given date range?

Athena Tables are partitioned like and same as s3 folder path
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9
parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14
PARTITIONED BY (
`parent` string,
`year` int,
`month` tinyint,
`date` tinyint)
Now how should I form the where condition for a select query to get data for parent = "9ab4fcca-65d8-11ea-bc55-0242ac130003" from 2019-06-01 to 2020-04-31 ?
SELECT *
FROM table
WHERE parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003' AND year >= 2019 AND year <= 2020 AND month >= 04 AND month <= 06 AND date >= 01 AND date <= 31 ;
But this isn't correct. Please help
Partitioning on year, month, and day separately makes it unnecessarily difficult to query tables. If you're starting out I really suggest to avoid this kind of partitioning scheme. If you can't avoid it you can still make things easier by creating the table partitions differently.
Most guides will tell you to create directory structures like year=2020/month=4/date=1/file1, create a table with three corresponding partition columns, and then run MSCK REPAIR TABLE to load partitions. This works, but it's far from the best way to use Athena. MSCK REPAIR TABLE has atrocious performance, and partitioning like that is far from ideal.
I suggest creating directory structures that are just 2020-03-01/file1, but if you can't, you can actually have any structure you want, 2020/03/01/file1, year=2020/month=4/date=1/file1, or any other structure where there is one distinct prefix per date will work more or less equally well.
I also suggest you create tables with only one partition column: date (or dt or day if you want avoid quoting), typed as DATE, not string.
What you do then, instead of running MSCK REPAIR TABLE is that you use ALTER TABLE … ADD PARTITION or the Glue APIs directly, to add partitions. This command lets you specify the location separately from the partition column value:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/2020-04-01/'
The important thing here is that the partition column value doesn't have to have any relationship at all with the location, this would work equally well:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/data-for-first-of-april/'
For your specific case you could have:
PARTITIONED BY (`parent` string, `day` date)
and then do:
ALTER TABLE your_table ADD
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-17') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-09') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9'
PARTITION (parent = '0fc966a0-bba7-4c0b-a648-cff7f0332059', day = '2020-04-16') LOCATION 's3://your-bucket/parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-14') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14'
Here is how you can use year, month and day values the come from partitions in order to select date range
SELECT col1, col2
FROM my_table
WHERE CAST(date_parse(concat(CAST(year AS VARCHAR(4)),'-',
CAST(month AS VARCHAR(2)),'-',
CAST(day AS VARCHAR(2))
), '%Y-%m-%d') as DATE)
BETWEEN DATE '2019-06-01' AND DATE '2020-04-31'
You can add additional filter statements as needed)

How to calculate accumulated time for a defined frequency?

I have rows containing descriptions of services that have been ordered by our customers.
Table:
OrderedServices
Columns:
Id (key)
CustomerId
ServiceId
StartDate
EndDate
AmountOfTimeOrdered (hours)
IntervalType (month, week or day)
Interval (integer)
An example:
1;24343;98;2020-01-20;2020-06-05;1.5;day;3
The above is read as ”Customer w/ id 24343 has ordered service #98 to be executed 1.5hrs every 3rd day during the period 2020-01-20 up until 2020-06-05”
The first day of execution is always StartDate, so, in the given example, the services is first executed 2020-01-20, followed by 2020-01-23 (20+3), 2020-01-26, 2020-01-29 aso.
Now I want to calculate the total amount of time executed for a given ServiceType for a given time period.
E.g. 2020-01-01 - 2020-01-31 = 4 x 1.5 = 6hrs in total executed time for the above.
What I can’t figure out is how to create a measure, or a calculated table to achieve this.
Does anyone have an idea?
Kind regards,
Peter
Go to the query editor and use the following stepts:
If your column looks like in your example use as first step Split Column by Delimiter.
After this just add the following custom column:

DynamoDB sort order of Date RangeKey

I have a DynamoDB table with the following key values: A simple string id as HashKey and a string representing a Date as RangeKey. The date string is in YYYY-MM-DD format.
I am now wondering how DynamoDB orders its entries. When I query for multiple RangeKey values on the same HashKey the result is ordered by the date ascending.
However, according to the Dynamo documentation it will order all non-integer RangeKeys considering their UTF-8 byte values.
When I now save the following RangeKey entries:
2019-01-01
2018-12-04
2018-12-05
The output of a simple DynamoDBMapper.query(...) results in the correct order:
2018-12-04
2018-12-05
2019-01-01
Is Dynamo ordering the RangeKeys by date or is the byte value calculated a way that it matches with the date representation?
Its sorting it in UTF-8 bytes. It has no idea that you are sorting dates, to DynamoDB, its just a string.

SHOW PARTITIONS with order by in Amazon Athena

I have this query:
SHOW PARTITIONS tablename;
Result is:
dt=2018-01-12
dt=2018-01-20
dt=2018-05-21
dt=2018-04-07
dt=2018-01-03
This gives the list of partitions per table. The partition field for this table is dt which is a date column. I want to see the partitions ordered.
The documentation doesn't explain how to do it:
https://docs.aws.amazon.com/athena/latest/ug/show-partitions.html
I tried to add order by:
SHOW PARTITIONS tablename order by dt;
But it gives:
AmazonAthena; Status Code: 400; Error Code: InvalidRequestException;
AWS currently (as of Nov 2020) supports two versions of the Athena engines. How one selects and orders partitions depends upon which version is used.
Version 1:
Use the information_schema table. Assuming you have year, month as partitions (with one partition key, this is of course simpler):
WITH
a as (
SELECT partition_number as pn, partition_key as key, partition_value as val
FROM information_schema.__internal_partitions__
WHERE table_schema = 'my_database'
AND table_name = 'my_table'
)
SELECT
year, month
FROM (
SELECT val as year, pn FROM a WHERE key = 'year'
) y
JOIN (
SELECT val as month, pn FROM a WHERE key = 'month'
) m ON m.pn = y.pn
ORDER BY year, month
which outputs:
year month
0 2018 10
0 2018 11
0 2018 12
0 2019 01
...
Version 2:
Use the built-in $partitions functionality, where the partitions are explicitly available as columns and the syntax is much simpler:
SELECT year, month FROM my_database."my_table$partitions" ORDER BY year, month
year month
0 2018 10
0 2018 11
0 2018 12
0 2019 01
...
For more information, see:
https://docs.aws.amazon.com/athena/latest/ug/querying-glue-catalog.html#querying-glue-catalog-listing-partitions
From your comment it sounds like you're looking to sort the partitions as a way to figure out whether or not a specific partition exists. For this purpose I suggest you use the Glue API instead of querying Athena. Run aws glue get-partition help or check your preferred SDK's documentation for how it works.
There is also a variant to list all partitions of a table, run aws glue get-partitions help to read more about that. I don't think it returns the partitions in alphabetical order, but it has operators for filtering.
The SHOW PARTITIONS command will not allow you to order the result, since this command does not produce a resultset to sort. This command only produces a string output.
You can on the other hand query the partition column and then order the result by value.
select distinct dt from tablename order by dt asc;