Athena Partition Projection Not Working As Expected - amazon-web-services

I am moving from registered partitions to partition projection.
Previously my data was partitioned by p_year={yyyy}/p_month={MM}/p_day={dd}/p_hour={HH}/... and I am moving these to p_date={yyyy}-{MM}-{dd} {HH}:00:00/..
I have a recent events table that stores the last 2 days worth of events. And so my p_date range is NOW-2DAYS,NOW. The full table parameters are-
projection.enabled: 'True'
projection.p_date.type: 'date'
projection.p_date.range: NOW-2DAYS,NOW
projection.p_date.format: 'yyyy-MM-dd HH:mm:ss'
projection.p_date.interval: 1
projection.p_date.interval.unit: 'HOURS'
But when I try to query this, I get no results.
SELECT COUNT(*) FROM recent_events_2d_v2
> 0
However, If I change the date range to 2020-09-01 00:00:00,NOW I do get results.
Something seems off with the relative date ranges with partition projection. Can anyone see what I may be doing wrong, or is this a bug?

You need to change your date format to 'yyyy-MM-dd HH:\'00:00\'' (i.e. literal "00:00" instead of minutes and seconds placeholders).
The way partition projection deals with dates leaves some things to be desired. It seems reasonable that if you say the interval is one hour that the timestamps get rounded to the nearest hour, but that's not what happens. Athena will use the actual "now" to generate the partition values, and if your date format contains fields for minutes and seconds, those will be filled in too.
I assume the reason why it worked when you used a hard coded timestamp is that Athena uses that value as the seed for the sequence, and all other timestamps will also be aligned to the hour.

If you are sure your bucket p_date={yyyy}-{MM}-{dd} {HH}:00:00/.. contains data, then you need to make sure that partitions are correctly loaded. Try running
MSCK REPAIR TABLE recent_events_2d_v2
and rerun the query.

Related

Power BI: Relative Time Under 5 Hours returns no data

I have a PBI desktop dashboard I've created to pull machine data from a local SQL server. I'm using a relative date time filter on one of the pages to drill down data for live feed, however anything under 5 hours of the relative time, the data goes blank.
I use 4 log tables for the raw data, each having their own time stamp for each instance. Each are related using a ID table with other general information contained. In addition, time is related using a calculated table to create a timeframe of all instances:
Relationship Model
DateTable = distinct(union(SUMMARIZE(LogFault,LogFault[Time]),SUMMARIZE(LogGood,LogGood[Time]),SUMMARIZE(LogReject,LogReject[Time]),SUMMARIZE(LogState,LogState[Time])))
5 Hours Relative Time
4 hours relative time
As you can see from the top right of the images, not even the times are pulled to the page. Is there a limitation to PBI on the relative time function? This wouldn't make sense to me if there is a "minutes" option under relative time. Any feedback on this would be appreciated.
For those looking in the future, unfortunately PowerBI desktop, along with service, appears to only like to work in the UTC time zone. So the relative date/time was filtering based on the UTC time zone, not my time zone (EST). In order to resolve this, I had to create a new calculated column next to my distinct time stamps to correct for the time zone. I then used the adjusted time for the relative time filtering, but the charts remained under the original time stamps.
UTC to EST time zone adjust
UTC_AdjustTZ = FORMAT(DateTable[Time]+TIME(4,0,0),"General Date")
Chart Example after adjust
Chart after fix implemented
Probably because your filter on Date Table doesn't reach the destined table. Normally filter moves from one side to many side, then one side to many side in a chain of relationships; but
In your case for example:
Filter goes from Date Table to Log Reject then It can't move to RejectDefinitions because of the filter direction. You have 2 options here:
1) Change the model relationships : Make Log Reject(One side) and RejectDefinitions(Many side) if It is possible.
OR
2) Set the filter direction as Both in the model.
You need to do this for all the remaining log tables(LogFault-FaultDefinitions,Logstate-StateDefinitions)
I hope It solves your problem. Please check that your model is not ambiguous after making those changes.

How would one go around creating a due by attribute in redshift

I am currently trying to calculate due by dates in a table by adding the sla time to the time the request was created. From what I am able to understand, the way to go around this is to create a table with the work days and hours and query that table to find the due date. However, redshift does not allow one to declare variables. I was wondering how I would go around creating a work hour table in redshift and if that is not possible, how I would calculate the due date by other means. Thanks!
It appears that you would like to provide a timestamp and then calculate the timestamp that is 'n work hours later', most probably taking into account certain rules such as:
Weekdays: 9am-5pm
Weekends: No Hours
Holidays: Occasional weekdays with No Hours
This could be done by Creating a scalar Python UDF - Amazon Redshift that would be passed a 'start' timestamp and a number of hours, and would return the 'end' timestamp.
Please note that Scalar UDFs cannot access tables or 'call outside' of Redshift, so it would need to be self-contained.
There is code on the web that shows How to find the number of hours between two dates excluding weekends and certain holidays in Python? BusinessHours package - Stack Overflow. You would need to modify such code to specify the duration rather than finding the duration.
The alternate method of "creating a work hour table" would work well when trying to find the number of work hours between two timestamps but would be a bit harder when trying to add workhours to a timestamp.

Best partitioning method for multiple devices and timestamps

In my organisation we have multiple devices sending data every seconds. The data is processed and partitioned in AWS S3 like this /year=YYYY/month=MM/day=DD/file.csv.
Using AWS Athena we use to run queries like this: SELECT col1, col2, coln FROM data WHERE year = 'YYYY' AND month = 'MM' and DAY = 'dd' AND device_id = 123 to retrieve data from one device for some time in a day. Sometimes we also need to get data from multiple devices (device_id IN (...)) and at different times. Note that the columns device_id and ts exist in the dataset and only ts is used to generate partitions.
Here's my question:
Will this method of partitioning be efficient in a long term ? At this time, we only have about 150 active devices, but we plan to scale at 1000 and more. Considering the fact that the query schema would be the same (get data for some device at a certain time), is it better to partition by device_id and then by date (/devive_id/year=YYYY/month=MM/day=DD/file.csv) ?
The partitioning is very good for your supplied query -- it will only need to look in one subdirectory for that single day of data.
However, if you were querying for a specific device across all time (without specifying a month/day), then it would not be efficient.
You will need to decide what is going to be more common:
If a specific device will always be queried, then partition by Device, then Date
If a specific day/month will always be queried, then your current method is fine (possibly with an additional partition of device after Day)

AWS IoT Analytics queries for retrieving data from dataset using boto3

Can we use query while retrieving the data from the dataset in AWS IoT Analytics, I want data between 2 timestamps. Im using boto3 to fetch the data. I didn't see any option to use query in get dataset content Below is the boto3 code:
response = client.get_dataset_content(
datasetName='string',
versionId='string'
)
Does anyone have suggestions how to use query or how rerieve the data between 2 timestamp in AWS IoT Analytics?
Thanks,
Pankaj
There could be a few ways to do this depending on what your workflow is, if you have a few more details, that would be helpful.
Possible approaches are;
1) Create a scheduled query to run every hour (for example) where the query looks something like this;
SELECT * FROM my_datastore WHERE __dt >= current_date - interval '1' day
AND my_timestamp >= now() - interval '1' hour
You may need to adjust the format of the timestamp to suit depending on how you are storing it (epoch seconds, epoch milliseconds, ISO8601 etc. If you set this to run every hour, each time it executes, you will get the last one hour of data. Note that the __dt constraint just helps your query run faster (and cheaper) by limiting the scan to the most recent day only.
2) You can improve on the above by using the delta window function of the dataset which lets you get the data that has arrived since the query last ran more easily. You could then simplify your query to look like;
select * from my_datastore where __dt >= current_date - interval '1' day
And configure the delta time window to look at your timestamp field. You then control how much data is retrieved by the frequency at which you execute the query (every 15 mins, every hour etc).
3) If you have a more general purpose requirement to fetch the data between 2 timestamps that you are calculating programatically, and may not be of the form now() - some interval, the way you could do this is to create a dataset and then update the dataset with the revised SQL expression before running it with create-dataset-content. That way the dataset content is updated with just the results you need with each execution. If this is of interest, I can expand upon the actual python required.
4) As Thomas suggested, it can often be just as easy to pull out a larger chunk of data with the dataset (for example the last day) and then filter down to the timestamp you want in code. This is particularly easy if you are using panda dataframes for example and there are plenty of related questions such as this one that have good answers.
Frankly, the easiest thing would be to do your own time filtering (the result of get_dataset_content is a csv file).
That's what QuickSight does to allow you to navigate the dataset in time.
If this isn't feasible the alternative is to reprocess the datastore with an updated pipeline that filters out everything except the time range you're interested in (more information here). You should note that while it's tempting to use the startTime and endTime parameters for StartPipelineReprocessing, these are only approximate to the nearest hour.

Merging data with a dates table gives strange behaviour when some recordfield is added

I'm using Crystel Reports again after not touching it for about 8 years.
I'm having this situation...
I have 1 data table, and 1 table with just day numbers from 1 to 31.
Nothing is really linked between each other.
In my report I let the user select a reference date.
From that date I grab the maximum days of the month.
The report lists a row per day of that month but there are no actual database fields inthere. Just the first 2 letters of the dayname, the day number and another formula based field showing 'yes/no' or '' depending on a main record value.
So far so good.
In the group header I was adding the fields from the main datatable which went all fine until I added fields that in the query on the sql server rely on some cases but CR just read it out as 1 singe record row with everything in it.
For some reason the report generation goes from 1-2 seconds to 30-40 once I add that field that just outputs 'X' or ''. (it represents things assigned to that user)
Other reports where I'm using the same data still generate in 2 seconds.
To get this working right and to eleminate double date records I'm stuck with 3 groups.
I think this ain't optimal and the reason for the slow down although it wasn't there at the start.
So I was wondering:
Should I go for a sub report for the day listing?
Can I feed the subreport with my date parameter?
or is there some kind of scripted way to list a row x-times without all the grouping requirements?
Synchro was right, the problem was in the actual query/view.
For some reason the view takes half a minute longer if you just added an order by to a specific field.
The "where id between 211 and 265 or id=67" has been moved from a joined view to the actual query.
Thanks for the hint, Synchro.