How would one go around creating a due by attribute in redshift - amazon-web-services

I am currently trying to calculate due by dates in a table by adding the sla time to the time the request was created. From what I am able to understand, the way to go around this is to create a table with the work days and hours and query that table to find the due date. However, redshift does not allow one to declare variables. I was wondering how I would go around creating a work hour table in redshift and if that is not possible, how I would calculate the due date by other means. Thanks!

It appears that you would like to provide a timestamp and then calculate the timestamp that is 'n work hours later', most probably taking into account certain rules such as:
Weekdays: 9am-5pm
Weekends: No Hours
Holidays: Occasional weekdays with No Hours
This could be done by Creating a scalar Python UDF - Amazon Redshift that would be passed a 'start' timestamp and a number of hours, and would return the 'end' timestamp.
Please note that Scalar UDFs cannot access tables or 'call outside' of Redshift, so it would need to be self-contained.
There is code on the web that shows How to find the number of hours between two dates excluding weekends and certain holidays in Python? BusinessHours package - Stack Overflow. You would need to modify such code to specify the duration rather than finding the duration.
The alternate method of "creating a work hour table" would work well when trying to find the number of work hours between two timestamps but would be a bit harder when trying to add workhours to a timestamp.

Related

Athena Partition Projection Not Working As Expected

I am moving from registered partitions to partition projection.
Previously my data was partitioned by p_year={yyyy}/p_month={MM}/p_day={dd}/p_hour={HH}/... and I am moving these to p_date={yyyy}-{MM}-{dd} {HH}:00:00/..
I have a recent events table that stores the last 2 days worth of events. And so my p_date range is NOW-2DAYS,NOW. The full table parameters are-
projection.enabled: 'True'
projection.p_date.type: 'date'
projection.p_date.range: NOW-2DAYS,NOW
projection.p_date.format: 'yyyy-MM-dd HH:mm:ss'
projection.p_date.interval: 1
projection.p_date.interval.unit: 'HOURS'
But when I try to query this, I get no results.
SELECT COUNT(*) FROM recent_events_2d_v2
> 0
However, If I change the date range to 2020-09-01 00:00:00,NOW I do get results.
Something seems off with the relative date ranges with partition projection. Can anyone see what I may be doing wrong, or is this a bug?
You need to change your date format to 'yyyy-MM-dd HH:\'00:00\'' (i.e. literal "00:00" instead of minutes and seconds placeholders).
The way partition projection deals with dates leaves some things to be desired. It seems reasonable that if you say the interval is one hour that the timestamps get rounded to the nearest hour, but that's not what happens. Athena will use the actual "now" to generate the partition values, and if your date format contains fields for minutes and seconds, those will be filled in too.
I assume the reason why it worked when you used a hard coded timestamp is that Athena uses that value as the seed for the sequence, and all other timestamps will also be aligned to the hour.
If you are sure your bucket p_date={yyyy}-{MM}-{dd} {HH}:00:00/.. contains data, then you need to make sure that partitions are correctly loaded. Try running
MSCK REPAIR TABLE recent_events_2d_v2
and rerun the query.

BigQuery very slow on (seemingly) a very simple query

We using GCP logs which being exported into BigQuery using log sink.
We don't have a huge amount of logs but each record seems to be fairly large.
Running a simple query seem to take a lot of time with BigQuery. We wonder is it normal or are we doing anything wrong... And is there anything we can do to make it a bit more practical to analize...
For example, query
SELECT
FORMAT_DATETIME("%Y-%m-%d %H:%M:%S", DATETIME(timestamp, "Australia/Melbourne")) as Melb_time,
jsonPayload.lg.a,
jsonPayload.lg.p
FROM `XXX.webapp_usg_logs.webapp_*`
ORDER BY timestamp DESC
LIMIT 100
takes
Query complete (44.2 sec elapsed, 35.2 MB processed)
Thank you!
Try adding this to your query:
WHERE _TABLE_SUFFIX > FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
It will filter to get tables with a TABLE_SUFFIX from within the last 3 days only - instead of having BigQuery look at each table from maybe many years of history.

AWS IoT Analytics queries for retrieving data from dataset using boto3

Can we use query while retrieving the data from the dataset in AWS IoT Analytics, I want data between 2 timestamps. Im using boto3 to fetch the data. I didn't see any option to use query in get dataset content Below is the boto3 code:
response = client.get_dataset_content(
datasetName='string',
versionId='string'
)
Does anyone have suggestions how to use query or how rerieve the data between 2 timestamp in AWS IoT Analytics?
Thanks,
Pankaj
There could be a few ways to do this depending on what your workflow is, if you have a few more details, that would be helpful.
Possible approaches are;
1) Create a scheduled query to run every hour (for example) where the query looks something like this;
SELECT * FROM my_datastore WHERE __dt >= current_date - interval '1' day
AND my_timestamp >= now() - interval '1' hour
You may need to adjust the format of the timestamp to suit depending on how you are storing it (epoch seconds, epoch milliseconds, ISO8601 etc. If you set this to run every hour, each time it executes, you will get the last one hour of data. Note that the __dt constraint just helps your query run faster (and cheaper) by limiting the scan to the most recent day only.
2) You can improve on the above by using the delta window function of the dataset which lets you get the data that has arrived since the query last ran more easily. You could then simplify your query to look like;
select * from my_datastore where __dt >= current_date - interval '1' day
And configure the delta time window to look at your timestamp field. You then control how much data is retrieved by the frequency at which you execute the query (every 15 mins, every hour etc).
3) If you have a more general purpose requirement to fetch the data between 2 timestamps that you are calculating programatically, and may not be of the form now() - some interval, the way you could do this is to create a dataset and then update the dataset with the revised SQL expression before running it with create-dataset-content. That way the dataset content is updated with just the results you need with each execution. If this is of interest, I can expand upon the actual python required.
4) As Thomas suggested, it can often be just as easy to pull out a larger chunk of data with the dataset (for example the last day) and then filter down to the timestamp you want in code. This is particularly easy if you are using panda dataframes for example and there are plenty of related questions such as this one that have good answers.
Frankly, the easiest thing would be to do your own time filtering (the result of get_dataset_content is a csv file).
That's what QuickSight does to allow you to navigate the dataset in time.
If this isn't feasible the alternative is to reprocess the datastore with an updated pipeline that filters out everything except the time range you're interested in (more information here). You should note that while it's tempting to use the startTime and endTime parameters for StartPipelineReprocessing, these are only approximate to the nearest hour.

Proc SQL in SAS and lag and lead

I'm struggling a bit with a problem I can't quite get my head around.
Let's say we have a few columns;
IP-address, time stamp, SSN.
How would I go finding occurrences where the same IP appears in several records where the time is within the same one hour window (as an example of a window of time) and there are several SSNs.
This could for example be used for received applications for whatever, where we get a lot of traffic from one location where the data given varies.
Might lag or lead be good?
I'm using SAS, but only Proc SQL really. Might lag or lead be a way to go?
Thank you for the help!
There are some uncertainty in "one hour window" description. It depends when is your starting point - one hour from when?
Otherwise you could end up with a double cycle:
For every IP
For every timestamp
Check if other timestamps of the same IP exists between 1 hour and with different SSN
A simpler solution might be using lag function.
First sort by IP and time stamp.
Second use lag to calculated new column with time difference between each two rows. Flag it when it is less than 1 hour. Use this flag in next query grouping to identify distinct SSN.
Problem with latter solution that it will mark records that are beyond 1 hour window in total.

Add column with difference in days

i'm trying the new Power BI (Desktop) to create a barchart that shows me the duration in days for the delivery of an order.
I have 2 files. 1 with the delivery data (date, barcode) and another file with the deliverystatusses (date, barcode).
I Created a relation in the powerBI relations tab on the left side to create a relation on barcode. 1 Delivery to many DeliveryStatusses.
Now I want to add a column/measure to calculate the number of days before a package is delivered. I searched a few blogs but with no succes.
The function DATEDIFF is only recognized in a measure, and measures seem to work on table date, not rowdata. So adding a column using the DATEDIFF function doesn't work.
Adding a column using a formula :
Duration = [DeliveryDate] - Delivery[OrderDate]
results in an error that the right side is a list (It seems the relationship isn't in place)?
What am I doing wrong?
You might try doing this in the Query window instead since I think each barcode has just one delivery date and one delivery status. You could merge the two queries into a single table. Then you wouldn't need to worry about the relationships... If on the other hand you can have multiple lines for each delivery in the delivery status table, then you need to get more fancy. If you're only interested in the last status (as opposed to the history of status) you could again use the Query windows to group the data. If you need the full flexibility, you'd probably need to create a Measure that expresses the logic you want.
The RELATED keyword is used to reference another table. Update your query as follows and it should work.
Like this:
Duration = [DeliveryDate] - RELATED(Delivery[OrderDate])