use substring function in group by clause in AWS Athena table - amazon-athena

My 'date_validation' column is in string type and display as '2018-05-22 13:38:59.0' so to convert it to date, had to use substring and 'date_parse' functions to have something like '2014-02-26 00:00:00.000'.
I need to have a count of boardings grouping by date_validation, because there are lots of validations for one day. But cannot do that since its giving error when I use group by as below.
SELECT
DATE_PARSE(SUBSTR(date_validation, 1, 10), '%Y-%m-%d'),
run_route_code,
stop_label,
COUNT(boardings)
FROM "raw_XX"."validations"
WHERE
run_route_code = 'XXX'
GROUP BY
DATE_PARSE(SUBSTR(date_validation, 1, 10), '%Y-%m-%d'),
run_route_code,
stop_label
ORDER BY
date_validation;
Error is saying
SYNTAX_ERROR: line 3:107: '"date_validation"' must be an aggregate expression or appear in GROUP BY clause'
If I use date_validation column directly in group by clause, it's not grouping and just display all values per one day.
please give me an advice.
Please note that cannot do any changes into the date as its raw table and huge partitioned table which is sitting on Raw bucket now.

The error you are seeing is actually being caused by the ORDER BY clause, which refers to the date_validation column. The issue here is that by the time the ORDER BY gets evaluated, only the date/substring is available, not the original date_validation column. Use the DATE_PARSE(...) version and your query should work:
SELECT
DATE_PARSE(SUBSTR(date_validation, 1, 10), '%Y-%m-%d'),
run_route_code,
stop_label,
COUNT(boardings)
FROM "raw_XX"."validations"
WHERE
run_route_code = 'XXX'
GROUP BY
DATE_PARSE(SUBSTR(date_validation, 1, 10), '%Y-%m-%d'),
run_route_code,
stop_label
ORDER BY
DATE_PARSE(SUBSTR(date_validation, 1, 10), '%Y-%m-%d');
Note: You could also just use ORDER BY 1 here, which is abbreviated, but I offer the solution above because it makes clear the source of/solution to the error.

Related

Column does not exist AWS Timestream Query error

I am trying to apply WHERE clause on DIMENSION of the AWS Timestream records. However, I got the error: Column does not exist
Here is my table schema:
The table schema
The table measure
First, I will show all the sample data I put in the table
SELECT username, time, manual_usage
FROM "meter-reading"."meter-metrics"
ORDER BY time DESC
LIMIT 4
The result:
Result
What I wanted to do is to query and filter the records by the Dimension ("username" specifically).
SELECT *
FROM "meter-reading"."meter-metrics"
WHERE measure_name = "OnceADay"
ORDER BY time DESC LIMIT 10
Then I got the Error: Column 'OnceADay' does not exist
I tried to search for any quotas for Dimensions name and check for error in my schema:
https://docs.aws.amazon.com/timestream/latest/developerguide/ts-limits.html#limits.naming
https://docs.aws.amazon.com/timestream/latest/developerguide/ts-limits.html#limits.system_identifier
But I didn't find that my "username" for the dimension violate any of the above rules.
I checked for some other queries by AWS Blog, the author used the WHERE clause for the Dimension filter normally:
https://aws.amazon.com/blogs/database/effective-queries-for-common-query-patterns-in-amazon-timestream/
I figured it out after I tried with the sample code. Turn out it was a silly mistake I believe.
Using apostrophe (') instead of single quotation marks ("") solved my problem.
SELECT *
FROM "meter-reading"."meter-metrics"
WHERE username = 'OnceADay'
ORDER BY time DESC LIMIT 10

MYSQL get substring

I'm trying to get substring dynamically and group by it. So if my uri column contains records like: /uri1/uri2 and /somelongword/someotherlongword I would like to get everything up to second delimiter, namely up to second / and count it. I'm using this query but obviously it is cutting string statically (6 letters after the first one).
SELECT substr(uri, 1, 6) as URI,
COUNT(*) as COUNTER
FROM staging
GROUP BY substr(uri, 1, 6)
ORDER BY COUNTER DESC
How can I achieve that?
You can use combination of SUBSTRING() and POSITION()
schema:
CREATE TABLE Table1
(`uri` varchar(10))
;
INSERT INTO Table1
(`uri`)
VALUES
('some/text'),
('some/text1'),
('some/text2'),
('aa/bb'),
('aa/cc'),
('bb/cc')
;
query
SELECT
SUBSTRING(uri,1,POSITION('/' IN uri)-1),
COUNT(*)
FROM Table1
GROUP BY SUBSTRING(uri,1,POSITION('/' IN uri)-1);
http://sqlfiddle.com/#!9/293dd3/3/0
edit: here I found amazon athena documentation: https://docs.aws.amazon.com/athena/latest/ug/presto-functions.html and here is the string function documentation: https://prestodb.io/docs/0.217/functions/string.html
my answer above still stands, but you might need to change SUBSTRING to SUBSTR
edit 2: it seems there's a special function to achieve this in amazon athena called SPLIT_PART()
query:
SELECT SPLIT_PART(uri, '/', 1), COUNT(*) FROM tbl GROUP BY SPLIT_PART(uri, '/', 1)
from docs:
split_part(string, delimiter, index) → varchar
Splits string on delimiter and returns the field index. Field indexes start with 1. If the index is larger than than the number of fields, then null is returned.

Filtering reoccuring values Power BI using DirectQuery

I have a table containing timestamps and Error codes from machines.
The machines will sometimes repeat the same error several times in a row but i only want to count these as one error. Thus I'm looking for a way to calculate if these errors is reoccurring and filter out these errors with some kind of filter.
I'm using DirectQuery so using EARLIER() to get the last error does not seem to work.
How should i filter these reoccuring errors?
If you want to do this in the database, Azure SQL Database supports LAG function, so the query for loading the data to Power BI could be something like this:
declare #t table([Time] time, [Error] int)
insert into #t([Time], [Error]) values
('11:01', 0),
('12:12', 0),
('13:31', 4),
('14:50', 0),
('15:10', 4),
('15:20', 4),
('15:30', 4),
('15:40', 4),
('17:01', 1),
('18:09', 1),
('19:41', 0)
select
t.[Time]
, t.[Error]
, IIF(t.[Error] <> 0 and LAG(t.[Error], 1) OVER(ORDER BY t.[Time]) = t.[Error], 1, 0) as Reoccuring
from #t t
order by t.[Time]
Please note, that the example doesn't show partitioning the data, e.g. by machine or something, because your sample data doesn't include that. If you need to do it, you must add PARTITION BY clause to the LAG function. If you update your question with exact database schema, I will update my answer too.
As Andrey Nikolov assumed i needed to use PARTITION BY clause using the serial numbers for the machines.
SELECT TOP 100 PERCENT *,
(CASE WHEN error = 0 OR error = LAG(error, 1, 0) OVER (PARTITION BY serial_nr ORDER BY event_time DESC)
THEN 0
ELSE 1
END) AS error_is_new
FROM MyTable
I added a new column in my table containing whether an error is new.
I used error_is_new to only show the errors that were new.

PowerBI/DAX: Unable to correctly compare two dates

I have this custom date that I created as a measure:
Start Date = DATE(YEAR(MAX(Loss[dte_month_end]))-1,12,31)
So this part looks fine in PowerBI and seems to be the right format.
So now I created a new column where I'm going through my data to check whether a record is equal to my "Start Date" as defined above.
IsStart = IF(Loss[dte_month_end]=[Start Date], TRUE, FALSE)
but the weird thing is that all records are evaluated to false.
I know this is actually not the case in my actual data, and I could find actual records with dte_month_end = 12/31/2017 as shown above.
Can someone help me understand why the IF statement would not be able to evaluate this correctly? I initially thought that this may be a case of the DATETIME format being inconsistent - but I purposefully changed both formats to be the same to no avail.
Thanks.
Edit1----------- FYI:
This is the format that my dte_month_end field has:
Edit2 --
I tried changing the dte_month_end format to Date instead of DateTime, and it still doesn't seem to work:
This is happening because you are using a measure inside of a calculated column. When you do this, the filter context for the measure is the row context in the table.
To fix this, you need to modify the filter context for your measure. For example:
Start Date = DATE(YEAR(CALCULATE(MAX(Loss[dte_month_end]), ALL(Loss))) - 1, 12, 31)
or
Start Date = DATE(YEAR(MAXX(ALL(Loss), Loss[dte_month_end])) - 1, 12, 31)
If you don't do this, the MAX only looks at the current row, rather than all the rows in the table.

How to Extract Numeric Range from Text in SQL

I am fairly new to SQL and I'm trying to extract some data from an ORACLE DB. My goal is to return rows where the query value lies in the range speicified by the "AA_SYNTAX" column in a table.
For example:
If the "AA_SYNTAX" column is 'p.G12_V14insR' and my search value is 13, I want to return that row. This column is organized as p.%number_%number%. So basically I want to return the two numerical values from this column and see if my query value is between them.
I have all the tables I need joined together and everything, just not sure how to construct a query like this. I know in regex I would do something like "\d+" but im not sure how to translate this into SQL.
Thanks
Using Oracle, you can use Regular Expressions to extract a number from the string.
More specifically, I would look into REGEXP_SUBSTR.
Using the date given in your example above, you could use:
with cte as
(
select 'p.G12_V14insR' as AA_SYNTAX from dual
)
select
REGEXP_SUBSTR(AA_SYNTAX,'p\.[[:alpha:]]+([[:digit:]]+)', 1, 1, NULL, 1) as Bottom
,REGEXP_SUBSTR(AA_SYNTAX,'\_[[:alpha:]]+([[:digit:]]+)', 1, 1, NULL, 1) as Top
from cte
I'm sure you could clean up the Regular Expression quite a bit, but, given this, you get the value of 14 for Top and 12 for Bottom.
Hope this helps move you in the right direction.