AWS Athena: varchar maximum value not matched with query - amazon-web-services

I created Athena with this query\
CREATE EXTERNAL TABLE IF NOT EXISTS report (
`token` varchar(40),
)
PARTITIONED BY ( `created_hour` string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://demo-kinesis-athena/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'projection.created_hour.format' = 'yyyy/MM/dd/HH',
'projection.created_hour.interval' = '1',
'projection.created_hour.interval.unit' = 'HOURS',
'projection.created_hour.range' = '2018/01/01/00,NOW',
'projection.created_hour.type' = 'date',
'projection.enabled' = 'true',
'storage.location.template' = 's3://demo-kinesis-athena/${created_hour}'
);
The query run successfully and the table is created, but if I generate table DDL it gives me the column type to be
`token` varchar(65535)`
instead.

Related

Convert SQL query into DAX query for PowerBI visual

I have a table like below and trying to count the number of job_ids that are in "PAUSED" status in databricks and indicate it in PowerBI report.
If there is job id that has both of "PAUSED" and "UNPAUSED" status with the same timestamp, I should count that job_id as well as it's in the "PAUSED" status.
table1
job_id
status
timestamp
A
PAUSED
2022-08-02T21:09:13
A
UNPAUSED
2022-08-02T21:09:13
B
PAUSED
2022-08-10T21:09:15
B
PAUSED
2022-07-20T21:09:13
A
PAUSED
2022-08-12T21:09:13
C
PAUSED
2022-08-01T21:07:19
C
PAUSED
2022-08-01T21:07:19
C
UNPAUSED
2022-08-03T21:07:19
B
UNPAUSED
2022-07-20T21:09:13
A
UNPAUSED
2022-08-04T21:09:13
and this is the sql query I wrote
SELECT count(job_id) FROM (SELECT distinct job_id,status,DENSE_RANK over() partition
by job_id order by timestamp desc AS rn FROM table1) t2 WHERE
t2.rn = 1 and t2.status = "PAUSED"
What I want to do next is to display the number in PowerBI using the card visual.
To do that I think I need to convert the sql query above into DAX query to get the correct number..(not sure if there is better way to perform it, I'm new to PowerBI)
Anyone could help me with this query..?
Would appreciate any kind of help!
In Power BI you would need the following measure:
Paused Jobs =
CALCULATE(
DISTINCTCOUNT('Table'[job_id]),
'Table'[status] = "PAUSED"
)
You can then put this measure onto a Card visual next to your data table
I hope This is what you want :
If we test the summary table:
Paused_Jobs =
VAR SummaryTable =
ADDCOLUMNS (
CALCULATETABLE (
SUMMARIZE ( table1, table1[job_id], table1[status] ),
table1[status] = "PAUSED"
),
"Latest", CALCULATE ( VALUES ( table1[timestamp] ), LASTDATE ( table1[timestamp] ) ),
"TotalCount", CALCULATE ( COUNTROWS ( table1 ), LASTDATE ( table1[timestamp] ) )
)
VAR Result =
SUMX ( SummaryTable, [TotalCount] )
RETURN
SummaryTable
And the resulting screenshot:
You can check the result with sql query :
If It is what you want, and want it as a measure:
Paused_Jobs =
VAR SummaryTable =
ADDCOLUMNS (
CALCULATETABLE (
SUMMARIZE ( table1, table1[job_id], table1[status] ),
table1[status] = "PAUSED"
),
"Latest", CALCULATE ( VALUES ( table1[timestamp] ), LASTDATE ( table1[timestamp] ) ),
"TotalCount", CALCULATE ( COUNTROWS ( table1 ), LASTDATE ( table1[timestamp] ) )
)
VAR Result =
SUMX ( SummaryTable, [TotalCount] )
RETURN
Result

Run Athena Query via CDK

I am trying to create a table in Athena using the AWS CDK in C#. As my table needs to contain WITH SERDEPROPERTIES (and I cannot see how to add them when using aws-glue-alpha.Table), I have opted to create the table via an Athena query in the CDK.
I have tried using both CfnNamedQuery (which creates a saved query but does not run it) and AthenaStartQueryExecution (which does not every show up in CloudFormation).
Here is how they are defined:
var cfnNamedQuery = new CfnNamedQuery(this, "MyCfnNamedQuery", new CfnNamedQueryProps {
Database = DatabaseName,
QueryString = # "CREATE EXTERNAL TABLE " + Database.DatabaseName + # ".workflow(
`instructionid`
string
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/workflow'
TBLPROPERTIES(
'has_encrypted_data' = 'false'
);
"
});
var startQueryExecutionJob = new AthenaStartQueryExecution(this, "AthenaStartQuery", new AthenaStartQueryExecutionProps {
QueryString = #"CREATE EXTERNAL TABLE " + DatabaseName + #".workflow(
`instructionid`
string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucket/workflow'
TBLPROPERTIES(
'has_encrypted_data' = 'false'
);
",
IntegrationPattern = IntegrationPattern.RUN_JOB,
WorkGroup = "primary",
ResultConfiguration = new ResultConfiguration {
OutputLocation = new Location {
BucketName = "mw-query-results-dev",
ObjectKey = "myprefix"
}
},
QueryExecutionContext = new QueryExecutionContext {
DatabaseName = DatabaseName
}
});
I am ideally looking for an answer to one of the three following questions:
How can I add WITH SERDEPROPERTIES when creating a table using aws-glue-alpha.Table?
How can I execute a saved query?
How do I correctly use AthenaStartQueryExecution?

How to get the Top N vales based on a column for each category in PowerBI?

I am facing the issue while filtering out the data based on a "Date" column to fetch top 3 for each category. Below is the sample data:
Can anybody help me with this to get the below-expected output?
You can try this (here dummy data); You can choice ASC or DESC based on your need:
Ranking by Date = var _cat = SELECTEDVALUE( Sheet1[Category])
return
IF(RANKX(FILTER(ALL(Sheet1), Sheet1[Category]= _cat), calculate(MAX(Sheet1[Date])),,ASC ,Skip)<=2, 1,BLANK())
or by Sale:
Ranking by Sales =
IF (
ISINSCOPE ('Sheet1'[Date] ),
VAR ProductsToRank = 2
VAR SalesAmount = [SumOf]
RETURN
IF (
SalesAmount > 0,
VAR VisibleProducts =
CALCULATETABLE (
VALUES ( 'Sheet1' ),
ALLSELECTED ( 'Sheet1'[Date] )
)
VAR Ranking =
RANKX (
VisibleProducts,
[SumOf],
SalesAmount
)
RETURN
IF (
Ranking > 0 && Ranking <= ProductsToRank,
1
)
)
)
Or you can create a new table in DAX like this:
Top2 = GENERATE(VALUES(Sheet1[Category]), TOPN(2, FILTER(SELECTCOLUMNS(ALL(Sheet1[Category], Sheet1[Date]),"Cat",[Category],"Date",[Date]),[Cat] = [Category]),[Date]))

filter the new table by range of data, e.g. 1997-2020 PowerBI

Error:The expression refers to multiple columns. Multiple columns cannot be converted to a scalar value.
I intend to select Data by datecolumns from 1997 to 2020. But no success
New Periode =
VAR DateStart =
DATE ( "1997", "1", "1" )
VAR DateEnd =
DATE ( "2021", "11", "10" )
RETURN
CALCULATETABLE (
'Date_data',
FILTER ( 'Date_data', 'Date_data'[Date] <= DateEnd && 'Date_data'[Date] >= DateStart )
)
Can you please try this below code-
Periode =
VAR DateStart = DATE ( "1997", "1", "1" )
VAR DateEnd = DATE ( "2021", "11", "10" )
RETURN
CALCULATETABLE (
'Date_data',
'Date_data'[Date] <= DateEnd
&& 'Date_data'[Date] >= DateStart
)

Add an IF statement in SUMMARIZE formula

Objective:
I would like to add an IF condition to my query, but i'm not sure how. In other words, substitute the MAX expression for the IF statement below
The IF condition is:
IF(
'Data Model'[amount] = 'Data Model'[orderTotalPrice],
'Data Model'[amount] - 'Data Model'[shipping_price] - 'Data Model'[total_tax],
'Data Model'[amount]
)
My query:
VAR Summary =
CALCULATETABLE(SUMMARIZE (
'Data Model',
'Data Model'[orderId],
"MaxValue", CALCULATE (
MAX ( 'Data Model'[amount] ),
'Data Model'[kind] = "refund",
'Data Model'[status] = "success"
)
),USERELATIONSHIP ( 'Data Model'[trx_date], DateTable[Date] ))
RETURN
SUMX ( Summary, [MaxValue] )