Athena/Presto data discovery query to recommend JSON schema? - amazon-athena

I have an Athena table (raw) with just one column (json).
I have the following query which outputs the frequencies of json keys:
SELECT key, count(*)
FROM (
SELECT map_keys(cast(json_parse(json) AS map(varchar, json))) AS keys
FROM raw
)
CROSS JOIN UNNEST (keys) AS t (key)
GROUP BY key
How can I extend this query so that it'll tell me whether a particular key has values with any non-numeric characters?
[failed attempts deleted after I found answer]

This works:
SELECT k, count(*) as isPresent, sum(isNumber) as isNumber,
count(*)-sum(isNumber) as notIsNumber from (
with dataset as (SELECT
cast(json_parse(json) AS map(varchar, varchar)) as kv FROM raw)
SELECT t.k, t.v,
IF(TRY(cast(t.v as double)) is null, 0, 1) as isNumber
from dataset cross join unnest(kv) as t(k, v)
) group by k

Related

Why does left join in redshift not working?

We are facing a weird issue with Redshift and I am looking for help to debug it please. Details of the issue are following:
I have 2 tables and I am trying to perform left join as follows:
select count(*)
from abc.orders ot
left outer join abc.events e on **ot.context_id = e.context_id**
where ot.order_id = '222:102'
Above query returns ~7000 records. Looks like it is performing default join as we have only 1 record in [Orders] table with Order ID = ‘222:102’
select count(*)
from abc.orders ot
left outer join abc.events e on **ot.event_id = e.event_id**
where ot.order_id = '222:102'
Above query returns 1 record correctly. If you notice, I have just changed column for joining 2 tables. Event_ID in [Events] table is identity column but I thought I should get similar records even if I use any other column like Context_ID.
Further, I tried following query under the impression it should return all the ~7000 records as I am using default join but surprisingly it returned only 1 record.
select count(*)
from abc.orders ot
**join** abc.events e on ot.event_id = e.event_id
where ot.order_id = '222:102'
Following are the Redshift database details:
Cutdown version of table metadata:
CREATE TABLE abc.orders (
order_id character varying(30) NOT NULL ENCODE raw,
context_id integer ENCODE raw,
event_id character varying(21) NOT NULL ENCODE zstd,
FOREIGN KEY (event_id) REFERENCES events_20191014(event_id)
)
DISTSTYLE EVEN
SORTKEY ( context_id, order_id );
CREATE TABLE abc.events (
event_id character varying(21) NOT NULL ENCODE raw,
context_id integer ENCODE raw,
PRIMARY KEY (event_id)
)
DISTSTYLE ALL
SORTKEY ( context_id, event_id );
Database: Amazon Redshift cluster
I think, I am missing something essential while joining the tables. Could you please guide me in right direction?
Thank you

MYSQL get substring

I'm trying to get substring dynamically and group by it. So if my uri column contains records like: /uri1/uri2 and /somelongword/someotherlongword I would like to get everything up to second delimiter, namely up to second / and count it. I'm using this query but obviously it is cutting string statically (6 letters after the first one).
SELECT substr(uri, 1, 6) as URI,
COUNT(*) as COUNTER
FROM staging
GROUP BY substr(uri, 1, 6)
ORDER BY COUNTER DESC
How can I achieve that?
You can use combination of SUBSTRING() and POSITION()
schema:
CREATE TABLE Table1
(`uri` varchar(10))
;
INSERT INTO Table1
(`uri`)
VALUES
('some/text'),
('some/text1'),
('some/text2'),
('aa/bb'),
('aa/cc'),
('bb/cc')
;
query
SELECT
SUBSTRING(uri,1,POSITION('/' IN uri)-1),
COUNT(*)
FROM Table1
GROUP BY SUBSTRING(uri,1,POSITION('/' IN uri)-1);
http://sqlfiddle.com/#!9/293dd3/3/0
edit: here I found amazon athena documentation: https://docs.aws.amazon.com/athena/latest/ug/presto-functions.html and here is the string function documentation: https://prestodb.io/docs/0.217/functions/string.html
my answer above still stands, but you might need to change SUBSTRING to SUBSTR
edit 2: it seems there's a special function to achieve this in amazon athena called SPLIT_PART()
query:
SELECT SPLIT_PART(uri, '/', 1), COUNT(*) FROM tbl GROUP BY SPLIT_PART(uri, '/', 1)
from docs:
split_part(string, delimiter, index) → varchar
Splits string on delimiter and returns the field index. Field indexes start with 1. If the index is larger than than the number of fields, then null is returned.

How to add a partition boundary only when not exists in SQL Data Warehouse?

I am using Azure SQL Data Warehouse Gen 1, and I create a partition table like this
CREATE TABLE [dbo].[StatsPerBin1](
[Bin1] [varchar](100) NOT NULL,
[TimeWindow] [datetime] NOT NULL,
[Count] [int] NOT NULL,
[Timestamp] [datetime] NOT NULL)
WITH
(
DISTRIBUTION = HASH ( [Bin1] ),
CLUSTERED INDEX([Bin1]),
PARTITION
(
[TimeWindow] RANGE RIGHT FOR VALUES ()
)
)
How should I split a partition only when there is no such boundary?
First I think if I can get partition boundaries by table name, then I can write a if statement to determine add partition boundary or not.
But I cannot find a way to associate a table with its corresponding partition values, the partition values of all partitions can be retrieved by
SELECT * FROM sys.partition_range_values
But it only contains function_id as identifier which I don't know how to join other tables so that I can get partition boundaries by table name.
Have you tried joining sys.partition_range_values with sys.partition_functions view?
Granted we cannot create partition functions in SQL DW, but the view seems to be still supported.
I know this is an out of date question, but I was having the same problem. Here is a query I ended up with that can get you started. It is modified slightly from a query for SQL Server documentation:
SELECT s.[name] AS [schema_name]
, t.[name] AS [table_name]
, p.[partition_number] AS [partition_number]
, rv.[value] AS [partition_boundary_value]
, p.[data_compression_desc] AS [partition_compression_desc]
FROM sys.schemas s
JOIN sys.tables t ON t.[schema_id] = s.[schema_id]
JOIN sys.partitions p ON p.[object_id] = t.[object_id]
JOIN sys.indexes i ON i.[object_id] = p.[object_id]
AND i.[index_id] = p.[index_id]
JOIN sys.data_spaces ds ON ds.[data_space_id] = i.[data_space_id]
LEFT JOIN sys.partition_schemes ps ON ps.[data_space_id] = ds.[data_space_id]
LEFT JOIN sys.partition_functions pf ON pf.[function_id] = ps.[function_id]
LEFT JOIN sys.partition_range_values rv ON rv.[function_id] = pf.[function_id]
AND rv.[boundary_id] = p.[partition_number]

Convert Map<string><string> in QuickSight

I have a column of type Map string->string in Athena and this is not recognized in AWS QuickSight. I am trying to convert this field to varchar in QuickSight using SQL
SELECT cast(body as varchar) FROM db.events;
But it fails
Cannot cast map(varchar,varchar) to varchar
How can I convert this field correctly so QuickSight can query against it?
I think there is is no easy way to do that, but maybe there are some workarounds.
If each map has two keys with known names you can create two new columns:
SELECT
ELEMENT_AT(map_col,'key1') AS key1_col
,ELEMENT_AT(map_col,'key2') AS key2_col
FROM
(
SELECT
MAP(
ARRAY['key1','key2'],
ARRAY['val1','val2']
) AS map_col
)
Which will output:
key1_col
key2_col
val1
val2
If your map column has just one key you can adapt the snippet above and use it or use this one:
SELECT
ARRAY_JOIN(MAP_KEYS(map_col), ', ') AS keys
,ARRAY_JOIN(MAP_VALUES(map_col), ', ') AS vals
FROM
(
SELECT
MAP(
ARRAY['key1'],
ARRAY['val1']
) AS map_col
)
which will result in:
keys
vals
key1
val1
As said above, there is no correct way, if you have many keys you can try to use the second snippet to create strings to store keys and values, and later use calculated fields (maybe using split) to access them.
Hope it helps (:

Doctrine join query to get all record satisfies count greater than 1

I tried with normal sql query
SELECT activity_shares.id FROM `activity_shares`
INNER JOIN (SELECT `activity_id` FROM `activity_shares`
GROUP BY `activity_id`
HAVING COUNT(`activity_id`) > 1 ) dup ON activity_shares.activity_id = dup.activity_id
Which gives me record id say 10 and 11
But same query I tried to do in Doctrine query builder,
$qb3=$this->getEntityManager()->createQueryBuilder('c')
->add('select','c.id')
->add('from','MyBundleDataBundle:ActivityShare c')
->innerJoin('c.activity', 'ca')
// ->andWhere('ca.id = c.activity')
->groupBy('ca.id')
->having('count(ca.id)>1');
Edited:
$query3=$qb3->getQuery();
$query3->getResult();
Generated SQL is:
SELECT a0_.id AS id0 FROM activity_shares a0_
INNER JOIN activities a1_ ON a0_.activity_id = a1_.id
GROUP BY a1_.id HAVING count(a1_.id) > 1
Gives only 1 record that is 10.I want to get both.I'm not getting idea where I went wrong.Any idea?
My tables structure is:
ActivityShare
+-----+---------+-----+---
| Id |activity |Share| etc...
+-----+---------+-----+----
| 1 | 1 |1 |
+-----+---------+-----+---
| 2 | 1 | 2 |
+-----+---------+-----+---
Activity is foreign key to Activity table.
I want to get Id's 1 and 2
Simplified SQL
first of all let me simplify that query so it gives the same result :
SELECT id FROM `activity_shares`
GROUP BY `id`
HAVING COUNT(`activity_id`) > 1
Docrtrine QueryBuilder
If you store the id of the activty in the table like you sql suggests:
You can use the simplified SQL to build a query:
$results =$this->getEntityManager()->createQueryBuilder('c')
->add('select','c.id')
->add('from','MyBundleDataBundle:ActivityShare c')
->groupBy('c.id')
->having('count(c.activity)>1');
->getResult();
If you are using association tables ( Doctrine logic)
here you will have to use join but the count may be tricky
Solution 1
use the associative table like an entitiy ( as i see it you only need the id)
Let's say the table name is activityshare_activity
it will have two fields activity_id and activityshare_id, if you find a way to add a new column id to that table and make it Autoincrement + Primary the rest is easy :
the new entity being called ActivityShareActivity
$results =$this->getEntityManager()->createQueryBuilder('c')
->add('select','c.activityshare_id')
->add('from','MyBundleDataBundle:ActivityShareActivity c')
->groupBy('c.activityshare_id')
->having('count(c.activity_id)>1');
->getResult();
the steps to add the new identification column to make it compatible with doctrine (you need to do this once):
add the column (INT , NOT NULL) don' t put the autoincrement yet
ALTER TABLE tableName ADD id INT NOT NULL
Populate the column using a php loop like for
Modify the column to be autoincrement
ALTER TABLE tableName MODIFY id INT NOT NULL AUTO_INCREMENT
Solution2
The correction to your query
$result=$this->getEntityManager()->createQueryBuilder()
->select('c.id')
->from('MyBundleDataBundle:ActivityShare', 'c')
->innerJoin('c.activity', 'ca')
->groupBy('c.id') //note: it's c.id not ca.id
->having('count(ca.id)>1')
->getResult();
I posted this one last because i am not 100% sure of the output of having+ count but it should word just fine :)
Thanks for your answers.I finally managed to get answer
My Doctrine query is:
$subquery=$this->getEntityManager()->createQueryBuilder('as')
->add('select','a.id')
->add('from','MyBundleDataBundle:ActivityShare as')
->innerJoin('as.activity', 'a')
->groupBy('a.id')
->having('count(a.id)>1');
$query=$this->getEntityManager()->createQueryBuilder('c')
->add('select','c.id')
->add('from','ChowzterDataBundle:ActivityShare c')
->innerJoin('c.activity', 'ca');
$query->andWhere($query->expr()->in('ca.id', $subquery->getDql()))
;
$result = $query->getQuery();
print_r($result->getResult());
And SQL looks like:
SELECT a0_.id AS id0 FROM activity_shares a0_ INNER JOIN activities a1_ ON a0_.activity_id = a1_.id WHERE a1_.id IN (SELECT a2_.id FROM activity_shares a3_ INNER JOIN activities a2_ ON a3_.activity_id = a2_.id GROUP BY a2_.id HAVING count(a2_.id) > 1