Athena Query for Array Column

Athena Query for Array Column - amazon-athena

I require your help in querying the array column in athena. Presently i have a table as mentioned below:
1 2020-05-06 01:13:48 dv1 [{addedtitle=apple, addedvalue=null, keytitle=Increase apple, key=p9, recvalue=0.899999999, unit=lbs, isbalanced=null}, {addedtitle=Orange (12%), addedvalue=15.0, keytitle=Increase Orange, key=p8, recvalue=18.218999999999998, unit=fl oz, isbalanced=null}, {addedtitle=Lemon, addedvalue=32.0, keytitle=Increase Lemon, key=p10, recvalue=33.6, unit=oz, isbalanced=null}, {addedtitle=Calcium (100%), addedvalue=86.0, keytitle=Increase Calcium , key=p6, recvalue=88.72002, unit=oz, isbalanced=null}, {addedtitle=Mango, addedvalue=10.0, keytitle=Increase Mango, key=p11, recvalue=11.7, unit=oz, isbalanced=null}]
2 2020-05-07 04:30:45 dev2 [{addedtitle=apple (12%), addedvalue=0.0, keytitle=Increase apple, key=p8, recvalue=0.88034375, unit=fl oz, isbalanced=null}, {addedtitle=Orange(31.4%), addedvalue=0.0, keytitle=Decrease Orange, key=p10, recvalue=1.83733225, unit=fl oz, isbalanced=null}, {addedtitle=Tree, addedvalue=0.0, keytitle=Increase Tree, key=p11, recvalue=1.69, unit=oz, isbalanced=null}]
5 2020-05-06 12:55:12 dev5 [{addedtitle=salt, addedvalue=0.0, keytitle=Increase salt, key=p9, recvalue=0.052500000000000005, unit=lbs, isbalanced=null}]
6 2020-05-08 07:03:59 dev6 [{addedtitle=Sugar, addedvalue=6.0, keytitle=Decrease sugar, key=p9, recvalue=2.4000000000000004, unit=fl oz, isbalanced=null}]
7 2020-05-06 12:52:39 dev7 []
8 2020-05-06 04:15:05 dev8 []
9 2020-05-07 05:02:38 dev9 []
I need to breakdown this 3rd array column into further columns so that i can import this in quicksight. Presently have a problem as quicksight does not recognize the 3rd column as it shows unsupported data types.
Can somebody please help as how to work on breaking this array into columns/rows for analyses?

The JSON-like data in your example is unfortunately not in a format that Athena can parse.
For anyone else finding this question I can explain how it can be done if the data is JSON formatted (e.g. {"addedtitle": "apple",… and not {addedtitle=apple,…). I'm also going to assume that there are tabs between the columns and not spaces (if there are spaces you have to use the Grok serde).
First you create a table that reads tab-separated values:
CREATE EXTERNAL TABLE my_table (
line_number int,
date_stamp timestamp,
id string,
data string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://my-bucket/path/to/data/'
Note how the data column is typed as string and not a complex type. Had each row been only JSON we could have used the JSON serde and specified the type as a complex type – but as far as I know the serde for TSV does not support complex types (nor embedded JSON).
To extract properties from the JSON data we can use JSON functions, and we can use UNNEST create rows from each element. You are probably after a combination of the two, for eample:
SELECT
id,
JSON_EXTRACT_SCALAR(element, '$.addedtitle') AS addedtitle,
JSON_EXTRACT_SCALAR(element, '$.recvalue') AS recvalue,
FROM my_table, UNNEST (JSON_PARSE(data) as ARRAY(JSON)) AS t(element)
Given the data in your question this would return:
id | addedtitle | recvalue
-----+---------------+----------------------
dv1 | apple | 0.899999999
dv1 | Orange | 18.218999999999998
dv1 | Lemon | 33.6
dv1 | Calcium | 88.72002
dv1 | Mango | 11.7
dev2 | apple (12%) | 0.88034375
dev2 | Orange(31.4%) | 1.83733225
dev2 | Tree | 1.69
dev5 | salt | 0.052500000000000005
dev6 | Sugar | 2.4000000000000004
Please note that the above assumes that the data column is valid JSON, from your question it does not look like this is the case. The data does not look like it is on a format that Athena supports.

Related

power automate desktop - convert list into data table

I read some text from a pdf into Power automate desktop. It is in the form of a list like
0 | 123 Testing Company 23.00
1 | Generation Z Co 555.11
2 | Tea Company 1,234.99
I need to separate the list into columns where the number at the end of the element is in its own column like
0 | 123 Testing Company | 23.00
1 | Generation Z Co | 555.11
2 | Tea Company | 1,234.99
Is there a way to do this? I've tried to extract tables from PDF instead but this method does not return the right data, because it seems PAD doesn't recognize it as a table.
Is there a way to convert a list into a data table?

The only way seems to be to For Loop the List,
Parse Text each of the columns
and insert them each as a row to the Data table.
i feel like there should be a quicker way to do this.

How to COPY specific columns from csv files into redshift table using lambda

I'm trying to load S3 data which is in .csv format and the S3 Bucket has many files each with a different number of columns and different column sequence and when trying to use the copy command the data is getting stored in wrong columns.
Example:
File1
client_id | event_timestamp | event_name
aaa1 | 2020-08-21 | app_launch
bbb2 | 2020-10-11 | first_launch
File2
a_sales| event_timestamp | client_id | event_name
2039 | 2020-08-27 | ccc1 | app_used
3123 | 2020-03-15 | aaa2 | app_uninstalled
Desired OUTPUT:
a_sales | client_id | event_name | event_timestamp
2039 | ccc1 | app_used | 2020-08-27
3123 | aaa2 | app_uninstalled | 2020-03-15
| aaa1 | app_launch | 2020-08-21
| bbb2 | first_launch | 2020-10-11
I have tried the below SQL script which basically runs successfully but doesn't give the desired output can someone help me out with this issue.
COPY public.sample_table
FROM 's3://mybucket/file*'
iam_role 'arn:aws:iam::99999999999:role/RedShiftRole'
FILLRECORD DELIMITER ',' IGNOREHEADER 1;

You can COPY data from S3 bucket into corresponding structure mapping staging tables.
Then either you can move data into a combined table from these 2 tables with different columns, or you can create a view which reads data into a unified structure from these staging tables

So the COPY command does NOT align data to columns based on the text in the header row of the CSV file. You need to specify which columns of the table you want to populate from the CSV file in the same order as the data is specified in the csv file.
See: Copy-command
Since your two types of files have different column orders (and columns) you will need to have a different column list for each type.

Split data into categories in the same row in Power BI

I have a table that contains multiple columns with their named having either the suffix _EXPECTED or _ACTUAL. For example, I'm looking at my sold items from my SoldItems Table and I have the following columns: APPLES_EXPECTED, BANANAS_EXPECTED, KIWIS_EXPECTED, APPLES_ACTUAL, BANANAS_ACTUAL, KIWIS_ACTUAL (The Identifier of the table is the date, so we have results per date). I want to show that data in a table form, something like this (for a selected date in filters:
+------------+----------+--------+
| Sold items | Expected | Actual |
+------------+----------+--------+
| Apples | 10 | 15 |
| Bananas | 8 | 5 |
| Kiwis | 2 | 1 |
+------------+----------+--------+
How can I manage something like this in Power BI ? I tried playing with the matrix/table visualization, however, I can't figure out a way to merge all the expected and actual columns together.

It looks like the easiest option for you would be to mould the data a bit differently using Power query. You can UNPIVOT your data so that all the expected and actual values become rows instead of columns. For example take the following sample:
Date Apples_Expected Apples_Actual
1/1/2019 1 2
Once you unpivot this it will become:
Date Fruit Count
1/1/2019 Apples_Expected 1
1/1/2019 Apples_Actual 2
Once you unpivot, it should be fairly straightforward to get the view you are looking for. The following link should walk you through the steps to unpivot:
https://support.office.com/en-us/article/unpivot-columns-power-query-0f7bad4b-9ea1-49c1-9d95-f588221c7098
Hope this helps.

Inputting missing value in primary dataset based on values in secondary dataset and a matching condition

my understanding of SAS is very elementary. I am trying to do something like this and i need help.
I have a primary dataset A with 20,000 observations where Col1 stores the CITY and Col2 stores the MILES. Col2 contains a lot of missing data. Which is as shown below.
+----------------+---------------+
| Col1 | Col2 |
+----------------+---------------+
| Gary,IN | 242.34 |
+----------------+---------------+
| Lafayette,OH | . |
+----------------+---------------+
| Ames, IA | 123.19 |
+----------------+---------------+
| San Jose,CA | 212.55 |
+----------------+---------------+
| Schuaumburg,IL | . |
+----------------+---------------+
| Santa Cruz,CA | 454.44 |
+----------------+---------------+
I have another secondary dataset B this has around 5000 observations and very similar to dataset A where Col1 stores the CITY and Col2 stores the MILES. However in this dataset B, Col2 DOES NOT CONTAIN MISSING DATA.
+----------------+---------------+
| Col1 | Col2 |
+----------------+---------------+
| Lafayette,OH | 321.45 |
+----------------+---------------+
| San Jose,CA | 212.55 |
+----------------+---------------+
| Schuaumburg,IL | 176.34 |
+----------------+---------------+
| Santa Cruz,CA | 454.44 |
+----------------+---------------+
My goal is to fill the missing miles in Dataset A based on the miles in Dataset B by matching the city names in col1.
In this example, I am trying to fill in 321.45 in Dataset A from Dataset B and similarly 176.34 by matching Col1 (city names) between the two datasets.
I am need help doing this in SAS

You just have to merge the two datasets. Note that values of Col1 needs to match exactly in the two datasets.
Also, I am assuming that Col1 is unique in dataset B. Otherwise you need to somehow tell more exactly what value you want to use or remove the duplicates (for example by adding nodupkey in proc sort statement).
Here is an example how to merge in SAS:
proc sort data=A;
by Col1;
proc sort data=B;
by Col1;
data AB;
merge A(in=a) B(keep=Col1 Col2 rename=(Col2 = Col2_new));
by Col1;
if a;
if missing(Col2) then Col2 = Col2_new;
drop Col2_new;
run;
This includes all observations and columns from dataset A. If Col2 is missing in A then we use the value from B.

Pekka's solution is perfectly working, I add an alternative solution for the sake of completeness.
Sometimes in SAS a PROC SQL lets you skip some passages compared to a DATA step (with the relative gain in storage resources and computational time), and a MERGE is a typical example.
Here you can avoid sorting both input datasets and handling the renaming of variables (here the matching key has the same name col1 but in general this is not the case).
proc sql;
create table want as
select A.col1,
coalesce(A.col2,B.col2) as col2
from A left join B
on A.col1=B.col1
order by A.col1;
quit;
The coalesce() function returns the first non missing element encountered in the arguments list.

Google Chart API table format for charts

Hopefully this makes sense. I will probably just keep mulling this over, until i figure it. I have a table, that is formatted in such as way, that a specific date may have more than one record assigned. Each record is a plant, so the structure of that table looks like the pinkish table in the image below. However when using the Google chart API the data needs to be in the format in the blue table for a line chart. Which I have working.
I am looking to create a graph in the Google chart api similar to the excel graph, using the pink table. Where at one date e.g. 01/02/2003 there are three species recorded A,B,C with values 1,2,3. I thought possibly using a scatter but that didn't work either.
What ties these together is the CenterID all these records belong to XXX CenterID. Each record with its species has an SheetID that grouped them together for example SheetID = 23, all those species were recorded on the same date.
Looking for suggestions, whether google chart API or php amendments. My PHP is below (I will switch to json_encode eventually).
$sql = "SELECT * FROM userrecords";
$stmt = $conn->prepare($sql);
$stmt->execute();
$data = $stmt->fetchAll();
foreach ($data as $row)
{
$dateArray = explode('-', $row['eventdate']);
$year = $dateArray[0];
$month= $dateArray[1] - 1;
$day= $dateArray[2];
$dataArray[] = "[new Date ($year, $month, $day), {$row['scientificname']}, {$row['category_of_taxom']}]";

To get that chart, where the dates are the series instead of the axis values, you need to change the way you are pulling your data. Assuming your database is structured like the pink table, you need to pivot the data on the date column instead of the species column to create a structure like this:
| Species | 01/02/2003 | 01/03/2003 | 01/04/2003 |
|---------|------------|------------|------------|
| A | 1 | 2 | 3 |
| B | 3 | 1 | 4 |
| C | 1 | 3 | 5 |

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Athena Query for Array Column - amazon-athena

Related

power automate desktop - convert list into data table

How to COPY specific columns from csv files into redshift table using lambda

Split data into categories in the same row in Power BI

Inputting missing value in primary dataset based on values in secondary dataset and a matching condition

Google Chart API table format for charts

Categories

Resources