Athena Error- trying to turn arrays into columns - bigquery - amazon-athena

I want the several field in the event_params array to be transformed into columns.
select
event_date,
user_pseudo_id,
(select param.value.string_value
from "bigquery",
unnest(event_params)
AS t(param) where param.key = 'buttonName')
from "bigquery"
I'm getting this error by only trying with one element of the array:
SUBQUERY_MULTIPLE_ROWS: Scalar sub-query has returned multiple rows

Related

How to RENAME struct/array nested columns using ALTER TABLE in BigQuery?

Suppose we have the following table in BigQuery:
CREATE TABLE sample_dataset.sample_table (
id INT
,struct_geo STRUCT
<
country STRING
,state STRING
,city STRING
>
,array_info ARRAY
<
STRUCT<
key STRING
,value STRING
>
>
);
I want to rename the columns inside the STRUCT and the ARRAY using an ALTER TABLE command. It's possible to follow the Google documentation available here for normal columns ("non-nested" columns) i:
ALTER TABLE sample_dataset.sample_table
RENAME COLUMN id TO str_id
But when I try to run the same command for nested columns I got errors from BigQuery.
Running the command for a column inside a STRUCT gives me the following message:
ALTER TABLE sample_dataset.sample_table
RENAME COLUMN `struct_geo.country` TO `struct_geo.str_country`
Error: ALTER TABLE RENAME COLUMN not found: struct_geo.country.
The exact same message appears when I run the same statement, but targeting a column inside an ARRAY:
ALTER TABLE sample_dataset.sample_table
RENAME COLUMN `array_info.str_key` TO `array_info.str_key`
Error: ALTER TABLE RENAME COLUMN not found: array_info.str_key
I got stuck since the BigQuery documentation about nested columns (available here) lacks examples of ALTER TABLE statements and refers directly to the default documentation for non-nested columns.
I understand that I can rename the columns by simply creating a new table using a CREATE TABLE new_table AS SELECT ... and then passing the new column names as aliases, but this would run a query over the whole table, which I'd rather avoid since my original table weighs way over 10TB...
Thanks in advance for any tips or solutions!

make all values of mutiple columns in one column Power Bi

I want to get this result (in the picture) using this code but it doesn't work , any suggestions to have all the values of all columns in one column and keep occurrences.
AllZipCode=UNION(SUMMARIZE('Table','Table'[ZipCode1]),
SUMMARIZE('Table','Table'[ZipCode2]),
SUMMARIZE('Table','Table'[ZipCode3]))
It is unlikely to combine all the column and return it in the existing table, else it will return error, by create a new table to join all the column referring to the table, you can achieve the expected output:
Table = UNION(SELECTCOLUMNS(Sheet1,"col1",Sheet1[Zip1]),
SELECTCOLUMNS(Sheet1,"col2",Sheet1[Zip2]),
SELECTCOLUMNS(Sheet1,"col3",Sheet1[Zip3]))
Original table :
Union table:

Auto-Populating to create 1 list in 1 column based on data from two different columns (Google sheets)

Having some trouble with a very simple problem. I want to create 1 list in a column based on data from two different columns, I only want each item to appear once in the list column.
So to create the list based on data from one column, I used this formula but I don't know how to do it from 2 (and the number of occurrences/count isn't needed).
=ArrayFormula(QUERY(J2:J&{"",""},"select Col1, count(Col2) where Col1 != '' group by Col1 label count(Col2) 'Number of occurrences'",-1))
https://docs.google.com/spreadsheets/d/1loPw3eUALLKx3NzXrxhDszqD2_B7G3cyH1EhnYg4tFg/edit?ts=5cf63f94#gid=0
Maybe something like:
=sort(UNIQUE({A1:A10;B1:B12}))
If your locale uses , as separator, the lists are in A1:A10 and B1:B12 and you want the result sorted.

Pyspark with AWS Glue join on multiple columns creating duplicates

I have two tables in AWS Glue, table_1 and table_2 that have almost identical schemas, however, table_2 has two additional columns. I am trying to join these two tables together on the columns that are the same and add the columns that are unique to table_2 with null values for the "old" data whose schema does not include those values.
Currently, I am able to join the two tables, using something similar to:
joined_table = Join.apply(table_1, table_2, 'id', 'id')
where the first 'id' is the id column in table_1 and the second 'id' is the id column in table_2. This call successfully joins the table into one, however, the resulting joined_table has duplicate fields for the matching columns.
My two questions are:
How can I leverage AWS Glue job with Pyspark to join all columns that match across the two tables so that there are not duplicate columns and while adding the new fields?
This sample call only takes in the 'id' column as I was trying to get this just to work, however, I want to pass in all the columns that match across the two tables. How can I pass in a list of columns to this Join.apply call? I am aware of the available methods from Pyspark directly, however, am wondering if there is a way specific to AWS Glue jobs or if there is something I need to do within AWS Glue to leverage Pyspark functionality directly.
I found that I needed to rename the columns in table_1 and then was missing a call to .drop_fields after my Join.apply call to remove the old columns from the joined table.
Additionally, you can pass in a list of column names rather than the single 'id' column that I was trying to use in the question.
joineddata = Join.apply(frame1 = table1, frame2 = table2, keys1 = ['id'], keys2 = ['id'], transformation_ctx = 'joinedData')
The join in aws glue doesn't handle duplicates. You need to convert to dataframes and then drop duplicate.
If you have duplicates, Try this:
selectedFieldsDataFrame = joineddata.toDF()
selectedFieldsDataFrame.dropDuplicates()

TypeError: dataset must be a Dataset or a DatasetReference(GBQ)

I am trying to list out Datasets within a project and tables within a dataset,but unable to understand the meaning of TypeError: dataset must be a Dataset or a Dataset Reference
Code 1 : List datasets within a project
from google.cloud import bigquery
GBQ_client = bigquery.Client(project= config.PROJECT_ID)
print GBQ_client.list_datasets()
Output:
<google.api_core.page_iterator.HTTPIterator object at 0x000000000660ACF8>
Code 2 : List tables within a dataset
tables = GBQ_client.list_tables(dataset = config.Dataset_ID)
where config.Dataset_ID = 'projectId:xxxxxxx'
Output:
TypeError: dataset must be a Dataset or a DatasetReference
To list all datasets within your project you were close, you just need to iterate over the HTTPIterator now:
datasets = list(GBQ_client.list_datasets())
Now to iterate over tables you probably have several ways to do so. If you have a specific datasetId such as "dataset123" then one way you could do is:
dataset = [dataset for dataset in client.list_datasets() if dataset.dataset_id == 'dataset123'][0] # be careful, if empty this will break
tables = list(client.list_tables(dataset=dataset.reference))
And tables will be a list with items associated to each table you have in "dataset123".
If you know upfront the name of the table you can retrieve it directly:
table = dataset.table('your table name')
Notice that list_tables method expects either an object from bigquery.dataset.Dataset or bigquery.dataset.DatasetReference to work; as it seems, you used a string in config.dataset so it won't be accepted as input (notice I used dataset.reference as input for this method).