Hive Array<Struct<field1:string, field2:string> equivalent in BigQuery - google-cloud-platform

I have a hive table where one of the column is Array
ruleInfoList array<structfield1:string,field2:string,field3:string,field4:string>
I am trying to create a similar DDL in Bigquery but unable to figure out how to do that.
Then I tried inserting the data without creating a table in BigQuery using Java and noticed below schema and I'm not sure what Record is and from where list and element got created. Please see the below screenshot.

RECORD is STRUCT, and REPEATED is ARRAY in BigQuery SQL, the type of your column should be:
ruleinfolist STRUCT<list ARRAY<element STRUCT<dataClassCode STRING, dataRuleGroupCode> > >
Tip: to see the type of a table column, you can query INFORMATION_SCHEMA.COLUMNS, and check the DATA_TYPE column.

Related

Big Query - Convert a int column into float

I would like to convert a column called lauder from int to float in Big Query. My table is called historical. I have been able to use this SQL query
SELECT *, CAST(lauder as float64) as temp
FROM sandbox.dailydev.historical
The query works but the changes are not saved into the table. What should I do?
If you use SELECT * you will scan the whole table and thus will be the cost. If table is small this shouldn't be a problem, but if it is big enough to be concern about cost - below is another approach:
apply ALTER TABLE ADD COLUMN to add new column of needed data type
apply UPDATE for new column
UPDATE table
SET new_column = CAST(old_column as float64)
WHERE true
Do you want to save them in a temporary table to use it later?
You can save it to a temporary table like below and then refer to "temp"
with temp as
( SELECT *, CAST(lauder as float64)
FROM sandbox.dailydev.historical)
You can not change a columns data type in a table
https://cloud.google.com/bigquery/docs/manually-changing-schemas#changing_a_columns_data_type
What you can do is either:
Create a view to sit on top and handle the data type conversion
Create a new column and set the data type to float64 and insert values into it
Overwrite the table
Options 2 and 3 are outlined well including pros and cons in the link I shared above.
Your statement is correct. But tables columns in Big Query are immutable. You need to run your query and save results to a new table with the modified column.
Click "More" > "Query settings", and in "Destination" select "Set a destination table for query results" and fill the table name. You can even select if you want to overwrite the existing table with generated one.
After these settings are set, just "Run" your query as usual.
You can use CREATE or REPLACE TABLE to write the structural changes along with data into the same table:
CREATE OR REPLACE TABLE sandbox.dailydev.historical
AS SELECT *, CAST(lauder as float64) as temp FROM sandbox.dailydev.historical;
In this example, historical table will be restructured with an additional column temp.
In some cases you can change column types:
CREATE TABLE mydataset.mytable(c1 INT64);
ALTER TABLE mydataset.mytable ALTER COLUMN c1 SET DATA TYPE NUMERIC;
Check conversion rules.
And google docs.

AWS Athena query on parquet file - using columns in where clause

We are planning to use Athena as a backend service for our data(stored as parquet files in partitions) in S3.
Some of the things we are interested to find out is how does adding additional columns in where clause of the query affect the query run time.
For example, we have 10million records in one hive partition(partition based on column 'date')
And all queries below return same volume - 10million. would all these queries take same time or does it reduce query run when we add additional columns in where clause(as parquet is columnar fomar)?
I tried to test this but results were not consistent as there was some queuing time as well I guess
select * from table where date='20200712'
select * from table where date='20200712' and type='XXX'
select * from table where date='20200712' and type='XXX' and subtype='YYY'
Parquet file contains page "indexes" (min, max and bloom filters.) If you sorting the data by columns in question during insert for example like this:
insert overwrite table mytable partition (dt)
select col1, --some columns
type,
subtype,
dt
distribute by dt
sort by type, subtype
then these indexes may work efficiently because data withe the same type, subtype will be loaded into the same pages, data pages will be selected using indexes. See some benchmarks here: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
Switch-on predicate-push-down: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_predicate_pushdown_parquet.html

Variable in a Power BI query

I have a SQL query to get the data into Power BI. For example:
select a,b,c,d from table1
where a in ('1111','2222','3333' etc.)
However, the list of variables ('1111','2222','3333' etc.) will change every day so I would like the SQL statement to be updated before refreshing the data. Is this possible?
Ideally, I would like to keep a spreadsheet with a list of a values (in this example) so before refresh, it will feed those parameters into this script.
Another problem I have is the list will have a different nr of parameters so the last variable needs to be without a comma.
Another option I was considering is to run the script without the where a in ('1111','2222','3333' etc.) and then load the spreadsheet with a list of those a's and filter the report down based on that list however this will be a lot of data to import into Power BI.
It's my first post ever, although I was sourcing help from Stackoverflow for years, so hopefully, it's all clear.
I would create a new Query to read the "a values" from your spreadsheet. I would set the Load To / Import Data option to Only Create Connection (to avoid duplicating the data).
Then in your SQL query I would remove the where clause. With that gone you actually don't need to write custom SQL at all - just select the table/view from the Navigation UI.
Then from the the "table1" query I would add a Merge Queries step, connecting to the "a values" Query on the "a" column, using the Join Type: Inner. The resulting rows will be only those with a matching "a" column value (similar to your current SQL where clause).
Power Query wont be able to send this to your SQL Server as a single query, so it will first select all the rows from table1. But it is still fairly quick and efficient.

Cannot read and write the query result back to BigQuery

I am using BigQueryIO.readTableRows().fromQuery(...) to read rows from BigQuery then writing TableRow back to BigQuery using BigQueryIO.writeTableRows(). I have table with correct schema already created so using CreateDisposition.CREATE_NEVER and will not have to set schema in Beam client. Problem is that the all Record fields are flattened(underscore appended) in query result and does not match the schema of the table which is not in flattened form. Using .withoutResultFlattening() on reads did not help unflattening the Records so cannot get around this discrepancy. How do we query without flattening the result?
You can use Standard SQL since the results will not be flattened as explained here.

Create DynamoDB tables using Hive

I have in my cloud, inside a S3 bucket, a CSV file with some data.
I would like to export that data into a DynamoDB table with columns "key" and "value".
Here's the current hive script I wrote:
CREATE EXTERNAL TABLE FromCSV(key string, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ', '
LOCATION 's3://mybucket/output/';
CREATE EXTERNAL TABLE hiveTransfer(col1 string, col2 string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "InvertedIndex",
"dynamodb.column.mapping" = "col1:key,col2:value");
INSERT OVERWRITE TABLE hiveTransfer SELECT * FROM FromCSV;
Now, basically the script works. though I would like to make some modifications to this script as follows:
1) The script works only if the table "InvertedIndex" already exists in DynamoDB, I would like the script to create the new table by itself and then put the data as it already does.
2) In the CSV the key is always a string but I have 2 kinds of values, string or integer. I would like the script to distinguish between the two and make two different tables.
Any help with those two modifications will be appriciated.
Thank you
Hi this could be accomplished but it is not trivial case.
1) For creating dynamo table that can't be done by hive because Dynamo Tables are managed by Amazon cloud. One thing which gets in my mind is to create Hive UDF for creating dynamo table and call it inside some dummy query before running insert. For example:
SELECT CREATE_DYNO_TABLE() FROM dummy;
Where dummy table has only one record.
2) You can split loading into two queries where in one query you will use RLIKE operator and [0-9]+ regular expression to detect numeric values and other just negation of that.
HTH,
Dino