I am using BigQueryIO.readTableRows().fromQuery(...) to read rows from BigQuery then writing TableRow back to BigQuery using BigQueryIO.writeTableRows(). I have table with correct schema already created so using CreateDisposition.CREATE_NEVER and will not have to set schema in Beam client. Problem is that the all Record fields are flattened(underscore appended) in query result and does not match the schema of the table which is not in flattened form. Using .withoutResultFlattening() on reads did not help unflattening the Records so cannot get around this discrepancy. How do we query without flattening the result?
You can use Standard SQL since the results will not be flattened as explained here.
Related
BigQuery scheduled queries and I would like to know one point when spitting out the output of a select statement into a separate table.
I am new to Google Cloud Platform.
The last table that outputs using order by will not have the Is the result of the output spit out as it is?
I would like to write to the table of output results of the schedule query after sorting by order by id. Is that possible?
Sorry for the rudimentary question, but thank you in advance.
BigQuery tables don't guarantee a certain order of its records. Unless you're using clustering - this will physically sort the data in the background. But this will also not guarantee a sorted output of a query. You have to use ORDER BY in order to have a sorted output guaranteed. That means you need a key by which you can sort.
I have a hive table where one of the column is Array
ruleInfoList array<structfield1:string,field2:string,field3:string,field4:string>
I am trying to create a similar DDL in Bigquery but unable to figure out how to do that.
Then I tried inserting the data without creating a table in BigQuery using Java and noticed below schema and I'm not sure what Record is and from where list and element got created. Please see the below screenshot.
RECORD is STRUCT, and REPEATED is ARRAY in BigQuery SQL, the type of your column should be:
ruleinfolist STRUCT<list ARRAY<element STRUCT<dataClassCode STRING, dataRuleGroupCode> > >
Tip: to see the type of a table column, you can query INFORMATION_SCHEMA.COLUMNS, and check the DATA_TYPE column.
I have a report that displays data for a certain period. There are two parameters in the report: the start date and the end date. Data is displayed for the specified period. I want to dynamically load data using a filter in the report, rather than changing the parameters of the query. How can this be implemented?
You can set up a Direct Query where it pulls in the data only when it needs it.
Note that you can have a composite model where some tables are Direct Query and others are Import.
I am currently exploring how to query only the streaming buffer data in tables at regular intervals for generating a performance report at near real-time and found the following StackOverflow link:
How to query for data in streaming buffer ONLY in BigQuery?
However, the current type of partition is implemented using --time_partitioning_field
Using the following query forces to query all data from the table:
SELECT * FROM `<project>.<data-set>.<time-partitioned-streaming-table>`
where <time-partitioning-field> is null
The query doesn't show any difference as ideally the peak streaming buffer is # ~60MB per hour
Is there a way to query only the streaming data with this type of partition?
I believe this should work (but it is legacy SQL)
#standardSQL
CREATE TABLE test.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY transaction_date
OPTIONS(
partition_expiration_days=3,
description="a table partitioned by transaction_date"
)
#legacySQL
select * from [test.newtable$__UNPARTITIONED__]
It is not possible to query streaming buffer data for partioned tables because once a specific TIMESTAMP or DATE has been defined, data is "streamed directly to the partition".
Checking the official documentation you can also find the solution for ingestion-time partitioned tables mentioned in the link you posted.
I have exported a mysql table to a parquet file (avro based). Now i want to read particular columns from that file. How can i read particular columns completely? I am looking for java code examples.
Is there an api where i can pass the columns i need and get back a 2D array of table?
If you can use hive, creating a hive table and issuing a simple select query would be by far the easiest option.
create external table tbl1(<columns>) location '<file_path>' stored as parquet;
select col1,col2 from tbl1;
//this works in hive 0.14
You can use JDBC driver to do that from java program as well.
Otherwise, if you want to stay completely in java, you need to modify the avro schema by excluding all the fields but the ones you want to fetch. Then when you read the file supply the modified schema as reader schema and it will only read the included columns. But you will get you original avro record back with excluded fields nullified, not a 2D array.
To modify the schema look at org.apache.avro.Schema and org.apache.avro.SchemaBuilder. make sure that modified schema is compatible with the original schema.
Options:
Use Hive table to create table with all columns with storage format parquet and read the required columns by specifying the column names
Create Thrift for the table and use the thrift fields to read the data from code (Java or Scala)
You can also use apache drill that natively parse parquet files.