Parse JSON as key value in Dataflow job - google-cloud-platform

How to parse JSON data in apache beam and store in bigquery table ?
For example: JSON data
[{ "name":"stack"},{"id":"100"}].
How to parse JSON data and convert to PCollection K,V that will store in BQ table?
Appreciate your help!!

Typically you would use a built in JSON parser in the programming language (Are you using beam or python). Then create a TableRow object and use that for the PCollection which you are passing to the BQ table.
Note: Some JSON parsers disallow JSON which starts with a root list, as you have shown in your example. They tend to prefer something like this, with a root map. I believe this is the case in python's json library.
{"name":"stack", "id":"100"}
Please see this example pipeline, for an example on how to create the PCollection and use BigqueryIO.
You may also want to consider using one of the X to BigQuery template pipelines.

Related

How to read json lines into arrow::Table in c++?

The support for line-separated JSON in c++ seems never completed from git log.
example JSON lines:
https://jsonlines.org/examples/
relevant docs in arrow:
https://arrow.apache.org/docs/cpp/json.html#basic-usage
I am trying to compare an arrow table with a line-delimited json [1].
Several approaches I tried:
read the line-delimited json into an arrow::Table. Because we do not know the schema of the json (e.g. what are the keys in the json and what are the types of the values), there does not seem a way to construct the arrow::schema
Convert the arrow::Table to line-delimited json by converting each row of the arrow::Table into a JSON line. arrow::Table is in columnar format so it seems we have to traverse columns to get elements by index.
[1] https://arrow.apache.org/docs/python/json.html

Groupby existing attribute present in json string line in apache beam java

I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.

Does Snowflake have a STRUCT data type that is identical to GBQ's STRUCT?

I am currently designing tables in Google Big Query and I will need to move the designs over to Snowflake in AWS within the next year. GBQ has a STRUCT datatype that allows for nested data within a column (Specifying nested and repeated columns). Does Snowflake have a similar data type/functionality? According to this article from Snowflake, the platform supports SQL queries that access semi-structured data. The sample data from both articles look the same and the verbiage is similar, however I am not sure if the two are the same types. Would I be able to translate a design that utilizes GBQ structs over to Snowflake without fully refactoring it?
I'm not a BQ expert, but I believe the key difference here is that BQ requires the definition of the STRUCT schema upfront. The equivalent type in Snowflake is the VARIANT type, which will store semi-structured data but doesn't require the Schema upfront. As such, you shouldn't need to re-factor anything as long as you can export the Struct column to JSON or Parquet or similar.

How to convert row from bigtable to Avro generic records

I am reading bigtable in my Pcollection and then trying to convert the read records to Avro Generic Records .Is it possible to directly change read from big table to generic records without writing any function in the pCollection ?
For example : i am trying to do something like below
pipeline
.apply("Read from Bigtable", read)
.apply("Transform to generic records using Avro.IO", AvroIO.<<>>
(read));
In order to write Generic Records with AvroIO, you'll need to provide an Avro Schema, which I believe is incompatible with the output from BigtableIO, so this is not possible without a transformation between BigtableIO and AvroIO.

Glue custom classifiers for CSV with non standard delimiter

I am trying to use AWS Glue to crawl a data set and make it available to query in Athena. My data set is a delimited text file using ^ to separate columns. Glue is not able to infer the schema for this data as the CSV classifier only recognises comma (,), pipe (|), tab (\t), semicolon (;), and Ctrl-A (\u0001). Is there a way of updating this classifer to include non standard delimeters? The option to build custom classifiers only seems to support Grok, JSON or XML which are not applicable in this case.
You will need to create a custom classifier using the custom Grok pattern and use that in the crawler. Suppose your data is like below with four fields:
qwe^123^22.3^2019-09-02
To process the above data, your custom pattern will look like below:
%{NOTSPACE:name}^%{INT:class_num}^%{BASE10NUM:balance}^%{CUSTOMDATE:balance_date}
Please let me know if that worked for you.