Using assertions with Google Cloud Dataform

Using assertions with Google Cloud Dataform - google-cloud-platform

I am creating a incremental table which pulls data from two tables(uses UNION on the two tables)
select * from ${ref("tableTrinity")}
${when(incremental(), `WHERE created_time > (SELECT MAX(created_time) FROM ${self()})`) }
UNION ALL
select * from ${ref("tableDavis")}
${when(incremental(), `WHERE created_time > (SELECT MAX(created_time) FROM ${self()})`) }
On both the tables I have applied assertion for unique key to be not null. To test I updated tableTrinity single row key value to null and executed the incremental sqlx script.
The Union works with no failures and the null value is pulled in.
Below Config- tableTrinity
config {
type: "table",
description: "Table for location Trinity & 6th Street",
assertions: {
nonNull: ["trip_id"],
uniqueKey: ["trip_id"]
},
tags: ["Derived Table","Non PII"]
}
Config for incremental -
config {
type: "incremental",
dependencies: ["dataformTest_tableTrinity_assertions_rowConditions"],
description: "Incremental Test",
tags : ["Dependent Table"],
uniqueKey: ["trip_id"],
assertions: {
nonNull: ["trip_id"]
},
bigquery: {
labels: {
department: "bikes",
"cost-center": "mechanics"
}
}
}
Though they have documentation on assertions, I could not find any eg. or video of how this works, how the orchestration is stopped on failures, or how to review the failed data.
Has anyone been able to implement assertions to stop an execution on assertion failure?

Related

How to apply policy tags with dataform

I'm trying to use Dataform(GCP) to apply PolicyTags, it runs successfully, but PolicyTag is not applied to the specified column.
This is an example that is on the Dataform website, and it doesn't work either.
config {
type: "table",
columns: {
column1: {
description: "Some description",
bigqueryPolicyTags: ["projects/dataform-integration-tests/locations/us/taxonomies/800183280162998443/policyTags/494923997126550963"]
}
}
}
select "test" as column1
I've already tried to check the user's grants, and it's with the BigQuery Admin and DataplexAdmin grants.
Has anyone had the same problem and managed to solve it?

It really was a bug, and Google has already fixed it.

How to pass query parameter of BigQuery insert job in Cloud Workflow using Terraform

I encountered an error when running Cloud Workflow that's supposed to execute a parameterised query.
The Cloud Workflow error is as follow:
"message": "Query parameter 'run_dt' not found at [1:544]",
"reason": "invalidQuery"
The Terraform code that contains the workflow is like this:
resource "google_workflows_workflow" "workflow_name" {
name = "workflow"
region = "location"
description = "description"
source_contents = <<-EOF
main:
params: [input]
steps:
- init:
assign:
- project_id: ${var.project}
- location: ${var.region}
- run_dt: $${map.get(input, "run_dt")}
- runQuery:
steps:
- insert_query:
call: googleapis.bigquery.v2.jobs.insert
args:
projectId: ${var.project}
body:
configuration:
query:
query: ${replace(templatefile("../../bq-queries/query.sql", { "run_dt" = "input.run_dt" } ), "\n", " ")}
destinationTable:
projectId: ${var.project}
datasetId: "dataset-name"
tableId: "table-name"
create_disposition: "CREATE_IF_NEEDED"
write_disposition: "WRITE_APPEND"
allowLargeResults: true
useLegacySql: false
partitioning_field: "dt"
- the_end:
return: "SUCCESS"
EOF
}
The query in the query.sql file looks like this:
SELECT * FROM `project.dataset.table-name`
WHERE sv.dt=#run_dt
With the code above the Terraform deployment succeeded, but the workflow failed.
If i wrote "input.run_dt" without double quote, i'd encounter Terraform error:
A managed resource "input" "run_dt" has not been declared in the root module.
If i wrote it as $${input.run_dt}, i'd encounter Terraform error:
This character is not used within the language.
If i wrote it as ${input.run_dt}, i'd encounter Terraform error:
Expected the start of an expression, but found an invalid expression token.
How can I pass the query parameter of this BigQuery job in Cloud Workflow using Terraform?

Found the solution!
add queryParameters field in the subworkflow:
queryParameters:
parameterType: {"type": "DATE"}
parameterValue: {"value": '$${run_dt}'}
name: "run_dt"

Terraform "primary workGroup could not be created"

I'm trying to execute query on my table In amazone but i cant execute any query i had this error msg :
Before you run your first query, you need to set up a query result location in Amazon S3.
Your query has the following error(s):
No output location provided. An output location is required either through the Workgroup result configuration setting or as an API input. (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; Request ID: b6b9aa41-20af-4f4d-91f6-db997e226936)
So i'm trying to add Workgroup but i have this problem
'Error: error creating Athena WorkGroup: InvalidRequestException: primary workGroup could not be created
{
RespMetadata: {
StatusCode: 400,
RequestID: "c20801a0-3c13-48ba-b969-4e28aa5cbf86"
},
AthenaErrorCode: "INVALID_INPUT",
Message_: "primary workGroup could not be created"
}
'
Mycode
resource "aws_s3_bucket" "tony" {
bucket = "tfouh"
}
resource "aws_athena_workgroup" "primary" {
name = "primary"
depends_on = [aws_s3_bucket.tony]
configuration {
enforce_workgroup_configuration = false
publish_cloudwatch_metrics_enabled = true
result_configuration {
output_location = "s3://${aws_s3_bucket.tony.bucket}/"
encryption_configuration {
encryption_option = "SSE_S3"
}
}
}
}
please if there are solution

This probably happens because you already have primary work group. Thus, you can't create new one of the same name. Just create a work group with different name if you want:
name = "primary2"

#Marcin suggested a valid approach, but what may be closer to what you are looking for would to to import existing workgroup into the state:
terraform import aws_athena_workgroup.primary primary
Once the state knows about the already existing resource it can do the plan and apply possible changes.

Logsink to bigquery partitioning not working

I created a logsink on folder level, so it neatly streams all the logs to Bigquery. In the logsink configuration, I specified the following options to let the logsink stream to (daily) partitions:
"bigqueryOptions": {
"usePartitionedTables": true,
"usesTimestampColumnPartitioning": true # output only
}
According to the bigquery documentation and bigquery resource type, I would assume that this would automatically create partitions, but it doesn't. I verified that it didn't create the partitions with the following query:
#LegacySQL
SELECT table_id, partition_id from [dataset1.table1$__PARTITIONS_SUMMARY__];
Gives me:
[
{
"table_id": "table1",
"partition_id": "__UNPARTITIONED__"
}
]
Is there something I am missing here? It should have partitioned by date.

The problem was that I did not wait long enough for the first partition to become active. Basically, a logsink streams data as unpartitioned. After a while, the data is partitioned by date, which is only visible after a few hours for the partition of today. Problem solved!
[
{
"table_id": "table1",
"partition_id": "__UNPARTITIONED__"
},
{
"table_id": "table1",
"partition_id": "20200510"
},
{
"table_id": "table1",
"partition_id": "20200511"
}
]

Loopback: How to query an array of objects?

How do I query an array within loopback 3?
I have the following method:
Driver.reserve = async function(cb) {
let query = {
where: {
preferred_delivery_days: {
elemMatch: {
availability: 0
}
}
}
};
return await app.models.Driver.find(query);
};
But I am getting the following error:
code: ER_PARSE_ERROR
errno: 1064
sqlMessage: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''{\"availability\":0}' ORDER BY `id`' at line 1
sqlState: 42000
sql: SELECT
driver021489826505814413.`first_name`,driver021489826505814413.`last_name`,driver021489826505814413.`gender`,driver021489826505814413.`preferred_delivery_days` FROM `my_driver_table` driver021489826505814413 WHERE driver021489826505814413.`preferred_delivery_days`'{\"availability\":0}' ORDER BY `id`
Here is an example of a database entry:
[
{
"day": 5,
"time": "morning",
"availability": 0
}
]

I think it might be hard, since according to the docs:
Data source connectors for relational databases don’t support filtering nested properties.
If your project is in the starting phase you may consider to change db to mongo or other no-sql db?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using assertions with Google Cloud Dataform - google-cloud-platform

Related

How to apply policy tags with dataform

How to pass query parameter of BigQuery insert job in Cloud Workflow using Terraform

Terraform "primary workGroup could not be created"

Logsink to bigquery partitioning not working

Loopback: How to query an array of objects?

Categories

Resources