How to Create HANA Temp table in Vora Programmatically(Scala) - vora

At the moment a temp table is created using the following statement
val HANA_TABLE = s"""
CREATE TEMPORARY TABLE TEMP_HANA
USING com.sap.spark.hana
OPTIONS (
path "TABLE",
host "HANA1",
dbschema "SCHEMA",
user "USER",
passwd "PASSWD",
instance "22"
)"""
vc.sql(HANA_TABLE);
Is there a way to do this Programmatically in scala? like
vc.read.format("com.sap.spark.hana").options(options).loadTemp()
on a side note is there an API for Vora?

Please see the Vora Developer Guide -> Chapter "8 Accessing Data in SAP HANA"
Your example could be written in this way
val options = Map(
"dbschema" -> "SCHEMA",
"path" -> "TABLE",
"host" -> "HANA1",
"instance" -> "22",
"user" -> "USER",
"passwd" -> "PASSWD"
)
val inputDF = vc.read.format("com.sap.spark.hana").options(options).load()
inputDF.registerTempTable("TEMP_HANA")
vc.sql("select * from TEMP_HANA").show

Related

Terraform Bigquery create tables replace table instead of edit

I added a json file that contains all the tables i want to create :
tables.json
"tables": {
"table1": {
"dataset_id": "dataset1",
"table_id": "table1",
"schema_path": "folder/table1.json"
},
"table2": {
"dataset_id": "dataset2",
"table_id": "table2",
"schema_path": "folder/table2.json"
}
}
Then with a foreach on a Terraform resource, i want to create these tables dynamically :
local.tf file
locals {
tables = jsondecode(file("${path.module}/resource/tables.json"))["tables"]
}
variables.tf file
variable "project_id" {
description = "Project ID, used to enforce providing a project id."
type = string
}
variable "time_partitioning" {
description = "Configures time-based partitioning for this table."
type = map(string)
default = {
type = "DAY"
field = "my_field"
}
}
main.tf file
resource "google_bigquery_table" "tables" {
for_each = local.tables
project = var.project_id
dataset_id = each.value["dataset_id"]
table_id = each.value["table_id"]
dynamic "time_partitioning" {
for_each = [
var.time_partitioning
]
content {
type = try(each.value["partition_type"], time_partitioning.value["type"])
field = try(each.value["partition_field"], time_partitioning.value["field"])
expiration_ms = try(time_partitioning.value["expiration_ms"], null)
require_partition_filter = try(time_partitioning.value["require_partition_filter"], null)
}
}
schema = file("${path.module}/resource/schema/${each.value["schema_path"]}")
}
The schema files contains classic bigquery schema, for example :
[
{
"name": "field",
"type": "STRING",
"mode": "NULLABLE",
"description": "My field"
}
]
The creation of tables works well, but when i add a new nullable field on a schema, Terraform proposes to "replace table" (destroy and recreate) instead of "update table".
The normal behaviour in this case for native Bigquery and Terraform is to update the table.
When a do the same test with the same Terraform resource but without forEach, Terraform has the expected behaviour and proposes to update the table.
An example of the Terraform log with "forEach" :
# google_bigquery_table.tables["table1"] must be replaced
-/+ resource "google_bigquery_table" "tables" {
~ creation_time = 1616764894477 -> (known after apply)
dataset_id = "dataset1"
deletion_protection = true
~ etag = "G9qwId8jgQS8nN4N61zqcA==" -> (known after apply)
~ expiration_time = 0 -> (known after apply)
~ id = "projects/my-project/datasets/dataset1/tables/table1" -> (known after apply)
- labels = {} -> null
~ last_modified_time = 1617075251337 -> (known after apply)
~ location = "EU" -> (known after apply)
~ num_bytes = 0 -> (known after apply)
~ num_long_term_bytes = 0 -> (known after apply)
~ num_rows = 0 -> (known after apply)
project = "project"
~ schema = jsonencode(
~ [ # forces replacement
{
description = "Field"
mode = "NULLABLE"
name = "field"
type = "STRING"
}
.....
+ {
+ description = "Field"
+ mode = "NULLABLE"
+ name = "newField"
+ type = "STRING"
}
Terraform displays and detects correctly the new column to add for a table, but indicates a replace instead of an edition.
I repeat that, the exact same test with a the same Terraform resource without forEach and on a single Bigquery table, works well (same schema, same change). I create the table and when a add a new nullable column, Terraform proposes an edition (the expected behaviour).
I checked on Terraform doc and web, i didn't saw some examples to manage a list of table with Terraform.
Is it not possible to create and update tables with configured tables and foreach ?
Thanks for your help.
This sounded like a provider bug. I found this issue in the terraform-provider-google repository that seems related to your issue. The fix was merged just 13 hours ago (at the time of writing). So, maybe you can wait for the next release (v3.63.0) and see if it fixes your issue.
Just FYI: You might want to verify that the fix commit was actually included in the next release. It happened to me before that something that was merged in master before a released was not actually released.
Thanks so much #Alessandro, the problem was indeed due to the Terraform provide Google version.
I used the v3.62.0 version of Google provider, and you target me to the good direction.
I saw this link too : https://github.com/hashicorp/terraform-provider-google/issues/8503
There is a very useful comment by "tpolekhin" (thanks to him) :
Hopefully im not beating a dead horse commenting on the closed issue, but I did some testing with various versions on the provider, and it behaves VERY differently each time.
So, our terraform code change was pretty simple: add 2 new columns to existing BigQuery table SCHEDULE
Nothing changes between runs - only provider version
v3.52.0
Plan: 0 to add, 19 to change, 0 to destroy.
Mostly adds + mode = "NULLABLE" to fields in bunch of tables, and adds 2 new fields in SCHEDULE table
v3.53.0
Plan: 0 to add, 2 to change, 0 to destroy.
Adds 2 new fields to SCHEDULE table, and moves one field in another table in a different place (sorting?)
v3.54.0
Plan: 1 to add, 1 to change, 1 to destroy.
Adds 2 new fields to SCHEDULE table, and moves one field in another table in a different place (sorting?) but now with table re-creation for some reason
v3.55.0
Plan: 0 to add, 2 to change, 0 to destroy.
Adds 2 new fields to SCHEDULE table, and moves one field in another table in a different place (sorting?)
behaves exactly like v3.53.0
v3.56.0
Plan: 1 to add, 0 to change, 1 to destroy.
In this comment, we can see that some versions have the problem.
For example this works with v3.55.0 but not with v3.56.0
I temporary downgrade the version to v3.55.0 and when the next release will solve this issue, i will upgrade it.
provider.tf :
provider "google" {
version = "= 3.55.0"
}

How create table with aws glue with indexes?

Consider a snippet:
glueContext.getSinkWithFormat(
connectionType = "postgresql",
options = JsonOptions(Map(
"url" -> "jdbc://myurl",
"dbtable" -> "myTable",
"user" -> "user",
"password" -> "password"
))
).writeDynamicFrame(frame)
This creates table if not exists automatically, but without any idexes or id columns. Is there a way to setup a glue to
create them?

How do I schedule refresh of a Web.Contents data source?

I am trying to set a 'Scheduled Refresh' on a dataset in the Power BI web app (https://app.powerbi.com).
Normally I should see these options in the dataset settings:
But when I go to settings I am greeted by this warning:
and no way to select the 'Gateway Connection' or data source settings.
I found a useful article which explains a problem with Web.Contents and how to get around it:
https://blog.crossjoin.co.uk/2016/08/23/web-contents-m-functions-and-dataset-refresh-errors-in-power-bi/
I applied this and it still doesn't work.
In Power BI Desktop no data sources are listed as I am using a hand-authored query.
The way it works is there is a main query (Log Scroll) which calls a recursive function query (RecursiveFetch). The function then calls a Web API which works by sending a new page of JSON data everytime it is called, in a sort of 'scrolling' manner.
The Log Scroll query looks like this:
let
url = "http://exampleURL:1000"
Source = RecursiveFetch(url, 5, null, null)
in
Source
The RecursiveFetch looks like this:
let
RecursiveFetch= (url, scrollCount, scrollID, counter) =>
let
Counter = if (counter = null) then 0 else counter,
Results = if (scrollID = null) then
Json.Document(Web.Contents(url,
[
Headers=[
#"Authorization"="Basic <key here>",
#"Content-Type"="application/json"
]
]
))
else
Json.Document(Web.Contents(url,
[
Content = Text.ToBinary(scrollID),
Headers=[
#"Authorization"="Basic <key here>",
#"Content-Type"="application/json"
]
]
)),
ParsedResults = Table.FromList(Results[hits][hits], Splitter.SplitByNothing(), null, null, ExtraValues.Error),
Return = if (Counter < scrollCount) then
ParsedResults & RecursiveFetch(url, scrollCount scrollID, Counter)
else
ParsedResults
in
Return
in
RecursiveFetch
It all works perfectly in Power BI Desktop but when I publish it to the web app I get the errors shown above.
I have manually set up a data source in my Gateway Cluster which connects fine to the URL with the same credentials that the hand-authored query uses.
How do I get this all to work? Is there something I have missed?
It all works perfectly in Power BI Desktop but when I publish it to the web app I get the errors shown above.
To fix that, Web.Contents needs to use options[RelativePath] and options[Query] (or Content for HTTP POST)
The original:
"https://data.gov.uk/api/3/action/package_search?q=" & Term
Will use:
let
BaseUrl = "https://data.gov.uk",
Options = [
RelativePath = "/api/3/action/package_search",
Headers = [
Accept="application/json"
],
Query = [
q = Term
]
],
// wrap 'Response' in 'Binary.Buffer' if you are using it multiple times
response = Web.Contents(BaseUrl, Options),
buffered = Binary.Buffer(response),
response_metadata = Value.Metadata(response),
status_code = response_metadata[Response.Status],
from_json = Json.Document(final_result)
in
from_json
the parameter url is going to be the minimum url possible, otherwise the service will think your dynamic request is actually unchanged -- causing the original refresh error.

How to use AVRO format on AWS Glue/Athena

I've a few topics in Kafka that are writing AVRO files into S3 buckets and I would like to perform some queries on bucket using AWS Athena.
I'm trying to create a table but AWS Glue crawler runs and doesn't add my table (it works if I change file type to JSON). I've tried to create a table from Athena console but it doesn't show support to AVRO file.
Any idea on how to make it work?
I suggest doing it manually and not via Glue. Glue only works for the most basic situations, and this falls outside that, unfortunately.
You can find the documentation on how to create an Avro table here: https://docs.aws.amazon.com/athena/latest/ug/avro.html
The caveat for Avro tables is that you need to specify both the table columns and the Avro schema. This may look weird and redundant, but it's how Athena/Presto works. It needs a schema to know how to interpret the files, and then it needs to know which of the properties in the files you want to expose as columns (and their types, which may or may not match the Avro types).
CREATE EXTERNAL TABLE avro_table (
foo STRING,
bar INT
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal' = '
{
"type": "record",
"name": "example",
"namespace": "default",
"fields": [
{
"name": "foo",
"type": ["null", "string"],
"default": null
},
{
"name": "bar",
"type": ["null", "int"],
"default": null
}
]
}
')
STORED AS AVRO
LOCATION 's3://some-bucket/data/';
Notice how the Avro schema appears as a JSON document inside of a serde property value (single quoted) – the formatting is optional, but makes this example easier to read.
Doing it manually seems to be the way to make it work.
Here is some code to generate the Athena schema directly from a literal avro schema. It works with avro-python3 on python3.7. It is taken from here: https://github.com/dataqube-GmbH/avro2athena (I am the owner of the repo)
from avro.schema import Parse, RecordSchema, PrimitiveSchema, ArraySchema, MapSchema, EnumSchema, UnionSchema, FixedSchema
def create_athena_schema_from_avro(avro_schema_literal: str) -> str:
avro_schema: RecordSchema = Parse(avro_schema_literal)
column_schemas = []
for field in avro_schema.fields:
column_name = field.name.lower()
column_type = create_athena_column_schema(field.type)
column_schemas.append(f"`{column_name}` {column_type}")
return ', '.join(column_schemas)
def create_athena_column_schema(avro_schema) -> str:
if type(avro_schema) == PrimitiveSchema:
return rename_type_names(avro_schema.type)
elif type(avro_schema) == ArraySchema:
items_type = create_athena_column_schema(avro_schema.items)
return f'array<{items_type}>'
elif type(avro_schema) == MapSchema:
values_type = avro_schema.values.type
return f'map<string,{values_type}>'
elif type(avro_schema) == RecordSchema:
field_schemas = []
for field in avro_schema.fields:
field_name = field.name.lower()
field_type = create_athena_column_schema(field.type)
field_schemas.append(f'{field_name}:{field_type}')
field_schema_concatenated = ','.join(field_schemas)
return f'struct<{field_schema_concatenated}>'
elif type(avro_schema) == UnionSchema:
# pick the first schema which is not null
union_schemas_not_null = [s for s in avro_schema.schemas if s.type != 'null']
if len(union_schemas_not_null) > 0:
return create_athena_column_schema(union_schemas_not_null[0])
else:
raise Exception('union schemas contains only null schema')
elif type(avro_schema) in [EnumSchema, FixedSchema]:
return 'string'
else:
raise Exception(f'unknown avro schema type {avro_schema.type}')
def rename_type_names(typ: str) -> str:
if typ in ['long']:
return 'bigint'
else:
return typ

AWS IoT rule - timestamp for Elasticsearch

Have a bunch of IoT devices (ESP32) which publish a JSON object to things/THING_NAME/log for general debugging (to be extended into other topics with values in the future).
Here is the IoT rule which kind of works.
{
"sql": "SELECT *, parse_time(\"yyyy-mm-dd'T'hh:mm:ss\", timestamp()) AS timestamp, topic(2) AS deviceId FROM 'things/+/stdout'",
"ruleDisabled": false,
"awsIotSqlVersion": "2016-03-23",
"actions": [
{
"elasticsearch": {
"roleArn": "arn:aws:iam::xxx:role/iot-es-action-role",
"endpoint": "https://xxxx.eu-west-1.es.amazonaws.com",
"index": "devices",
"type": "device",
"id": "${newuuid()}"
}
}
]
}
I'm not sure how to set #timestamp inside Elasticsearch to allow time based searches.
Maybe I'm going about this all wrong, but it almost works!
Elasticsearch can recognize date strings matching dynamic_date_formats.
The following format is automatically mapped as a date field in AWS Elasticsearch 7.1:
SELECT *, parse_time("yyyy/MM/dd HH:mm:ss", timestamp()) AS timestamp FROM 'events/job/#'
This approach does not require to create a preconfigured index, which is important for dynamically created indexes, e.g. with daily rotation for logs:
devices-${parse_time("yyyy.MM.dd", timestamp(), "UTC")}
According to elastic.co documentation,
The default value for dynamic_date_formats is:
[ "strict_date_optional_time","yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"]
#timestamp is just a convention as the # prefix is the default prefix for Logstash generated fields. Because you are not using Logstash as a middleman between IoT and Elasticsearch, you don't have a default mapping for #timestamp.
But basically, it is just a name, so call it what you want, the only thing that matters is that you declare it as a timestamp field in the mappings section of the Elasticsearch index.
If for some reason you still need it to be called #timestamp, you can either SELECT it with that prefix right away in the AS section (might be an issue with IoT's sql restrictions, not sure):
SELECT *, parse_time(\"yyyy-mm-dd'T'hh:mm:ss\", timestamp()) AS #timestamp, topic(2) AS deviceId FROM 'things/+/stdout'
Or you use the copy_to functionality when declaring you're mapping:
PUT devices/device
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"copy_to": "#timestamp"
},
"#timestamp": {
"type": "date",
}
}
}
}