GCP - BigTable to BigQuery - google-cloud-platform

I am trying to query Bigtable data in BigQuery using the external table configuration. I have the following SQL command that I am working with. However, I get an error stating invalid bigtable_options for format CLOUD_BIGTABLE.
The code works when I remove the columns field. For context, the raw data looks like this (running query without column field):
rowkey
aAA.column.name
aAA.column.cell.value
4271
xxx
30
yyy
25
But I would like the table to look like this:
rowkey
xxx
4271
30
CREATE EXTERNAL TABLE dev_test.telem_test
OPTIONS (
format = 'CLOUD_BIGTABLE',
uris = ['https://googleapis.com/bigtable/projects/telem/instances/dbb-bigtable/tables/db1'],
bigtable_options =
"""
{
bigtableColumnFamilies: [
{
"familyId": "aAA",
"type": "string",
"encoding": "string",
"columns": [
{
"qualifierEncoded": string,
"qualifierString": string,
"fieldName": "xxx",
"type": string,
"encoding": string,
"onlyReadLatest": false
}
]
}
],
readRowkeyAsString: true
}
"""
);

I think you let the default value for each column attribute. the string is the type of the value to provide, but not the raw value to provide. It makes no sense in JSON here. Try to add double quote like that
CREATE EXTERNAL TABLE dev_test.telem_test
OPTIONS (
format = 'CLOUD_BIGTABLE',
uris = ['https://googleapis.com/bigtable/projects/telem/instances/dbb-bigtable/tables/db1'],
bigtable_options =
"""
{
bigtableColumnFamilies: [
{
"familyId": "aAA",
"type": "string",
"encoding": "string",
"columns": [
{
"qualifierEncoded": "string",
"qualifierString": "string",
"fieldName": "xxx",
"type": "string",
"encoding": "string",
"onlyReadLatest": false
}
]
}
],
readRowkeyAsString: true
}
"""
);
The false is correct because the type is a boolean. More details here. The encoding "string" will be erroneous (use a real encoding type).

The error here is in this part:
bigtableColumnFamilies: [
It should be:
"columnFamilies": [
Concerning adding columns for string you will only add:
"columns": [{
"qualifierString": "name_of_column_from_bt",
"fieldName": "if_i_want_rename",
}],
fieldName is not required.
However to access your field value you will still have to use such SQL code:
SELECT
aAA.xxx.cell.value as xxx
FROM dev_test.telem_test

Related

Convert list to tibble in R with NULL values

library(tidyverse)
event <- list(
reportId = 157250,
eventId = 4580,
country = "Moldova",
disease = "African swine fever",
subType = NULL
)
event %>% as_tibble()
An error raised:
! All columns in a tibble must be vectors.
✖ Column subType is NULL.
There are thousands of events list like this, is any method to get a tibble object properly.
You can circumvent this by running the following steps:
event %>%
enframe() %>% # long format with all elements as lists
unnest("value", keep_empty = TRUE) %>% # transform into elementary data
pivot_wider(names_from = "name", values_from = "value") # make wide format with all columns
use map_df can generate a tibble object, however, the column with NULL missed.
event %>% map_df(.f = ~.x)
I have the same requirement; I have a list obtained from the FlightAware API (converted from JSON) that contains NULLs.
The solution by #manuela-benary did not work straight away because the response includes a nested array of alternatives; here is the specification for the response from the API documentation:
{
"icao": "string",
"iata": "string",
"callsign": "string",
"name": "string",
"country": "string",
"location": "string",
"phone": "string",
"shortname": "string",
"url": "string",
"wiki_url": "string",
"alternatives": [
{
"icao": "string",
"iata": "string",
"callsign": "string",
"name": "string",
"country": "string",
"location": "string",
"phone": "string",
"shortname": "string",
"url": "string",
"wiki_url": "string"
}
]
}
This simply required running Manuela's code on the alternatives list first, then NULLing that member of the response, and then running the code on the original list. The two sets of results can then be row_bind()ed together.

How to highlight custom extractions using a2i's crowd-textract-analyze-document?

I would like to create a human review loop for images that undergone OCR using Amazon Textract and Entity Extraction using Amazon Comprehend.
My process is:
send image to Textract to extract the text
send text to Comprehend to extract entities
find the Block IDs in Textract's output of the entities extracted by Comprehend
add new Blocks of type KEY_VALUE_SET to textract's JSON output per the docs
create a Human Task with crowd-textract-analyze-document element in the template and feed it the modified textract output
What fails to work in this process is step 5. My custom entities are not rendered properly. By "fails to work" I mean that the entities are not highlighted on the image when I click them on the sidebar. There is no error in the browser's console.
Has anyone tried such a thing?
Sorry for not including examples. I will remove secrets/PII from my files and attach them to the question
I used the AWS documentation of the a2i-crowd-textract-detection human task element to generate the value of the initialValue attribute. It appears the doc for that attribute is incorrect. While the the doc shows that the value should be in the same format as the output of Textract, namely:
[
{
"BlockType": "KEY_VALUE_SET",
"Confidence": 38.43309020996094,
"Geometry": { ... }
"Id": "8c97b240-0969-4678-834a-646c95da9cf4",
"Relationships": [
{ "Type": "CHILD", "Ids": [...]},
{ "Type": "VALUE", "Ids": [...]}
],
"EntityTypes": ["KEY"],
"Text": "Foo bar"
},
]
the a2i-crowd-textract-detection expects the input to have lowerCamelCase attribute names (rather than UpperCamelCase). For example:
[
{
"blockType": "KEY_VALUE_SET",
"confidence": 38.43309020996094,
"geometry": { ... }
"id": "8c97b240-0969-4678-834a-646c95da9cf4",
"relationships": [
{ "Type": "CHILD", "ids": [...]},
{ "Type": "VALUE", "ids": [...]}
],
"entityTypes": ["KEY"],
"text": "Foo bar"
},
]
I opened a support case about this documentation error to AWS.

Error when importing CSV file into Amazon Personalize

I am trying to import a CSV file into Amazon Personalize
my schema looks like this:
{
"type": "record",
"name": "Items",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "ITEM_ID",
"type": "string"
},
{
"name": "AUTHOR",
"type": "string",
"categorical": true
},
{
"name": "COUNTRY",
"type": "string",
"categorical": true
},
{
"name": "CITY",
"type": "string",
"categorical": true
},
{
"name": "STYLES",
"type": "string",
"categorical": true
},
{
"name": "CATEGORIES",
"type": "string",
"categorical": true
}
],
"version": "1.0"
}
the first few rows of data look like this:
ITEM_ID,AUTHOR,COUNTRY,CITY,STYLES,CATEGORIES
5b4253a7e12434f55875381e,5acd193f48ed4b9b3add5be6,US,city_us_austin,5ad45bc575eb016f3cdb562b|571aa21888a4fd9934f0fd7b|571aa21888a4fd9934f0fd79|5ad45e8c75eb016f3cdb563f|5b4ea35abaa12285687a1f47,593a866a082c26444eab2d3c|5a8e4820fc112d414fbc1be3
5b4253a7e12434f55875381f,5acd193f48ed4b9b3add5be6,US,city_us_jackson,571aa21888a4fd9934f0fd82|57600e419e4959cd069658eb|5ad45c3a75eb016f3cdb5631|571aa21888a4fd9934f0fd7b|57aaa7094a393f531ace43f0|575e6d8e34ca56f742bea1c8|571aa21888a4fd9934f0fd8f,593a866a082c26444eab2d3c|5a8e4820fc112d414fbc1be3
I get the error
Failed to create a data import job for item dataset.
Input csv has rows that do not conform to the dataset schema. Please ensure all required data fields are present and that they are of the type specified in the schema.
How can I figure out what is wrong with the CSV (it's thousands of lines long), so I have not idea if its a general mistake, or something wrong on a specific line?
In my experience, so long as the dataset is not >250 thousand records, you can still use Excel to check the data utilizing data filters and corresponding search functions. If it's more than that, look into using Notepad++ and RegEx. Your problem may be one of the following things:
(1) There's a missing comma. This would misalign your data and keep it from being processed.
(2) There's a missing ITEM_ID value. For Items, Personalize requires ITEM_ID and at least one metadata field. It might give this error if there is an instance where you are missing ITEM_ID or have ITEM_ID but no other metadata field values.
(3) STYLES and/or CATEGORIES exceeds 256 characters. There is probably a limit on String length, but I can't get a clear answer on this from the developer's guide. I would guess it's 256 characters. If I was betting money, this would be my guess on your problem.
Here is a different approach to solve the problem, maybe will be useful for other cases. I had the same issue, but when dealing with int columns having null values. Pandas by default converts the columns to float data type - something AWS Personalize dataset import job will not accept if you have dedfined these columns as int or long. Long story short, converting these columns to int solves the problem:
df.column_name = df.column_name.astype(pd.Int32Dtype())

Watson Assistant: Problem with extracting value for pattern entity

I am trying to get the value for the first group match of a pattern entity from the json response of Watson Assistant. The pattern is a simple regex to recognize sequences of numbers: ([0-9]+)
The json response looks like this:
"entity": "ID",
"location": [
18,
23
],
"value": "id",
"confidence": 1.0,
"groups": [
{
"group": "group_0",
"location": [
18,
23
]
}
]
},
{
"entity": "sys-number",
"location": [
18,
23
],
"value": "12345",
"confidence": 1.0,
"metadata": {
"numeric_value": 12345.0
}
}
]
So, the group is matched alright, but the field "value" is populated with the String literal from the entity config. I would expected to find the actual value there (which is the one the value field of the next entity, sys-number).
How do I need to change the config so that the value is included as-is in the value field (or somewhere else) and so that I don't have to extract the entity from the text string using the location values? Is it possible at all?
Thanks a lot
Cheers,
Martin
To access value of pattern based entity, you can either use <? #entity_name.literal ?> or <? #entity_name.groups[0] ?> - if there are more groups captured. You can find more info in the doc: https://cloud.ibm.com/docs/services/assistant?topic=assistant-entities

When predicting, what are the valid values for dataFormat?

Problem
Using the REST API, I have trained and deployed a model that I now want to use for prediction. I've defined the collections for prediction input and output and uploaded a json file formatted accordingly to the cloud storage. However, when trying to create a prediction job I cannot figure out what value to use for the dataFormat field, which is a required parameter. Is there any way to list all valid values?
What I've tried
My requests look like the one below. I've tried JSON, NEWLINE_DELIMITED_JSON (like when importing data into BigQuery), and even the json mime type application/json, in pretty much all different cases I can think of (upper and lower combined with snake, camel, etc.).
{
"jobId": "my_predictions_123",
"predictionInput": {
"modelName": "projects/myproject/models/mymodel",
"inputPaths": [
"gs://model-bucket/data/testset.json"
],
"outputPath": "gs://model-bucket/predictions/0/",
"region": "us-central1",
"dataFormat": "JSON"
},
"predictionOutput": {
"outputPath": "gs://my-bucket/predictions/1/"
}
}
All my attempts have only gotten me this back though:
{
"error": {
"code": 400,
"message": "Invalid value at 'job.prediction_input.data_format' (TYPE_ENUM), \"JSON\"",
"status": "INVALID_ARGUMENT",
"details": [
{
"#type": "type.googleapis.com/google.rpc.BadRequest",
"fieldViolations": [
{
"field": "job.prediction_input.data_format",
"description": "Invalid value at 'job.prediction_input.data_format' (TYPE_ENUM), \"JSON\""
}
]
}
]
}
}
From Cloud ML API reference document https://cloud.google.com/ml/reference/rest/v1beta1/projects.jobs#DataFormat, the data format field in your request should be "TEXT" for all text inputs (including JSON, CSV, etc).