Google BigQuery support for Avro logicalTypes - google-cloud-platform

As Google claims there is no support for conversion from Avro logicalType to BigQuery specific type (as described here on the bottom).
However I'm able to load Avro file with the following schema:
schema = {
'name': 'test',
'namespace': 'testing',
'type': 'record',
'fields': [
{'name': 'test_timestamp', 'type': 'long', 'logicalType': 'timestamp-millis'},
],
}
onto BigQuery with column of type TIMESTAMP.
The situation is different with the following schema:
schema = {
'name': 'test',
'namespace': 'testing',
'type': 'record',
'fields': [
{'name': 'testdate', 'type': 'int', 'logicalType': 'date'},
],
}
and BigQuery table with column of type DATE. I was using bq load in the following way (in both cases):
bq --location=EU load --source_format=AVRO --project_id=test-project dataset.table "gs://bucket/test_file.avro"
and it failed with exception:
Field testdate has changed type from DATE to INTEGER
Is there any chance that logicalTypes will be supported by BigQuery or is there any elegant way to workaround such situation? (I'm aware of workaround where temporary table is used and then there is BQL select that casts TIMESTAMPS to DATES but it's not really pretty :P)

Native understanding for Avro Logical Types is now available publicly for all BigQuery users. Please refer to the documentation page here for more details: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#logical_types

Related

I'm not getting the expected response from client.describe_image_scan_findings() using Boto3

I'm trying to use Boto3 to get the number of vulnerabilities from my images in my repositories. I have a list of repository names and image IDs that are getting passed into this function. Based off their documentation
I'm expecting a response like this when I filter for ['imageScanFindings']
'imageScanFindings': {
'imageScanCompletedAt': datetime(2015, 1, 1),
'vulnerabilitySourceUpdatedAt': datetime(2015, 1, 1),
'findingSeverityCounts': {
'string': 123
},
'findings': [
{
'name': 'string',
'description': 'string',
'uri': 'string',
'severity': 'INFORMATIONAL'|'LOW'|'MEDIUM'|'HIGH'|'CRITICAL'|'UNDEFINED',
'attributes': [
{
'key': 'string',
'value': 'string'
},
]
},
],
What I really need is the
'findingSeverityCounts' number, however, it's not showing up in my response. Here's my code and the response I get:
main.py
repo_names = ['cftest/repo1', 'your-repo-name', 'cftest/repo2']
image_ids = ['1.1.1', 'latest', '2.2.2']
def get_vuln_count(repo_names, image_ids):
container_inventory = []
client = boto3.client('ecr')
for n, i in zip(repo_names, image_ids):
response = client.describe_image_scan_findings(
repositoryName=n,
imageId={'imageTag': i}
)
findings = response['imageScanFindings']
print(findings)
Output
{'findings': []}
The only thing that shows up is findings and I was expecting findingSeverityCounts in the response along with the others, but nothing else is showing up.
THEORY
I have 3 repositories and an image in each repository that I uploaded. One of my theories is that I'm not getting the other responses, such as findingSeverityCounts because my images don't have vulnerabilities? I have inspector set-up to scan on push, but they don't have vulnerabilities so nothing shows up in the inspector dashboard. Could that be causing the issue? If so, how would I be able to generate a vulnerability in one of my images to test this out?
My theory was correct and when there are no vulnerabilities, the response completely omits certain values, including the 'findingSeverityCounts' value that I needed.
I created a docker image using python 2.7 to generate vulnerabilities in my scan to test out my script properly. My work around was to implement this if statement- if there's vulnerabilities it will return them, if there aren't any vulnerabilities, that means 'findingSeverityCounts' is omitted from the response, so I'll have it return 0 instead of giving me a key error.
Example Solution:
response = client.describe_image_scan_findings(
repositoryName=n,
imageId={'imageTag': i}
)
if 'findingSeverityCounts' in response['imageScanFindings']:
print(response['imageScanFindings']['findingSeverityCounts'])
else:
print(0)

AWS Quicksight parseJson function not working with redshift

How to fix parseJson function error in AWS Quicksight?
I have a json column in AWS redshift called discount_codes of type varchar. The data looks like this:
{'code': 'blabla', 'amount': '12.00', 'type': 'percentage'}
I want to have a seperate column for 'code' in Quicksight. There is a function for this called parseJson. The formular should look like this.
parseJson({discount_codes}, "$.code")
Unfortunately it is not working and giving me the following error:
[Amazon](500310) Invalid operation: JSON parsing error Details: ----------------------------------------------- error: JSON parsing error code: 8001 context: invalid json object {'code': 'blabla', 'amount': '12.00', 'type': 'percentage'}
Any idea how to fix this?
I could fix it by myself. The json column had single quotation marks. I replaced them by normal ones. Now the data looks like this:
{"code": "blabla", "amount": "12.00", "type": "percentage"}
parseJson works now.

FILEBEAT-KAFKA-EVENTHUB-AZURE DATA EXPLORER : Not able to parse the events using json mapping

I am working on new PoC to establish a new data pipeline for our platform. I am shipping logs from application to the EventHub [ Kafka enabled] and trying to consume the messages into the ADX table.
I have created the data source from the ADX to map to the EH.
My Table Definition in ADX is :
.create table Trident ( Context:string, Latency:string, TimeStampUtc:string, Status:string, Source:string, Destination:string, LatencyType:string, CorrelationId:string)
I have tried the following json mapping but ADX never able to map the incoming event values to the corresponding column
.create table Trident ( Context:dynamic, Latency:dynamic, TimeStampUtc:dynamic, Status:dynamic, Source:dynamic, Destination:dynamic, LatencyType:dynamic, CorrelationId:dynamic)
.create-or-alter table Trident ingestion json mapping 'TridentMapping' '[{'column':'Context','path':'$.message.Context','datatype':'dynamic'},{'column':'Latency','path':'$.message.Latency','datatype':'dynamic'},{'column':'TimeStampUtc','path':'$.message.TimeStampUtc','datatype':'dynamic'},{'column':'Status','path':'$.message.Status','datatype':'dynamic'},{'column':'Source','path':'$.message.Source','datatype':'dynamic'},{'column':'Destination','path':'$.message.Destination','datatype':'dynamic'},{'column':'LatencyType','path':'$.message.LatencyType','datatype':'dynamic'}, {'column':'CorrelationId','path':'$.message.CorrelationId','datatype':'dynamic'}]'
.create-or-alter table Trident ingestion json mapping 'TridentMapping' '[{'column':'Context','path':'$.message[Context]','datatype':'string'},{'column':'Latency','path':'$.message[Latency]','datatype':'string'},{'column':'TimeStampUtc','path':'$.message[TimeStampUtc]','datatype':'string'},{'column':'Status','path':'$.message[Status]','datatype':'string'},{'column':'Source','path':'$.message[Source]','datatype':'string'},{'column':'Destination','path':'$.message[Destination]','datatype':'string'},{'column':'LatencyType','path':'$.message[LatencyType]','datatype':'string'}, {'column':'CorrelationId','path':'$.message[CorrelationId]','datatype':'string'}]'
.create-or-alter table Trident ingestion json mapping 'TridentMapping' '[{'column':'Context','transform' : 'Context'},{'column':'Latency','transform' : 'Latency'},{'column':'TimeStampUtc','transform':'TimeStampUtc'},{'column':'Status','transform':'Status'},{'column':'Source','transform':'Source'},{'column':'Destination','transform':'Destination'},{'column':'LatencyType','transform':'$.LatencyType'}, {'column':'CorrelationId','transform':'CorrelationId'}]'
None of the mapping were able to map the incoming request to the corresponding columns in to the Trident table.
The json payload generated by FileBeat is as follows.
Message Received:
{
'#timestamp': '2019-07-12T01:43:34.196Z',
'#metadata': {
'beat': 'filebeat',
'type': '_doc',
'version': '7.2.0',
'topic': 'trident2'
},
'host': {
'name': 'tridenet-st-az-vm-pragna'
},
'agent': {
'version': '7.2.0',
'type': 'filebeat',
'ephemeral_id': '2fb76a89-2d30-45e2-8ac3-8e47f086bb60',
'hostname': 'tridenet-st-az-vm-pragna',
'id': 'eb1c4b07-75f5-4c0c-bfc8-5a56016760ee'
},
'log': {
'offset': 2801247,
'file': {
'path': '/home/prmoh/trident/CatE2ECSharpLoadGenerator/CatE2ECSharpLoadGen/temp/test.log'
}
},
'message': '{\'Context\':\'Trident-AZ-EastUS2-AzurePublicCloud-0ea43e61-f92c-4dc7-bab6-c9bf049d50d1\',\'Latency\':\'39.3731389843734\',\'TimeStampUtc\':\'7/12/19 1:43:34 AM\',\'Status\':\'200\',\'Source\':\'BC5BCA47-A882-4096-BB2D-D76E6C170534\',\'Destination\':\'090556DA-D4FA-764F-A9F1-63614EDA019A\',\'LatencyType\':\'File-Write\',\'CorrelationId\':\'3e8f064a-2477-490a-88fc-3f55b035cfee\'}',
'ecs': {
'version': '1.0.0'
}
}
The document you pasted does not look like a valid json.
Can you start with a mapping that maps the whole document to a single column in a test table? for example
.create table test(message:dynamic)
.create table test ingestion json mapping "map" '[{"column":"message", "path":"$"}]'
This will allow you see the actual json document that arrived to ADX and to easily create the applicable mapping.

Boto3 athena query without saving data to s3

I am trying to use boto3 to run a set of queries and don't want to save the data to s3. Instead I just want to get the results and want to work with those results. I am trying to do the following
import boto3
client = boto3.client('athena')
response = client.start_query_execution(
QueryString='''SELECT * FROM mytable limit 10''',
QueryExecutionContext={
'Database': 'my_db'
}.
ResultConfiguration={
'OutputLocation': 's3://outputpath',
}
)
print(response)
But here I don't want to give ResultConfiguration because I don't want to write the results anywhere. But If I remove the ResultConfiguration parameter I get the following error
botocore.exceptions.ParamValidationError: Parameter validation failed:
Missing required parameter in input: "ResultConfiguration"
So it seems like giving s3 output location for writing is mendatory. So what could the way to avoid this and get the results only in response?
The StartQueryExecution action indeed requires a S3 output location. The ResultConfiguration parameter is mandatory.
The alternative way to query Athena is using JDBC or ODBC drivers. You should probably use this method if you don't want to store results in S3.
You will have to specify an S3 temp bucket location whenever running the 'start_query_execution' command. However, you can get a result set (a dict) by running the 'get_query_results' method using the query id.
The response (dict) will look like this:
{
'UpdateCount': 123,
'ResultSet': {
'Rows': [
{
'Data': [
{
'VarCharValue': 'string'
},
]
},
],
'ResultSetMetadata': {
'ColumnInfo': [
{
'CatalogName': 'string',
'SchemaName': 'string',
'TableName': 'string',
'Name': 'string',
'Label': 'string',
'Type': 'string',
'Precision': 123,
'Scale': 123,
'Nullable': 'NOT_NULL'|'NULLABLE'|'UNKNOWN',
'CaseSensitive': True|False
},
]
}
},
'NextToken': 'string'
}
For more information, see boto3 client doc: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/athena.html#Athena.Client.get_query_results
You can then delete all files in the S3 temp bucket you've specified.
You still need to provide s3 as temporary location for Athena to save the data although you want to process the data using python. But you can page through the data as tuple using Pagination API. please refer to the example here. Hope that helps

Glue AWS creating a data catalog table on boto3 python

I have been trying to create a table within our data catalog using the python API. Following the documentation posted here and here for the API. I can understand how that goes. Nevertheless, I need to undestand how to declare a field structure when I create the table because when I take a look on the Storage Definition for the table here there is any explanation about how should I define this type of column for my table. In addition. I dont see the classification property for the table where is covered. Maybe on properties? I have used the boto3 documentation for this sample
code:
import boto3
client = boto3.client(service_name='glue', region_name='us-east-1')
response = client.create_table(
DatabaseName='dbname',
TableInput={
'Name': 'tbname',
'Description': 'tb description',
'Owner': 'I'm',
'StorageDescriptor': {
'Columns': [
{ 'Name': 'agents', 'Type': 'struct','Comment': 'from deserializer' },
{ 'Name': 'conference_sid', 'Type': 'string','Comment': 'from deserializer' },
{ 'Name': 'call_sid', 'Type': 'string','Comment': 'from deserializer' }
] ,
'Location': 's3://bucket/location/',
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'Compressed': False,
'SerdeInfo': { 'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe'}
},
'TableType' : "EXTERNAL_TABLE"} )
Found this post because I ran into the same issue and eventually found the solution so you could do as type:
array<struct<id:string,timestamp:bigint,message:string>>
I found this "hint" while using the AWS Console and clicking on a data type of an existing table created via a Crawler. It hints:
An ARRAY of scalar type as a top - level column.
ARRAY <STRING>
An ARRAY with elements of complex type (STRUCT).
ARRAY < STRUCT <
place: STRING,
start_year: INT
>>
An ARRAY as a field (CHILDREN) within a STRUCT. (The STRUCT is inside another ARRAY, because it is rare for a STRUCT to be a top-level column.)
ARRAY < STRUCT <
spouse: STRING,
children: ARRAY <STRING>
>>
A STRUCT as the element type of an ARRAY.
ARRAY < STRUCT <
street: STRING,
city: STRING,
country: STRING
>>