FILEBEAT-KAFKA-EVENTHUB-AZURE DATA EXPLORER : Not able to parse the events using json mapping - azure-eventhub

I am working on new PoC to establish a new data pipeline for our platform. I am shipping logs from application to the EventHub [ Kafka enabled] and trying to consume the messages into the ADX table.
I have created the data source from the ADX to map to the EH.
My Table Definition in ADX is :
.create table Trident ( Context:string, Latency:string, TimeStampUtc:string, Status:string, Source:string, Destination:string, LatencyType:string, CorrelationId:string)
I have tried the following json mapping but ADX never able to map the incoming event values to the corresponding column
.create table Trident ( Context:dynamic, Latency:dynamic, TimeStampUtc:dynamic, Status:dynamic, Source:dynamic, Destination:dynamic, LatencyType:dynamic, CorrelationId:dynamic)
.create-or-alter table Trident ingestion json mapping 'TridentMapping' '[{'column':'Context','path':'$.message.Context','datatype':'dynamic'},{'column':'Latency','path':'$.message.Latency','datatype':'dynamic'},{'column':'TimeStampUtc','path':'$.message.TimeStampUtc','datatype':'dynamic'},{'column':'Status','path':'$.message.Status','datatype':'dynamic'},{'column':'Source','path':'$.message.Source','datatype':'dynamic'},{'column':'Destination','path':'$.message.Destination','datatype':'dynamic'},{'column':'LatencyType','path':'$.message.LatencyType','datatype':'dynamic'}, {'column':'CorrelationId','path':'$.message.CorrelationId','datatype':'dynamic'}]'
.create-or-alter table Trident ingestion json mapping 'TridentMapping' '[{'column':'Context','path':'$.message[Context]','datatype':'string'},{'column':'Latency','path':'$.message[Latency]','datatype':'string'},{'column':'TimeStampUtc','path':'$.message[TimeStampUtc]','datatype':'string'},{'column':'Status','path':'$.message[Status]','datatype':'string'},{'column':'Source','path':'$.message[Source]','datatype':'string'},{'column':'Destination','path':'$.message[Destination]','datatype':'string'},{'column':'LatencyType','path':'$.message[LatencyType]','datatype':'string'}, {'column':'CorrelationId','path':'$.message[CorrelationId]','datatype':'string'}]'
.create-or-alter table Trident ingestion json mapping 'TridentMapping' '[{'column':'Context','transform' : 'Context'},{'column':'Latency','transform' : 'Latency'},{'column':'TimeStampUtc','transform':'TimeStampUtc'},{'column':'Status','transform':'Status'},{'column':'Source','transform':'Source'},{'column':'Destination','transform':'Destination'},{'column':'LatencyType','transform':'$.LatencyType'}, {'column':'CorrelationId','transform':'CorrelationId'}]'
None of the mapping were able to map the incoming request to the corresponding columns in to the Trident table.
The json payload generated by FileBeat is as follows.
Message Received:
{
'#timestamp': '2019-07-12T01:43:34.196Z',
'#metadata': {
'beat': 'filebeat',
'type': '_doc',
'version': '7.2.0',
'topic': 'trident2'
},
'host': {
'name': 'tridenet-st-az-vm-pragna'
},
'agent': {
'version': '7.2.0',
'type': 'filebeat',
'ephemeral_id': '2fb76a89-2d30-45e2-8ac3-8e47f086bb60',
'hostname': 'tridenet-st-az-vm-pragna',
'id': 'eb1c4b07-75f5-4c0c-bfc8-5a56016760ee'
},
'log': {
'offset': 2801247,
'file': {
'path': '/home/prmoh/trident/CatE2ECSharpLoadGenerator/CatE2ECSharpLoadGen/temp/test.log'
}
},
'message': '{\'Context\':\'Trident-AZ-EastUS2-AzurePublicCloud-0ea43e61-f92c-4dc7-bab6-c9bf049d50d1\',\'Latency\':\'39.3731389843734\',\'TimeStampUtc\':\'7/12/19 1:43:34 AM\',\'Status\':\'200\',\'Source\':\'BC5BCA47-A882-4096-BB2D-D76E6C170534\',\'Destination\':\'090556DA-D4FA-764F-A9F1-63614EDA019A\',\'LatencyType\':\'File-Write\',\'CorrelationId\':\'3e8f064a-2477-490a-88fc-3f55b035cfee\'}',
'ecs': {
'version': '1.0.0'
}
}

The document you pasted does not look like a valid json.
Can you start with a mapping that maps the whole document to a single column in a test table? for example
.create table test(message:dynamic)
.create table test ingestion json mapping "map" '[{"column":"message", "path":"$"}]'
This will allow you see the actual json document that arrived to ADX and to easily create the applicable mapping.

Related

Glue JSON serialization and athena query, return full record each field

I've been trying for a long time, through Glue's crawlers, to recognize .jsons from my S3, to be queried in Athena. But after different changes in settings, the best result I got, is still wrong.
Glue's crawler even recognizes the column structure of my .json, however, when queried in Athena, it sets up the columns found, but throws all items in the same line, one item for each column, as in images below.
My Classifier setting is "$[*]".
The .json data structure
[
{ "id": "TMA.fid--4f6e8018_18596f01b4f\_-5e3a", "airspace_p": 1061, "codedistv1": "SFC", "fid": 299 },
{ "id": "TMA.fid--4f6e8018_18596f01b4f\_-5e39", "airspace_p": 408, "codedistv1": "STD", "fid": 766 },
{ "id": "TMA.fid--4f6e8018_18596f01b4f\_-5e38", "airspace_p": 901, "codedistv1": "STD", "fid": 806 },
...
]
Configuration result in Glue:
Configuration result in Glue
Result in Athena from this table:
Result in Athena from this table
I already tried different .json structures, different classifiers, changed and added the JsonSerde
If you can change the data source, use the JSON lines format instead, then run the Glue crawler without any custom classifier.
{"id": "TMA.fid--4f6e8018_18596f01b4f_-5e3a","airspace_p": 1061,"codedistv1": "SFC","fid": 299}
{"id": "TMA.fid--4f6e8018_18596f01b4f_-5e39","airspace_p": 408,"codedistv1": "STD","fid": 766}
{"id": "TMA.fid--4f6e8018_18596f01b4f_-5e38","airspace_p": 901,"codedistv1": "STD","fid": 806}
Cause of your issue is that Athena doesn't support custom JSON classifier.

Not able to add data into DynamoDB using API gateway POST method

I made a Serverless API backend on AWS console which uses API Gateway, DynamoDB, Lambda functions.
Upon creation I can add the data in dynamoDB online by adding a JSON file, which looks like this:
{
"id": "4",
"k": "key1",
"v": "value1"
}
But when I try to add this using "Postman", by adding the above JSON data in the body of POST message, I get a Positive return (i.e. no errors) but only the "id" field is added in the database and not the "k" or "v".
What is missing?
I think that you need to check on your Lambda function.
As you are using Postman to do the API calls, received event's body will be as follows:
{'resource':
...
}, 'body': '{\n\t"id": 1,\n\t"name": "ben"\n
}', 'isBase64Encoded': False
}
As you can see:
'body': '{\n\t"id": 1,\n\t"name": "ben"\n}'
For example, I will use Python 3 for this case, what I need to do is to load the body into JSON format then we are able to use it.
result = json.loads(event['body'])
id = result['id']
name = result['name']
Then update them into DynamoDB:
item = table.put_item(
Item={
'id': str(id),
'name': str(name)
}
)

Boto3 athena query without saving data to s3

I am trying to use boto3 to run a set of queries and don't want to save the data to s3. Instead I just want to get the results and want to work with those results. I am trying to do the following
import boto3
client = boto3.client('athena')
response = client.start_query_execution(
QueryString='''SELECT * FROM mytable limit 10''',
QueryExecutionContext={
'Database': 'my_db'
}.
ResultConfiguration={
'OutputLocation': 's3://outputpath',
}
)
print(response)
But here I don't want to give ResultConfiguration because I don't want to write the results anywhere. But If I remove the ResultConfiguration parameter I get the following error
botocore.exceptions.ParamValidationError: Parameter validation failed:
Missing required parameter in input: "ResultConfiguration"
So it seems like giving s3 output location for writing is mendatory. So what could the way to avoid this and get the results only in response?
The StartQueryExecution action indeed requires a S3 output location. The ResultConfiguration parameter is mandatory.
The alternative way to query Athena is using JDBC or ODBC drivers. You should probably use this method if you don't want to store results in S3.
You will have to specify an S3 temp bucket location whenever running the 'start_query_execution' command. However, you can get a result set (a dict) by running the 'get_query_results' method using the query id.
The response (dict) will look like this:
{
'UpdateCount': 123,
'ResultSet': {
'Rows': [
{
'Data': [
{
'VarCharValue': 'string'
},
]
},
],
'ResultSetMetadata': {
'ColumnInfo': [
{
'CatalogName': 'string',
'SchemaName': 'string',
'TableName': 'string',
'Name': 'string',
'Label': 'string',
'Type': 'string',
'Precision': 123,
'Scale': 123,
'Nullable': 'NOT_NULL'|'NULLABLE'|'UNKNOWN',
'CaseSensitive': True|False
},
]
}
},
'NextToken': 'string'
}
For more information, see boto3 client doc: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/athena.html#Athena.Client.get_query_results
You can then delete all files in the S3 temp bucket you've specified.
You still need to provide s3 as temporary location for Athena to save the data although you want to process the data using python. But you can page through the data as tuple using Pagination API. please refer to the example here. Hope that helps

AWS: Transforming data from DynamoDB before it's sent to Cloudsearch

I'm trying to set up AWS' Cloudsearch with a DynamoDB table. My data structure is something like this:
{
"name": "John Smith",
"phone": "0123 456 789"
"business": {
"name": "Johnny's Cool Co",
"id": "12345",
"type": "contractor",
"suburb": "Sydney"
},
"profession": {
"name": "Plumber",
"id": "20"
},
"email": "johnsmith#gmail.com",
"id": "354684354-4b32-53e3-8949846-211384",
}
Importing this data from DynamoDB -> Cloudsearch is a breeze, however I want to be able to index on some of these nested object parameters (like business.name, profession.name etc).
Cloudsearch is pulling in some of the nested objects like suburb, but it seems like it's impossible for it to differentiate between the name in the root of the object and the name within the business and profession objects.
Questions:
How do I make these nested parameters searchable? Can I index on business.name or something?
If #1 is not possible, can I somehow send my data through a transforming function before it gets to Cloudsearch? This way I could flatten all of my objects and give the fields unique names like businessName and professionName
EDIT:
My solution at the moment is to have a separate DynamoDB table which replicates our users table, but stores it in a CloudSearch-friendly format. However, I don't like this solution at all so any other ideas are totally welcome!
You can use dynamodb streams and write a function that runs in lambda to capture changes and add documents to cloudsearch, flatenning them at that point, instead of keeping an additional dynamodb table.
For example, within my lambda function I have logic that keeps the list of nested fields (within a "body" parent in this case) and I create a just flatten them with their field name, in the case of duplicate sub-field names you can append the parent name to create a new field such as "body-name" as the key.
... misc. setup ...
headers = { "Content-Type": "application/json" }
indexed_fields = ['app', 'name', 'activity'] #fields to flatten
def handler(event, context): #lambda handler called at each update
document = {} #document to be uploaded to cloudsearch
document['id'] = ... #your uid, from the dynamo update record likely
document['type'] = 'add'
all_fields = {}
#flatten/pull out info you want indexed
for record in event['Records']:
body = record['dynamodb']['NewImage']['body']['M']
for key in indexed_fields:
all_fields[key] = body[key]['S']
document['fields'] = all_fields
#post update to cloudsearch endpoint
r = requests.post(url, auth=awsauth, json=document, headers=headers)

Google BigQuery support for Avro logicalTypes

As Google claims there is no support for conversion from Avro logicalType to BigQuery specific type (as described here on the bottom).
However I'm able to load Avro file with the following schema:
schema = {
'name': 'test',
'namespace': 'testing',
'type': 'record',
'fields': [
{'name': 'test_timestamp', 'type': 'long', 'logicalType': 'timestamp-millis'},
],
}
onto BigQuery with column of type TIMESTAMP.
The situation is different with the following schema:
schema = {
'name': 'test',
'namespace': 'testing',
'type': 'record',
'fields': [
{'name': 'testdate', 'type': 'int', 'logicalType': 'date'},
],
}
and BigQuery table with column of type DATE. I was using bq load in the following way (in both cases):
bq --location=EU load --source_format=AVRO --project_id=test-project dataset.table "gs://bucket/test_file.avro"
and it failed with exception:
Field testdate has changed type from DATE to INTEGER
Is there any chance that logicalTypes will be supported by BigQuery or is there any elegant way to workaround such situation? (I'm aware of workaround where temporary table is used and then there is BQL select that casts TIMESTAMPS to DATES but it's not really pretty :P)
Native understanding for Avro Logical Types is now available publicly for all BigQuery users. Please refer to the documentation page here for more details: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#logical_types