Pinot nested json ingestion - pinot

I have this json schema
{
"name":"Pete"
"age":24,
"subjects":[
{
"name":"maths"
"grade":"A"
},
{
"name":"maths"
"grade":"B"
}
]
}
and I want to ingest this into a pinot table to run a query like
select age,subjects_grade,count(*) from table group by age,subjects_grade
Is there a way to do this in a pinot job?

Pinot has two ways to handle JSON records:
1. Flatten the record during ingestion time:
In this case, we treat each nested field as a separated field, so need to:
Define those fields in the table schema
Define transform functions to flatten nested fields in table config
Please see how column subjects_name and subjects_grade is defined below. Since it's an array, so both fields are multi-value columns in Pinot.
2. Directly ingest JSON records
In this case, we treat each nested field as one single field, so need to:
Define the JSON field in table schema as a string with maxLength value
Put this field into noDictionaryColumns and jsonIndexColumns in table config
Define transform functions jsonFormat to stringify the JSON field in table config
Please see how column subjects_str is defined below.
Below is the sample table schema/config/query:
Sample Pinot Schema:
{
"metricFieldSpecs": [],
"dimensionFieldSpecs": [
{
"dataType": "STRING",
"name": "name"
},
{
"dataType": "LONG",
"name": "age"
},
{
"dataType": "STRING",
"name": "subjects_str"
},
{
"dataType": "STRING",
"name": "subjects_name",
"singleValueField": false
},
{
"dataType": "STRING",
"name": "subjects_grade",
"singleValueField": false
}
],
"dateTimeFieldSpecs": [],
"schemaName": "myTable"
}
Sample Table Config:
{
"tableName": "myTable",
"tableType": "OFFLINE",
"segmentsConfig": {
"segmentPushType": "APPEND",
"segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
"schemaName": "myTable",
"replication": "1"
},
"tenants": {},
"tableIndexConfig": {
"loadMode": "MMAP",
"invertedIndexColumns": [],
"noDictionaryColumns": [
"subjects_str"
],
"jsonIndexColumns": [
"subjects_str"
]
},
"metadata": {
"customConfigs": {}
},
"ingestionConfig": {
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"segmentIngestionFrequency": "DAILY",
"batchConfigMaps": [],
"segmentNameSpec": {},
"pushSpec": {}
},
"transformConfigs": [
{
"columnName": "subjects_str",
"transformFunction": "jsonFormat(subjects)"
},
{
"columnName": "subjects_name",
"transformFunction": "jsonPathArray(subjects, '$.[*].name')"
},
{
"columnName": "subjects_grade",
"transformFunction": "jsonPathArray(subjects, '$.[*].grade')"
}
]
}
}
Sample Query:
select age, subjects_grade, count(*) from myTable GROUP BY age, subjects_grade
select age, json_extract_scalar(subjects_str, '$.[*].grade', 'STRING') as subjects_grade, count(*) from myTable GROUP BY age, subjects_grade
Comparing both ways, we recommend solution 1 to flatten the nested fields out when the field density is high(e.g. every document has field name and grade, then it's worth extracting them out to be new columns), it gives better query performance and better storage efficiency.
For solution 2, it's simpler in configuration, and good for sparse fields(e.g. only a few documents have certain fields). It requires to use json_extract_scalar function to access the nested field.
Please also note the behavior of Pinot GROUP BY on multi-value columns.
More references:
Pinot Column Transformation
Pinot JSON Functions
Pinot JSON Index
Pinot Multi-value Functions

Related

Power Bi Custom Visual: column order is incorrect

I'm developing a Power Bi custom visual, but I have a problem: when the user adds dimensions to the visual, the order shown in the UI does not reflect the actual order of the columns in the data I get from Power Bi. See for example this screenshot:
This is very limiting in a lot of scenarios, for example if I want to draw a table with columns in the order that the user sets.
Why does the API behave like this? Doesn't seem logical to me, am I doing something wrong? Here is the data binding definition if it helps:
"dataRoles": [
{
"displayName": "Fields",
"name": "fields",
"kind": "Grouping"
},
{
"displayName": "Measures",
"name": "measures",
"kind": "Measure"
}
],
"dataViewMappings": [
{
"table": {
"rows": {
"select": [
{
"for": {
"in": "fields"
}
},
{
"for":{
"in": "measures"
}
}
],
"dataReductionAlgorithm": {
"window": {
"count": 30000
}
}
}
}
}
]
I think I solved it.
You can get the actual column order like this:
dataView.metadata.columns.forEach(col => {
let columnOrder = col['rolesIndex'].XXX[0];
});
where XXX is the name of the Data Role, which in the example in my question would be fields.
Note that you have to access the rolesIndex property by key name like I did above (or alternatively cast to any before accessing it), because the DataViewMetadataColumn type is missing that property for some reason.

What's the best practice for unmarshalling data returned from a dynamo operation in aws step functions?

I am running a state machine running a dynamodb query (called using CallAwsService). The format returned looks like this:
{
Items: [
{
"string" : {
"B": blob,
"BOOL": boolean,
"BS": [ blob ],
"L": [
"AttributeValue"
],
"M": {
"string" : "AttributeValue"
},
"N": "string",
"NS": [ "string" ],
"NULL": boolean,
"S": "string",
"SS": [ "string" ]
}
}
]
}
I would like to unmarshall this data efficiently and would like to avoid using a lambda call for this
The CDK code we're currently using for the query is below
interface FindItemsStepFunctionProps {
table: Table
id: string
}
export const FindItemsStepFunction = (scope: Construct, props: FindItemStepFunctionProps): StateMachine => {
const { table, id } = props
const definition = new CallAwsService(scope, 'Query', {
service: 'dynamoDb',
action: 'query',
parameters: {
TableName: table.tableName,
IndexName: 'exampleIndexName',
KeyConditionExpression: 'id = :id',
ExpressionAttributeValues: {
':id': {
'S.$': '$.path.id',
},
},
},
iamResources: ['*'],
})
return new StateMachine(scope, id, {
logs: {
destination: new LogGroup(scope, `${id}LogGroup`, {
logGroupName: `${id}LogGroup`,
removalPolicy: RemovalPolicy.DESTROY,
retention: RetentionDays.ONE_WEEK,
}),
level: LogLevel.ALL,
},
definition,
stateMachineType: StateMachineType.EXPRESS,
stateMachineName: id,
timeout: Duration.minutes(5),
})
}
Can you unmarshall the data downstream? I'm not too well versed on StepFunctions, do you have the ability to import utilities?
Unmarshalling DDB JSON is as simple as calling the unmarshall function from DynamoDB utility:
https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/modules/_aws_sdk_util_dynamodb.html
You may need to do so downstream as StepFunctions seems to implement the low level client.
Step functions still don't make it easy enough to call DynamoDB directly from a step in a state machine without using a Lambda function. The main missing parts are the handling of the different cases of finding zero, one or more records in a query, and the unmarshaling of the slightly complicated format of DynamoDB records. Sadly the $utils library is still not supported in step functions.
You will need to implement these two in specific steps in the graph.
Here is a diagram of the steps that we use as DynamoDB query template:
The first step is used to provide parameters to the query. This step can be omitted and define the parameters in the query step:
"Set Query Parameters": {
"Type": "Pass",
"Next": "DynamoDB Query ...",
"Result": {
"tableName": "<TABLE_NAME>",
"key_value": "<QUERY_KEY>",
"attribute_value": "<ATTRIBUTE_VALUE>"
}
}
The next step is the actual query to DynamoDB. You can also use GetItem instead of Query if you have the record keys.
"Type": "Task",
"Parameters": {
"TableName": "$.tableName",
"IndexName": "<INDEX_NAME_IF_NEEDED>",
"KeyConditionExpression": "#n1 = :v1",
"FilterExpression": "#n2.#n3 = :v2",
"ExpressionAttributeNames": {
"#n1": "<KEY_NAME>",
"#n2": "<ATTRIBUTE_NAME>",
"#n3": "<NESTED_ATTRIBUTE_NAME>"
},
"ExpressionAttributeValues": {
":v1": {
"S.$": "$.key_value"
},
":v2": {
"S.$": "$.attribute_value"
}
},
"ScanIndexForward": false
},
"Resource": "arn:aws:states:::aws-sdk:dynamodb:query",
"ResultPath": "$.ddb_record",
"ResultSelector": {
"result.$": "$.Items[0]"
},
"Next": "Check for DDB Object"
}
The above example seems a bit complicated, using both ExpressionAttributeNames and ExpressionAttributeValues. However, it makes it possible to query on nested attributes such as item.id.
In this example, we only take the first item response with $.Items[0]. However, you can take all the results if you need more than one.
The next step is to check if the query returned a record or not.
"Check for DDB Object": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.ddb_record.result",
"IsNull": false,
"Comment": "Found Context Object",
"Next": "Parse DDB Object"
}
],
"Default": "Do Nothing"
}
And lastly, to answer your original question, we can parse the query result, in case that we have one:
"Parse DDB Object": {
"Type": "Pass",
"Parameters": {
"string_object.$": "$.ddb_record.result.string_object.S",
"bool_object.$": "$.ddb_record.result.bool_object.Bool",
"dict_object": {
"nested_dict_object.$": "$.ddb_record.result.item.M.name.S",
},
"dict_object_full.$": "States.StringToJson($.ddb_record.result.JSON_object.S)"
},
"ResultPath": "$.parsed_ddb_record",
"End": true
}
Please note that:
Simple strings are easily converted by "string_object.$": "$.ddb_record.result.string_object.S"
The same for numbers or booleans by "bool_object.$": "$.ddb_record.result.bool_object.Bool")
Nested objects are parsing the map object ("item.name.$": "$.ddb_record.result.item.M.name.S", for example)
Creation of a JSON object can be achieved by using States.StringToJson
The parsed object is added as a new entry on the flow using "ResultPath": "$.parsed_ddb_record"

DynamoDB getItem - expected item to be a S when its an N

I have one row in a table where the N value is 1 and not 0. This field is called active_duty_manager and I want to pull back the row where the value is 1 so I can get the user credentials.
When I query the table using the following code:
var params = {
AttributesToGet: ['mobile'],
TableName: 've-users',
Key: { 'is_active_duty_manager': {N:1} },
};
ddb.getItem(params, function (err, data) {
if (err) {
console.log(err);
} else { // Call DynamoDB to read the item from the table
console.log("Success, duty manager =",data.Item.user_id.N);
}
})
I get the following Error:
{ InvalidParameterType: Expected params.Key['is_active_duty_manager'].N to be a string
at ParamValidator.fail (/Users/kevin/lambda/dynamo/node_modules/aws-sdk/lib/param_validator.js:50:37)
at ParamValidator.validateType (/Users/kevin/lambda/dynamo/node_modules/aws-sdk/lib/param_validator.js:222:10)
at ParamValidator.validateString (/Users/kevin/lambda/dynamo/node_modules/aws-sdk/lib/param_validator.js:154:32)
at ParamValidator.validateScalar (/Users/kevin/lambda/dynamo/node_modules/aws-sdk/lib/param_validator.js:130:21)
at ParamValidator.validateMember (/Users/kevin/lambda/dynamo/node_modules/aws-sdk/lib/param_validator.js:94:21)
at ParamValidator.validateStructure (/Users/kevin/lambda/dynamo/node_modules/aws-sdk/lib/param_validator.js:75:14)
at ParamValidator.validateMember (/Users/kevin/lambda/dynamo/node_modules/aws-sdk/lib/param_validator.js:88:21)
at ParamValidator.validateMap (/Users/kevin/lambda/dynamo/node_modules/aws-sdk/lib/param_validator.js:117:14)
at ParamValidator.validateMember (/Users/kevin/lambda/dynamo/node_modules/aws-sdk/lib/param_validator.js:92:21)
at ParamValidator.validateStructure (/Users/kevin/lambda/dynamo/node_modules/aws-sdk/lib/param_validator.js:75:14)
message: 'Expected params.Key[\'is_active_duty_manager\'].N to be a string',
code: 'InvalidParameterType',
time: 2018-02-26T20:13:09.795Z }
If I export a row as a CSV I can see the column type are S or N and, for example, active_duty_manager, is definitely a Number. So the question is why the error expects params.Key value to be a string?
Many thanks
Kevin
So looking at your table you have a primary key on user_id. This means you cannot write a query on this table which will give you the asked results, no matter what you write it won't work.
As I see it you basically have two options:
write a scan with a filter on is_active_duty_manager equals 0,
this is however a fairly expensive one as it will always read all
items.
Make a global secondary index on is_active_duty_manager and only
write 1 to it and leave it blank otherwise. This way you will get a
sparse index with just items which have this value set. You can then
query this index and this will be very fast and cheap.
When your table will be very small option 1 might still work for you. Cost optimization is a little bit out of scope here, good luck!
Looks like you need to define the key like
Key: { 'is_active_duty_manager': {'N':'1'} },
You may need to restructure your entire params with quotes.
var params = {
"AttributesToGet": ["mobile"],
"TableName": "ve-users",
"Key": { "is_active_duty_manager": {"N":"1"} },
};
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_GetItem.html
Here is the Request Syntax from the DynamoDB API Reference:
{
"AttributesToGet": [ "string" ],
"ConsistentRead": boolean,
"ExpressionAttributeNames": {
"string" : "string"
},
"Key": {
"string" : {
"B": blob,
"BOOL": boolean,
"BS": [ blob ],
"L": [
"AttributeValue"
],
"M": {
"string" : "AttributeValue"
},
"N": "string",
"NS": [ "string" ],
"NULL": boolean,
"S": "string",
"SS": [ "string" ]
}
},
"ProjectionExpression": "string",
"ReturnConsumedCapacity": "string",
"TableName": "string"
}

ElasticSearch AND query in python

I am trying to query elastic search for logs which have one field with some value and another fields with another value
my logs looks like this in Kibana:
{
"_index": "logstash-2016.08.01",
"_type": "logstash",
"_id": "6345634653456",
"_score": null,
"_source": {
"#timestamp": "2016-08-01T09:03:50.372Z",
"session_id": "value_1",
"host": "local",
"message": "some message here with error",
"exception": null,
"level": "ERROR",
},
"fields": {
"#timestamp": [
1470042230372
]
}
}
I would like to receive all logs which have the value of "ERROR" in the level field (inside _source) and the value of value_1 in the session_id field (inside the _sources)
I am managing to query for one of them but not both together:
from elasticsearch import Elasticsearch
host = "localhost"
es = Elasticsearch([{'host': host, 'port': 9200}])
query = 'session_id:"{}"'.format("value_1")
result = es.search(index=INDEX, q=query)
Since you need to match exact values, I would recommend using filters, not queries.
Filter for your case would look somewhat like this:
filter = {
"filter": {
"and": [
{
"term": {
"level": "ERROR"
}
},
{
"term": {
"session_id": "value_1"
}
}
]
}
}
And you can pass it to filter using es.search(index=INDEX, body=filter)
EDIT: reason to use filters instead of queries: "In filter context, a query clause answers the question “Does this document match this query clause?” The answer is a simple Yes or No — no scores are calculated. Filter context is mostly used for filtering structured data, e.g."
Source: https://www.elastic.co/guide/en/elasticsearch/reference/2.0/query-filter-context.html

How can I turn the deepest elements of nested JSON payload into individual rows in Power Query?

Goal:
I have a JSON payload with the following format:
{
"Values": [
{
"Details": {
"14342": {
"2016-06-07T00:00:00": {
"Value": 99.62,
"Count": 7186
},
"2016-06-08T00:00:00": {
"Value": 99.73,
"Count": 7492
}
},
"14362": {
"2016-06-07T00:00:00": {
"Value": 97.55,
"Count": 1879
},
"2016-06-08T00:00:00": {
"Value": 92.68,
"Count": 355
}
}
},
"Key": "query5570027",
"Total": 0.0
},
{
"Details": {
"14342": {
"2016-06-07T00:00:00": {
"Value": 0.0,
"Count": 1018
},
"2016-06-08T00:00:00": {
"Value": 0.0,
"Count": 1227
}
}
},
"Key": "query4004194",
"Total": 0.0
}
],
"LatencyInMinute": 0.0
}
I want to load this in PowerBI and produce a table like so:
Notice how each Value + Count pair has its own row and some elements are repeated.
Problem: When I try to do this in Power BI (via Power Query), I get three initial columns, one of which is Details. Trouble is that I can expand Details, but I just get more columns, where what I really want is rows. I tried transpose, pivoting columns, and such but nothing helped. My troubles are exacerbated by Power Query treating the nested data elements as column names.
Question: Is there a way, in M, to convert this nested JSON payload to the table example I illustrated above?
Chris Webb wrote a recursive function to expand all table-type columns - I've managed to clone it for record-type columns:
https://gist.github.com/Mike-Honey/0a252edf66c3c486b69b
If you use Record.FromList for the expansion it should work.
You can find an example in the script here: https://chris.koester.io/wp-content/uploads/2016/04/TransformJsonArrayWithPowerQueryImkeFeldmann.txt