Parquet Column predicate for null values - mapreduce

I have problem while using parquet's UnboundRecordFilter for nullable columns.
My avro records looks like this:
[{"namespace": "com.test.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "userId", "type": "long"},
{"name": "userType", "type": ["null", "string"], "default":null}
]
}]
I have filter with column predicate to read records with null userType
public class NullableUserTypeFilter implements UnboundRecordFilter{
NullUserTypePredicateFunction nullPredicate;
public NullableUserTypeFilter() {
this.nullPredicate = new NullUserTypePredicateFunction();
}
#Override
public RecordFilter bind(Iterable<ColumnReader> readers) {
return ColumnRecordFilter.column("userType", nullPredicate).bind(readers);
}
class NullUserTypePredicateFunction implements Predicate{
public NullUserTypePredicateFunction(){}
#Override
public boolean apply(ColumnReader input) {
return input.getBinary()==null || input.getBinary().toStringUsingUTF8()==null;
}
}
}
while running the job i have
java.lang.Exception: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file file:/Users/kinga/repo/test/test-parquet/target/input/UserSnapshot/0/users-2000.01.01-test.parquet
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file file:/Users/kinga/repo/test/test-parquet/target/input/UserSnapshot/0/users-2000.01.01-test.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:146)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:553)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [userType] BINARY at value 97 out of 100, 97 out of 100 in currentPage. repetition level: 0, definition level: 1
at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:483)
at org.apache.parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:416)
at com.test.parquet.filter.NullableUserTypeFilter$NullUserTypePredicateFunction.apply(NullableUserTypeFilter.java:31)
at org.apache.parquet.filter.ColumnRecordFilter.isMatch(ColumnRecordFilter.java:72)
at org.apache.parquet.io.FilteredRecordReader.skipToMatch(FilteredRecordReader.java:80)
at org.apache.parquet.io.FilteredRecordReader.read(FilteredRecordReader.java:60)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
... 14 more
Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
at org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readValueDictionaryId(DictionaryValuesReader.java:76)
at org.apache.parquet.column.impl.ColumnReaderImpl$1.read(ColumnReaderImpl.java:166)
at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
... 20 more
The problem is while reading records with null values.
What is proper way to deal with nullable fields?

Related

Django loop through json object

How to loop through a JSON object in Django template?
JSON:
"data": {
"node-A": {
"test1A": "val1A",
"test2A": "val2A",
"progress": {
"conf": "conf123A"
"loc": "loc123A"
},
"test3A": "val3A"
},
"node-B": {
"test1B": "val1B",
"test2B": "val2B",
"progress": {
"conf": "conf123B"
"loc": "loc123B"
},
"test3B": "val3B"
}
}
I am having trouble accessing the nested values "conf" and "lock" inside "progress". How can I access them in Django template if the data is passed as context i.e. return (request, 'monitor.html', {"data_context": json_data['data']})?
they way you have it set up, your data is in a dictionary called 'data_context'. To access what you need in the template it would be {{data_context.test1A}}.
to not have to use 'data_context.' try this instead,
return (request, 'monitor.html', json_data['data'].to_dict())
Dictionary lookup, attribute lookup and list-index lookups are implemented with a dot notation:
{{ my_dict.key.key_nested }}
As the JSON format behaves like a dictionary in Python, the data stored with the specified keys conf and loc should be accessible with the python notation for dictionaries. Since the provided JSON can be seen as a nested dictionary, you need to "concat" the keys respectively to get your desired data.
Your return statement returns a dictionary which I will call ret so the structure should be:
{"data_context": {
"node-A": {
"test1": "val1A",
"test2": "val2A",
"progress": {
"conf": "conf123A",
"loc": "loc123A"
},
"test3": "val3A"
},
"node-B": {
"test1B": "val1B",
"test2B": "val2B",
"progress": {
"conf": "conf123B",
"loc": "loc123B"
},
"test3": "val3B"
}
}
}
Therefor to access conf and loc:
ret["data_context"]["node-A"]["progress"]["conf"]
will get you the value stored at conf in node-A

databind.exc.MismatchedInputException: Cannot deserialize value - for micronaut - BFF test

I am working in micronaut graphQL and writing a test. The logic seems to be pretty straight forward. I am making a mock call and suppose to receive a fake response. The request been called without issues I can see it in the logs. But it seems like it is not able to parse the data and deserialize it. I am trying to make a test for a request but got deserialize error
Here is the error message
message -> Error code: GENERIC_ERROR, description: Exception getData, executionId: 770c8b9b-47cd-445e-9739-19ca0fd890c2, detailedInfo: message = Error instantiating bean of type [com.web.MyApi]: Cannot deserialize value of type `java.util.ArrayList<java.lang.String>` from Object value (token `JsonToken.START_OBJECT`)
at [Source: (String)"{
"data": {
"values": {
"formatted": [
{
"name": "2",
"lastName": "15"
}
]
}
}
}"; line: 5, column: 9] (through reference chain: com.web.DataResponse["data"]->com.web.Values["formatted"]->java.util.ArrayList[0]), cause = com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize value of type `java.util.ArrayList<java.lang.String>` from Object value (token `JsonToken.START_OBJECT`)
at [Source: (String)"{
"data": {
"metr
My test
given("my query description ") {
val scoreCardInfoRequestGraph =
"""
query{
getData(number:"6528")
{
name
lastName
}
}
""".toEscapedQuery()
`when`("posting the query") {
val dto =
client.getResponse(
myGraph,
ReportDataResponse::class.java
)
then("check the score card data formatted correctly") {
//verify
}
}
}
My graph for mocking the response
{
"data": {
"values": {
"formatted": [
{
"name": "2",
"lastName": "15"
}
]
}
}
}
And classes that I use for parsing
data class ReportDataResponse(
#JsonProperty("data")
val data: Data
)
data class Data(
#JsonProperty("values")
val values: Values
)
data class Values(
#JsonProperty("formatted")
val formatted: List<List<String>>
)

jsonPath expression expected value but return list of values

I want to check response "class_type" value has "REGION".
I test springboot API using mockMvc.
the MockHttpServletResponse is like this.
Status = 200
Error message = null
Headers = {Content-Type=[application/json;charset=UTF-8]}
Content type = application/json;charset=UTF-8
Body =
{"result":true,
"code":200,
"desc":"OK",
"data":{"total_count":15567,
"items": ...
}}
this is whole response object.
Let's take a closer look, especially items.
"items": [
{
"id": ...,
"class_type": "REGION",
"region_type": "MULTI_CITY",
"class": "com.model.Region",
"code": "AE-65GQ6",
...
},
{
"id": "...",
"class_type": "REGION",
"region_type": "CITY",
"class": "com.model.Region",
"code": "AE-AAN",
...
},
I tried using jsonPath.
#When("User wants to get list of regions, query is {string} page is {int} pageSize is {int}")
public void userWantsToGetListOfRegionsQueryIsPageIsPageSizeIs(String query, int page, int pageSize) throws Exception {
mockMvc().perform(get("/api/v1/regions" ))
.andExpect(status().is2xxSuccessful())
.andDo(print())
.andExpect(jsonPath("$.data", is(notNullValue())))
.andExpect(jsonPath("$.data.total_count").isNumber())
.andExpect(jsonPath("$.data.items").isArray())
.andExpect(jsonPath("$.data.items[*].class_type").value("REGION"));
log.info("지역 목록");
}
but
jsonPath("$.data.items[*].class_type").value("REGION")
return
java.lang.AssertionError: Got a list of values ["REGION","REGION","REGION","REGION","REGION","REGION","REGION","REGION","REGION","REGION","REGION","REGION","REGION","REGION","REGION","REGION","REGION","REGION","REGION","REGION"] instead of the expected single value REGION
I want to just check "$.data.items[*].class_type" has "REGION".
How can I change this?
One option would be to check whether you have elements in your array which have the class_type equal to 'REGION':
public static final String REGION = "REGION";
mockMvc().perform(get("/api/v1/regions"))
.andExpect(jsonPath("$.data.items[?(#.class_type == '" + REGION + "')]").exists());

"type mismatch error, expected type LIST" for querying a one-to-many relationship in AppSync

The schema:
type User {
id: ID!
createdCurricula: [Curriculum]
}
type Curriculum {
id: ID!
title: String!
creator: User!
}
The resolver to query all curricula of a given user:
{
"version" : "2017-02-28",
"operation" : "Query",
"query" : {
## Provide a query expression. **
"expression": "userId = :userId",
"expressionValues" : {
":userId" : {
"S" : "${context.source.id}"
}
}
},
"index": "userIdIndex",
"limit": #if(${context.arguments.limit}) ${context.arguments.limit} #else 20 #end,
"nextToken": #if(${context.arguments.nextToken}) "${context.arguments.nextToken}" #else null #end
}
The response map:
{
"items": $util.toJson($context.result.items),
"nextToken": #if(${context.result.nextToken}) "${context.result.nextToken}" #else null #end
}
The query:
query {
getUser(id: "0b6af629-6009-4f4d-a52f-67aef7b42f43") {
id
createdCurricula {
title
}
}
}
The error:
{
"data": {
"getUser": {
"id": "0b6af629-6009-4f4d-a52f-67aef7b42f43",
"createdCurricula": null
}
},
"errors": [
{
"path": [
"getUser",
"createdCurricula"
],
"locations": null,
"message": "Can't resolve value (/getUser/createdCurricula) : type mismatch error, expected type LIST"
}
]
}
The CurriculumTable has a global secondary index titled userIdIndex, which has userId as the partition key.
If I change the response map to this:
$util.toJson($context.result.items)
The output is the following:
{
"data": {
"getUser": {
"id": "0b6af629-6009-4f4d-a52f-67aef7b42f43",
"createdCurricula": null
}
},
"errors": [
{
"path": [
"getUser",
"createdCurricula"
],
"errorType": "MappingTemplate",
"locations": [
{
"line": 4,
"column": 5
}
],
"message": "Unable to convert \n{\n [{\"id\":\"87897987\",\"title\":\"Test Curriculum\",\"userId\":\"0b6af629-6009-4f4d-a52f-67aef7b42f43\"}],\n} to class java.lang.Object."
}
]
}
If I take that string and run it through a console.log in my frontend app, I get:
{
[{"id":"2","userId":"0b6af629-6009-4f4d-a52f-67aef7b42f43"},{"id":"1","userId":"0b6af629-6009-4f4d-a52f-67aef7b42f43"}]
}
That's clearly an object. How do I make it... not an object, so that AppSync properly reads it as a list?
SOLUTION
My response map had a set of curly braces around it. I'm pretty sure that was placed there in the generator by Amazon. Removing them fixed it.
I think I'm not seeing the complete view of your schema, I was expecting something like:
schema {
query: Query
}
Where Query is RootQuery, in fact you didn't share us your Query definition. Assuming you have the right Query definition. The main problem is in your response template.
> "items": $util.toJson($context.result.items)
This means that you are passing a collection named: *"items"* to Graphql query engine. And you are referring this collection as "createdCurricula". So solve this issue your response-mapping-template is the right place to fix. How? just replace the above line with the following.
"createdCurricula": $util.toJson($context.result.items),
Please the main thing to note here is, the mapping template is a bridge between your datasources and qraphql, feel free to make any computation, or name mapping but don't forget that object names in that response json are the one should match in schema/query definition.
Thanks.
Musema
change to result type to $util.toJson($ctx.result.data.posts)
The exception msg says that it expected a type list.
Looking at:
{
[{"id":"2","userId":"0b6af629-6009-4f4d-a52f-67aef7b42f43"},{"id":"1","userId":"0b6af629-6009-4f4d-a52f-67aef7b42f43"}]
}
I don't see that createdCurricula is a LIST.
What is currently in DDB is:
"id": "0b6af629-6009-4f4d-a52f-67aef7b42f43",
"createdCurricula": null

Elasticsearch _reindex fails

I am working on AWS Elasticsearch. It doesn't allow open/close index, so setting change can not be applied on the index.
In order to change the setting of a index, I have to create a new index with new setting and then move the data from the old index into new one.
So first I created a new index with
PUT new_index
{
"settings": {
"max_result_window":3000000,
"analysis": {
"filter": {
"german_stop": {
"type": "stop",
"stopwords": "_german_"
},
"german_keywords": {
"type": "keyword_marker",
"keywords": ["whatever"]
},
"german_stemmer": {
"type": "stemmer",
"language": "light_german"
}
},
"analyzer": {
"my_german_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"german_stop",
"german_keywords",
"german_normalization",
"german_stemmer"
]
}
}
}
}
}
it succeeded. Then I try to move data from old index into new one with query:
POST _reindex
{
"source": {
"index": "old_index"
},
"dest": {
"index": "new_index"
}
}
It failed with
Request failed to get to the server (status code: 504)
I checked the indices with _cat api, it gives
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open old_index AGj8WN_RRvOwrajKhDrbPw 5 1 2256482 767034 7.8gb 7.8gb
yellow open new_index WnGZ3GsUSR-WLKggp7Brjg 5 1 52000 0 110.2mb 110.2mb
Seemingly some data are loaded into there, just wondering why the _reindex doesn't work.
You can check the status of reindex with api:
GET _tasks?detailed=true&actions=*reindex
There is a "status" object in response which has field "total":
total is the total number of operations that the reindex expects to perform. You can estimate the progress by adding the updated, created, and deleted fields. The request will finish when their sum is equal to the total field.
Link to ES Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html