Store list or collection in Apache Avro schema - list

I'm currently creating Avro schema to store twitter data streams.
My data source in JSON:
{
'id': '123456789',
'text': 'bla bla bla...',
'entities': {
'hashtags': [{'text':'hashtag1'},{'text':'hashtag2'}]
}
}
in Cassandra, I can define collection (sets or lists) to store hashtags data.
But I have no idea how to define this structure in Apache Avro.
Here's my best try:
{"namespace": "ln.twitter",
"type": "record",
"name": "main",
"fields": [
{"name": "id","type": "string"},
{"name": "text","type": "string"},
{"name": "hashtags","type": "string"} // is there any better format for this ?
]
}
Need your advice please.
Thanks,
Yusata.

The entities field needed explicit records (or maps) inside. Here's a schema that should work:
{
"type": "record",
"name": "Main",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "text",
"type": "string"
},
{
"name": "entities",
"type": {
"type": "record",
"name": "Entities",
"fields": [
{
"name": "hashtags",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "Hashtag",
"fields": [
{
"name": "text",
"type": "string"
}
]
}
}
}
]
}
}
]
}
In case it's helpful, you can use this tool to generate an (anonymous) Avro schema from any valid JSON record. You'll then just need to add names to the record types.
You can try it on your example after switching its ' to ":
{
"id": "123456789",
"text": "bla bla bla...",
"entities": {"hashtags": [{"text": "hashtag1"}, {"text": "hashtag2"}]}
}

Related

How to delete a column in BigQuery that is part of a nested column

I want to delete a column in a BigQuery table that is part of a record or nested column. I've found this command in their documentation. Unfortunately, this command is not available for nested columns inside existing RECORD fields.
Is there any workaround for this?
For example, if I had this schema I want to remove the address2 field inside the address field. So from this:
[
{
"name": "name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "addresses",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "address1",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "address2",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "country",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
to this:
[
{
"name": "name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "addresses",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "address1",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "country",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
Use below
select * replace(
array(select as struct * except(address2) from t.addresses)
as addresses)
from `project.dataset.table` t
if you want permanently remove that field - use create or replace table as in below example
create or replace table `project.dataset.new_table` as
select * replace(
array(select as struct * except(address2) from t.addresses)
as addresses)
from `project.dataset.table` t

How index the same field in multiple ways with wildcard in ElasticSearch

I have the below mappings for a field ("name"):
"name": {
"analyzer": "ngram_analyzer",
"search_analyzer": "keyword_analyzer",
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
It works fine and allows to search as both text and keyword.
As per the ES documentation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html A string field could be mapped as a text field for full-text search and as a keyword field for sorting or aggregation.
But I am trying to extend this mapping to also support wildcard search.
I tried to modify the mapping(for eg. like below) but couldn't get it working.
"name": {
"analyzer": "ngram_analyzer",
"search_analyzer": "keyword_analyzer",
"type": "text",
"fields": [{
"raw": {
"type": "wildcard"
}
}, {
"type": "keyword"
}]
}
Also tried with,
"name": {
"analyzer": "ngram_analyzer",
"search_analyzer": "keyword_analyzer",
"type": "text",
"fields": [{
"raw": {
"type": "wildcard"
}
}, {"raw": {
"type": "keyword"
}}]
}
How should the mapping look like to allow name to be searched as text, keyword and wildcard.
You can use multi-fields to index the name field in multiple ways. Modified index mapping will be
{
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "ngram_analyzer",
"search_analyzer": "keyword_analyzer",
"fields": {
"raw": {
"type": "wildcard"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Now you can use name for text-based search, name.raw for wildcard search, name.keyword for keyword search

Importing User Events in Google Cloud's Recommendation AI

I'm working on Google Cloud's Recommendation AI. For importing User Events, it supports BigQuery but can't find the Table Schema required. As it's in Beta, the community support and Documentation is limited. Can anyone please guide me in creating Table Schema for importing user events from BigQuery?
As you can see here, to import data from BigQuery you should use the Recommendations AI Schema. The schema is a JSON schema like below:
[
{
"name": "product_metadata",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{
"name": "images",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "uri",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "height",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "width",
"type": "STRING",
"mode": "NULLABLE"
}
]
},
{
"name": "exact_price",
"type": "RECORD",
"mode": "REQUIRED",
"fields": [
{
"name": "original_price",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "display_price",
"type": "FLOAT",
"mode": "NULLABLE"
}
]
},
{
"name": "canonical_product_uri",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "currency_code",
"type": "STRING",
"mode": "NULLABLE"
}
]
},
{
"name": "language_code",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "description",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "title",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "tags",
"type": "STRING",
"mode": "REPEATED"
},
{
"name": "category_hierarchies",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "categories",
"type": "STRING",
"mode": "REPEATED"
}
]
},
{
"name": "id",
"type": "STRING",
"mode": "REQUIRED"
}
]
Its also important to mention that you can also import data from Google Analytics. The reference for the Google Analytics schema can be found here.

ElasticSearch (AWS): How to use another index as a query/match parameter?

Basically I am trying to implement this strategy.
Sample Data:
PUT /newsfeed_consumer_network/consumer_network/urn:viadeo:member:me
{
"producerIds": [
"urn:viadeo:member:ned",
"urn:viadeo:member:john",
"urn:viadeo:member:mary"
]
}
PUT /newsfeed/news/urn:viadeo:news:33
{
"producerId": "urn:viadeo:member:john",
"published": "2014-12-17T12:45:00.000Z",
"actor": {
"id": "urn:viadeo:member:john",
"objectType": "member",
"displayName": "John"
},
"verb": "add",
"object": {
"id": "urn:viadeo:position:10",
"objectType": "position",
"displayName": "Software Engineer # Viadeo"
},
"target": {
"id": "urn:viadeo:profile:john",
"objectType": "profile",
"displayName": "John's profile"
}
}
Sample Query:
POST /newsfeed/news/_search
{
"query": {
"bool": {
"must": [{
"match": {
"actor.id": {
"producerId": {
"index": "newsfeed_consumer_network",
"type": "consumer_network",
"id": "urn:viadeo:network:me",
"path": "producerIds"
}
}
}
}]
}
}
}
I am getting the following error:
"type": "query_parsing_exception",
"reason": "[match] query does not support [index]"
How can I use an index to support a matching query? Is there any way to implement this?
Basically I just want to use another document as the source of the matching parameter for my query. Is this even possible with ElasticSearch?

How to post model together with embedded model using Loopback.js?

Here is the form model schema:
{
"startTime": "",
"stopTime": "",
"id": "objectid",
"formQuestions": [
{
"type": 0,
"label": "",
"content": [
""
],
"id": "objectid"
}
]
}
relations of form.json:
"relations": {
"questions": {
"type": "embedsMany",
"model": "FormQuestion",
"option": {
"validate": true,
"autoId": true
}
}
}
If I post to http://localhost:3000/api/Forms as follow,
{
"startTime": "",
"stopTime": "",
"formQuestions": [
{
"type": 0,
"label": "a label",
"content": ["the content"]
}
]
}
it returns:
{
"startTime": "",
"stopTime": "",
"id": "54ccf7ae6159f1bc0bc6b430",
"formQuestions": []
}
But I want the embedded model formQuestion also to be inserted into the database, how can I do it? I would be grateful if anyone could help me.
Nobody knows, but I find the way myself.
It is because there is some bug about auto id generating with embedded document. It is useless even I set autoId to true. So if I post with manual id, it will success.
Just like this:
{
"startTime": "",
"stopTime": "",
"formQuestions": [
{
"id": 1,
"type": 0,
"label": "a label",
"content": ["the content"]
}
]
}