Cloudant query index over map reduce view - mapreduce

I've been using Cloudant Query to find documents and found it works really well.
The documentation says "Cloudant Query indexes can also be built using MapReduce Views". I cannot see where the index definition syntax allows specifying a view.
Is there an example of an index or query that uses a view?

The documentation does indeed say that "Cloudant Query indexes can also be built using MapReduce Views" but it is referring to the technology which underpins the Cloudant Query service.
Cloudant Query indexes can take two forms:
type: "text" is an index built on Apache Lucene which is suitable for fielded searches and full-text queries
type: "json" is an index built on MapReduce where materialized views are created to be able to answer the queries you supply
The sentence you refer to in the documentation is intended to convey that Cloudant Query indexes can be specified as type: "json" which result the MapReduce views behind the scenes.

Related

Is there a way to query multiple Partial Keys in dynamo DB table using AWS dashboard?

I would like to know if there's an option to query with multiple partition keys from DynamoDB table in AWS dashboard. Unable to find any article or similar requests for dashboard on the web. Will keep you posted if I find an answer for the same.
Thanks in advance.
The Console doesn't support this directly, because there is no support in the underlying API. What you're looking for is the equivalent of the following SQL query:
select *
from table
where PK in ('value_1', 'value_2') /*equivalent to: PK = 'value_1' or PK = 'value_2' */
The console supports using the Query and Scan operations. Query always operates on an item collection, so all items that share the same partition key, which means it can't be used for your use case.
Scan on the other hand is a full table scan, which allows you to optionally filter the results. The filter language has no support for this kind of or logical operator so that won't really help you. It will however allow you to view all items, which includes the ones you're looking for, but as I said, it's not really possible.

How to create composite indexes in datastore to filter with multiple attributes in entity

We are using Google Datastore for our dashboard. In dashboard we provide filtering option for endusers.
Lets say we have an datastore kind whose structure is:
{
'attribute1': 'val1',
'attribute2': 'val2',
'attribute3': 'val3',
'timestamp': 123456789,
}
There is a use case where we need to filter data in specific time period, with combinations of different other attributes.
What is the best way to create composite index to achieve this capability in Datastore?
Any thoughts?
Cloud Datastore provides two types of indexes
Built-in indexes -
These are the indexes that Cloud Datastore automatically creates for each property of each entity kind. These built-in indexes are single property indexes and are suitable for simple queries.
Composite indexes -
Composite indexes are manual indexes which are built by the user and not by Cloud Datastore automatically. These are multi property indexes. Composite indexes are needed for complex queries which involves filtering the data using two or more properties. To build a composite index one needs to configure an index.yaml file and then create it by running the following command -
gcloud datastore indexes create ~/path/to/index.yaml/file
Now coming to your use case -
As you want to filter the data based on more than two properties you can not use built-in indexes. So you have to use composite or manual index. For that you need to define an index configuration file and deploy that as mentioned above. You wanted to filter the data based on a specific time period in combination with one or more attributes. So for use cases the following index configuration file should work well.
index.yaml
indexes:
- kind: demo
properties:
- name: attribute1
direction: asc
- name: attribute2
direction: asc
- name: attribute3
direction: asc
- name: timestamp
direction: asc
In the above configuration file the direction property is optional and if you don’t specify it it will take it as ascending(asc) by default. For descending sort order you can specify it as desc. This will work for filtering the data by combination of two or more properties which are mentioned in the configuration file. You may go through this page to know more about indexing in Cloud Datastore.
Please note that index-based query mechanism supports a wide range of queries and is suitable for most of the applications. Still there are some restrictions or limitations on Datastore query. I would suggest you to go through this page to know more about the restrictions while querying in Cloud Datastore.

Elastic search vs Dynamodb for Filtering

I am building a service which would have millions of rows of data in it. We wanted to have good search on it. Eg. we can search by some field values. The structure of the row will be like as follows:
{
"field1" : "value1",
"field2" : "value2",
"field3" : {
"field4": "value4",
"field5": "value5"
}
}
Also, the structure of field3 can be changing with field4 present sometime and sometime not.
We wanted to have filters on following fields field1, field2 and field 4. We can create indexes in dynamodb to do that. But I am not sure if we can create index on field4 in dynamodb easily without flattening the json.
Now, my question is, should we use elastic search datastore for it, which as far as I know, will create indexes on every field in the document and then one can search on every field? Is that right? Or should we use dynamodb or completely any other data store?
Please provide some suggestions.
If search is a key requirement for your application, then use a search product - not a database. Dynamodb is great for a lot of things, but adhoc search is not one of them - you are going to end up running lots of very expensive (slow) scans if you go with dynamodb; this is what ES was built for.
I've a decent working experience with dynamoDB and extensive working experience with Elasticsearch(ES).
Let's first understand the key difference between these two:
dynamoDB is
Amazon DynamoDB is a key-value and document database
while Elasticsearch
Elasticsearch is a distributed, open source search and analytics
engine for all types of data, including textual, numerical,
geospatial, structured, and unstructured data.
Now coming to question, let's discuss how these system works internally and how it affects the performance.
DynamoDB is great to fetch the documents based on keys but not great for filtering and searching, as in relations database for improving performance of these oprations you create index on the columns, in similar way you have to create an index in dynamoDB as its a database, not search engine. And creating index on fields on the fly is pain and its not cached in DynamoDB.
Elasticsearch stores data differently by creating the inverted index for all indexed fields(default as mentioned by OP) and filtering on these fields are super fast if you use the filter context which is the same use case here, more info with example is explained in official ES doc https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html#filter-context, Also as these filters are not used for score calculation and cached at elasticsearch so their performance(both read and write) is super fast as compared to dynamoDB and you can benchmark that as well.

Django ElasticSearch DSL DRF aggregations

What is the correct approach to adding aggregations to Django ElasticSearch DSL DRF? The codebase contains some empty filter backends
https://github.com/barseghyanartur/django-elasticsearch-dsl-drf/tree/a2be5842e36a102ad66988f5c74dec984c31c89b/src/django_elasticsearch_dsl_drf/filter_backends/aggregations
Should I create a custom backend or is there a way to add them directly to my viewset?
In particular, I want to calculate the sum of an IntegerField across all results in a particular facet.
Elasticsearch has more than one type of aggregations. Simple aggregations in the django-elasticsearch-dsl-drf are implemented in the FacetedSearchFilterBackend. Read the docs or simply run the example project to run experiments.

Django - fulltext search with PostgreSQL and Elasticsearch

I have a Django and Django REST Framework powered RESTful API (talking to a PostgreSQL DB backend) which supports filtering on a specific model.
Now I want to add a fulltext search functionality.
Is it be possible to use Elasticsearch for fulltext search and then apply my existing API filters on top of these search results?
I would suggest you consider using PostgreSQL only to do what you asked for.
In my opinion it is the best solution because you will have the data and the search indexes directly inside PostgreSQL and you will not be forced to install and maintain additional software (such as Elasticsearch) and keep the data and indexes in sync.
This is the simplest code example you can have to perform a full-text search in Django with PostgreSQL:
Entry.objects.filter(body_text__search='Cheese')
For all the basic documentation on using the full-text search in Django with PostgreSQL you can use the official documentation: "Full text search"
If you want to deepen further you can read an article that I wrote on the subject:
"Full-Text Search in Django with PostgreSQL"
Your question is too broad to be answered with code, but it's definitely possible.
You can easily search your elasticsearch for rows matching your full-text criteria.
Then get those rows' PK fields (or any other candidate key, used to uniquely identify rows in your PostgreSQL dB), and filter your django ORM-backed models for PKs matching those you found from your Elasticsearch.
Pseudocode would be:
def get_chunk(l, length):
for i in xrange(0, len(l), length):
yield l[i:i + length]
res = es.search(index="index", body={"query": {"match": ...}})
pks = []
for hit in res['hits']:
pks.append(hit['pk'])
for chunk_10k in get_chunk(pks, 10000):
DjangoModel.objects.filter(pk__in=chunk_10k, **the_rest_of_your_api_filters)
EDIT
To resolve a case in which lots and lots of PKs might be found with your elastic query, you can define a generator that yields successive 10K rows of the results, so you won't step over your DB query limit and to ensure best update performance. I've defined it above with a function called get_chunk.
Something like that would work for alternatives like redis, mongodb, etc ...