Increase CloudSearch _score by specific number - amazon-web-services

I am new to AWS cloudsearch, There is relevance score (_score) which is computed automatically based on the search terms occurrences..
My question is that Can I increase my relevance score(_score) by specific amount based on specific key value..
Example:
Lets say cloudsearch returns following two documents
fields: {
{
fullname: "Daniel Wildt",
active: "T",
_score: "82"
}
{
fullname: "Robert",
active: "F",
_score: "84"
}
}
I want First document (Daniel Wildt) to be higher... It means by active = T aws should add something to the score

Unfortunately you can't use a custom rank directly because that's only available for sort-enabled numeric fields (int, double, date).
Here are a couple alternative options
Sorting: if you plan to give a lot of weight to the active field, it will become dominant enough to be functionally equivalent to the the sort operator. That is, you can just add sort=active desc to your query to get the T results before F
Convert to int: map T and F to numeric values before submitting your documents to be indexed, eg T=1 F=0, then use these in a custom rank expression to affect the ordering of results &expr.myrank=_score+active&sort=myrank
Field weight: Add active:'T' to your query, which would potentially exclude results where active=F, and then use field weights to adjust the impact of this portion of the query: q.options={fields:['active^0.5']}. This will require some tuning

Related

DynamoDB QuerySpec {MaxResultSize + filter expression}

From the DynamoDB documentation
The Query operation allows you to limit the number of items that it
returns in the result. To do this, set the Limit parameter to the
maximum number of items that you want.
For example, suppose you Query a table, with a Limit value of 6, and
without a filter expression. The Query result will contain the first
six items from the table that match the key condition expression from
the request.
Now suppose you add a filter expression to the Query. In this case,
DynamoDB will apply the filter expression to the six items that were
returned, discarding those that do not match. The final Query result
will contain 6 items or fewer, depending on the number of items that
were filtered.
Looks like the following query should return (at least sometimes) 0 records.
In summary, I have a UserLogins table. A simplified version is:
1. UserId - HashKey
2. DeviceId - RangeKey
3. ActiveLogin - Boolean
4. TimeToLive - ...
Now, let's say UserId = X has 10,000 inactive logins in different DeviceIds and 1 active login.
However, when I run this query against my DynamoDB table:
QuerySpec{
hashKey: null,
rangeKeyCondition: null,
queryFilters: null,
nameMap: {"#0" -> "UserId"}, {"#1" -> "ActiveLogin"}
valueMap: {":0" -> "X"}, {":1" -> "true"}
exclusiveStartKey: null,
maxPageSize: null,
maxResultSize: 10,
req: {TableName: UserLogins,ConsistentRead: true,ReturnConsumedCapacity: TOTAL,FilterExpression: #1 = :1,KeyConditionExpression: #0 = :0,ExpressionAttributeNames: {#0=UserId, #1=ActiveLogin},ExpressionAttributeValues: {:0={S: X,}, :1={BOOL: true}}}
I always get 1 row. The 1 active login for UserId=X. And it's not happening just for 1 user, it's happening for multiple users in a similar situation.
Are my results contradicting the DynamoDB documentation?
It looks like a contradiction because if maxResultSize=10, means that DynamoDB will only read the first 10 items (out of 10,001) and then it will apply the filter active=true only (which might return 0 results). It seems very unlikely that the record with active=true happened to be in the first 10 records that DynamoDB read.
This is happening to hundreds of customers that are running similar queries. It works great, when according to the documentation it shouldn't be working.
I can't see any obvious problem with the Query. Are you sure about your premise that users have 10,000 items each?
Your keys are UserId and DeviceId. That seems to mean that if your user logs in with the same device it would overwrite the existing item. Or put another way, I think you are saying your users having 10,000 different devices each (unless the DeviceId rotates in some way).
In your shoes I would just remove the filterexpression and print the results to the log to see what you're getting in your 10 results. Then remove the limit too and see what results you get with that.

dynamodb - scan items where map contains a key

I have a table that contains a field (not a key field), called appsMap, and it looks like this:
appsMap = { "qa-app": "abc", "another-app": "xyz" }
I want to scan all rows whose appsMap contains the key "qa-app" (the value is not important, just the key). I tried something like this but it doesn't work in the way I need:
FilterExpression = '#appsMap.#app <> :v',
ExpressionAttributeNames = {
"#app": "qa-app",
"#appsMap": "appsMap"
},
ExpressionAttributeValues = {
":v": { "NULL": True }
},
ProjectionExpression = "deviceID"
What's the correct syntax?
Thanks.
There is a discussion on the subject here:
https://forums.aws.amazon.com/thread.jspa?threadID=164470
You might be missing this part from the example:
ExpressionAttributeValues: {":name":{"S":"Jeff"}}
However, just wanted to echo what was already being said, scan is an expensive procedure that goes through every item and thus making your database hard to scale.
Unlike with other databases, you have to do plenty of setup with Dynamo in order to get it to perform at it's great level, here is a suggestion:
1) Convert this into a root value, for example add to the root: qaExist, with possible values of 0|1 or true|false.
2) Create secondary index for the newly created value.
3) Make query on the new index specifying 0 as a search parameter.
This will make your system very fast and very scalable regardless of how many records you get in there later on.
If I understand the question correctly, you can do the following:
FilterExpression = 'attribute_exists(#0.#1)',
ExpressionAttributeNames = {
"#0": "appsMap",
"#1": "qa-app"
},
ProjectionExpression = "deviceID"
Since you're not being a bit vague about your expectations and what's happening ("I tried something like this but it doesn't work in the way I need") I'd like to mention that a scan with a filter is very different than a query.
Filters are applied on the server but only after the scan request is executed, meaning that it will still iterate over all data in your table and instead of returning you each item, it applies a filter to each response, saving you some network bandwidth, but potentially returning empty results as you page trough your entire table.
You could look into creating a GSI on the table if this is a query you expect to have to run often.

How to use MapReduce when extracting a group of document id's by some criteria from CouchDB

I'm in my first week of CouchDB experimentation and trying to stop thinking in SQL. I have a collection of documents (5000 event files) that all have some ID value that will be common to groups of documents. So there might be 10 that all have TheID: 'foobar'.
(In case someone asks - TheID is not an auto-increment value from a relational database - it is a unique id assigned by a partner company of ours. I cannot redesign my source data to identify itself some other way, I have to use this TheID field to recognise groups of documents.)
I want to query my list of documents:
{ _id: 'document1', Message: { TheID: 'foobar' } }
{ _id: 'document2', Message: { TheID: 'xyz' } }
{ _id: 'document3', Message: { TheID: 'xyz' } }
{ _id: 'document4', Message: { TheID: 'foobar' } }
{ _id: 'document5', Message: { TheID: 'wibble' } }
{ _id: 'document6', Message: { TheID: 'foobar' } }
I want the results:
'foobar': [ 'document1', 'document4', 'document6' ]
'xyz': [ 'document2', 'document3' ]
'wibble': [ 'document5' ]
The aim is to represent groups of documents on our UI grouped by TheID, so the user can see all documents for a specific TheID together, and select that TheID to drill into the data querying just by that TheID value. Yes, the string id of each document is useful - in our case, the _id value of each document is the source event identifier, so it is a unique and useful value that the user is going to want to see in the list on screen.
In SQL one might order by or group by the TheID field and iterate the result set appropriately. I doubt this thinking is any use at all with a CouchDB query.
I know that I can use a map function to extract the TheID value for each document, for example:
function (doc) {
emit(doc.Message.TheID, 1);
}
or perhaps
function (doc) {
emit(doc._id, doc.Message.TheID);
}
I'm not sure exactly what I should emit as the key and value. Even if this is useful, I'm getting the feeling that I should not use a reduce function to try to 'reduce' the large map output (1 result row per document in the database) to what I want (3 results each with a list of document id's).
http://guide.couchdb.org/draft/views.html says "A common mistake new CouchDB users make is attempting to construct complex aggregate values with a reduce function. Full reductions should result in a scalar value, like 5, and not, for instance, a JSON hash with a set of unique keys and the count of each."
I thought I might be able to use reduce to scan the results of the map and somehow collect all results that have a common TheID value into a single result object. What I see when reading the reduce documentation is that it will be given arrays of keys and values that contain fairly unpredictable collections, driven by the structure of the btree underlying the map results. It won't be given arrays guaranteed to contain all similar TheID values that I could scan for. This approach seems completely broken.
So, is a map/reduce pair the right thing to do here? Should I look at using a 'show' or 'list' instead? I'm intending to build a mustache based HTML template engine around the results, so 'list' seems the wrong way to go.
Thanks in advance for any guidance.
EDIT I have done some local dev and come up with what I think is a broken solution. Hopefully this will show you the direction I'm trying to go in. See a public cloud based CouchDB I created at https://neek.iriscouch.com/_utils/database.html?test/_design/test/_view/collectByTheID
This is public. If you would like to play, please copy it to a new view, don't pollute this one in case others come in and want to see the original.
map function:
function(doc) {
emit(doc.Message.TheID, doc._id);
}
reduce function:
function(keys, values, rereduce) {
if (!rereduce) {
return values;
} else {
var ret = [];
values.forEach(function (ar) {
ret.concat(ar);
});
return ret;
}
}
Results:
"foobar" ["document6", "document4", "document1"]
"wibble" ["document5"]
"xyz" ["document3", "document2"]
The reduce function first leaves the array of values alone, and on the second pass concatenates them together. However when I run this on my large 5000+ document database it comes up with some TheID values with empty document id arrays. I believe this suffers from the problem I mentioned before, where the array of values passed to reduce are build dependent on the btree structure of the map they are extractd from and are not guaranteed to contain a complete set of values for given keys.
Make use of the group_level feature:
Map:
emit([doc.message.TheID, doc._id], null)
Reduce:
You must include a reduce to use group_level, it can be empty as below or something else, i.e. _count
function(keys, values){
return null;
}
A query with group_level=1 would return:
/_design/d/_view/v?group_level=1
[
{key: ["foobar"], value: null},
{key: ["xyz"], value: null},
{key: ["wibble"], value: null}
]
You would use this query to populate the top level in your grouping UI. When the user expands a category, you would do another query with group_level 2 and start and end keys:
/_design/d/_view/v?group_level=2&startkey=["foobar"]&endkey=["foobar",{}]
[
{key: ["foobar", "document6"], value: null},
{key: ["foobar", "document4"], value: null},
{key: ["foobar", "document1"], value: null}
]
This doesn't produce the output exactly as you are requesting, however, I think you'll find it flexible enough

Extjs dynamic filter

I have created a filter that contains unik values of column:
{
header : 'Вопрос А',
dataIndex: 'answerA',
itemId: 'answerA',
width : 100,
filter: {
type: 'list',
options: this.getStore().collect('answerA'),
},
editor: 'textfield'
}
When the user inputs new value in the cell of the column this value must appear in the filter. How can I do this?
I have looked at Extjs Grid Filter - Dynamic ListFilter but it doesn't help me much.
You should carefully look at StringFilter.js file which is at ux/grid/filter. I did this myself by adding a combobox to each filter. On combo expand I update its store by parsing the corresponding column data. Besides I advise you to collect unique values, for this task use Ext.Array.unique function. Otherwise, if two cells contain the same data both values will go to the store, which is not good, because any of them will produce the same filtration. I do not attach my code, because it is heavily dependent on my custom data types, but the idea is quite easy, I think.

Multiple access to static data in a django app

I'm building an application and I'm having trouble making a choice about how is the best way to access multiple times to static data in a django app. My experience in the field is close to zero so I could use some help.
The app basically consists in a drag & drop of foods. When you drag a food to a determined place(breakfast for example) differents values gets updated: total breakfast calories, total day nutrients(Micro/Macro), total day calories, ...That's why I think the way I store and access the data it's pretty important performance speaking.
This is an excerpt of the json file I'm currently using:
foods.json
{
"112": {
"type": "Vegetables",
"description": "Mushrooms",
"nutrients": {
"Niacin": {
"unit": "mg",
"group": "Vitamins",
"value": 3.79
},
"Lysine": {
"units": "g",
"group": "Amino Acids",
"value": 0.123
},
... (+40 nutrients)
"amount": 1,
"unit": "cup whole",
"grams": 87.0 }
}
I've thought about different options:
1) JSON(The one I'm currently using):
Every time I drag a food to a "droppable" place, I call a getJSON function to access the food data and then update the corresponding values. This file has a 2mb size, but it surely will increase as I add more foods to it. I'm using this option because it was the most quickest to begin to build the app but I don't think it's a good choice for the live app.
2) RDBMS with normalized fields:
I could create two models: Food and Nutrient, each food has 40+ nutrients related by a FK. The problem I see with this is that every time a food data request is made, the app will hit the db a lot of times to retrieve it.
3) RDBMS with picklefield:
This is the option I'm actually considering. I could create a Food models and put the nutrients in a picklefield.
4) Something with Redis/Django Cache system:
I'll dive more deeply into this option. I've read some things about them but I don't clearly know if there's some way to use them to solve the problem I have.
Thanks in advance,
Mariano.
This is a typical use case for a relational database. More or less normalized form is the proper way most of the time.
I wrote this data model up from the top of my head, according to your example:
CREATE TABLE unit(
unit_id integer PRIMARY KEY
,unit text NOT NULL
,metric_unit text NOT NULL
,atomic_amount numeric NOT NULL
);
CREATE TABLE food_type(
food_type_id integer PRIMARY KEY
,food_type text NOT NULL
);
CREATE TABLE nutrient_type(
nutrient_type_id integer PRIMARY KEY
,nutrient_type text NOT NULL
);
CREATE TABLE food(
food_id serial PRIMARY KEY
,food text NOT NULL
,food_type_id integer REFERENCES food_type(food_type_id) ON UPDATE CASCADE
,unit_id integer REFERENCES unit(unit_id) ON UPDATE CASCADE
,base_amount numeric NOT NULL DEFAULT 1
);
CREATE TABLE nutrient(
nutrient_id serial PRIMARY KEY
,nutrient text NOT NULL
,metric_unit text NOT NULL
,base_amount numeric NOT NULL
,calories integer NOT NULL DEFAULT 0
);
CREATE TABLE food_nutrient(
food_id integer references food (food_id) ON UPDATE CASCADE ON DELETE CASCADE
,nutrient_id integer references nutrient (nutrient_id) ON UPDATE CASCADE
,amount numeric NOT NULL DEFAULT 1
,CONSTRAINT food_nutrient_pkey PRIMARY KEY (food_id, nutrient_id)
);
CREATE TABLE meal(
meal_id serial PRIMARY KEY
,meal text NOT NULL
);
CREATE TABLE meal_food(
meal_id integer references meal(meal_id) ON UPDATE CASCADE ON DELETE CASCADE
,food_id integer references food (food_id) ON UPDATE CASCADE
,amount numeric NOT NULL DEFAULT 1
,CONSTRAINT meal_food_pkey PRIMARY KEY (meal_id, food_id)
);
This is definitely not, how it should work:
every time a food data request is made, the app will hit the db a lot
of times to retrieve it.
You should calculate / aggregate all values you need in a view or function and hit the database only once per request, not many times.
Simple example to calculate the calories of a meal according to the above model:
SELECT sum(n.calories * fn.amount * f.base_amount * u.atomic_amount * mf.amount)
AS meal_calories
FROM meal_food mf
JOIN food f USING (food_id)
JOIN unit u USING (unit_id)
JOIN food_nutrient fn USING (food_id)
JOIN nutrient n USING (nutrient_id)
WHERE mf.meal_id = 7;
You can also use materialized views. For instance, store computed values per food in a table and update it automatically if underlying data changes. Most likely, those rarely change (but are still easily updated this way).
I think the flat file version you are using comes in last place. Every time it is requested it is being read from top to bottom. For the size I think this comes in last place. The cache system would provide the best performance, but the RDBMS would be the easiest to manage/extend, plus your queries will automatically be cached.