I am currently looking for a way in which I can store my data, and quickly look it up.
What currently seem to be the best idea is to use a hash map.
reasons:
To identify what item I am looking for, one need to extract a certain set of features from the data, which can be used as a key to extract the item.
Problem here is that the features could contain noise, making the key <-> item match not 100% perfect.
In such case I need to find the key which matches the features set the best?
Or a set of features sets that matches the feature set the best.
But how do i do that?
The feature vector is 512 long.
sorting would be a possibility? but how should I sort such a matrix?
And given the complications I am facing, is using a hash map the correct choice or is something better?
Related
I'm using Elasticsearch to build search for ecommerece site.
One index will have products stored in it, in products index I'll store categories in it's other attributes along with. Categories can be multiple but the attribute will have single field value. (E.g. color)
Let's say user types in Black(color) Nike(brand) shoes(Categories)
I want to process this query so that I can extract entities (brand, attribute, etc...) and I can write Request body search.
I have tought of following option,
Applying regex on query first to extract those entities (But with this approach not sure how Fuzzyness would work, user may have typo in any of the entity)
Using OpenNLP extension (But this one only works on indexation time, in above scenario we want it on query side)
Using NER of any good NLP framework. (This is not time & cost effective because I'll have millions of products in engine also they get updated/added on frequent basis)
What's the best way to solve above issue ?
Edit:
Found couple of libraries which would allow fuzzy text matching in regex. But the entities to find will be many, so what's the best solution to optimise that ?
Still not sure about OpenNLP
NER won't work in this case because there are fixed number of entities so prediction is not right when there are no entity available in the query.
If you cannot achieve desired results with tuning of built-in ElasticSearch scoring/boosting most likely you'll need some kind of 'natural language query' processing:
Tokenize free-form query. Regex can be used for splitting lexems, however very often it is better to write custom tokenizer for that.
Perform named-entity recognition to determine possible field(s) for each keyword. At this step you will get associations like (Black -> color), (Black -> product name) etc. In fact you don't need OpenNLP for that as this should be just an index (keyword -> field(s)), and you can try to use ElasticSearch 'suggest' API for this purpose.
(optional) Recognize special phrases or combinations like "released yesterday", "price below $20"
Generate possible combinations of matches, and with help of special scoring function determine 'best' recognition result. Scoring function may be hardcoded (reflect 'common sense' heuristics) or it this may be a result of machine learning algorithm.
By recognition result (matches metadata) produce formal query to produce search results - this may be ElasticSearch query with field hints, or even SQL query.
In general, efficient NLQ processing needs significant development efforts - I don't recommend to implement it from scratch until you have enough resources & time for this feature. As alternative, you can try to find existing NLQ solution and integrate it, but most likely this will be commercial product (I don't know any good free/open-source NLQ components that really ready for production use).
I would approach this problem as NER tagging considering you already have corpus of tags. My approach for this problem will be as below:
Create a annotated dataset of queries with each word tagged to one of the tags say {color, brand, Categories}
Train a NER model (CRF/LSTMS).
This is not time & cost effective because I'll have millions of
products in engine also they get updated/added on frequent basis
To handle this situation I suggest dont use words in the query as features but rather use the attributes of the words as features. For example create an indicator function f(x',y) for word x with context x' (i.e the word along with the surrounding words and their attributes) and tag y which will return a 1 or 0. A sample indicator function will be as below
f('blue', 'y') = if 'blue' in `color attribute` column of DB and words previous to 'blue' is in `product attribute` column of DB and 'y' is `colors` then return 1 else 0.
Create lot of these indicator functions also know as features maps.
These indicator functions are then used to train a models using CRFS or LSTMS. Finially we use viterbi algorithm to find the best tagging sequence for your query. For CRFs you can use packages like CRFSuite or CRF++. Using these packages all you have go do is create indicator functions and the package will train a model for you. Once trained you can use this model to predict the best sequence for your queries. CRFs are very fast.
This way of training without using vector representation of words will generalise your model without the need of retraining. [Look at NER using CRFs].
I've been racking my brain over the past several days trying to find a solution to a storage/access problem.
I currently have 300 unique items with 10 attributes each (all attributes are currently set as strings but some of the attributes are numerical). I am trying to programmatically store and be able to efficiently access attributes based on item ID. I have attempted storing them in string arrays, vectors, maps, and multimaps with no success.
Goal: to be able to quickly access an item and one of its attributes quickly and efficiently by unique identifier.
The closest I have been able to get to being successful is:
string item1[] = {"attrib1","attrib2","attrib3","attrib4","attrib5","attrib6","attrib7","attrib8","attrib9","attrib10"};
I was then able to access an element on-demand by callingitem1[0]; but this is VERY inefficient (particularly when trying to loop through 300 items) and was very hard to work with.
Is there a better way to approach this?
If I understand your question correctly it sounds like you should have some sort of class to hold the attributes, which you would put into a map that has the item ID as the key.
Actually this question was raised by someone else at here https://stackoverflow.com/questions/13338799/does-couchdbs-group-true-prevent-rereduce.
But there is no convincing answer.
group=true is the conceptual equivalent of group_level=exact, so CouchDB runs a reduce per unique key in the map row set.
This is how it is explained in doc.
It sounds like CouchDB would collect all the values for the same key and only reduce one time per each distinct key.
But in another article, it is said that
If the query is on the reduce value of each key (group_by_key = true),
then CouchDB try to locate the boundary of each key. Since this range
is probably not fitting exactly along the B+Tree node, CouchDB need to
figure out the edge of both ends to locate the partially matched leave
B+Tree node and resend its map result (with that key) to the View
Server. This reduce result will then merge with existing rereduce
result to compute the final reduce result of this key.
It sounds like rereduce may happen when group=true.
In my project, there are many documents but there are most 2 values with the same key after grouping for each distinct key.
Will rereduce happen in this case?
Best Regards
Yes. Rereduce is always a possibility.
If this is a problem, there is a rereduce parameter in the reduce function, which allows you to detect if this is happening.
http://docs.couchdb.org/en/latest/couchapp/ddocs.html#reduce-and-rereduce-functions
I'm reading a book about coding style in Django and one thing they discuss is db_index=True. Ever since I started using Django, I've never used this function because I'm not really sure what it does.
So my question is, when to consider adding indexes?
This is not really django specific; more to do with databases. You add indexes on columns when you want to speed up searches on that column.
Typically, only the primary key is indexed by the database. This means look ups using the primary key are optimized.
If you do a lot of lookups on a secondary column, consider adding an index to that column to speed things up.
Keep in mind, like most problems of scale, these only apply if you have a statistically large number of rows (10,000 is not large).
Additionally, every time you do an insert, indexes need to be updated. So be careful on which column you add indexes.
As always, you can only optimize what you can measure - so use the EXPLAIN statement and your database logs (especially any slow query logs) to find out where indexes can be useful.
The above answer is correct but in some cases where the search is being done on columns that have only varchar datatype like email. There you need to add an index.
Following is the way of doing that:
Index(name='covering_index', fields=['headline'], include=['pub_date'])
reference from https://docs.djangoproject.com/en/3.2/ref/models/indexes/
Im pretty new to backend web design, but I am building a webapp and will have a database of locations. As the database gets big, I realized that if I want people to be able to access the database and say, for example, see points around them, it would take a long time to go through every element and check the distance from them to any given point.
Does anyone know of a way to quickly check the database, but maybe only relevant locations?
Im using python, django, and SQL if that makes any difference.
The way to quickly access data in a database is adding keys. But “regular“ keys operate on a strict ordering of possible values, i.e. a 1D domain. Geographical locations would require 2D access. There are basically two approaches you can use:
Use specific 2D functionality provided by your DBMS, e.g. Spatial Extensions for MySQL.
Use space-filling curves like the Z-order curve for your index, so that close points will correspond to a small number of small intervals in the index space.
You might also want to look at related tags, e.g. geographical-information and nearest-neighbor, or at related questions like the one about K-Nearest neighbours. And you might want to give more details on the database system you're working with.