I'm working on a web service that needs to accept categories and perform a search using those. Categories can be combined, hence bitmask comes to mind.
Example:
Spring = 1, Summer = 2, Autumn = 4, Winter = 8
Possible options:
?categories=5 - not very user friendly/pretty
?categories=1,4 - needs special parsing
?categories=1&categories=4 - well supported but a bit verbose for a lot of categories
?categories=Spring,Autumn - seems most user friendly
Is there any standard way or preferred way to model bitmask type data?
I'd suggest going for semantic clarity over compression, but leverage native functions like JSON.stringify() and JSON.parse() and model them as an array, e.g.
categories = ['spring', 'summer', 'winter']
This is like #4, but slightly different in that it's using JSON which you can generate and parse unambiguously.
Related
I have multiple indices in Elasticsearch (and the corresponding documents in Django created using django-elasticsearch-dsl). All of the indices have these settings:
settings = {'number_of_shards': 1,
'number_of_replicas': 0}
Now, I am trying to perform a search across all the 10 indices. In order to retrieve consistent scoring between the results from different indices, I am using dfs_query_then_fetch:
search = Search(index=['mov*'])
search = search.params(search_type='dfs_query_then_fetch')
objects = search.query("multi_match", query='Tom & Jerry', fields=['title', 'actors'])
I get bad results due to inconsistent scoring. A book called 'A story of Jerry and his friend Tom' from one index can be ranked higher than the cartoon 'Tom & Jerry' from another index. The reason is that dfs_query_then_fetch is not working. When I remove it or substitute with the simple query_then_fetch, I get absolutely the same results with the identical scoring.
I have tested it on URI requests as well, and I always get the same scores for both search types.
What can be the reason for it?
UPDATE: The results are actually not the same, but they are only really slightly different, e.g. a score of 50.1 with dfs and 50.0 without dfs, while the same model within one index has a score of 80.0.
If the number of shards is 1, then dfs_query_then_fetch and query_then_fetch will return the same result. DFS query will do a query to all shards and then show you results based on the scores computed, but in this case there is only one shard.
Regarding the scoring, you might wanna have a look at your actors field too. Also, do let us know what are the analyzer and tokenizer if you have used custom ones?
I am building a recommender system using sagemaker's built-in factorisation machine model.
My desired result is to have a rating matrix where I can look up a predicted score by a user id and an item id.
I understand that there is a predict API provided by the model:
result = fm_predictor.predict(X_test[1000:1010].toarray())
But I am not sure how I can use it to achieve the desired purpose. If I want to know, say, if user#123 is interested in movie#456, how can I use the above API?
Reference:
https://medium.com/#julsimon/building-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker-cedbfc8c93d8
https://www.slideshare.net/AmazonWebServices/building-a-recommender-system-on-aws-aws-summit-sydney-2018 (p.41,43)
Updated:
I think I understand how to use the API now, you have to build another one-hot encoded dataset as input, for example:
X_new = lil_matrix((1, nbFeatures)).astype('float32')
X_new[0, 935] = 1
X_new[0, 1600] = 1
prediction2 = X_new[0].toarray()
result2 = fm_predictor.predict(prediction2)
print(result2)
But it seems it would be quite inefficient to fill out the recommendation matrix this way. What would be the best practice?
I think one could think about 2 scenarios:
1) if you need very low latency, you can fill up the matrix indeed, i.e. compute all recos for all users, and store it in a key/value backend queried by your app. You can definitely predict multiple users at a time, using the one-hot encoded technique above.
2) predict on-demand by invoking the endpoint directly from the app. This is quite simpler, at the cost of a little latency.
Hope this helps.
I have a dataset with mostly integer values. I want to apply association rule mining on it. I have taken a look at the popular algorithms like Apriori, etc. but all of them work on data which have boolean values, i.e., either the item exists in the transaction or doesn't.
Is there an algorithm which lets us account for values of the attributes in addition to their counts? (I plan to normalize the data to have values between 0 and 1)
You can "hack" around this limitation if your nubers are integer (why normalize to 0 1?) and small:
apple banana apple
becomes
apple banana apple_2
which would allow to find association rules like
banana => apple, apple_2
but you need to mix in some clever filters to not get useless rules like
apple_2 => apple
Item-item collaborative filtering is quite similar to similarity-based data mining techniques like association rule mining. Moreover, collaborative filtering was built to handle continuous and ordinal values, such as star ratings or a Likert scale: this is usually preference information from users.
Content-based filtering is probably your best bet for the situation you describe. It allows for item attributes and weights (that do not change per user for that item), then takes in user preference for each item (that does change per user for that item).
If you want both preference (counts) and attributes to change for each user-item pair, I don't know of an algorithm that handles that. Usually algorithms are built for one input per user-item pair.
Yes. There are some variations of the itemset mining problem that will let you specify additional information. For example, high utility itemset mining algorithms let you specify a quantity for each item occuring in a transaction, as well as a weight for each item.
I have to work with government-provided data that is sometimes broken in strange ways. My code already contains snippets like:
for row in governmental_data:
# XXX Workaround for that one row among thousands
# that was mislabeled by a clerk and will not be fixed
# before form A-320-Tango-5 is completed and submitted
# on the first Sunday after a solstice.
if row is the_spawn_of_satan:
row = fix_row_A320(row)
# XXX end of workaround
process_row(row)
which before the error was just
for row in governmental_data:
process_row(row)
I can not make a mirror of the data with applied fixes, because the data is dynamic.
What can I do to manage these workarounds as they grow in number? Are there any best practices (besides "do not provide broken data to begin with")?
I suggest use Decorator Design Pattern for handling this data conversion issue. Wikipedia page
has a coffee making example. In the same way I suggest every data conversion should be decorator which takes a row and makes some operations on it and gives back a row. This design pattern is well established one. Intercepting filters design pattern is similar to this idea which is implemented both in java (servlet filters) and .net (Asp.Net Mvc Filters).
Your code should be as following
listOfDataConversionFilters = [XXXWorkaround,formA_320Tango5,...]
for row in governmental_data:
for filter in listOfDataConversionFilters
filteredRow = filter(row)
process_row(filteredRow)
I would like to use Apriori to carry out affinity analysis on transaction data. I have a table with a list of orders and their information. I mainly need to use the OrderID and ProductID attributes which are in the following format
OrderID ProductID
1 A
1 B
1 C
2 A
2 C
3 A
Weka requires you to create a nominal attribute for every product ID and to specify whether the item is present in the order using a true or false value like like this:
1, TRUE, TRUE, TRUE
2, TRUE, FALSE, TRUE
3, TRUE, FALSE, FALSE
My dataset contains about 10k records... about 3k different products. Can anyone suggest a way to create the dataset in this format? (Besides a manually time consuming way...)
How about writing a script to convert it?
Should be less than 10 lines in a good scripting language such as Python.
Or you may look into options of pivoting the relation as desired.
Either way, it is a straight forward programming task, so I don't see your question here.
You obviously need to convert your data. Easiest way: write a software that read the file in the programming language that you are most familiar with and then write the file in the appropriate format. Since it is text files, it should not be too complicated.
By the way, if you want more algorithms for pattern mining and association mining than just Apriori in Weka, you could check my software SPMF ( http://www.philippe-fournier-viger.com/spmf/ ) which is also in Java, can read ARFF files too and offers about 50 algorithms specialized in pattern mining (Apriori FPGrowth, and many others.
Your data is formatted correctly as-is for implementation in R using the ARULES package (and apriori function). You might consider checking it out, esp. if you're not able to get into script coding.