how to use get_offline_features() in the mlrun.feature_store? - nuclio

I am trying to get a feature from an existing feature store.
In the documentation https://docs.mlrun.org/en/latest/api/mlrun.feature_store.html, it says you can either pass a feature vector uri or FeatureVector object to the mlrun.feature_store.get_offline_features().
What is the uri for a feature store?
Where can I find an example?

In MLRun, a Feature Set is a group of features that are ingested together. A Feature Vector is a selection of features from Feature Sets (a few columns here, a few columns there, etc). This is great for joining several data sources together using a common entity/key.
A full example of creating and querying a feature set from MLRun can be found below:
import mlrun.feature_store as fs
from mlrun import set_environment
import pandas as pd
# Set project - for retrieving features later
set_environment(project="my-project")
# Feature set to ingest
df = pd.DataFrame({
"key" : [0, 1, 2, 3],
"value" : ["A", "B", "C", "D"]
})
# Create feature set with desired name and entity/key
fset = fs.FeatureSet("my-feature-set", entities=[fs.Entity("key")])
# Ingest
fs.ingest(featureset=fset, source=df)
# Create feature vector (allows for joining multiple feature sets together)
features = ["my-feature-set.*"] # can also do ["my-feature-set.A", my-feature-set.B", ...]
vector = fs.FeatureVector("my-feature-vector", features)
# Retrieve offline features (vector object)
fs.get_offline_features(vector)
# Retrieve offline features (project + name)
fs.get_offline_features("my-project/my-feature-vector")
# Retrieve offline features as pandas dataframe
fs.get_offline_features("my-project/my-feature-vector").to_dataframe()
You can find more feature store examples in the documentation here: https://docs.mlrun.org/en/latest/feature-store/feature-store.html

Related

Spark DataFrame ArrayType or MapType for checking for value in column

I have a pyspark dataframe, and one column is a list of IDs. I want to, for example, get the count of rows which have a certain ID in it.
AFAIK the two column types relevant to me are ArrayType and MapType. I could use the map type because checking for membership inside a map/dict is more efficient than checking for membership in an array.
However, to use the map I would need to filter with a custom udf rather than the built in (scala) function array_contains
with a MapType I can do :
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf
df = spark.createDataFrame([("a-key", {"345": True, "123": True})], ["key", "ids"])
def is_in_map(k, d):
return k in d.keys()
def map_udf(key):
return udf(lambda d: is_in_map(key, d), BooleanType())
c = df.filter(map_udf("123")(df.ids)).count()
or with an ArrayType I can do :
from pyspark.sql.functions import array_contains
df = spark.createDataFrame([("a-key", ["345", "123"])], ["key", "ids"])
c = df.filter(array_contains(df.ids, "123")).count()
My first reaction is to use the MapArray because checking for membership inside the map is (I assume) more efficient.
On the other hand the built in function array_contains executes scala code and I assume that whatever scala defined function I call is going to be more efficient than returning the column dict to a python context and checking k in d.keys().
For checking membership in this (multi-value) column, is it best to use the MapType or ArrayType pyspark.sql.types?
Update
There is a column method pyspark.sql.Column.getItem which means I can filter by membership without a python udf
Maps are more performant, in Scala + Spark I used
df.where(df("ids").getItem("123") === true)
it uses standard Dataframe API and df("ids").getItem("123") returns Column with value of the map or null, it will work at Spark's native speed. Pyspark developers say that Pyspark has that API as well.

How to speed up Gensim Word2vec model by filtering out some words?

Suppose I filter in a list the word that i want to use in my next word2vec model load. How can I construct my own KeyedVectors that contain only these filtered words list?
I tried to make:
w2v_model_keyed = w2v_model.wv
w2v_model_keyed.drop(word)
for a given word but i get the following error:
AttributeError: 'KeyedVectors' object has no attribute 'drop'
Thank you
The gensim KeyedVectors class doesn't support incremental expansion or modification (as with a .drop() method). You'll need to construct a new instance, of just the right size/contents.
You should look the gensim KeyedVectors source code, and especially the .load_word2vec_format() method, to learn how existing instances are created in gensim, and mimic that to create one that is of just the size/words you need.
Starting from version 4.1.2, Gensim KeyedVectors object supports method .vectors_for_all that takes a list of words and creates a new KeyedVectors object with the corresponding vectors for those words only.
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format("model.bin", binary=True)
words_to_keep = ["salmon", "tuna", "cod"]
smaller_model = model.vectors_for_all(words_to_keep)
smaller_model.save_word2vec_format("smaller_model.bin", binary=True)

Grouping values with regular expressions regex

I have got a list of names within an Excel sheet (also in csv) and I made groups with the origin of the names.
This is what the groups I made look like.
Now I want to add a new column with the group name behind the name.
This is what I want to obtain.
How do I get this? Do I have to use regualar expressions for this?
You don't need regex here. For instance, you can use the csv module of python.
old.csv
groups,,,
Dutch,Lore,Kilian,Daan
German,Marte,,
USA,Eva,Judith,
python script using import csv
import csv
rows = []
with open('old.csv','r') as old_csv:
old = csv.reader(old_csv, delimiter=',')
old.next()
for row in old:
for name in row[1:]:
if name:
rows.append({'name':name,'group':row[0]})
with open('new.csv','w') as new_cvs:
fieldnames = ['name', 'group']
new = csv.DictWriter(new_cvs, fieldnames=fieldnames)
new.writer.writerow(new.fieldnames)
new.writerows(rows)
new.csv
name,group
Lore,Dutch
Kilian,Dutch
Daan,Dutch
Marte,German
Eva,USA
Judith,USA
You can also use xlrd and xlwt modules but you have to install them because they aren't standard.

ValueError Scikit learn. Number of features of model don't match input

I am pretty new to machine learning in general and scikit-learn in specific.
I am trying to use the example given on the site http://scikit-learn.org/stable/tutorial/basic/tutorial.html
For practicing on my own, I am using my own data-set. My data set is divided into two different CSV files:
Train_data.csv (Contains 32 columns, the last column is the output value).
Test_data.csv (Contains 31 columns the output column is missing - Which should be the case, no?)
Test data is one column less than training data..
I am using the following code to learn (using training data) and then predict (using test data).
The issue I am facing is the error:
*ValueError: X.shape[1] = 31 should be equal to 29, the number of features at training time*
Here is my code (sorry if it looks completely wrong :( )
import pandas as pd #import the library
from sklearn import svm
mydata = pd.read_csv("Train - Copy.csv") #I read my training data set
target = mydata["Desired"] #my csv has header row, and the output label column is named "Desired"
data = mydata.ix[:,:-3] #select all but the last column as data
clf = svm.SVC(gamma=0.001, C=100.) #Code from the URL above
clf.fit(data,target) #Code from the URL above
test_data = pd.read_csv("test.csv") #I read my test data set. Without the output column
clf.predict(test_data[-1:]) #Code from the URL above
The training data csv labels looks something like this:
Value1,Value2,Value3,Value4,Output
The test data csv labels looks something like this:
Value1,Value2,Value3,Value4.
Thanks :)
Your problem is a Supervised Problem, you have some data in form of (input,output).
The input are the features describing your example and the output is the prediction that your model should respond given that input.
In your training data, you'll have one more attribute in your csv file because in order to train your model you need to give him the output.
The general workflow in sklearn with a Supervised Problem should look like this
X, Y = read_data(data)
n = len(X)
X_train, X_test = X[:n*0.8], X[n*0.8:]
Y_train, Y_test = Y[:n*0.8], Y[n*0.8:]
model.fit(X_train,Y_train)
model.score(X_test, Y_test)
To split your data, you can use train_test_split and you can use several metrics in order to judge your model's performance.
You should check the shape of your data
data.shape
It seems like you're not taking into the account the last 3 columns instead of only the last. Try instead :
data = mydata.ix[:,:-1]

Deduplicaton / matching in Couchdb?

I have documents in couchdb. The schema looks like below:
userId
email
personal_blog_url
telephone
I assume two users are actually the same person as long as they have
email or
personal_blog_url or
telephone
be identical.
I have 3 views created, which basically maps email/blog_url/telephone to userIds and then combines the userIds into the group under the same key, e.g.,
_view/by_email:
----------------------------------
key values
a_email#gmail.com [123, 345]
b_email#gmail.com [23, 45, 333]
_view/by_blog_url:
----------------------------------
key values
http://myblog.com [23, 45]
http://mysite.com/ss [2, 123, 345]
_view/by_telephone:
----------------------------------
key values
232-932-9088 [2, 123]
000-111-9999 [45, 1234]
999-999-0000 [1]
My questions:
How can I merge the results from the 3 different views into a final user table/view which contains no duplicates?
Or whether it is a good practice to do such deduplication in couchdb?
Or what would be a good way to do a deduplication in couch then?
ps. in the finial view, suppose for all dupes, we only keep the smallest userId.
Thanks.
Good question. Perhaps you could listen to _changes and search for the fields you want to be unique for the real user in the views you suggested (by_*).
Merge the views into one (emit different fields in one map):
function (doc) {
if (!doc.email || !doc.personal_blog_url || !doc.telephone) return;
emit([1, doc.email], [doc._id]);
emit([2, doc.personal_blog_url], [doc._id]);
emit([3, doc.telephone], [doc._id]);
}
Merge the lists of id's in reduce
When new doc in changes feed arrives, you can query the view with keys=[[1, email], [2, personal_blog_url], ...] and merge the three lists. If its minimal id is smaller then the changed doc, update the field realId, otherwise update the documents in the list with the changed id.
I suggest using different document to store { userId, realId } relation.
You can't create new documents by just using a view. You'd need a task of some sort to do the actual merging.
Here's one idea.
Instead of creating 3 views, you could create one view (that indexes the data if it exists):
Key Values
--- ------
[userId, 'phone'] 777-555-1212
[userId, 'email'] username#example.com
[userId, 'url'] favorite.url.example.com
I wouldn't store anything else except the raw value, as you'd end up with lots of unnecessary duplication of data (if you stored the full object for example).
Then, to query, you could do something like:
...startkey=[userId]&endkey=[userId,{}]
That would give you all of the duplicate information as a series of docs for that user Id. You'd still need to parse it apart to see if there were duplicates. But, this way, the results would be nicely merged into a single CouchDB call.
Here's a nice example of using arrays as keys on StackOverflow.
You'd still probably load the original "user" document if it had other data that wasn't part of the de-duplication process.
Once discovered, you could consider cleaning up the data on the fly and prevent new duplicates from occurring as new data is entered into your application.