Word2Vec model output types - word2vec

When Word2Vec model is trained, there are three outputs created.
model
model.wv.syn0
model.syn1neg
I have a couple of questions regarding these models.
How are these outputs essentially different from each other?
Which model to look at if I want to access trained results?
Thanks in advance !

Those are 3 files created by the gensim Word2Vec .save() function. The model file is a Python pickle of the main model; the other files are some of the over-large numpy arrays stored separately for efficiency. The syn0 happens to contain the raw word vectors, and the syn1neg the model's internal weights – but neither are cleanly interpretable without the other data.
So, the only support for re-loading them is to use the matching .load() function, with all three available. A successful re-load() will result in a model object just like the one you save()d, and you'd access the results via that loaded object.
(If you only need the raw word-vectors, you can also use the .save_word2vec_format() method, which writes in a format compatible with the original Google-releases word2vec.c code. But that format has strictly less information that gensim's native save, so you'd only use it if you absolutely need to for compatibility with other software. Working with the gensim native files ensures you could always save the other format later, while you can't go the other way.)

Related

gensim creates files with extension .bin.trainables.syn1neg.npy and .bin.wv.vectors.npy in addition to .bin

I am using python gensim to create word2vec for my 93 million sentences. However, when I train my model, I am getting three files as output with extensions .bin.trainables.syn1neg.npy and .bin.wv.vectors.npy in addition to .bin. I went through the answer provided here: Why are multiple model files created in gensim word2vec? which gives reasoning of why this happens. However I would like to know if there is a way to convert these files into a normal single bin file?
There is an optional parameter to .save(), called sep_limit with a default value of 10MiB, which controls the threshold over which separate files are used. You could try setting this to a much larger value – larger than any of the extra files you're seeing – and as long as your model is still small enough to not hit pickle() limits, it might work.
But, gensim is saving a model to multiple files for both efficiency, and to be sure of not htting size limitations in Python pickle(). You should if at all possible just keep the files together as a set. They will always have the same shared prefix, that you provided as a name to .save().

Using Matlab SVM model in C++

I have used libsvm in Matlab to create an SVM model. I can't create the model in the code where I do the prediction so I need to save the model and use it later. I want to use that model in my C++ code to make predictions. I know how to predict in matlab itself using svmpredict, but I want to save the model created in matlab and use it in C++ for predictions. Fist of all, is it possible? If so how do I save the model in matlab and call it back in C++?
One option is to save the parameters learned by the model in a csv file. The returned model from svmtrain is a struct. One of the elements of this struct are the model parameters. You could then read this into your C++ file.
However, this seems redundant because libSVM is already written in C. Hence, the predict function being called is being called in C.
If all you need is being able to predict values in your C++ code, one thing you can do is extracting the model parameters in matlab and use it in predictions in your C++ code.
You may already know that you can manually do the predictions by substituting the required values and predicting based on the sign.
This answer has information about what parameters to extract in the case of RBF kernel and how you can make predictions.

How to see the operations performed in a Tensorflow pb file?

More specifically, I want to see the operations performed in the graph inside 'classify_image_graph_def.pb' which is inside Tensorflow's imagenet Inception model.
This is related to this question - basically you can load it (like in the python image classification example) and then run get_operations() on the graph, which gives you a list of operations, each of which have a name and a type attribute. You also could in principle use tensorboard I believe, but for that specific graph I always ran into "too big" errors.

What's the difference between core data, essential data and sample data in hybris?

In the hybris wiki trails, there is mention of core data vs. essential data vs. sample data. What is the difference between these three types of data?
Ordinarily, I would assume that sample data is illustrative gobbledygook data created to populate the example apparel and electronics storefronts. However, the wiki trails suggest that core data is for non-store specific data and the sample data is for store specific data.
On the same page, the wiki states that core data contains cockpit and catalog definitions, email templates, CMS layout, and site definitions (countries and user groups impex are included in this as well). This seems rather store specific to me. Does anyone have an explanation for this?
Yes, I have an explanation. Actually a lot of this is down to arbitrary decisions I made on separating data between acceleratorcore and acceleratorsampledata extensions as part of the Accelerator in 4.5 (later these had y- prefix added).
Essential and Project Data are two sets of data that are used within hybris' init/update process. These steps are controlled for each extension via particular Annotations on classes and methods.
Core vs Sample data is more about if I thought the impex file, or lines, were specific to the sample store or were more general. You will notice your CoreSystemSetup has both essential and projectdata steps.
Lots of work has happened in various continents since then, so, like much of hybris now, its a bit of a mess.
There are a few fun bugs related to hybris making certain things part of essentialdata. But these are in the platform not something I can fix without complaining to various people etc.
To confuse matters further, there is the yacceleratorinitialdata extension. This extension was a way I hoped to make projects easier, by giving some impex skeletons for new sites and stores. This would be generated for you during modulegen. It has rotted heavily though since release, now very out of date.
For a better explanation, take a look at this answer from answers.sap.com.
Hybris imports two types of data on initialization and update processes; first is essentialdata and other one is projectdata.
Essentialdata is the coredata setup which is mandatory and will import when you run initialization or update.
sampledata is your projectdata and it is not mandatory it will import when you select project while updating the system.

How should I link a data Class to my GUI code (to display attributes of object, in C++)?

I have a class (in C++), call it Data, that has thousands of instances (objects) when the code is run. I have a widget (in Qt), call it DataWidget that displays attributes of the objects. To rapidly build the widget I simply wrote the object attributes to a file and had the widget parse the file for the attributes - this approach works, but isn't scalable or pretty.
To be more clear my requirements are:
1 - DataWidget should be able to display multiple, different, Data object's attributes at a time
2 - DataWidget should be able to display thousands of Data objects per second
3 - DataWidget should be run along side the code that generates new Data objects
4 - each Data object needs to be permanently saved to file/database
Currently, the GUI is created and the DataWidget is created then the experiment runs and generates thousands of Data objects (periodically writing some of them to file). After the experiment runs the DataWidget displays the last Data object written to file (they are written to XML files).
With my current file approach I can satisfy (1) by grabbing more than one file after the experiment runs. Since the experiment isn't tied to DataWidget, there is no concurrency, so I can't do (3) until I add a signal that informs the DataWidget that a new file exists.
I haven't moved forward with this approach for 2 reasons:
Firstly, even though the files aren't immediately written to disk, I can't imagine that this method is scalable unless I implement a caching system - but, this seems like I'm reinvent the wheel? Secondly, Data is a wrapper for a graph data-structure and I'm using Graphml (via Boost Graph Library i.e. write_graphml()) to write the structure to XML files, and to read the structure back in with Boost's read_graphml() requires me to read the file back into a Data object ... which means the experiment portion of the program encodes the object into XML, writes the XML to a file (but hopefully in memory and not to disk), then the DataWidget reads the XML from a file and decodes it into an object!
It seems to me like I should be using a database which would handle all the caching etc. Moreover, it seems like I should be able to skip the file/database step and pass the Data to the DataWidget in the program (perhaps pass it a reference to a list of Data). Yet, I also want to save the Data to file to the file/database step isn't entirely pointless - I'm just using it in the wrong way at the wrong time.
What is the better approach given my requirements?
Are there any general resources and/or guidelines for handling and displaying data like this?
I see you're using Qt. This is good because Qt 4.0 and later includes a powerful model/view framework. And I think this is what you want.
Model/View
Basically, have your Data class inherit and implement QAbstractItemModel, or a different Qt Model class, depending on the kind of model you want. Then set your view widget (most likely a QListView) to use Data for its model.
There are lots of examples at their site and this solution scales nicely with large data sets.
Added: This model test code from labs.trolltech.com comes in real handy:
http://labs.trolltech.com/page/Projects/Itemview/Modeltest
It seems to me like I should be using
a database which would handle all the
caching etc. Moreover, it seems like I
should be able to skip the
file/database step and pass the Data
to the DataWidget in the program
(perhaps pass it a reference to a list
of Data). Yet, I also want to save the
Data to file to the file/database step
isn't entirely pointless - I'm just
using it in the wrong way at the wrong
time.
If you need to display that much rapidly changing data, having an intermediate file or database will slow it down and likely become the bottleneck. I think the Widget should read the newly generated data directly from memory. This doesn't prevent you from storing the data in a file or database though, it can be done in a separate thread/process.
If all of the data items will fit in memory, I'd say put them in a vector/list, and pass a reference to that to the DataWidget. When it's time to save them, pass a reference to your serializing method. Then your experiment just populates the data structure for the other processes to use.