How to calculate the phrase similarity based on word2vec - word2vec

I have millions of sentences and want to get the phrase vector, so I can calculate the phrase similarity. But the problem is that I don't know how to use word2vec to get phrase vector, or does anyone know other tools?

The simplist way to do this is simply add each of the corresponding word vector elements together and renormalise the result, giving you a sentence vector.
In C#, do something like this:
var vec = new double[dims];
foreach (var key in sentence)
{
var tmp = model[key];
for (var i = 0; i < dims; i++)
vec[i] += tmp[i];
}
double len = 0;
for (var i = 0; i < dims; i++)
len += vec[i] * vec[i];
len = Math.Sqrt(len);
var normal = new double[dims];
for (var i = 0; i < dims; i++)
normal[i] = vec[i] / len;
return normal;

To find the phrase similarity, you have to use word vectors (trained by using word2vec algorithm) to combine phrase vectors.
Here is how to use word2vec to get phrase vector: How to calculate phrase similarity between phrases

Related

Find Longest common Substring

I need to find longest common substring from two DNA strings.
I have first string "CGATAC", and second: "GACAGTC"
With my code my result is: "GAC", but you can get longer substring, i mean "GATC". What i need to change to get longer substring ?
int k = 0;
for (int i = 0; i < substring1.length(); i++) {
char znak = substring1[i];
for (int j = k; j < substring2.length(); j++) {
char znak2 = substring2[j];
if (znak == znak2) {
end_substring += znak;
k = j;
break;
}
}
}
cout << end_substring;
You can improve your code with some basic ideas. I understand you want one of the longest strings, not all, then you can store the length of the longest string until each moment in the program, and use this length for search string at least of length+1. But the bes solution is use dynamic porgramming, you can read this solution here: https://www.geeksforgeeks.org/longest-common-substring-dp-29/

looking for a faster way to help reduce/create a huge list of strings

I tried to write an algorithm to guess correctly in the game "Masterminds",
it works the average number of guesses is 6, but it takes a lot of time to calculate the best guess.
I used the idea of Knuth the algorithm works as follows:
Create the set S of 1296 possible codes (1111, 1112 ... 6665, 6666).
Start with initial guess 1122 (Knuth gives examples showing that other first guesses such as 1123, 1234 do not win in five tries on
every code).
Play the guess to get a response of colored and white pegs.
If the response is four colored pegs, the game is won, the algorithm terminates.
Otherwise, remove from S any code that would not give the same response if the current guess were the code.
In my code step 2 is to take random number.
I used vector<string> for this.
AllPoss is the vector full of strings, I guess is the last guess that was used. answer is the count of bulls and cows looks like "x,y" (where x and y are numbers)
void bullpgia::SmartGuesser::remove(string guess, string answer)
{
for (auto i= AllPoss.begin();i != AllPoss.end();i++){
string token = *i;
if (calculateBullAndPgia(token, guess) != answer)
AllPoss.erase(i--);
}
}
this is the part it take a lot of time to calculate is there any way of improvement?
to creating the list i used :
void bullpgia::SmartGuesser::All() {
/**
* creates a pool of all the possibilities strings
* we then delete the ones we dont need
* #param length is the length of the word we need to guess
*/
for(int i=0;i<pow(10,length);i++){
stringstream ss;
ss << setw(length) << setfill('0') << i;
string s = ss.str();
AllPoss.push_back(s);
}
}
the function calculateBullAndPgia(string , string) is:
string calculateBullAndPgia(const string &choice, const string &guess) {
string temp = choice;
string temp2 = guess;
unsigned int bull = 0;
unsigned int pgia = 0;
for (int i = 0; i < temp.length(); i++) {
if (temp[i] == temp2[i]) {
bull++;
temp[i] = 'a';
temp2[i] = 'z';
}
}
for (int i = 0; i < temp.length(); i++) {
for (int j = 0; j < temp2.length(); j++) {
if (i != j && temp[i] == temp2[j]) {
pgia++;
temp[i] = 'a';
temp2[j] = 'z';
}
}
}
return to_string(bull) + "," + to_string(pgia);
}
Erasing a single element in the middle of a vector is O(n). My guess is that you wind up doing it O(n) times per call to SmartGuesser::remove. Then you loop over that so you probably have a O(n^3) algorithm. You instead could use std::remove_if, which is O(n), to move all the to-be-erased elements to the end of the vector where they can be cheaply erased.:
AllPoss.erase(std::remove_if(AllPos.begin(), AllPos.end(), [&](const std::string& token, const std::string& guess) { return calculateBullAndPgia(token, guess) != answer; }), AllPos.end());

Map string in cpp

I want to store all substrings in a unordered_map .I am thinking to use substr
function of stl but it worst case time complexity comes out to be O(n) and when I am going to use inside a loop for all indexes of string it will give me O(n^2).
Can we do something better in O(n) by using pointer or something else so that i can access the substring later.
If you don't want to copy the sub strings into the map then you can use std::string_view to store a view of the sub string. This cost you a pointer and a length, so it's as efficient as it can be.
You can build a vector of all the sub strings like
int main()
{
std::string word = "word";
auto size = word.size();
std::vector<std::string_view> parts;
parts.reserve(size * (size + 1)/2); // reserve space for all the sub strings
for(size_t i = 0; i < size; ++i)
for(size_t j = i; j < size; ++j)
parts.emplace_back(word.data() + i, j - i + 1);
}

How to calculate a mismatch score between n number of strings more efficiently?

Suppose I have a vector that contains n strings, where the strings can be length 5...n. Each string must be compared with each string character by character. If there is a mismatch, the score is increased by one. If there is a match, the score does not increase. Then I will store the resulting scores in a matrix.
I have implemented this in the following way:
for (auto i = 0u; i < vector.size(); ++i)
{
// vector.size() x vector.size() matrix
std::string first = vector[i]; //horrible naming convention
for (auto j = 0u; j < vector.size(); ++j)
{
std::string next = vector[j];
int score = 0;
for (auto k = 0u; k < sizeOfStrings; ++k)
{
if(first[k] == second[k])
{
score += 0;
}
else
{
score += 1;
}
}
//store score into matrix
}
}
I am not happy with this solution because it is O(n^3). So I have been trying to think of other ways to make this more efficient. I have thought about writing another function that would replace the innards of our j for loop, however, that would still be O(n^3) since the function would still need a k loop.
I have also thought about a queue, since I only care about string[0] compared to string[1] to string[n]. String[1] compared to string[2] to string[n]. String[2] compared to string[3] to string[n], etc. So my solutions have unnecessary computations since each string is comparing to every other string. The problem with this, is I am not really sure how to build my matrix out of this.
I have finally, looked into the std template library, however std::mismatch doesn't seem to be what I am looking for, or std::find. What other ideas do you guys have?
I don't think you can easily get away from O(n^3) comparisons, but you can easily implement the change you talk about. Since the comparisons only need to be done one way (i.e. comparing string[1] to string[2] is the same as comparing string[2] to string[1]), as you point out, you don't need to iterate through the entire array each time and can change the start value of your inner loop to be the current index of your outer loop:
for (auto i = 0u; i < vector.size(); ++i) {
// vector.size() x vector.size() matrix
std::string first = vector[i]; //horrible naming convention
for (auto j = i; j < vector.size(); ++j) {
To store it in a matrix, setup your i x j matrix, initialize it to all zeroes and simply store each score in M[i][j]
for (auto k = 0u; k < sizeOfStrings; ++k) {
if (first[k] != second[k]) {
M[i][j]++;
}
}
If you have n strings each of length m, then no matter what (even with your queue idea), you have to do at least (n-1)+(n-2)+...+(1)=n(n-1)/2 string comparisons, so you'll have to do (n(n-1)/2)*m char comparisons. So no matter what, your algorithm is going to be O(mn^2).
General comment:
You don't have to compare the same strings with each other. And what is more important you starting from the begining each time in second loop while you already computed those diffs, so change the second loop to start from i+1.
By doing so your complexity will decrease as you won't check string that you already checked or are the same.
Improvement
Sort vector and remove duplicated entries, then instead wasting computation for checking the same strings you will only check those that are different.
The other answers that say this is at least O(mn^2) or O(n^3) are incorrect. This can be done in O(mn) time where m is string size and n is number of strings.
For simplicity we'll start with the assumption that all characters are ascii.
You have a data structure:
int counts[m][255]
where counts[x][y] is the number of strings that have ascii character y at index x in the string.
Now, if you did not restrict to ascii, then you would need to use a std::map
map counts[m]
But it works the same way, at index m in counts you have a map in which each entry in the map y,z tells you how many strings z use character y at index m. You would also want to choose a map with constant time lookups and constant time insertions to match the complexity.
Going back to ascii and the array
int counts[m][255] // start by initializing this array to all zeros
First initialize the data structure:
m is size of strings,
vec is a std::vector with the strings
for (int i = 0; i < vec.size(); i++) {
std::string str = vec[i];
for(int j = 0; j < m; j++) {
counts[j][str[j]]++;
}
}
Now that you have this structure, you can calculate the scores easily:
for (int i = 0; i < vec.size(); i++) {
std::string str = vec[i];
int score = 0;
for(int j = 0; j < m; j++) {
score += counts[j][str[j]] - 1; //subtracting 1 gives how many other strings have that same char at that index
}
std::cout << "string \"" << str << "\" has score " << score;
}
As you can see by this code, this is O(m * n)

what compression algorithm to use for highly redundant data

This program uses sockets to transfer highly redundant 2D byte arrays (image like). While the transfer rate is comparatively high (10 Mbps), the arrays are also highly redundant (e.g. Each row may contain several consequently similar values).
I have tried zlib and lz4 and the results were promising, however I still think of a better compression method and please remember that it should be relatively fast as in lz4. Any suggestions?
You should look at the PNG algorithms for filtering image data before compressing. They are simple to more sophisticated methods for predicting values in a 2D array based on previous values. To the extent that the predictions are good, the filtering can make for dramatic improvements in the subsequent compression step.
You should simply try these filters on your data, and then feed it to lz4.
you could create your own, if the data in rows is similar you can create a resource / index map thus reducing substantial the size, something like this
Original file:
row 1: 1212, 34,45,1212,45,34,56,45,56
row 2: 34,45,1212,78,54,87,....
you could create a list of unique values, than use and index in replacement,
34,45,54,56,78,87,1212
row 1: 6,0,2,6,1,0,.....
this can potantialy save you over 30% or more data transfer, but it depends on how redundant the data is
UPDATE
Here a simple implementation
std::set<int> uniqueValues
DataTable my2dData; //assuming 2d vector implementation
std::string indexMap;
std::string fileCompressed = "";
int Find(int value){
for(int i = 0; i < uniqueValues.size; ++i){
if(uniqueValues[i] == value) return i;
}
return -1;
}
//create list of unique values
for(int i = 0; i < my2dData.size; ++i){
for(int j = 0; j < my2dData[i].size; ++j){
uniqueValues.insert(my2dData[i][j]);
}
}
//create indexes
for(int i = 0; i < my2dData.size; ++i){
std::string tmpRow = "";
for(int j = 0; j < my2dData[i].size; ++j){
if(tmpRow == ""){
tmpRow = Find(my2dData[i][j]);
}
else{
tmpRow += "," + Find(my2dData[i][j]);
}
}
tmpRow += "\n\r";
indexMap += tmpRow;
}
//create file to transfer
for(int k = 0; k < uniqueValues.size; ++k){
if(fileCompressed == ""){
fileCompressed = "i: " + uniqueValues[k];
}
else{
fileCompressed += "," + uniqueValues[k];
}
}
fileCompressed += "\n\r\d:" + indexMap;
now on the receiving end you just do the opposite, if the line start with "i" you get the index, if it start with "d" you get the data