Is it recommended to remove duplicate words in word2vec algorithm? [closed] - word2vec

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have a data that consists of DNA-sequences, where the words represented as kmers of length 6, and the sentences represented as DNA-sequences. Each DNA-sequence has 80 kmers (words)
The list of kmers I have is around 130,000 kmers, but after removing the duplicate elements, I would have 4500 kmers only. So, this huge gap confused me in regarding removing the duplicate kmers or not. My question is, is it recommended in this case to remove the duplicated kmers in the word2vec algorithm?
Thanks.

Without an example, it's unclear what you mean by "removing the duplicate elements". (Does that mean, when the same token appears twice in a row? Or twice in one "sentence"? Or, as I'm not familiar with what your data looks like in this domain, something else entirely?)
That you say there are 130,000 tokens in the vocabulary, but then 4,500 later, is also confusing. Typically the "vocabulary" size is the number of unique tokens. Removing duplicate tokens couldn't possibly change the number of unique tokens encountered.
In the usual domain of word2vec, natural language, words don't often repeat one-after-another. To the extent they sometimes might – as in say the utterance "it's very very hot in here" – it's not really an important enough case that I've noticed anyone commenting about handling that "very very" differently than any other two words.
(If a corpus had some artificially-duplicated full-sentences, it might be the case that you'd want to try discarding the exact-duplicate-sentences. Word2vec benefits from a variety of different usage-examples. Repeating the same sentence 10 times essentially just overweights those training examples – it's not nearly as good as 10 contrasting, but still valid, examples of the same words' usage.)
You're in a different domain that's not natural language, with different co-occurrence frequencies, and different end-goals. Word2vec might prove useful, but it's unlikely any general rules-of-thumb or recommendations from other domains will be useful. You should test things both ways, evaluate the results on your ultimate task in a robust repeatable way, and choose based on what you discover.

Related

Finding keys near other keys [C++] [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I'm sorry if the title is undescriptive, I'm not sure how to summarize my issue into a few words.
I'm looking to find which characters are physically near other characters on my QWERTY (UK, but I don't mind if you provide information specific to US) keyboard.
e.g:
charsNearChar('j') // OUTPUT -> U,I,H,K,N,M.
I can't seem to wrap my head around any solutions beside switch cases for each individual character, any help is appreciated!
There is no (simple) calculation that you could perform to get the list of adjacent keys. You simply need to use an explicitly written list of adjacent keys for each key.
any solutions beside switch cases
You don't need switch cases. What you're essentially asking for is a graph where nodes are keys and edges are to "adjacent" keys.
There are many ways to represent graphs. For your use case, perhaps an easy to understand, and reasonably fast choice is to use an associative map from key to its adjacency list (a vector of chars, or a string):
std::unordered_map<char, std::string> {
{'J', "UIHKNM"},
....
};
Since you limit the functionality to alphanumeric keys, they have an interesting property of being in a hexagonal grid. Such grid could well be represented with a 2D matrix:
char grid[][] = {
"123...",
"QWE...",
"ASD...",
"ZXC...",
};
This representation has less repetition, and the adjacency lists can be generated form this matrix with an algorithm.

Most frequent substring of fixed length - simple solution needed [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Please describe (without implementation!) algorithm (possibly fastest)
which receives string of n letters and positive integer k as
arguments, and prints the most frequent substring of length k (if
there are multiple such substrings, algorithm prints any one of them).
String is composed of letters "a" and "b". For example: for string
ababaaaabb and k=3 the answer is "aba", which occurs 2 times (the fact
that they overlap doesn't matter). Describe an algorithm, prove its
correctness and calculate its complexity.
I can use only most basic functions of C++: no vectors, classes, objects etc. I also don't know about strings, only char tables. Can someone please explain to me what the algorithm would be, possibly with implementation in code for easier understanding? That's question from university exam, that's why it's so weird.
A simple solution is by trying all possible substrings from left to right (i.e. starting from indices i=0 to n-k), and comparing each to the next substrings (i.e. starting from indices j=i+1 to n-k).
For every i-substring, you count the number of occurrences, and keep a trace of the most frequentso far.
As a string comparison costs at worst k character comparisons, and you will be performing (n-k-1)(n-k)/2 such comparisons and the total cost is of order O(k(n-k)²). [In fact the cost can be lower because some of the string comparisons may terminate early, but I an not able to perform the evaluation.]
This solution is simple but probably not efficient.
In theory you can reduce the cost using a more efficient string matching algorithm, such as Knuth-Morris-Pratt, resulting in O((n-k)(n+k)) operations.

Given billions of URLs, how to determine duplicate content [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I was asked this question in a programming interview. I have described the question in detail below. It was an open-ended question.
Given billions of URLs(deep links), how do I classify that which URLs point to the duplicate content. The question was further extended to finding out that in cases of duplicate pages, which of them was authentic. This was the first part.
My approach (with valid assumptions) was to classify them on the basis of domains and then match the contents of URLs in the same bucket.
In the second part, the interviewer narrowed down the question stating that:
Given just two URLs, URL1 is a wiki page about a celebrity, (eg: Brad Pitt) and URL2 contains information about many celebrities including Brad Pitt.
How do we identify which one is authentic and which is duplicate ?
My answer was based on comparing the two pages on the basis of their citations.
The interviewer asked me to build the answer from scratch, and wanted me to assume that we don't have any prior information about duplicate content on the URLs.
Since its an open-ended question, any lead would prove helpful.
You might find this paper to be helpful: "Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms" by Monika Henzinger at Google, as this problem has attracted a fair amount of research. From the paper:
A naive solution is to compare all pairs to documents. Since this is
prohibitively expensive on large datasets, Manber [11] and Heintze [9]
proposed first algorithms for detecting near-duplicate documents with
a reduced number of comparisons. Both algorithms work on sequences of
adjacent characters. Brin et al. 1 started to use word sequences to
detect copyright violations. Shivakumar and Garcia-Molina [13, 14]
continued this research and focused on scaling it up to multi-gigabyte
databases [15]. Broder et al. [3] also used word sequences to
efficiently find near duplicate web pages. Later, Charikar [4]
developed an approach based on random projections of the words in a
document. Recently Hoad and Zobel [10] developed and compared methods
for identifying versioned and plagiarised documents.
In other words, it's a complex problem with a variety of solutions of varying success, and not something with a 'right' answer. Most of the answers involve checking word or character sequences.
The above link for me did not work, but I found this page from Stanford, which had an interesting theorem involving shingles and the Jaccard coefficient.

Right sequence of brackets [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Please help with writing program on C++. We have a sequence of brackets. It consists from 4 kinds - (), [], {}, <>. Required to find the shortest sequence with the right placement of brackets, for which the initial sequence would be a subsequence, i. e. would be obtained from the resulting correct sequence by deleting some (possibly zero) number of brackets.
Example:
initial sequence <]}} {([])
the answer: <[] {} {} ([]) <>>
Your proposed answer doesn't seem to fit the requirements. For example, it doesn't look (at least to me) like you can generate the }}{ sequence by deleting elements from <[] {} {} ([]) <>>. You also seem to have a completely unnecessary pair of angle brackets. Presumably, your intent is also that the brackets in the generated sequence are balanced--otherwise, the correct answer is to simply leave the original sequence unchanged. With no other requirements, that's clearly the shortest sequence from which you can generate that sequence by deleting (zero) items.
If the requirement for balancing is correct, it looks like your original input has four possible correct results:
<[]{}{}{([])}>
<[]{}{}{}([])>
<>[]{}{}{}([])
<>[]{}{}{([])}
All these are the same length, so I don't see a particular reason to prefer one over the other. This looks enough like homework that I'm not going just give a direct solution to the problem, but I think the simplest code you could write for the job would probably produce the first of these four solutions (and that may provide at least some guidance about how I'd solve the problem).
I'm reasonably certain this can be done entirely using counters--shouldn't need any sort of "context stacks" (though a stack-based solution is certainly possible as well).

Transliteration between different writing systems [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I need to learn how to change a transliteration of a text to another writing system. Apparently the best way would somehow involve regular expressions and perl, probably from command line? I've been using regular expressions earlier in Notepad++ and TextWrangler, so I know some basics already. If there is some really good (and relatively easy and customizable) way to do this in Ruby or something else, I can start learning that as well. There is a constant need to transliterate linguistic sample texts in my field in Uralic linguistics, where many different variants of transliteration systems are used. So it is worth investing some time.
So the material I have now consists of lines with a sentence on each line. Some lines have other data like numbers, but those should stay as they are. I want to keep the punctuation marks as they are, this is just about converting one set of unicode letter characters to another. I searched the site but a lot was about converting from ascii to unicode and so on - this is not the problem here.
So the original text is like this (in broad Finno-Ugric Transcription):
mödis ivan velöććyny pećoraö ščötövödnej kurs vylö.
And I would need it in a form like this:
мӧдiс иван велӧччыны печораӧ щӧтӧвӧднэй курс вылӧ.
This continues for some thousand lines.
There is a clear correspondence between characters used, but it is sometimes complex and involves dealing first with some digraphs and consonant + vowel combinations, etc. As you see from the example, in some situations latin i corresponds to cyrillic и but in some positions can remain as i. Different texts have different solutions, so I would need to adjust the rules in each case. I understand I would need to run a long series of regular expressions in a very specific order to make it work. This order I will figure out myself, but I need to know into what kind of tool I have feed these rules in and how to do it.
I also have often situations where I would like to have the original sentence and transliterated one separated by a tab, so that the lines would have a form like this:
mödis ivan velöććyny pećoraö ščötövödnej kurs vylö. мӧдiс иван
велӧччыны печораӧ щӧтӧвӧдней курс вылӧ.
Of course there are many more questions, but after learning these basics I think I can move forward independently. Learning this would help me a lot. Thanks in advance!
Niko