Most frequent substring of fixed length - simple solution needed [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Please describe (without implementation!) algorithm (possibly fastest)
which receives string of n letters and positive integer k as
arguments, and prints the most frequent substring of length k (if
there are multiple such substrings, algorithm prints any one of them).
String is composed of letters "a" and "b". For example: for string
ababaaaabb and k=3 the answer is "aba", which occurs 2 times (the fact
that they overlap doesn't matter). Describe an algorithm, prove its
correctness and calculate its complexity.
I can use only most basic functions of C++: no vectors, classes, objects etc. I also don't know about strings, only char tables. Can someone please explain to me what the algorithm would be, possibly with implementation in code for easier understanding? That's question from university exam, that's why it's so weird.

A simple solution is by trying all possible substrings from left to right (i.e. starting from indices i=0 to n-k), and comparing each to the next substrings (i.e. starting from indices j=i+1 to n-k).
For every i-substring, you count the number of occurrences, and keep a trace of the most frequentso far.
As a string comparison costs at worst k character comparisons, and you will be performing (n-k-1)(n-k)/2 such comparisons and the total cost is of order O(k(n-k)²). [In fact the cost can be lower because some of the string comparisons may terminate early, but I an not able to perform the evaluation.]
This solution is simple but probably not efficient.
In theory you can reduce the cost using a more efficient string matching algorithm, such as Knuth-Morris-Pratt, resulting in O((n-k)(n+k)) operations.

Related

Is it recommended to remove duplicate words in word2vec algorithm? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have a data that consists of DNA-sequences, where the words represented as kmers of length 6, and the sentences represented as DNA-sequences. Each DNA-sequence has 80 kmers (words)
The list of kmers I have is around 130,000 kmers, but after removing the duplicate elements, I would have 4500 kmers only. So, this huge gap confused me in regarding removing the duplicate kmers or not. My question is, is it recommended in this case to remove the duplicated kmers in the word2vec algorithm?
Thanks.
Without an example, it's unclear what you mean by "removing the duplicate elements". (Does that mean, when the same token appears twice in a row? Or twice in one "sentence"? Or, as I'm not familiar with what your data looks like in this domain, something else entirely?)
That you say there are 130,000 tokens in the vocabulary, but then 4,500 later, is also confusing. Typically the "vocabulary" size is the number of unique tokens. Removing duplicate tokens couldn't possibly change the number of unique tokens encountered.
In the usual domain of word2vec, natural language, words don't often repeat one-after-another. To the extent they sometimes might – as in say the utterance "it's very very hot in here" – it's not really an important enough case that I've noticed anyone commenting about handling that "very very" differently than any other two words.
(If a corpus had some artificially-duplicated full-sentences, it might be the case that you'd want to try discarding the exact-duplicate-sentences. Word2vec benefits from a variety of different usage-examples. Repeating the same sentence 10 times essentially just overweights those training examples – it's not nearly as good as 10 contrasting, but still valid, examples of the same words' usage.)
You're in a different domain that's not natural language, with different co-occurrence frequencies, and different end-goals. Word2vec might prove useful, but it's unlikely any general rules-of-thumb or recommendations from other domains will be useful. You should test things both ways, evaluate the results on your ultimate task in a robust repeatable way, and choose based on what you discover.

C++ sort() function algorithm [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Some days ago I wanted to use C++ sort() function to sort an array of strings, but I had a problem!
What algorithm does it use to sort the array? Is it a deterministic one or may it use different algorithms based on the type of the array?
Also, is there a clear time complexity analysis about it?
Does this function use the same algorithm for sorting numbers array and strings array?
It might or it might not. That is not specified by the standard.
And if we use it to sort an array of strings which the total size of them is less than 100,000 characters, would it work in less than 1 second(in the worst case)?
It might or it might not. It depends on the machine you're running the program on. Even if it will work in less than 1 second in worst case on a particular machine, it would be difficult to prove. But you can get a decent estimation by measuring. A measurement only applies to the machine it was performed, of course.

C++ compare two sentences [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
randomly came across this:
Develop an algorithm to compare two sentences to see if
they match or not. The key aspect of these sentences is that
the words could be in any order (e.g. "california is hot" and "
hot is california" are two sentences that would match).
any ideas?
Parse each sentence into words, use space as delimiters.
Add all std::string words to a std::vector<std::string>, then sort.
Use the ==operator to compare the two vectors for equality.
Perhaps put words into a std::map<string, int> and count up the element each time you find a word on the one side, and down on the other side, then iterate over the map and check that all entries are zero. [This assumes that "california is hot hot" isn't supposed to be the same as "hot is california", in which case you need a bit more logic, to only count words the first time you see them on each side]
Or put each word in each sentence into a std::vector<string>, then sort each vector and compare the two vectors. Again, strategy changes if the sentence needs to be recognised regardless of the number of times each word is seen.

Right sequence of brackets [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Please help with writing program on C++. We have a sequence of brackets. It consists from 4 kinds - (), [], {}, <>. Required to find the shortest sequence with the right placement of brackets, for which the initial sequence would be a subsequence, i. e. would be obtained from the resulting correct sequence by deleting some (possibly zero) number of brackets.
Example:
initial sequence <]}} {([])
the answer: <[] {} {} ([]) <>>
Your proposed answer doesn't seem to fit the requirements. For example, it doesn't look (at least to me) like you can generate the }}{ sequence by deleting elements from <[] {} {} ([]) <>>. You also seem to have a completely unnecessary pair of angle brackets. Presumably, your intent is also that the brackets in the generated sequence are balanced--otherwise, the correct answer is to simply leave the original sequence unchanged. With no other requirements, that's clearly the shortest sequence from which you can generate that sequence by deleting (zero) items.
If the requirement for balancing is correct, it looks like your original input has four possible correct results:
<[]{}{}{([])}>
<[]{}{}{}([])>
<>[]{}{}{}([])
<>[]{}{}{([])}
All these are the same length, so I don't see a particular reason to prefer one over the other. This looks enough like homework that I'm not going just give a direct solution to the problem, but I think the simplest code you could write for the job would probably produce the first of these four solutions (and that may provide at least some guidance about how I'd solve the problem).
I'm reasonably certain this can be done entirely using counters--shouldn't need any sort of "context stacks" (though a stack-based solution is certainly possible as well).

What is the the fastest algorithm in DNA pattern matching [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
Suppose we have a string S with a length of several millions. The string only contains 'a' 't' 'g' 'c' and we have a pattern W with a length of roughly 20. What could be the fastest algorithm in C++ to find ALL occurrences of W in S? It seems KMP is not fast enough.
KMP is linear in S+W. You can't get faster than that.
You at least need to read the data, and that is also linear. So even if your algorithm is instant, you still can't do much better than KMP.
I suspect you do something wrong reading the data or traversing it in a way that destroys caching.
You could try a Suffix Tree although, if you are only processing it once, the tree takes O(n log n) to create, so KMP is faster for single checkings. So if you have multiple distinct 'W's to find then I would go with a Suffix Tree, else KMP is probably your best bet.
From the wikipedia article:
The suffix array of a string can be used as an index to quickly locate
every occurrence of a substring pattern P within the string S. Finding
every occurrence of the pattern is equivalent to finding every suffix
that begins with the substring. Thanks to the lexicographical
ordering, these suffixes will be grouped together in the suffix array
and can be found efficiently with two binary searches.