Using multidimensional arrays to analyze sequences of RNA - c++

I'm currently learning about multidimensional arrays and was given the task of analyzing strands of RNA sequences (given from a .txt file). Here is an example of a strand:
AUGCUUAUUAACUGAAAACAUAUGGGUAGUCGAUGA
Given this string, I am to figure out what protein this RNA strand would create. In order to do so, I am to break down each strand into codons (groups of 3). So for this exampple, I need to look at AUG CUU AUU AAC UGA, etc. Each of these codons represents an amino acid. So AUG is methionine (represented by 'M'), CUU is leucine (represented by 'L') and so on and so forth. My output should therefore be a new string of amino acids (M-L-I...)
What would be the best way to approach this problem? From my understanding, I'm to create a 3-D array, let's say
int aminoAcid[4][4][4]
Since there are 4 possible choice for each base (A,U,G,C). I'm not entirely sure where to go from here though since certain combinations will give the same amino acid.
EDIT: Am I going in the right direction if a were to first convert the string into number representations (A=0, U=1, G=2, C=3). From there I can work better with a 3d array right?

You can use the 3d array to connect amino acids to different sequences. You should learn about enum and figure out how you can use enum with your array indices so that you can do something like
aminoAcid['A']['U']['G'] = 24
where 24 is also corresponding to methionine, meaning you can use another enum there. Use enums whenever you have a limited known group of items you want to represent with numbers.
It sounds like this is just the beginning of a larger project, so you should follow good practices from the start, thinking about how you can build components that represent your problem.

Related

Is there a way to restrict string manipulation e.g substring?

The problem is that I'm processing some UTF8 strings and I would like to design a class or a way to prevent string manipulations.
String manipulation is not desirable for strings of multibyte characters as splitting the string at a random position (which is measured in bytes) may split a character half way.
I have thought about using const std::string& but the user/developer can create a substring by calling std::substr.
Another way would be create a wrapper around const std::string& and expose only the string through getters.
Is this even possible?
Another way would be create a wrapper around const std::string& and expose only the string through getters.
You need a class wrapping a std::string or std::u8string, not a reference to one. The class then owns the string and its contents, basically just using it as a storage, and can provide an interface as you see fit to operate on unicode code points or characters instead of modifying the storage directly.
However, there is nothing in the standard library that will help you implement this. So a better approach would be to use a third party library that already does this for you. Operating on code points in a UTF-8 string is still reasonably simple and you can implement that part yourself, but if you want to operate on characters (in the sense of grapheme clusters or whatever else is suitable) implementation is going to be a project in itself.
I would use a wrapper where your external interface provides access to either code points, or to characters. So, foo.substr(3, 4) (for example) would skip the first 3 code points, and give you the next 4 code points. Alternatively, it would skip the first 3 characters, and give you the next 4 characters.
Either way, that would be independent of the number of bytes used to represent those code points or characters.
Quick aside on terminology for anybody unaccustomed to Unicode terminology: ISO 10646 is basically a long list of code points, each assigned a name and a number from 0 to (about) 220-1. UTF-8 encodes a code point number in a sequence of 1 to 4 bytes.
A character can consist of a (more or less) arbitrary number of code points. It will consist of a base character (e.g., a letter) followed by some number of combining diacritical marks. For example, à would normally be encoded as an a followed by a "combining grave accent" (U+0300).
The a and the U+0300 are each a code point. When encoded in UTF-8, the a would be encoded in a single byte and the U+0300 would be encoded in three bytes. So, it's one character composed of two code points encoded in 4 characters.
That's not quite all there is to characters (as opposed to code points) but it's sufficient for quite a few languages (especially, for the typical European languages like Spanish, German, French, and so on).
There are a fair number of other points that become non-trivial though. For example, German has a letter "ß". This is one character, but when you're doing string comparison, it should (at least normally) compare as equal to "ss". I believe there's been a move to change this but at least classically, it hasn't had an upper-case equivalent either, so both comparison and case conversion with it get just a little bit tricky.
And that's fairly mild compared to situations that arise in some of the more "exotic" languages. But it gives a general idea of the fact that yes, if you want to deal intelligently with Unicode strings, you basically have two choices: either have your code use ICU1 to do most of the real work, or else resign yourself to this being a multi-year project in itself.
1. In theory, you could use another suitable library--but in this case, I'm not aware of such a thing existing.

DEFLATE method reasoning

Why does LZ77 DEFLATE use Huffman encoding for it's second pass instead of LZW? Is there something about their combination that is optimal? If so, what is the nature of the output of LZ77 that makes it more suitable for Huffman compression than LZW or some other method entirely?
LZW tries to take advantage of repeated strings, just like the first "stage" as you call it of LZ77. It then does a poor job of entropy coding that information. LZW has been completely supplanted by more modern approaches. (Except for its legacy use in the GIF format.) Once LZ77 generates a list of literals and matches, there is nothing left for LZW to take advantage of, and it would then make an almost completely ineffective entropy coder for that information.
Mark Adler could best answer this question.
The details of how the LZ77 and Huffman work together need some closer examination. Once the raw data has been turned into a string of characters and special length, distance pairs, these elements must be represented with Huffman codes.
Though this is NOT, repeat, NOT standard terminology, call the point where we start reading in bits a "dial tone." After all, in our analogy, the dial tone is where you can start specifying a series of numbers that will end up mapping to a specific phone. So call the very beginning a "dial tone." At that dial tone, one of three things could follow: a character, a length-distance pair, or the end of the block. Since we must be able to tell which it is, all the possible characters ("literals"), elements that indicate ranges of possible lengths ("lengths"), and a special end-of-block indicator are all merged into a single alphabet. That alphabet then becomes the basis of a Huffman tree. Distances don't need to be included in this alphabet, since they can only appear directly after lengths. Once the literal has been decoded, or the length-distance pair decoded, we are at another "dial-tone" point and we start reading again. If we got the end-of-block symbol, of course, we're either at the beginning of another block or at the end of the compressed data.
Length codes or distance codes may actually be a code that represents a base value, followed by extra bits that form an integer to be added to the base value.
...
Read the whole deal here.
Long story short. LZ77 provides duplicate elimination. Huffman coding provides bit reduction. It's also on the wiki.

How to Find number of occurances of words in a String Paragraph, using Pointers in C++ Pgm in a single iteration?

suppose I have following paragraph of strings.
"A tape suggests the crew of the Costa Concordia ship mentioned only a "blackout" in a communication with Italian officials after hitting rocks.A tape suggests the crew of the Costa Concordia ship mentioned only a "blackout" in a communication with Italian officials after hitting rocks."
Now I have to write a C++ pgm which in single traversal gives the output like
A 2
tape 2
suggests 2
the 2
......... and so on.
Note: Only one iteration, NO LOOP. using pointers.
As you traverse the text, keep a record of what words you've seen so far and how many times you've seen each.
Then, when you've finished, print out the results.
You must show us what you wrote yet.
If you are stuck, and assuming you have to write in C ( you spoke about pointer ), take a look at
strtok -- Split string into tokens
The operation would then be linear with a hash table ( n call to strtok, ~O(1) access to the hash at each iteration, plus ~O(1) insertion if the word does not exist).
Hash table can be tricky to write in full C (but you can look at prefix tree for example). You can use libc hash table, or if it is an option, use c++ map.

Simple spell checking algorithm

I've been tasked with creating a simple spell checker for an assignment but have given next to no guidance so was wondering if anyone could help me out. I'm not after someone to do the assignment for me, but any direction or help with the algorithm would be awesome! If what I'm asking is not within the guildlines of the site then I'm sorry and I'll look elsewhere. :)
The project loads correctly spelled lower case words and then needs to make spelling suggestions based on two criteria:
One letter difference (either added or subtracted to get the word the same as a word in the dictionary). For example 'stack' would be a suggestion for 'staick' and 'cool' would be a suggestion for 'coo'.
One letter substitution. So for example, 'bad' would be a suggestion for 'bod'.
So, just to make sure I've explained properly.. I might load in the words [hello, goodbye, fantastic, good, god] and then the suggestions for the (incorrectly spelled) word 'godd' would be [good, god].
Speed is my main consideration here so while I think I know a way to get this work, I'm really not too sure about how efficient it'll be. The way I'm thinking of doing it is to create a map<string, vector<string>> and then for each correctly spelled word that's loaded in, add the correctly spelled work in as a key in the map and the populate the vector to be all the possible 'wrong' permutations of that word.
Then, when I want to look up a word, I'll look through every vector in the map to see if that word is a permutation of one of the correctly spelled word. If it is, I'll add the key as a spelling suggestion.
This seems like it would take up HEAPS of memory though, cause there would surely be thousands of permutations for each word? It also seems like it'd be very very slow if my initial dictionary of correctly spelled words was large?
I was thinking that maybe I could cut down time a bit by only looking in the keys that are similar to the word I'm looking at. But then again, if they're similar in some way then it probably means that the key will be a suggestion meaning I don't need all those permutations!
So yeah, I'm a bit stumped about which direction I should look in. I'd really appreciate any help as I really am not sure how to estimate the speed of the different ways of doing things (we haven't been taught this at all in class).
The simpler way to solve the problem is indeed a precomputed map [bad word] -> [suggestions].
The problem is that while the removal of a letter creates few "bad words", for the addition or substitution you have many candidates.
So I would suggest another solution ;)
Note: the edit distance you are describing is called the Levenshtein Distance
The solution is described in incremental step, normally the search speed should continuously improve at each idea and I have tried to organize them with the simpler ideas (in term of implementation) first. Feel free to stop whenever you're comfortable with the results.
0. Preliminary
Implement the Levenshtein Distance algorithm
Store the dictionnary in a sorted sequence (std::set for example, though a sorted std::deque or std::vector would be better performance wise)
Keys Points:
The Levenshtein Distance compututation uses an array, at each step the next row is computed solely with the previous row
The minimum distance in a row is always superior (or equal) to the minimum in the previous row
The latter property allow a short-circuit implementation: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current row is superior to 2, you can abandon the computation. A simple strategy is to return the treshold + 1 as the distance.
1. First Tentative
Let's begin simple.
We'll implement a linear scan: for each word we compute the distance (short-circuited) and we list those words which achieved the smaller distance so far.
It works very well on smallish dictionaries.
2. Improving the data structure
The levenshtein distance is at least equal to the difference of length.
By using as a key the couple (length, word) instead of just word, you can restrict your search to the range of length [length - edit, length + edit] and greatly reduce the search space.
3. Prefixes and pruning
To improve on this, we can remark than when we build the distance matrix, row by row, one world is entirely scanned (the word we look for) but the other (the referent) is not: we only use one letter for each row.
This very important property means that for two referents that share the same initial sequence (prefix), then the first rows of the matrix will be identical.
Remember that I ask you to store the dictionnary sorted ? It means that words that share the same prefix are adjacent.
Suppose that you are checking your word against cartoon and at car you realize it does not work (the distance is already too long), then any word beginning by car won't work either, you can skip words as long as they begin by car.
The skip itself can be done either linearly or with a search (find the first word that has a higher prefix than car):
linear works best if the prefix is long (few words to skip)
binary search works best for short prefix (many words to skip)
How long is "long" depends on your dictionary and you'll have to measure. I would go with the binary search to begin with.
Note: the length partitioning works against the prefix partitioning, but it prunes much more of the search space
4. Prefixes and re-use
Now, we'll also try to re-use the computation as much as possible (and not just the "it does not work" result)
Suppose that you have two words:
cartoon
carwash
You first compute the matrix, row by row, for cartoon. Then when reading carwash you need to determine the length of the common prefix (here car) and you can keep the first 4 rows of the matrix (corresponding to void, c, a, r).
Therefore, when begin to computing carwash, you in fact begin iterating at w.
To do this, simply use an array allocated straight at the beginning of your search, and make it large enough to accommodate the larger reference (you should know what is the largest length in your dictionary).
5. Using a "better" data structure
To have an easier time working with prefixes, you could use a Trie or a Patricia Tree to store the dictionary. However it's not a STL data structure and you would need to augment it to store in each subtree the range of words length that are stored so you'll have to make your own implementation. It's not as easy as it seems because there are memory explosion issues which can kill locality.
This is a last resort option. It's costly to implement.
You should have a look at this explanation of Peter Norvig on how to write a spelling corrector .
How to write a spelling corrector
EveryThing is well explain in this article, as an example the python code for the spell checker looks like this :
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
Hope you can find what you need on Peter Norvig website.
for spell checker many data structures would be useful for example BK-Tree. Check Damn Cool Algorithms, Part 1: BK-Trees I have done implementation for the same here
My Earlier code link may be misleading, this one is correct for spelling corrector.
off the top of my head, you could split up your suggestions based on length and build a tree structure where children are longer variations of the shorter parent.
should be quite fast but i'm not sure about the code itself, i'm not very well-versed in c++

Damerau–Levenshtein distance for language specific quirks

To Dutch speaking people the two characters "ij" are considered to be a single letter that is easily exchanged with "y".
For a project I'm working on I would like to have a variant of the Damerau–Levenshtein distance that calculates the distance between "ij" and "y" as 1 instead of the current value of 2.
I've been trying this myself but failed. My problem is that I do not have a clue on how to handle the fact that both texts are of different lengths.
Does anyone have a suggestion/code fragment on how to solve this?
Thanks.
The Wikipedia article is rather loose with terminology. There are no such things as "strings" in "natural language". There are phonemes in natural language which can be represented by written characters and character-combinations.
Some character-combinations are vestiges of historical conventions which have survived into modern times, as in modern English "rough" where the "gh" can sound like -f- or make no sound at all. It seems to me that in focusing on raw "strings" the algorithm must be agnostic about the historical relationship of language and orthographic convention, which leads to some arbitrary metrics whenever character-combinations correlate to a single phoneme. How would it measure "rough" to "ruf"? Or "through" to "thru"?
Or German o-umlaut to "oe"?
In your case the -y- can be exchanged phonetically and orthographically with -ij-. So what is that according to the algorithm, two deletions followed by an insertion, or a single deletion of the -j- or of the -i- followed by a transposition of the remaining character to -y-? Or is -ij- being coalesced and the coalescence is followed by a transposition?
I would recommend that you use another unused comnbining character for -ij- before applying the algorithm, perhaps U00EC, Latin small letter i with grave accent.
How does the algorithm handle multi-codepoint characters?
Well the D-L distance itself isn't going to handle it for you, due to the way it measure distances.
As there is no code (or language) involved here, I can only leave you with a suggestion to ensure all strings adhere to the same structure.
To clarify the situation since your asking in general terms,
bear in mind that the D-L distance compares character for character and doesn't actually read your strings in themselves, as such you'll have to parse before compare, as cases where ij shouldn't be exchanged with y will cause other issues instead.
An idea is to translate each string into some sort of constructed orthographemic representation, where digraphs such as "ij" and the english "gh" "th" and friends are only one character long. The distance metric does not have to be equal for all types of replactements when doing Damerau-Levenshtein so you can use whatever penalties you want, but the table needs to be filled locally, therefore you really want each sound to be one cell in the table.
This however breaks when the "ij" was not intended as "ij" but a misspelling or at a word-segmentation border (I don't know if that can happen in Dutch), or in any other situation it is not actually (meant as) a digraph.
Otherwise you will need to do some lookaround, this will complicate things but should not change the growth order of the algorithm (I believe), provided you only look at constant number of cells around. The constant factors will still be much bigger though.