Caesar Cipher w/Frequency Analysis how to proceed next? - c++

I understand this has been asked before and I somewhat have a grasp on how to compare frequency tables between cipher and English(this is the language I'm assuming its in for my program) but I'm unsure about how to get this into code.
void frequencyUpdate(std::vector< std::vector< std::string> > &file, std::vector<int> &freqArg) {
for (int itr_1 = 0; itr_1 < file.size(); ++itr_1) {
for (int itr_2 = 0; itr_2 < file.at(itr_1).size(); ++itr_2) {
for (int itr_3 = 0; itr_3 < file.at(itr_1).at(itr_2).length(); ++itr_3) {
file.at(itr_1).at(itr_2).at(itr_3) = toupper(file.at(itr_1).at(itr_2).at(itr_3));
if (!((int)file.at(itr_1).at(itr_2).at(itr_3) < 65 || (int)file.at(itr_1).at(itr_2).at(itr_3) > 90)) {
int temp = (int)file.at(itr_1).at(itr_2).at(itr_3) - 65;
freqArg.at(temp) += 1;
}
}
}
}
}
this is how I get the frequency of a given file that has its contents split into lines and then into words, hence the double vector of strings and using ASCII values of the chars - 65 for indices. The resulting vector of ints that hold frequency is saved.
Now is where I don't knot how to proceed. Should I hardcode in a const std:: vector <int> for the English frequency of letters and then somehow to comparison? How would I compare efficiently rather than simply compare each vector to each other for is possible not an efficient method?
This comparison is for getting an appropriate shift value for caesar cipher shifting to decrypt a text. I don't wanna use brute force and shift one at a time until the text is readable. Any advice on how to approach this? Thanks.

Take your frequency vector and the frequency vector for "typical" English text, and find the cross-correlation.
The highest values of the cross-correlation correspond to the most likely shift values. At that point you'll need to use each one to decrypt, and see whether the output is sensible (i.e. forms real words and coherent sentences).

In English, 'e' has the highest frequency. So whatever most frequent letter you got from your ciphertext, it most likely maps to 'e'.
Since e --> X then the key should be difference between 'e' and your most frequent letter X.
If this is not the right key (due to too short ciphertext distorting the statistics), try to match your most frequent ciphertext letter with the second one in English i.e. a.

I would suggest a graph traversal algorithm. Your starting node has no substitutions assigned and has 26 connected nodes, one for each possible letter substitution for the most frequently occurring ciphertext letter. The next node has another 25 connected nodes for the possible letters for the second most frequent ciphertext letter (one less, since you've already used one possible letter). Which destination node you choose should be based on which letters are most likely given a normal frequency distribution for the target language.
At each node, you can test for success by doing your substitutions into the ciphertext, and finding all the resulting words that now match entries in a dictionary file. The more matches you've found, the more likely you've got the correct substitution key.

Related

Choosing an efficient data structure to find rhymes

I've been working on a program that reads in a whole dictionary, and utilizes the WordNet from CMU that splits every word to its pronunciation.
The goal is to utilize the dictionary to find the best rhymes and alliterations of a given word, given the number of syllables in the word we need to find and its part of speech.
I've decided to use std::map<std::string, vector<Sound> > and std::multimap<int, std::string> where the map maps each word in the dictionary to its pronunciation in a vector, and the multimap is returned from a function that finds all the words that rhyme with a given word.
The int is the number of syllables of the corresponding word, and the string holds the word.
I've been working on the efficiency, but can't seem to get it to be more efficient than O(n). The way I'm finding all the words that rhyme with a given word is
vector<string> *rhymingWords = new vector<string>;
for (iterator it : map<std::string, vector<Sound> >) {
if(rhymingSyllables(word, it.first) >= 1 && it.first != word) {
rhymingWords->push_back(it.first);
}
}
return rhymingWords;
And when I find the best rhyme for a word (a word that rhymes the most syllables with the given word), I do
vector<string> rhymes = *getAllRhymes(rhymesWith);
int x = 0;
for (string s : rhymes) {
if (countSyllables(s) == numberOfSyllables) {
int a = rhymingSyllables(s, rhymesWith);
if (a > x) {
maxRhymes = thisRhyme;
bestRhyme = s;
}
}
}
return bestRhyme;
The drawback is the O(n) access time in terms of the number of words in the dictionary. I'm thinking of ideas to drop this down to O(log n) , but seem to hit a dead end every time. I've considered using a tree structure, but can't work out the specifics.
Any suggestions? Thanks!
The rhymingSyllables function is implemented as such:
int syllableCount = 0;
if((soundMap.count(word1) == 0) || (soundMap.count(word2) == 0)) {
return 0;
}
vector<Sound> &firstSounds = soundMap.at(word1), &secondSounds = soundMap.at(word2);
for(int i = firstSounds.size() - 1, j = secondSounds.size() - 1; i >= 0 && j >= 0; --i, --j){
if(firstSounds[i] != secondSounds[j]) return syllableCount;
else if(firstSounds[i].isVowel()) ++syllableCount;
}
return syllableCount;
P.S.
The vector<Sound> is the pronunciation of the word, where Sound is a class that contains every different pronunciation of a morpheme in English: i.e,
AA vowel AE vowel AH vowel AO vowel AW vowel AY vowel B stop CH affricate D stop DH fricative EH vowel ER vowel EY vowel F fricative G stop HH aspirate IH vowel IY vowel JH affricate K stop L liquid M nasal N nasal NG nasal OW vowel OY vowel P stop R liquid S fricative SH fricative T stop TH fricative UH vowel UW vowel V fricative W semivowel Y semivowel Z fricative ZH fricative
Perhaps you could group the morphemes that will be matched during rhyming and compare not the vectors of morphemes, but vectors of associated groups. Then you can sort the dictionary once and get a logarithmic search time.
After looking at rhymingSyllables implementation, it seems that you convert words to sounds, and then match any vowels to each other, and match other sounds only if they are the same. So applying advice above, you could introduce an extra auxiliary sound 'anyVowel', and then during dictionary building convert each word to its sound, replace all vowels with 'anyVowel' and push that representation to dictionary. Once you're done sort the dictionary. When you want to search a rhyme for a word - convert it to the same representation and do a binary search on the dictionary, first by last sound as a key, then by previous and so on. This will give you m*log(n) worst case complexity, where n is dictionary size and m is word length, but typically it will terminate faster.
You could also exploit the fact that for best rhyme you consider words only with certain syllable numbers, and maintain a separate dictionary per each syllable count. Then you count number of syllables in word you look rhymes for, and search in appropriate dictionary. Asymptotically it doesn't give you any gain, but a speedup it gives may be useful in your application.
I've been thinking about this and I could probably suggest an approach to an algorithm.
I would maybe first take the dictionary and divide it into multiple buckets or batches. Where each batch represents the number of syllables each word has. The traversing of the vector to store into different buckets should be linear as you are traverse a large vector of strings. From here since the first bucket will have all words of 1 syllable there is nothing to do at the moment so you can skip to bucket two and each bucket after will need to take each word and separate the syllables of each word. So if you have say 25 buckets, where you know the first few and the last few are not going to hold many words their time shouldn't be significant and should be done first, however the buckets in the middle that have say 3-5 or 3-6 syllables in length will be the largest to do so you could run each of these buckets on a separate thread if their size is over a certain amount and have them run in parallel. Now once you are done; each bucket should return a std::vector<std::shared_ptr<Word>> where your structure might look like this:
enum SpeechSound {
SS_AA,
SS_AE,
SS_...
SS_ZH
};
enum SpeechSoundType {
ASPIRATE,
...
VOWEL
};
struct SyllableMorpheme {
SpeechSound sound;
SpeechSoundType type;
};
class Word {
public:
private:
std::string m_strWord;
// These Two Containers Should Match In Size! One String For Each
// Syllable & One Matching Struct From Above Containing Two Enums.
std::vector<std::string> m_vSyllables
std::vector<SyllableMorpheme> m_vMorphemes;
public:
explicit Word( const std::string& word );
std::string getWord() const;
std::string getSyllable( unsigned index ) const;
unsigned getSyllableCount() const;
SyllableMorpheme getMorhpeme( unsigned index ) const;
bool operator==( const ClassObj& other ) const;
bool operator!=( const ClassObj& other ) const;
private:
Word( const Word& c ); // Not Implemented
Word& operator=( const Word& other ) const; // Not Implemented
};
This time you will now have new buckets or vectors of shared pointers of these class objects. Then you can easily write a function to traverse through each bucket or even multiple buckets since the buckets will have the same signature only a different amount of syllables. Remember; each bucket should already be sorted alphabetically since we only added them in by the syllable count and never changed the order that was read in from the dictionary.
Then with this you can easily compare if two words are equal or not while checking For Matching Syllables and Morphemes. And these are contained in std::vector<std::shared_ptr<Word>>. So you don't have to worry about memory clean up as much either.
The idea is to use linear search, separation and comparison as much as possible; yet if your container gets too large, then create buckets and run in parallel multiple threads, or maybe use a hash table if it will suite your needs.
Another possibility with this class structure is that you could even add more to it later on if you wanted or needed to such as another std::vector for its definitions, and another std::vector<string> for its part of speech {noun, verb, etc.} You could even add in other vector<string> for things such as homonyms, homophomes and even a vector<string> for a list of all words that rhyme with it.
Now for your specific task of finding the best matching rhyme you may find that some words may end up having a list of Words that would all be considered a Best Match or Fit! Due to this you wouldn't want to store or return a single string, but rather a vector of strings!
Case Example:
To Too Two Blue Blew Hue Hew Knew New,
Bare Bear Care Air Ayre Heir Fair Fare There Their They're
Plain, Plane, Rain, Reign, Main, Mane, Maine
Yes these are all single syllable rhyming words, but as you can see there are many cases where there are multiple valid answers, not just a single best case match. This is something that does need to be taken into consideration.

algorithm to identify string matches with hashing and no false positives

I need to find if a given string matches strings in a list without having the strings in the list; basically I need to hash the strings and match only against the list of hashes. The problem is being sure that there are no false positives so that only exact matches will be found and any other set of characters is not found. This is of course easy with an actual list of strings, even a simple binary search will work, but I want an algorithm that works without the actual characters present (i.e. precalculated). A bloom filter doesn't have a guarantee that some arbitrary set of characters might not be matched.
Update: this is similar to storing only password hashes and then hashing an entered password which is then compared to the hash list to see if the password is one of them (admittedly not the usual use of a password). The reason for this requirement is to no have to ship the actual text, just the hashes.
Update 2: Is there another way to do this without a perfect hash function? I have hundreds of thousands of entries, finding a perfect hash is hard. Maybe something like a bloom filter but with a better guarantee?
A good cryptographic hash function (with sufficient bits) will make the probability of a false match extremely small; sufficiently small that brute force attacks are essentially impossible. Most security systems feel that such mechanisms are adequate.
If you want an absolute guarantee that no false positive is possible, then you'll actually need to include enough data to validate the input; that cannot be any shorter than the target strings (but it doesn't have to be any larger). In effect, you need to encrypt the target strings. Since the encryption key and the encrypted strings will both be visible, in order to avoid someone simply decrypting the encrypted strings you need to use an asymmetric cipher. Those are computationally expensive, but that might not be a problem for your environment.
Any perfect hash will do. Follow it up with a string compare to verify it is not a false positive.
Here's a "almost perfect" hashing algorithm:
MPQ Hash
It is used in the StarCraft save files. This algorithm is very efficient, and has a very low collision possibility (about 1:18889465931478580854784 on average).
Here is how this algorithm works.
1.Compute the three hashes (offset hash and two check hashes) and store them in variables.
2.Move to the entry of the offset hash
3.Is the entry unused? If so, stop the search and return 'file not found'.
4.Do the two check hashes match the check hashes of the file we're looking for? If so, stop the search and return the current entry.
5.Move to the next entry in the list, wrapping around to the beginning if we were on the last entry.
6.Is the entry we just moved to the same as the offset hash (did we look through the whole hash table?)? If so, stop the search and return 'file not found'.
7.Go back to step 3.
And here is the hashing and hash table function:
unsigned long HashString(char *lpszFileName, unsigned long dwHashType)
{
//lpszFileName is the string to be hashed.
//dwHashType will change the hash value according to hash mode.
//You can see how it's used in the beginning of GetHashTablePos().
unsigned char *key = (unsigned char *)lpszFileName;
unsigned long seed1 = 0x7FED7FED, seed2 = 0xEEEEEEEE;
int ch;
while(*key != 0)
{
ch = toupper(*key++); //Convert every character to upper case.
//dwHashType will change the hash value in different hashing modes.
//(Whether to calculate the position or to verify.)
seed1 = cryptTable[(dwHashType << 8) + ch] ^ (seed1 + seed2);
seed2 = ch + seed1 + seed2 + (seed2 << 5) + 3;
}
return seed1;
}
int GetHashTablePos(char *lpszString, MPQHASHTABLE *lpTable, int nTableSize)
{
const int HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;
//nHash controls where the hash value of the string should be stored in the hash table.
//nHashA and nHashB are used for verifying the match
int nHash = HashString(lpszString, HASH_OFFSET),
nHashA = HashString(lpszString, HASH_A),
nHashB = HashString(lpszString, HASH_B),
nHashStart = nHash % nTableSize, nHashPos = nHashStart;
while (lpTable[nHashPos].bExists)
{
if (lpTable[nHashPos].nHashA == nHashA && lpTable[nHashPos].nHashB == nHashB)
return nHashPos; //If found, return the entry index
else
nHashPos = (nHashPos + 1) % nTableSize; //Not found, move to next position
if (nHashPos == nHashStart)
break; //Reach the beginning of table, stop searching.
}
return -1; //Error value
}
You might consider putting together a bloom filter. That is basically a set of hash algorithms and hash tables which when taken as a group can give a very high probabilty of a correct match. It's not 100%, but you can get as close to it as you like.

Increase string overlap matrix building efficiency

I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be
XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)
I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)
read in all strings to array
create empty NxN matrix
compare each string to every string with a higher array index (to avoid redoing comparisons)
Write longest overlap to matrix
There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.
In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.
I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.
Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed
//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
for (int k=j+1; k<counter; k++)
int test = determineBestOverlap(DNArray[j],DNArray[k]);
//boring stuff
//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
int maxCounter = 0, bestOffset = 0;
//basically just tries overlapping the strings every possible way
for (int j=0; j<str2.length(); j++)
{
int counter = 0, offset = 0;
while (str1[offset] == str2[j+offset] && str1[offset] != 'X')
{
counter++;
offset++;
}
if (counter > maxCounter)
{
maxCounter = counter;
bestOffset = j;
}
}
return maxCounter;
} //this simplified version doesn't account for flipped strings
Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.
However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.
What i'd try if i were you - still without knowing if that fits your application - is:
1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.
2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)
AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC 1
......
3) Now, from the top to the bottom of the list:
a) Set the "current string" to the string topmost on the list.
b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.
c) Compare the current string with all strings of the same length, to find matches where some characters are X.
d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.
e) Repeat 3b with the last, instead of the first, character removed from the current string.
f) Remove the current string from the list.
g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.
If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.
I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.
The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:
A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.

Topic mining algorithm in c/c++

I am working on subject extraction fro articles algorithm using c++.
First I have written code to remove words like articles, propositions etc.
Then rest of the words get store in one char array: char *excluded_string[50] = { 0 };
while ((NULL != word) && (50 > i)) {
ch[i] = strdup(word);
excluded_string[j]=strdup(word);
word = strtok(NULL, " ");
skp = BoyerMoore_skip(ch[i], strlen(ch[i]) );
if(skp != NULL)
{
i++;
continue;
}
j++;
skp is NULL when ch[i] is not articles or similar caregory.
This function checks whether any word belongs to articles or propo...etc
Now at the end ex..[] contains set of required words. Now I want occurrence of each words in this array and after that word which has max occurrence. All if more then one.
What logic should I use?
What I thought is:
Taking and two dimension array. First column will have word. and 2nd column I can use for storing count values.
Then for each word sending that word to the array and for each occurance of that word increment count values and store that count values for that words in 2nd column.
But this is costly and also complex.
Any other idea?
If you wish to count the occurrences of each word in an array then you can do no better than O(n) (i.e. one pass over the array). However, if you try to store the word counts in a two dimensional array then you must also do a lookup each time to see if the word is already there, and this can quickly become O(n^2).
The trick is to use a hash table to do your lookup. As you step through your word list you increment the right entry in the hash table. Each lookup should be O(1), so it ought to be efficient as long as there are sufficiently many words to offset the complexity of the hashing algorithm and memory usage (i.e. don't bother if you're dealing with less than 10 words, say).
Then, when you're done, you just iterate over the entries in the hash table to find the maximum. In fact, I would probably keep track of that while counting the words so there's no need to do it after ("if thisWordCount is greater than currentMaximumCount then currentMaximum = thisWord").
I believe the standard C++ unordered_map type should do what you need. There's an example here.

Given a 2D matrix of characters we have to check whether the given word exist in it or not

Given a 2D matrix of characters we have to check whether the given word exist in it or not.
eg
s f t
d a h
r y o
we can find "rat in it
(top down , straight ,diagonal or anypath).. even in reverse order. with least complexiety.
my approach is
While traversing the 2d matrix ( a[][] ) row wise.
If ( a[i][j] == first character of given word ) {
search for rest of the letters in 4 directions i.e. right, right diagonally down, down and left diagonally down.
} else if( a[i][j] == last character of the given word ) {
search for remaining characters in reverse order in 4 directions i.e. left, right diagonally up, up, left diagonally up.
}
is there any better approach?
Let me describe a very cool data structure for this problem.
Go ahead and look up Tries.
It takes O(k) time to insert a k-length word into the Trie, and O(k) to look-up the presence of a k-length word.
Video tutorial
If you have problems understanding the data structure, or implementing it, I'll be happy to help you there.
I think I would do this in two phases:
1) Iterate over the array, looking for instances of the first letter in the word.
2) Whenever you find an instance of the first letter, call a function that examines all adjacent cells (e.g. up to 9 of them) to see if any of them are the second letter of the word. For any second-letter-matches that are found, this function would call itself recursively and look for third-letter matches in cells adjacent to that (and so on). If the recursion ever gets all the way to the final letter of the word and finds a match for it, then the word exists in the array. (Note that if you're not allowed to use a letter twice you'll need to flag cells as 'already used' in order to prevent the algorithm from re-using them. Probably the easiest way to do that would be to pass-by-value a vector of already-used-cell-coordinates in to the recursive function, and have the recursive function ignore the contents of any cells that are in that list)
In fact you have 16 sequences here:
sft
dah
ryo
sdr
fay
tho
sao
rat
tfs
had
oyr
rds
yaf
oht
oas
tar
(3 horizontal + 3 vertical + 2 diagonals) * 2 (reversed) = 16. Let n be a size of a matrix. In your example n = 3. Number of sequences = (n + n + 2) * 2 = 4n + 4.
Now you need to determine whether a sequence is a word or not. Create a hash set (unordered_set in C++, HashSet in Java) with words from dictionary (found on the internet). You can check one sequence in O(1).
Look for the first letter or your word using a simple loop and when you find it use the following recursive function.
The function will get as input 5 parameters: the word you are looking for str, your current position of the letter in the word str you look for in your array k, i and j as the position in your array to search for the letter and direction d.
The stop conditions will be:
-if k > strlen(str); return 1;
-if arr[i][j] != str[k]; return 0;
If none of the upper statements are true you increment your letter counter k++; update your i and j acording to your value of d and call again your function via return func(str, k);