How to print histogram of ordered C++ string array? - c++

I read a given text file and then fill my array with each word in the text file (I do a check to make sure the file is under 100 words, storing the number of words up top). I order them alphabetically (using bubble sort) and then get an array of of a bunch of words in order, occurring different amounts of times so for example:
string stringText[10] = {alpha, alpha, bravo, charlie, charlie, charlie...}
I need to print a histogram where I have each word followed by an 'x' for each occurrence of the word (to create a histogram):
alpha: xx
bravo: x
charlie: xxx
and so on...
My question I guess is, should I edit the array, getting rid of duplicate elements or just print the first occurrence of each unique element followed by how many times it occurs? If I delete duplicate elements, my approach would be to go back into the string I read and just count how many times that word occurs. I'm more inclined towards leaving the array and just printing the first unique occurrence followed by an 'x' for each occurrence but I am not sure how to implement that.
I'm not allowed to map/use vectors etc.

The code below requires sorting and no limitation on number of words.
const int wordCount = 6;
string stringText[wordCount] = {"alpha", "alpha", "bravo", "charlie", "charlie","charlie"};
int counter = 0;
while(counter<wordCount)
{
cout<<stringText[counter];
cout<<" : x";
for(int i=counter+1;i<wordCount;++i)
{
if(stringText[i]==stringText[counter])
{
cout<<"x";
counter++;
}
}
cout<<endl;
counter++;
}
And the output is :
alpha : xx
bravo : x
charlie : xxx

Could be easy with a map ... But if you can't do an int* sorted like that :
int count[nb_word];
Where
count[0]
represent the first word (in your exemple alpha) number of occurence

Just keep previous element if it's the same as current one, print x, if not print the new element with 1st occurrence mark.
I am not giving implementation nor completely exact algorithm on purpose, as it seems like an exercise.

Related

Caesar Cipher w/Frequency Analysis how to proceed next?

I understand this has been asked before and I somewhat have a grasp on how to compare frequency tables between cipher and English(this is the language I'm assuming its in for my program) but I'm unsure about how to get this into code.
void frequencyUpdate(std::vector< std::vector< std::string> > &file, std::vector<int> &freqArg) {
for (int itr_1 = 0; itr_1 < file.size(); ++itr_1) {
for (int itr_2 = 0; itr_2 < file.at(itr_1).size(); ++itr_2) {
for (int itr_3 = 0; itr_3 < file.at(itr_1).at(itr_2).length(); ++itr_3) {
file.at(itr_1).at(itr_2).at(itr_3) = toupper(file.at(itr_1).at(itr_2).at(itr_3));
if (!((int)file.at(itr_1).at(itr_2).at(itr_3) < 65 || (int)file.at(itr_1).at(itr_2).at(itr_3) > 90)) {
int temp = (int)file.at(itr_1).at(itr_2).at(itr_3) - 65;
freqArg.at(temp) += 1;
}
}
}
}
}
this is how I get the frequency of a given file that has its contents split into lines and then into words, hence the double vector of strings and using ASCII values of the chars - 65 for indices. The resulting vector of ints that hold frequency is saved.
Now is where I don't knot how to proceed. Should I hardcode in a const std:: vector <int> for the English frequency of letters and then somehow to comparison? How would I compare efficiently rather than simply compare each vector to each other for is possible not an efficient method?
This comparison is for getting an appropriate shift value for caesar cipher shifting to decrypt a text. I don't wanna use brute force and shift one at a time until the text is readable. Any advice on how to approach this? Thanks.
Take your frequency vector and the frequency vector for "typical" English text, and find the cross-correlation.
The highest values of the cross-correlation correspond to the most likely shift values. At that point you'll need to use each one to decrypt, and see whether the output is sensible (i.e. forms real words and coherent sentences).
In English, 'e' has the highest frequency. So whatever most frequent letter you got from your ciphertext, it most likely maps to 'e'.
Since e --> X then the key should be difference between 'e' and your most frequent letter X.
If this is not the right key (due to too short ciphertext distorting the statistics), try to match your most frequent ciphertext letter with the second one in English i.e. a.
I would suggest a graph traversal algorithm. Your starting node has no substitutions assigned and has 26 connected nodes, one for each possible letter substitution for the most frequently occurring ciphertext letter. The next node has another 25 connected nodes for the possible letters for the second most frequent ciphertext letter (one less, since you've already used one possible letter). Which destination node you choose should be based on which letters are most likely given a normal frequency distribution for the target language.
At each node, you can test for success by doing your substitutions into the ciphertext, and finding all the resulting words that now match entries in a dictionary file. The more matches you've found, the more likely you've got the correct substitution key.

Increase string overlap matrix building efficiency

I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be
XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)
I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)
read in all strings to array
create empty NxN matrix
compare each string to every string with a higher array index (to avoid redoing comparisons)
Write longest overlap to matrix
There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.
In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.
I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.
Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed
//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
for (int k=j+1; k<counter; k++)
int test = determineBestOverlap(DNArray[j],DNArray[k]);
//boring stuff
//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
int maxCounter = 0, bestOffset = 0;
//basically just tries overlapping the strings every possible way
for (int j=0; j<str2.length(); j++)
{
int counter = 0, offset = 0;
while (str1[offset] == str2[j+offset] && str1[offset] != 'X')
{
counter++;
offset++;
}
if (counter > maxCounter)
{
maxCounter = counter;
bestOffset = j;
}
}
return maxCounter;
} //this simplified version doesn't account for flipped strings
Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.
However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.
What i'd try if i were you - still without knowing if that fits your application - is:
1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.
2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)
AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC 1
......
3) Now, from the top to the bottom of the list:
a) Set the "current string" to the string topmost on the list.
b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.
c) Compare the current string with all strings of the same length, to find matches where some characters are X.
d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.
e) Repeat 3b with the last, instead of the first, character removed from the current string.
f) Remove the current string from the list.
g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.
If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.
I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.
The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:
A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.

Topic mining algorithm in c/c++

I am working on subject extraction fro articles algorithm using c++.
First I have written code to remove words like articles, propositions etc.
Then rest of the words get store in one char array: char *excluded_string[50] = { 0 };
while ((NULL != word) && (50 > i)) {
ch[i] = strdup(word);
excluded_string[j]=strdup(word);
word = strtok(NULL, " ");
skp = BoyerMoore_skip(ch[i], strlen(ch[i]) );
if(skp != NULL)
{
i++;
continue;
}
j++;
skp is NULL when ch[i] is not articles or similar caregory.
This function checks whether any word belongs to articles or propo...etc
Now at the end ex..[] contains set of required words. Now I want occurrence of each words in this array and after that word which has max occurrence. All if more then one.
What logic should I use?
What I thought is:
Taking and two dimension array. First column will have word. and 2nd column I can use for storing count values.
Then for each word sending that word to the array and for each occurance of that word increment count values and store that count values for that words in 2nd column.
But this is costly and also complex.
Any other idea?
If you wish to count the occurrences of each word in an array then you can do no better than O(n) (i.e. one pass over the array). However, if you try to store the word counts in a two dimensional array then you must also do a lookup each time to see if the word is already there, and this can quickly become O(n^2).
The trick is to use a hash table to do your lookup. As you step through your word list you increment the right entry in the hash table. Each lookup should be O(1), so it ought to be efficient as long as there are sufficiently many words to offset the complexity of the hashing algorithm and memory usage (i.e. don't bother if you're dealing with less than 10 words, say).
Then, when you're done, you just iterate over the entries in the hash table to find the maximum. In fact, I would probably keep track of that while counting the words so there's no need to do it after ("if thisWordCount is greater than currentMaximumCount then currentMaximum = thisWord").
I believe the standard C++ unordered_map type should do what you need. There's an example here.

count number of times a character appears in an array?

i've been thinking for a long time and havent got anywhere with the program. i dont know where to begin. The assignment requires use of single function main and only iostream library to be used.
the task is to Declare a char array of 10 elements. Take input from user. Determine if array contains any values more than 1 times . do not show the characters that appears 1 time only.
Sample output:
a 2
b 4
..
a an b are characters. and 2 and 4 represents number of times they appear in the array B.
i tried to use nested loop to compare a character with all the character in array and incrementing a counter each time similer character id sound but unexpected results are occuring.
Here is the code
#include <iostream>
using namespace std;
void main()
{
char ara[10];
int counter=0;
cout<<"Enter 10 characters in an array\n";
for ( int a=0; a<10; a++)
cin>>ara[a];
for(int i=0; i<10; i++)
{
for(int j=i+1; j<10; j++)
{
if(ara[i] == ara[j])
{
counter++;
cout<<ara[i]<<"\t"<<counter<<endl;
}
}
}
}
Algorithm 2: std::map
Declare / define the container:
std::map<char, unsigned int> frequency;
Open the file
read a letter.
find the letter: frequency.find(letter)
If letter exists, increment the frequency: frequency[letter]++;
If letter no exists, insert into frequency: frequency[letter] = 1;
After all letters processed, iterate through the map displaying the letter and its frequency.
Here's one possible way you can solve this. I'm not giving you full code; it's considered bad to just give full implementations for other people's homework.
First, fill a new array with only unique characters. For example, if the input was:
abacdadeff
The new array should only have:
abcdef
That is, every character should appear only once in it. Do not forget to \0-terminate it, so that you can tell where it ends (since it can have a length smaller than 10).
Then create a new array of int (or unsigned, since you can't have negative occurrences) values that holds the frequency of occurence of every character from the unique array in the original input array. Every value should be initially 1. You can achieve this with a declaration like:
unsigned freq[10] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
Now, iterate over the unique array and every time you find the current character in the original input array, increment the corresponding element of the frequencies array. So at the end, for the above input, you would have:
a b c d e f (unique array)
3 1 1 2 1 2 (frequencies array)
And you're done. You can now tell how many times each characters appears in the input.
Here, I'll tell you what you should do and you code it yourself:
include headers ( stdio libs )
define main ( entry point for your app )
declare input array A[amount_of_chars_in_your_input]
write output requesting user to input
collect input
now the main part:
declare another array of unsigned shorts B[]
declare counter int i = 0
declare counter int j = 0
loop through the array A[] ( in other words i < sizeof ( A ); or a[i] != '\0' )
now loop as much as there is different letters in the array A
store the amount of letters in the B[]
print it out
Now there are some tricks applying this but you can handle it
Try this:
unsigned int frequency[26] = {0};
char letters[10];
Algorithm:
Open file / read a letter.
Search for the letters array for the new letter.
If the new letter exists: increment the frequency slot for that
letter: frequency[toupper(new_letter) - 'A']++;
If the new letter is missing, add to array and set frequency to 1.
After all letters are processed, print out the frequency array:
`cout << 'A' + index << ": " << frequency[index] << endl;

Given a string array, return all groups of strings that are anagrams

Given a string array, return all groups of strings that are anagrams.
My solutions:
For each string word in the array, sort it O(m lg m), m is the average length of a word.
Build up a hash Table < string, list >.
Put the sorted word into the hash table as key and also generate all permutations (O(m!)) of the word, search each permutation in a dictionary (a prefix tree map) with O(m), if it is in the dictionary, put (O(1)) it into the hash table so that all permutated words are put into the list with the same key.
Totally, O(n * m * lg m * m!) time and O(n* m!) space , n is the size of the given array.
If m is very large, it is not efficient , m! .
Any better solutions ?
thanks
We define an alphabet, which contains every letter we may have in our wordlist. Next, we need a different prime for each of the letters in the alphabet, I recommend using the smallest you can find.
That would give us the following mapping:
{ a => 2, b => 3, c => 5, d => 7, etc }
Now count the letters in the word you want to represent as integer, and build your result integer as follows:
Pseudocode:
result = 1
for each letter:
....result *= power(prime[letter], count(letter,word)
some examples:
aaaa => 2^4
aabb => 2^2 * 3^2 = bbaa = baba = ...
and so on.
So you will have an integer representing each word in your dictionary and the word you want to check will be able to be converted to an integer. So if n is the size of your wordlist and k is the size of the longest word it will take O(nk) to build your new dictionary and O(k) to check a new word.
Hackthissite.com has a programming challenge which is: Given a scrambled word, look it up in a dictionary to see if any anagrams of it are in the dictionary. There is a good article on an efficient solution to the problem from which I have borrowed the answer, it also goes into detail on further optimisations.
use counting sort to sort the word so that sorting can be done in O(m).
after sorting generate key from word and insert a node (key,value) into hashtable. Generating key can be achieved in O(m).
You can take value in (key,value) as some dynamic array which can hold more than one strings.
Each time you insert a key which is already present just push the original word from which key is generated on value array.
So overall time complexity O(mn) where n is the total number of words (size of input).
Also this link has solution to similar problems->
http://yourbitsandbytes.com/viewtopic.php?f=10&t=42
#include <map>
#include <iostream>
#include <set>
#include <algorithm>
int main () {
std::string word;
std::map<std::string, std::set<std::string>> anagrams;
while(std::cin >> word) {
std::string sortedWord(word);
std::sort(sortedWord.begin(), sortedWord.end());
anagrams[sortedWord].insert(word);
}
for(auto& pair : anagrams) {
for(auto& word : pair.second) {
std::cout << word << " ";
}
std::cout << "\n";
}
}
I'll let someone who is better at big-O analysis than I am figure out the complexities.
turn the dictionary into a mapping of the sorted characters of a word mapped to every word of those characters and store that. For each word you are given, sort it and add the list of anagrams from the mapping to your output.
I don't believe you can do better in O terms than
sorting the letters of each word
sorting the list of sorted words
each set of anagrams will now be grouped consecutively.