Count sequences of letters in a string - C++

Count sequences of letters in a string - C++ - c++

I'm working on a simple substitution-cipher decoder. I'm using frequency analysis to decrypt the ciphertext. Just looking at the frequency of unique letter isn't enough. I need to look at the occurrences of 2-letter sequences (maybe 3-letter sequences).
My code for counting the occurrences of each letter is below
int counterRaw[256][2] = {0};
for(int i = 0; i <inputString.length(); i++)
counterRaw[inputString[i]][1]++;
int counterLetter[26][2] = {0};
for(int i = 0 ; i<26 ; i++){
counterLetter[i][0] = 'A'+i;
counterLetter[i][1] = counterRaw['A'+i][1];
As you can see very simple yet effective !
But I don't know how to achieve a 2-letter sequence counter, do you have any idea which could help me to code this ?
Thanks !
EDIT : As an example
Given AZAZ RTYU JKLM I want my program to output :
AZ : 2
ZA : 1
ZR : 1
RT : 1
...

Something like the following would do the trick, though you'd have to do some jiggery pokery to make it suit your own needs.
#include <iostream>
#include <map>
#include <string>
int main ()
{
std::string message("some string that you will probably get from some encrypted file");
std::map<std::string,int> occurences;
std::string seq(" ");
for(int i = 1; i < message.length() - 1; i++)
{
seq[0] = message[i-1];
seq[1] = message[i];
//ignore spaces
if (seq.compare(0,1, " ") && seq.compare(1,1, " "))
{
occurences[seq]++;
}
}
//let's have a look ...
for(auto iter = occurences.begin(); iter != occurences.end(); ++iter)
{
std::cout << iter->first << " " << iter->second << "\n";
}
return 0;
}
output:
ab 1
at 1
ba 1
bl 1
cr 1
ed 1
en 1
et 1
fi 1
fr 1
ge 1
ha 1
il 2
in 1
ll 1
ly 1
me 2
nc 1
ng 1
ob 1
om 3
ou 1
pr 1
pt 1
ri 1
ro 2
ry 1
so 2
st 1
te 1
th 1
tr 1
wi 1
yo 1
yp 1

You should create a "composite letter" from two letters.
As letters in C,C++ are numbers, you can just convert each of the 2 letters to a number ( the characters are already numbers ) and than create a number with two numbers. e.g. int C=inputString[i]+256*inputString[i+1].
The above with the supposition that the strings are of char and chars are between 0 and 255 ( better than signed ).

What you are doing right now is a counting sort.
A radix sort would be a viable option for you if you take multiple digits into consideration.

Here you go (based on the idea of user3723779):
#define MAKEWORD(a, b) (((a) << 8) | (b))
std::string noSpaces(std::string s)
{
int pos;
while((pos = s.find(' ')) != std::string::npos)
{
s.erase(pos, 1);
}
return s;
}
std::map<short, int> seqDet2(const std::string &s)
{
int length = s.length();
if(length == 0) return std::map<short, int>();
// assert(sizeof(char) == 1);
std::vector<short> v;
for(int i = 0; i < length - 1; ++i)
{
v.push_back(MAKEWORD(s[i], s[i + 1]));
}
std::map<short, int> occ;
for(auto x: v)
{
occ[x]++;
}
return occ;
}
int main()
{
std::string s = "AZAZRTYUI AZTWI";
auto occ = seqDet2(noSpaces(s));
for(auto x: occ)
{
unsigned char b = (unsigned char)x.first;
unsigned char a = (unsigned char)(x.first >> 8);
printf("%c%c - %d\n", a, b, x.second);
}
getchar();
}

Related

Occurrences of same alternative characters in a CString

I'm new to C++ programming and this is the task that i've to do, but i can't get the desired output even after trying and trying. Can anyone please look into code and let me know what should do, i know my code is incomplete but i don't know how to proceed from here.
Task: Write a program, using functions only, with the following features.
Program reads paragraph(s) from console and stores in a string.
Program then counts the occurrences of double letter appearing in any word of the paragraph(s) and outputs the characters along with its number of occurrences.
If a double letter is appearing more than one time the program should show this only one time along with its total frequency in paragraph.
Output letters must be in sequence.
Sample input (file):
Double bubble deep in the sea of tooth with teeth and meet with riddle.
Sample output:
bb 1
dd 1
ee 3
oo 1
This is my code:
#include <iostream>
#include <conio.h>
#include <cstring>
#include <ctime>
using namespace std;
int counter = 0;
char alphabets[26] = { 0 };
void alternatives();
int main() {
alternatives();
_getch();
return 0;
}
void alternatives() {
char str[] = "Double bubble deep in the sea of tooth with teeth and meet with riddle.";
int count = 0;
for (int j = 0; j < strlen(str); j++)
str[j] = tolower(str[j]);
for (int i = 0; i < strlen(str); i++) {
if (str[i] == str[i + 1]) {
counter++;
cout << str[i] << str[i + 1] << "\t" << counter << endl;
}
counter = 0;
}
}
Output:
bb 1
ee 1
oo 1
ee 1
ee 1
dd 1

You have 26 letters (I assume) so you need 26 counts. A simple array would do
int counters[26] = { 0 }; // initialise all counts to zero
Now when you find a repeated letter you need to increment the appropriate count, something like
for (int i = 0; i < strlen(str); i++)
{
char letter = str[i];
if (letter >= 'a' && letter <= 'z' && // is it a letter and
letter == str[i + 1]) // is it repeated?
{
counters[letter - 'a']++; // increment count
}
}
Note the use of letter - 'a' to get the offset into the array of counts
Finally you need to output the results
for (char letter = 'a'; letter <= 'z'; ++letter)
{
int count = counters[letter - 'a'];
if (count > 0)
cout << letter << letter << ' ' << count << ' ';
}
cout << '\n';
Not perfect, but hopefully something to get you started. This is untested code.

You can use an int array of length 26 to keep track of repeated instances of letters. You can then iterate over the C string and check for repeats. If you find one, make sure to jump your iterator forward.
#include <iostream>
#include <cstring>
int main() {
int repeats[26] = {0};
char str[] = "Double bubble deep in the sea of tooth with teeth and meet with riddle.";
for (char *ch = str; *ch; ch++)
*ch = tolower(*ch);
for (char *ch = str; *ch; ch++) {
if (std::isalpha(*ch) && *ch == ch[1]) {
repeats[*ch - 'a']++;
ch++;
}
}
for (size_t i = 0; i < 26; i++) {
std::cout << static_cast<char>('a' + i) << ": " << repeats[i] << std::endl;
}
return 0;
}
Result:
a: 0
b: 1
c: 0
d: 1
e: 3
f: 0
g: 0
h: 0
i: 0
j: 0
k: 0
l: 0
m: 0
n: 0
o: 1
p: 0
q: 0
r: 0
s: 0
t: 0
u: 0
v: 0
w: 0
x: 0
y: 0
z: 0

How to sort my list of most frequent 3-mers in ascending order?

I am writing a code to find out most frequent 3-mers in a DNA sequence. I wrote a code that counts the occurrence of a 3-mer and if it greater than 1 then code records both string and number of occurrences.
This is giving me a list that is redundant in nature. I want to sort the list such that I only see each 3-mer once in the list.
Below is the code that wrote:
int main()
{
char dna[1000];
char read[3] = {0};
char most_freq[3];
printf("Enter the DNA sequence\n");
fgets(dna, sizeof(dna), stdin);
int i, j;
for(i=0; i<strlen(dna)-3; i++)
{
read[0] = dna[i];
read[1] = dna[i+1];
read[2] = dna[i+2];
int count=0, maxcount=1;
for(j = 0; j < strlen(dna); j++)
{
if(dna[j] == read[0] && dna[j+1] == read[1] && dna[j+2] == read[2])
{
count++;
}
else
{
continue;
}
}
if(count > maxcount)
{
maxcount = count;
printf("%s %d\n", read, maxcount);
}
}
}
This is what I get if I input :
CGCCTAAATAGCCTCGCGGAGCCTTATGTCATACTCGTCCT
CGC 2
GCC 3
CCT 4
ATA 2
AGC 2
GCC 3
CCT 4
CTC 2
TCG 2
CGC 2
AGC 2
GCC 3
CCT 4
GTC 2
ATA 2
CTC 2
TCG 2
GTC 2
CCT 4
It is clear that the answer is CCT but I don't want redundancy in output. How do I resolve this?

Here's a reasonably fast way to do it in C
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
typedef struct {
char n_base[4];
int count;
} NMer_3;
typedef struct {
int count;
NMer_3 trimer[4 * 4 * 4];
} dict;
int cmp(const void* a, const void* b) {
return strncmp((char*)a, (char*)b, 3);
}
void insertTrimer(dict* d, char c[3]) {
NMer_3* ptr = (NMer_3*)bsearch((void*)c, (void*)d->trimer, d->count,
sizeof(NMer_3), cmp);
if (ptr == NULL) {
int offset = d->count;
strncpy(d->trimer[offset].n_base, c, 3);
d->trimer[offset].count = 1;
d->count++;
qsort(d->trimer, d->count, sizeof(NMer_3), cmp);
} else {
ptr->count++;
}
}
int main() {
char dna[1000];
dict d;
printf("Enter the DNA sequence\n");
char* res = fgets(dna, sizeof(dna), stdin);
if (res == NULL)
return 1;
char* ptr = &dna[0];
for (int i = 0; i < strlen(dna) - 2; i++)
insertTrimer(&d, ptr++);
for (int i = 0; i < d.count; i++)
printf("%s : %d\n", d.trimer[i].n_base, d.trimer[i].count);
return 0;
}
Basically, each 3-mer is an entry in a larger struct. The larger struct is binary-searched, and q-sorted every time a new 3-mer is found. Otherwise, if a repeat is found, their entry is incremented.
Here is the result used with your input
AAA : 1
AAT : 1
ACT : 1
AGC : 2
ATA : 2
ATG : 1
CAT : 1
CCT : 4
CGC : 2
CGG : 1
CGT : 1
CTA : 1
CTC : 2
CTT : 1
GAG : 1
GCC : 3
GCG : 1
GGA : 1
GTC : 2
TAA : 1
TAC : 1
TAG : 1
TAT : 1
TCA : 1
TCC : 1
TCG : 2
TGT : 1
TTA : 1
Ways to improve the speed:
Use a program like Jellyfish
Use a hashmap. There's no standard C-library for hashmaps/tables. You're basically going to do something very similar to what I did here. Memory management might be a challenge. However, you're going to be doing a O(1) search for each 3-mer in the sequence, instead of O(log(n)), additionally, adding will only be O(1) instead of needing O(n*log(n)) sorting.
If you do it in C++, you get a lot of benefits, the first being much simpler code:
#include <string>
#include <iostream>
#include <map>
int main() {
std::string dna;
printf("Enter the DNA sequence\n");
std::getline(std::cin, dna);
auto d = std::map<std::string,int>{};
for (int i = 0; i < dna.size() - 2; i++){
auto mer3 = dna.substr(i,3);
auto itr = d.find(mer3);
if (itr == d.end()){
d[mer3] = 1;
} else {
itr->second += 1;
}
}
for (auto i : d) std::cout << i.first << ':' << i.second << '\n';
std::cout <<std::endl;
return 0;
}
This is effectively the same as the C example.
If you replace map with unordered_map it becomes much faster, however, the output will not be sorted.

Combination of unique elements without repetition

Elements : a b c
all combinations in such a way:abcabacbcabc
Formula to get total number of combinations of unique elements without repetition = 2^n - 1 (where n is the number of unique elements)
In our case top: 2^3 - 1 = 7
Another formula to get the combinations with specific length = n!/(r! * (n - r)!) (where n= nb of unique items and r=length)
Example for our the above case with r=2 : 3!/(2! * 1!) = 3 which is ab ac bc
Is there any algorithm or function that gets all of the 7 combinations?
I searched a lot but all i can find is one that gets the combinations with specific length only.
UPDATE:
This is what I have so far but it only gets combination with specific length:
void recur(string arr[], string out, int i, int n, int k, bool &flag)
{
flag = 1;
// invalid input
if (k > n)
return;
// base case: combination size is k
if (k == 0) {
flag = 0;
cout << out << endl;
return;
}
// start from next index till last index
for (int j = i; j < n; j++)
{
recur(arr, out + " " + arr[j], j + 1, n, k - 1,flag);
}
}

The best algorithm I've ever find to resolve this problem is to use bitwise operator. You simply need to start counting in binary. 1's in binary number means that you have to show number.
e.g.
in case of string "abc"
number , binary , string
1 , 001 , c
2 , 010 , b
3 , 011 , bc
4 , 100 , a
5 , 101 , ac
6 , 110 , ab
7 , 111 , abc
This is the best solution I've ever find. you can do it simply with loop. there will not be any memory issue.
here is the code
#include <iostream>
#include <string>
#include <math.h>
#include<stdio.h>
#include <cmath>
using namespace std;
int main()
{
string s("abcd");
int condition = pow(2, s.size());
for( int i = 1 ; i < condition ; i++){
int temp = i;
for(int j = 0 ; j < s.size() ; j++){
if (temp & 1){ // this condition will always give you the most right bit of temp.
cout << s[j];
}
temp = temp >> 1; //this statement shifts temp to the right by 1 bit.
}
cout<<endl;
}
return 0;
}

Do a simple exhaustive search.
#include <iostream>
#include <string>
using namespace std;
void exhaustiveSearch(const string& s, int i, string t = "")
{
if (i == s.size())
cout << t << endl;
else
{
exhaustiveSearch(s, i + 1, t);
exhaustiveSearch(s, i + 1, t + s[i]);
}
}
int main()
{
string s("abc");
exhaustiveSearch(s, 0);
}
Complexity: O(2^n)

Here's an answer using recursion, which will take any number of elements as strings:
#include <vector>
#include <string>
#include <iostream>
void make_combos(const std::string& start,
const std::vector<std::string>& input,
std::vector<std::string>& output)
{
for(size_t i = 0; i < input.size(); ++i)
{
auto new_string = start + input[i];
output.push_back(new_string);
if (i + 1 == input.size()) break;
std::vector<std::string> new_input(input.begin() + 1 + i, input.end());
make_combos(new_string, new_input, output);
}
}
Now you can do:
int main()
{
std::string s {};
std::vector<std::string> output {};
std::vector<std::string> input {"a", "b", "c"};
make_combos(s, input, output);
for(auto i : output) std::cout << i << std::endl;
std::cout << "There are " << output.size()
<< " unique combinations for this input." << std::endl;
return 0;
}
This outputs:
a
ab
abc
ac
b
bc
c
There are 7 unique combinations for this input.

Generate all sequences of bits within Hamming distance t

Given a vector of bits v, compute the collection of bits that have Hamming distance 1 with v, then with distance 2, up to an input parameter t.
So for
011 I should get
~~~
111
001
010
~~~ -> 3 choose 1 in number
101
000
110
~~~ -> 3 choose 2
100
~~~ -> 3 choose 3
How to efficiently compute this? The vector won't be always of dimension 3, e.g. it could be 6. This will run numerous time in my real code, so some efficiency would be welcome as well (even by paying more memory).
My attempt:
#include <iostream>
#include <vector>
void print(const std::vector<char>& v, const int idx, const char new_bit)
{
for(size_t i = 0; i < v.size(); ++i)
if(i != idx)
std::cout << (int)v[i] << " ";
else
std::cout << (int)new_bit << " ";
std::cout << std::endl;
}
void find_near_hamming_dist(const std::vector<char>& v, const int t)
{
// if t == 1
for(size_t i = 0; i < v.size(); ++i)
{
print(v, i, v[i] ^ 1);
}
// I would like to produce t == 2
// only after ALL the t == 1 results are reported
/* how to? */
}
int main()
{
std::vector<char> v = {0, 1, 1};
find_near_hamming_dist(v, 1);
return 0;
}
Output:
MacBook-Pro:hammingDist gsamaras$ g++ -Wall -std=c++0x hammingDist.cpp -o ham
MacBook-Pro:hammingDist gsamaras$ ./ham
1 1 1
0 0 1
0 1 0

First: There is a bijection between hamming dist k bit-vectors and subsets (of n aka v.size()) of kardinality k (the set of indices with changed bits). Hence, I'd enumerate the subsets of changed indices instead. A quick glance at the SO history shows this reference. You'd have to keep track of the correct cardinalitites of course.
Considering efficiency is probably pointless, since the solution to your problem is exponential anyways.

If Hamming distance h(u, v) = k, then u^v has exactly k bits set. In other words, computing u ^ m over all masks m with k bits set gives all words with the desired Hamming distance. Notice that such set of mask does not depend on u.
That is, for n and t reasonably small, precompute sets of masks with k bits set, for all k in 1,t, and iterate over these sets as required.
If you don't have enough memory, you may generate the k-bit patterns on the fly. See this discussion for details.

#include <stdio.h>
#include <stdint.h>
#include <string.h>
void magic(char* str, int i, int changesLeft) {
if (changesLeft == 0) {
printf("%s\n", str);
return;
}
if (i < 0) return;
// flip current bit
str[i] = str[i] == '0' ? '1' : '0';
magic(str, i-1, changesLeft-1);
// or don't flip it (flip it again to undo)
str[i] = str[i] == '0' ? '1' : '0';
magic(str, i-1, changesLeft);
}
int main(void) {
char str[] = "011";
printf("%s\n", str);
size_t len = strlen(str);
size_t maxDistance = len;
for (size_t i = 1 ; i <= maxDistance ; ++i) {
printf("Computing for distance %d\n", i);
magic(str, len-1, i);
printf("----------------\n");
}
return 0;
}
Output:
MacBook-Pro:hammingDist gsamaras$ nano kastrinis.cpp
MacBook-Pro:hammingDist gsamaras$ g++ -Wall kastrinis.cpp
MacBook-Pro:hammingDist gsamaras$ ./a.out
011
Computing for distance 1
010
001
111
----------------
Computing for distance 2
000
110
101
----------------
Computing for distance 3
100
----------------

In response to Kastrinis' answer, I would like to verify that this can be extended to my basis example, like this:
#include <iostream>
#include <vector>
void print(std::vector<char>&v)
{
for (auto i = v.begin(); i != v.end(); ++i)
std::cout << (int)*i;
std::cout << "\n";
}
void magic(std::vector<char>& str, const int i, const int changesLeft) {
if (changesLeft == 0) {
print(str);
return;
}
if (i < 0) return;
// flip current bit
str[i] ^= 1;
magic(str, i-1, changesLeft-1);
// or don't flip it (flip it again to undo)
str[i] ^= 1;
magic(str, i-1, changesLeft);
}
int main(void) {
std::vector<char> str = {0, 1, 1};
print(str);
size_t len = str.size();
size_t maxDistance = str.size();
for (size_t i = 1 ; i <= maxDistance ; ++i) {
printf("Computing for distance %lu\n", i);
magic(str, len-1, i);
printf("----------------\n");
}
return 0;
}
where the output is identical.
PS - I am also toggling the bit with a different way.

c++ Algorithm to convert an integer into an array of bool

I'm trying to code an algorithm that will save to file as binary strings every integer in a range. Eg, for the range 0 to 7:
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Note that the leading zeros and spaces between digits are essential.
What I cant figure out how to do in a simple way is to convert the integers to binary numbers represented by bool []s (or some alternate approach).
EDIT
As requested, my solution so far is:
const int NUM_INPUTS = 6;
bool digits[NUM_INPUTS] = {0};
int NUM_PATTERNS = pow(2, NUM_INPUTS);
for(int q = 0; q < NUM_PATTERNS; q++)
{
for(int w = NUM_INPUTS -1 ; w > -1 ; w--)
{
if( ! ((q+1) % ( (int) pow(2, w))) )
digits[w] = !digits[w];
outf << digits[w] << " ";
}
outf << "\n";
}
Unfortunately, this is a bit screwy as the first pattern it gives me is 000001 instead of 000000.
This is not homework. I'm just coding a simple algorithm to give me an input file for training a neural network.

Don't use pow. Just use binary math:
const int NUM_INPUTS = 6;
int NUM_PATTERNS = 1 << NUM_INPUTS;
for(int q = 0; q < NUM_PATTERNS; q++)
{
for(int w = NUM_INPUTS -1 ; w > -1; w--)
{
outf << ((q>>w) & 1) << " ";
}
outf << "\n";
}

Note: I'm not providing code, but merely a hint because the question sounds like homework
This is quite easy. See this example:
number = 23
binary representation = 10111
first digit = (number )&1 = 1
second digit = (number>>1)&1 = 1
third digit = (number>>2)&1 = 1
fourth digit = (number>>3)&1 = 1
fifth digit = (number>>4)&1 = 1
Alternatively written:
temp = number
for i from 0 to digits_count
digit i = temp&1
temp >>= 1
Note that the order of digits taken by this algorithm is the reverse of what you want to print.

The lazy way would be to use std::bitset.
Example:
#include <bitset>
#include <iostream>
int main()
{
for (unsigned int i = 0; i != 8; ++i){
std::bitset<3> b(i);
std::cout << b << std::endl;
}
}
If you want to output the bits individually, space-separated, replace std::cout << b << std::endl; with a call to something like Write(b), with Write defined as:
template<std::size_t S>
void Write(const std::bitset<S>& B)
{
for (int i = S - 1; i >= 0; --i){
std::cout << std::noboolalpha << B[i] << " ";
}
std::cout << std::endl;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Count sequences of letters in a string - C++ - c++

What you are doing right now is a counting sort. A radix sort would be a viable option for you if you take multiple digits into consideration.

Related

Occurrences of same alternative characters in a CString

How to sort my list of most frequent 3-mers in ascending order?

Combination of unique elements without repetition

Generate all sequences of bits within Hamming distance t

c++ Algorithm to convert an integer into an array of bool

Categories

Resources