String compression and comparison

String compression and comparison - c++

Recently, I started writing a program to compare DNA sequence. As the alphabet only consists of four letters (ATCG), compressing each character to 2 bits seemed like it would offer faster comparisons (are two characters the same or different). However, when I ran a test char comparisons were much faster than bit comparisons (by ~30%). Compression was carried out in both programs as a control. What am I missing here? Is there a more efficient way to compare bits?
p.s. I also tried vector, but it was a bit slower than bitset.
// File: bittest.cc
// Test use of bitset container
#include <ctime>
#include <iostream>
#include <bitset>
#include <vector>
#include <string>
using namespace std;
void compress(string&, bitset<74>&);
void compare(bitset<74>&, bitset<74>&);
int main()
{
// Start timer
std::clock_t start;
double difference;
start = std::clock();
for(int i=0; i<10000000; ++i){
string frag1="ATCGACTGACTGACTGACTGACTGACTGACTGACTGA";
string frag2="AACGAACGAACGAACGAACGAACGAACGAACGAACGA";
int a=37;
bitset<74> bits1;
bitset<74> bits2;
compress(frag1, bits1);
compress(frag2, bits2);
compare(bits1, bits2);
}
difference = ( std::clock() - start ) / (double)CLOCKS_PER_SEC;
int minutes = difference/60;
int seconds = difference - minutes * 60;
if (seconds < 10){
cout << "\nRunning time: " << minutes << ":0" << seconds << endl << endl;
}else{
cout << "\nRunning time: " << minutes << ":" << seconds << endl << endl;
}
return 0;
}
void compress(string& in, bitset<74>& out){
char c;
int b=0;
for(int i=0; i<in.length(); ++i){
c=in[i];
b=2*i;
switch(c){
case 'A':
break;
case 'C':
out.set(b+1);
break;
case 'G':
out.set(b);
break;
case 'T':
out.set(b);
out.set(b+1);
break;
default:
cout << "Invalid character in fragment.\n";
}
}
}
void compare(bitset<74>& a, bitset<74>& b){
for(int i=0; i<74; ++i){
if(a[i] != b[i]){
}
}
}
And the string harness...
// File: bittest.cc
#include <ctime>
#include <iostream>
#include <bitset>
#include <vector>
#include <string>
using namespace std;
void compress(string&, bitset<74>&);
void compare(string&, string&);
int main()
{
// Start timer
std::clock_t start;
double difference;
start = std::clock();
for(int i=0; i<10000000; ++i){
string frag1="ATCGACTGACTGACTGACTGACTGACTGACTGACTGA";
string frag2="AACGAACGAACGAACGAACGAACGAACGAACGAACGA";
int a=37;
bitset<74> bits1;
bitset<74> bits2;
compress(frag1, bits1);
compress(frag2, bits2);
compare(frag1, frag2);
}
difference = ( std::clock() - start ) / (double)CLOCKS_PER_SEC;
int minutes = difference/60;
int seconds = difference - minutes * 60;
if (seconds < 10){
cout << "\nRunning time: " << minutes << ":0" << seconds << endl << endl;
}else{
cout << "\nRunning time: " << minutes << ":" << seconds << endl << endl;
}
return 0;
}
void compress(string& in, bitset<74>& out){
char c;
int b=0;
for(int i=0; i<in.length(); ++i){
c=in[i];
b=2*i;
switch(c){
case 'A':
break;
case 'C':
out.set(b+1);
break;
case 'G':
out.set(b);
break;
case 'T':
out.set(b);
out.set(b+1);
break;
default:
cout << "Invalid character in frag.\n";
}
}
}
void compare(string& a, string& b){
for(int i=0; i<37; ++i){
if(a[i] != b[i]){
}
}
}

Consider the two comparison routines:
void compare(bitset<74>& a, bitset<74>& b){
for(int i=0; i<74; ++i){
if(a[i] != b[i]){
}
}
}
and
void compare(string& a, string& b){
for(int i=0; i<37; ++i){
if(a[i] != b[i]){
}
}
}
Right off the bat you can see that one is executing the loop 74 times and the other is executing the loop 37 times. So already the bitset approach is starting from a position of weakness.
Now consider the types of data being accessed; accessing individual bytes is reasonably quick; accessing individual bits from any data structure might store a single bit in an entire byte or maybe even some larger word size. If it stores bits in individual bits, then some bitmasking operations must be introduced, and those all take processing power too. If the bits are stored in bytes, then you are in fact comparing only half of each character on each bit. If the bits are stored in words or larger, you are increasing the size of the CPU's data cache -- potentially taking something that might fit entirely in one cache line to several cache lines. That's a potential for gigantic speed penalties, though on inputs this small, it's probably not too horrible yet.
If you replace your bitset with a char[] that is large enough to hold all your data, manually set the bits yourself in the compression routines, and then compare the char[] array a byte at a time or larger, you can probably drastically improve the speed of the comparison routines. Will the speed up be sufficient to overcome the cost of the compression routines? That's tough to say and depends in part upon how many comparisons you can make with each compressed form.
If you can perform your comparison using int or larger datatypes, you can probably go even significantly faster, as modern CPUs are usually faster at accessing 4-bytes or 8-bytes at a time than 1-byte at a time. Most strcmp(3) or memcmp(3) routines are optimized to perform huge, aligned reads. If you use memcmp(3) to do your comparison, you'll have the best chance of going at top speed -- and that goes for both the compressed and uncompressed versions.

The CPU will not load anything smaller than a byte, which is eight bits. Therefore, when your program treats a pair of bits, the CPU actually loads eight bits, then masks the unused six out. The masking operation takes processor time.
You must trade memory-usage efficiency against execution time. Which you prefer is your choice.

Related

Random generating of numbers doesn't work properly

I'm trying to create a programm that makes sudoku's. But when I try to let the programm place numbers at random spots it doesnt use every position.
I tried to use rand(); with srand(time(0));
and random number generators from <random>.
In the Constructor i use this:
mt19937_64 randomGeneratorTmp(time(0));
randomGenerator = randomGeneratorTmp;
uniform_int_distribution<int> numGetterTmp(0, 8);
numGetter = numGetterTmp;
While I have randomGenerator and numGetter variable so i can use them in another function of the sudoku object.
And this is the function where i use the random numbers:
bool fillInNumber(int n){
int placedNums = 0, tries=0;
int failedTries[9][9];
for(int dim1=0;dim1<9;dim1++){
for(int dim2=0;dim2<9;dim2++){
failedTries[dim1][dim2] = 0;
}
}
while(placedNums<9){
int dim1 = numGetter(randomGenerator);
int dim2 = numGetter(randomGenerator);
if(nums[dim1][dim2]==0){
if(allowedLocation(n,dim1,dim2)){
nums[dim1][dim2] = n;
placedNums++;
} else {
failedTries[dim1][dim2]++;
tries++;
}
}
if(tries>100000000){
if(placedNums == 8){
cout<< "Number: " << n << endl;
cout<< "Placing number: " << placedNums << endl;
cout<< "Dim1: " << dim1 << endl;
cout<< "Dim2: " << dim2 << endl;
printArray(failedTries);
}
return false;
}
}
return true;
}
(The array failedTries just shows me which positions the program tried.
and most of the fields have been tried millions of times, while others not once)
I think that the random generation just repeats itself before it used every number combination, but i don't know what i'm doing wrong.

Don't expect random numbers to have an even distribution over your matrix - there's no guarantee they will. That would be like having a routine to randomly generate cards from a deck, and waiting until you see all 52 values - you may wait a very very long time to get every single card.
That's especially true since "random" numbers are actually generated by pseudorandom number generators, which, generally utilize multiplying a very large number and adding an arbitrary constant. Depending on the algorithm, this might cluster in unanticipated ways.
If I may make a suggestion: create an array of all of the possible matrix positions, and then shuffle that array. That's how deck shuffling algorithms are able to guarantee you have all the cards in the deck covered, and it's the same problem you're having.
For a shuffle, generate two random positions in the array and exchange the values - repeat as many times as it takes to get a suitably random result. (Since your array is limited to 9x9, I might shuffle an array of ints 0..80: extract the columns and rows with a /9 and a % 9 for each int).

I wrote a simple program that should be equivalent to your code, and it works without issues:
#include <iostream>
#include <random>
#include <vector>
using namespace std;
int main()
{
std::default_random_engine engine;
std::uniform_int_distribution<int> distr(0,80);
std::vector<bool> vals(81,false);
int attempts = 0;
int trueCount = 0;
while(trueCount < 81)
{
int newNum = distr(engine);
if(!vals[newNum])
{
vals[newNum] = true;
trueCount++;
}
attempts++;
}
std::cout << "attempts: " << attempts;
return 0;
}
Usually it prints around 400 attempts which is the statistical average.
You most likely have a bug in your code. I am not sure where though, as you don't show all of your code.

Why is clock() returning 1.84467e+13?

I am trying to time a code I've got in C++. I have an inner and an outer loop that I want to time separately, but at the same time. For some reason when I do this one of the instances returns 1.84467e+13 and always this exact number.
Why is this happening?
Here is a minimum working example that replicates the effect on my machine:
#include <iostream>
#include <stdlib.h>
#include <time.h>
using namespace std;
int main()
{
long int i, j;
clock_t start, finish, tick, tock;
double a = 0.0;
double adding_time, runtime;
start = clock();
for(i=0; i<10; i++)
{
a=0.0;
tick =clock();
for(j=0; j<10000000; j++)
{
a+=1;
}
tock= clock();
adding_time = (double)(tick - tock)/CLOCKS_PER_SEC;
cout << "Computation time:" << adding_time << endl;
}
finish = clock();
runtime = (double)(finish - start)/CLOCKS_PER_SEC;
cout << "Total computation time:" << runtime << endl;
}

Your clock_t is apparently an unsigned 64-bit type.
You're taking tick - tock, where tock was measured after tick, so if there's any difference between the two at all, it's going to try to produce a negative number--but since it's an unsigned type, that's wrapping around to become something close to the largest number that can be represented in that type.
Obviously, you really want to use tock-tick instead.

let say tic = 2ms and tac is 4ms; so when you do tic-tac(2-4) that will generate a negative number obviously.. even if it given a positive number it wont be the real time. and also, the number it generate (which doesnt appear on my computer) is a big number, so, try to use the manipulator;
#include"iomanip"
cout << fixed << showpoint;
cout << setprecision(2);
it might work..

Program will not output data to console when using a data input size greater than 30 million

I'm trying to make a program that will eventually show the runtime differences with large data inputs by using a binary search tree and a vector. But before I get to that, I'm testing to see if the insertion and search functions are working properly. It seems to be fine but whenever I assign SIZE to be 30 million or more, after about 10-20 seconds, it will only display Press any key to continue... with no output. However if I assign SIZE to equal to 20 million or less, it will output the search results as I programmed it. So what do you think is causing this problem?
Some side notes:
I'm storing a unique, (no duplicates) randomly generated value into the tree as well as the vector. So at the end, the tree and the vector will both have the exact same values. When the program runs the search portion, if a value is found in the BST, then it should be found in the vector as well. So far this has worked with no problems when using 20 million values or less.
Also, I'm using randValue = rand() * rand(); to generate the random values because I know the maximum value of rand() is 32767. So multiplying it by itself will guarantee a range of numbers from 0 - 1,073,741,824. I know the insertion and searching methods I'm using are inefficient because I'm making sure there are no duplicates but it's not my concern right now. This is just for my own practice.
I'm only posting up my main.cpp for the sake of simplicity. If you think the problem lies in one of my other files, I'll post the rest up.
Here's my main.cpp:
#include <iostream>
#include <time.h>
#include <vector>
#include "BSTTemplate.h"
#include "functions.h"
using namespace std;
int main()
{
const long long SIZE = 30000000;
vector<long long> vector1(SIZE);
long long randNum;
binarySearchTree<long long> bst1;
srand(time(NULL));
//inserts data into BST and into the vector AND makes sure there are no duplicates
for(long long i = 0; i < SIZE; i++)
{
randNum = randLLNum();
bst1.insert(randNum);
if(bst1.numDups == 1)//if the random number generated is duplicated, don't count it and redo that iteration
{
i--;
bst1.numDups = 0;
continue;
}
vector1[i] = randNum;
}
//search for a random value in both the BST and the vector
for(int i = 0; i < 5; i++)
{
randNum = randLLNum();
cout << endl << "The random number chosen is: " << randNum << endl << endl;
//searching with BST
cout << "Searching for " << randNum << " in BST..." << endl;
if(bst1.search(randNum))
cout << randNum << " = found" << endl;
else
cout << randNum << " = not found" << endl;
//searching with linear search using vectors
cout << endl << "Searching for " << randNum << " in vector..." << endl;
if(containsInVector(vector1, SIZE, randNum))
cout << randNum << " = found" << endl;
else
cout << randNum << " = not found" << endl;
}
cout << endl;
return 0;
}

(Comments reposted as answer at OP's request)
Options include: compile 64 bit (if you're not already - may make it better or worse depending on whether RAM or address space are the issue), buy more memory, adjust your operating system's swap memory settings (letting it use more disk), design a more memory-efficient tree (but at best you'll probably only get an order of magnitude improvement, maybe less, and it could impact other things like performance characteristics), redesign your tree so it manually saves data out to disk and reads it back (e.g. with an LRU).
Here's a how-to for compiling 64 bit on VC++: msdn.microsoft.com/en-us/library/9yb4317s.aspx

How Can I Speed My C++ Program Up?

Basically I am relearning C++ and decided to create a lotto number generator.
The code creates the ticket and if that ticket does not already exist, it is added to a vector to store every possible combination.
The program works, but its just far too slow, adding an entry roughly every second, and It will get slower as it finds it more difficult to add unique combinations out of over 13 million possible combinations.
Anyway here is my code, any optimization tips would appreciated:
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <string>
#include <sstream>
#include <vector>
#include <algorithm>
using namespace std;
vector<string> lottoCombos;
const int NUMBERS_PER_TICKET = 6;
const int NUMBERS = 49;
const int POSSIBLE_COMBOS = 13983816;
string createTicket();
void startUp();
void getAllCombinations();
int main()
{
lottoCombos.reserve(POSSIBLE_COMBOS);
cout<< "Random Ticket: "<< createTicket()<< endl;
getAllCombinations();
for (int i = 0; i < POSSIBLE_COMBOS; i++)
{
cout << endl << lottoCombos[i];
}
system("PAUSE");
return 0;
}
string createTicket()
{
srand(static_cast<unsigned int>(time(0)));
vector<int> ticket;
vector<int> numbers;
vector<int>::iterator numberIterator;
//ADD AVAILABLE NUMBERS TO VECTOR
for (int i = 0; i < NUMBERS; i++)
{
numbers.push_back(i + 1);
}
for (int j = 0; j < NUMBERS_PER_TICKET; j++)
{
int ticketNumber = rand() % numbers.size();
numberIterator = numbers.begin()+ ticketNumber;
int nm = *numberIterator;
numbers.erase(numberIterator);
ticket.push_back(nm);
}
sort(ticket.begin(), ticket.end());
string result;
ostringstream convert;
convert << ticket[0] << ", " << ticket[1] << ", " << ticket[2] << ", " << ticket[3] << ", " << ticket[4] << ", " << ticket[5];
result = convert.str();
return result;
}
void getAllCombinations()
{
int i = 0;
cout << "Max Vector Size: " << lottoCombos.max_size() << endl;
cout << "Creating Entries" << endl;
while ( i != POSSIBLE_COMBOS )
{
bool matchFound = true;
string newNumbers = createTicket();
for (int j = 0; j < lottoCombos.size(); j++)
{
if ( newNumbers == lottoCombos[j] )
{
matchFound = false;
break;
}
}
if (matchFound != false)
{
lottoCombos.push_back(createTicket());
i++;
cout << "Entries: "<< i << endl;
}
}
sort(lottoCombos.begin(), lottoCombos.end());
cout << "\nCombination generation complete!!!\n\n";
}

The reason each lottery ticket is taking a second to generate is because you are misusing srand(). By calling srand(time(0)) every time createTicket() is called, you ensure that createTicket() returns the same numbers every time it is called, until the next time the value returned by time() changes, i.e. once per second. So your reject-duplicates algorithm will almost always find a duplicate until the next second goes by. You should move your srand(time(0)) call to the top of main() instead.
That said, there are perhaps larger issues to confront here: my first question would be, is it really necessary to generate and store every possible lottery ticket? (and if so, why?) IIRC real lotteries don't do that when issuing a ticket; they just generate some random numbers and print them out (and if there are multiple winning tickets printed with the same numbers, the owners of those tickets share the prize money).
Assuming you do need to generate every possible lottery ticket for some reason, there are better ways to do it than randomly. If you've ever watched the odometer increment while driving a car, you'll get the idea for how to do it linearly; just imagine an odometer with 6 wheels, where each wheel has 49 different possible positions it can be in (rather than the traditional 10).
Finally, a vector has O(N) lookup time, and if you are doing a lookup in the vector for every value you generate, then your algorithm has O(N^2) time, which is to say, it's going to get really slow really quickly as you generate more tickets. So if you have to store all known tickets in a data structure, you should definitely use a data structure with quicker lookup times, for example a std::map or a std::unordered_set, or even a std::bitset as suggested by #RedAlert.

C++ Long Division

Whilst working on a personal project of mine, I came across a need to divide two very large arbitrary numbers (each number having roughly 100 digits).
So i wrote out the very basic code for division (i.e., answer = a/b, where a and b are imputed by the user)and quickly discovered that it only has a precision of 16 digits! It may be obvious at this point that Im not a coder!
So i searched the internet and found a code that, as far as i can tell, uses the traditional method of long division by making a string(but too be honest im not sure as im quite confused by it). But upon running the code it gives out some incorrect answers and wont work at all if a>b.
Im not even sure if there's a better way to solve this problem than the method in the code below!? Maybe there's a simpler code??
So basically i need help to write a code, in C++, to divide two very large numbers.
Any help or suggestions are greatly appreciated!
#include <iostream>
#include <iomanip>
#include <cmath>
using namespace std; //avoids having to use std:: with cout/cin
int main (int argc, char **argv)
{
string dividend, divisor, difference, a, b, s, tempstring = ""; // a and b used to store dividend and divisor.
int quotient, inta, intb, diff, tempint = 0;
char d;
quotient = 0;
cout << "Enter the dividend? "; //larger number (on top)
cin >> a;
cout << "Enter the divisor? "; //smaller number (on bottom)
cin >> b;
//making the strings the same length by adding 0's to the beggining of string.
while (a.length() < b.length()) a = '0'+a; // a has less digits than b add 0's
while (b.length() < a.length()) b = '0'+b; // b has less digits than a add 0's
inta = a[0]-'0'; // getting first digit in both strings
intb = b[0]-'0';
//if a<b print remainder out (a) and return 0
if (inta < intb)
{
cout << "Quotient: 0 " << endl << "Remainder: " << a << endl;
}
else
{
a = '0'+a;
b = '0'+b;
diff = intb;
//s = b;
// while ( s >= b )
do
{
for (int i = a.length()-1; i>=0; i--) // do subtraction until end of string
{
inta = a[i]-'0'; // converting ascii to int, used for munipulation
intb = b[i]-'0';
if (inta < intb) // borrow if needed
{
a[i-1]--; //borrow from next digit
a[i] += 10;
}
diff = a[i] - b[i];
char d = diff+'0';
s = d + s; //this + is appending two strings, not performing addition.
}
quotient++;
a = s;
// strcpy (a, s);
}
while (s >= b); // fails after dividing 3 x's
cout << "s string: " << s << endl;
cout << "a string: " << a << endl;
cout << "Quotient: " << quotient << endl;
//cout << "Remainder: " << s << endl;
}
system ("pause");
return 0;
cin.get(); // allows the user to enter variable without instantly ending the program
cin.get(); // allows the user to enter variable without instantly ending the program
}

There are much better methods than that. This subtractive method is arbitrarily slow for large dividends and small divisors. The canonical method is given as Algorithm D in Knuth, D.E., The Art of Computer Programming, volume 2, but I'm sure you will find it online. I'd be astonished if it wasn't in Wikipedia somewhere.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

String compression and comparison - c++

Related

Random generating of numbers doesn't work properly

Why is clock() returning 1.84467e+13?

Program will not output data to console when using a data input size greater than 30 million

How Can I Speed My C++ Program Up?

C++ Long Division

Categories

Resources