Accessing specific column / row and compare to another (C++) - c++

I'm attempting to take a text file as an input with, let's say, six columns and twenty rows and make various calculations based on the data in the text file.
Is there a way to be able to access a specific column/row in the code and compare it to another? I'm basically trying to see how many numbers in, let's say, column two are +10 away from each other so if column two was 10 11 16 20 21 25 30 31 34 40 50, the program would give me the solution 10,20,30,40,50 and 11,21,31.

It sounds like you may want to utilize this functionality to do more than just figure out if numbers in a row are a set distance from eachother, so I'll provide a more generalized solution.
First create a 20x6 matrix of character pointers:
char *inputmatrix[20][6];
Then load up the matrix with the values from the file. We first get the whole line from the file with fgets, from there we need to parse the line based on spaces using strtok. From there we'll need to create space for each element using malloc, copy in the value from strtok (because it gets overridden on the next call to strtok), and then store the pointer in our array:
char buffer[256];
char *value;
while(!feof(f)){
if(!fgets(buffer,256,f))
break;
value = strtok(buffer," ");
while(value != NULL){
inputmatrix[currow][curcol] = (char*)malloc(strlen(value+1));
memset(inputmatrix[currow][curcol],0,strlen(value+1));
memcpy(inputmatrix[currow][curcol],value,strlen(value));
curcol++;
value = strtok(NULL," ");
}
currow++;
curcol = 0;
}
Now that we've got a matrix of strings, we can go through and run any algorithm you want. For instance, to find out all the elements in a column that are +10 away from eachother we'll have to first determine if the element can be converted to an int using atoi, then compare it with the next int in the column and so on:
int curelement = -1, nextelement = -1;
for(int i=0;i<3;i++){
for(int j=0;j<6;j++){
if((nextelement = atoi(inputmatrix[i][j])) != 0){
if(nextelement - curelement == 10){
printf("row %i,: %i,%i\n",i,curelement,nextelement);
}
curelement = nextelement;
}
}
The above algorithm only works if the integers in the row are in ascending order; if not you have to take each integer and compare it with the rest of the integers in the row.

Related

Better(faster) algorithm to compare 2 vector of vector of integer?

I have 1 set of raw data file(s), each has 8 millions~9 millions lines (yes,
8,000,000~9,000,000) in the following format,
1,2,3,4,5,16,23,35
1,2,3,4,6,17,23,36
1,2,3,4,7,18,23,37
1,2,3,4,8,19,23,38
1,2,3,4,9,20,23,39
1,2,3,4,10,21,23,40
1,2,3,4,11,22,23,41
1,2,3,4,12,23,24,42
1,2,3,4,13,24,25,43
1,2,3,4,14,25,26,44
Each line has 8 sorted numbers and range from 1~49.
Another set of "filter" file(s) each has 6 millions ~ 7 millions line in the
following format,
13,4,7,8,18,20
9,10,11,12,5,6,7,8,1,2,3,4,21,22,23,24,13,14,15,16,29,30,31,32,45,46,47,48
29,49,36,37,34,17,15,9,16,30,28,47,46,27,20,32,14,26,1,4,3,6,10,2,7,48,44,41
Each line has 4~28 non sorted numbers and range 1~49
I need to compare each line from "raw data" file with every lines in "filter" file
and get the intersection value, e.g. line 1 in raw with line 1~3 in filter
1 // since only 4 is in common with filter line 1
7 // since only 35 not found in filter line 2
6 // since 5 23 35 not found in filter line 3
After the comparsion, will output the result according to the threshold value.
e.g.
output raw data line with intersection value >= 2,
output raw data line with intersection value == 4
I knew that there are (at most) 9 millions x 8 millions line comparsions.
At first, I try using set_intersection to do the job but it takes forever to do the task (the filter line is sorted before pass to set_intersection).
int res[8];
int *it = set_intersection(Raw.Data, Raw.Data+8, FilterVal.begin(), FilterVal.end(), res);
ds = GetIntersect(GDE.DrawRes, LotArr) * 2;
int IntersectCnt=it-res;
Next, I try build up an array of integer zero:
int ResArr[49] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
and use 3 helper functions:
void InitResArr(int * inResArr, vector<int> & FilterVal) {
for (int i = 0; i < FilterVal.size(); i++) {
inResArr[FilterVal[i] - 1] = 1;
}
}
void ResetResArr(int * inResArr, vector<int> & FilterVal) {
for (int i = 0; i < FilterVal.size(); i++) {
inResArr[FilterVal[i] - 1] = 0;
}
}
int GetIntersect(int * inResArr, int * inRawData) {
int RtnVal = 0;
for (int i = 0; i < 8; i++) {
RtnVal+=inResArr[inRawData[i] - 1];
}
But this approach still take over 3 hrs to finish 1 comparsion (1 raw data file with 1 filter).
And I have 5,000 raw data files and 40,000 filters to go!!!
Is there any other better approach to handle this task ? Thanks.
Regds
LAM Chi-fung
Not sure how well it'll work for your case (was hard to understand what you wanted from your description) but I've thought of the following algorithm:
Sort your long rows. It can be done in O(n), where n is the length of a single data row, by simple counting.
After that just for every number in filter row do a binary search on a sorted row. That'll be O(m * log(n)), where m is the number of filter rows. Should be a big improvement over your O(m*n) (you need to also multiply all those complexities by the number of the data rows, to be precise).
Also, pay attention to your I/O, after the algo updated it might become the next bottleneck (if you are using iostreams, don't forget to std::ios::sync_with_stdio(false).

How to work with Bubble Sort to determine highest number [C++]

My assignment is to create a function determine the highest number of a given array read from a text file. I've looked into using bubble sorting and I think that since the assignment does not ask for sorted numbers, it is unnecessary to store them as such
Here is what I've got so far
void determineWinner(string namesArr[], float votesArr[], int size)
{
int temp = 0;
string tempname;
for (int i = 0; i < size; i++)
{
if (votesArr[i] > votesArr[i + 1])
{
temp = votesArr[i];
tempname = namesArr[i];
votesArr[i] = votesArr[i + 1];
namesArr[i] = namesArr[i + 1];
votesArr[i + 1] = temp;
namesArr[i + 1] = tempname;
}
}
}
I've created it such that it tests the condition, (with the goal in mind to sort smallest to biggest), and then replaces i with i+1. And then because the "votes" are linked to specific names, I switch the names as the votes move around.
For instance the array would be arranged at first
5000, 4000, 6000, 2500, 1800
and would need to end up as
1800, 2500, 4000, 5000, 6000
I think I'm getting a runtime error with "program name has stopped working", what can I do to fix this up?
Your Problem is that the size of your array is 5, means it contains the elements 0,1,...,4 which you want to iterate from i=0 to i<5.
This means that in the 4th run, you will try if (votesArr[4] > votesArr[4+1]), which is not legal, as the 4th element is the last (there is no 5th).
So either you start at i=1 and do something like if (votesArr[i-1] > votesArr[i]) or you only go to i < size-1.
Think about when you only have two elements. You would only need one comparison.
You should also try to look after your bubble sort algorithm, it doesn't seem correct to me.
If you only want the highest element, you should keep the actual highest in memory (your temp variable) and overwrite it every time, you find a bigger one.

C++ Incompatible types: calculating allele frequencies

Here is what the input file looks like:
1-1_Sample 1
GCCCATGGCT
2-1_Sample 1
GAGTGTATGT
3-1_Sample 1
TGTTCTATCT
1-1_Sample 2
GCTTAGCCAT
2-1_Sample 2
TGTAGTCAGT
3-1_Sample 2
GGGAACCAAG
1-1_Sample 3
TGGAAGCGGT
2-1_Sample 3
CGGGAGGAGA
3-1_Sample 3
CTTCAGTTTT
#include <cstdlib>
#include <iostream>
#include <string>
#include <fstream>
#include <stdlib.h>
using namespace std;
const int pops = 10;
const int sequence = 100;
string w;
string popname;
string lastpop;
int totalpops;
string ind;
int i;
int j;
char c;
float dna[pops][4][sequence];
float Af[1][1][1];
int main(int argc, char *argv[])
{
ifstream fin0("dnatest.txt");
lastpop = "nonsense";
totalpops = -1;
if (fin0)
{
do
{
getline(fin0, w);
cout << w<<endl;
i=0;
ind = "";
popname = "";
do {c = w [i];
i++;
if ((c != '>')&(c!='-')) ind=ind+c; } while (c != '-');
do {c = w [i];
i++; } while (c != ' ');
do {c = w [i];
i++;
if (c!= '\n') popname=popname+c; } while (i< w.length());
if (popname != lastpop) { totalpops++;
lastpop=popname;
}
getline (fin0, w);
cout << w<<endl << w.length()<<endl;
for (i=0; i<w.length(); i++)
{if (w[i]=='A') dna[totalpops][0][i]++;
if (w[i]=='C') dna[totalpops][1][i]++;
if (w[i]=='G') dna[totalpops][2][i]++;
if (w[i]=='T') dna[totalpops][3][i]++;
}
for(int k=0;k<1;k++)
{for(int j=0; j<1;j++)
{for (int i=0;i<1;i++)
Af[0] = Af[0][0][0]+dna[i][j][k]; //RETURNS THE ERROR "INCOMPATIBLE TYPES IN ASSIGNMENT OF 'FLOAT' TO 'FLOAT[1][1]'
cout<<Af<<endl;}
}
while (!fin0.eof());
}
system("PAUSE");
return EXIT_SUCCESS;
}
Background:
I am very new to C++ and trying to teach myself to use it to supplement my graduate research. I am genetics PhD candidate trying to model different evolutionary histories, and how they affect the frequency of alleles across populations.
Question:
I am trying to extract certain portions of data from the "dna" array that I created from the input file.
For example, here I have created another array "Af" where I am trying to extract counts for the first "cell," so to speak, of the dna array. The purpose of doing this, is so that I can calculate a frequency by comparing the counts in certain groups of cells to the entire dna array. I can't figure out how to do this. I keep getting the error message: "INCOMPATIBLE TYPES IN ASSIGNMENT OF 'FLOAT' TO 'FLOAT[1][1]'"
I have spent a great deal of time researching this on different forums, but I cannot seem to understand what this error means, and how else to achieve what I'm trying to achieve.
So the dna array I'm visualizing is a made from the input file such that there are 4 rows (A,C,G,T). and then 10 columns (one column for each nucleotide in the series). This "grid" is then stacked 3 times (one "sheet" for each Sample (here sample means population, and there are three individuals per population) as listed on the input file).
So from this stack of grids I want to extract, for example, the first cell (the number of A's in Sample 1 at position 1. I would then want to compare this number to the total number of A's at position 1 across all samples. This frequency would then be a meaningful number for the model I'm testing.
The problem is, I don't know how to extract portions of the dna array - once I figure out this condensed example, I will be applying it to very large input files, and will want to extract more than one cell at a time.
Af is a 3-dimensional array:
float Af[1][1][1];
However, it contains only a single element. It has one row, one column, and one "layer" (or however you want to name the 3rd dimension). That makes it a bit pointless. You might as well just have this:
float Af;
Nonetheless, you don't have that - you have a 3D array. Now let's look at this line:
Af[0] = Af[0][0][0] + dna[i][j][k];
So first it takes the (0, 0, 0)th element from Af (which as we've just seen is the only element in A and adds the (i, j, j)th element from dna to it. That bit is fine because both of these elements are of type float. That is:
Af[0] = Af[0][0][0] + dna[i][j][k];
// ^^^^^^^^^^^ ^^^^^^^^^^^^
// These are both floats
So the result of this addition is also a float. Then what do you try to assign this result to? Well you try to assign it to Af[0], but that is not a float. You've simplify specified the 0th index in the first dimension. There's still two other dimensions to specify. The type of Af[0] is actually a float[1][1] (a two dimensional array of floats). This would work, for example:
Af[0][0][0] = Af[0][0][0] + dna[i][j][k];
// Or equivalently:
Af[0][0][0] += dna[i][j][k];
Whether that's what you want to do or not is completely dependent on the problem, which I can't begin to understand. However, as I said, it makes very little sense to have Af as a 3 dimensional array with only a single element in it. If it's just one float, make it a float, not an array. Then you would do the above line as:
Af += dna[i][j][k];

count number of times a character appears in an array?

i've been thinking for a long time and havent got anywhere with the program. i dont know where to begin. The assignment requires use of single function main and only iostream library to be used.
the task is to Declare a char array of 10 elements. Take input from user. Determine if array contains any values more than 1 times . do not show the characters that appears 1 time only.
Sample output:
a 2
b 4
..
a an b are characters. and 2 and 4 represents number of times they appear in the array B.
i tried to use nested loop to compare a character with all the character in array and incrementing a counter each time similer character id sound but unexpected results are occuring.
Here is the code
#include <iostream>
using namespace std;
void main()
{
char ara[10];
int counter=0;
cout<<"Enter 10 characters in an array\n";
for ( int a=0; a<10; a++)
cin>>ara[a];
for(int i=0; i<10; i++)
{
for(int j=i+1; j<10; j++)
{
if(ara[i] == ara[j])
{
counter++;
cout<<ara[i]<<"\t"<<counter<<endl;
}
}
}
}
Algorithm 2: std::map
Declare / define the container:
std::map<char, unsigned int> frequency;
Open the file
read a letter.
find the letter: frequency.find(letter)
If letter exists, increment the frequency: frequency[letter]++;
If letter no exists, insert into frequency: frequency[letter] = 1;
After all letters processed, iterate through the map displaying the letter and its frequency.
Here's one possible way you can solve this. I'm not giving you full code; it's considered bad to just give full implementations for other people's homework.
First, fill a new array with only unique characters. For example, if the input was:
abacdadeff
The new array should only have:
abcdef
That is, every character should appear only once in it. Do not forget to \0-terminate it, so that you can tell where it ends (since it can have a length smaller than 10).
Then create a new array of int (or unsigned, since you can't have negative occurrences) values that holds the frequency of occurence of every character from the unique array in the original input array. Every value should be initially 1. You can achieve this with a declaration like:
unsigned freq[10] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
Now, iterate over the unique array and every time you find the current character in the original input array, increment the corresponding element of the frequencies array. So at the end, for the above input, you would have:
a b c d e f (unique array)
3 1 1 2 1 2 (frequencies array)
And you're done. You can now tell how many times each characters appears in the input.
Here, I'll tell you what you should do and you code it yourself:
include headers ( stdio libs )
define main ( entry point for your app )
declare input array A[amount_of_chars_in_your_input]
write output requesting user to input
collect input
now the main part:
declare another array of unsigned shorts B[]
declare counter int i = 0
declare counter int j = 0
loop through the array A[] ( in other words i < sizeof ( A ); or a[i] != '\0' )
now loop as much as there is different letters in the array A
store the amount of letters in the B[]
print it out
Now there are some tricks applying this but you can handle it
Try this:
unsigned int frequency[26] = {0};
char letters[10];
Algorithm:
Open file / read a letter.
Search for the letters array for the new letter.
If the new letter exists: increment the frequency slot for that
letter: frequency[toupper(new_letter) - 'A']++;
If the new letter is missing, add to array and set frequency to 1.
After all letters are processed, print out the frequency array:
`cout << 'A' + index << ": " << frequency[index] << endl;

How can i count the collisions in this hash function?

This is a prefix hashing function. i want to count the number of collisions in this method but i am not sure how to do it. It seems like it might be simple but i just cant think of a great way to do it....
int HashTable_qp::preHash(string & key, int tableSize )
{
string pad = "AA";
//some words in the input are less than 3 letters
//I choose to pad the string with A because all padded characters
//have same ascii val, which is low, and will hopefully alter the results less
if (key.length() < 3)
{
key.append(pad);
}
return ( key[0] + 27 * key[1] + 729 * key[2] ) % tableSize;
}
If it's an array as the underlying data structure do:
int hash = preHash(&key, array.length);
if(array[hash] != null)
this.count++;
If it's an array of linked lists do:
if(array[hash] != null && *(array[hash]) != null)
this.count++
If you only have access to the stl library I believe just testing that element is null
before adding it would be enough after calling the hash function.
create a histogram:
unsigned histogram[tablesize] = {0};
generate some (all) possible strings and compute their hashval, and update the histogram accordingly:
for(iter=0; iter < somevalue; iter++) {
hashval = hashfunc( string.iterate(iter) ); // I don't know c++
histogram[hashval] +=1;
}
Now you have to analyze the hashtable for lumps / clusters. Rule of thumb is that for (tablesize==iter), you expect about 30 % cells with count =1, and about 30 % empty; the rest has two or more.
If you sum all the (count*(count+1))/2, and divide by the tablesize, you should expect around 1.5. A bad hashfunction gives higher values, a perfect hash would only have cells with count=1 (and thus: ratio=1) With linear probing you should of course never use tablesize=niter, but make tablesize bigger, say twice as big. You can use the same metric (number of probes / number of entries), to analyse its performance, though.
UPDATE: a great introduction on hashfunctions and their performance can be found at http://www.strchr.com/hash_functions .
You can create an array of integers, each representing one hash. When you're done making the hashes loop back through the array in a nested loop. If you had the following array,
[0] -> 13
[1] -> 5
[2] -> 12
[3] -> 7
[4] -> 5
For each item i in 0..n, check items i+1..n for matches. In English that would be: check if each element equals any of the elements after it.