Reducing Condition statement for Large data - c++

I am doing some sample programs in C++. I have 100,000 single column data in a Text File. I loaded that all data into a std::vector. Now I need to check the first character of every data whether it is '#' or '%' or '$'. The data are in Sorted manner in the way of '#','$','%'. I can do programming for this like Below.
for ( i = 0; i < myArray.size(); i++)
{
if(mrarray[i][0]=='#')
{
// do some process for '#' values
}
else if(mrarray[i][0]=='$')
{
// do some process for '$' values
}
else if(mrarray[i][0]=='%')
{
// do some process for '%' values
}
}
My Question is " Is it a best way to do this ? . Is there any other way to do this program with better efficiency ? " ???

That's about as efficient as it gets, the only thing that will make it a bit faster is using a switch statement.
for ( i = 0; i < myArray.size(); i++)
{
switch(myArray[i][0]){
case '#':
// Do stuff
break;
case '$':
// Do stuff
break;
case '%':
// Do stuff
break;
}
}
I'm also assuming you're only doing this once. If you're doing it more than once with the same data, then it can be made more efficient by sorting it. If that's the case, let me know and I will update my answer.

As state in why-is-processing-a-sorted-array-faster-than-an-unsorted-array, It may be more efficient to sort your lines first (based on first character) and then process your vector.
The switch statement proposed by David would be my choice.
But as alternative, you may try an array of function pointer, something like:
using my_function_pointer = void (*)(const std::string&/*, other required stuff*/);
const my_function_pointer funs[256] = { dummy_f, dummy_f, .., f_charp, f_dollar, f_mod, dummy_f, ..};
for ( i = 0; i < myArray.size(); i++) {
funs[myArray[i][0]](myArray[i] /*, other required stuff*/);
}
And anyway, you need to benchmark your change as for any optimization.

Related

How do I avoid repetitive code in the case of switch statements which are the same but for some substituted variables/vectors/etc.?

The following snippet is from an inventory system I'm working on. I keep on running into scenarios where I fell I should be able to simple run a for loop, but am stymied by the fact that in different cases I'm using different vectors/variables/etc. I run into this problem just about any time I need to work with a variable or object who's name won't be known at run-time. In this particular situation, case 1: is exactly the same as case 2: except that the vector tankInventory[] would be dpsInventory[] in case 2:
I feel I'm doing something fundamentally backwards, but I'm not clear on how to reorient my thinking about this. Any advice?
case 1:
//loop through the inventory...
for (int i = 0; i < 6; i++)
{
//looking for an empty spot
if (tankInventory[i] == -1)
{
//add the item...
tankInventory[i] = { item };
//decrement the number of items being added
number--;
//and stop the loop if you're out of items to add
if (!number)
break;
}
}
//if there are no more items to add, break;
if (!number)
break;
//but if there are more...
else
{
//switch to main inventory...
character = 0;
//and return to the top
goto returnPoint;
}
Use a function.
Just extract the common logic out into a function, and take as parameters whatever can change.
Also, it seems like you're using goto and breaking out from the switch instead of doing a loop. I'd do something like do {} while (number) or while (number) {}, depending on what you need. This way it's much easier to use a function.
You're very likely on the right track, this is how we build up the abstractions. A simple way is to define a lambda:
// you might refine the captures
auto processInventory = [&](auto& inventoryToProcess) {
//loop through the inventory...
for (int i = 0; i < 6; i++)
{
//looking for an empty spot
if (inventoryToProcess[i] == -1)
{
//add the item...
inventoryToProcess[i] = { item };
//decrement the number of items being added
number--;
//and stop the loop if you're out of items to add
if (!number)
break;
}
}
//if there are no more items to add, break;
if (!number)
break;
//but if there are more...
else
{
//switch to main inventory...
character = 0;
//and return to the top
goto returnPoint;
}}
};
switch(condition) {
case 1:
processInventory(tankInventory);
break;
case 2:
processInventory(dpsInventory);
}

Fast counting of nucleotide types in a large number of sequences

First, a bit of background about my question.
I work as a bioinformatician, which means that I do informatics treatment to try to answer a biological question. In my problem, I have to manipulate a file called a FASTA file which looks like this :
>Header 1
ATGACTGATCGNTGACTGACTGTAGCTAGC
>Header 2
ATGCATGCTAGCTGACTGATCGTAGCTAGC
ATCGATCGTAGCT
So a FASTA file is basically just a header, preceded by a '>' character, then a sequence on one or multiple lines that is composed of nucleotides. Nucleotides are characters that can take 5 possible values : A, T, C, G or N.
The thing I would like to do is count the number of times each nucleotide type appears so if we consider this dummy FASTA file :
>Header 1
ATTCGN
I should have, as a result :
A:1 T:2 C:1 G:1 N:1
Here is what I got so far :
ifstream sequence_file(input_file.c_str());
string line;
string sequence = "";
map<char, double> nucleotide_counts;
while(getline(sequence_file, line)) {
if(line[0] != '>') {
sequence += line;
}
else {
nucleotide_counts['A'] = boost::count(sequence, 'A');
nucleotide_counts['T'] = boost::count(sequence, 'T');
nucleotide_counts['C'] = boost::count(sequence, 'C');
nucleotide_counts['G'] = boost::count(sequence, 'G');
nucleotide_counts['N'] = boost::count(sequence, 'N');
sequence = "";
}
}
So it reads the file line by line, if it encounters a '>' as the first character of the line, it knows that the sequence is complete and starts to count. Now the problem I'm facing is that I have millions of sequences with several billions of nucleotides in total. I can see that my method is not optimized because I call boost::count five times on the same sequence.
Other things I have tried :
Parsing the sequence to increment a counter for each nucleotide types. I tried using a map<char, double> to map each nucleotide to a value but this was slower than the boost solution.
Using the std::count of the algorithm library but this was too slow too.
I searched the internet for solutions but every solution I found was good if the number of sequences was low, which is not my case. Would you have any idea that could help me speed things up ?
EDIT 1 :
I also tried this version but it was 2 times slower than the boost one :
ifstream sequence_file(input_file.c_str());
string line;
string sequence = "";
map<char, double> nucleotide_counts;
while(getline(sequence_file, line)) {
if(line[0] != '>') {
sequence += line;
}
else {
for(int i = 0; i < sequence.size(); i++) {
nucleotide_counts[sequence[i]]++;
}
sequence = "";
}
}
EDIT 2 : Thanks to everyone in this thread, I was able to obtain a speed up of about 30 times compared to the boost original solution. Here is the code :
#include <map> // std::array
#include <fstream> // std::ifstream
#include <string> // std::string
void count_nucleotides(std::array<double, 26> &nucleotide_counts, std::string sequence) {
for(unsigned int i = 0; i < sequence.size(); i++) {
++nucleotide_counts[sequence[i] - 'A'];
}
}
std::ifstream sequence_file(input_file.c_str());
std::string line;
std::string sequence = "";
std::array<double, 26> nucleotide_counts;
while(getline(sequence_file, line)) {
if(line[0] != '>') {
sequence += line;
}
else {
count_nucleotides(nucleotide_counts, sequence);
sequence = "";
}
}
In order of importance:
Good code for this task will 100% be I/O-bound. Your processor can count characters much faster than your disk can pump them to the CPU. Thus, the first question to me is: What is the throughput of your storage medium? What are your ideal RAM and cache throughputs? Those are the upper limits. If you've hit them, there's not much point in looking at your code further. It's possible that your boost solution is there already.
std::map lookups are relatively expensive. Yes, it's O(log(N)), but your N=5 is small and constant, so this tells you nothing. For 5 values, the map will have to chase about three pointers for every lookup (not to mention how impossible this is for the branch predictor). Your count solution has 5 map lookups and 5 traversals of each string, whereas your manual solution has a map lookup for every nucleotide (but only one traversal of the string).
Serious suggestion: Use a local variable for each counter. Those will almost surely get placed in CPU registers and are therefore essentially free. You won't ever pollute your cache with the counters that way, unlike map, unordered_map, vector etc.
Replacing abstraction by repetition like this is usually not a good idea, but in this case, it's pretty inconceivable that you'll ever need significantly more counters, so scalability is not an issue.
Consider std::string_view (which would require a different method of reading the file) to avoid creating copies of the data. You load the entire data into memory from disk and then, for each sequence, you copy it. That's not really necessary and (depending on how smart your compiler is) can bog you down. Especially since you keep appending to the string until the next header (which is more unnecessary copying - you could just count after every line).
If, for some reason, you are not hitting the theoretical throughputs, consider multithreading and/or vectorization. But I can't imagine this would be necessary.
By the way, boost::count is a thin wrapper around std::count at least in this version.
I think you did the right thing here though: Writing good and readable code, then identifying it as performance bottleneck and checking if you can make it run faster (potentially by making it slightly more ugly).
If this is the main task you have to perform, you might have an interest in an awk solution. Various problems with FASTA files are very easily tackled with awk:
awk '/^>/ && c { for(i in a) if (i ~ /[A-Z]/) printf i":"a[i]" "; print "" ; delete a }
/^>/ {print; c++; next}
{ for(i=1;i<=length($0);++i) a[substr($0,i,1)]++ }
END{ for(i in a) if (i ~ /[A-Z]/) printf i":"a[i]" "; print "" }' fastafile
This outputs on your example:
>Header 1
N:1 A:7 C:6 G:8 T:8
>Header 2
A:10 C:10 G:11 T:12
note: I am aware that this is not C++, but it is often useful to show other means to achieve the same goal.
Benchmarks with awk:
testfile: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
unziped size: 2.3G
total records: 5502947
total lines:
Script 0: (runtime: too long) The first mentioned script is utterly slow. Use only on small files
Script 1: (runtime: 484.31 sec) This is an optimised version where we do a targetted count:
/^>/ && f { for(i in c) printf i":"c[i]" "; print "" ; delete c }
/^>/ {print; f++; next}
{ s=$0
c["A"]+=gsub(/[aA]/,"",s)
c["C"]+=gsub(/[cC]/,"",s)
c["G"]+=gsub(/[gG]/,"",s)
c["T"]+=gsub(/[tT]/,"",s)
c["N"]+=gsub(/[nN]/,"",s)
}
END { for(i in c) printf i":"c[i]" "; print "" ; delete c }
Update 2: (runtime: 416.43 sec) Combine all the subsequences into a single sequence and count only ones:
function count() {
c["A"]+=gsub(/[aA]/,"",s)
c["C"]+=gsub(/[cC]/,"",s)
c["G"]+=gsub(/[gG]/,"",s)
c["T"]+=gsub(/[tT]/,"",s)
c["N"]+=gsub(/[nN]/,"",s)
}
/^>/ && f { count(); for(i in c) printf i":"c[i]" "; print "" ; delete c; string=""}
/^>/ {print; f++; next}
{ string=string $0 }
END { count(); for(i in c) printf i":"c[i]" "; print "" }
Update 3: (runtime: 396.12 sec) Refine how awk finds its records and fields, and abuse this in a single go.
function count() {
c["A"]+=gsub(/[aA]/,"",string)
c["C"]+=gsub(/[cC]/,"",string)
c["G"]+=gsub(/[gG]/,"",string)
c["T"]+=gsub(/[tT]/,"",string)
c["N"]+=gsub(/[nN]/,"",string)
}
BEGIN{RS="\n>"; FS="\n"}
{
print $1
string=substr($0,length($1)); count()
for(i in c) printf i":"c[i]" "; print ""
delete c; string=""
}
Update 4: (runtime: 259.69 sec) Update the regex search in gsub. This creates a worthy speedup:
function count() {
n=length(string);
gsub(/[aA]+/,"",string); m=length(string); c["A"]+=n-m; n=m
gsub(/[cC]+/,"",string); m=length(string); c["C"]+=n-m; n=m
gsub(/[gG]+/,"",string); m=length(string); c["G"]+=n-m; n=m
gsub(/[tT]+/,"",string); m=length(string); c["T"]+=n-m; n=m
gsub(/[nN]+/,"",string); m=length(string); c["N"]+=n-m; n=m
}
BEGIN{RS="\n>"; FS="\n"}
{
print ">"$1
string=substr($0,length($1)); count()
for(i in c) printf i":"c[i]" "; print ""
delete c; string=""
}
Don't use a map if you want speed and can use an array. Also, std::getline can use a custom delimiter (instead of \n).
ifstream sequence_file(input_file.c_str());
string sequence = "";
std::array<int, 26> nucleotide_counts;
// For one sequence
getline(sequence_file, sequence, '>');
for(auto&& c : sequence) {
++nucleotide_counts[c-'A'];
}
// nucleotide_counts['X'-'A'] contains the count of nucleotide X in the sequence
Demo
The reason why it's so slow is that you have indirect accesses all the time or 5 scans of the same string.
You don't need a map, use 5 integers, and increment them separately. Then it should be faster than the boost::count version because you don't traverse the string 5 times, and it will be faster than the map or the unordered_map increments because you won't have n indirect accesses.
so use something like:
switch(char)
{
case 'A':
++a;
break;
case 'G':
++g;
break;
}
...
Like people in comments suggested, try sth like that
enum eNucleotide {
NucleotideA = 0,
NucleotideT,
NucleotideC,
NucleotideG,
NucleotideN,
Size,
};
void countSequence(std::string line)
{
long nucleotide_counts[eNucleotide::Size] = { 0 };
if(line[0] != '>') {
for(int i = 0; i < line.size(); ++i)
{
switch (line[i])
{
case 'A':
++nucleotide_counts[NucleotideA];
break;
case 'T':
++nucleotide_counts[NucleotideT];
break;
case 'C':
++nucleotide_counts[NucleotideC];
break;
case 'G':
++nucleotide_counts[NucleotideC];
break;
case 'N':
++nucleotide_counts[NucleotideN];
break;
default :
/// error condition
break;
}
}
/// print results
std::cout << "A: " << nucleotide_counts[NucleotideA];
std::cout << "T: " << nucleotide_counts[NucleotideT];
std::cout << "C: " << nucleotide_counts[NucleotideC];
std::cout << "G: " << nucleotide_counts[NucleotideG];
std::cout << "N: " << nucleotide_counts[NucleotideN] << std::endl;
}
}
and call this function for every line content.(Didn't tested code.)

Is it possible to embed "for loop" inside an "if statement" to compare multiple condition before continuing using c++

In this program, the user must type in an 3 letter departing airport code (userFlight) and I will give them back the possible destinations. To check that what they typed in is one of the valid airport codes (departureAirport) I want to compare userFlight and make sure it is one of the possible departureAirports which I have stored in a vector called flights[]. This code obviously isn't working, but is there a similar way to accomplish this?
if
(for (j = 0, j < flights.size, j++)
{
(userFlight != flights[j].departAirport)
})
{return errorCode};
else
{//doSomething()};
If it has a operator< inside which does compare like your condition, how about
if(std::find(flights.begin(), flights.end(), userFlight) != flights.end())
{
/* found */
}
else
{
/* not found */
}
Else, if you don't like that, just check if the loop runs through all indices:
size_t i;
for (i = 0, i < flights.size, i++)
{
if(userFlight == flights[i].departAirport)
break;
}
if(i < flights.size)
{
/* found */
}
else
{
/* not found */
}
But no, a syntax like you want doesn't exist.
The code structure you were aiming for is:
for (j = 0; j < flights.size(); j++)
if (userFlight == flights[j].departAirport)
break;
if ( j == flights.size() ) // we got to the end
return errorCode;
doSomething(j);
However, this is a C-like code style. Not that there is anything wrong with that, but C++ allows for algorithms to be expressed more abstractly (and therefore, easier to read and maintain). IMHO it would be better to use one of the other suggestions such as std::set or std::find_if.
It sounds like you actually want to have a std::set of departing airports.
std::set<std::string> departing_airports = {"DTW", "MKE", "MSP", };
assert(departing_airports.count("DTW") == 1);
Yet another option is std::any_of. Assuming flights contains objects of type Flight:
if (std::any_of(std::begin(flights), std::end(flights),
[&](const Flight& f) { return userFlight == f.departAirport; }))
return errorCode;
doSomething();

iterating vector of strings C++

The code is to read instructions from text file and print out graphic patterns. One is my function is not working properly. The function is to read the vectors of strings I've got from the file into structs.
Below is my output, and my second, third, and sixth graphs are wrong. It seems like the 2nd and 3rd vectors are not putting the correct row and column numbers; and the last one skipped "e" in the alphabetical order.
I tried to debug many times and still can't find the problem.
typedef struct Pattern{
int rowNum;
int colNum;
char token;
bool isTriangular;
bool isOuter;
}Pattern;
void CommandProcessing(vector<string>& , Pattern& );
int main()
{
for (int i = 0; i < command.size(); i++)
{
Pattern characters;
CommandProcessing(command[i], characters);
}
system("pause");
return 0;
}
void CommandProcessing(vector<string>& c1, Pattern& a1)
{
reverse(c1.begin(), c1.end());
string str=" ";
for (int j = 0; j < c1.size(); j++)
{
bool foundAlpha = find(c1.begin(), c1.end(), "alphabetical") != c1.end();
bool foundAll = find(c1.begin(), c1.end(), "all") != c1.end();
a1.isTriangular = find(c1.begin(), c1.end(), "triangular") != c1.end() ? true : false;
a1.isOuter = find(c1.begin(), c1.end(), "outer") != c1.end() ? true : false;
if (foundAlpha ==false && foundAll == false){
a1.token = '*';
}
//if (c1[0] == "go"){
else if (c1[j] == "rows"){
str = c1[++j];
a1.rowNum = atoi(str.c_str());
j--;
}
else if (c1[j] == "columns"){
str = c1[++j];
a1.colNum = atoi(str.c_str());
j--;
}
else if (c1[j] == "alphabetical")
a1.token = 0;
else if (c1[j] == "all"){
str = c1[--j];
a1.token = *str.c_str();
j++;
}
}
}
Before debugging (or posting) your code, you should try to make it cleaner. It contains many strange / unnecessary parts, making your code harder to understand (and resulting in the buggy behaviour you just described).
For example, you have an if in the beginning:
if (foundAlpha ==false && foundAll == false){
If there is no alpha and all command, this will be always true, for the entire length of your loop, and the other commands are all placed in else if statements. They won't be executed.
Because of this, in your second and third example, no commands will be read, except the isTriangular and isOuter flags.
Instead of a mixed structure like this, consider the following changes:
add a default constructor to your Pattern struct, initializing its members. For example if you initialize token to *, you can remove that if, and even the two bool variables required for it.
Do the parsing in one way, consistently - the easiest would be moving your triangular and outer bool to the same if structure as the others. (or if you really want to keep this find lookup, move them before the for loop - you only have to set them once!)
Do not modify your loop variable ever, it's an error magnet! Okay, there are some rare exceptions for this rule, but this is not one of them.
Instead of str = c1[++j];, and decrementing later, you could just write str = c1[j+1]
Also, are you sure you need that reverse? That makes your relative +/-1 indexing unclear. For example, the c1[j+1 is j-1 in the original command string.
About the last one: that's probably a bug in your outer printing code, which you didn't post.

How to determine the number of array 100 are not equal to each other

I am coding a Sudoku program. I found the number in the array determine whether duplicate each other is hard.
Now I have an array: int streamNum[SIZE]
if SIZE=3,I can handle this problem like:if(streamNum[0]!=streamNum[1])...
if SIZE=100,I think that I need a better solution, is there any standard practice?
There are a couple of different ways to do this, I suppose the easiest is to write two loops
bool has_duplicate = false;
for (int i = 0; i < SIZE && !has_duplicate; ++i)
for (int j = i + 1; j < SIZE && !has_duplicate; ++j)
if (streamNum[i] == streamNum[j])
has_duplicate = true;
if (has_duplicate)
{
...
}
else
{
...
}
The first loop goes through each element in the array, the second loop checks if there is a duplicate in the remaining elements of the array (that's why it starts at i + 1). Both loops quit as soon as you find a duplicate (that's what && !has_duplicate does).
This is not the most efficient way, more efficient would be to sort the array before looking for duplicates but that would modify the contents of the array at the same time.
I hope I've understand your requirements well enough.
for(int i=0;i<size;i++){
for(int j=i+1;j<size;j++){
if(streamNUM[i]==streamNUM[j]){
...........
}
}
}
I assume that u need whether there is duplication or not this may be helpful
If not comment
It's a little unclear what exactly you're looking to do here but I'm assuming as it's sudoku you're only interested in storing numbers 1-9?
If so to test for a duplicate you could iterate through the source array and use a second array (with 9 elements - I've called it flag) to hold a flag showing whether each number has been used or not.
So.. something like:
for (loop=0;loop<size;loop++) {
if (flag[streamNum[loop]]==true) {
//duplicate - do something & break this loop
break;
}
else {
flag[streamNum[loop]=true;
}
}
Here's how I'd test against Sudoku rules - it checks horizontal, vertical and 3x3 block using the idea above but here 3 different flag arrays for the 3 rules. This assumes your standard grid is held in an 81-element array. You can easily adapt this to cater for partially-completed grids..
for (loop=0;loop<9;loop++) {
flagH=[];
flagV=[];
flagS=[];
for (loop2=0;loop2<9;loop2++) {
//horizontal
if(flagH[streamNum[(loop*9)+loop2]]==true) {
duplicate
else {
flagH[streamNum[(loop*9)+loop2]]=true);
}
//column test
if(flagV[streamNum[loop+(loop2*9)]]==true) {
..same idea as above
//3x3 sub section test
basecell = (loop%3)*3+Math.floor(loop/3)*27; //topleft corner of 3x3 square
cell = basecell+(loop2%3+(Math.floor(loop2/3)*9));
if(flagS[streamNum[cell]]==true) {
..same idea as before..
}
}