Related
I am very new to c++. I am trying to split a string that contains even numbered sub strings till there is no even numbered sub string left. For example, if I input AB ABCD ABC, the output should be A B A B C D ABC. I am trying to do it without tokens, because I don't know how to..
What I have so far only split the first even sub string and it doesn't work if I only have 1 sub string. Can someone please help me out?
Any advise will be much appreciated. Thank you!
string temp = "";
void check(string &str, int &i, int &flag)
{
int count = 0;
int reminder;
do
{
count++;
temp += str[i];
i++;
} while (str[i] != ' ');
i = i - temp.size();
reminder = count % 2;
if (reminder == 0)
flag = 1;
else
flag = 0;
}
void SplitEvenWord(string &str)
{
int i = 0;
int flag = 0;
for (i = 0; i < str.size(); i++)
{
check(str, i, flag);
if (flag == 1)
{
temp.insert(temp.size() / 2, " ");
str.replace(i, temp.size() - 1, temp);
}
}
}
There are two skills that are absolutely vital in software engineering (Well, more than two, but two for now): developing new functions in isolation, and testing things in the simplest possible way.
You say that the code fails if there is only one substring. You don't say how it fails (I should have mentioned clear error reports in the list) so I don't know whether to test your code with an even-length string which it ought to split ("ABCD" => "A B C D") or an odd-length string which it ought to leave alone ("ABC" => "ABC"). Before I try to code these up, I look at your first function:
void check(string &str, int &i, int &flag)
{
...
do
{
count++;
temp += str[i];
i++;
} while (str[i] != ' ');
...
}
Trouble already. The strings I have in mind do not contain any spaces, so the loop cannot terminate. This code will run past the end of the string into whatever happens to be in that memory space, which will cause undefined behavior. (If you don't know that term, it means that there's no telling what will happen, but if you're lucky the program will just crash.)
Fix that, try running that code on "ABC" and "ABCD" and "A" and "" and "ABC DEF", and get it working perfectly. Once it does, take a look at your other function. Don't test it with random typing, test it with short, clearly defined strings. Once it works perfectly, try longer, more complicated ones. If you find a string which causes it to fail, hold onto it! That string will lead you to a bug.
That should be enough to get you started.
I'm writing this as an answer because it was too long to fit as a comment.
I have a couple of suggestions that may help you to figure out what the problem is.
Separate "check" into at least two functions, one to split the string into individual words and check them and one to check the length of the string.
Test the "check" and "tokenize" functions by separately and see if they give you the expected answers. Work on them individually until they are correct.
Separate the formatting of the answers out of "SplitEvenWord" into a separate function.
"SplitEvenWord" should then be nothing more than calling the functions you created as a result of the steps above.
When I'm stuck, I always try to break the problem down into small bite sized pieces that I know I can get working. Eventually, the problem becomes assembling the already working pieces of the solution into a larger function that solves the original problem.
I have code that is supposed to separate a string into 3 length sections:
ABCDEFG should be ABC DEF G
However, I have an extremely long string and I keep getting the
terminate called without an active exception
When I cut the length of the string down, it seems to work. Do I need more space? I thought when using a string I didn't have to worry about space.
int main ()
{
string code, default_Code, start_C;
default_Code = "TCAATGTAACGCGCTACCCGGAGCTCTGGGCCCAAATTTCATCCACT";
start_C = "AUG";
code = default_Code;
for (double j = 0; j < code.length(); j++) { //insert spacing here
code.insert(j += 3, 1, ' ');
}
cout << code;
return 0;
}
Think about the case when code.length() == 2. You're inserting a space somewhere over the string. I'm not sure but it would be okay if for(int j=0; j+3 < code.length(); j++).
This is some fairly confusing code. You are looping through a string and looping until you reach the end of the string. However, inside the loop you are not only modifying the string you are looping through, but you also change the loop variable when you say j += 3.
It happens to work for any string with a multiple of 3 letters, but you are not correctly handling other cases.
Here is a working example of the for loop that is a bit more clear it what it's doing:
// We skip 4 each time because we added a space.
for (int j = 3; j < code.length(); j += 4)
{
code.insert(j, 1, ' ');
}
You are using an extremely inefficient method to do such an operation. Every time you insert a space you are moving all the remaining part of the string forward and this means that the total number of operations you will need is in the order of o(n**2).
You can instead do this transormation with a single o(n) pass by using a read-write approach:
// input string is assumed to be non-empty
std::string new_string((old_string.size()*4-1)/3);
int writeptr = 0, count = 0;
for (int readptr=0,n=old_string.size(); readptr<n; readptr++) {
new_string[writeptr++] = old_string[readptr];
if (++count == 3) {
count = 0;
new_string[writeptr++] = ' ';
}
}
A similar algorithm can be written also to work "inplace" instead of creating a new string, simply you have to first enlarge the string and then work backward.
Note also that while it's true that for a string you don't need to care about allocation and deallocation still there are limits about the size of a string object (even if probably you are not hitting them... your version is so slow that it would take forever to get to that point on a modern computer).
I have a C++ program that will read in data from a binary file and originally I stored data in std::vector<char*> data. I have changed my code so that I am now using strings instead of char*, so that std::vector<std::string> data. Some changes I had to make was to change from strcmp to compare for example.
However I have seen my execution time dramatically increase. For a sample file, when I used char* it took 0.38s and after the conversion to string it took 1.72s on my Linux machine. I observed a similar problem on my Windows machine with execution time increasing from 0.59s to 1.05s.
I believe this function is causing the slow down. It is part of the converter class, note private variables designated with_ at the end of variable name. I clearly am having memory problems here and stuck in between C and C++ code. I want this to be C++ code, so I updated the code at the bottom.
I access ids_ and names_ many times in another function too, so access speed is very important. Through the use of creating a map instead of two separate vectors, I have been able to achieve faster speeds with more stable C++ code. Thanks to everyone!
Example NewList.Txt
2515 ABC 23.5 32 -99 1875.7 1
1676 XYZ 12.5 31 -97 530.82 2
279 FOO 45.5 31 -96 530.8 3
OLD Code:
void converter::updateNewList(){
FILE* NewList;
char lineBuffer[100];
char* id = 0;
char* name = 0;
int l = 0;
int n;
NewList = fopen("NewList.txt","r");
if (NewList == NULL){
std::cerr << "Error in reading NewList.txt\n";
exit(EXIT_FAILURE);
}
while(!feof(NewList)){
fgets (lineBuffer , 100 , NewList); // Read line
l = 0;
while (!isspace(lineBuffer[l])){
l = l + 1;
}
id = new char[l];
switch (l){
case 1:
n = sprintf (id, "%c", lineBuffer[0]);
break;
case 2:
n = sprintf (id, "%c%c", lineBuffer[0], lineBuffer[1]);
break;
case 3:
n = sprintf (id, "%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2]);
break;
case 4:
n = sprintf (id, "%c%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2],lineBuffer[3]);
break;
default:
n = -1;
break;
}
if (n < 0){
std::cerr << "Error in processing ids from NewList.txt\n";
exit(EXIT_FAILURE);
}
l = l + 1;
int s = l;
while (!isspace(lineBuffer[l])){
l = l + 1;
}
name = new char[l-s];
switch (l-s){
case 2:
n = sprintf (name, "%c%c", lineBuffer[s+0], lineBuffer[s+1]);
break;
case 3:
n = sprintf (name, "%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2]);
break;
case 4:
n = sprintf (name, "%c%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2],lineBuffer[s+3]);
break;
default:
n = -1;
break;
}
if (n < 0){
std::cerr << "Error in processing short name from NewList.txt\n";
exit(EXIT_FAILURE);
}
ids_.push_back ( std::string(id) );
names_.push_back(std::string(name));
}
bool isFound = false;
for (unsigned int i = 0; i < siteNames_.size(); i ++) {
isFound = false;
for (unsigned int j = 0; j < names_.size(); j ++) {
if (siteNames_[i].compare(names_[j]) == 0){
isFound = true;
}
}
}
fclose(NewList);
delete [] id;
delete [] name;
}
C++ CODE
void converter::updateNewList(){
std::ifstream NewList ("NewList.txt");
while(NewList.good()){
unsigned int id (0);
std::string name;
// get the ID and name
NewList >> id >> name;
// ignore the rest of the line
NewList.ignore( std::numeric_limits<std::streamsize>::max(), '\n');
info_.insert(std::pair<std::string, unsigned int>(name,id));
}
NewList.close();
}
UPDATE: Follow up question: Bottleneck from comparing strings and thanks for the very useful help! I will not be making these mistakes in the future!
My guess it that it should be tied to the vector<string>'s performance
About the vector
A std::vector works with an internal contiguous array, meaning that once the array is full, it needs to create another, larger array, and copy the strings one by one, which means a copy-construction and a destruction of string which had the same contents, which is counter-productive...
To confirm this easily, then use a std::vector<std::string *> and see if there is a difference in performance.
If this is the case, they you can do one of those four things:
if you know (or have a good idea) of the final size of the vector, use its method reserve() to reserve enough space in the internal array, to avoid useless reallocations.
use a std::deque, which works almost like a vector
use a std::list (which doesn't give you random access to its items)
use the std::vector<char *>
About the string
Note: I'm assuming that your strings\char * are created once, and not modified (through a realloc, an append, etc.).
If the ideas above are not enough, then...
The allocation of the string object's internal buffer is similar to a malloc of a char *, so you should see little or no differences between the two.
Now, if your char * are in truth char[SOME_CONSTANT_SIZE], then you avoid the malloc (and thus, will go faster than a std::string).
Edit
After reading the updated code, I see the following problems.
if ids_ and names_ are vectors, and if you have the slightest idea of the number of lines, then you should use reserve() on ids_ and and names_
consider making ids_ and names_ deque, or lists.
faaNames_ should be a std::map, or even a std::unordered_map (or whatever hash_map you have on your compiler). Your search currently is two for loops, which is quite costly and inneficient.
Consider comparing the length of the strings before comparing its contents. In C++, the length of a string (i.e. std::string::length()) is a zero cost operation)
Now, I don't know what you're doing with the isFound variable, but if you need to find only ONE true equality, then I guess you should work on the algorithm (I don't know if there is already one, see http://www.cplusplus.com/reference/algorithm/), but I believe this search could be made a lot more efficient just by thinking on it.
Other comments:
Forget the use of int for sizes and lengths in STL. At very least, use size_t. In 64-bit, size_t will become 64-bit, while int will remain 32-bits, so your code is not 64-bit ready (in the other hand, I see few cases of incoming 8 Go strings... but still, better be correct...)
Edit 2
The two (so called C and C++) codes are different. The "C code" expects ids and names of length lesser than 5, or the program exists with an error. The "C++ code" has no such limitation. Still, this limitation is ground for massive optimization, if you confirm names and ids are always less then 5 characters.
Before fixing something make sure that it is bottleneck. Otherwise you are wasting your time. Plus this sort of optimization is microoptimization. If you are doing microoptimization in C++ then consider using bare C.
Resize vector to large enough size before you start populating it. Or, use pointers to strings instead of strings.
The thing is that the strings are being copied each time the vector is being auto-resized. For small objects such as pointers this cost nearly nothing, but for strings the whole string is copied in full.
And id and name should be string instead of char*, and be initialized like this (assuming that you still use string instead of string*):
id = string(lineBuffer, lineBuffer + l);
...
name = string(lineBuffer + s, lineBuffer + s + l);
...
ids_.push_back(id);
names_.push_back(name);
Except for std::string, this is a C program.
Try using fstream, and use the profiler to detect the bottle neck.
You can try to reserve a number of vector values in order to reduce the number of allocations (which are costly), as said Dialecticus (probably from the ancient Roma?).
But there is something that may deserve some observation: how do you store the strings from the file, do you perform concatenations etc...
In C, strings (which do not exist per say - they don't have a container from a library like the STL) need more work to deal with, but at least we know what happens clearly when dealing with them. In the STL, each convenient operation (meaning requiring less work from the programmer) may actually require a lot of operations behind the scene, within the string class, depending on how you use it.
So, while the allocations / freeings are a costly process, the rest of the logic, especially the strings process, may / should probably be looked at as well.
I believe the main issue here is that your string version is copying things twice -- first into dynamically allocated char[] called name and id, and then into std::strings, while your vector<char *> version probably does not do that. To make the string version faster, you need to read directly into the strings and get rid of all the redundant copies
streams take care of a lot of the heavy lifting for you. Stop doing it all yourself, and let the library help you:
void converter::updateNewList(){
std::ifstream NewList ("NewList.txt");
while(NewList.good()){
int id (0);
std::string name;
// get the ID and name
NewList >> id >> name;
// ignore the rest of the line
NewList.ignore( numeric_limits<streamsize>::max(), '\n');
ids_.push_back (id);
names_.push_back(name);
}
NewList.close();
}
There's no need to do the whitespace-tokenizing manually.
Also, you may find this site a helpful reference:
http://www.cplusplus.com/reference/iostream/ifstream/
You can use a profiler to find out where your code consumes most time. If you are for example using gcc, you can compile your program with -pg. When you run it, it saves profiling results in a file. You can the run gprof on the binary to get human readable results. Once you know where most time is consumed you can post that piece of code for further questions.
int main()
{
int var = 0;; // Typo which compiles just fine
}
How else could assert(foo == bar); compile down to nothing when NDEBUG is defined?
This is the way C and C++ express NOP.
You want to be able to do things like
while (fnorble(the_smurf) == FAILED)
;
and not
while (fnorble(the_smurf) == FAILED)
do_nothing_just_because_you_have_to_write_something_here();
But! Please do not write the empty statement on the same line, like this:
while (fnorble(the_smurf) == FAILED);
That’s a very good way to confuse the reader, since it is easy to miss the semicolon, and therefore think that the next row is the body of the loop. Remember: Programming is really about communication — not with the compiler, but with other people, who will read your code. (Or with yourself, three years later!)
I'm no language designer, but the answer I'd give is "why not?" From the language design perspective, one wants the rules (i.e. the grammar) to be as simple as possible.
Not to mention that "empty expressions" have uses, i.e.
for (i = 0; i < INSANE_NUMBER; i++);
Will dead-wait (not a good use, but a use nonetheless).
EDIT: As pointed out in a comment to this answer, any compiler worth its salt would probably not busy wait at this loop, and optimize it away. However, if there were something more useful in the for head itself (other than i++), which I've seen done (strangely) with data structure traversal, then I imagine you could still construct a loop with an empty body (by using/abusing the "for" construct).
OK, I’ll add this to the worst case scenario that you may actually use:
for (int yy = 0; yy < nHeight; ++yy) {
for (int xx = 0; xx < nWidth; ++xx) {
for (int vv = yy - 3; vv <= yy + 3; ++vv) {
for (int uu = xx - 3; uu <= xx + 3; ++uu) {
if (test(uu, vv)) {
goto Next;
}
}
}
Next:;
}
}
I honestly don't know if this is the real reason, but I think something that makes more sense is to think about it from the standpoint of a compiler implementer.
Large portions of compilers are built by automated tools that analyze special classes of grammars. It seems very natural that useful grammars would allow for empty statements. It seems like unnecessary work to detect such an "error" when it doesn't change the semantics of your code. The empty statement won't do anything, as the compiler won't generate code for those statements.
It seems to me that this is just a result of "Don't fix something that isn't broken"...
Obviously, it is so that we can say things like
for (;;) {
// stuff
}
Who could live without that?
When using ;, please also be aware about one thing. This is ok:
a ? b() : c();
However this won't compile:
a ? b() : ; ;
There are already many good answers but have not seen the productive-environment sample.
Here is FreeBSD's implementation of strlen:
size_t
strlen(const char *str)
{
const char *s;
for (s = str; *s; ++s)
;
return (s - str);
}
The most common case is probably
int i = 0;
for (/* empty */; i != 10; ++i) {
if (x[i].bad) break;
}
if (i != 10) {
/* panic */
}
while (1) {
; /* do nothing */
}
There are times when you want to sit and do nothing. An event/interrupt driven embedded application or when you don't want a function to exit such as when setting up threads and waiting for the first context switch.
example:
http://lxr.linux.no/linux+v2.6.29/arch/m68k/mac/misc.c#L523
I'm trying to write a function that is able to determine whether a string contains a real or an integer value.
This is the simplest solution I could think of:
int containsStringAnInt(char* strg){
for (int i =0; i < strlen(strg); i++) {if (strg[i]=='.') return 0;}
return 1;
}
But this solution is really slow when the string is long... Any optimization suggestions?
Any help would really be appreciated!
What's the syntax of your real numbers?
1e-6 is valid C++ for a literal, but will be passed as integer by your test.
Is your string hundreds of characters long? Otherwise, don't care about any possible performance issues.
The only inefficiency is that you are using strlen() in a bad way, which means a lot of iterations over the string (inside strlen). For a simpler solution, with the same time complexity (O(n)), but probably slightly faster, use strchr().
You are using strlen, which means you are not worried about unicode. In that case why to use strlen or strchr, just check for '\0' (Null char)
int containsStringAnInt(char* strg){
for (int i =0;strg[i]!='\0'; i++) {
if (strg[i]=='.') return 0;}
return 1; }
Only one parsing through the string, than parsing through the string in each iteration of the loop.
Your function does not take into account exponential notation of reals (1E7, 1E-7 are both doubles)
Use strtol() to try to convert the string to integer first; it will also return the first position in the string where the parsing failed (this will be '.' if the number is real). If the parsing stopped at '.', use strtod() to try to convert to double. Again, the function will return the position in the string where the parsing stopped.
Don't worry about performance, until you have profiled the program. Otherwise, for fastest possible code, construct a regular expression that describes acceptable syntax of numbers, and hand-convert it first into a FSM, then into highly optimized code.
So the standard note first, please don't worry about performance too much if not profiled yet :)
I'm not sure about the manual loop and checking for a dot. Two issues
Depending on the locale, the dot can actually be a "," too (here in Germany that's the case :)
As others noted, there is the issue with numbers like 1e7
Previously I had a version using sscanf here. But measuring performance showed that sscanf is is significantly slower for bigger data-sets. So I'll show the faster solution first (Well, it's also a whole more simple. I had several bugs in the sscanf version until I got it working, while the strto[ld] version worked the first try):
enum {
REAL,
INTEGER,
NEITHER_NOR
};
int what(char const* strg){
char *endp;
strtol(strg, &endp, 10);
if(*strg && !*endp)
return INTEGER;
strtod(strg, &endp);
if(*strg && !*endp)
return REAL;
return NEITHER_NOR;
}
Just for fun, here is the version using sscanf:
int what(char const* strg) {
// test for int
{
int d; // converted value
int n = 0; // number of chars read
int rd = std::sscanf(strg, "%d %n", &d, &n);
if(!strg[n] && rd == 1) {
return INTEGER;
}
}
// test for double
{
double v; // converted value
int n = 0; // number of chars read
int rd = std::sscanf(strg, "%lf %n", &v, &n);
if(!strg[n] && rd == 1) {
return REAL;
}
}
return NEITHER_NOR;
}
I think that should work. Have fun.
Test was done by converting test strings (small ones) randomly 10000000 times in a loop:
6.6s for sscanf
1.7s for strto[dl]
0.5s for manual looping until "."
Clear win for strto[ld], considering it will parse numbers correctly I will praise it as the winner over manual looping. Anyway, 1.2s/10000000 = 0.00000012 difference roughly for one conversion isn't all that much in the end.
Strlen walks the string to find the length of the string.
You are calling strlen with every pass of the loop. Hence, you are walking the string way many more times than necessary. This tiny change should give you a huge performance improvement:
int containsStringAnInt(char* strg){
int len = strlen(strg);
for (int i =0; i < len; i++) {if (strg[i]=='.') return 0;}
return 1;
}
Note that all I did was find the length of the string once, at the start of the function, and refer to that value repeatedly in the loop.
Please let us know what kind of performance improvement this gets you.
#Aaron, with your way also you are traversing the string twice. Once within strlen, and once again in for loop.
Best way for ASCII string traversing in for loop is to check for Null char in the loop it self. Have a look at my answer, that parses the string only once within for loop, and may be partial parsing if it finds a '.' prior to end. that way if a string is like 0.01xxx (anotther 100 chars), you need not to go till end to find the length.
#include <stdlib.h>
int containsStringAnInt(char* strg){
if (atof(strg) == atoi(strg))
return 1;
return 0;
}