Recursive String Transformations

Recursive String Transformations - c++

EDIT: I've made the main change of using iterators to keep track of successive positions in the bit and character strings and pass the latter by const ref. Now, when I copy the sample inputs onto themselves multiple times to test the clock, everything finishes within 10 seconds for really long bit and character strings and even up to 50 lines of sample input. But, still when I submit, CodeEval says the process was aborted after 10 seconds. As I mention, they don't share their input so now that "extensions" of the sample input work, I'm not sure how to proceed. Any thoughts on an additional improvement to increase my recursive performance would be greatly appreciated.
NOTE: Memoization was a good suggestion but I could not figure out how to implement it in this case since I'm not sure how to store the bit-to-char correlation in a static look-up table. The only thing I thought of was to convert the bit values to their corresponding integer but that risks integer overflow for long bit strings and seems like it would take too long to compute. Further suggestions for memoization here would be greatly appreciated as well.
This is actually one of the moderate CodeEval challenges. They don't share the sample input or output for moderate challenges but the output "fail error" simply says "aborted after 10 seconds," so my code is getting hung up somewhere.
The assignment is simple enough. You take a filepath as the single command-line argument. Each line of the file will contain a sequence of 0s and 1s and a sequence of As and Bs, separated by a white space. You are to determine whether the binary sequence can be transformed into the letter sequence according to the following two rules:
1) Each 0 can be converted to any non-empty sequence of As (e.g, 'A', 'AA', 'AAA', etc.)
2) Each 1 can be converted to any non-empty sequences of As OR Bs (e.g., 'A', 'AA', etc., or 'B', 'BB', etc) (but not a mixture of the letters)
The constraints are to process up to 50 lines from the file and that the length of the binary sequence is in [1,150] and that of the letter sequence is in [1,1000].
The most obvious starting algorithm is to do this recursively. What I came up with was for each bit, collapse the entire next allowed group of characters first, test the shortened bit and character strings. If it fails, add back one character from the killed character group at a time and call again.
Here is my complete code. I removed cmd-line argument error checking for brevity.
#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
using namespace std;
//typedefs
typedef string::const_iterator str_it;
//declarations
//use const ref and iterators to save time on copying and erasing
bool TransformLine(const string & bits, str_it bits_front, const string & chars, str_it chars_front);
int main(int argc, char* argv[])
{
//check there are at least two command line arguments: binary executable and file name
//ignore additional arguments
if(argc < 2)
{
cout << "Invalid command line argument. No input file name provided." << "\n"
<< "Goodybe...";
return -1;
}
//create input stream and open file
ifstream in;
in.open(argv[1], ios::in);
while(!in.is_open())
{
char* name;
cout << "Invalid file name. Please enter file name: ";
cin >> name;
in.open(name, ios::in);
}
//variables
string line_bits, line_chars;
//reserve space up to constraints to reduce resizing time later
line_bits.reserve(150);
line_chars.reserve(1000);
int line = 0;
//loop over lines (<=50 by constraint, ignore the rest)
while((in >> line_bits >> line_chars) && (line < 50))
{
line++;
//impose bit and char constraints
if(line_bits.length() > 150 ||
line_chars.length() > 1000)
continue; //skip this line
(TransformLine(line_bits, line_bits.begin(), line_chars, line_chars.begin()) == true) ? (cout << "Yes\n") : (cout << "No\n");
}
//close file
in.close();
return 0;
}
bool TransformLine(const string & bits, str_it bits_front, const string & chars, str_it chars_front)
{
//using iterators so store current length as local const
//can make these const because they're not altered here
int bits_length = distance(bits_front, bits.end());
int chars_length = distance(chars_front, chars.end());
//check success rule
if(bits_length == 0 && chars_length == 0)
return true;
//Check fail rules:
//1. next bit is 0 but next char is B
//2. bits length is zero (but char is not, by previous if)
//3. char length is zero (but bits length is not, by previous if)
if((*bits_front == '0' && *chars_front == 'B') ||
bits_length == 0 ||
chars_length == 0)
return false;
//we now know that chars_length != 0 => chars_front != chars.end()
//kill a bit and then call recursively with each possible reduction of front char group
bits_length = distance(++bits_front, bits.end());
//current char group tracker
const char curr_char_type = *chars_front; //use const so compiler can optimize
int curr_pos = distance(chars.begin(), chars_front); //position of current front in char string
//since chars are 0-indexed, the following is also length of current char group
//start searching from curr_pos and length is relative to curr_pos so subtract it!!!
int curr_group_length = chars.find_first_not_of(curr_char_type, curr_pos)-curr_pos;
//make sure this isn't the last group!
if(curr_group_length < 0 || curr_group_length > chars_length)
curr_group_length = chars_length; //distance to end is precisely distance(chars_front, chars.end()) = chars_length
//kill the curr_char_group
//if curr_group_length = char_length then this will make chars_front = chars.end()
//and this will mean that chars_length will be 0 on next recurssive call.
chars_front += curr_group_length;
curr_pos = distance(chars.begin(), chars_front);
//call recursively, adding back a char from the current group until 1 less than starting point
int added_back = 0;
while(added_back < curr_group_length)
{
if(TransformLine(bits, bits_front, chars, chars_front))
return true;
//insert back one char from the current group
else
{
added_back++;
chars_front--; //represents adding back one character from the group
}
}
//if here then all recursive checks failed so initial must fail
return false;
}
They give the following test cases, which my code solves correctly:
Sample input:
1| 1010 AAAAABBBBAAAA
2| 00 AAAAAA
3| 01001110 AAAABAAABBBBBBAAAAAAA
4| 1100110 BBAABABBA
Correct output:
1| Yes
2| Yes
3| Yes
4| No
Since a transformation is possible if and only if copies of it are, I tried just copying each binary and letter sequences onto itself various times and seeing how the clock goes. Even for very long bit and character strings and many lines it has finished in under 10 seconds.
My question is: since CodeEval is still saying it is running longer than 10 seconds but they don't share their input, does anyone have any further suggestions to improve the performance of this recursion? Or maybe a totally different approach?
Thank you in advance for your help!

Here's what I found:
Pass by constant reference
Strings and other large data structures should be passed by constant reference.
This allows the compiler to pass a pointer to the original object, rather than making a copy of the data structure.
Call functions once, save result
You are calling bits.length() twice. You should call it once and save the result in a constant variable. This allows you to check the status again without calling the function.
Function calls are expensive for time critical programs.
Use constant variables
If you are not going to modify a variable after assignment, use the const in the declaration:
const char curr_char_type = chars[0];
The const allows compilers to perform higher order optimization and provides safety checks.
Change data structures
Since you are perform inserts maybe in the middle of a string, you should use a different data structure for the characters. The std::string data type may need to reallocate after an insertion AND move the letters further down. Insertion is faster with a std::list<char> because a linked list only swaps pointers. There may be a trade off because a linked list needs to dynamically allocate memory for each character.
Reserve space in your strings
When you create the destination strings, you should use a constructor that preallocates or reserves room for the largest size string. This will prevent the std::string from reallocating. Reallocations are expensive.
Don't erase
Do you really need to erase characters in the string?
By using starting and ending indices, you overwrite existing letters without have to erase the entire string.
Partial erasures are expensive. Complete erasures are not.
For more assistance, post to Code Review at StackExchange.

This is a classic recursion problem. However, a naive implementation of the recursion would lead to an exponential number of re-evaluations of a previously computed function value. Using a simpler example for illustration, compare the runtime of the following two functions for a reasonably large N. Lets not worry about the int overflowing.
int RecursiveFib(int N)
{
if(N<=1)
return 1;
return RecursiveFib(N-1) + RecursiveFib(N-2);
}
int IterativeFib(int N)
{
if(N<=1)
return 1;
int a_0 = 1, a_1 = 1;
for(int i=2;i<=N;i++)
{
int temp = a_1;
a_1 += a_0;
a_0 = temp;
}
return a_1;
}
You would need to follow a similar approach here. There are two common ways of approaching the problem - dynamic programming and memoization. Memoization is the easiest way of modifying your approach. Below is a memoized fibonacci implementation to illustrate how your implementation can be speeded up.
int MemoFib(int N)
{
static vector<int> memo(N, -1);
if(N<=1)
return 1;
int& res = memo[N];
if(res!=-1)
return res;
return res = MemoFib(N-1) + MemoFib(N-2);
}

Your failure message is "Aborted after 10 seconds" -- implying that the program was working fine as far as it went, but it took too long. This is understandable, given that your recursive program takes exponentially more time for longer input strings -- it works fine for the short (2-8 digit) strings, but will take a huge amount of time for 100+ digit strings (which the test allows for). To see how your running time goes up, you should construct yourself some longer test inputs and see how long they take to run. Try things like
0000000011111111 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBAAAAAAAA
00000000111111110000000011111111 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBAAAAAAAA
and longer. You need to be able to handle up to 150 digits and 1000 letters.

At CodeEval, you can submit a "solution" that just outputs what the input is, and do that to gather their test set. They may have variations so you may wish to submit it a few times to gather more samples. Some of them are too difficult to solve manually though... the ones you can solve manually will also run very quickly at CodeEval too, even with inefficient solutions, so there's that to consider.
Anyway, I did this same problem at CodeEval (using VB of all things), and my solution recursively looked for the "next index" of both A and B depending on what the "current" index is for where I was in a translation (after checking stoppage conditions first thing in the recursive method). I did not use memoization but that might've helped speed it up even more.
PS, I have not run your code, but it does seem curious that the recursive method contains a while loop within which the recursive method is called... since it's already recursive and should therefore encompass every scenario, is that while() loop necessary?

Related

Minimum Window Substring

I'm working on the Leetcode "Minimum Window Substring" practice problem:
Given two strings s and t of lengths m and n respectively, return the minimum window substring of s such that every character in t (including duplicates) is included in the window. If there is no such substring, return the empty string "".
The testcases will be generated such that the answer is unique.
Example 1:
Input: s = "ADOBECODEBANC", t = "ABC"
Output: "BANC"
Example 2:
Input: s = "a", t = "a"
Output: "a"
Example 3:
Input: s = "a", t = "aa"
Output: ""
Explanation: Both 'a's from t must be included in the window. Since the largest window of s only has one 'a', return empty string.
My solution uses two maps to keep track of character counts:
strr map is to keep count of characters in the window and
patt map is for the given pattern string.
It also uses two indices, start and end, to keep track of the current window (which includes end).
The core of the solution is an outer loop that advances end, adding the new character to strr. It then runs an inner loop as long as the window is valid that:
checks & updates the shortest window seen so far
removes the first character in the window
advances start.
Once the outer loop finishes, the shortest window it encountered should be the answer.
#include <iostream>
#include <unordered_map>
bool check_map(std::unordered_map<char, int> patt, std::unordered_map<char, int> strr)
{
for(auto data:patt)
{
if(strr[data.first] != data.second)
return false;
}
return true;
}
std::string Substring(std::string s, std::string t)
{
std::unordered_map<char, int> patt;
std::unordered_map<char, int> strr;
std::string ans;
for(int i=0; i<t.length(); i++)
patt[t[i]]++;
int start = 0, length = INT_MAX;;
for(int end=0; end<s.length(); end++)
{
strr[s[end]]++;
while(check_map(patt, strr))
{
if(length > (end-start+1))
{
ans = s.substr(start, end+1);
length = end-start+1;
}
strr[s[start]]--;
if(strr[s[start]] == 0)
strr.erase(s[start]);
start++;
}
}
return ans;
}
int main()
{
std::string s = "ADOBECODEBANC",
pattern = "ABC";
std::cout << "String: " << s << std::endl
<< "Pattern: " << pattern << std::endl
<< "Minimum Window Substring is " << Substring(s, pattern) << std::endl;
return 0;
}
For example 1 from the problem, the program should return "BANC" but instead returns "ADOBEC". Program output:
String: ADOBECODEBANC
Pattern: ABC
Minimum Window Substring is ADOBEC
Where is the error in my code?

I am very sorry that I cannot answer your concrete question to “where is the error in my code”.
But what I can do, is to help you to understand the problem, develop an algorithm and show, one of many, potential solution.
The title of the question already implies, what algorithm shall be used: The so called “Sliding Window”-algorithm.
You will find a very good explanation from Said Sryheni here.
And for your problem, we will use the Flexible-Size Sliding window approach.
We will iterate over the source string character by character and wait, until we meet a certain condition. In this case, until we “saw” all characters that needs to be searched for. Then, we will find a window, in which all these characters are.
In the given example, the end of the sliding window is always the last read character from the source string. This, because the last read character fulfills the condition. Then we need to find the beginning of the window. In that case the position of the rightmost character (of the search characters) in the source string that still fulfills the condition.
Then we will continue to read the source string and wait for the next condition to be fulfilled. And then we will recalculate the sliding window positions.
By the way. The other characters, besides the search characters in the source string, are just noise and will only extend the width of the sliding window.
But how do we meet the condition?
And especially, since the order of the search characters does not matter, and, there can even be double characters in it?
The solution is that we will “count”.
First, we will count the occurrence of all characters in the search string. Additionally, we will use a second counter that indicates if all characters are matched.
Then, while iterating over the source string, we will decrement a counter for any character that we see. If the count of a search character hits the 0, then we will decrement the “Match” counter. And, if that is 0, we found all search characters and the condition is fulfilled. We can then come to the calculation of the window positions.
Please note: We will only decrement the match counter, if, after decrementing the character counter, this will be 0.
Example (I will omit the noise with the ‘x’es):
Search string “ABC”, source string: “xxAxxxxBBBxCAxx”.
Initial character counters will be 1,1,1, the match counter will be 3.
Reading the first ‘A’. Counters: 0,1,1  2
Reading the first ‘B’. Counters: 0,0,1  1
Reading the 2nd ‘B’. Counters: 0,-1,1  1 (We will decrement the match counter only if character counter hits the 0).
Reading the 3rd ‘B’. Counters: 0,-2,1  1 (We will decrement the match counter only if character counter hits the 0).
Reading the first ‘C’. Counters: 0,-2,0  0. The match counter is 0, the condition is fulfilled.
Please note. Negative character counts indicate that there are more of the same character further right.
Next, since the condition is fulfilled now, we will check the positions of the sliding window. The end position is clear. This is the last read character from the source string. This led to the fulfillment of the condition. So, easy.
To get the start position of the sliding window, we will check from the beginning of the source string, where we can find a search character. We will increment its count, and if the count is greater then 0, we will again increment the match count. If the match count is greater than 0, we found a start position. Counters now: 1,-2,0  1
The start position will be incremented for the next check. We will never start again with 0, but only with the last used start position.
OK, having found a start and end position, we have our first window and will look for potential smaller windows. We will continue to read the source string and check
After the calculation of the sliding window position, the counter will be: 1,-2,0  1
Reading the next ‘A’. Counters: 0,-2,0  0. Again, the condition is fulfilled.
We continue with sliding window detection. The last start position was pointing to the character ‘x’ after the first ‘A’
Increment start position and skip all ‘x’es. Continue
Reading the first ‘B’. Counters: 0,-1,0  0
Reading the 2nd ‘B’. Counters: 0,0,0  0
Reading the 3d ‘B’. Counters: 0,1,0  1. Window position calculation done. Start position is the 3rd B. This window is smaller than the previous one, so take it.
Since the source string is consumed, we are done and found the solution.
How to implement that. We will do a small abstraction of the counter and pack it into a mini class. That will encapsulate the inner handling of character and match counts and can be optimized later.
A counter, which works for all kind of char types could be implemented like the below:
struct SpecialCounterForGeneralChar {
std::unordered_map<char, int> individualLetter{};
int necessaryMatches{};
SpecialCounterForGeneralChar(const std::string& searchLetters) {
for (const char c : searchLetters) individualLetter[c]++;
necessaryMatches = individualLetter.size();
}
inline void incrementFor(const char c) {
individualLetter[c]++;
if (individualLetter[c] > 0)
++necessaryMatches;
}
inline void decrementFor(const char c) {
individualLetter[c]--;
if (individualLetter[c] == 0)
--necessaryMatches;
}
inline bool allLettersMatched() { return necessaryMatches == 0; }
};
If we know more about the input data and it is for example restricted to an 8 bit char, we can also use:
struct SpecialCounter {
char individualLetter[256]{};
int necessaryMatches{};
SpecialCounter(const std::string& searchLetters) {
for (const char c : searchLetters) {
if (individualLetter[c] == 0) ++necessaryMatches;
individualLetter[c]++;
}
}
inline void incrementFor(const char c) {
individualLetter[c]++;
if (individualLetter[c] > 0)
++necessaryMatches;
}
inline void decrementFor(const char c) {
individualLetter[c]--;
if (individualLetter[c] == 0)
--necessaryMatches;
}
inline bool allLettersMatched() { return necessaryMatches == 0; }
};
This will be slightly faster than the above (under the given restrictions)
And, then the rest of the program will then be just 15 lines of code.
The important message here is that we need to think very verylong, before we start to implement the first line of code.
A good selected algorithm and design, will help us to find an optimum solution.
Please see the complete example solution below:
#include <string>
#include <iostream>
#include <unordered_map>
#include <limits>
using Index = unsigned int;
// We want to hide the implementation of the special counter to the outside world
struct SpecialCounter {
char individualLetter[256]{};
int necessaryMatches{};
SpecialCounter(const std::string& searchLetters) {
for (const char c : searchLetters) {
if (individualLetter[c] == 0) ++necessaryMatches;
individualLetter[c]++;
}
}
inline void incrementFor(const char c) {
individualLetter[c]++;
if (individualLetter[c] > 0)
++necessaryMatches;
}
inline void decrementFor(const char c) {
individualLetter[c]--;
if (individualLetter[c] == 0)
--necessaryMatches;
}
inline bool allLettersMatched() { return necessaryMatches == 0; }
};
std::string solution(std::string toBeSearchedIn, std::string toBeSearchedFor) {
// Counter with somespecial properties
SpecialCounter counter(toBeSearchedFor);
// This will be slided. End of window is always last read character. Start of window may increase
Index currentWindowStart {};
// The potential solution
Index resultingWindowStart {};
Index resultingWindowWith{ std::numeric_limits<size_t>::max() };
// Iterate over all characters of the string under evaluation
for (Index index{}; index < toBeSearchedIn.length(); ++index) {
// We saw a character. So, subtract from characters to be searched
counter.decrementFor(toBeSearchedIn[index]);
// If we hit and found all necessary characters and adjusted the sliding windows start position
while (counter.allLettersMatched()) {
// Calculate start and width of sliding window. So, if we found a new, more narrow window
const unsigned int currentWindowWith{ index - currentWindowStart + 1 };
if (currentWindowWith < resultingWindowWith) {
// Remember one potential solution
resultingWindowWith = currentWindowWith;
resultingWindowStart = currentWindowStart;
}
// Now, for the sliding window. We saw and decremented thsi character before
// Now we see it in the sliding window and increment it again.
counter.incrementFor(toBeSearchedIn[currentWindowStart]);
// Slide start of window to one to the right
currentWindowStart++;
}
}
return (resultingWindowWith != std::numeric_limits<size_t>::max()) ? toBeSearchedIn.substr(resultingWindowStart, resultingWindowWith) : "No solution";
}
int main()
{
const std::string toBeSearchedIn{ "KKKADOBECODEBBBAANCKKK" };
const std::string toBeSearchedFor = { "AABBC" };
std::cout << "Solution:\n" << solution(toBeSearchedIn, toBeSearchedFor) << '\n';
}

Since the question is part of an attempt at an exercise, this answer will not present a complete solution to the exercise problem that inspired it. Instead, it will do just what is asked: it will point out the main issue with the posted code, and how it can be discovered.
Code Examination
An artful approach is to check for mismatches between the requirements, design, and implementation; artful because this approach is more an art than a science, and you can easily lead yourself astray. This basically involves running through design and through implementation in your head, as if you were the processor, though perhaps examining only small parts of the code at a time.
Some of the implementation looks fine, such as: end advancing along in the outer loop, checking for a smaller window (and replacing the previous smallest window). Some could stand closer examination, such as removing entries from the window histogram after checking that the window is valid (for algorithm correctness, it's very useful to think of good loop invariants, such as 'the window should always be valid', and ensure they always hold true).
However, when you look at check_map, there's a mismatch. One problem requirement is:
every character in t (including duplicates) is included in the window
While there is a slight ambiguity in the phrasing (if a character from t occurs in a window more than in t, is the window valid?), the straight reading of this requirement is that the count of a character in s must be at least the count of a character in t. In check_map, the counts are being compared exactly. This strongly suggests a place to examine more closely.
Testing
A semi-automated, systematic approach that can catch all sorts of bugs is using tests, both unit and integration (a search of this site and the web at large will explain these terms). One key part of tests is identifying edge cases to test. For example, if you try with the search string "ACBA" and pattern "AB", the example program correctly finds the minimum window "BA". However, for the search string "ACBBA", it returns "ACB" as the minimum window. This suggests the implementation has an issue with character counts, which makes check_map the prime suspect (and the lines that update strr the secondary suspect).
For another test, consider search string "A123B12345A12BA123A" and pattern "AAB". This has 3 potential windows, with the shortest in the middle. If you fix check_map and test your code against this test case, the code returns "A12BA123A", rather than "A12BA". This suggests something is either wrong with testing the window validity (check_map again) or with setting the answer. Some scaffolding code (e.g. printing start, end and ans when it's updated) will reveal the cause.
Debugging
The most general approach that can reveal an issue with implementation correctness is to use an interactive debugger. For the sample code, breakpoints can be set at various key points, such as beginning of loops and branches. You can furthermore make these breakpoints conditional, at the indices when the code should be finding new windows. If you do this, you'll find that check_map returns false in instances when you'd expect it to be returning true. From there, you can start stepping in to check_map to observe why it's doing this.
Once that's fixed, there is still an issue with the code, though you'll need a test case such as the one with "A123B12345A12BA123A" above, as the issue isn't apparent with the "ADOBECODEBANC" test case. Stepping through the inner loop and examining the various variables will reveal what's going wrong.
Check the API
Bugs basically all have one cause: you expect the code to do one thing, but it does something different. One source of this is misunderstanding an API, so it can be helpful to read the API documentation to make sure your understanding is correct. Typically, before going to the API you'll want to find the specific API calls that aren't behaving as you understand them, which debugging can reveal. I mention this because there is an API call in the sample code that is incorrect.
Conclusion
Each of the above approaches leads to the same bug: the comparison in check_map. Two of them also can lead to an additional bug, given a suitable test case.
Additional Notes
Efficiency
Substring examines & tracks not only those characters in t, but all characters. This leads to the inner loop body being executed (including updating ans) for every character in s, not only those that are present in the pattern. Generally, you should make an implementation correct, then make it efficient. However, in this case it's trivial to make Substring ignore characters that aren't in the pattern and is closer to the problem description.
Types
An earlier formulation of this answer, addressing an earlier formulation of the question, covered examining types to check that they're the most appropriate. For the updated question, this no longer leads to bug discovery.
One point from the early formulation still applies to designing a solution.
Conceptually, the most appropriate data type for the pattern characters and the characters in the current window would be a multiset. As the window shifts, characters can be added and removed simply from a multiset. The validity of the current window is a simple subset operation (pattern ⊆ window). However, multiset in the STL doesn't correspond to the mathematical multiset.

C++ if statement order

A portion of a program needs to check if two c-strings are identical while searching though an ordered list (e.g.{"AAA", "AAB", "ABA", "CLL", "CLZ"}). It is feasible that the list could get quite large, so small improvements in speed are worth degradation of readability. Assume that you are restricted to C++ (please don't suggest switching to assembly). How can this be improved?
typedef char StringC[5];
void compare (const StringC stringX, const StringC stringY)
{
// use a variable so compareResult won't have to be computed twice
int compareResult = strcmp(stringX, stringY);
if (compareResult < 0) // roughly 50% chance of being true, so check this first
{
// no match. repeat with a 'lower' value string
compare(stringX, getLowerString() );
}
else if (compareResult > 0) // roughly 49% chance of being true, so check this next
{
// no match. repeat with a 'higher' value string
compare(stringX, getHigherString() );
}
else // roughly 1% chance of being true, so check this last
{
// match
reportMatch(stringY);
}
}
You can assume that stringX and stringY are always the same length and you won't get any invalid data input.
From what I understand, a compiler will make the code so that the CPU will check the first if-statement and jump if it's false, so it would be best if that first statement is the most likely to be true, as jumps interfere with the pipeline. I have also heard that when doing a compare, a[n Intel] CPU will do a subtraction and look at the status of flags without saving the subtraction's result. Would there be a way to do the strcmp once, without saving the result into a variable, but still being able to check that result during the both of the first two if-statements?

std::binary_search may help:
bool cstring_less(const char (&lhs)[4], const char (&rhs)[4])
{
return std::lexicographical_compare(std::begin(lhs), std::end(lhs),
std::begin(rhs), std::end(rhs));
}
int main(int, char**)
{
const char cstrings[][4] = {"AAA", "AAB", "ABA", "CLL", "CLZ"};
const char lookFor[][4] = {"BBB", "ABA", "CLS"};
for (const auto& s : lookFor)
{
if (std::binary_search(std::begin(cstrings), std::end(cstrings),
s, cstring_less))
{
std::cout << s << " Found.\n";
}
}
}
Demo

I think using hash tables can improve the speed of comparison drastically. Also, if your program is multithreaded, you can find some useful hash tables in intel thread building blocks library. For example, tbb::concurrent_unordered_map has the same api as std::unordered_map
I hope it helps you.

If you try to compare all the strings to each other you'll get in a O(N*(N-1)) problem. The best thing, as you have stated the lists can grow large, is to sort them (qsort algorithm has O(N*log(N))) and then compare each element with the next one in the list, which adds a new O(N) giving up to O(N*log(N)) total complexity. As you have the list already ordered, you can just traverse it (making the thing O(N)), comparing each element with the next. An example, valid in C and C++ follows:
for(i = 0; i < N-1; i++) /* one comparison less than the number of elements */
if (strcmp(array[i], array[i+1]) == 0)
break;
if (i < N-1) { /* this is a premature exit from the loop, so we found a match */
/* found a match, array[i] equals array[i+1] */
} else { /* we exhausted al comparisons and got out normally from the loop */
/* no match found */
}

grabbing data sets from a file with an arbitrary amount of spaces

**No direct answers or code examples please, this is my homework which i need to learn from. I'm looking for help concerning the algorithm i need to develop.
I seem to be having a logic error in coming up with a solution for a portion of my class work, the program involves multiple files, but here is the only relevant portion:
I have a file PlayerStats that holds the stats for a basketball player in:
rebounds
points
assists
uniform #
my initial reaction would be to create a while loop and read these into a temporary struct that holds these values, then create a merge function that merges the values of the temp struct with the inital array of records, simple enough?
struct Baller
{
//other information on baller
int rebounds;
int assists;
int uniform;
int points;
void merge(Baller tmp); //merge the data with the array of records
}
//in my read function..
Baller tmp;
int j = 0;
inFile << tmp.uniform << tmp.assists << tmp.points << tmp.rebounds
while(inFile){
ArrayRecords[j].merge(tmp);
j++;
//read in from infile again
}
The catch:
The file can have an arbitrary number of spaces between the identifiers, and the information can be in any order(leaving out the uniform number, that is always first). e.g.
PlayerStats could be
11 p3 a12 r5 //uniform 11, 3 points 12 assists 5 rebounds
//other info
OR
11 p 3 r 5 a 12 //same exact values
What I've come up with
can't seem to think of an algorithm to grab these values from the file in the correct order, i was thinking of something along these lines:
inFile << tmp.uniform; //uniform is ALWAYS first
getline(inFile,str); //get the remaining line
int i = 0;
while(str[i] == " ") //keep going until i find something that isnt space
i++;
if(str[i] == 'p') //heres where i get stuck, how do i find that number now?
else if(str[i] == 'a')
eles if(str[i] = 'r'

If you're only going to check one letter, you could use a switch statement instead of if / else, that would make it easier to read.
You know where the number starts at that point, (hint: str[i+1]), so depending on what type your str[] is, you can either use atoi if its a char array, or std::stringstream if it's an std::string.
I'm tempted to give you some code, but you said not too. If you do want some, let me know and I'll edit the answer with some code.
Instead of using a 'merge' function, try using an std::vector so you can just push_back your structure instead of doing any 'merging'. Besides, your merge function is basically a copy assignment operator, which is created by the compiler by default (you don't need to create a 'merge' function), you just need to use = to copy the data across. If you wanted to do something special in your 'merge' function, then you should overload the copy assignment operator instead of a 'merge' function. Simples.

Do something like that:
int readNumber () {
while isdigit (nextchar) -> collect in numberstring or directly build number
return that number;
}
lineEater () {
Read line
skip over spaces
uniform=readNumber ();
haveNum=false;
haveWhat=false;
Loop until eol {
skip over spaces
if (isdigit)
number=readNumber ();
skip over spaces
haveNum=true;
else
char=nextChar;
haveWhat=true;
if (haveChar and haveNum) {
switch (char) {
case 'p' : points=number; break;
...
}
haveNum=false;
haveWhat=false;
}
}
or, if you are more ambitous, write a grammar for your input and use lex/yacc.

Char* vs String Speed in C++

I have a C++ program that will read in data from a binary file and originally I stored data in std::vector<char*> data. I have changed my code so that I am now using strings instead of char*, so that std::vector<std::string> data. Some changes I had to make was to change from strcmp to compare for example.
However I have seen my execution time dramatically increase. For a sample file, when I used char* it took 0.38s and after the conversion to string it took 1.72s on my Linux machine. I observed a similar problem on my Windows machine with execution time increasing from 0.59s to 1.05s.
I believe this function is causing the slow down. It is part of the converter class, note private variables designated with_ at the end of variable name. I clearly am having memory problems here and stuck in between C and C++ code. I want this to be C++ code, so I updated the code at the bottom.
I access ids_ and names_ many times in another function too, so access speed is very important. Through the use of creating a map instead of two separate vectors, I have been able to achieve faster speeds with more stable C++ code. Thanks to everyone!
Example NewList.Txt
2515 ABC 23.5 32 -99 1875.7 1
1676 XYZ 12.5 31 -97 530.82 2
279 FOO 45.5 31 -96 530.8 3
OLD Code:
void converter::updateNewList(){
FILE* NewList;
char lineBuffer[100];
char* id = 0;
char* name = 0;
int l = 0;
int n;
NewList = fopen("NewList.txt","r");
if (NewList == NULL){
std::cerr << "Error in reading NewList.txt\n";
exit(EXIT_FAILURE);
}
while(!feof(NewList)){
fgets (lineBuffer , 100 , NewList); // Read line
l = 0;
while (!isspace(lineBuffer[l])){
l = l + 1;
}
id = new char[l];
switch (l){
case 1:
n = sprintf (id, "%c", lineBuffer[0]);
break;
case 2:
n = sprintf (id, "%c%c", lineBuffer[0], lineBuffer[1]);
break;
case 3:
n = sprintf (id, "%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2]);
break;
case 4:
n = sprintf (id, "%c%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2],lineBuffer[3]);
break;
default:
n = -1;
break;
}
if (n < 0){
std::cerr << "Error in processing ids from NewList.txt\n";
exit(EXIT_FAILURE);
}
l = l + 1;
int s = l;
while (!isspace(lineBuffer[l])){
l = l + 1;
}
name = new char[l-s];
switch (l-s){
case 2:
n = sprintf (name, "%c%c", lineBuffer[s+0], lineBuffer[s+1]);
break;
case 3:
n = sprintf (name, "%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2]);
break;
case 4:
n = sprintf (name, "%c%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2],lineBuffer[s+3]);
break;
default:
n = -1;
break;
}
if (n < 0){
std::cerr << "Error in processing short name from NewList.txt\n";
exit(EXIT_FAILURE);
}
ids_.push_back ( std::string(id) );
names_.push_back(std::string(name));
}
bool isFound = false;
for (unsigned int i = 0; i < siteNames_.size(); i ++) {
isFound = false;
for (unsigned int j = 0; j < names_.size(); j ++) {
if (siteNames_[i].compare(names_[j]) == 0){
isFound = true;
}
}
}
fclose(NewList);
delete [] id;
delete [] name;
}
C++ CODE
void converter::updateNewList(){
std::ifstream NewList ("NewList.txt");
while(NewList.good()){
unsigned int id (0);
std::string name;
// get the ID and name
NewList >> id >> name;
// ignore the rest of the line
NewList.ignore( std::numeric_limits<std::streamsize>::max(), '\n');
info_.insert(std::pair<std::string, unsigned int>(name,id));
}
NewList.close();
}
UPDATE: Follow up question: Bottleneck from comparing strings and thanks for the very useful help! I will not be making these mistakes in the future!

My guess it that it should be tied to the vector<string>'s performance
About the vector
A std::vector works with an internal contiguous array, meaning that once the array is full, it needs to create another, larger array, and copy the strings one by one, which means a copy-construction and a destruction of string which had the same contents, which is counter-productive...
To confirm this easily, then use a std::vector<std::string *> and see if there is a difference in performance.
If this is the case, they you can do one of those four things:
if you know (or have a good idea) of the final size of the vector, use its method reserve() to reserve enough space in the internal array, to avoid useless reallocations.
use a std::deque, which works almost like a vector
use a std::list (which doesn't give you random access to its items)
use the std::vector<char *>
About the string
Note: I'm assuming that your strings\char * are created once, and not modified (through a realloc, an append, etc.).
If the ideas above are not enough, then...
The allocation of the string object's internal buffer is similar to a malloc of a char *, so you should see little or no differences between the two.
Now, if your char * are in truth char[SOME_CONSTANT_SIZE], then you avoid the malloc (and thus, will go faster than a std::string).
Edit
After reading the updated code, I see the following problems.
if ids_ and names_ are vectors, and if you have the slightest idea of the number of lines, then you should use reserve() on ids_ and and names_
consider making ids_ and names_ deque, or lists.
faaNames_ should be a std::map, or even a std::unordered_map (or whatever hash_map you have on your compiler). Your search currently is two for loops, which is quite costly and inneficient.
Consider comparing the length of the strings before comparing its contents. In C++, the length of a string (i.e. std::string::length()) is a zero cost operation)
Now, I don't know what you're doing with the isFound variable, but if you need to find only ONE true equality, then I guess you should work on the algorithm (I don't know if there is already one, see http://www.cplusplus.com/reference/algorithm/), but I believe this search could be made a lot more efficient just by thinking on it.
Other comments:
Forget the use of int for sizes and lengths in STL. At very least, use size_t. In 64-bit, size_t will become 64-bit, while int will remain 32-bits, so your code is not 64-bit ready (in the other hand, I see few cases of incoming 8 Go strings... but still, better be correct...)
Edit 2
The two (so called C and C++) codes are different. The "C code" expects ids and names of length lesser than 5, or the program exists with an error. The "C++ code" has no such limitation. Still, this limitation is ground for massive optimization, if you confirm names and ids are always less then 5 characters.

Before fixing something make sure that it is bottleneck. Otherwise you are wasting your time. Plus this sort of optimization is microoptimization. If you are doing microoptimization in C++ then consider using bare C.

Resize vector to large enough size before you start populating it. Or, use pointers to strings instead of strings.
The thing is that the strings are being copied each time the vector is being auto-resized. For small objects such as pointers this cost nearly nothing, but for strings the whole string is copied in full.
And id and name should be string instead of char*, and be initialized like this (assuming that you still use string instead of string*):
id = string(lineBuffer, lineBuffer + l);
...
name = string(lineBuffer + s, lineBuffer + s + l);
...
ids_.push_back(id);
names_.push_back(name);

Except for std::string, this is a C program.
Try using fstream, and use the profiler to detect the bottle neck.

You can try to reserve a number of vector values in order to reduce the number of allocations (which are costly), as said Dialecticus (probably from the ancient Roma?).
But there is something that may deserve some observation: how do you store the strings from the file, do you perform concatenations etc...
In C, strings (which do not exist per say - they don't have a container from a library like the STL) need more work to deal with, but at least we know what happens clearly when dealing with them. In the STL, each convenient operation (meaning requiring less work from the programmer) may actually require a lot of operations behind the scene, within the string class, depending on how you use it.
So, while the allocations / freeings are a costly process, the rest of the logic, especially the strings process, may / should probably be looked at as well.

I believe the main issue here is that your string version is copying things twice -- first into dynamically allocated char[] called name and id, and then into std::strings, while your vector<char *> version probably does not do that. To make the string version faster, you need to read directly into the strings and get rid of all the redundant copies

streams take care of a lot of the heavy lifting for you. Stop doing it all yourself, and let the library help you:
void converter::updateNewList(){
std::ifstream NewList ("NewList.txt");
while(NewList.good()){
int id (0);
std::string name;
// get the ID and name
NewList >> id >> name;
// ignore the rest of the line
NewList.ignore( numeric_limits<streamsize>::max(), '\n');
ids_.push_back (id);
names_.push_back(name);
}
NewList.close();
}
There's no need to do the whitespace-tokenizing manually.
Also, you may find this site a helpful reference:
http://www.cplusplus.com/reference/iostream/ifstream/

You can use a profiler to find out where your code consumes most time. If you are for example using gcc, you can compile your program with -pg. When you run it, it saves profiling results in a file. You can the run gprof on the binary to get human readable results. Once you know where most time is consumed you can post that piece of code for further questions.

Fastest way to determine whether a string contains a real or integer value

I'm trying to write a function that is able to determine whether a string contains a real or an integer value.
This is the simplest solution I could think of:
int containsStringAnInt(char* strg){
for (int i =0; i < strlen(strg); i++) {if (strg[i]=='.') return 0;}
return 1;
}
But this solution is really slow when the string is long... Any optimization suggestions?
Any help would really be appreciated!

What's the syntax of your real numbers?
1e-6 is valid C++ for a literal, but will be passed as integer by your test.

Is your string hundreds of characters long? Otherwise, don't care about any possible performance issues.
The only inefficiency is that you are using strlen() in a bad way, which means a lot of iterations over the string (inside strlen). For a simpler solution, with the same time complexity (O(n)), but probably slightly faster, use strchr().

You are using strlen, which means you are not worried about unicode. In that case why to use strlen or strchr, just check for '\0' (Null char)
int containsStringAnInt(char* strg){
for (int i =0;strg[i]!='\0'; i++) {
if (strg[i]=='.') return 0;}
return 1; }
Only one parsing through the string, than parsing through the string in each iteration of the loop.

Your function does not take into account exponential notation of reals (1E7, 1E-7 are both doubles)
Use strtol() to try to convert the string to integer first; it will also return the first position in the string where the parsing failed (this will be '.' if the number is real). If the parsing stopped at '.', use strtod() to try to convert to double. Again, the function will return the position in the string where the parsing stopped.
Don't worry about performance, until you have profiled the program. Otherwise, for fastest possible code, construct a regular expression that describes acceptable syntax of numbers, and hand-convert it first into a FSM, then into highly optimized code.

So the standard note first, please don't worry about performance too much if not profiled yet :)
I'm not sure about the manual loop and checking for a dot. Two issues
Depending on the locale, the dot can actually be a "," too (here in Germany that's the case :)
As others noted, there is the issue with numbers like 1e7
Previously I had a version using sscanf here. But measuring performance showed that sscanf is is significantly slower for bigger data-sets. So I'll show the faster solution first (Well, it's also a whole more simple. I had several bugs in the sscanf version until I got it working, while the strto[ld] version worked the first try):
enum {
REAL,
INTEGER,
NEITHER_NOR
};
int what(char const* strg){
char *endp;
strtol(strg, &endp, 10);
if(*strg && !*endp)
return INTEGER;
strtod(strg, &endp);
if(*strg && !*endp)
return REAL;
return NEITHER_NOR;
}
Just for fun, here is the version using sscanf:
int what(char const* strg) {
// test for int
{
int d; // converted value
int n = 0; // number of chars read
int rd = std::sscanf(strg, "%d %n", &d, &n);
if(!strg[n] && rd == 1) {
return INTEGER;
}
}
// test for double
{
double v; // converted value
int n = 0; // number of chars read
int rd = std::sscanf(strg, "%lf %n", &v, &n);
if(!strg[n] && rd == 1) {
return REAL;
}
}
return NEITHER_NOR;
}
I think that should work. Have fun.
Test was done by converting test strings (small ones) randomly 10000000 times in a loop:
6.6s for sscanf
1.7s for strto[dl]
0.5s for manual looping until "."
Clear win for strto[ld], considering it will parse numbers correctly I will praise it as the winner over manual looping. Anyway, 1.2s/10000000 = 0.00000012 difference roughly for one conversion isn't all that much in the end.

Strlen walks the string to find the length of the string.
You are calling strlen with every pass of the loop. Hence, you are walking the string way many more times than necessary. This tiny change should give you a huge performance improvement:
int containsStringAnInt(char* strg){
int len = strlen(strg);
for (int i =0; i < len; i++) {if (strg[i]=='.') return 0;}
return 1;
}
Note that all I did was find the length of the string once, at the start of the function, and refer to that value repeatedly in the loop.
Please let us know what kind of performance improvement this gets you.

#Aaron, with your way also you are traversing the string twice. Once within strlen, and once again in for loop.
Best way for ASCII string traversing in for loop is to check for Null char in the loop it self. Have a look at my answer, that parses the string only once within for loop, and may be partial parsing if it finds a '.' prior to end. that way if a string is like 0.01xxx (anotther 100 chars), you need not to go till end to find the length.

#include <stdlib.h>
int containsStringAnInt(char* strg){
if (atof(strg) == atoi(strg))
return 1;
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js