Text difference of scrolling output - c++

I have code that is capturing the text from scrolling output and I'm looking for an algorithm (working with C++/Qt) that can tell me which lines are new. NOTE: New lines are only ever added to the end.
So on first capture I might have the following:
hello world
some more text
hello world
some text
And on second capture might have:
hello world
some text
yet more text
hello world
So I want the algorithm to return that I have two new lines:
yet more text
hello world
If possible it would be help performance if it could start from the last line and terminate once it reaches an already processed line. But I'm thinking this is probably not possible since there can be duplicate lines.

Well you say its scrolling, and you are using OCR, so could you also capture the size of the scroll widget on the scroll window, and check that along with the lines youve recorded?
Alternatively can you hook a dll into the producer program so you can signal when it outputs a new line? or directly pipe its output into yours?

For your special case I would consider a plain basic loop-inside-loop algorithm. I don't think that performance is really an issue (not so much lines, I also consider OCR to be the major part) and therefore the algorithm should be easily readable and robust.
One possible algorithm in pseudo code:
numberOfNewLines = 0
while numberOfNewLines <= numberOfTotalLines do
compare lines
[1..numberOfTotalLines-numberOfNewLines] of textNew
with lines [1+numberOfNewLines..numberOfTotalLines] of textOld
if identical then exit while
numberOfNewLines++
end while
You can break comparison as soon as one line differs, but still the algorithm is O(N^2) in the number of lines.
Then you can output the last numberOfNewLines from the end of textNew. As mentioned in the comment you can of course not detect some edge cases like "10000 times 'ABC' and then 1 times 'DEF'" where most of the lines 'ABC' will be neglected.

I have tested this against a number of test cases and it works so far:
QStringList scrollDiff(const QStringList& oldLines, const QStringList& newLines)
{
if (oldLines.empty()) {
return newLines;
}
if (oldLines.size() < newLines.size()) {
return newLines.mid(oldLines.size());
}
/*
* Note: oldLines.size() == newLines.size()
*/
int i;
for (i = 0; i < oldLines.size() && oldLines[i] == newLines[i]; ++i);
if (i == oldLines.size()) {
return QStringList();
}
// Remove lines from oldLines that are no longer shown
int j = oldLines.indexOf(newLines[i]);
if (j == -1) {
return newLines;
}
QStringList commonLines = oldLines.mid(j - i);
return newLines.mid(commonLines.size());
}

Related

Looking for a better algorithn of findins substrings in strings using Qt

UPDATE
I'll add some info about the problem to give you a better idea about why is everything done the way it is.
The main point of the whole script is to find all errors in a special file that keeps original and translated strings.
The script requires the "special" bilingual file(an xml in real life) and a "special" vocabulary file which keeps words and their tranlations(xls, xlsx constructed by hand. PO would probably be better.)
As a result it find all errors in translation, using the provided vocabulary.
Obviously if the vocab is bad the result sucks.
At some point of time the whole thing used 'std' or mostly 'std' and 'boost regular expressions'.
At some other point of time came the need for utf-8 support, including the regular expressions. We had no time to write complex stuff, so it was decided to go the QT way.
We were aware that it is possible to iterate over bytes. But we needed actual letters and sequences of letters also we needed to cut the word ending which is done though regular expressions, and no other regex supports utf-8 relatively good.
It was decided that Qt fitted the role far better than anything we would write ourselves in very limited time, as Qt has utf-8 support, and as of v5 keeps all internal stings as utf-8 encoded(as far as I am aware).
It was pointed out that complexity of proposed solution looks like O(m * n).
In reality it's probably even worse - closer to O(m * n * log(l)) or even O(m * n * l) strait. Here m is number of strings, n - number of vocabulary records, l - number of synonyms each word has(l is always at least equals 1).
Since we need to check all strings, and for each string run the whole vocabulary to find all errors, I currently see no way how can we make it any faster, because there is no real way faster.
As the question implies I am looking for a better solution to an existing coding problem.
I am gonna try to explain what exactly the problem is as best as I can.
Imagine you have a piece of code written on C++ that takes a string, a translation of the string,
gets rid of pesky word endings.
After that it takes another file which is a vocabulary and actually runs the whole vocab to find out whether the translation of the string has any errors.
Obviously this thing is highly dependent on the actual vocabulary, but that is not really a problem.
I actually have a described piece of code, although I need to mention the whole thing runs through CGI(don't ask, but at some point it was decided that C++ will run it faster). I can have the full code uploaded to git repo, it's rather big, but I will share the essential parts here.
The current problem I am facing is two fold: either the code does not find all it is supposed to, or it works too slow(probably gets stuck somewhere, but I have not yet pin pointed where)
The main idea behind the code was:
// All definitions for essential structures so you have a better idea what he hell is goind on
struct Word {
QString full = "";
QString stemmed = "";
};
struct VocRecord {
QVector<Word> orig;
QVector<Word> trans;
QString error = "";
void clearRecord() {
this->orig.clear();
this->trans.clear();
this->error = "";
}
};
typedef QVector<VocRecord> Vocabluary;
......
Vocabluary voc = .....; // Obviosly here we get the vocabulary, now how we get it is rather complicated, you can just assume it looks like defined vector of records.
QString origStemmed, transStemmed, orig, trans;
// orig - original string
// trans - it's translation
// origStemmed - original string with removed word endings (we call it stemming hence stemmed)
// transStemmed - transtalion with removed word endings.
At first the algo was something along the lines of:
origStemmed = QString(" ") + origStemmed + QString(" "); // Add whitespaces in the begin and end of string for searching
transStemmed = QString(" ") + transStemmed + QString(" ");
for(int i = 0; i < voc.length(); i++) {
VocRecord record = voc[i];
for(int j = 0; j < record.orig.length(); j++) {
Word origWord = record.orig[j];
si = origStemmed.indexOf(QString(" ") + origWord.stemmed + QString(" "));
if(si > -1) {
int ind = origWord.stemmed.indexOf(" ");
int idx = 0;
if(ind != -1) {
// Found a space in record, means record contains at least two words.
// Here we care where the firs word ends, an it's part of the global problem
idx = origMod.indexOf(origWord.full.mid(0, ind));
} else {
// We did not find a space, do one word only, take the whole thing.
idx = origMod.indexOf(origWord.full);
}
// Now comes the tricky part, we try to figure out if that original text, in which we found our voc record, had any punctuation after the word.
// Now this actually matters only for records that have more then one word in reality, but as you'll see we check all of them and that is not correct - still figuring how to get around it.
QChar symb; - // We'll keep our last symbol of first word here
// originMod - modified original: everything is lowercase, punctuation is kept.
// The main reason we have this at all is because when stemming we have to get rid of all punctuation so we keep the "lowercased" string separate.
// I am 100% sure we don't need it at all since Qt supporrts case insensitive search, but I would like to hear your opinion on it.
if(origMod.indexOf(" ", idx) > 0) {
symb = origMod[origMod.indexOf(" ", idx)-1];
} else {
symb = origMod[origMod.length()-1];
}
// When we have the last symbol we skip the the found word
if(ind != -1 && (symb == QChar(',') || symb == QChar(';') || symb == QChar('!') || symb == QChar(':') || symb == QChar('?') || symb == QChar('.'))) {
continue;
}
// The important part ends here
............
As you will notice we search for stemmed word in the original string.
by all accounts it should work, but the main problem of proposed search that it can have several matches including false ones, and we only care about first found one. The most obvious solution is probably go through all matches, but I am unsure that is a good idea, it requires another loop and the algo is quite slow already.
The next solution I came up with to solving the problem was using regular expressions, but I must have messed up, because the algo started to be "really slow".
The main idea of the second solution:
// We DO not add spaces! spaces suck big time.
for(int i = 0; i < voc.length(); i++) {
VocRecord record = voc[i];
for(int j = 0; j < record.orig.length(); j++) {
Word origWord = record.orig[j];
// In stead of using spaces, we search for a regular expression made from vocab record.
// The simple contains actually runs into the same set of problems namely more then one match or in some cases false matches(when the searched part matches something it should not).
// Now this is terribly slow as you can imagine because we create regular expressions on the fly and not pre-make them. But I still have not thought of a way around it.
if(origStemmed.contains(origWord.stemmed + "\\b",
QRegularExpression::UseUnicodePropertiesOption | QRegularExpression::CaseInsensitiveOption))) {
// Here we do something ungodly. We take our stemmed voc record, split it by space, then go through all parts making striing that will become our regular expression later
QString temp;
parts.clear();
parts = origWord.stemmed.split(" ");
for(int k = 0; k < parts.count(); k++) {
temp += "\\b" + parts[k] + "[a-z]*?\\b";
}
// After we added everything we need? we join the whole thing back by spaces.
temp = parts.join(" ");
// And here is the Ungodly chech - we actually search for the made regular expression in the original sting, and because we made sure to exclude any punctuation from expression in theory this should work.
if(!origMod.contains(QRegularExpression(temp, QRegularExpression::UseUnicodePropertiesOption | QRegularExpression::CaseInsensitiveOption))) {
continue;
}
// Well it does not work, or rather it works so slow - it's impossible to get any result, and even if we do, we still don't find everything we should - I blame the shitty regex here.
// And the important part ends.
As I pointed the second solution sucks big time. Currently I am aiming for some intermediate solution and would gladly accept any tips or suggestions you can make on where to look or what to look for.
If any of you will want to see the full code for this thing - just add a comment, I'll github all the important files in a separate repo.

Minimum Window Substring

I'm working on the Leetcode "Minimum Window Substring" practice problem:
Given two strings s and t of lengths m and n respectively, return the minimum window substring of s such that every character in t (including duplicates) is included in the window. If there is no such substring, return the empty string "".
The testcases will be generated such that the answer is unique.
Example 1:
Input: s = "ADOBECODEBANC", t = "ABC"
Output: "BANC"
Example 2:
Input: s = "a", t = "a"
Output: "a"
Example 3:
Input: s = "a", t = "aa"
Output: ""
Explanation: Both 'a's from t must be included in the window. Since the largest window of s only has one 'a', return empty string.
My solution uses two maps to keep track of character counts:
strr map is to keep count of characters in the window and
patt map is for the given pattern string.
It also uses two indices, start and end, to keep track of the current window (which includes end).
The core of the solution is an outer loop that advances end, adding the new character to strr. It then runs an inner loop as long as the window is valid that:
checks & updates the shortest window seen so far
removes the first character in the window
advances start.
Once the outer loop finishes, the shortest window it encountered should be the answer.
#include <iostream>
#include <unordered_map>
bool check_map(std::unordered_map<char, int> patt, std::unordered_map<char, int> strr)
{
for(auto data:patt)
{
if(strr[data.first] != data.second)
return false;
}
return true;
}
std::string Substring(std::string s, std::string t)
{
std::unordered_map<char, int> patt;
std::unordered_map<char, int> strr;
std::string ans;
for(int i=0; i<t.length(); i++)
patt[t[i]]++;
int start = 0, length = INT_MAX;;
for(int end=0; end<s.length(); end++)
{
strr[s[end]]++;
while(check_map(patt, strr))
{
if(length > (end-start+1))
{
ans = s.substr(start, end+1);
length = end-start+1;
}
strr[s[start]]--;
if(strr[s[start]] == 0)
strr.erase(s[start]);
start++;
}
}
return ans;
}
int main()
{
std::string s = "ADOBECODEBANC",
pattern = "ABC";
std::cout << "String: " << s << std::endl
<< "Pattern: " << pattern << std::endl
<< "Minimum Window Substring is " << Substring(s, pattern) << std::endl;
return 0;
}
For example 1 from the problem, the program should return "BANC" but instead returns "ADOBEC". Program output:
String: ADOBECODEBANC
Pattern: ABC
Minimum Window Substring is ADOBEC
Where is the error in my code?
I am very sorry that I cannot answer your concrete question to “where is the error in my code”.
But what I can do, is to help you to understand the problem, develop an algorithm and show, one of many, potential solution.
The title of the question already implies, what algorithm shall be used: The so called “Sliding Window”-algorithm.
You will find a very good explanation from Said Sryheni here.
And for your problem, we will use the Flexible-Size Sliding window approach.
We will iterate over the source string character by character and wait, until we meet a certain condition. In this case, until we “saw” all characters that needs to be searched for. Then, we will find a window, in which all these characters are.
In the given example, the end of the sliding window is always the last read character from the source string. This, because the last read character fulfills the condition. Then we need to find the beginning of the window. In that case the position of the rightmost character (of the search characters) in the source string that still fulfills the condition.
Then we will continue to read the source string and wait for the next condition to be fulfilled. And then we will recalculate the sliding window positions.
By the way. The other characters, besides the search characters in the source string, are just noise and will only extend the width of the sliding window.
But how do we meet the condition?
And especially, since the order of the search characters does not matter, and, there can even be double characters in it?
The solution is that we will “count”.
First, we will count the occurrence of all characters in the search string. Additionally, we will use a second counter that indicates if all characters are matched.
Then, while iterating over the source string, we will decrement a counter for any character that we see. If the count of a search character hits the 0, then we will decrement the “Match” counter. And, if that is 0, we found all search characters and the condition is fulfilled. We can then come to the calculation of the window positions.
Please note: We will only decrement the match counter, if, after decrementing the character counter, this will be 0.
Example (I will omit the noise with the ‘x’es):
Search string “ABC”, source string: “xxAxxxxBBBxCAxx”.
Initial character counters will be 1,1,1, the match counter will be 3.
Reading the first ‘A’. Counters: 0,1,1  2
Reading the first ‘B’. Counters: 0,0,1  1
Reading the 2nd ‘B’. Counters: 0,-1,1  1 (We will decrement the match counter only if character counter hits the 0).
Reading the 3rd ‘B’. Counters: 0,-2,1  1 (We will decrement the match counter only if character counter hits the 0).
Reading the first ‘C’. Counters: 0,-2,0  0. The match counter is 0, the condition is fulfilled.
Please note. Negative character counts indicate that there are more of the same character further right.
Next, since the condition is fulfilled now, we will check the positions of the sliding window. The end position is clear. This is the last read character from the source string. This led to the fulfillment of the condition. So, easy.
To get the start position of the sliding window, we will check from the beginning of the source string, where we can find a search character. We will increment its count, and if the count is greater then 0, we will again increment the match count. If the match count is greater than 0, we found a start position. Counters now: 1,-2,0  1
The start position will be incremented for the next check. We will never start again with 0, but only with the last used start position.
OK, having found a start and end position, we have our first window and will look for potential smaller windows. We will continue to read the source string and check
After the calculation of the sliding window position, the counter will be: 1,-2,0  1
Reading the next ‘A’. Counters: 0,-2,0  0. Again, the condition is fulfilled.
We continue with sliding window detection. The last start position was pointing to the character ‘x’ after the first ‘A’
Increment start position and skip all ‘x’es. Continue
Reading the first ‘B’. Counters: 0,-1,0  0
Reading the 2nd ‘B’. Counters: 0,0,0  0
Reading the 3d ‘B’. Counters: 0,1,0  1. Window position calculation done. Start position is the 3rd B. This window is smaller than the previous one, so take it.
Since the source string is consumed, we are done and found the solution.
How to implement that. We will do a small abstraction of the counter and pack it into a mini class. That will encapsulate the inner handling of character and match counts and can be optimized later.
A counter, which works for all kind of char types could be implemented like the below:
struct SpecialCounterForGeneralChar {
std::unordered_map<char, int> individualLetter{};
int necessaryMatches{};
SpecialCounterForGeneralChar(const std::string& searchLetters) {
for (const char c : searchLetters) individualLetter[c]++;
necessaryMatches = individualLetter.size();
}
inline void incrementFor(const char c) {
individualLetter[c]++;
if (individualLetter[c] > 0)
++necessaryMatches;
}
inline void decrementFor(const char c) {
individualLetter[c]--;
if (individualLetter[c] == 0)
--necessaryMatches;
}
inline bool allLettersMatched() { return necessaryMatches == 0; }
};
If we know more about the input data and it is for example restricted to an 8 bit char, we can also use:
struct SpecialCounter {
char individualLetter[256]{};
int necessaryMatches{};
SpecialCounter(const std::string& searchLetters) {
for (const char c : searchLetters) {
if (individualLetter[c] == 0) ++necessaryMatches;
individualLetter[c]++;
}
}
inline void incrementFor(const char c) {
individualLetter[c]++;
if (individualLetter[c] > 0)
++necessaryMatches;
}
inline void decrementFor(const char c) {
individualLetter[c]--;
if (individualLetter[c] == 0)
--necessaryMatches;
}
inline bool allLettersMatched() { return necessaryMatches == 0; }
};
This will be slightly faster than the above (under the given restrictions)
And, then the rest of the program will then be just 15 lines of code.
The important message here is that we need to think very verylong, before we start to implement the first line of code.
A good selected algorithm and design, will help us to find an optimum solution.
Please see the complete example solution below:
#include <string>
#include <iostream>
#include <unordered_map>
#include <limits>
using Index = unsigned int;
// We want to hide the implementation of the special counter to the outside world
struct SpecialCounter {
char individualLetter[256]{};
int necessaryMatches{};
SpecialCounter(const std::string& searchLetters) {
for (const char c : searchLetters) {
if (individualLetter[c] == 0) ++necessaryMatches;
individualLetter[c]++;
}
}
inline void incrementFor(const char c) {
individualLetter[c]++;
if (individualLetter[c] > 0)
++necessaryMatches;
}
inline void decrementFor(const char c) {
individualLetter[c]--;
if (individualLetter[c] == 0)
--necessaryMatches;
}
inline bool allLettersMatched() { return necessaryMatches == 0; }
};
std::string solution(std::string toBeSearchedIn, std::string toBeSearchedFor) {
// Counter with somespecial properties
SpecialCounter counter(toBeSearchedFor);
// This will be slided. End of window is always last read character. Start of window may increase
Index currentWindowStart {};
// The potential solution
Index resultingWindowStart {};
Index resultingWindowWith{ std::numeric_limits<size_t>::max() };
// Iterate over all characters of the string under evaluation
for (Index index{}; index < toBeSearchedIn.length(); ++index) {
// We saw a character. So, subtract from characters to be searched
counter.decrementFor(toBeSearchedIn[index]);
// If we hit and found all necessary characters and adjusted the sliding windows start position
while (counter.allLettersMatched()) {
// Calculate start and width of sliding window. So, if we found a new, more narrow window
const unsigned int currentWindowWith{ index - currentWindowStart + 1 };
if (currentWindowWith < resultingWindowWith) {
// Remember one potential solution
resultingWindowWith = currentWindowWith;
resultingWindowStart = currentWindowStart;
}
// Now, for the sliding window. We saw and decremented thsi character before
// Now we see it in the sliding window and increment it again.
counter.incrementFor(toBeSearchedIn[currentWindowStart]);
// Slide start of window to one to the right
currentWindowStart++;
}
}
return (resultingWindowWith != std::numeric_limits<size_t>::max()) ? toBeSearchedIn.substr(resultingWindowStart, resultingWindowWith) : "No solution";
}
int main()
{
const std::string toBeSearchedIn{ "KKKADOBECODEBBBAANCKKK" };
const std::string toBeSearchedFor = { "AABBC" };
std::cout << "Solution:\n" << solution(toBeSearchedIn, toBeSearchedFor) << '\n';
}
Since the question is part of an attempt at an exercise, this answer will not present a complete solution to the exercise problem that inspired it. Instead, it will do just what is asked: it will point out the main issue with the posted code, and how it can be discovered.
Code Examination
An artful approach is to check for mismatches between the requirements, design, and implementation; artful because this approach is more an art than a science, and you can easily lead yourself astray. This basically involves running through design and through implementation in your head, as if you were the processor, though perhaps examining only small parts of the code at a time.
Some of the implementation looks fine, such as: end advancing along in the outer loop, checking for a smaller window (and replacing the previous smallest window). Some could stand closer examination, such as removing entries from the window histogram after checking that the window is valid (for algorithm correctness, it's very useful to think of good loop invariants, such as 'the window should always be valid', and ensure they always hold true).
However, when you look at check_map, there's a mismatch. One problem requirement is:
every character in t (including duplicates) is included in the window
While there is a slight ambiguity in the phrasing (if a character from t occurs in a window more than in t, is the window valid?), the straight reading of this requirement is that the count of a character in s must be at least the count of a character in t. In check_map, the counts are being compared exactly. This strongly suggests a place to examine more closely.
Testing
A semi-automated, systematic approach that can catch all sorts of bugs is using tests, both unit and integration (a search of this site and the web at large will explain these terms). One key part of tests is identifying edge cases to test. For example, if you try with the search string "ACBA" and pattern "AB", the example program correctly finds the minimum window "BA". However, for the search string "ACBBA", it returns "ACB" as the minimum window. This suggests the implementation has an issue with character counts, which makes check_map the prime suspect (and the lines that update strr the secondary suspect).
For another test, consider search string "A123B12345A12BA123A" and pattern "AAB". This has 3 potential windows, with the shortest in the middle. If you fix check_map and test your code against this test case, the code returns "A12BA123A", rather than "A12BA". This suggests something is either wrong with testing the window validity (check_map again) or with setting the answer. Some scaffolding code (e.g. printing start, end and ans when it's updated) will reveal the cause.
Debugging
The most general approach that can reveal an issue with implementation correctness is to use an interactive debugger. For the sample code, breakpoints can be set at various key points, such as beginning of loops and branches. You can furthermore make these breakpoints conditional, at the indices when the code should be finding new windows. If you do this, you'll find that check_map returns false in instances when you'd expect it to be returning true. From there, you can start stepping in to check_map to observe why it's doing this.
Once that's fixed, there is still an issue with the code, though you'll need a test case such as the one with "A123B12345A12BA123A" above, as the issue isn't apparent with the "ADOBECODEBANC" test case. Stepping through the inner loop and examining the various variables will reveal what's going wrong.
Check the API
Bugs basically all have one cause: you expect the code to do one thing, but it does something different. One source of this is misunderstanding an API, so it can be helpful to read the API documentation to make sure your understanding is correct. Typically, before going to the API you'll want to find the specific API calls that aren't behaving as you understand them, which debugging can reveal. I mention this because there is an API call in the sample code that is incorrect.
Conclusion
Each of the above approaches leads to the same bug: the comparison in check_map. Two of them also can lead to an additional bug, given a suitable test case.
Additional Notes
Efficiency
Substring examines & tracks not only those characters in t, but all characters. This leads to the inner loop body being executed (including updating ans) for every character in s, not only those that are present in the pattern. Generally, you should make an implementation correct, then make it efficient. However, in this case it's trivial to make Substring ignore characters that aren't in the pattern and is closer to the problem description.
Types
An earlier formulation of this answer, addressing an earlier formulation of the question, covered examining types to check that they're the most appropriate. For the updated question, this no longer leads to bug discovery.
One point from the early formulation still applies to designing a solution.
Conceptually, the most appropriate data type for the pattern characters and the characters in the current window would be a multiset. As the window shifts, characters can be added and removed simply from a multiset. The validity of the current window is a simple subset operation (pattern ⊆ window). However, multiset in the STL doesn't correspond to the mathematical multiset.

How to adjust the indentation of code in text file using c++?

I am copying code of file 1 to file 2 , but i want the code in file 2 to look adjusted with indentation like this: at the beginning indentation=0, every curly bracket opened increases the depth of indentation, every curly bracket closed reduces the indentation 4 spaces for example. I need help in fixing this to work
char preCh;
int depth=0;
int tab = 3;
int d = 0;
int pos = 0;
file1.get(ch);
while(!file1.eof())
{
if(ch=='{')
{
d++;
}
if(ch=='}'){
d--;
}
depth = tab * d;
if(preCh == '{' && ch=='\n'){
file2.put(ch);
for (int i = 0; i <= depth; i++)
{
file2.put(' ');
}
}
else
file2.put(ch);
preCh = ch;
ch = file1.get();
}
}
result must be indented like in code editors:
int main(){
if(a>0)
{
something();
}
}
Maybe, unexpectedly for you, there is no easy answer to your question.
And because of that, your code will never work.
First and most important, you need to understand and define indentation styles. Please see here in Wikipedia. Even in your given mini example, you are mixing Allman and K&R. So, first you must be clear, what to use.
Then, you must be aware that brackets may appear in quotes, double quotes, C-Comments, C++ comments and even worse, multi line comments (and #if or #idefs). This will make life really hard.
And, for the closing brackets, and for example Allman style, you will know the needed indentation only after you printed already the "indentation spaces". So you need to work line oriented or use a line buffers, before you print a complete line.
Example:
}
In this one simple line, you will read the '}' character, after you have already printed the spaces. This will always lead to wrong (too far right) indentation.
The logic for only this case would be complicated. Ant then assume statements like
if (x ==5) { y = 3; } } } }
So, unfortunately I cannot give you an easy solution.
A parser would be needed, or I simply recommend any kinde of beautifier or pretty printer

Better, or advantages in different ways of coding similar functions

I'm writing the code for a GUI (in C++), and right now I'm concerned with the organisation of text in lines. One of the problems I'm having is that the code is getting very long and confusing, and I'm starting to get into a n^2 scenario where for every option I add in for the texts presentation, the number of functions I have to write is the square of that. In trying to deal with this, A particular design choice has come up, and I don't know the better method, or the extent of the advantages or disadvantages between them:
I have two methods which are very similar in flow, i.e, iterate through the same objects, taking into account the same constraints, but ultimately perform different operations between this flow. For anyones interest, the methods render the text, and determine if any text overflows the line due to wrapping the text around other objects or simply the end of the line respectively.
These functions need to be copied and rewritten for left, right or centred text, which have different flow, so whatever design choice I make would be repeated three times.
Basically, I could continue what I have now, which is two separate methods to handle these different actions, or I could merge them into one function, which has if statements within it to determine whether or not to render the text or figure out if any text overflows.
Is there a generally accepted right way to going about this? Otherwise, what are the tradeoffs concerned, what are the signs that might indicate one way should be used over the other? Is there some other way of doing things I've missed?
I've edited through this a few times to try and make it more understandable, but if it isn't please ask me some questions so I can edit and explain. I can also post the source code of the two different methods, but they use a lot of functions and objects that would take too long to explain.
// EDIT: Source Code //
Function 1:
void GUITextLine::renderLeftShifted(const GUIRenderInfo& renderInfo) {
if(m_renderLines.empty())
return;
Uint iL = 0;
Array2t<float> renderCoords;
renderCoords.s_x = renderInfo.s_offset.s_x + m_renderLines[0].s_x;
renderCoords.s_y = renderInfo.s_offset.s_y + m_y;
float remainingPixelsInLine = m_renderLines[0].s_y;
for (Uint iTO= 0;iTO != m_text.size();++iTO)
{
if(m_text[iTO].s_pixelWidth <= remainingPixelsInLine)
{
string preview = m_text[iTO].s_string;
m_text[iTO].render(&renderCoords);
remainingPixelsInLine -= m_text[iTO].s_pixelWidth;
}
else
{
FSInternalGlyphData intData = m_text[iTO].stealFSFastFontInternalData();
float characterWidth = 0;
Uint iFirstCharacterOfRenderLine = 0;
for(Uint iC = 0;;++iC)
{
if(iC == m_text[iTO].s_string.size())
{
// wrap up
string renderPart = m_text[iTO].s_string;
renderPart.erase(iC, renderPart.size());
renderPart.erase(0, iFirstCharacterOfRenderLine);
m_text[iTO].s_font->renderString(renderPart.c_str(), intData,
&renderCoords);
break;
}
characterWidth += m_text[iTO].s_font->getWidthOfGlyph(intData,
m_text[iTO].s_string[iC]);
if(characterWidth > remainingPixelsInLine)
{
// Can't push in the last character
// No more space in this line
// First though, render what we already have:
string renderPart = m_text[iTO].s_string;
renderPart.erase(iC, renderPart.size());
renderPart.erase(0, iFirstCharacterOfRenderLine);
m_text[iTO].s_font->renderString(renderPart.c_str(), intData,
&renderCoords);
if(++iL != m_renderLines.size())
{
remainingPixelsInLine = m_renderLines[iL].s_y;
renderCoords.s_x = renderInfo.s_offset.s_x + m_renderLines[iL].s_x;
// Cool, so now try rendering this character again
--iC;
iFirstCharacterOfRenderLine = iC;
characterWidth = 0;
}
else
{
// Quit
break;
}
}
}
}
}
// Done! }
Function 2:
vector GUITextLine::recalculateWrappingContraints_LeftShift()
{
m_pixelsOfCharacters = 0;
float pixelsRemaining = m_renderLines[0].s_y;
Uint iRL = 0;
// Go through every text object, fiting them into render lines
for(Uint iTO = 0;iTO != m_text.size();++iTO)
{
// If an entire text object fits in a single line
if(pixelsRemaining >= m_text[iTO].s_pixelWidth)
{
pixelsRemaining -= m_text[iTO].s_pixelWidth;
m_pixelsOfCharacters += m_text[iTO].s_pixelWidth;
}
// Otherwise, character by character
else
{
// Get some data now we don't get it every function call
FSInternalGlyphData intData = m_text[iTO].stealFSFastFontInternalData();
for(Uint iC = 0; iC != m_text[iTO].s_string.size();++iC)
{
float characterWidth = m_text[iTO].s_font->getWidthOfGlyph(intData, '-');
if(characterWidth < pixelsRemaining)
{
pixelsRemaining -= characterWidth;
m_pixelsOfCharacters += characterWidth;
}
else // End of render line!
{
m_pixelsOfWrapperCharacters += pixelsRemaining; // we might track how much wrapping px we use
// If this is true, then we ran out of render lines before we ran out of text. Means we have some overflow to return
if(++iRL == m_renderLines.size())
{
return harvestOverflowFrom(iTO, iC);
}
else
{
pixelsRemaining = m_renderLines[iRL].s_y;
}
}
}
}
}
vector<GUIText> emptyOverflow;
return emptyOverflow; }
So basically, render() takes renderCoordinates as a parameter and gets from it the global position of where it needs to render from. calcWrappingConstraints figures out how much text in the object goes over the allocated space, and returns that text as a function.
m_renderLines is an std::vector of a two float structure, where .s_x = where rendering can start and .s_y = how large the space for rendering is - not, its essentially width of the 'renderLine', not where it ends.
m_text is an std::vector of GUIText objects, which contain a string of text, and some data, like style, colour, size ect. It also contains under s_font, a reference to a font object, which performs rendering, calculating the width of a glyph, ect.
Hopefully this clears things up.
There is no generally accepted way in this case.
However, common practice in any programming scenario is to remove duplicated code.
I think you're getting stuck on how to divide code by direction, when direction changes the outcome too much to make this division. In these cases, focus on the common portions of the three algorithms and divide them into tasks.
I did something similar when I duplicated WinForms flow layout control for MFC. I dealt with two types of objects: fixed positional (your pictures etc.) and auto positional (your words).
In the example you provided I can list out common portions of your example.
Write Line (direction)
bool TestPlaceWord (direction) // returns false if it cannot place word next to previous word
bool WrapPastObject (direction) // returns false if it runs out of line
bool WrapLine (direction) // returns false if it runs out of space for new line.
Each of these would be performed no matter what direction you are faced with.
Ultimately, the algorithm for each direction is just too different to simplify anymore than that.
How about an implementation of the Visitor Pattern? It sounds like it might be the kind of thing you are after.

Optimizing WordWrap Algorithm

I have a word-wrap algorithm that basically generates lines of text that fit the width of the text. Unfortunately, it gets slow when I add too much text.
I was wondering if I oversaw any major optimizations that could be made. Also, if anyone has a design that would still allow strings of lines or string pointers of lines that is better I'd be open to rewriting the algorithm.
Thanks
void AguiTextBox::makeLinesFromWordWrap()
{
textRows.clear();
textRows.push_back("");
std::string curStr;
std::string curWord;
int curWordWidth = 0;
int curLetterWidth = 0;
int curLineWidth = 0;
bool isVscroll = isVScrollNeeded();
int voffset = 0;
if(isVscroll)
{
voffset = pChildVScroll->getWidth();
}
int AdjWidthMinusVoffset = getAdjustedWidth() - voffset;
int len = getTextLength();
int bytesSkipped = 0;
int letterLength = 0;
size_t ind = 0;
for(int i = 0; i < len; ++i)
{
//get the unicode character
letterLength = _unicodeFunctions.bringToNextUnichar(ind,getText());
curStr = getText().substr(bytesSkipped,letterLength);
bytesSkipped += letterLength;
curLetterWidth = getFont().getTextWidth(curStr);
//push a new line
if(curStr[0] == '\n')
{
textRows.back() += curWord;
curWord = "";
curLetterWidth = 0;
curWordWidth = 0;
curLineWidth = 0;
textRows.push_back("");
continue;
}
//ensure word is not longer than the width
if(curWordWidth + curLetterWidth >= AdjWidthMinusVoffset &&
curWord.length() >= 1)
{
textRows.back() += curWord;
textRows.push_back("");
curWord = "";
curWordWidth = 0;
curLineWidth = 0;
}
//add letter to word
curWord += curStr;
curWordWidth += curLetterWidth;
//if we need a Vscroll bar start over
if(!isVscroll && isVScrollNeeded())
{
isVscroll = true;
voffset = pChildVScroll->getWidth();
AdjWidthMinusVoffset = getAdjustedWidth() - voffset;
i = -1;
curWord = "";
curStr = "";
textRows.clear();
textRows.push_back("");
ind = 0;
curWordWidth = 0;
curLetterWidth = 0;
curLineWidth = 0;
bytesSkipped = 0;
continue;
}
if(curLineWidth + curWordWidth >=
AdjWidthMinusVoffset && textRows.back().length() >= 1)
{
textRows.push_back("");
curLineWidth = 0;
}
if(curStr[0] == ' ' || curStr[0] == '-')
{
textRows.back() += curWord;
curLineWidth += curWordWidth;
curWord = "";
curWordWidth = 0;
}
}
if(curWord != "")
{
textRows.back() += curWord;
}
updateWidestLine();
}
There are two main things making this slower than it could be, I think.
The first, and probably less important: as you build up each line, you're appending words to the line. Each such operation may require the line to be reallocated and its old contents copied. For long lines, this is inefficient. However, I'm guessing that in actual use your lines are quite short (say 60-100 characters), in which case the cost is unlikely to be huge. Still, there's probably some efficiency to be won there.
The second, and probably much more important: you're apparently using this for a text-area in some sort of GUI, and I'm guessing that it's being typed into. If you're recomputing for every character typed, that's really going to hurt once the text gets long.
As long as the user is only adding characters at the end -- which is surely the most common case -- you can make effective use of the fact that with your "greedy" line-breaking algorithm changes never affect anything on earlier lines: so just recompute from the start of the last line.
If you want to make it fast even when the user is typing (or deleting or whatever) somewhere in the middle of the text, your code will need to do more work and store more information. For instance: whenever you build a line, remember "if you start a line with this word, it ends with that word and this is the whole resulting line". Invalidate this information when anything changes within that line. Now, after a little editing, most changes will not require very much recalculation. You should work out the details of this for yourself because (1) it's a good exercise and (2) I need to go to bed now.
(To save on memory, you might prefer not to store whole lines at all -- whether or not you implement the sort of trick I just described. Instead, just store here's-the-next-line-break information and build up lines as your UI needs to render them.)
It's probably more complication than you want to take on board right now, but you should also look up Donald Knuth's dynamic-programming-based line-breaking algorithm. It's substantially more complicated than yours but can still be made quite quick, and it produces distinctly better results. See, e.g., http://defoe.sourceforge.net/folio/knuth-plass.html.
Problems on algorithms often come with problem on data-structures.
Let's make a few observations, first:
paragraphs can be treated independently
editing at a given index only invalidates the current word and those that follow
it is unnecessary to copy the whole words when their index would suffice for retrieving them and only their length matter for the computation
Paragraph
I would begin by introducing the notion of paragraph, which are determined by user-introduced line-breaks. When an edition takes place, you need to locate which is the concerned paragraph, which requires a look-up structure.
The "ideal" structure here would be a Fenwick Tree, for a small text box however this seems overkill. We'll just have each paragraph store the number of displayed lines that make up its representation and you'll count from the beginning. Note that an access to the last displayed line is an access to the last paragraph.
The paragraphs are thus stored as a contiguous sequence, in C++ terms, well probably take the hit of an indirection (ie storing pointers) to save moving them around when a paragraph in the middle is removed.
Each paragraph will store:
its content, the simplest being a single std::string to represent it.
its display, in editable form (which we need to determine still)
Each paragraph will cache its display, this paragraph cache will be invalidated whenever an edit is made.
The actual rendering will be made for only a couple of paragraphs at a time (and better, a couple of displayed lines): those which are visible.
Displayed Line
A paragraph may be to displayed with at least one line, but there is no maximum. We need to store the "display" in editable form, that is a form suitable for edition.
A single chunk of characters with \n thrown in is not suitable. Changes imply moving lots of characters around, and users are supposed to be changing the text, so we need better.
Using lengths, instead of characters, we may actually only store a mere 4 bytes (if the string takes more than 3GB... I don't guarantee much about this algorithm).
My first idea was to use the character index, however in case of edition all subsequent indexes are changed, and the propagation is error prone. Lengths are offsets, so we have an index relative to the position of the previous word. It does pose the issue of what a word (or token) is. Notably, do you collapse multiple spaces ? How do you handle them ? Here I'll assume that words are separated from one another by a single whitespace.
For "fast" retrieval, I'll store the length of the whole displayed line as well. This allows quickly skipping the first displayed lines when an edit is made at character 503 of the paragraph.
A displayed line will thus be composed of:
a total length (inferior to the maximum displayed length of the box, once computation ended)
a sequence of words (tokens) length
This sequence should be editable efficiently at both ends (since for wrapping we'll push/pop words at both ends depending on whether an edit added or removed words). It's not so important if in the middle we're not that efficient, because only one line at a time is edited in the middle.
In C++, either a vector or deque should be fine. While in theory a list would be "perfect", in practice its poor memory locality and high memory overhead will offset its asymptotic guarantees. A line is composed of few words, so the asymptotic behavior does not matter and high constants do.
Rendering
For the rendering, pick up a buffer of already sufficient length (a std::string with a call to reserve will do). Normally, you'd clear and rewrite the buffer each time, so no memory allocation occurs.
You need not display what cannot be seen, but do need to know how many lines there are, to pick up the correct paragraph.
Once you get the paragraph:
set offset to 0
for each line hidden, increment offset by its length (+ 1 for the space after it)
a word is accessed as a substring of _content, you can use the insert method on buffer: buffer.insert(buffer.end(), _content[offset], _content[offset+length])
The difficulty is in maintaining offset, but that's what makes the algorithm efficient.
Structures
struct LineDisplay: private boost::noncopyable
{
Paragraph& _paragraph;
uint32_t _length;
std::vector<uint16_t> _words; // copying around can be done with memmove
};
struct Paragraph:
{
std::string _content;
boost::ptr_vector<LineDisplay> _lines;
};
With this structure, implementation should be straightforward, and should not slow down as much when the content grows.
General change to the algorithm -
work out if you need the scroll bar as cheap as you can, ie. count the number of \n in the text and if it's greater then the vheight turn on the scroll, check lengths so on.
prepare the text into appropriate lines for the control now that you know you need a scroll bar or not.
This allows you to remove/reduce the test if(!isVscroll && isVScrollNeeded()) as is run on almost every character - isVScroll is probably not cheep, the example code doesn't seem to pass knowledge of lines to the function so can't see how it tells if it is needed.
Assuming textRows is a vector<string> - textrows.back() += is kind of expensive, looking up the back not so much as += on string not being efficient for strings. I'd change to using a ostrstream for gathering the row and push it in when it is done.
getFont().getWidth() are likely to be expensive - is the font changing? how greatly does the width differ between smallest and largest, shortcuts for fixed width fonts.
Use native methods where possible to get the size of a word since you don't want to break them - GetTextExtentPoint32
Often the will be sufficient space to allow for the VScroll when you change between. Restarting from the beginning with measuring could cost you up to twice the time. Store the width of the line with each line so you can skip over the ones that still fit.
Or don't build the line strings directly, keep the words seperate with the size.
How accurate does it realy need to be? Apply some pragmatism...
Just assume VScroll will be needed, mostly wrapping won't change much even if it isn't (1 letter words at the end/start of a line)
try and work more with words than with letters - checking remaining space for each letter can waste time. assume each letter in the string is the longest letter, letters x longest < space then put it in.