Why is my regex C++ expression not working?

Why is my regex C++ expression not working? - c++

I have the following regex expression: \\(([^)]+)\\) (don't take into account the double brackets it's because of C++) and the following code:
if (in_str.find("(") != string::npos) {
print(to_string(countMatchInRegex(in_str, "\\([^ ]*\\.[^ ]*\\)")));
for (int i = 0; i < countMatchInRegex(in_str, "\\(([^)]+)\\)"); ++i) {
regex r("\\(([^)]+)\\)");
smatch m;
regex_search(in_str, m, r);
string obj = m.str();
obj = obj.substr(1, obj.length() - 2);
string property = obj.substr(obj.find(".") + 1);
obj = obj.substr(0, obj.find("."));
in_str = replace(in_str, m.str(), process_property(obj, property));
}
}
This code is supposed to find, in a string, substrings like the following: (something.somethingelse). It works fine, except it only works two times...but i don't know how. I do know the problem is not the countMatchInRegex function, because I've printed what it results the correct number of substrings that match the regex expression in the string.
If anyone has any idea, please share them, I've been stuck on this for weeks..

I actually found the error, I don’t know why but C++ was returning different values for the two countMatchInRegex so all I did was assign it to a variable and use the variable instead every thing the function was called in the code.

Related

Looking for a better algorithn of findins substrings in strings using Qt

UPDATE
I'll add some info about the problem to give you a better idea about why is everything done the way it is.
The main point of the whole script is to find all errors in a special file that keeps original and translated strings.
The script requires the "special" bilingual file(an xml in real life) and a "special" vocabulary file which keeps words and their tranlations(xls, xlsx constructed by hand. PO would probably be better.)
As a result it find all errors in translation, using the provided vocabulary.
Obviously if the vocab is bad the result sucks.
At some point of time the whole thing used 'std' or mostly 'std' and 'boost regular expressions'.
At some other point of time came the need for utf-8 support, including the regular expressions. We had no time to write complex stuff, so it was decided to go the QT way.
We were aware that it is possible to iterate over bytes. But we needed actual letters and sequences of letters also we needed to cut the word ending which is done though regular expressions, and no other regex supports utf-8 relatively good.
It was decided that Qt fitted the role far better than anything we would write ourselves in very limited time, as Qt has utf-8 support, and as of v5 keeps all internal stings as utf-8 encoded(as far as I am aware).
It was pointed out that complexity of proposed solution looks like O(m * n).
In reality it's probably even worse - closer to O(m * n * log(l)) or even O(m * n * l) strait. Here m is number of strings, n - number of vocabulary records, l - number of synonyms each word has(l is always at least equals 1).
Since we need to check all strings, and for each string run the whole vocabulary to find all errors, I currently see no way how can we make it any faster, because there is no real way faster.
As the question implies I am looking for a better solution to an existing coding problem.
I am gonna try to explain what exactly the problem is as best as I can.
Imagine you have a piece of code written on C++ that takes a string, a translation of the string,
gets rid of pesky word endings.
After that it takes another file which is a vocabulary and actually runs the whole vocab to find out whether the translation of the string has any errors.
Obviously this thing is highly dependent on the actual vocabulary, but that is not really a problem.
I actually have a described piece of code, although I need to mention the whole thing runs through CGI(don't ask, but at some point it was decided that C++ will run it faster). I can have the full code uploaded to git repo, it's rather big, but I will share the essential parts here.
The current problem I am facing is two fold: either the code does not find all it is supposed to, or it works too slow(probably gets stuck somewhere, but I have not yet pin pointed where)
The main idea behind the code was:
// All definitions for essential structures so you have a better idea what he hell is goind on
struct Word {
QString full = "";
QString stemmed = "";
};
struct VocRecord {
QVector<Word> orig;
QVector<Word> trans;
QString error = "";
void clearRecord() {
this->orig.clear();
this->trans.clear();
this->error = "";
}
};
typedef QVector<VocRecord> Vocabluary;
......
Vocabluary voc = .....; // Obviosly here we get the vocabulary, now how we get it is rather complicated, you can just assume it looks like defined vector of records.
QString origStemmed, transStemmed, orig, trans;
// orig - original string
// trans - it's translation
// origStemmed - original string with removed word endings (we call it stemming hence stemmed)
// transStemmed - transtalion with removed word endings.
At first the algo was something along the lines of:
origStemmed = QString(" ") + origStemmed + QString(" "); // Add whitespaces in the begin and end of string for searching
transStemmed = QString(" ") + transStemmed + QString(" ");
for(int i = 0; i < voc.length(); i++) {
VocRecord record = voc[i];
for(int j = 0; j < record.orig.length(); j++) {
Word origWord = record.orig[j];
si = origStemmed.indexOf(QString(" ") + origWord.stemmed + QString(" "));
if(si > -1) {
int ind = origWord.stemmed.indexOf(" ");
int idx = 0;
if(ind != -1) {
// Found a space in record, means record contains at least two words.
// Here we care where the firs word ends, an it's part of the global problem
idx = origMod.indexOf(origWord.full.mid(0, ind));
} else {
// We did not find a space, do one word only, take the whole thing.
idx = origMod.indexOf(origWord.full);
}
// Now comes the tricky part, we try to figure out if that original text, in which we found our voc record, had any punctuation after the word.
// Now this actually matters only for records that have more then one word in reality, but as you'll see we check all of them and that is not correct - still figuring how to get around it.
QChar symb; - // We'll keep our last symbol of first word here
// originMod - modified original: everything is lowercase, punctuation is kept.
// The main reason we have this at all is because when stemming we have to get rid of all punctuation so we keep the "lowercased" string separate.
// I am 100% sure we don't need it at all since Qt supporrts case insensitive search, but I would like to hear your opinion on it.
if(origMod.indexOf(" ", idx) > 0) {
symb = origMod[origMod.indexOf(" ", idx)-1];
} else {
symb = origMod[origMod.length()-1];
}
// When we have the last symbol we skip the the found word
if(ind != -1 && (symb == QChar(',') || symb == QChar(';') || symb == QChar('!') || symb == QChar(':') || symb == QChar('?') || symb == QChar('.'))) {
continue;
}
// The important part ends here
............
As you will notice we search for stemmed word in the original string.
by all accounts it should work, but the main problem of proposed search that it can have several matches including false ones, and we only care about first found one. The most obvious solution is probably go through all matches, but I am unsure that is a good idea, it requires another loop and the algo is quite slow already.
The next solution I came up with to solving the problem was using regular expressions, but I must have messed up, because the algo started to be "really slow".
The main idea of the second solution:
// We DO not add spaces! spaces suck big time.
for(int i = 0; i < voc.length(); i++) {
VocRecord record = voc[i];
for(int j = 0; j < record.orig.length(); j++) {
Word origWord = record.orig[j];
// In stead of using spaces, we search for a regular expression made from vocab record.
// The simple contains actually runs into the same set of problems namely more then one match or in some cases false matches(when the searched part matches something it should not).
// Now this is terribly slow as you can imagine because we create regular expressions on the fly and not pre-make them. But I still have not thought of a way around it.
if(origStemmed.contains(origWord.stemmed + "\\b",
QRegularExpression::UseUnicodePropertiesOption | QRegularExpression::CaseInsensitiveOption))) {
// Here we do something ungodly. We take our stemmed voc record, split it by space, then go through all parts making striing that will become our regular expression later
QString temp;
parts.clear();
parts = origWord.stemmed.split(" ");
for(int k = 0; k < parts.count(); k++) {
temp += "\\b" + parts[k] + "[a-z]*?\\b";
}
// After we added everything we need? we join the whole thing back by spaces.
temp = parts.join(" ");
// And here is the Ungodly chech - we actually search for the made regular expression in the original sting, and because we made sure to exclude any punctuation from expression in theory this should work.
if(!origMod.contains(QRegularExpression(temp, QRegularExpression::UseUnicodePropertiesOption | QRegularExpression::CaseInsensitiveOption))) {
continue;
}
// Well it does not work, or rather it works so slow - it's impossible to get any result, and even if we do, we still don't find everything we should - I blame the shitty regex here.
// And the important part ends.
As I pointed the second solution sucks big time. Currently I am aiming for some intermediate solution and would gladly accept any tips or suggestions you can make on where to look or what to look for.
If any of you will want to see the full code for this thing - just add a comment, I'll github all the important files in a separate repo.

Edit string by calling it using concatenation in C++

I'm a very new C++ user (and a new StackOverflow user at that), and I'm trying to code a very basic Tic-Tac-Toe (Naughts and Crosses) game. I'm not sure how to render the board as it is updated.
My main question is if it is possible to call a string using concatenation. I have an array set up that indexes the states of the 9 spaces of the board using a 0 for empty, a 1 for an X, and a 2 for an O. If I set up 9 variables in a user-defined renderBoard() function named bit1, bit2, etc; Can I call them this way:
void renderBoard()
{
int i = 1;
string bit1;
string bit2;
string bit3;
string bit4;
string bit5;
string bit6;
string bit7;
string bit8;
string bit9;
while (i < 10)
{
if (Spaces[i] = 0)
{
(bit + i) = * //This is the main bit I'm wondering about
}
else
{
//Check for 1, 2, and edit the string bits accordingly
}
++i;
}
//Put all of the strings together, as well as some more strings for adding the grid
//Output the whole concatenated string to the command line
}
If anyone knows of a better way to do this, please let me know. I've tried Googling and rifling through various C++ help websites, but I find it difficult to express my particular case through anything other than a long-winded and specific explanation.
Thanks for you help!!

If I correctly understood your problem, your problem is that you want to access the strings named bit1, bit2, etc using a variable i like bit + i.
And no, you cannot do that!
It will throw a compile time error.
Please correct me if I didn't get what you are looking for.
But one question is still in my mind that why are you using string variables bit1, bit2 etc?
I think you just want to store single digit value in those strings. If this is the case, you can just use a single string of length 9.
You can do this as follows:
int i = 0; //because string indices start from 0 and also array indices.
string bit(9, ' '); //declare a string of length 9 with default value of a space (you can modify it with your default value)
while (i < 9) { // i < 9 because highest index will be 8
if (Spaces[i] == 0) {
bit[i] = '*';
} else {
}
++i;
}

Declaring 9 variables like this is apparently wrong. What you are looking for is an array.
std::array<std::string, 9> bits;
(You need #include <array> and #include <string>.)
Then, you can traverse the string using a for-loop: (in C++, arrays are indexed starting from zero, not one)
for (std::size_t i = 0; i < 9; ++i) {
// operate on bits[i]
}
In the for-loop, you can use the subscript operator to access the element: bits[i].
Finally, to put all the strings together, use std::accumulate:
std::accumulate(bits.begin(), bits.end(), std::string{})
(You need #include <numeric>.)

As I can detect various function blocks with braces "{}" using regex?

I need a regular expression to extract the following text block functions only.
Example:
// Comment 1. function example1() { return 1; } // Comment 2 function example2() { if (a < b) { a++ } } // Comment 3 function example3() { while (1) { i++; } } /* Comment 4 */ function example4() { i = 4; for (i = 1; i < 10; i++) { i++; } return i; }
Take into account that no line breaks. It is a single block of code.
I have tried using the following regular expression:
Expression:
function\s[a-z|A-Z|0-9_]+()\s?{(?:.+)\s}
But there is a problem, place the .+ , take me all characters to the end of the text block.
Thanks in advance guys for the help you can give me.

In PCRE (PHP, R, Delphi), you can achieve this with recursion:
function\s[a-zA-Z0-9_]+\(\)(\s?{(?>[^{}]|(?1))*})
See demo.
In Ryby, just use \g<1> instead of (?1):
function\s[a-zA-Z0-9_]+\(\)(\s?{(?>[^{}]|(\g<1>))*})
In .NET, you can match them using balanced groups:
function\s[a-zA-Z0-9_]+\(\)\s*{((?<sq>{)|(?<-sq>})|[^{}]*)+}
See demo 2
In other languages, there is no recursion, and you need to use a workaround by adding nested levels "manually". In your examples, you have 2 levels.
Thus, in Python, it will look like:
function\s[a-zA-Z0-9_]+\(\)(?:\s?{(?:[^{}]*(?:\s*{[^{}]*}[^{}]*)*})*)
See Python demo (also works in JavaScript
In Java, you will need to escape {:
function\s[a-zA-Z0-9_]+\(\)(?:\s?\{(?:[^\{}]*(?:\s*\{[^{}]*}[^{}]*)*})*)

Split a even-numbered string in c++

I am very new to c++. I am trying to split a string that contains even numbered sub strings till there is no even numbered sub string left. For example, if I input AB ABCD ABC, the output should be A B A B C D ABC. I am trying to do it without tokens, because I don't know how to..
What I have so far only split the first even sub string and it doesn't work if I only have 1 sub string. Can someone please help me out?
Any advise will be much appreciated. Thank you!
string temp = "";
void check(string &str, int &i, int &flag)
{
int count = 0;
int reminder;
do
{
count++;
temp += str[i];
i++;
} while (str[i] != ' ');
i = i - temp.size();
reminder = count % 2;
if (reminder == 0)
flag = 1;
else
flag = 0;
}
void SplitEvenWord(string &str)
{
int i = 0;
int flag = 0;
for (i = 0; i < str.size(); i++)
{
check(str, i, flag);
if (flag == 1)
{
temp.insert(temp.size() / 2, " ");
str.replace(i, temp.size() - 1, temp);
}
}
}

There are two skills that are absolutely vital in software engineering (Well, more than two, but two for now): developing new functions in isolation, and testing things in the simplest possible way.
You say that the code fails if there is only one substring. You don't say how it fails (I should have mentioned clear error reports in the list) so I don't know whether to test your code with an even-length string which it ought to split ("ABCD" => "A B C D") or an odd-length string which it ought to leave alone ("ABC" => "ABC"). Before I try to code these up, I look at your first function:
void check(string &str, int &i, int &flag)
{
...
do
{
count++;
temp += str[i];
i++;
} while (str[i] != ' ');
...
}
Trouble already. The strings I have in mind do not contain any spaces, so the loop cannot terminate. This code will run past the end of the string into whatever happens to be in that memory space, which will cause undefined behavior. (If you don't know that term, it means that there's no telling what will happen, but if you're lucky the program will just crash.)
Fix that, try running that code on "ABC" and "ABCD" and "A" and "" and "ABC DEF", and get it working perfectly. Once it does, take a look at your other function. Don't test it with random typing, test it with short, clearly defined strings. Once it works perfectly, try longer, more complicated ones. If you find a string which causes it to fail, hold onto it! That string will lead you to a bug.
That should be enough to get you started.

I'm writing this as an answer because it was too long to fit as a comment.
I have a couple of suggestions that may help you to figure out what the problem is.
Separate "check" into at least two functions, one to split the string into individual words and check them and one to check the length of the string.
Test the "check" and "tokenize" functions by separately and see if they give you the expected answers. Work on them individually until they are correct.
Separate the formatting of the answers out of "SplitEvenWord" into a separate function.
"SplitEvenWord" should then be nothing more than calling the functions you created as a result of the steps above.
When I'm stuck, I always try to break the problem down into small bite sized pieces that I know I can get working. Eventually, the problem becomes assembling the already working pieces of the solution into a larger function that solves the original problem.

Regex appears to return pos >1 but length 0

The following code is returning an error on Mid, saying the third argument is -2 - so it thinks the length is 0. We're totally stumped as to how this could happen. The code looks for values between curly braces and strips them out. Can you think of a way to break this?
Str can be anything - we don't know, it's not supplied by us - so that's the var you want to break.
str = "Here's a string with {EmailAddy} and maybe some {otherVariables}";
start = 1;
pos = 0;
length = 0;
tokens = ArrayNew(1);
while(true) {
x = REFind("\{\w*\}", str, start, true);
pos = x.pos[1];
length = x.len[1];
if (pos == 0) {
break;
} else {
// get the token, trimming the curly brackets
token = mid(str, pos+1, length-2);
arrayAppend(tokens, token);
start = pos + length;
}
}
WriteDump(tokens);

You don't need lookbehind:
var Tokens = rematch( "\{\w*(?=\})" , Arguments.Str );
for ( var i = 1 ; i LTE ArrayLen(Tokens) ; i++ )
Tokens[i] = Tokens[i].substring(1);
return Tokens;
And that code should also give you a clue as to the most likely cause of the code breaking, in that you've probably got it in a function in a persisted component, but (without any scoping) everything is going in the component's variables scope, and thus it's not thread-safe and - with multiple calls under load - the variables involved are liable to get corrupted.
This is a general issue you should be looking for throughout the code - generally the first assignment for every variable inside a function should be prefixed with either the var keyword (or explicitly the local. scope) to ensure it it local to that function and not global the the component. (Except of course in the instances when a global variable is what is desired.)
Oh, and if you ever do actually want/need to use lookbehind in CF, I've made cfRegex, a library that wraps Java's more powerful regex engine, providing support for lookbehind (with limited-width), and with a (hopefully) easy to use and consistent set of functions for interacting with it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why is my regex C++ expression not working? - c++

I actually found the error, I don’t know why but C++ was returning different values for the two countMatchInRegex so all I did was assign it to a variable and use the variable instead every thing the function was called in the code.

Related

Looking for a better algorithn of findins substrings in strings using Qt

Edit string by calling it using concatenation in C++

As I can detect various function blocks with braces "{}" using regex?

Split a even-numbered string in c++

Regex appears to return pos >1 but length 0

Categories

Resources