Find substr between delimiter characters in Qt with RegEx - regex

I need to obtain a substring in a string in Qt, but with a few details:
the substring I need is delimited by [ and ]
the substring might have some unpredictable characters like /, ^, -. This substring basically describes a unit of measurement.
Also, besides obtaining the substring itself, I need to have a test to check if such a substring exists in the string or not.
I don't know anything about RegEx and I'm new to Qt as well. Most of the examples I found here don't report to Qt and/or don't explicitly account for what I need.

QRegExp exp("\\[([^\\]]+)\\]");
QString s1 = "5 [sm^2]";
qDebug() << exp.indexIn(s1);
qDebug() << exp.capturedTexts();
Output:
2
("[sm^2]", "sm^2")
If none of the string's parts match the regexp, indexIn will indicate that by returning -1. Otherwise the result will be >= 0, and the capturedTexts()[1] will contain the text that was enclosed in brackets.

Related

How can I use RegEx to remove certain words in from string

I need to clean some cells and only keep important words to generate a search index.
Eg. "How to make an account recovery request" would be trimmed to "Make Account Recovery Request" because "How, To, An" would be in a list of words to be filtered out.
The other complexity is that it will also be in French and Spanish, which means that I have to deal with part-word like d'.
So far I've been trying to use a function but it doesn't work with part-word (d') and if "de" and "des" are listed in the same cell, it will remove DE from DES and then only keep the lonely S because DES is not recognized anymore:
Function ClearWords(s As String, rWords As Range) As String
Static RX As Object
If RX Is Nothing Then
Set RX = CreateObject("VBScript.RegExp")
RX.Global = True
RX.IgnoreCase = True
End If
RX.Pattern = "\b" & Replace(Join(Application.Transpose(rWords), "|"), ".", "\.") & "\b"
ClearWords = Application.Trim(RX.Replace(s, ""))
End Function
If you plan to support English, French, and other European languages you may leverage the regex I posted at Regular expression not working for at least one European character
, (?![×÷])[A-Za-zÀ-ÿ]. This is a pattern that is supposed to match all the alphabetic chars you need to support. Since you are going to use it in VBA, it makes sense to replace literal extended letters with \uXXXX entities, and convert it to a single character class, [A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF] ([A-Za-zÀ-ÖØ-öø-ÿ] with literal chars).
Now, you need to build the custom boundaries. The initial boundary is either start of the string, ^, or any char other than the letters above (and possibly digits, and _, if you want to fully emulate \b). Since you want to replace, you need to put these two patterns into a (^|[^A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]) capturing group and use $1 in the replacement pattern to restore the value in order not to lose it. The trailing boundary is any char other than the letters above (or digits / _) and end of the string. Since VBA regex supports lookaheads, we may just use a negative lookahead, (?![A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]).
Putting it together:
RX.Pattern = "(^|[^A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF])(?:" & Replace(Join(Application.Transpose(rWords), "|"), ".", "\.") & ")(?![A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF])"
ClearWords = Application.Trim(RX.Replace(s, "$1"))
See this regex demo.
To also remove spaces before, replace "(^|[^A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF])(?:" with (?:\s+|(^|[^A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]))(?:. See this regex demo.
Bonus: you seem to need to escape the words to use them in a regex:
Dim regExEscape As New RegExp
With regExEscape
.pattern = "[-/\\^$*+?.()|[\]{}]"
.Global = True
.MultiLine = False
End With
Just make sure you process all words you have instead of Replace(Join(Application.Transpose(rWords), "|"), ".", "\.").

Regular Expression starting and ending with special characters

I need to extract all matches from a huge text that start with [" and end with "]. These special characters separate each record from database. I need to extract all records.
Inside this record there are letters, numbers and special characters like -, ., &, (), /, {space} or so.
I'm writing this in Office VBA.
The pattern I have come so far looks like this: .Pattern = "[[][""][a-z|A-Z|w|W]*".
With this pattern, I am able to extract the first word from each record, with the starting characters [". The count of found matches is correct.
Example of one record:
["blabla","blabla","blabla","\u00e1no","nie","\u00e1no","\u00e1no","\u00e1no","\u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-pencil\u0022\u003E\u003C\/i\u003E Upravi\u0165\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;crz-form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva CRZ\u003C\/a\u003E"]
The question is : How can I extract the all records starting with [" and ending with "]?
I don't necessary need the starting and ending characters, but I can clean that up later.
Thanks for help.
The easiest way is to get rid of the initial and trailing [" and "] with either Replace or Left/Right/Mid functions, and then Split with "," (in VBA, """,""").
E.g.
input = "YOUR_STRING"
input = Replace(Replace(input, """]", ""), "[""", "")
result = Split(input, """,""")
If you plan to use Regex, you can use \["[\s\S]*?"] pattern, but it is not that efficient with long inputs and may even freeze the macro if timeout issue occurs. You can unroll it as
\["[^"]*(?:"(?!])[^"]*)*"]
See the regex demo. In VBA, Pattern = "\[""[^""]*(?:""(?!])[^""]*)*""]"
Note that with this unrolled pattern, you do not even need to use the workarounds for dot matching newline issue (negated character class [^"] matches any char but ", including a newline).
Pattern details:
\[" - [" literally
[^"]* - zero or more characters other than "
(?:"(?!])[^"]*)* - zero or more sequences of
"(?!]) - " not followed with ]
[^"]* - zero or more characters other than "
"] - literal character sequence "]

Splitting strings separated by \r\n into array of strings [C/C++]

I have string containing e.g. "FirstWord\r\nSecondWord\r\nThird Word\n\r" and so on...
I want to split it to string array using vector <string> so I would get:
FileName[0] == "FirstWord";
FileName[1] == "SecondWord";
FileName[2] == "Third Word";
Also, note the space in the third string.
This is what I've got so far:
string text = Files; // Files var contains the huge string of lines separated by \r\n
vector<string> FileName; // (optionaly) Here I want to store the result without \r\n
regex rx("[^\\s]+\r\n");
sregex_iterator FormatedFileList(text.begin(), text.end(), rx), rxend;
while(FormatedFileList != rxend)
{
FileName.push_back(FormatedFileList->str().c_str());
++FormatedFileList;
}
It works, but when it comes to the third string which is "Third Word\r\n", it only gives me "Word\r\n".
Can anyone explain to me how do the regular expressions work? I'm a bit confused.
\s matches all spaces, including regular space, tab and a few others. You only want to exclude \r and \n, so your regex should be
regex rx("[^\r\n]+\r\n");
EDIT: This will not fit in a comment, and it will not be exhaustive -- regexes are a fairly complex topic, but I'll do my best to give a cursory explanation. All of this does make more sense if you grok formal languages, so I encourage you to read up on it, and there are countless regex tutorials on the net that go into more detail and that you should also read. Okay.
Your code uses sregex_iterator to walk through all places in the string text where the regular expression rx matches, then turns them into strings and saves them. So, what are regular expressions?
Regular expressions are a way of applying pattern matching to strings. This can range from simple substring searches to...well, to complex substring searches, really. Instead of just looking for an instance of "oba" in the string "foobar", for example, you might search for "oo" followed by any character followed by "a" and find it in "foobar" as well as in "foonarf".
In order to enable this kind of pattern search, you must have a way to specify what pattern you are looking for, and one such way are regular expressions. The details vary across implementations, but in general it works by defining special characters that match special things or modify the behaviour of other parts of the pattern. This sounds confusing, so let's consider a few examples:
The period . matches any single character
Something followed by the Kleene star * matches zero ore more instances of that something
Something followed by a + will match one or more instances of that something
brackets [, ] enclose a set of characters; the whole thing then matches any one of those characters.
The caret ^ inverts the selection of a bracket expression
Still confusing. So let's put it together:
oo.a
is a regular expression using the .. This will match "oo.a", "ooba", "oona", "oo|a" and anything else that is two o's followed by one character followed by an a. It will not match "ooa", "oba" or "nonsense".
a*
will match "", "a", "aa", "aaa", and any other sequence consisting only of a's but nothing else.
[fgh]oobar
will match any of "foobar", "goobar", and "hoobar", nothing else.
[^fgh]oobar
will match "aoobar", "boobar", "coobar" and so forth but not "foobar", "goobar" and "hoobar".
[^fgh]+oobar
will match "aoobar", "aboobar", "abcoobar", but not "oobar", "foobar", "agoobar", and "abhoobar".
In your case,
[^\r\n]+\r\n
will match any instance of one or more characters that are neither \r nor \n followed by \r\n. You then iterate through all those matches and save the matched portions of text.
That is about as deep as I believe I can reasonably go here. This rabbit hole is very deep, which means that you can do freaky cool stuff with regexes but that you should not expect to master them in a day or two. Most of it goes along the lines of what I just outlined, but in true programmer's fashion, most regex implementations go beyond the mathematical scope of regular languages and expressions and introduce useful but mindbendy stuff. Dragons be ahead, but the journey is worth it.
One simple alternative will be to use split_regex from Boost. Eg. split_regex(out, input, boost::regex("(\r\n)+")) where out is a vector of string and input is the input string. A complete example is pasted below:
#include <vector>
#include <iostream>
#include <boost/algorithm/string/regex.hpp>
#include <boost/regex.hpp>
using std::endl;
using std::cout;
using std::string;
using std::vector;
using boost::algorithm::split_regex;
int main()
{
vector<string> out;
string input = "aabcdabc\r\n\r\ndhhh\r\ndabcpqrshhsshabc";
split_regex(out, input, boost::regex("(\r\n)+"));
for (auto &x : out) {
std::cout << "Split: " << x << std::endl;
}
return 0;
}
This is also one way to go:
char * pch = strtok((LPSTR)Files.c_str(), "\r\n");
while(pch != NULL)
{
FileName.push_back(pch);
pch = strtok(NULL, "\r\n");
}
regex rx("[^\\s]+\r\n");, seems like you're trying to match the strings instead of splitting it. This [^\\s] negated character class means match any character but not space(horizontal spaces or line breaks). In the third line, there is an horizontal space, so your regex matches the text which was next to the horizontal space. In multiline mode, . would match any character but not of line breaks. You could use regex rx(".+\r\n"); instead of regex rx("[^\\s]+\r\n");

Regular expression that finds and replaces a long string of words

I am new to Regular Expressions.
What is the expression that would find a long string of words that begin with a 3-digit number and place spaces at the beginning of capitalized words:
REPLACE:
013TheBlueCowJumpedOverTheFence1984.jpg
WITH:
013 The Blue Cow Jumped Over The Fence 1984
Note: removes the .jpg at the end
This will save me ooooodles of time.
I would not use regular expressions for this task. It's going to be ugly and hard to maintain. A better approach would be to loop through the string and rebuild the string as you go based on your input.
string retVal = "";
foreach(char s in myInput){
if(IsCapitol(s)){
reVal += " " + s;
}
//insert the rest of your conditions
}
try use this regular expression \d+|[A-Z][a-z]*
it will collect all matches, and you must join them with spases
This will need two operations since the replacement is different for each.
The first:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z])))/
Replace with: ' $1' (note the space)
Will put spaces between the words. The second:
/\s*(.*)\s*\..*$/
Replace with: '$1'
Will remove trailing spaces and the extension.
The first expression can be taken into parts: (?<![\d])\d finds a digit not preceded by another digit, the second: ((?<![A-Z])[A-Z](?![A-Z])) finds an uppercase letter not preceded or followed by an uppercase lettter.
You'll likely have more rules that you will want to incorporate into this, such as how are you dealing with the string: 'BackInTheUSSR.jpg'?
Edit: This should handle that example:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z]))|((?<![A-Z])[A-Z]+(?![a-z])))/
match:
'[A-Z][a-z]*'
replace with
' \0'
Note that this doesn't put a space before 1984, and it doesn't remove .jpg.
You can do the former by matching on
'[0-9]+|[A-Z][a-z]*'
instead. And the latter by removing it in a separate instruction, for example with a regexp replacement of '\.jpg$' with ''
Note that \'s need to be written as \\ in many languages.

Quick Regex Matches Question

(Yes I am using regex to parse HTML, its the only solution I know)
Im having trouble creating the regex for the below piece of code, there are about 10 matches per page.
Inner Text
this is the regex ive been trying
below is the code I usually use to get a match collection
Private Function Extract(ByVal source As String) As String()
Dim mc As MatchCollection
Dim i As Integer
mc = Regex.Matches(source, _
"<A href=" & Chr(34) & "viewmessage.aspx?message_id *.</A>")
Dim results(mc.Count - 1) As String
For i = 0 To results.Length - 1
results(i) = mc(i).Value
Next
Return results
End Function
Dim str1 As String()
Dim str2 As String
Dim results As New StringBuilder
str1 = Extract(result)
For Each str2 In str1
results.Append(str2 & vbNewLine)
Next
RTBlinks.Text = results.ToString
Could anyone point out what im doing wrong ? I have spent a few hours trying different things.
I try to program mainly as a hobby, so apologies if ive made any glaring errors.
You've got *. where you'd need .*. Right now, the quantifier * is applied to the space before it, and the dot matches exactly one character. Switch the two, remove the space (it matters, and there is no space in your test string at this point) and try again.
Be aware that .* matches greedily, i. e. as many characters as possible (except newlines). So if you have no more than one <A> tag per line, it should still work. A bit safer would be .*? instead, making the dot match as few characters as possible; even safer [^<]* which would match anything except opening angle brackets, making sure we don't cross tag boundaries.
However, all of those measures fail in certain, not uncommon situations (think comments, attribute strings, nested tags, invalid markup) which is why you should let regexes loose on markup languages only if you can exactly control your inputs and know your limitations.
Also, I think that in VB.NET you can escape quotes inside a string by doubling it, so you can simply write
"<A href=""viewmessage.aspx?message_id=.*?</A>"