Detecting text like "#smth" with RegExp (with some more terms) - c++

I'm really bad in regular expressions, so please help me.
I need to find in string any pieces like #text.
text mustn't contain any space characters (\\s). It's length must be at least 2 characters ({2,}), and it must contain at least 1 letter(QChar::isLetter()).
Examples:
#c, #1, #123456, #123 456, #123_456 are incorrect
#cc, #text, #text123, #123text are correct
I use QRegExp.

QRegExp rx("#(\\S+[A-Za-z]\\S*|\\S*[A-Za-z]\\S+)$");
bool result = (rx.indexIn(str) == 0);
rx either finds a non-whitespace followed by a letter and by an unspecified number of non-whitespace characters, or a letter followed by at least non-whitespace.

Styne666 gave the right regex.
Here is a little Perl script which is trying to match its first argument with this regex:
#!/usr/bin/env perl
use strict;
use warnings;
my $arg = shift;
if ($arg =~ m/(#(?=\d*[a-zA-Z])[a-zA-Z\d]{2,})/) {
print "$1 MATCHES THE PATTERN!\n";
} else {
print "NO MATCH\n";
}
Perl is always great to quickly test your regular expressions.
Now, your question is a bit different. You want to find all the substrings in your text string,
and you want to do it in C++/Qt. Here is what I could come up with in couple of minutes:
#include <QtCore/QCoreApplication>
#include <QRegExp>
#include <iostream>
using namespace std;
int main(int argc, char *argv[])
{
QString str = argv[1];
QRegExp rx("[\\s]?(\\#(?=\\d*[a-zA-Z])[a-zA-Z\\d]{2,})\\b");
int pos = 0;
while ((pos = rx.indexIn(str, pos)) != -1)
{
QString token = rx.cap(1);
cout << token.toStdString().c_str() << endl;
pos += rx.matchedLength();
}
return 0;
}
To make my test I feed it an input like this (making a long string just one command line argument):
peter#ubuntu01$ qt-regexp "#hjhj 4324 fdsafdsa #33e #22"
And it matches only two words: #hjhj and #33e.
Hope it helps.

The shortest I could come up with (which should work, but I haven't tested extensively) is:
QRegExp("^#(?=[0-9]*[A-Za-z])[A-Za-z0-9]{2,}$");
Which matches:
^ the start of the string
# a literal hash character
(?= then look ahead (but don't match)
[0-9]* zero or more latin numbers
[A-Za-z] a single upper- or lower-case latin letter
)
[A-Za-z0-9]{2,} then match at least two characters which may be upper- or lower-case latin letters or latin numbers
$ then find and consume the end of the line
Technically speaking though this is still wrong. It only matches latin letters and numbers. Replacing a few bits gives you:
QRegExp("^#(?=\\d*[^\\d\\s])\\w{2,}$");
This should work for non-latin letters and numbers but this is totally untested. Have a quick read of the QRegExp class reference for an explanation of each escaped group.
And then to match within larger strings of text (again, untested):
QRegExp("\b#(?=\\d*[^\\d\\s])\\w{2,}\b");
A useful tool is the Regular Expressions Example which comes with the SDK.

use this regular expression. hope fully your problem will solve with given RE.
^([#(a-zA-Z)]+[(a-zA-Z0-9)]+)*(#[0-9]+[(a-zA-Z)]+[(a-zA-Z0-9)]*)*$

Related

Regex to replace single occurrence of character in C++ with another character

I am trying to replace a single occurrence of a character '1' in a String with a different character.
This same character can occur multiple times in the String which I am not interested in.
For example, in the below string I want to replace the single occurrence of 1 with 2.
input:-0001011101
output:-0002011102
I tried the below regex but it is giving be wrong results
regex b1("(1){1}");
S1=regex_replace( S,
b1, "2");
Any help would be greatly appreciated.
If you used boost::regex, Boost regex library, you could simply use a lookaround-based solution like
(?<!1)1(?!1)
And then replace with 2.
With std::regex, you cannot use lookbehinds, but you can use a regex that captures either start of string or any one char other than your char, then matches your char, and then makes sure your char does not occur immediately on the right.
Then, you may replace with $01 backreference to Group 1 (the 0 is necessary since the $12 replacement pattern would be parsed as Group 12, an empty string here since there is no Group 12 in the match structure):
regex reg("([^1]|^)1(?!1)");
S1=std::regex_replace(S, regex, "$012");
See the C++ demo online:
#include <iostream>
#include <regex>
int main() {
std::string S = "-0001011101";
std::regex reg("([^1]|^)1(?!1)");
std::cout << std::regex_replace(S, reg, "$012") << std::endl;
return 0;
}
// => -0002011102
Details:
([^1]|^) - Capturing group 1: any char other than 1 ([^...] is a negated character class) or start of string (^ is a start of string anchor)
1 - a 1 char
(?!1) - a negative lookahead that fails the match if there is a 1 char immediately to the right of the current location.
Use a negative lookahead in the regexp to match a 1 that isn't followed by another 1:
regex b1("1(?!1)");

How to retrieve the captured substrings from a capturing group that may repeat?

I'm sorry I found it difficult to express this question with my poor English. So, let's go directly to a simple example.
Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I'm using is ^(\w+)(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".
Because the subject string to match doesn't need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.
What is the correct pattern to use in this case? By the way, I'm going to use PCRE2 in C++ codes, so there is no split, a Perl built-in function. Thanks.
If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use the regex pattern
[^:]+
which matches consecutive characters which are not :, so a substring up to the first :. That may need to capture as well, ([^:]+), depending on the overall approach. How to use this to get all such matches depends on the language.†
In C++ there are different ways to approach this. Using std::regex_iterator
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>
int main()
{
std::string str{R"(one:two:three)"};
std::regex r{R"([^:]+)"};
std::vector<std::string> result{};
auto it = std::sregex_iterator(str.begin(), str.end(), r);
auto end = std::sregex_iterator();
for(; it != end; ++it) {
auto match = *it;
result.push_back(match[0].str());
}
std::cout << "Input string: " << str << '\n';
for(auto i : result)
std::cout << i << '\n';
}
Prints as expected.
One can also use std::regex_search, even as it returns at first match -- by iterating over the string to move the search start after every match
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string str{"one:two:three"};
std::regex r{"[^:]+"};
std::smatch res;
std::string::const_iterator search_beg( str.cbegin() );
while ( regex_search( search_beg, str.cend(), res, r ) )
{
std::cout << res[0] << '\n';
search_beg = res.suffix().first;
}
std::cout << '\n';
}
(With this string and regex we don't need the raw string literal so I've removed them here.)
† This question was initially tagged with perl (with no c++), also with an explicit mention of it in text (still there), and the original version of this answer referred to Perl with
/([^:]+)/g
The /g "modifier" is for "global," to find all matches. The // are pattern delimiters.
When this expression is bound (=~) to a variable with a target string then the whole expression returns a list of matches when used in a context in which a list is expected, which can thus be directly assigned to an array variable.
my #captures = $string =~ /[^:]+/g;
(when this is used literally as shown then the capturing () aren't needed)
Assigning to an array provides this "list context." If the matching is used in a "scalar context," in which a single value is expected, like in the condition for an if test or being assigned to a scalar variable, then a single true/false is returned (usually 1 or '', empty string).
Repeating a capture group will only capture the value of the last iteration. Instead, you might make use of the \G anchor to get consecutive matches.
If the whole string can only contain word characters separated by colons:
(?:^(?=\w+(?::\w+)+$)|\G(?!^):)\K\w+
The pattern matches:
(?: Non capture group
^ Assert start of string
(?=\w+(?::\w+)+$) Assert from the current position 1+ word characters and 1+ repetitions of : and 1+ word characters till the end of the string
| Or
\G(?!^): Assert the position at the end of the previous match, not at the start and match :
) Close non capture group
\K\w+ Forget what is matched so far, and match 1+ word characters
Regex demo
To allow only words as well from the start of the string, and allow other chars after the word chars:
\G:?\K\w+
Regex demo

Ignore String containing special words (Months)

I am trying to find alphanumeric strings by using the following regular expression:
^(?=.*\d)(?=.*[a-zA-Z]).{3,90}$
Alphanumeric string: an alphanumeric string is any string that contains at least a number and a letter plus any other special characters it can be # - _ [] () {} ç _ \ ù %
I want to add an extra constraint to ignore all alphanumerical strings containing the following month formats :
JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
One solution is to actually match an alphanumerical string. Then check if this string contains one of these names by using the following function:
vector<string> findString(string s)
{
vector<string> vec;
boost::regex rgx("JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
");
boost::smatch match;
boost::sregex_iterator begin {s.begin(), s.end(), rgx},
end {};
for (boost::sregex_iterator& i = begin; i != end; ++i)
{
boost::smatch m = *i;
vec.push_back(m.str());
}
return vec;
}
Question: How can I add this constraint directly into the regular expression instead of using this function.
One solution is to use negative lookahead as mentioned in How to ignore words in string using Regular Expressions.
I used it as follows:
String : 2-hello-001
Regular expression : ^(?=.*\d)(?=.*[a-zA-Z]^(?!Jan|Feb|Mar)).{3,90}$
Result: no match
Test website: http://regexlib.com/
The edit provided by #Robin and #RyanCarlson : ^[][\w#_(){}ç\\ù%-]{3,90}$ works perfectly in detecting alphanumeric strings with special characters. It's just the negative lookahead part that isn't working.
You can use negative look ahead, the same way you're using positive lookahead:
(?=.*\d)(?=.*[a-zA-Z])
(?!.*(?:JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre)).{3,90}$
Also you regex is pretty unclear. If you want alphanumerical strings with a length between 3 and 90, you can just do:
/^(?!.*(?:JANVIER|F[Eé]VRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AO[Uù]T|SEPTEMBRE|OCTOBRE|NOVEMBRE|D[Eé]CEMBRE|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))
[][\w#_(){}ç\\ù%-]{3,90}$/i
the i flag means it will match upper and lower case (so you can reduce your forbidden list), \w is a shortcut for [0-9a-zA-Z_] (careful if you copy-paste, there's a linebreak here for readability between (?! ) and [ ]). Just add in the final [...] whatever special characters you wanna match.

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}

Regex: mask all but the last 5 digits, ignoring non-digits

I want to match a number containing 17-23 digits interspersed with spaces or hyphens, then replace all but the last five digits with asterisks. I can match with the following regex:
((?:(?:\d)([\s-]*)){12,18})(\d[\s-]*){5}
My problem is that I can't get the regex to group all instances of [\s-] in the first section, and I have no idea how to get it to replace the initial 12-18 digits with asterisks (*).
How about this:
s/\d(?=(?:[ -]*\d){5,22}(?![ -]*\d))/*/g
The positive lookahead insures that there are at least 5 digits ahead of the just-matched digit, while the embedded negative lookahead insures that aren't more than 22.
However, there could still be more digits before the first-matched digit. That is, if there are 24 or more digits, this regex only operates on the last 23 of them. I don't know if that's a problem for you.
Even assuming that this is feasible with regex alone I'd bet that it would be way slower than using the non-capturing version of your regex and then reverse iterating over the match, leaving the first 5 digits alone and replacing the rest of them with '*'.
I think your regex is ok, but you might need to have a callback where you can insert the asterisks with another inline regex. The below is a Perl example.
s/((?:\d[\s-]*){12,18})((?:\d[\s-]*){4}\d)/ add_asterisks($1,$2) /xeg
use strict;
use warnings;
my $str = 'sequence of digits 01-2 3-456-7-190 123-416 78 ';
if ($str =~ s/((?:\d[\s-]*){12,18})((?:\d[\s-]*){4}\d)/ add_asterisks($1,$2) /xeg )
{
print "New string: '$str'\n";
}
sub add_asterisks {
my ($pre,$post) = #_;
$pre =~ s/\d/*/g;
return $pre . $post;
}
__END__
Output
New string: 'sequence of digits **-* *-***-*-*** ***-416 78 '
To give a java regex variant to Alan Moore's answer and using all word characters [a-zA-Z0-9] as \w instead of just digits \d.
This will also work with any length string.
public String maskNumber(String number){
String regex = "\\w(?=(?:\\W*\\w){4,}(?!\\W*\\w))";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(number);
while(m.find()){
number = number.replaceFirst(m.group(),"*");
}
return number;
}
This example
String[] numbers = {
"F4546-6565-55654-5457",
"F4546-6565-55654-54-D57",
"F4546-6565-55654-54-D;5.7",
"F4546-6565-55654-54-g5.37",
"hd6g83g.duj7*ndjd.(njdhg75){7dh i8}",
"####.####.####.675D-45",
"****.****.****.675D-45",
"**",
"12"
};
for (String number : numbers){
System.out.println(maskNumber(number));
}
Gives:
*****-****-*****-5457
*****-****-*****-*4-D57
*****-****-*****-*4-D;5.7
*****-****-*****-**-g5.37
*******.*********.(*******){*dh i8}
####.####.####.**5D-45
****.****.****.**5D-45
**
12