Tokenize with colon using std::tr1::regex - regex

I'm working on a quasi-SCPI command parser and I want to split a string based on colons, ignoring quoted strings. I want to get an empty string if there is no text between colons.
If I use this regex expression in EditPad Pro 7.2.2, it does exactly what I want.
(([^:\"']|\"[^\"]\"|'[^']')+)?
As an example, using this data string:
:foo:::bar:baz
I get 6 hits: [empty],foo,[empty],[empty],bar,baz
So far, so good. However, in my code, using std::tr1::regex, I'm getting 9 hits with the same data string. It seems like I'm getting an extra empty hit after each non-empty hit.
void RICommandState::InitRawCommandEnum(const std::string& full_command)
{
// Split string by colons, but ignore text within quotes.
static const std::tr1::regex split_by_colon("(([^:\"']|\"[^\"]*\"|'[^']*')+)?");
raw_command_list.clear();
raw_command_index = 0;
DebugPrintf(ZONE_REMOTE, (TEXT("InitRawCommandEnum FULL '%S'"), full_command.c_str()));
const std::tr1::sregex_token_iterator end;
for (std::tr1::sregex_token_iterator it(full_command.begin(),
full_command.end(),
split_by_colon);
it != end;
it++)
{
raw_command_list.push_back(*it);
const std::string temp(*it);
DebugPrintf(ZONE_REMOTE, (TEXT("InitRawCommandEnum '%S'"), temp.c_str()));
}
DebugPrintf(ZONE_REMOTE, (TEXT("InitRawCommandEnum hits = %d"), raw_command_list.size()));
}
And here is my output:
InitRawCommandEnum FULL ':foo:::bar:baz'
InitRawCommandEnum ''
InitRawCommandEnum 'foo'
InitRawCommandEnum ''
InitRawCommandEnum ''
InitRawCommandEnum ''
InitRawCommandEnum 'bar'
InitRawCommandEnum ''
InitRawCommandEnum 'baz'
InitRawCommandEnum ''
InitRawCommandEnum hits = 9
The most important question is how can I get my regex search to yield one (and only one) hit for every token delimited by a colon? Is the problem with my search expression?
Or maybe I'm misinterpreting the results? Do the empty strings after the non-empty strings have a special meaning? If so, what? And if that's the case, then is the correct solution to simply ignore them?
As a side question, I'm deeply curious why my code is behaving differently than EditPad Pro. EditPad is a useful test environment for experimenting with regular expressions, and it would be nice to know what the gotchas are.
Thanks!

It's still not clear to me what the meaning of the empty strings are, but I was able to work around them by ignoring them. I track the position of the hits within the search string and only process results that are farther along in the string.
Here's my code, without modification. Note that my regex search expression is slightly different, but that's not critical to the answer.
void RICommandState::InitRawCommandEnum(const std::string& full_command)
{
// Split string by colons, but ignore text within quotes.
static const std::tr1::regex split_by_colon("(?:[^:\"']|\"[^\"]*\"|'[^']*')*");
raw_command_list.clear();
raw_command_index = 0;
std::tr1::sregex_iterator::difference_type minPosition = 0;
const std::tr1::sregex_iterator end;
for (std::tr1::sregex_iterator it(full_command.begin(),
full_command.end(),
split_by_colon);
it != end;
it++)
{
if (it->position() >= minPosition)
{
raw_command_list.push_back(it->str());
minPosition = it->position() + it->length() + 1;
}
}
}

Related

regx to check + in entire string using javascript or jquery

i am trying to validate input string to chech whether it contains '+' symbol anywhere in the string. i used for of loop but didnt get what is exprected.
const isMobileValidWithoutPlus = funcLib.isValidMobileWithoutPlus(mobileNumber);
isValidMobileWithoutPlus(mobileNumber) {
if (!mobileNumber) {
return false;
}
const checkRegex = new RegExp('\\+?\\d+');
return checkRegex.test(mobileNumber);
}
but able to get desired out.
The regex for this would be
const rgx = new RegExp(/\+/gm);
Your regular expression checks if you have a string that can either start with + or not, and is followed by one or more numbers. But you're saying you want to just check if there's a "+" anywhere in the number. For that you can use this regex above.
Also, do you need to use a regex?
You can do this using indexOf on a string if using regex is not a must.
let number = "+001234";
function hasPlus(number) {
return number.indexOf('+') !== -1;
}
Regular expressions are generally useful when you don't have one specific string that you're looking for, or when you want to find all the apparitions of a regex in a longer string. In your case, checking if a string contains "+", it isn't necessary to use them.

How to check which matching group was used to match (boost-regex)

I'm using boost::regex to parse some formatting string where '%' symbol is escape character. Because I do not have much experience with boost::regex, and with regex at all to be honest I do some trial and error. This code is some kind of prototype that I came up with.
std::string regex_string =
"(?:%d\\{(.*)\\})|" //this group will catch string for formatting time
"(?:%([hHmMsSqQtTlLcCxXmMnNpP]))|" //symbols that have some meaning
"(?:\\{(.*?)\\})|" //some other groups
"(?:%(.*?)\\s)|"
"(?:([^%]*))";
boost::regex regex;
boost::smatch match;
try
{
regex.assign(regex_string, boost::regex_constants::icase);
boost::sregex_iterator res(pattern.begin(), pattern.end(), regex);
//pattern in line above is string which I'm parsing
boost::sregex_iterator end;
for(; res != end; ++res)
{
match = *res;
output << match.get_last_closed_paren();
//I want to know if the thing that was just written to output is from group describing time string
output << "\n";
}
}
catch(boost::regex_error &e)
{
output<<"regex error\n";
}
And this works pretty good, on the output I have exactly what I want to catch. But I do not know from which group it is. I could do something like match[index_of_time_group]!="" but this is kind of fragile, and doesn't look too good. If I change regex_string index that was pointing on group catching string for formatting time could also change.
Is there a neat way to do this? Something like naming groups? I'll be grateful for any help.
You can use boost::sub_match::matched bool member:
if(match[index_of_time_group].matched) process_it(match);
It is also possible to use named groups in regexp like: (?<name_of_group>.*), and with this above line could be changed to:
if(match["name_of_group"].matched) process_it(match);
Dynamically build regex_string from pairs of name/pattern, and return a name->index mapping as well as the regex. Then write some code that determines if the match comes from a given name.
If you are insane, you can do it at compile time (the mapping from tag to index that is). It isn't worth it.

Regex to filter strings

I need to filter strings based on two requirements
1) they must start with "city_date"
2) they should not have "metro" anywhere in the string.
This need to be done in just one check.
To start I know it should be like this but dont know hoe to eliminate strings with "metro"
string pattern = "city_date_"
Added: I need to use the regex for a SQL LIKE statement. hence i need it in a string.
Use a negative lookahead assertion (I don't know if this is supported in your regex lib)
string pattern = "^city_date(?!.*metro)"
I also added an anchor ^ at the start, that will match the start of the string.
The negative lookahead assertion (?!.*metro) will fail, if there is the string "metro" somewhere ahead.
Regular expressions are usually far more expensive than direct comparisons. If direct comparisons can easily express the requirements, use them. This problem doesn't need the overhead of a regular expression. Just write the code:
std::string str = /* whatever */
const std::string head = "city_date";
const std::string exclude = "metro";
if (str.compare(head, 0, head.size) == 0 && str.find(exclude) == std::string::npos) {
// process valid string
}
by using javascript
input="contains the string your matching"
var pattern=/^city_date/g;
if(pattern.test(input)) // to match city_data at the begining
{
var patt=/metro/g;
if(patt.test(input)) return "false";
else return input; //matched string without metro
}
else
return "false"; //unable to match city_data

ATL regex to parse csv files

Can some tell me what is wrong with the below code, I am trying to parse CSV files using the below program but it returns zero in m_uNumGroups field.
int _tmain(int argc, _TCHAR* argv[])
{
CAtlRegExp<> reUrl;
// Five match groups: scheme, authority, path, query, fragment
REParseError status = reUrl.Parse(**L"[^\",]+|(?:[ˆ\"])|\"\")"**);
if (REPARSE_ERROR_OK != status)
{
// Unexpected error.
return 0;
}
TCHAR testing[ ] = L"It’ s \" 10 Grand\" , baby";
CAtlREMatchContext<> mcUrl;
if (!reUrl.Match(testing,&mcUrl))
{
// Unexpected error.
return 0;
}
for (UINT nGroupIndex = 0; nGroupIndex < mcUrl.m_uNumGroups;nGroupIndex)
{
const CAtlREMatchContext<>::RECHAR* szStart = 0;
const CAtlREMatchContext<>::RECHAR* szEnd = 0;
mcUrl.GetMatch(nGroupIndex, &szStart, &szEnd);
ptrdiff_t nLength = szEnd - szStart;
printf_s("%d: \"%.*s\"\n", nGroupIndex, nLength, szStart);
}
return 0;;
}
With ATL regular expression syntax you need to use curly brackets around the expression you are catching. Your expression does not have any, so you're doing just match without sbu-expressions.
Check this out: http://msdn.microsoft.com/en-us/library/k3zs4axe%28v=vs.80%29.aspx
{ }
Indicates a match group. The actual text in the input that matches the expression inside the braces can be retrieved through the CAtlREMatchContext object.
I don't know C++, but if you're trying to parse "It’ s \" 10 Grand\" , baby" into It’ s \" 10 Grand\" and baby, then this fails for several reasons:
because that string is not valid CSV syntax. In CSV, quotes within fields need to be escaped by doubling (yours aren't escaped at all, only at string level), and fields that contain quotes must be surrounded by quotes. A valid CSV string would be "\"It’ s \"\" 10 Grand\"\"\", baby".
because your regex is wrong. Parsing CSV with regexes is hard, if not impossible, because of all the gotchas involved. Search StackOverflow for csv regex and find out that you should use a CSV parser instead.

Regex Rejecting matches because of Instr

What's the easiest way to do an "instring" type function with a regex? For example, how could I reject a whole string because of the presence of a single character such as :? For example:
this - okay
there:is - not okay because of :
More practically, how can I match the following string:
//foo/bar/baz[1]/ns:foo2/#attr/text()
For any node test on the xpath that doesn't include a namespace?
(/)?(/)([^:/]+)
Will match the node tests but includes the namespace prefix which makes it faulty.
I'm still not sure whether you just wanted to detect if the Xpath contains a namespace, or whether you want to remove the references to the namespace. So here's some sample code (in C#) that does both.
class Program
{
static void Main(string[] args)
{
string withNamespace = #"//foo/ns2:bar/baz[1]/ns:foo2/#attr/text()";
string withoutNamespace = #"//foo/bar/baz[1]/foo2/#attr/text()";
ShowStuff(withNamespace);
ShowStuff(withoutNamespace);
}
static void ShowStuff(string input)
{
Console.WriteLine("'{0}' does {1}contain namespaces", input, ContainsNamespace(input) ? "" : "not ");
Console.WriteLine("'{0}' without namespaces is '{1}'", input, StripNamespaces(input));
}
static bool ContainsNamespace(string input)
{
// a namspace must start with a character, but can have characters and numbers
// from that point on.
return Regex.IsMatch(input, #"/?\w[\w\d]+:\w[\w\d]+/?");
}
static string StripNamespaces(string input)
{
return Regex.Replace(input, #"(/?)\w[\w\d]+:(\w[\w\d]+)(/?)", "$1$2$3");
}
}
Hope that helps! Good luck.
Match on :? I think the question isn't clear enough, because the answer is so obvious:
if(Regex.Match(":", input)) // reject
You might want \w which is a "word" character. From javadocs, it is defined as [a-zA-Z_0-9], so if you don't want underscores either, that may not work....
I dont know regex syntax very well but could you not do:
[any alpha numeric]\*:[any alphanumeric]\*
I think something like that should work no?
Yeah, my question was not very clear. Here's a solution but rather than a single pass with a regex, I use a split and perform iteration. It works as well but isn't as elegant:
string xpath = "//foo/bar/baz[1]/ns:foo2/#attr/text()";
string[] nodetests = xpath.Split( new char[] { '/' } );
for (int i = 0; i < nodetests.Length; i++)
{
if (nodetests[i].Length > 0 && Regex.IsMatch( nodetests[i], #"^(\w|\[|\])+$" ))
{
// does not have a ":", we can manipulate it.
}
}
xpath = String.Join( "/", nodetests );