Trouble with C++ Regex POSIX character class

Trouble with C++ Regex POSIX character class - c++

I'm trying to create a regex that is capable of analysing something like this:
002561-1415179671591i.jpg
The second part is a unix timestamp (before the i), and I need to extract that. I came up with the following syntax, but std::regex keeps throwing a regex_error before I even check for a match and I'm not too sure what's wrong.
Here's what I've got so far:([:d:])*-[[:d:]]*([:alpha:])\.jpg
The C++ code line that throws the error is the constructor to regex
std::regex reg(regex_expr);
where regex_expr is a string that has been read in from a file.
Really appreciate any help you can give.
Edit: I wrote a try catch section, and it seems I'm getting the following error code
std::regex_constants::error_brack
Edit 2: Ok... seems I'm still getting the error even with a cut-down test:
([:alpha:])*
Edit 3:
Seems I can't get any expression to work. Here's a bit more info. I'm using clang++ 3.5.0 on Kubuntu 14.04.

Not sure that [:d:] can stand for [:digit:]. [EDIT] (It seems it's possible)
When you use a POSIX character class, it must be enclosed in a character class like that:
[[:digit:]]
(This syntax allows to compose other classes [[:digit:]ab])
so:
std::string regex_expr = "([[:digit:]]*)-([[:digit:]]*)([[:alpha:]])\\.jpg";
or if you use the basic mode:
std::string regex_expr = "\\([[:digit:]]*\\)-\\([[:digit:]]*\\)\\([[:alpha:]]\\)\\.jpg";

I would rather use the perl-compatible syntax instead of character classes:
\d+-(\d+)[a-z]*\.jpg

After many tests, for whatever reason, I couldn't get this to work. As I had boost anyway, I tried out the boost::regex and it worked straight away, so I presume that something to do with either clang, or the version of the standard on this pc just wasn't working.
So simply put, tried boost and it worked straight away. Not much of an answer, but I guess that's how things go sometimes.

Related

VSCode Snippets: Format File Name from my_file_name to MyFileName

I am creating custom snippets for flutter/dart. My goal is to pull the file name (TM_FILENAME_BASE) remove all of the underscores and convert it to PascalCase (or camelCase).
Here is a link to what I have learned so far regarding regex and vscode's snippets.
https://code.visualstudio.com/docs/editor/userdefinedsnippets
I have been able to remove the underscores nicely with the following code
${TM_FILENAME_BASE/[\\_]/ /}
I can even make it all caps
${TM_FILENAME_BASE/(.*)/${1:/upcase}/}
However, it seems that I cannot do two steps at a time. I am not familiar with regex, this is just me fiddling around with this for the last couple of days.
If anyone could help out a fellow programmer just trying make coding simpler, it would be really appreciated!
I expect the output of "my_file_name" to be "MyFileName".

It's as easy as that: ${TM_FILENAME_BASE/(.*)/${1:/pascalcase}/}

For the camelCase version you mentioned, you can use:
${TM_FILENAME_BASE/(.*)/${1:/camelcase}/}

How to I make gerrit query that spans across few specific projects?

I tried for few hours to find the right syntax for making a regex query that returns reviews from 2-3 different projects but I failed and decided to crowdsource the task ;)
The search is documented at https://review.openstack.org/Documentation/user-search.html and mentions possible use of REGEX,... but it just didn't work.
Task: return all CRs from openstack-infra/gerritlib and openstack-infra/git-review projects from https://review.openstack.org
Doing it for one project works well project:openstack-infra/gerritlib
Ideally I would like to look for somethign like ^openstack-infra\/(gerritlib|git-review), or at least this is the standard regex syntax.
Still, I found impossible to use parentheses so far, every time I used them it stopped it from returning any results.

1) You don't need to escape the "/" character.
2) You need to use double quotes to make the parentheses work.
So the following search should work for you:
project:"^openstack-infra/(gerritlib|git-review)"

Using utf-8 characters in log4cxx

I need to be able to use utf-8-encoded strings with log4cxx. I can print the strings just fine with std::cout (the characters are displayed correctly). Using log4cxx, i.e. putting the strings into the LOG4CXX_DEBUG() macro with a ConsoleAppender will output "??" instead of the special character. I found one solution:
LOG4CXX_DECODE_CHAR(logstring, str);
LOG4CXX_DEBUG(logstring);
where str is my input string, but this does not work. Anyone have an idea how this might work? I google'd around a bit, but I couldn't find anything useful.

You can use
setlocale(LC_CTYPE, "UTF-8");
to set only the character encoding, without changing any other information about the locale.

I met the same problem and searched and searched. I found this post, It may work, but I don't like the setlocaleish solution. so i made more research, finally the solution came out.
I reconfigure log4cxx and build it, the problem was solved!
add two more configure options in log4cxx:
./configure --prefx=blabla --with-apr=blabla --with-apr-util=blabla --with-charset=utf-8 --with-logchar=utf-8
hope this will help anyone who need it.

One solution is to use
setlocale(LC_ALL, "en_US.UTF-8");
in my main function. This is OK for me, but if you want more localizable applications, this will probably become hard to track/use.

The first answer didn't work for me, the second one is more than i want. So I combined the two answers:
setlocale(LC_CTYPE, "xx_XX.UTF-8"); // or "xx_XX.utf8", it means the same
where xx_XX is some language tag. I tried to log strings in many languages with different alphabets (on LINUX, including Chinese, language left-to-right and rigth-to-left); so I tried:
setlocale(LC_CTYPE, "it_IT.UTF-8");
and it worked with any tested language. I cannot understand why the simple "UTF-8" without indicating a language xx_XX doesn't work, since i use UTF8 to be language-independent and one shouldn't indicate one. (If somebody know the reason also for that, would be an interesting improvement to the answer). Maybe this also depends by Operatin System.
Finally, on Linux you can get a list of the encodings by typing on shell:
# locale -a | grep utf

FILE Returns a String with "\/" in the Path

I use the __FILE__ macro for error messages. However, sometimes the path comes back as E:\x\y\/z.ext. It does this for specific files.
For example, E:\programming\v2\wwwindowclass.h comes back as E:\programming\v2\/wwwindowclass.h and E:\programming\v2\test.cpp comes back as E:\programming\v2\test.cpp. In fact, the only file in the directory that works seems to be test.cpp.
To work around this, I used jmucchiello's answer to this question to replace any occurrence of "/" with "\". This worked fine, and the displayed path changed to a normal one.
The problem was when I tried it on Windows 7 (after using XP). The string came up as (null) after calling the function.
Along with this, I sometimes get some seemingly random error 2: File not found errors. I'm unsure of whether this is related at all, but if there's an explanation, it would be nice to hear.
I've tried to find why __FILE__ would be returning the wrong string, but to no avail. I'm using GNU g++ 4.6.1. I'm not actually sure yet if the paths that were wrong in XP were wrong in Windows 7 too. Any insight is appreciated.

The function in the linked question appears to return NULL if there are no changes to make. Probably Windows 7 doesn't suffer from the \/ problem (in some cases).

As per MSalters's comment:
Typically, the compiler does so when you pass #include "v2/wwwindowclass.h" to the compiler.
Since every file has its own include statements, you can (but shouldn't) mix the two styles.
This was the case. My compiler automatically adds a forward slash.

Why is this regular expression faster?

I'm writing a Telnet client of sorts in C# and part of what I have to parse are ANSI/VT100 escape sequences, specifically, just those used for colour and formatting (detailed here).
One method I have is one to find all the codes and remove them, so I can render the text without any formatting if needed:
public static string StripStringFormating(string formattedString)
{
if (rTest.IsMatch(formattedString))
return rTest.Replace(formattedString, string.Empty);
else
return formattedString;
}
I'm new to regular expressions and I was suggested to use this:
static Regex rText = new Regex(#"\e\[[\d;]+m", RegexOptions.Compiled);
However, this failed if the escape code was incomplete due to an error on the server. So then this was suggested, but my friend warned it might be slower (this one also matches another condition (z) that I might come across later):
static Regex rTest =
new Regex(#"(\e(\[([\d;]*[mz]?))?)?", RegexOptions.Compiled);
This not only worked, but was in fact faster to and reduced the impact on my text rendering. Can someone explain to a regexp newbie, why? :)

Do you really want to do run the regexp twice? Without having checked (bad me) I would have thought that this would work well:
public static string StripStringFormating(string formattedString)
{
return rTest.Replace(formattedString, string.Empty);
}
If it does, you should see it run ~twice as fast...

The reason why #1 is slower is that [\d;]+ is a greedy quantifier. Using +? or *? is going to do lazy quantifing. See MSDN - Quantifiers for more info.
You may want to try:
"(\e\[(\d{1,2};)*?[mz]?)?"
That may be faster for you.

I'm not sure if this will help with what you are working on, but long ago I wrote a regular expression to parse ANSI graphic files.
(?s)(?:\e\[(?:(\d+);?)*([A-Za-z])(.*?))(?=\e\[|\z)
It will return each code and the text associated with it.
Input string:
<ESC>[1;32mThis is bright green.<ESC>[0m This is the default color.
Results:
[ [1, 32], m, This is bright green.]
[0, m, This is the default color.]

Without doing detailed analysis, I'd guess that it's faster because of the question marks. These allow the regular expression to be "lazy," and stop as soon as they have enough to match, rather than checking if the rest of the input matches.
I'm not entirely happy with this answer though, because this mostly applies to question marks after * or +. If I were more familiar with the input, it might make more sense to me.
(Also, for the code formatting, you can select all of your code and press Ctrl+K to have it add the four spaces required.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Trouble with C++ Regex POSIX character class - c++

I would rather use the perl-compatible syntax instead of character classes: \d+-(\d+)[a-z]*\.jpg

Related

VSCode Snippets: Format File Name from my_file_name to MyFileName

How to I make gerrit query that spans across few specific projects?

Using utf-8 characters in log4cxx

FILE Returns a String with "\/" in the Path

Why is this regular expression faster?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Trouble with C++ Regex POSIX character class - c++

I would rather use the perl-compatible syntax instead of character classes: \d+-(\d+)[a-z]*\.jpg

Related

VSCode Snippets: Format File Name from my_file_name to MyFileName

How to I make gerrit query that spans across few specific projects?

Using utf-8 characters in log4cxx

__FILE__ Returns a String with "\/" in the Path

Why is this regular expression faster?

Categories

Resources

FILE Returns a String with "\/" in the Path