Regex for comparing special Characters in (C)Strings

Regex for comparing special Characters in (C)Strings - c++

I have an MFC project where I need to read and compare various configuration strings from (xml-)files.
The problem is that they could contain one or multiple special characters like STX, ETX, LF, CR ... and so on.
An idea is using regex. I could simply write the full regex pattern in the files and compare them with a match function.
As I looked this up via google and msdn, there were two different(?) regex frameworks for MFC but I don't see any difference between them nor do I see if they can solve my problem, meaning handle special characters.
Do any of you have an experience with those frameworks? Can you recommend one or can you think of another solution for this problem?
Many thanks in advance.

I recommend std::regex or boost::regex over non standard alternatives. Also, they are able to support special characters.

Related

Search and replace with particular phrase

I need a help with mass search and replace using regex.
I have a longer strings where I need to look for any number and particular string - e.g. 321BS and I need to replace just the text string that I was looking for. So I need to look for BS in "gf test test2 321BS test" (the pattern is always the same just the position differs) and change just BS.
Can you please help me to find particular regex for this?
Update: I need t keep the number and change just the text string. I will be doing this notepad++. However I need a general funcion for this if possible. I am a rookie in regex. Moreover, is it possible to do it in Trados SDL Studio? Or how am i able to do it in excel file in bulk?
Thank you very much!

Your question is a bit vague, however, as I understand it you want to match any digits followed by BS, ie 123BS. You want to keep 123 but replace BS?
Regex: (\d+)BS matches 123BS
In notepad++ you can:
match (\d+)BS
replace \1NEWTEXT
This will replace 123BS with 123NEWTXT.
\1 will substitue the capture group (\d+). (which matches 1 or more digits.

You could do this in Trados Studio using an app. The SDLXLIFF Toolkit may be the most appropriate for you. The advantage over Notepad++ is that it's controlled and will only affect the translatable text and not anything that might break the integrity of the file if you make a mistake. You can also handle multiple files, or even multiple Trados Studio projects in one go.
The syntax would be very similar to the suggestion above... you would:
match (\d+)BS
replace $1NEWTEXT

Regex character required between 1st and 8th character

I am currently using this regex to limit the characters that can be used "([A-Za-z0-9_-]+)". I now have an additional requirement to require a hyphen between the 1st and 8th character. I am not sure where to begin for this and my search results have not been fruitful. Could anyone point me in a direction or give me pointers of where to get started with this request? I can usually cobble together some regex on my own through examples here and elsewhere on the web, but I can't find anything similar to these requirements.
here are some good examples of what I mean:
this-isvalid
so-isthis
Thank you in advance!

Yeah, typically when you know the requirements use an online regex checker.
http://www.regexplanet.com/advanced/java/index.html
There's a number of them, you can google them.
You can go ahead and specify between 1 and 7 copies of that and then a dash so something like:
(^[A-Za-z0-9_]{1,7}-[A-Za-z0-9_]+)

How to remove emoticons from tweets in C++?

I'm working on a twitter sentiment analysis tool in C++. So far I get the tweets from Twitter and I process them a bit ( lowercase, remove RT, remove # and URLs).
The next step is to remove emoticons and all those special characters. How does one do that? before you jump me, I already looked at other similar questions but none of them deals with C++. Mostly R,Python and PHP.
I was thinking to use regex however I can't get it to work. I tried it with removal of hashtags and URLs and I gave up. I ended up using normal string:find and find_first_of.
Is there any library or method available to get rid of those emoticons and special stuff ?
Thanks

I would recommend using regular expressions for this. Now you have two options, you can either extract only the characters you are interested in (if you are working with English tweets this would probably be A-Z,a-z, numbers and maybe some symbols, depending on your needs), or you can select invalid characters (emoticons) and replace them with an empty string.
I only have experience with Qt's RegularExpression engine, but the c++ standard library has regex support (although I'm not sure how good it is with Unicode), but the ICU provides a regex library too.
*I'd provide more links but I don't have enough reputation yet :/

In what ways can I improve this regular expression?

I have written this regex that works, but honestly, it’s like 75% guesswork.
The goal is this: I have lots of imports in Xcode, like so:
#import <UIKit/UIKit.h>
#import "NSString+MultilineFontSize.h"
and I only want to return the categories that contain +. There are also lots of lines of code throughout the source which include + in other contexts.
Right now, this returns all of the proper lines throughout the Xcode project. But if there is one thing I’ve learned from googling and searching Stack Overflow for regex tutorials, it is that there are LOTS of different ways to do things. I’d love to see all of the different ways you guys can come up with that make it either more efficient or more bulletproof regarding potential spoofs or misses.
^\#import+.[\"]*+.(?:(?!\+).)*+.*[\"]
Thanks in advance for all of your help.
Update
Also I suppose I’ll accept the answer of whoever does this with the shortest string, without missing any possible spoofs. But again, thanks to everyone who participates in this learning experience.
Resources from answers
This is an awesome resource for practicing regex from Dan Rasmussen: RegExr

The first thing I notice is that your + characters are misplaced: t+. matches t one or more times, followed by a single character .. I'm assuming you wanted to match the end of import, followed by one or more of any character: import.+
Secondly, # doesn't need to be escaped.
Here's what I came up with: ^#import\s+(.*\+.*)$
\s+ matches one or more whitespace character, so you're guaranteed that the line actually starts with #import and not #importbutnotreally or anything else.
I'm not familiar with xcode syntax, but the following part of the expression, (.*\+.*), simply matches any string with a + character somewhere in it. This means invalid imports may be matched, but I'm working under the assumption your trying to match valid code. If not, this will need to be modified to validate the importer syntax as well.
P.S. To test your expression, try RegExr. You can hover over characters to check what they do.

sed 's:^#import \(.*[+].*\):\1:' FILE
will display
"NSString+MultilineFontSize.h"
for your sample.

Win32 API to do wildcard string match

I am looking for a wildcard string match API (not regex match). I cannot use anything other than Win32 APIs.

There is PathMatchSpec - but handling is specialized for files, so results might not be what you expect if you need general wildcard matching.
Otherwise, you should probably go with an RegEx, as Pavel detailed.
[edit]
I incorrectly assumed PathMatchSpec shares the properties of FindFirstFile/FindNextFile. I've ran a few tests - it doesn't. So it looks like the best candidate.

Strange that so many years passed and nobody gave you this answer:
There is a WIN32 API that does exactly what you're looking for. (I found it searching in the MSDN for "wildcard")
It's name is SymMatchString. It sits in DbgHelp.dll which is part of the operating system.
Put a CriticalSection around the API call if your app is mulithreaded!
The API that FindFirstFile uses internally for wildcard matches is probably FsRtlIsNameInExpression.
Elmü

The easiest thing would be to just convert your glob pattern to a regex, by the following rules:
* becomes .*
? becomes .
Any of \|.^$+()[]{} are escaped by preceding them with \
This is partly true.
Following rules are inducted from DIR behaviour in XP+ Command Prompt:
* is the same as *.* and becomes regex .+
? becomes regex .? unless followed by a non-wildcard
? not followed by a wildcard becomes regex .
*. means "without extension", and becomes [^.]+$

The FindFirstFile and FindNextFile APIs do wildcard matches, but only against filenames.
You can't use anything but Win32? What about STL or CRT? Are you using Boost?
Without the Win32 API restriction, I would recommend using the code from some open-source project. Another option would be to translate the glob into a regex, which I believe can be done with a regex replace operation.
edit: First google match is the PHP code:
http://cvs.php.net/viewvc.cgi/php-src/win32/

If you're after a simple wildcard compare (globbing), some people have written their own, including this one (which we use in our code)

WHat exactly is your requirement? Are you just looking to use the '' symbol to match 0 or more characters or are you planning on using the '?' symbol as well. If it is just '', do you need to look for a, a, ab, ab*c, etc type patterns? If your requirement is limited, you could easily get away with the C++ runtime library's strstr function.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex for comparing special Characters in (C)Strings - c++

I recommend std::regex or boost::regex over non standard alternatives. Also, they are able to support special characters.

Related

Search and replace with particular phrase

Regex character required between 1st and 8th character

How to remove emoticons from tweets in C++?

In what ways can I improve this regular expression?

Win32 API to do wildcard string match

Categories

Resources