Recommended built-in WinXP language support for UTF-8 regex - regex

It's my first foray into UTF-8 land. I'm an IIS Admin, so I've never gotten to touch this professionally. I'm trying to help a missionary who's translated the bible into an African language and now needs to do some global matching against large UTF-8 files. We're specifically matching for accented characters.
We're using older XP computers here, so I cobbled together a quick script in VBS knowing the language would be installed on their boxes already. After playing around for a few minutes, it appears VBS regexes handle UTF-8 by breaking each character up into 2 characters. To match a single â, my pattern is \u00c3\u00a2. Shouldn't this be \u00e2?
Since I'm out of my depth I thought I'd seek a little guidance. It almost looks like UTF-8 simply requires this kind of double matching (and UTF-8 is required.) Can someone tell me into which box canyon I'm coding? :-)
Downloading and installing Perl or Java is probably outside this project's bandwidth and technical know-how. The tool should be built in. MS Office is installed, so VBA is an option if there's some library that offers specific support. JavaScript is installed as well, though I don't know what versions.
Thanks

Unless you need to match two or more consecutive dots (e.g. you have .. or ... in your regex but not .*) you can use any ASCII regex library on UTF-8 and expect it to work correctly.
The trick is to know what you are looking for. UTF-8 does that kind of byte breakup, so write your regex in whatever you are familiar with and convert it to UTF-8 and it will work unless it contains "..".

What about PowerShell? It uses the .NET regular expressions library, and that is one of the best libraries available, especially for Unicode support.

Related

Python3 unicode regex

I'm not a native English speaker, but it happened that I've never written any regex for any non-ASCII text in my life, so I'm confused with a seemingly trivial case.
I have a large dictionary scrapped from a website by a robot. All HTML tags are removed. My goal is to remove most carry over hyphens. The idea is that >90% of problematic punctuation have a form lowercase-lowercase, so they could be caught by regex like '\p{Ll}-\p{Ll}'. This should be able to capture Russian lowercase chars, при-мер for example.
However, it seems like \p isn't supported by python's re engine. I'm not sure which alternative regex engine I'm supposed to choose because googling doesn't show any information relevant to Python 3. I thought Python3 is much more advanced when it comes to i14n and Unicode, and it's supposed to have Unicode character class support.

How to remove emoticons from tweets in C++?

I'm working on a twitter sentiment analysis tool in C++. So far I get the tweets from Twitter and I process them a bit ( lowercase, remove RT, remove # and URLs).
The next step is to remove emoticons and all those special characters. How does one do that? before you jump me, I already looked at other similar questions but none of them deals with C++. Mostly R,Python and PHP.
I was thinking to use regex however I can't get it to work. I tried it with removal of hashtags and URLs and I gave up. I ended up using normal string:find and find_first_of.
Is there any library or method available to get rid of those emoticons and special stuff ?
Thanks
I would recommend using regular expressions for this. Now you have two options, you can either extract only the characters you are interested in (if you are working with English tweets this would probably be A-Z,a-z, numbers and maybe some symbols, depending on your needs), or you can select invalid characters (emoticons) and replace them with an empty string.
I only have experience with Qt's RegularExpression engine, but the c++ standard library has regex support (although I'm not sure how good it is with Unicode), but the ICU provides a regex library too.
*I'd provide more links but I don't have enough reputation yet :/

Futile attempt to run regular expression find/replace in MS Word using groups on Mac

According to the received wisdom MS Word (more or less) supports find/replace with use of regular expressions. I have a simple regular expression:
^(C[[:alpha:]]*)(\d*)(.*)$
That I'm running on the data:
indSIMDdecile
CSdeccrim12006
CSdeccrim12006
CSdeccrim12009
CSdeccrim12009
CSdeccrim12012
CSdeccrim12012
CSdeceduc12004
CSdeceduc12004
CSdeceduc12006
CSdeceduc12006
CSdeceduc12009
CSdeceduc12009
CSdeceduc12012
CSdeceduc12012
CSdecemp12004.x
I'm interested in returning the first word prior to the digit 1, which works as demonstrated on regex101 here.
Problem
I would like to the same but in MS Word (v. 15.18 on Mac). After getting error messages of trying to supply unsuitable syntax I learned that MS Word does not support to the full regex syntax. I simplified my expression to something on the lines:
but the search does not find any strings and nothing gets replaced. Hence my questions, is it possible to use MS Word on Mac with regex?
The linked help website hints that something like that should be possible, but so far now luck.
The simple answer is "no", if you mean "Does Mac Word have a UI feature that lets you use one of the modern dialects of regex?" Word's Find/Replace only supports its own Regular Expression syntax.
In this case, I think the following will give you what you need:
Find with wildcards:
(C)([!1]#)(1)
and a replace by
\1
(If you also had to find "C1", then that doesn't work, and unfortunately nor does
(C)([!1]{0,})(1)
because Word does not allow 0 in the {,} pattern)
But there is a problem with "#". If the text the "#" is looking for is long, the find/replace may fail. There is supposed to be a 255 limit, but it seems rather more arbitrary than that. (I have long suspected a buffer overrun type error in the Word code, but perhaps there is a simpler explanation).
If you mean, "is there any way to use modern regex with Word?", then the answer is "Yes, but you only get to operate on a copy of the text in the document. You will need to create your own code to do the 'replace' part of the find replace, and that means that you would have to deal with any of the issues such as preserving formatting that Word's built-in find/replace might get right for you.
On the Windows side, people who want a better regex than Word's often use VBScript's regexp object because it is easily used from VBA. VBA itself only really has the "like" operator, which also only has fairly crude pattern matching abilities. I think there are examples of VBScript rexexp use on StackOverflow. On the Mac side, you would either have to use VBA and "shell out" to one of the built-in Mac/Unix utilities to do your finding (and perhaps replacing), or perhaps use Applescript or Javascript application scripting to do it. As far as I can remember Applescript does not have a 'modern' regex built-in either.
[As a bit of history, Word's "regular expressions" were I think introduced in Word 6, around 1993, at a time when most dialects of regex were much more crude than they are today. I don't think Word's version has moved along much at all - it probably added some Unicode support at some point, but that's probably about it. I assume that people using modern regex don't regard it as regex at all, and I personally prefer not to call Word's Regular Expressions 'regex' precisely for that reason.]

RegEx with Excel VBA on Mac

I need to use regEx with Excel VBA. I'm using Mac OS 10.10 and Office 2011. So there is no DLL file I can use.
What is there to do here?
I read I've to bind an apple script. How is this done and what content does this script need?
You can use VBA's Like operator. It's a very limited regex tester only.
Microsoft Word has it's standard wildcards plus if you tick Use Wildcards it is a Regex engine (plus find words that sound the same, and words with the same root). So use Word rather than Vbscript's RegEx.
Just record a Find and Replace in Word and you'll get most of the program written for you that you'll just need to adapt.
Natively, you can't really - AppleScript isn't actually that good for this kind of thing (where VBA is concerned)
There are other libraries that you can install and use to allow support for things like regular expressions on Mac OS - the one I've seen used the most is Satimage although I've not personally had to use it (yet) so can't vouch for it myself:
http://www.satimage.fr/software/en/downloads/downloads_companion_osaxen.html
I'm working on this problem too and I think Advanced Filters may be your answer if you want to do it in Excel without adding an external library. You can access it through VBA and set up a hidden sheet somewhere to stash your filters.
https://searchengineland.com/advanced-filters-excels-amazing-alternative-to-regex-143680
And you can see what it looks like in VBA here:
https://www.contextures.com/exceladvancedfiltervba.html
However, Advanced Filters does have some notable shortcomings, like the inability to distinguish a digit from a letter. The LIKE command mentioned earlier DOES have this ability however - so you could combine them to overcome that limitation.
Hopefully you and I can both solve this problem using these tools...!

Regex for comparing special Characters in (C)Strings

I have an MFC project where I need to read and compare various configuration strings from (xml-)files.
The problem is that they could contain one or multiple special characters like STX, ETX, LF, CR ... and so on.
An idea is using regex. I could simply write the full regex pattern in the files and compare them with a match function.
As I looked this up via google and msdn, there were two different(?) regex frameworks for MFC but I don't see any difference between them nor do I see if they can solve my problem, meaning handle special characters.
Do any of you have an experience with those frameworks? Can you recommend one or can you think of another solution for this problem?
Many thanks in advance.
I recommend std::regex or boost::regex over non standard alternatives. Also, they are able to support special characters.