Going from regex to word vba (.Find) - regex

I have this regex
<#([^\s]+).*?>\s?<a href=""(.*?)"".*?>(.*?)</a>(\s?\((Pending|Prepared)\))?
And i really need it in a vba version for words .find method (don't need the matching-groups), here is what i have so far
\<\#*\>*\<a href=*\>*\<\/a\>
But i cant get the last part to work, here I'm talking about
(\s?\((Pending|Prepared)\))?
I really hope someone can help me, as regex in this case is not an option (Although i know i can use regex in VBA!)
Cheers

I don't see an OR | in the documentation (Wildcard character reference) or the examples (Putting regular expressions to work in Word), so instead I suggest splitting it into two separate searches. The Word MVPs site has a good reference on the Word Regex as well if you want more information.
[^\s] can be written in the Word style regex as [! ] (note the space), + becomes #. It appears that neither the {n,} nor {n,m} syntax of VBA support an n value of 0, making ? and * hard to implement in Word. One option that the MS guys seem to use is *, which in Word is "Any string of characters". By my testing, * is lazy, meaning the pattern \<#*\> run against the string <#sometag> asdfsadfasdf > will only match <#sometag>. In addition, it can match 0 characters, for example \<\#*\> will match <#>.
So assuming that the first part is working as you expect, you could try the following two regex:
\<\#*\>*\<a href=*\>*\<\/a\>*\(Pending\)
and
\<\#*\>*\<a href=*\>*\<\/a\>*\(Prepared\)
The trouble here is that the * will match up until it hits the P of Pending or Prepared, so there could be other text in between, but it's the only way I can see of matching an optional space. If you can guaruntee that the space will or will not be there, that would go a long way towards making the regex safer.
Give that a try and see if it works for you!

Related

RegEx: Searching for numbers (int, float) that are NOT part of a word

I'm hoping we have some regular expression guru's here that might be able to help me - a regex newbie - solve a problem.
I know some people will want to know some background info on this issue:
Regex Flavor: Basic Regex, being used in a Vertica Database using the REGEXP_REPLACE function.
The regex I am using is working great with one exception.
I have a rule that I'm trying to implement, related to stripping the numbers from text, where any number that is part of a word, e.g. table5, go2market, 33monroe, room222, etc. is ignored and NOT filtered.
Here is what I started with for detecting numbers:
[-+]?[0-9]*\.?[0-9]
That seems to work pretty well, including handling directly adjacent commas and parentheses for example.
But all cases where there is a number that is part of alphabetic text is also being detected, which fails the rule that it cannot be a part of a word, and by word, I mean any alphabetic text.
So, in searching for solutions, I happened upon this regex that seems to work well detecting those specific cases where numbers appear next to, or in, any string of characters:
((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
My thought was that maybe I could add this as an INVERTED match to my original regex, to allow it to still select standalone numbers while ignoring those that were a part of a word, like so:
[-+]?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)*\.?[0-9]^((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z])[a-zA-Z0-9]*)
Unfortunately however, it breaks the original detection of standalone numbers.
:(
I'm hoping there is someone here that can spot what I'm doing wrong, and help me identify the right solution?
Thanks in advance!
According to Vertica documentation, the regex flavour seems to follow the Perl syntax. In this case you can use negative lookarounds and in particular a negative lookbehind: (?<!\w) (not preceded with a word character.)
Lookarounds are only tests and don't consume characters.
You can also use a negative lookahead to test the right part, (?!\w) (not followed by a word character), but it's more simple to use a word boundary since the pattern ends with a digit (that is also a word character):
(?<!\w)[-+]?\d*\.?\d+\b
In the worst case, if you have something like v1.0 in your string and you want to avoid it, you can try to use the bactracking control verbs (*SKIP) and (*FAIL). (*FAIL) forces the pattern to fail and (*SKIP) skips all the already matched positions before it. I hope vertica supports these Perl regex features.
Something like:
\p{L}+[-+]?\d*\.?\d+(*SKIP)(*FAIL)|[-+]?\d*\.?\d+(*SKIP)(?!\p{L})

RegEx Expression for Eclipse that searches for all items that have not been dealt with

To help stop SQL Injection attacks, I am going through about 2000 parameter requests in my code to validate them. I validate them by determining what type of value (e.g. integer, double) they should return and then applying a function to them to sanitize the value.
Any requests I have dealt with look like this
*SecurityIssues.*(request.getParameter
where * signifies any number of characters on the same line.
What RegExp expression can I use in the Eclipse search (CTRL+H) which will help me search for all the ones I have not yet dealt with, i.e. all the times that the text request.getParameter appears when it is not preceded by the word SecurityIssues?
Examples for matches
The regular expression should match each of the following e.g.
int companyNo = StringFunctions.StringToInt(request.getParameter("COMPANY_NO‌​"))
double percentage = StringFunctions.StringToDouble(request.getParameter("MARKETSHARE"))
int c = request.getParameter("DUMMY")
But should not match:
int companyNo = SecurityIssues.StringToIntCompany(request.getParameter("COMP‌​ANY_NO"))
With inspiration and the links provided by #michaeak (thank you), as well as testing in https://regex101.com/ I appear to have found the answer:
^((?!SecurityIssues).)*(request\.getParameter)
The advantage of this answer is that I can blacklist the word SecurityIssues, as opposed to having to whitelist the formats that I do want.
Note, that it is relatively slow, and also slowed down my computer a lot when performing the search.
Try e.g.
=\s*?((?!SecurityIssues).)*?(request\.getParameter)\(
Notes
Paranthesis ( or ) are special characters for group matching. They need to be escaped with \.
If .* will match anything, also characters that you don't want it to match. So .*? will prevent it from matching anything (reluctant). This can be helpful if after the wildcard other items need to match.
There is a tutorial at https://docs.oracle.com/javase/tutorial/essential/regex/index.html , I think all of these should be available in eclipse. You can then deal with generic replacement also.
Problem
From reading Regular expression that doesn't contain certain string and Regular expression to match a line that doesn't contain a word? it seems quite difficult to create a regex matching anything but not to contain a certain word.

Regular Expression: Filenames

Extremely new to this and have been trying to figure this out on my own, but no luck.
It seems simple. I have files that are named either starting with L or P, followed by 6 numbers. I need to have 2 expressions, one that only reads files starting with L and one that only reads files starting with P.
I have tried using derivatives of ^[K-M], ^\L.*
No luck so far. Hoping someone can offer a suggestion.
Thanks for your time!
Try ^P\d{6} and ^L\d{6}. The ^ says start at the beginning of the string. The \d{6} matches 6 digits.
If at some point you wanted to match both in one go, you could do ^[LP]\d{6}. The [LP] says match one of L or P.
If the above doesn't work, you might be working with a more limited regex implementation. You could try ^P\d\d\d\d\d\d and ^L\d\d\d\d\d\d to get the same results.
If that doesn't work, you could try ^P[0-9][0-9][0-9][0-9][0-9][0-9] and ^L[0-9][0-9][0-9][0-9][0-9][0-9] which should work on all regex implementations. The \d is just shorthand for [0-9] anyway.
Seth's answer is correct.
If it doesn't matter what comes after the 'P' or 'L' you could also just use ^P and ^L.
In the future, you should try testing how regexes match your input strings using a regex tester such as RegexPal or Regular Expression Editor.

What is wrong with my simple regex that accepts empty strings and apartment numbers?

So I wanted to limit a textbox which contains an apartment number which is optional.
Here is the regex in question:
([0-9]{1,4}[A-Z]?)|([A-Z])|(^$)
Simple enough eh?
I'm using these tools to test my regex:
Regex Analyzer
Regex Validator
Here are the expected results:
Valid
"1234A"
"Z"
"(Empty string)"
Invalid
"A1234"
"fhfdsahds527523832dvhsfdg"
Obviously if I'm here, the invalid ones are accepted by the regex. The goal of this regex is accept either 1 to 4 numbers with an optional letter, or a single letter or an empty string.
I just can't seem to figure out what's not working, I mean it is a simple enough regex we have here. I'm probably missing something as I'm not very good with regexes, but this syntax seems ok to my eyes. Hopefully someone here can point to my error.
Thanks for all help, it is greatly appreciated.
You need to use the ^ and $ anchors for your first two options as well. Also you can include the second option into the first one (which immediately matches the third variant as well):
^[0-9]{0,4}[A-Z]?$
Without the anchors your regular expression matches because it will just pick a single letter from anywhere within your string.
Depending on the language, you can also use a negative look ahead.
^[0-9]{0,4}[A-Za-z](?!.*[0-9])
Breakdown:
^[0-9]{0,4} = This look for any number 0 through 4 times at the beginning of the string
[A-Za-z] = This look for any characters (Both cases)
(?!.*[0-9]) = This will only allow the letters if there are no numbers anywhere after the letter.
I haven't quite figured out how to validate against a null character, but that might be easier done using tools from whatever language you are using. Something along this logic:
if String Doesn't equal $null Then check the Rexex
Something along those lines, just adjusted for however you would do it in your language.
I used RegEx Skinner to validate the answers.
Edit: Fixed error from comments

Regular expression greedy match not working as expected

I have a very basic regular expression that I just can't figure out why it's not working so the question is two parts. Why does my current version not work and what is the correct expression.
Rules are pretty simple:
Must have minimum 3 characters.
If a % character is the first character must be a minimum of 4 characters.
So the following cases should work out as follows:
AB - fail
ABC - pass
ABCDEFG - pass
% - fail
%AB - fail
%ABC - pass
%ABCDEFG - pass
%%AB - pass
The expression I am using is:
^%?\S{3}
Which to me means:
^ - Start of string
%? - Greedy check for 0 or 1 % character
\S{3} - 3 other characters that are not white space
The problem is, the %? for some reason is not doing a greedy check. It's not eating the % character if it exists so the '%AB' case is passing which I think should be failing. Why is the %? not eating the % character?
Someone please show me the light :)
Edit: The answer I used was Dav below: ^(%\S{3}|[^%\s]\S{2})
Although it was a 2 part answer and Alan's really made me understand why. I didn't use his version of ^(?>%?)\S{3} because it worked but not in the javascript implementation. Both great answers and a lot of help.
The word for the behavior you described isn't greedy, it's possessive. Normal, greedy quantifiers match as much as they can originally, but back off if necessary to allow the whole regex to match (I like to think of them as greedy but accommodating). That's what's happening to you: the %? originally matches the leading percent sign, but if there aren't enough characters left for an overall match, it gives up the percent sign and lets \S{3} match it instead.
Some regex flavors (including Java and PHP) support possessive quantifiers, which never back off, even if that causes the overall match to fail. .NET doesn't have those, but it has the next best thing: atomic groups. Whatever you put inside an atomic group acts like a separate regex--it either matches at the position where it's applied or it doesn't, but it never goes back and tries to match more or less than it originally did just because the rest of the regex is failing (that is, the regex engine never backtracks into the atomic group). Here's how you would use it for your problem:
^(?>%?)\S{3}
If the string starts with a percent sign, the (?>%?) matches it, and if there aren't enough characters left for \S{3} to match, the regex fails.
Note that atomic groups (or possessive quantifiers) are not necessary to solve this problem, as #Dav demonstrated. But they're very powerful tools which can easily make the difference between impossible and possible, or too damn slow and slick as can be.
Regex will always try to match the whole pattern if it can - "greedy" doesn't mean "will always grab the character if it exists", but instead means "will always grab the character if it exists and a match can be made with it grabbed".
Instead, what you probably want is something like this:
^(%\S{3}|[^%\s]\S{2})
Which will match either a % followed by 3 characters, or a non-%, non-whitespace followed by 2 more.
I always love to look at RE questions to see how much time people spend on them to "Save time"
str.len() >= str[0]=='&' ? 4 : 3
Although in real life I'd be more explicit, I just wrote it that way because for some reason some people consider code brevity an advantage (I'd call it an anti-advantage, but that's not a popular opinion right now)
Try the regex modified a little based on Dav's original one:
^(%\S{3,}|[^%\s]\S{2,})
with the regex option "^ and $ match at line breaks" on.