regex interval with possible characters before and after number VBA - regex

I'm trying to produce a regular expression that can identify a number within an interval in a string in VBA. Sometimes this number has characters around it, other times not (non-consistent notation from a supplier). The expression should identify that 1413 in the three examples below are within the number range 500-2000 (or alternatively that it's not in the number range 0-50 or 51-499).
Example:
Test 12/2014. Tot.flow:1413 m3 or
Test 12/2014. Tot.flow:1413m3 or
Test 12/2014. Tot.flow: 1413
These strings have some identifiers:
there will always be a colon before the number
there may be a white space between the colon and the number
there may be a white space between the number and the m3
m3 is not necessarily always present, and if not, the number is at the end of the string
So far what I have in my attempt to make an regex that find the number range is ([5-9][0-9][0-9]|[1]\d{3}|2000), but this matches all three digit numbers as well (2001 gives a match on 200). However, I understand that I'm missing out on a couple of concepts to achieve the ultimate goal here. I guess my problems are as following:
How to start the interval at something not being zero (found lots of questions on intervals starting on zero)
How to take into account the variations in notation both for flow: and m3?
I'm only interested in checking that the number lies within the number range. This is driving me bonkers, all help is highly appreciated!

You can just extract the number with regExp.Replace() using the following regex:
^.*:\s*(\d+).*$
The replacement part is $1.
Then, use usual number comparison to check whether the value is in the expected range (e.g. If CLng(result) > 499 And If CLng(result) < 2001 Then ...).
Test macro:
Dim re As RegExp, tgt As String, src As String
Set re = New RegExp
With re
.pattern = "^.*:\s*(\d+).*$"
.Global = False
End With
src = "Test 12/2014. Tot.flow: 1413"
tgt = re.Replace(src, "$1")
MsgBox (CLng(tgt) > 499 And CLng(tgt) < 2001)

You can try with:
:\s?([5-9]\d\d|1\d{3}|2000)\s?(m3|\n)
also, your regex ([5-9][0-9][0-9]|[1]\d{3}|2000) in my opinion is fine, it should not match numbers >500 and 2000<.

Related

RegEx for matching a string after a string up to a comma

Here is a sample string.
"BLAH, blah, going to the store &^5, light Version 12.7(2)L6, anyway
plus other stuff Version 3.3.4.6. Then goes on an on for several lines..."
I want to capture only the first version number without including the word version if possible but not include the periods and parenthesis. The result would stop when it encounters a comma. The result would be:
"1272L6"
I don't want it to include other instances of version in the text. Can this be done?
I've tried (?<=version)[^,]* I know it does not address removing the periods and parens and does not address the subsequent versions.
This exact RegEx, maybe not the best solution, but it might help you to get 1272L6:
([0-9]{2})\.([0-9]{1})\(([0-9]{1})\)([A-Z]{1}[0-9]{1})
It creates four groups (where $1$2$3$4 is your target 1272L6) and passes ., ) and (.
You might change {1} to other numbers of repetitions, such as {1,2}.
Assuming the version number is fixed on format but not on the specific digits or letters, you could do this.
String s = "this is a test 12.7(2)L6, 13.7(2)L6, 14.7(2)L6";
String reg = "(\\d\\d\\.\\d\\(\\d\\)[A-Z]\\d),";
Matcher m = Pattern.compile(reg).matcher(s);
if (m.find()) { // should only find first one
System.out.println(m.group(1).replaceAll("[.()]", ""));
}

Regex to insert space with certain characters but avoid date and time

I made a regex which inserts a space where ever there is any of the characters
-:\*_/;, present for example JET*AIRWAYS\INDIA/858701/IDBI 05/05/05;05:05:05 a/c should beJET* AIRWAYS\ INDIA/ 858701/ IDBI 05/05/05; 05:05:05 a/c
The regex I used is (?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)
I have added some words exceptions like a/c w/d etc. \D conditions given to avoid date/time values getting separated, but this created an issue, the numbers followed by the above mentioned characters never get split.
My requirement is
1. Insert a space after characters -:\*_/;,
2. but date and time should not get split which may have / :
3. need exception on words like a/c w/d
The following is the full code
Private Function formatColon(oldString As String) As String
Dim reg As New RegExp: reg.Global = True: reg.Pattern = "(?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)" '"(\D:|\D/|\D-|^w/d)"
Dim newString As String: newString = reg.Replace(oldString, "$1 ")
formatColon = XtraspaceKill(newString)
End Function
I would use 3 replacements.
Replace all date and time special characters with a special macro that should never be found in your text, e.g. for 05/15/2018 4:06 PM, something based on your name:
05MANUMOHANSLASH15MANUMOHANSLASH2018 4MANUMOHANCOLON06 PM
You can encode exceptions too, like this:
aMANUMOHANSLASHc
Now run your original regex to replace all special characters.
Finally, unreplace the macros MANUMOHANSLASH and MANUMOHANCOLON.
Meanwhile, let me tell you why this is complicated in a single regex.
If trying to do this in a single regex, you have to ask, for each / or :, "Am I a part of a date or time?"
To answer that, you need to use lookahead and lookbehind assertions, the latter of which Microsoft has finally added support for.
But given a /, you don't know if you're between the first and second, or second and third parts of the date. Similar for time.
The number of cases you need to consider will render your regex unmaintainably complex.
So please just use a few separate replacements :-)

A regex for maximal periodic substrings

This is a follow up to A regex to detect periodic strings .
A period p of a string w is any positive integer p such that w[i]=w[i+p]
whenever both sides of this equation are defined. Let per(w) denote
the size of the smallest period of w . We say that a string w is
periodic iff per(w) <= |w|/2.
So informally a periodic string is just a string that is made up from a another string repeated at least once. The only complication is that at the end of the string we don't require a full copy of the repeated string as long as it is repeated in its entirety at least once.
For, example consider the string x = abcab. per(abcab) = 3 as x[1] = x[1+3] = a, x[2]=x[2+3] = b and there is no smaller period. The string abcab is therefore not periodic. However, the string ababa is periodic as per(ababa) = 2.
As more examples, abcabca, ababababa and abcabcabc are also periodic.
#horcruz, amongst others, gave a very nice regex to recognize a periodic string. It is
\b(\w*)(\w+\1)\2+\b
I would like to find all maximal periodic substrings in a longer string. These are sometimes called runs in the literature.
Formally a substring w is a maximal periodic substring if it is periodic and neither w[i-1] = w[i-1+p] nor w[j+1] = w[j+1-p]. Informally, the "run" cannot be contained in a larger "run"
with the same period.
The four maximal periodic substrings (runs) of string T = atattatt are T[4,5] = tt, T[7,8] = tt, T[1,4] = atat, T[2,8] = tattatt.
The string T = aabaabaaaacaacac contains the following 7 maximal periodic substrings (runs):
T[1,2] = aa, T[4,5] = aa, T[7,10] = aaaa, T[12,13] = aa, T[13,16] = acac, T[1,8] = aabaabaa, T[9,15] = aacaaca.
The string T = atatbatatb contains the following three runs. They are:
T[1, 4] = atat, T[6, 9] = atat and T[1, 10] = atatbatatb.
Is there a regex (with backreferences) that will capture all maximal
periodic substrings?
I don't really mind which flavor of regex but if it makes a difference, anything that the Python module re supports. However I would even be happy with PCRE if that makes the problem solvable.
(This question is partly copied from https://codegolf.stackexchange.com/questions/84592/compute-the-maximum-number-of-runs-possible-for-as-large-a-string-as-possible . )
Let's extend the regex version to the very powerful https://pypi.python.org/pypi/regex . This supports variable length lookbehinds for example.
This should do it, using Python's re module:
(?<=(.))(?=((\w*)(\w*(?!\1)\w\3)\4+))
Fiddle: https://regex101.com/r/aA9uJ0/2
Notes:
You must precede the string being scanned by a dummy character; the # in the fiddle. If that is a problem, it should be possible to work around it in the regex.
Get captured group 2 from each match to get the collection of maximal periodic substrings.
Haven't tried it with longer strings; performance may be an issue.
Explanation:
(?<=(.)) - look-behind to the character preceding the maximal periodic substring; captured as group 1
(?=...) - look-ahead, to ensure overlapping patterns are matched; see How to find overlapping matches with a regexp?
(...) - captures the maximal periodic substring (group 2)
(\w*)(\w*...\w\3)\4+ - #horcruz's regex, as proposed by OP
(?!\1) - negative look-ahead to group 1 to ensure the periodic substring is maximal
As pointed out by #ClasG, the result of my regex may be incomplete. This happens when two runs start at the same offset. Examples:
aabaab has 3 runs: aabaab, aa and aa. The first two runs start at the same offset. My regex will fail to return the shortest one.
atatbatatb has 3 runs: atatbatatb, atat, atat. Same problem here; my regex will only return the first and third run.
This may well be impossible to solve within the regex. As far as I know, there is no regex engine that is capable of returning two different matches that start at the same offset.
I see two possible solutions:
Ignore the missing runs. If I am not mistaken, then they are always duplicates; an identical run will follow within the same encapsulating run.
Do some postprocessing on the result. For every run found (let's call this X), scan earlier runs trying to find one that starts with the same character sequence (let's call this Y). When found (and not already 'used'), add an entry with the same character sequence as X, but the offset of Y.
I think it is not possible. Regular expressions cannot do complex nondeterministic jobs, even with backreferences. You need an algorithm for this.
This kind of depends on your input criteria... There is no infinite string of characters.. using back references you will be able to create a suitable representation of the last amount of occurrences of the pattern you wish to match.
\
Personally I would define buckets of length of input and then fill them.
I would then use automata to find patterns in the buckets and then finally coalesce them into larger patterns.
It's not how fast the RegEx is going to be in this case it's how fast you are going to be able to recognize a pattern and eliminate the invalid criterion.

Regex for pattern matching last instance with a range of values

I need to split a string that has a character, that is repeated but only split the last 3 instances and keep the 0 to any number of instances before those 3 instances of the character.
ex1: "Hi####there" after regex (and splitting) "Hi#","there"
ex2: "Hi###there" after regex (and splitting) "Hi","there"
Splitting on #{3} does not give the result I want: "Hi","#there"
edit:
I simplified my quesiton a bit; serialization and the character I am using is not relevant. I'm just interested in the regex that will get me the last x number in a list of y number of a repetitive list of characters. (#{3})(\w+) results in only returning hi# and the rest of the string is lost.
This is done in Javascript, but works
"Hi####there".split(/###(?=[^#])/); // Output = > ["Hi#", "there"]
"Hi###there".split(/###(?=[^#])/); // Output = > ["Hi", "there"]
// Tested on win7 with Chrome 45+
(?=[^#]) ... Lookahead to check if there is no # after the final/splitter ###

Is there a RegEx that can parse out the longest list of digits from a string?

I have to parse various strings and determine a prefix, number, and suffix. The problem is the strings can come in a wide variety of formats. The best way for me to think about how to parse it is to find the longest number in the string, then take everything before that as a prefix and everything after that as a suffix.
Some examples:
0001 - No prefix, Number = 0001, No suffix
1-0001 - Prefix = 1-, Number = 0001, No suffix
AAA001 - Prefix = AAA, Number = 001, No suffix
AAA 001.01 - Prefix = AAA , Number = 001, Suffix = .01
1_00001-01 - Prefix = 1_, Number = 00001, Suffix = -01
123AAA 001_01 - Prefix = 123AAA , Number = 001, Suffix = _01
The strings can come with any mixture of prefixes and suffixes, but the key point is the Number portion is always the longest sequential list of digits.
I've tried a variety of RegEx's that work with most but not all of these examples. I might be missing something, or perhaps a RegEx isn't the right way to go in this case?
(The RegEx should be .NET compatible)
UPDATE: For those that are interested, here's the C# code I came up with:
var regex = new System.Text.RegularExpressions.Regex(#"(\d+)");
if (regex.IsMatch(m_Key)) {
string value = "";
int length;
var matches = regex.Matches(m_Key);
foreach (var match in matches) {
if (match.Length >= length) {
value = match.Value;
length = match.Length;
}
}
var split = m_Key.Split(new String[] {value}, System.StringSplitOptions.RemoveEmptyEntries);
m_KeyCounter = value;
if (split.Length >= 1) m_KeyPrefix = split(0);
if (split.Length >= 2) m_KeySuffix = split(1);
}
You're right, this problem can't be solved purely by regular expressions. You can use regexes to "tokenize" (lexically analyze) the input but after that you'll need further processing (parsing).
So in this case I would tokenize the input with (for example) a simple regular expression search (\d+) and then process the tokens (parse). That would involve seeing if the current token is longer than the tokens seen before it.
To gain more understanding of the class of problems regular expressions "solve" and when parsing is needed, you might want to check out general compiler theory, specifically when regexes are used in the construction of a compiler (e.g. http://en.wikipedia.org/wiki/Book:Compiler_construction).
You're input isn't regular so, a regex won't do. I would iterate over the all groups of digits via (\d+) and find the longest and then build a new regex in the form of (.*)<number>(.*) to find your prefix/suffix.
Or if you're comfortable with string operations you can probably just find the start and end of the target group and use substr to find the pre/suf fix.
I don't think you can do this with one regex. I would find all digit sequences within the string (probably with a regex) and then I would select the longest with .NET code, and call Split().
This depends entirely on your Regexp engine. Check your Regexp environment for capturing, there might be something in it like the automatic variables in Perl.
OK, let's talk about your question:
Keep in mind, that both, NFA and DFA, of almost every Regexp engine are greedy, this means, that a (\d+) will always find the longest match, when it "stumbles" over it.
Now, what I can get from your example, is you always need middle portion of a number, try this:
/^(.*\D)?(\d+)(\D.*)?$/ig
The now look at variables $1, $2, $3. Not all of them will exist: if there are all three of them, $2 will hold your number in question, the other vars, parts of the prefix. when one of the prefixes is missing, only variable $1 and $2 will be set, you have to see for yourself, which one is the integer. If both prefix and suffix are missing, $1 will hold the number.
The idea is to make the engine "stumble" over the first few characters and start matching a long number in the middle.
Since the modifier /gis present, you can loop through all available combinations, that the machine finds, you can then simply take the one you like most or something.
This example is in PCRE, but I'm sure .NET has a compatible mode.