Regexp: Keyword followed by value to extract - regex

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T

pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.

If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"

Related

Regex to insert space with certain characters but avoid date and time

I made a regex which inserts a space where ever there is any of the characters
-:\*_/;, present for example JET*AIRWAYS\INDIA/858701/IDBI 05/05/05;05:05:05 a/c should beJET* AIRWAYS\ INDIA/ 858701/ IDBI 05/05/05; 05:05:05 a/c
The regex I used is (?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)
I have added some words exceptions like a/c w/d etc. \D conditions given to avoid date/time values getting separated, but this created an issue, the numbers followed by the above mentioned characters never get split.
My requirement is
1. Insert a space after characters -:\*_/;,
2. but date and time should not get split which may have / :
3. need exception on words like a/c w/d
The following is the full code
Private Function formatColon(oldString As String) As String
Dim reg As New RegExp: reg.Global = True: reg.Pattern = "(?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)" '"(\D:|\D/|\D-|^w/d)"
Dim newString As String: newString = reg.Replace(oldString, "$1 ")
formatColon = XtraspaceKill(newString)
End Function
I would use 3 replacements.
Replace all date and time special characters with a special macro that should never be found in your text, e.g. for 05/15/2018 4:06 PM, something based on your name:
05MANUMOHANSLASH15MANUMOHANSLASH2018 4MANUMOHANCOLON06 PM
You can encode exceptions too, like this:
aMANUMOHANSLASHc
Now run your original regex to replace all special characters.
Finally, unreplace the macros MANUMOHANSLASH and MANUMOHANCOLON.
Meanwhile, let me tell you why this is complicated in a single regex.
If trying to do this in a single regex, you have to ask, for each / or :, "Am I a part of a date or time?"
To answer that, you need to use lookahead and lookbehind assertions, the latter of which Microsoft has finally added support for.
But given a /, you don't know if you're between the first and second, or second and third parts of the date. Similar for time.
The number of cases you need to consider will render your regex unmaintainably complex.
So please just use a few separate replacements :-)

VB.Net Beginner: Replace with Wildcards, Possibly RegEx?

I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?
I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.

Finding if a string matches a pattern

At one point in my app, I need to match some strings against a pattern. Let's say that some of the sample strings look as follows:
Hi there, John.
What a lovely day today!
Lovely sunset today, John, isn't it?
Will you be meeting Linda today, John?
Most (not all) of these strings are from pre-defined patterns as follows:
"Hi there, %s."
"What a lovely day today!"
"Lovely sunset today, %s, isn't it?"
"Will you be meeting %s today, %s?"
This library of patterns is ever-expanding (currently at around 1,500), but is manually maintained. The input strings though (the first group) is largely unpredictable. Though most of them will match one of the patterns, some of them will not.
So, here's my question: Given a string (from the first group) as input, I need to know which of the patterns (known second group) it matched. If nothing matched, it needs to tell me that.
I'm guessing the solution involves building a regex out of the patterns, and iteratively checking which one matched. However, I'm unsure what the code to build those regexes looks like.
Note: The strings I've given here are for illustration purposes. In reality, the strings aren't human generated, but are computer-generated human-friendly strings as shown above from systems I don't control. Since they aren't manually typed in, we don't need to worry about things like typos and other human errors. Just need to find which pattern it matches.
Note 2: I could modify the patterns library to be some other format, if that makes it easier to construct the regexes. The current structure, with the printf style %s, isn't set in stone.
I am looking at this as a parsing problem. The idea is that the parser function takes a string and determines if it is valid or not.
The string is valid if you can find it among the given patterns. That means you need an index of all the patterns. The index must be a full text index. Also it must match according to the word position. eg. it should short circuit if the first word of the input is not found among the first word of the patterns. It should take care of the any match ie %s in the pattern.
One solution is to put the patterns in an in memory database (eg. redis) and do a full text index on it. (this will not match according to word position) but you should be able to narrow down to the correct pattern by splitting the input into words and searching. The searches will be very fast because you have a small in memory database. Also note that you are looking for the closest match. One or more words will not match. The highest number of matches is the pattern you want.
An even better solution is to generate your own index in a dictionary format. Here is an example index for the four patterns you gave as a JavaScript object.
{
"Hi": { "there": {"%s": null}},
"What: {"a": {"lovely": {"day": {"today": null}}}},
"Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}},
"Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}}
}
This index is recursive descending according to the word postion. So search for the first word, if found search for the next within the object returned by the first and so on. Same words at a given level will have only one key. You should also match the any case. This should be blinding fast in memory.
My first thought would be to have the regexp engine take all the trouble of handling this. They're usually optimised to handle large amounts of text so it shouldn't be that much of a performance hassle. It's brute force but the performance seems to be okay. And you could split the input into pieces and have multiple processes handle them. Here's my moderately tested solution (in Python).
import random
import string
import re
def create_random_sentence():
nwords = random.randint(4, 10)
sentence = []
for i in range(nwords):
sentence.append("".join(random.choice(string.lowercase) for x in range(random.randint(3,10))))
ret = " ".join(sentence)
print ret
return ret
patterns = [ r"Hi there, [a-zA-Z]+.",
r"What a lovely day today!",
r"Lovely sunset today, [a-zA-Z]+, isn't it?",
r"Will you be meeting [a-zA-Z]+ today, [a-zA-Z]+\?"]
for i in range(95):
patterns.append(create_random_sentence())
monster_pattern = "|".join("(%s)"%x for x in patterns)
print monster_pattern
print "--------------"
monster_regexp = re.compile(monster_pattern)
inputs = ["Hi there, John.",
"What a lovely day today!",
"Lovely sunset today, John, isn't it?",
"Will you be meeting Linda today, John?",
"Goobledigoock"]*2000
for i in inputs:
ret = monster_regexp.search(i)
if ret:
print ".",
else:
print "x",
I've created a hundred patterns. This is the maximum limit of the python regexp library. 4 of them are your actual examples and the rest are random sentences just to stress performance a little.
Then I combined them into a single regexp with 100 groups. (group1)|(group2)|(group3)|.... I'm guessing you'll have to sanitise the inputs for things that can have meanings in regular expressions (like ? etc.). That's the monster_regexp.
Testing one string against this tests it against 100 patterns in a single shot. There are methods that fetch out the exact group which was matched. I test 10000 strings 80% of which should match and 10% which will not. It short cirtcuits so if there's a success, it will be comparatively quick. Failures will have to run through the whole regexp so it will be slower. You can order things based on the frequency of input to get some more performance out of it.
I ran this on my machine and this is my timing.
python /tmp/scratch.py 0.13s user 0.00s system 97% cpu 0.136 total
which is not too bad.
However, to run a pattern against such a large regexp and fail will take longer so I changed the inputs to have lots of randomly generated strings that won't match and then tried. 10000 strings none of which match the monster_regexp and I got this.
python /tmp/scratch.py 3.76s user 0.01s system 99% cpu 3.779 total
Similar to Noufal's solution, but returns the matched pattern or None.
import re
patterns = [
"Hi there, %s.",
"What a lovely day today!",
"Lovely sunset today, %s, isn't it",
"Will you be meeting %s today, %s?"
]
def make_re_pattern(pattern):
# characters like . ? etc. have special meaning in regular expressions.
# Escape the string to avoid interpretting them as differently.
# The re.escape function escapes even %, so replacing that with XXX to avoid that.
p = re.escape(pattern.replace("%s", "XXX"))
return p.replace("XXX", "\w+")
# Join all the pattens into a single regular expression.
# Each pattern is enclosed in () to remember the match.
# This will help us to find the matched pattern.
rx = re.compile("|".join("(" + make_re_pattern(p) + ")" for p in patterns))
def match(s):
"""Given an input strings, returns the matched pattern or None."""
m = rx.match(s)
if m:
# Find the index of the matching group.
index = (i for i, group in enumerate(m.groups()) if group is not None).next()
return patterns[index]
# Testing with couple of patterns
print match("Hi there, John.")
print match("Will you be meeting Linda today, John?")
Python solution. JS should be similar.
>>> re2.compile('^ABC(.*)E$').search('ABCDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABCDDDDDDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABX') == None
True
>>>
The trick is to use ^ and $ to bound your pattern and making it a "template". Use (.*) or (.+) or whatever it is that you want to "search" for.
The main bottleneck for you, imho, will be iterating through a list of these patterns. Regex searches are computationally expensive.
If you want a "does any pattern match" result, build a massive OR based regex and let your regex engine handle the 'OR'ing for you.
Also, if you have only prefix patterns, check out the TRIE data structure.
This could be a job for sscanf, there is an implementation in js: http://phpjs.org/functions/sscanf/; the function being copied is this: http://php.net/manual/en/function.sscanf.php.
You should be able to use it without changing the prepared strings much, but I have doubts about the performances.
the problem isn't clear to me. Do you want to take the patterns and build regexes out of it?
Most regex engines have a "quoted string" option. (\Q \E). So you could take the string and make it
^\QHi there,\E(?:.*)\Q.\E$
these will be regexes that match exactly the string you want outside your variables.
if you want to use a single regex to match just a single pattern, you can put them in grouped patterns to find out which one matched, but that will not give you EVERY match, just the first one.
if you use a proper parser (I've used PEG.js), it might be more maintainable though. So that's another option if you think you might get stuck in regex hell

Regex: How to match a string that is not only numbers

Is it possible to write a regular expression that matches all strings that does not only contain numbers? If we have these strings:
abc
a4c
4bc
ab4
123
It should match the four first, but not the last one. I have tried fiddling around in RegexBuddy with lookaheads and stuff, but I can't seem to figure it out.
(?!^\d+$)^.+$
This says lookahead for lines that do not contain all digits and match the entire line.
Unless I am missing something, I think the most concise regex is...
/\D/
...or in other words, is there a not-digit in the string?
jjnguy had it correct (if slightly redundant) in an earlier revision.
.*?[^0-9].*
#Chad, your regex,
\b.*[a-zA-Z]+.*\b
should probably allow for non letters (eg, punctuation) even though Svish's examples didn't include one. Svish's primary requirement was: not all be digits.
\b.*[^0-9]+.*\b
Then, you don't need the + in there since all you need is to guarantee 1 non-digit is in there (more might be in there as covered by the .* on the ends).
\b.*[^0-9].*\b
Next, you can do away with the \b on either end since these are unnecessary constraints (invoking reference to alphanum and _).
.*[^0-9].*
Finally, note that this last regex shows that the problem can be solved with just the basics, those basics which have existed for decades (eg, no need for the look-ahead feature). In English, the question was logically equivalent to simply asking that 1 counter-example character be found within a string.
We can test this regex in a browser by copying the following into the location bar, replacing the string "6576576i7567" with whatever you want to test.
javascript:alert(new String("6576576i7567").match(".*[^0-9].*"));
/^\d*[a-z][a-z\d]*$/
Or, case insensitive version:
/^\d*[a-z][a-z\d]*$/i
May be a digit at the beginning, then at least one letter, then letters or digits
Try this:
/^.*\D+.*$/
It returns true if there is any simbol, that is not a number. Works fine with all languages.
Since you said "match", not just validate, the following regex will match correctly
\b.*[a-zA-Z]+.*\b
Passing Tests:
abc
a4c
4bc
ab4
1b1
11b
b11
Failing Tests:
123
if you are trying to match worlds that have at least one letter but they are formed by numbers and letters (or just letters), this is what I have used:
(\d*[a-zA-Z]+\d*)+
If we want to restrict valid characters so that string can be made from a limited set of characters, try this:
(?!^\d+$)^[a-zA-Z0-9_-]{3,}$
or
(?!^\d+$)^[\w-]{3,}$
/\w+/:
Matches any letter, number or underscore. any word character
.*[^0-9]{1,}.*
Works fine for us.
We want to use the used answer, but it's not working within YANG model.
And the one I provided here is easy to understand and it's clear:
start and end could be any chars, but, but there must be at least one NON NUMERICAL characters, which is greatest.
I am using /^[0-9]*$/gm in my JavaScript code to see if string is only numbers. If yes then it should fail otherwise it will return the string.
Below is working code snippet with test cases:
function isValidURL(string) {
var res = string.match(/^[0-9]*$/gm);
if (res == null)
return string;
else
return "fail";
};
var testCase1 = "abc";
console.log(isValidURL(testCase1)); // abc
var testCase2 = "a4c";
console.log(isValidURL(testCase2)); // a4c
var testCase3 = "4bc";
console.log(isValidURL(testCase3)); // 4bc
var testCase4 = "ab4";
console.log(isValidURL(testCase4)); // ab4
var testCase5 = "123"; // fail here
console.log(isValidURL(testCase5));
I had to do something similar in MySQL and the following whilst over simplified seems to have worked for me:
where fieldname regexp ^[a-zA-Z0-9]+$
and fieldname NOT REGEXP ^[0-9]+$
This shows all fields that are alphabetical and alphanumeric but any fields that are just numeric are hidden. This seems to work.
example:
name1 - Displayed
name - Displayed
name2 - Displayed
name3 - Displayed
name4 - Displayed
n4ame - Displayed
324234234 - Not Displayed

Match overlapping patterns with capture using a MATLAB regular expression

I'm trying to parse a log file that looks like this:
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
...
This excerpt contains two time periods I'd like to extract, from the first delimiter to the second, and from the second to the third. I'd like to use a regular expression to extract the start and stop times for each of these intervals. This mostly works:
p = '%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?%{4} (?<stop>.*?)\n';
times = regexp(c,p,'names');
Returning:
times =
1x16 struct array with fields:
start
name
stop
The problem is that this only captures every other period, since the second delimiter is consumed as part of the first match.
In other languages, you can use lookaround operators (lookahead, lookbehind) to solve this problem. The documentation on regular expressions explains how these work in MATLAB, but I haven't been able to get these to work while still capturing the matches. That is, I not only need to be able to match every delimiter, but also I need to extract part of that match (the timestamp).
Is this possible?
P.S. I realize I can solve this problem by writing a simple state machine or by matching on the delimiters and post-processing, if there's no way to get this to work.
Update: Thanks for the workaround ideas, everyone. I heard from the developer and there's currently no way to do this with the regular expression engine in MATLAB.
MATLAB seems unable to capture characters as a token without removing them from the string (or, I should say, I was unable to do so using MATLAB REGEXP). However, by noting that the stop time for one block of text is equal to the start time of the next, I was able to capture just the start times and the names using REGEXP, then do some simple processing to get the stop times from the start times. I used the following sample text:
c =
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
some more junk
...and applied the following expression:
p = '%{4} (?<start>[^\n]*)\n% Starting (?<name>[^\n]*)[^%]*|%{4} (?<start>[^\n]*).*';
The processing can then be done with the following code:
names = regexp(c,p,'names');
[names.stop] = deal(names(2:end).start,[]);
names = names(1:end-1);
...which gives us these results for the above sample text:
>> names(1)
ans =
start: '09-May-2009 04:10:29'
name: 'foo'
stop: '09-May-2009 04:10:50'
>> names(2)
ans =
start: '09-May-2009 04:10:50'
name: 'bar'
stop: '09-May-2009 04:11:29'
If you are doing a lot of parsing and such work, you might consider using Perl from within Matlab. It gives you access to the powerful regex engine of Perl and might also make many other problems easier to solve.
All you should have to do is to wrap a lookahead around the part of the regex that matches the second timestamp:
'%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?(?=%{4} (?<stop>.*?)\n)'
EDIT: Here it is without named groups:
'%{4} (.*?)\n% Starting (.*?)\n.*?(?=%{4} (.*?)\n)'