Select two ranges, one immediately after another using regular expressions [duplicate] - regex

I have a large log file, and I want to extract a multi-line string between two strings: start and end.
The following is sample from the inputfile:
start spam
start rubbish
start wait for it...
profit!
here end
start garbage
start second match
win. end
The desired solution should print:
start wait for it...
profit!
here end
start second match
win. end
I tried a simple regex but it returned everything from start spam. How should this be done?
Edit: Additional info on real-life computational complexity:
actual file size: 2GB
occurrences of 'start': ~ 12 M, evenly distributed
occurences of 'end': ~800, near the end of the file.

This regex should match what you want:
(start((?!start).)*?end)
Use re.findall method and single-line modifier re.S to get all the occurences in a multi-line string:
re.findall('(start((?!start).)*?end)', text, re.S)
See a test here.

Do it with code - basic state machine:
open = False
tmp = []
for ln in fi:
if 'start' in ln:
if open:
tmp = []
else:
open = True
if open:
tmp.append(ln)
if 'end' in ln:
open = False
for x in tmp:
print x
tmp = []

This is tricky to do because by default, the re module does not look at overlapping matches. Newer versions of Python have a new regex module that allows for overlapping matches.
https://pypi.python.org/pypi/regex
You'd want to use something like
regex.findall(pattern, string, overlapped=True)
If you're stuck with Python 2.x or something else that doesn't have regex, it's still possible with some trickery. One brilliant person solved it here:
Python regex find all overlapping matches?
Once you have all possible overlapping (non-greedy, I imagine) matches, just determine which one is shortest, which should be easy.

You could do (?s)start.*?(?=end|start)(?:end)?, then filter out everything not ending in "end".

Related

Wrong regexp query for elasticsearch

I have some problems with the regexp query for elasticsearch. In my index there's a text field with comma-separated numeric values (IDs), f.e.
2,140,3,2495
And I have the following query term:
"regexp" : {
"myIds" : {
"value" : "^2495,|,2495,|,2495$|^2495$",
"boost" : 1
}
}
But my result list is empty.
Let me say that I know that regexp queries are kind of slow but the index still exists and is filled with millions of documents so unfortunately it's not an option to restructure it. So I need a regex solution.
In ElasticSearch regex, patterns are anchored by default, the ^ and $ are treated as literal chars.
What you mean to use is "2495,.*|.*,2495,.*|.*,2495|2495" - 2495, at the start of string, ,2495, in the middle, ,2495 at the end or a whole string equal to 2495.
Or, you may use a simpler
"(.*,)?2495(,.*)?"
That means
(.*,)? - an optional text (not including line breaks) ending with ,
2495 - your value
(,.*)? - an optional text (not including line breaks) ending with ,
Here is an online demo showing how this expression works (not a proof though).
Ok, I got it to work but run in another problem now. I built the string as follows:
(.*,)?2495(,.*)?|(.*,)?10(,.*)?|(.*,)?898(,.*)?
It works good for a few IDs but if I have let's say 50 IDs, then ES throws an exception which says that the regexp is too complex to process.
Is there a way to simplify the regexp or restructure the query it selves?

Regex to insert space with certain characters but avoid date and time

I made a regex which inserts a space where ever there is any of the characters
-:\*_/;, present for example JET*AIRWAYS\INDIA/858701/IDBI 05/05/05;05:05:05 a/c should beJET* AIRWAYS\ INDIA/ 858701/ IDBI 05/05/05; 05:05:05 a/c
The regex I used is (?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)
I have added some words exceptions like a/c w/d etc. \D conditions given to avoid date/time values getting separated, but this created an issue, the numbers followed by the above mentioned characters never get split.
My requirement is
1. Insert a space after characters -:\*_/;,
2. but date and time should not get split which may have / :
3. need exception on words like a/c w/d
The following is the full code
Private Function formatColon(oldString As String) As String
Dim reg As New RegExp: reg.Global = True: reg.Pattern = "(?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)" '"(\D:|\D/|\D-|^w/d)"
Dim newString As String: newString = reg.Replace(oldString, "$1 ")
formatColon = XtraspaceKill(newString)
End Function
I would use 3 replacements.
Replace all date and time special characters with a special macro that should never be found in your text, e.g. for 05/15/2018 4:06 PM, something based on your name:
05MANUMOHANSLASH15MANUMOHANSLASH2018 4MANUMOHANCOLON06 PM
You can encode exceptions too, like this:
aMANUMOHANSLASHc
Now run your original regex to replace all special characters.
Finally, unreplace the macros MANUMOHANSLASH and MANUMOHANCOLON.
Meanwhile, let me tell you why this is complicated in a single regex.
If trying to do this in a single regex, you have to ask, for each / or :, "Am I a part of a date or time?"
To answer that, you need to use lookahead and lookbehind assertions, the latter of which Microsoft has finally added support for.
But given a /, you don't know if you're between the first and second, or second and third parts of the date. Similar for time.
The number of cases you need to consider will render your regex unmaintainably complex.
So please just use a few separate replacements :-)

Vim: How to delete repetition in a line

I am having a log file for analysis, in that few of the line will have repetition of it own, but not complete repetition, say
Alex is here and Alex is here and we went out
We bothWe both went out
I want to remove the first occurrence and get
Alex is here and we went out
We both went out
Please share a regex to do in Vim in Windows.
I don't recommend trying to use regex magic to solve this problem. Just write an external filter and use that.
Here's an external filter written in Python. You can use this to pre-process the log file, like so:
python prefix_chop.py logfile.txt > chopped.txt
But it also works by standard input:
cat logfile.txt | prefix_chop.py > chopped.txt
This means you can use it in vim with the ! command. Try these commands: goto line 1, then pipe from current line through the last line through the external program prefix_chop.py:
1G
!Gprefix_chop.py<Enter>
Or you can do it from ex mode:
:1,$!prefix_chop.py<Enter>
Here's the program:
#!/usr/bin/python
import sys
infile = sys.stdin if len(sys.argv) < 2 else open(sys.argv[1])
def repeated_prefix_chop(line):
"""
Check line for a repeated prefix string. If one is found,
return the line with that string removed, else return the
line unchanged.
"""
# Repeated string cannot be more than half of the line.
# So, start looking at mid-point of the line.
i = len(line) // 2 + 1
while True:
# Look for longest prefix that is found in the string after pos 0.
# The prefix starts at pos 0 and always matches itself, of course.
pos = line.rfind(line[:i])
if pos > 0:
return line[pos:]
i -= 1
# Stop testing before we hit a length-1 prefix, in case a line
# happens to start with a word like "oops" or a number like "77".
if i < 2:
return line
for line in infile:
sys.stdout.write(repeated_prefix_chop(line))
I put a #! comment on the first line, so this will work as a stand-alone program on Linux, Mac OS X, or on Windows if you are using Cygwin. If you are just using Windows without Cygwin, you might need to make a batch file to run this, or just type the whole command python prefix_chop.py. If you make a macro to run this you don't have to do the typing yourself.
EDIT: This program is pretty simple. Maybe it could be done in "vimscript" and run purely inside vim. But the external filter program can be used outside of vim... you can set things up so that the log file is run through the filter once per day every day, if you like.
Regex:\b(.*)\1\b
Replace with:\1 or $1
If you want to deal with more than two repeating sentences you can try this
\b(.+?\b)\1+\b
--
|->avoids matching individual characters in word like xxx
NOTE
Use \< and \> instead of \b
You could do it by matching as much as possible at the beginning of the line and then using a backreference to match the repeated bit.
For example, this command solves the problem you describe:
:%s/^\(.*\)\(\1.*\)/\2

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T
pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.
If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"

Match overlapping patterns with capture using a MATLAB regular expression

I'm trying to parse a log file that looks like this:
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
...
This excerpt contains two time periods I'd like to extract, from the first delimiter to the second, and from the second to the third. I'd like to use a regular expression to extract the start and stop times for each of these intervals. This mostly works:
p = '%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?%{4} (?<stop>.*?)\n';
times = regexp(c,p,'names');
Returning:
times =
1x16 struct array with fields:
start
name
stop
The problem is that this only captures every other period, since the second delimiter is consumed as part of the first match.
In other languages, you can use lookaround operators (lookahead, lookbehind) to solve this problem. The documentation on regular expressions explains how these work in MATLAB, but I haven't been able to get these to work while still capturing the matches. That is, I not only need to be able to match every delimiter, but also I need to extract part of that match (the timestamp).
Is this possible?
P.S. I realize I can solve this problem by writing a simple state machine or by matching on the delimiters and post-processing, if there's no way to get this to work.
Update: Thanks for the workaround ideas, everyone. I heard from the developer and there's currently no way to do this with the regular expression engine in MATLAB.
MATLAB seems unable to capture characters as a token without removing them from the string (or, I should say, I was unable to do so using MATLAB REGEXP). However, by noting that the stop time for one block of text is equal to the start time of the next, I was able to capture just the start times and the names using REGEXP, then do some simple processing to get the stop times from the start times. I used the following sample text:
c =
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
some more junk
...and applied the following expression:
p = '%{4} (?<start>[^\n]*)\n% Starting (?<name>[^\n]*)[^%]*|%{4} (?<start>[^\n]*).*';
The processing can then be done with the following code:
names = regexp(c,p,'names');
[names.stop] = deal(names(2:end).start,[]);
names = names(1:end-1);
...which gives us these results for the above sample text:
>> names(1)
ans =
start: '09-May-2009 04:10:29'
name: 'foo'
stop: '09-May-2009 04:10:50'
>> names(2)
ans =
start: '09-May-2009 04:10:50'
name: 'bar'
stop: '09-May-2009 04:11:29'
If you are doing a lot of parsing and such work, you might consider using Perl from within Matlab. It gives you access to the powerful regex engine of Perl and might also make many other problems easier to solve.
All you should have to do is to wrap a lookahead around the part of the regex that matches the second timestamp:
'%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?(?=%{4} (?<stop>.*?)\n)'
EDIT: Here it is without named groups:
'%{4} (.*?)\n% Starting (.*?)\n.*?(?=%{4} (.*?)\n)'