Wrong line numbers reported when using multiline search - regex

Given the following text:
defmodule MyModule do
app_env(:plans, :myapp, [:billing, :plans],
binding_order: [:config],
required: true,
type: :any
)
app_env(
:plans_with_min_amount_of_integrations,
:myapp,
[:billing, :plans_with_min_amount_of_integrations],
binding_order: [:config],
required: true,
type: :any
)
end
I'm trying to match with the following condition in mind:
myapp string,
that is known to be located between app_env and :billing strings.
To do this, I'm running:
rg --replace '$1' --multiline --multiline-dotall "app_env.*?(myapp).*?billing" test.txt
I expect the following output
2:myapp
10:myapp
But for some reason, I'm getting the following output:
2:myapp
8:myapp
Why? How do I change the regexp to return the correct lines, while retaining the conditions?
Note, that this example is simplified, and is a part of a larger code search effort of precisely looking & replacing the code, so simply running rg myapp won't cut it in this case.

Author kindly responded to issue in ripgrep bug tracker about this: https://github.com/BurntSushi/ripgrep/issues/2420
It's not an issue with ripgrep, but rather my own faulty expectations based on misunderstanding how line numbers in conjunction with capture groups are calculated.
In a nutshell, a line number always corresponds to the first line of matched string, and not to the first capture group (which I wrongly assumed was the case).
Kudoes goes to #burntsushi5!

Related

How to write to a file a regular expression match

I have been parsing a log file and working on it just using print. I've got it working but I can't figure out how to write it to a file instead of printing it to screen.
I've tried opening output file o for writing, then the following regex
matched = re.search(r"(http|https)://(.*?)./+", line)
o.write(matched)
It throws an error that it has to be a string object for the .write argument. I've also tried o.write(matched(1),line) but that only gets me http. I'm a newbie so I'm sorry if this is to simple a question. But I don't know enough about this to know where to start.
Here's the documentation for Match objects where one of the functions mentions what you want:
Match.group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). [...]
Here's a runnable example:
import re
line = "Some text with https://www.example.com/ in it"
matched = re.search(r"(http|https)://(.*?)./+", line)
with open("file.txt", "w") as o:
o.write(matched.group())
which results in:
$ python3 test.py; cat file.txt; echo
https://www.example.com/
$ python2 test.py; cat file.txt; echo
https://www.example.com/

Wrong regexp query for elasticsearch

I have some problems with the regexp query for elasticsearch. In my index there's a text field with comma-separated numeric values (IDs), f.e.
2,140,3,2495
And I have the following query term:
"regexp" : {
"myIds" : {
"value" : "^2495,|,2495,|,2495$|^2495$",
"boost" : 1
}
}
But my result list is empty.
Let me say that I know that regexp queries are kind of slow but the index still exists and is filled with millions of documents so unfortunately it's not an option to restructure it. So I need a regex solution.
In ElasticSearch regex, patterns are anchored by default, the ^ and $ are treated as literal chars.
What you mean to use is "2495,.*|.*,2495,.*|.*,2495|2495" - 2495, at the start of string, ,2495, in the middle, ,2495 at the end or a whole string equal to 2495.
Or, you may use a simpler
"(.*,)?2495(,.*)?"
That means
(.*,)? - an optional text (not including line breaks) ending with ,
2495 - your value
(,.*)? - an optional text (not including line breaks) ending with ,
Here is an online demo showing how this expression works (not a proof though).
Ok, I got it to work but run in another problem now. I built the string as follows:
(.*,)?2495(,.*)?|(.*,)?10(,.*)?|(.*,)?898(,.*)?
It works good for a few IDs but if I have let's say 50 IDs, then ES throws an exception which says that the regexp is too complex to process.
Is there a way to simplify the regexp or restructure the query it selves?

Select two ranges, one immediately after another using regular expressions [duplicate]

I have a large log file, and I want to extract a multi-line string between two strings: start and end.
The following is sample from the inputfile:
start spam
start rubbish
start wait for it...
profit!
here end
start garbage
start second match
win. end
The desired solution should print:
start wait for it...
profit!
here end
start second match
win. end
I tried a simple regex but it returned everything from start spam. How should this be done?
Edit: Additional info on real-life computational complexity:
actual file size: 2GB
occurrences of 'start': ~ 12 M, evenly distributed
occurences of 'end': ~800, near the end of the file.
This regex should match what you want:
(start((?!start).)*?end)
Use re.findall method and single-line modifier re.S to get all the occurences in a multi-line string:
re.findall('(start((?!start).)*?end)', text, re.S)
See a test here.
Do it with code - basic state machine:
open = False
tmp = []
for ln in fi:
if 'start' in ln:
if open:
tmp = []
else:
open = True
if open:
tmp.append(ln)
if 'end' in ln:
open = False
for x in tmp:
print x
tmp = []
This is tricky to do because by default, the re module does not look at overlapping matches. Newer versions of Python have a new regex module that allows for overlapping matches.
https://pypi.python.org/pypi/regex
You'd want to use something like
regex.findall(pattern, string, overlapped=True)
If you're stuck with Python 2.x or something else that doesn't have regex, it's still possible with some trickery. One brilliant person solved it here:
Python regex find all overlapping matches?
Once you have all possible overlapping (non-greedy, I imagine) matches, just determine which one is shortest, which should be easy.
You could do (?s)start.*?(?=end|start)(?:end)?, then filter out everything not ending in "end".

Regex to extract long hexadecimal string

I want to extract with a regex the value after ajaxBrowserNavigationCheck('&x and before the = from the following javascript code:
if (ajaxBrowserNavigationCheck('&x909ef93d-61ac-4311-ac56-20c2ae9770f5=7ebdc2a4-df58-4c1c-9b50-96964c93e927', '', 'servletcontroller', '')){
processBrowserNavigationButton();
Basicly teh value I want to extra are &x909ef93d-61ac-4311-ac56-20c2ae9770f5 (the value before the = and we need the &x)
and 7ebdc2a4-df58-4c1c-9b50-96964c93e927 (the value after the =)
Note that the value is there twice (its after MODE=BROWSER_NAV)
Note that both value have 36 char without the &x
the &x is always there for the first string
My reg ex is a bit rusty here what I got so far:
(&x([0-9a-fA-F]|-)+) get me the first part
(&x([0-9a-fA-F]|-)+)|(=([0-9a-fA-F]|-)+) get me both but with the = we don't want it...
Edit: Sorry that I forgot the language, it's for a jmeter script which use jakarta ORO.
Edit2: I realize I can split those in two variable or even in three in jmeter that make it a bit easier.
Edit3: I removed the window location part because it was misleading since it was the same in the ajax part.
in ajaxBrowserNavigationCheck('&x909ef93d-61ac-4311-ac56-20c2ae9770f5=7ebdc2a4-df58-4c1c-9b50-96964c93e927', '', 'servletcontroller', ''))
we want &x909ef93d-61ac-4311-ac56-20c2ae9770f5 and 7ebdc2a4-df58-4c1c-9b50-96964c93e927
You haven't said what language you are using, so it's hard to give a solid answer.
This matches just your targets:
&x[a-fA-F0-9-]*(?==)
The last term is a look ahead, which asserts, but does not capture, an equals sign.
This regex matches all the input and captures each target twice as groups 1 and 2:
(?m).*?(&x[a-fA-F0-9-]*)=.*(&x[a-fA-F0-9-]*)=.*
See a live demo on rubular

Match overlapping patterns with capture using a MATLAB regular expression

I'm trying to parse a log file that looks like this:
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
...
This excerpt contains two time periods I'd like to extract, from the first delimiter to the second, and from the second to the third. I'd like to use a regular expression to extract the start and stop times for each of these intervals. This mostly works:
p = '%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?%{4} (?<stop>.*?)\n';
times = regexp(c,p,'names');
Returning:
times =
1x16 struct array with fields:
start
name
stop
The problem is that this only captures every other period, since the second delimiter is consumed as part of the first match.
In other languages, you can use lookaround operators (lookahead, lookbehind) to solve this problem. The documentation on regular expressions explains how these work in MATLAB, but I haven't been able to get these to work while still capturing the matches. That is, I not only need to be able to match every delimiter, but also I need to extract part of that match (the timestamp).
Is this possible?
P.S. I realize I can solve this problem by writing a simple state machine or by matching on the delimiters and post-processing, if there's no way to get this to work.
Update: Thanks for the workaround ideas, everyone. I heard from the developer and there's currently no way to do this with the regular expression engine in MATLAB.
MATLAB seems unable to capture characters as a token without removing them from the string (or, I should say, I was unable to do so using MATLAB REGEXP). However, by noting that the stop time for one block of text is equal to the start time of the next, I was able to capture just the start times and the names using REGEXP, then do some simple processing to get the stop times from the start times. I used the following sample text:
c =
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
some more junk
...and applied the following expression:
p = '%{4} (?<start>[^\n]*)\n% Starting (?<name>[^\n]*)[^%]*|%{4} (?<start>[^\n]*).*';
The processing can then be done with the following code:
names = regexp(c,p,'names');
[names.stop] = deal(names(2:end).start,[]);
names = names(1:end-1);
...which gives us these results for the above sample text:
>> names(1)
ans =
start: '09-May-2009 04:10:29'
name: 'foo'
stop: '09-May-2009 04:10:50'
>> names(2)
ans =
start: '09-May-2009 04:10:50'
name: 'bar'
stop: '09-May-2009 04:11:29'
If you are doing a lot of parsing and such work, you might consider using Perl from within Matlab. It gives you access to the powerful regex engine of Perl and might also make many other problems easier to solve.
All you should have to do is to wrap a lookahead around the part of the regex that matches the second timestamp:
'%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?(?=%{4} (?<stop>.*?)\n)'
EDIT: Here it is without named groups:
'%{4} (.*?)\n% Starting (.*?)\n.*?(?=%{4} (.*?)\n)'