Match text between tags - regex

How do I match the text between tags when the end tag is non repeating?
Example:
DATA GOES HERE
aaa
DATA GOES HERE
bbb
The goal is to capture "aaa" and "bbb". I have tried the following regex however it fails to match the second batch;
^(DATA\sGOES\sHERE).*?\k<1>
The result from the above is always the first batch;
DATA GOES HERE
aaa
DATA GOES HERE
Thanks.

Have a try with:
(?s)^(DATA GOES HERE\R)(.+?)(?=\1|\z)
The sring you want is in group 2.

Assuming that the tag is always DATA GOES HERE:
(?<=DATA GOES HERE[\r\n]).+
Here is the output from RegexBuddy showing the match:
Explanation: -
(?<=DATA GOES HERE[\r\n]) - this is a positive lookbehind. It means 'make sure this is preceded by'.
.+ One or more of any characters (not newlines).
Essentially this looks for any sets of characters that are preceeded by a line with DATA GOES HERE. A lookbehind is zero length, so it does not participate in the matched text which is why you only get aaa and bbb which I am assuming is what you wanted.
Update based on comment
It doesn't work if line-break is CRLF, also when there are multiple lines to catch
Quite correct about the CRLF, there should have been a + after [\r\n]. To match multiple lines you can use the following:
(?<=(DATA GOES HERE[\r\n]+)).[\s\S]+?(?=\1)|(?<=DATA GOES HERE[\r\n]+).[\s\S]+
The updates are:
[\s\S]+ Any characters including new lines.
| = OR. Now it will match either between DATA GOES HERE blocks or for the last text after DATA GOES HERE.
Result:

You can try
(?:DATA GOES HERE\n(.+)(?=|$))+
to capture the texts between the tags (aaa and bbb).
Debuggex Demo

Related

Notepad++ and regex (multiline)

I have been facing a challenge. I have a text file with the following pattern:
SOME RANDOM TITLE IN CAPS (nnnn)
text text text
more text
...
SOME OTHER RANDOM TITLE IN CAPS (nnnn)
What is for sure is that what I want to extract are lines with a bracket and a date ex: (2015) ; (20008)
After the (nnnn) there is no text, sometimes space and CR LF, sometimes just CR LF
I would like to delete everything else and keep just the TITLE LINE with the brackets
The time I spent I could have done it by hand (there are 100lines) but I like the challenge :)
I thought I could find the issue but I am stuck.
I have tried something along this line:
^.*\(\d\d\d\d\)(?s)(.*)(^.*\(\d\d\d\d\))
But I don't get what I want. I can't seem to stop the (?s)(.*) going all the way to the end of the text instead of stopping at the next occurrence.
I suggest using the Search > Mark feature. Use a pattern like \(\d{4}\) and check the "Bookmark Line" option then click "Mark All". Then use Search > Bookmark > Remove Unmarked Lines. This will remove all lines except the ones that have matched your pattern.
Note: If it's possible to have parentheses with 4 digits within your other lines you could add $ to the end of the expression to ensure that the pattern only matches the end of the line. E.g. more text (1234) and other stuff would be matched by the pattern I gave above but if you use pattern \(\d{4}\)$ it will no longer match.
If you want to be even more specific with your pattern by looking for those lines with only uppercase letters and spaces followed by parentheses with 4 digits inside where the parentheses are at the end of the line, then you could use a pattern like this: [A-Z ]+\(\d{4}\)$
Sample input:
SOME RANDOM TITLE IN CAPS (2008)
text text text
more text
...
SOME OTHER RANDOM TITLE IN CAPS (2010)
Here is how to mark the lines:
After clicking "Mark All" here is what you see:
Now use Search > Bookmark > Remove Unmarked Lines and you get this:
The following RegEx maches the 2 lines with brackets containing 4 numbers:
.*?\(\d{4}\)\s*
It starts matching anything at start zero or more times (non greedy), then it matches a start bracket followed by 4 numbers. Finally ending White Space and new line.
If you want to remove all lines but the ones that end with (4numbers) you may try with this:
^(?!.*\(\d{4}\)\h*$).*(?:\r?\n|\z)
Replace by: (nothing)
See demo

How do you "quantify" a variable number of lines using a regexp?

Say you know the starting and ending lines of some section of text, but the chars in some lines and the number of lines between the starting and ending lines are variable, รก la:
aaa
bbbb
cc
...
...
...
xx
yyy
Z
What quantifier do you use, something like:
aaa\nbbbb\ncc\n(.*\n)+xx\nyyy\nZ\n
to parse those sections of text as a group?
You can use the s flag to match multilines texts, you can do it like:
~\w+ ~s.
There is a similar question here:
Javascript regex multiline flag doesn't work
If I understood correctly, you know that your text begins with aaa\nbbbb\ncc and ends with xx\nyyy\nZ\n. You could use aaa.+?bbbb.+?cc(.+?)xx.+?yyy.+?Z so that all operators are not greedy and you don't accidentally capture two groups at once. The text inbetween these groups would be in match group 1. You also need to turn the setting that causes dot to match new line on.
Try this:
aaa( |\n)bbbb( |\n)cc( |\n)( |\n){0,1}(.|\n)*xx( |\n)yyy( |\n)Z
( |\n) matches a space or a newline (so your starting and ending phrases can be split into different lines)
RegExr
At the end of the day what worked for me using Kate was:
( )+aaa\n( )+bbbb\n( )+cc\n(.|\n)*( )+xx\n( )+yyy\n( )+Z\n
using such regexps you can clear pages of quite a bit of junk.

Get all the characters until a new date/hour is found

I have to parse a lot of content with a regular expression.
The content might, for example, be:
14-08-2015 14:18 : Example : Hello =) How are you?
What are you doing?
14-08-2015 14:19: Example2 : I'm fine thanks!
I have this regular expression that will of course return 2 matches, and the groups that I need - data, hour, name, multi line message:
(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):([^\d]+)
The problem is that if a number is written inside the message this will not be OK, because the regex will stop getting more characters.
For example in this case this will not work:
14-08-2015 14:18 : Example : Hello =) How are you?
What are you 2 doing?
14-08-2015 14:19: Example2 : I'm fine thanks!
How do I get all the characters until a new date/hour is found?
The problem is with your final capturing group ([^\d]+).
Instead you can use ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+)
The outer parenthesis: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+) indicate a capturing group
The next set of parenthesis: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+) indicate a non-capturing group that we want to match 1 to infinite amount of times.
Inside we have a negative look ahead: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+). This says that whatever we are matching cannot include a date.
What we actually capture: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+) means we capture every character including a new line.
The entire regex that works looks like this:
(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+)
https://regex101.com/r/wH5xR2/2
Use a lookahead for dates and get everything up to that.
/^(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):\s?((?:(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}).)*)/sm
I've edited you regex in two ways:
Added ^to the front, ensuring you only start from timestamps on their own line, which should filter out most issues with people posting timestamps
Replaced the last capturing group with ((?:(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}).)*)
(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}) is a negative lookahead, with date
(?:(lookahead).)* Looks for any amount of characters that aren't followed by a date anchored to the start of a line.
((?:(lookahead).)*) Just captures the group for you.
It's not that efficient, but it works. Note the s flag for dotall (dot matches newlines) and m flag that lets ^ match at the start of line. ^ is necessary in the lookahead so that you don't stop the match in case someone posts a timestamp, and in the start to make sure you only match dates from the start of a line.
DEMO: https://regex101.com/r/rX8eH0/3
DEMO with flags in regex: https://regex101.com/r/rX8eH0/4

Vim: remove matching braces and the first word in the braces

For example, change
text 12345 {\color{red}text 123 \ref{label} 567
1234} 567
to
text 12345 text 123 \ref{label} 567
1234 567
What kind of operation should be done in vim?
I aim to find every all patterns {\color{red}
and remove the pattern and the matching brace } for the pattern,
while keeping the text in between.
The pattern {\color{red} can be anywhere in the line (not necessarily at the beginning of the line).
The text between the {\color{red} ...} can have multiple lines as shown above.
Thanks a lot for your help.
Edit:
I just find a way to do it, but may not be efficient enough.
:g/\\color{red}/norm ndiBvaBpd%
g: global
/\\color{red}: match the pattern
/norm: normal mode command
n: forward the cursor to next matching pattern from the cursor. But if the pattern is at the beginning of the line, it may fail to find it.
diB: delete inner block from the cursor
vaB: select block around the cursor
p: put to the selected block
d%: delete \color{red}
didn't get what do you really mean. there are many ways could do it.
{\color{red}text 123 \ref{label} 567}
^
|cursor
you could do:
df}$x
if you have surround.vim installed, removing surrounding braces would be easier. (ds{)
EDIT
for the question update:
open your file, and type:
:g#{\\color{red}#normal 0df}$x
hope the command does what you want.
EDIT II based on question update
if your target text object is crossing lines, you could try this:
g/{\\color{red}/normal 0f{mz%x`zxdf}
above line works if your target pattern crossing multiple lines (not only one/two, could be many). However the syntax must be correct, which means, the { , } must be paired.
I would use a substitution with regex for this:
%s/\v\{\\color\{\w+\}(.*)} ?$/\1
\v very magic (sane regexes)
{\\color\{\w+\} the color thingy
(.*) capture the text you want to save
} ?$ closing nipple bracket and optional space at the end of the line
/\1 replace the whole thing with the first capture, which is stuff between color tag BS
For your edited example, you can use \_. instead of . because it includes linebreak characters.
%s/\v\{\\color\{\w+\}(\_.*)}/\1

regular expression matching issue

I've got a string which has the following format
some_string = ",,,xxx,,,xxx,,,xxx,,,xxx,,,xxx,,,xxx,,,"
and this is the content of a text file called f
I want to search for a specific term within the xxx (let's say that term is 'silicon')
note that the xxx can all be different and can contain any special characters (including meta characters) except for a new line
match = re.findall(r",{3}(.*?silicon.*?),{3}", f.read())
print match
But this doesn't seem to work because it returns results which are in the format:
["xxx,,,xxx,,,xxx,,,xxx,,,silicon", "xxx,,,xxx,,,xxx,,,xxsiliconxx"] but I only want it to return ["silicon", "xxsiliconxx"]
What am I doing wrong?
Try the following regex:
(?<=,{3})(?:(?!,{3}).)*?silicon.*?(?=,{3})
Example:
>>> s = ',,,xxx,,,silicon,,,xxx,,,xxsiliconxx,,,xxx'
>>> re.findall(r'(?<=,{3})(?:(?!,{3}).)*?silicon.*?(?=,{3})', s)
['silicon', 'xxsiliconxx']
I am assuming that the content in the xxx can contain commas, just not three consecutive commas or it would end the field. If the content in the xxx sections cannot contain any commas, you can use the following instead:
(?<=,{3})[^,\r\n]*?silicon.*?(?=,{3})
The reason your current approach doesn't work is that even though .*? will try to match as few characters as possible, the match will still start as early as possible. So for example the regex a*?b would match the entire string "aaaab". The only time the regex will advance the starting position is when the regex fails to match, and since ,,, can be matched by the .*?, your match will always start at the beginning of the string or just after the previous match.
The lookbehind and lookahead are used to address the issue raised by JaredC in comments, basically re.findall() won't return overlapping matches, so you need the leading and trailing ,,, to not be a part of the match.