How do you "quantify" a variable number of lines using a regexp? - regex

Say you know the starting and ending lines of some section of text, but the chars in some lines and the number of lines between the starting and ending lines are variable, รก la:
aaa
bbbb
cc
...
...
...
xx
yyy
Z
What quantifier do you use, something like:
aaa\nbbbb\ncc\n(.*\n)+xx\nyyy\nZ\n
to parse those sections of text as a group?

You can use the s flag to match multilines texts, you can do it like:
~\w+ ~s.
There is a similar question here:
Javascript regex multiline flag doesn't work

If I understood correctly, you know that your text begins with aaa\nbbbb\ncc and ends with xx\nyyy\nZ\n. You could use aaa.+?bbbb.+?cc(.+?)xx.+?yyy.+?Z so that all operators are not greedy and you don't accidentally capture two groups at once. The text inbetween these groups would be in match group 1. You also need to turn the setting that causes dot to match new line on.

Try this:
aaa( |\n)bbbb( |\n)cc( |\n)( |\n){0,1}(.|\n)*xx( |\n)yyy( |\n)Z
( |\n) matches a space or a newline (so your starting and ending phrases can be split into different lines)
RegExr

At the end of the day what worked for me using Kate was:
( )+aaa\n( )+bbbb\n( )+cc\n(.|\n)*( )+xx\n( )+yyy\n( )+Z\n
using such regexps you can clear pages of quite a bit of junk.

Related

How can I delete this part of the text with regex?

I have a problem that I really hope that somebody could help me. So, I want to delete some parts of text from a notepad++ document using Regex. If there's another software that I can use to delete this part of text, let me know please, I am really really noob with regex
So, my document its like this:
1
00:00:00,859 --> 00:00:03,070
text over here
2
00:00:03,070 --> 00:00:09,589
text over here
3
00:00:09,589 --> 00:00:10,589
some numbers here
4
00:00:10,589 --> 00:00:12,709
Text over here
5
00:00:12,709 --> 00:00:18,610
More text with numbers here
What I want to learn is how can I delete the first 2 lines of numbers in all the document? So I could get only the text parts (the "text over here" parts)
I would really appreciate any kind of help!
My solution:
^[\s\S]{1,5}\d{1,3}:\d{1,3}:\d{1,3},\d{1,5}\s-->\s*?\d{1,3}:\d{1,3}:\d{1,3},\d{1,5}\s
This solution match both types: either all data in one line, or numbers in one line and data in the second.
Demo: https://regex101.com/r/nKD0DQ/1/
Simplest solution;
\d+(\r\n|\r|\n)\d{2}:\d{2}.*(\r\n|\r|\n)
Get line with some number \d+ with its line break (\r\n|\r|\n)
Also the next line that starts with two 2-digit numbers and a colon \d{2}:\d{2} with the rest .* and its line break. No need to match all since we already are in the correct line, since subtitle file is defined well with its predictable structure.
Put this as Find what: value in Search -> Replace.. in Notepad++, with Seach Mode: Regular Expression and with replace value (Replace with:) of empty space. Will get you the correct result, lines of expected text with empty line in between each.
to see it on action on regex101
Subtitles, for accuracy you can use this:
\d+(\r\n|\n|\r)(\d\d:){2}\d\d,\d{3}\s*-->\s*(\d\d:){2}\d\d,\d{3}(\r\n|\n|\r)
Check Regular Expression, Find what with this and Replace with empty would do.
Regxe Demo
srt subtitles are basically ordered. And it's better accurate than lose texts.
\d : a single digit.
+ : one or more of occurances of the afore character or group.
\r\n: carriage and return. (newline)
* : zero or more of occurances of the afore character or group.
| : Or, match either one.
{3}: Match afore character or group three times.
I'm going for a less specific regex:
^[0-9]*\n[0-9:,]*\s-->\s[0-9:,]*
Demo # regex101

Notepad++ and regex (multiline)

I have been facing a challenge. I have a text file with the following pattern:
SOME RANDOM TITLE IN CAPS (nnnn)
text text text
more text
...
SOME OTHER RANDOM TITLE IN CAPS (nnnn)
What is for sure is that what I want to extract are lines with a bracket and a date ex: (2015) ; (20008)
After the (nnnn) there is no text, sometimes space and CR LF, sometimes just CR LF
I would like to delete everything else and keep just the TITLE LINE with the brackets
The time I spent I could have done it by hand (there are 100lines) but I like the challenge :)
I thought I could find the issue but I am stuck.
I have tried something along this line:
^.*\(\d\d\d\d\)(?s)(.*)(^.*\(\d\d\d\d\))
But I don't get what I want. I can't seem to stop the (?s)(.*) going all the way to the end of the text instead of stopping at the next occurrence.
I suggest using the Search > Mark feature. Use a pattern like \(\d{4}\) and check the "Bookmark Line" option then click "Mark All". Then use Search > Bookmark > Remove Unmarked Lines. This will remove all lines except the ones that have matched your pattern.
Note: If it's possible to have parentheses with 4 digits within your other lines you could add $ to the end of the expression to ensure that the pattern only matches the end of the line. E.g. more text (1234) and other stuff would be matched by the pattern I gave above but if you use pattern \(\d{4}\)$ it will no longer match.
If you want to be even more specific with your pattern by looking for those lines with only uppercase letters and spaces followed by parentheses with 4 digits inside where the parentheses are at the end of the line, then you could use a pattern like this: [A-Z ]+\(\d{4}\)$
Sample input:
SOME RANDOM TITLE IN CAPS (2008)
text text text
more text
...
SOME OTHER RANDOM TITLE IN CAPS (2010)
Here is how to mark the lines:
After clicking "Mark All" here is what you see:
Now use Search > Bookmark > Remove Unmarked Lines and you get this:
The following RegEx maches the 2 lines with brackets containing 4 numbers:
.*?\(\d{4}\)\s*
It starts matching anything at start zero or more times (non greedy), then it matches a start bracket followed by 4 numbers. Finally ending White Space and new line.
If you want to remove all lines but the ones that end with (4numbers) you may try with this:
^(?!.*\(\d{4}\)\h*$).*(?:\r?\n|\z)
Replace by: (nothing)
See demo

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

Match text between tags

How do I match the text between tags when the end tag is non repeating?
Example:
DATA GOES HERE
aaa
DATA GOES HERE
bbb
The goal is to capture "aaa" and "bbb". I have tried the following regex however it fails to match the second batch;
^(DATA\sGOES\sHERE).*?\k<1>
The result from the above is always the first batch;
DATA GOES HERE
aaa
DATA GOES HERE
Thanks.
Have a try with:
(?s)^(DATA GOES HERE\R)(.+?)(?=\1|\z)
The sring you want is in group 2.
Assuming that the tag is always DATA GOES HERE:
(?<=DATA GOES HERE[\r\n]).+
Here is the output from RegexBuddy showing the match:
Explanation: -
(?<=DATA GOES HERE[\r\n]) - this is a positive lookbehind. It means 'make sure this is preceded by'.
.+ One or more of any characters (not newlines).
Essentially this looks for any sets of characters that are preceeded by a line with DATA GOES HERE. A lookbehind is zero length, so it does not participate in the matched text which is why you only get aaa and bbb which I am assuming is what you wanted.
Update based on comment
It doesn't work if line-break is CRLF, also when there are multiple lines to catch
Quite correct about the CRLF, there should have been a + after [\r\n]. To match multiple lines you can use the following:
(?<=(DATA GOES HERE[\r\n]+)).[\s\S]+?(?=\1)|(?<=DATA GOES HERE[\r\n]+).[\s\S]+
The updates are:
[\s\S]+ Any characters including new lines.
| = OR. Now it will match either between DATA GOES HERE blocks or for the last text after DATA GOES HERE.
Result:
You can try
(?:DATA GOES HERE\n(.+)(?=|$))+
to capture the texts between the tags (aaa and bbb).
Debuggex Demo

Vim: remove matching braces and the first word in the braces

For example, change
text 12345 {\color{red}text 123 \ref{label} 567
1234} 567
to
text 12345 text 123 \ref{label} 567
1234 567
What kind of operation should be done in vim?
I aim to find every all patterns {\color{red}
and remove the pattern and the matching brace } for the pattern,
while keeping the text in between.
The pattern {\color{red} can be anywhere in the line (not necessarily at the beginning of the line).
The text between the {\color{red} ...} can have multiple lines as shown above.
Thanks a lot for your help.
Edit:
I just find a way to do it, but may not be efficient enough.
:g/\\color{red}/norm ndiBvaBpd%
g: global
/\\color{red}: match the pattern
/norm: normal mode command
n: forward the cursor to next matching pattern from the cursor. But if the pattern is at the beginning of the line, it may fail to find it.
diB: delete inner block from the cursor
vaB: select block around the cursor
p: put to the selected block
d%: delete \color{red}
didn't get what do you really mean. there are many ways could do it.
{\color{red}text 123 \ref{label} 567}
^
|cursor
you could do:
df}$x
if you have surround.vim installed, removing surrounding braces would be easier. (ds{)
EDIT
for the question update:
open your file, and type:
:g#{\\color{red}#normal 0df}$x
hope the command does what you want.
EDIT II based on question update
if your target text object is crossing lines, you could try this:
g/{\\color{red}/normal 0f{mz%x`zxdf}
above line works if your target pattern crossing multiple lines (not only one/two, could be many). However the syntax must be correct, which means, the { , } must be paired.
I would use a substitution with regex for this:
%s/\v\{\\color\{\w+\}(.*)} ?$/\1
\v very magic (sane regexes)
{\\color\{\w+\} the color thingy
(.*) capture the text you want to save
} ?$ closing nipple bracket and optional space at the end of the line
/\1 replace the whole thing with the first capture, which is stuff between color tag BS
For your edited example, you can use \_. instead of . because it includes linebreak characters.
%s/\v\{\\color\{\w+\}(\_.*)}/\1