Is this the way Regex works? - regex

So I don't know exactly how to ask this question exactly but as you can see from the picture I have labelled each $replacement_number and drew a line to where it ends. The top line is what I am looking for and the bottom line is what I am replacing it with but I'm sure you all know this.
This is for an MP3 tag editor and what I am accomplishing here is it looks for exactly one letter that follows a number which may follow [anything BUT a letter] or follow JUST ONE letter and capitalize the letter that's after the number. So basically if I have 22b it will become 22B, if I have y2k, it will become y2K. But if I have yy2k it will be yy2k or if I have 2bb it will stay 2bb... etc, etc...
My question is, are the numbers in the image exactly how regex understands them or am I wrong somewhere?
Also, is my code efficient or not?

Yes, that is how the capture groups will be ordered, with all major (and probably minor) flavors of Regex.
If you're interested to see how exactly regex understands your pattern you can use this nifty visualization tool:

you are correct, the numbers - as defined by the parenthesis (as you pictured) are exactly how RegEx will label them.

Related

Strip strings from text between idential pairs of characters (using regex or otherwise)

I have long text files (.srt subtitle files, actually) - which unfortunately include a lot of irrelevant/distracting information.
All irrelevant text is enclosed within identical pairs of pilcrow (paragraph) characters: ¶
So for example, some text would look like this:
This is important, and ¶junk trash garbage rubbish¶ I would like to
keep it.
Obviously, I want to remove everything between the ¶ characters and keep the rest. It doesn't matter whether the ¶ characters themselves are stripped or retained: if they're retained, it's trivial just to remove them directly with a subsequent search/replace - so I just need whatever pattern match is easiest.
Note that the ¶ symbols come in identical pairs, so it's not as simple as, for example, stripping out everything between [asymetrical characters].
I'm not working on any particular platform. In fact, I was hoping to use a web-based tool to do it like this one.
I just need the regex - if anyone can assist! Alternatively, if there are better ways than regex, I'd be grateful for suggestions.
Edit: It has been suggested that this question (Remove text in-between delimiters in a string (using a regex?)) answers what I'm looking for. Thanks, but unfortunately it doesn't. That relates to using it in C# (which I don't know), and the answers to that question do not explain exactly how to replicate what I want. I want it to work in the online tool to which I linked.
Update: A good answer works, but only if the unwanted text appears in-line. I also need it to remove text where the entire line is unwanted:
779 00:35:52,216 --> 00:35:54,784
I miss him already.
780 00:36:00,291 --> 00:36:03,727
¶ If you ever need someone ¶
665
00:30:21,821 --> 00:30:25,589
¶ Feels like
sometimes you want to ¶
So I want to remove everything which appears between the ¶ symbols, regardless of where they appeal in the line, and regardless of the presence of line breaks.
Second Update
Subsequent to the accepted answer, it seems it's not entirely working. In the example here, the regex provided does not work in the first multi-line instance. I have no clue what's wrong. I just want line breaks (or any other characters) to be irrelevant in the consideration. The request is simply to delete everything between pairs of ¶ characters, regardless of where they appear, and regardless of what else lies between.
Final (hopefully) update
For reference, and thanks to user MDR, we have the solution: (¶[\S\s]*?¶)
Updated because of new information in question and comments below this answer.
That online tool you quoted seems to extract text (perhaps not what you want here - you want to remove the bit found). Maybe instead use a local text editor (xed, Gedit, Textedit, TextWrangler, Visual Code Studio, Atom, NotePad++ on Windows etc.) that has find and replace but with a regex option and find...
(¶[\S\s]*?¶)
...and replace with nothing. Demo: https://regex101.com/r/4v9gXj/8
If I may suggest regexr.com. Use as pattern ¶.*?¶ and then switch to Replace section as the screenshot shows.

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

Regex that matches even amount of character

Disclamer (after solved): this is my uni assignment thus I the answer could be simple. Hints are shown but my answer is hidden from here. Alternative answers could be found here but I take no responsibility with any plagiarism with direct answers posted here.
Hi I'm having troubles with the following exercise
Find regex that strictly represents the language:
b^(m+1), such that m>=0, m mod 2 = 1
The language breaks down to words:
{bb,bbbb,bbbbbb,bbbbbbbb,...}
I have tried the following:
b(bbb)?(bb)*
But this also accepts
{bb,bbb,bbbb,bbbbb,...}
Is there a way to write it such one bit of expression is depended on the other? ie: (bb)* cannot be chosen if (bbb)? is chosen at once, then repeat the decision but allow the vice versa.
Any help would be appreciated. Thanks
Update:-
You can use
^(?:bb)+$
Regex Demo
Initial heading of question was --> Regex that matches odd amount of character
You can try this
^b(?:(?:b{2})+)?$
Regex Demo
My guess is that, this might be closer,
^(?:bb){1,}$
and your set might look like,
bb
bbbb
bbbbbb
not sure though. If your set was correct, expression can likely be modified.
also, b would not probably be in the set, since m=0 does not pass the second requirement.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

RegEx to match sets of literal strings along with value ranges

Utter RegEx noob here with a project involving RegEx I need to modify. Has been a blast learning all of this.
I need to search for/verify a set of vales that start with one of two string combinations (NC or KH) and a variable numeric list—unique to each string prefix. NC01-NC13 or KH01-11.
I have been able to pull off the first common "chunk" of this with:
^(NC|KH)0[1-9]$
to verify NC01-NC09 or KH01-KH09. The next part is completely throwing me—needing to change the leading character of the two-digit character to a 1 vs a 0, and restricting the range to 0–3 for NC and 0–1 for KH.
I have found references abound for selecting between two strings (where I got the (NC|KH) from), but nothing as detailed as how to restrict following values based on the found text.
Any and all help would be greatly appreciated, as well as any great references/books/tutorials to RegEx (currently using Regular-Expressions.info).
The best way to do this is to just separate the two case altogether.
((NC(0\d|1[0-3])|(KH(0\d|1[01])))
You might want to turn some of those internal capturing groups into non capturing groups, but that make the regex a little hard to read.
Edit: You might also be able to do this with positive lookbehind.
Edit: Here's a regex using lookbehind. It's a lot messier, and not really necessary here, but hopefully demonstrates the utility:
(KH|NC)(0\d|(?<=KH)(1[01])|(?<=NC)(1[0-3]))
Sticking with your original idea of options for NC or KH, do the same for the numbers, try this:
^(NC|KH)(0[1-9]|1[0-3])$
Hope that makes sense
EDIT:
Based upon #Patrick's comment below, and sticking with this original answer, you could use this (although I bet there's a better way):
^(NC|KH)(0[1-9]|1[0-1])|(NC1[2-3])$

Regular Expression matching anything after a word

I am looking to find anything that matches this pattern, the beginning word will be:
organism aogikgoi egopetkgeopt foprkgeroptk 13
So anything that starts with organism needs to be found using regex.
^organism will match anything starting with "organism".
^organism(.*) will also capture everything that follows, into the variable that contains the first match (which varies according to language -- in Perl it's $1).
Also just wanna add for others newbies like me and their various circumstances, you can do it in various ways depending on your text and what you are tryna do.
Like here's an Example where I wanna delete everything after ?spam so I could use .?spm.+ or .?spm.+ or any other ways as long you are creative about it lol.
This might come in handy, here's a Link | Link where you can find some basic necessary regex and their meanings.