Regex to capture everything except the text that is coherent - regex

I have this string and other ones like it:
<a href='/webapps/alrn-atomiclearning-bb_bb60/atomic/view.jsp?courseId=#X#course.pk_string#X#&contentId=#X#content.pk_string#X#&tt=Using+the+course+calendar&st=Blackboard+Learn%E2%84%A2+9.1+Instructor+-+Additional+Features+Training&d=00:02:09&tid=84425&sid=2389'><img src='/webapps/alrn-atomiclearning-bb_bb60/images/icon_play_UnlockedTutorial.png' alt='play icon'> Using the course calendar</a><br/>Duration: (00:02:09)
I'm trying to come up with a regex to capture everything EXCEPT the coherent labels that begin after and end just before the </a><br/>
So for example, I would capture everything and then delete it and end up only having:
Using the course calendar
as still there. I've tried multiple variations in Rubular but can only get up to the . Trying to use the [^a-zA-Z|^\s]*<\/a>.* to skip every word char and white space up to the <\a> does not work.
Thanks.

Using a lookahead and a lookbehind - the two sections in brackets. Modify the character class in the middle to capture everything you want to select.
(?<=> )[a-zA-Z\s]+(?=<\/)
Edit:
([\s\w\d\S\W\D]+)((?<=> )[a-zA-Z\s]+(?=<\/))\K([\s\w\d\S\W\D]+)
Ultimately this creates three match groups, the bit before what you want to be left with, the bit you want to be left with, and the bit after what you want to be left with. I'm not sure how, or if indeed you can, specify to select multiple matches as if it's a single match.
I'd still go with the selecting what you're actually after, if possible.

Related

How to use regex to look around a complex pattern?

I have the following html element in Sublime Text:
<div class="exg"><div><strong class="syn">investigate</strong><span class="syn">, conduct investigations into, make inquiries into, inquire into, probe, examine, explore, research, study, look into, go into</span></div>
I want to use regex to select the content after and including the 5th comma in this element, stopping before
</span></div>.
So, in this case I'd want to select:
, examine, explore, research, study, look into, go into
So far, I was able to write this regex, which works:
(<div class="exg"><div><strong class="syn">(\w+)((\s)?(\w+)?)+</strong><span class="syn">((\,((\s)?(\w+)?)+)?){5})
This allows me to select the part before what I need to select. I tried to use this with a positive lookbehind, but it isn't working and I can't figure out how to fix it. Here is what I tried:
(?<=(<div class="exg"><div><strong class="syn">(\w+)((\s)?(\w+)?)+</strong><span class="syn">((\,((\s)?(\w+)?)+)?){3}))((\,?((\s)?(\w+)?)+?)+)
You make a heavy use of parenthesis. Also your expression for catching words between commas could be simpler. Replacing your groups with non capturing ones, you'll get the expected match in your first (and only) group with this regex:
(?<=<div class="exg"><div><strong class="syn">)(?:\s?\w)*<\/strong><span class="syn">(?:,(?:\s?\w)*){4}(.*?)(?=<\/span><\/div>)
BTW if you want to capture the 5th comma I think your quantifier should be {4} (but I might have misunderstood)
Check the Demo
Update:
If you're looking to delete the matched group (i.e. replacing it with an empty string). Just do the opposite: build one group before and one after:
(<div class="exg"><div><strong class="syn">(?:\s?\w)*<\/strong><span class="syn">(?:,(?:\s?\w)*){4}).*?(<\/span><\/div>)
Demo
Then replace in your editor with \1\2(groups one after the other, without the previously matched string inbetween)

Regex taking too many characters

I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.

Regex repeat patterns and non matching groups

Okay so I am having an issue getting a repeat to work at all, let alone the way I want it to work...
I will be bringing in a string with the following information
NETWORK;PASS;1;THIS TEXT|CAN BE|RANDOM|WITH|PIPE|SEPERATORS;\r
what I have so far
(?:NETWORK;.*;(?:0|1);)([^|]*)
this currently leaves me the first block matched
THIS TEXT
what I am trying to do is set it up so I can programmatically specify which block to match. the text separated with pipes will have between 3-7 "blocks" and depending on the situation I may need to match any one of them, but only one at a time.
I had thought about just duplicating
([^|]*)
and adding a non matching operator to all but the one but I cant seem to get it to match anything if I duplicate that group, and neither can I get repeat operators to work on the group.
I am a bit lost so this may not make entire sense if clarification is required I will provide on request. any help is appreciated.
Why not just split THIS TEXT|CAN BE|RANDOM|WITH|PIPE|SEPERATORS on the pipe symbol? Much easier than a dynamically-generated regex.
But if you really want to generate a regex:
Start with (?:NETWORK;.*;(?:0|1);)
To get the nth element (indexed from 0), add (?:[^|]+[|]){n} (replace n with the number to skip), followed by ([^|]+)
Example:
(?:NETWORK;.*;(?:0|1);)(?:[^|]+[|]){3}([^|]+)
Debuggex Demo
Matches WITH in your example. Here's a regex101 demo.

Using a Regular expression to extract required data

I am trying to modify my current regular expression to be more exclusive.
This is what I have so far:
RE.Pattern = "(L\d{1}-\w{2}-\w{4,7}-DATA-\d{3,4})"
What this does is extracts the following examples of strings from a load of junk data. FYI these strings are NOT static, the number values etc will change between cells.
L2-R2-TEST-DATA-4724
L1-SR-TESTING-DATA-472
L1-R2-WORKING-DATA-472
The above strings are what I want, however as well as this it will also extract the data below:
L1-R2-WRONGON-DATA-4725
L2-SR-RUBBISH-DATA-472
This is not what I need, and was wondering what, if anything, could be done in order to modify my regex to stop this from happening...
I was wondering if it is possible to statically define say TEST,TESTING and WORKING somehow within the original regex? So that I can grab them and not WRONGON and RUBBISH.
You could use a non-capturing group (?: to separate what words you want to include. Also its not necessary to have L\d{1}, you can simply use L\d
RE.Pattern = "(L\d-\w{2}-(?:TEST(?:ING)?|WORKING)-DATA-\d{3,4})"
See Live demo
I'm not sure I understand your question, since you say that the strings will be changing, but you want to know if you can statically match specific cases.
If you just want to match TEST, TESTING and WORKING, you can replace \w{4,7} with (?:TEST|TESTING|WORKING) and it will obviously not match WRONGON or RUBBISH. If you want to match any 4-7 character word except the latter two, that's a different matter.

Regex for SublimeText Snippet

I've been stuck for a while on this Sublime Snippet now.
I would like to display the correct package name when creating a new class, using TM_FILEPATH and TM_FILENAME.
When printing TM_FILEPATH variable, I get something like this:
/Users/caubry/d/[...]/src/com/[...]/folder/MyClass.as
I would like to transform this output, so I could get something like:
com.[...].folder
This includes:
Removing anything before /com/[...]/folder/MyClass.as;
Removing the TM_FILENAME, with its extension; in this example MyClass.as;
And finally finding all the slashes and replacing them by dots.
So far, this is what I've got:
${1:${TM_FILEPATH/.+(?:src\/)(.+)\.\w+/\l$1/}}
and this displays:
com/[...]/folder/MyClass
I do understand how to replace splashes with dots, such as:
${1:${TM_FILEPATH/\//./g/}}
However, I'm having difficulties to add this logic to the previous one, as well as removing the TM_FILENAME at the end of the logic.
I'm really inexperienced with Regex, thanks in advance.
:]
EDIT: [...] indicates variable number of folders.
We can do this in a single replacement with some trickery. What we'll do is, we put a few different cases into our pattern and do a different replacement for each of them. The trick to accomplish this is that the replacement string must contain no literal characters, but consist entirely of "backreferences". In that case, those groups that didn't participate in the match (because they were part of a different case) will simply be written back as an empty string and not contribute to the replacement. Let's get started.
First, we want to remove everything up until the last src/ (to mimic the behaviour of your snippet - use an ungreedy quantifier if you want to remove everything until the first src/):
^.+/src/
We just want to drop this, so there's no need to capture anything - nor to write anything back.
Now we want to match subsequent folders until the last one. We'll capture the folder name, also match the trailing /, but write back the folder name and a .. But I said no literal text in the replacement string! So the . has to come from a capture as well. Here comes the assumption into play, that your file always has an extension. We can grab the period from the file name with a lookahead. We'll also use that lookahead to make sure that there's at least one more folder ahead:
^.+/src/|\G([^/]+)/(?=[^/]+/.*([.]))
And we'll replace this with $1$2. Now if the first alternative catches, groups $1 and $2 will be empty, and the leading bit is still removed. If the second alternative catches, $1 will be the folder name, and $2 will have captured a period. Sweet. The \G is an anchor that ensures that all matches are adjacent to one another.
Finally, we'll match the last folder and everything that follows it, and only write back the folder name:
^.+/src/|\G([^/]+)/(?=[^/]+/.*([.]))|\G([^/]+)/[^/]+$
And now we'll replace this with $1$2$3 for the final solution. Demo.
A conceptually similar variant would be:
^.+/src/|\G([^/]+)/(?:(?=[^/]+/.*([.]))|[^/]+$)
replaced with $1$2. I've really only factored out the beginning of the second and third alternative. Demo.
Finally, if Sublime is using Boost's extended format string syntax, it is actually possible to get characters into the replacement conditionally (without magically conjuring them from the file extension):
^.+/src/|\G(/)?([^/]+)|\G/[^/]+$
Now we have the first alternative for everything up to src (which is to be removed), the third alternative for the last slash and file name (which is to be removed), and the middle alternative for all folders you want to keep. This time I put the slash to be replaced optionally at the beginning. With a conditional replacement we can write a . there if and only if that slash was matched:
(?1.:)$2
Unfortunately, I can't test this right now and I don't know an online tester that uses Boost's regex engine. But this should do the trick just fine.