Regular expression to delete all words between two specific words - regex

I'm normally ok with regex but I'm struggling with this.
I have a simple file with two words that start and end a set of data. The data between the words changes but - start and status are always in the same place.
Example :
start
Everything in between
status
I'm trying to work out how to delete (replace) everything between and including start and status
I'm sure I had it working with this at one time
(?i)^start.+?status
set(#replaceAll,$replace regular expression(#textTest,"(?i)^start.+?status"," "),"Global")
but its just not working anymore.

You could use the regular expression
\bstart\b.+?\bstatus\b
which does not require "status" to be on the same line as "start". Two flags should be set:
case indifference (/i)
single-line mode, which allows . to match a newline (/s)
Demo
The regex reads, "match 'start' with a word break fore and aft (to avoid matching 'starting' or 'jumpstart', for example), then match one or more characters lazily, then match 'status' with wordbreaks". The middle match must be lazy so that the regex engine will stop at the next (rather than last) instance of 'status'.
If the regex engine being used does not support single-line mode, or something comparable, one can replace .+ with [\s\S]+.

So my original expression works and so dose Cary's
The files have changed since I last used the expression. They contain some white-space in the form of newlines that needed to be removed first
set(#cleanup,$replace(#text2,$new line," "),"Global")
set(#text2,$replace regular expression(#cleanup,"\\bstart\\b.*?\\bstatus\\b",""),"Global")
set(#cleanup,$replace regular expression(#cleanup,"(?i)^start.+?status:",""),"Global")
Sorry about that but thanks to all who looked and helped :)

Related

Regex taking too many characters

I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.

regular expression replace all except captured expression

I'm trying to use a regular expression to find everything except for the data I don't want to replace. I'm then wanting to replace everything except for the found expression.
My match is to find all words that start with "CN=" and ends with 12 characters after. I'm currently using the logic (!?CN=)\w{12}. It finds the two occurrences in the string. However, I'm wanting to find everything but these two occurrences and find everything else and replace everything else with an empty value so that I just have the two CN= word values.
The attached image shows my testing and results. I'm wanting the reverse affect.
image
Thanks,
Fred
You could use .* to capture whatever comes before a match, and add $ (end-of-input) as alternative for CN=\w{12}:
.*?($|CN=\w{12})
Replace with just the captured groups:
$1
Demo on Java regex tester
NB: I did not add the !? as you wrote ...words that start with "CN=". If you really need to keep the exclamation mark before CN=, then you should add it to the regular expression.

Using Flags of Regex within Google Forms

I'm trying to use flags within Google Forms, and I've been googling hoping to find an answer in the last couple of hours, but didn't find any. Google Forms say that the regular expression is not valid. Even when I use a simple regex such as: (?i)t. I'm trying to use the regex inside a paragraph question.
How can I make it work?
Edit:
What I really need is to match [a-zA-Z" ]+( *),( *)[1-9]([0-9]??)\n repeatedly, so each line will look something like: Sam "The Man" McAdams , 9\n. Of course, the number of lines is unknown. using the repetition modifiers of * or + at the end of the regex does not satisfy my needs, because if the first line is accepted as valid, the other lines might be composed of anything really, and it considers it as a valid input, while it's not.
You can use the following expression to validate an entire string that only consists of lines meeting your pattern:
^([a-zA-Z" ]+ *, *[1-9][0-9]?(\n|$))+$
See the regex demo.
The main point is to add an alternation group to match either a newline or the end of string ((\n|$)) and wrap the whole pattern into a +-quantified group ((...)+) anchored at both start (^) and end ($).

Assistance with a regular expression

I am not good with regular expressions, and I could use some help with a couple of expressions I am working on. I have a line of text, such as Text here then 999-99 and I'd like to isolate that number sequence at the end. It could be either 999-99 or 999-99-9. The following seems to work:
\d{3}-\d{2}(-\d{1})?
But I notice that it really just seems to be searching anywhere within the text, as I can add text after the number sequence and it still matches. This needs to be more strict, so that the line must end with this exact sequence, and nothing after it. I tried ending with $ instead of ?, but that never seems to create a match (it always returns false).
I could also use some help with character replacement. I am working on a program which deals with OCR scanning, and occasionally the string value that comes back contains undisplayable characters, represented by the ܀ symbol. Is there a regular expression which will replace the ܀ characters with a space?
Try this regular expression.
([\d-]+)$
This should work. Just end your regex with $. It represents end of line
\d{3}-\d{2}(-\d{1})?$
Use the word-boundary metacharacter, \b:
\b\d{3}-\d{2}(-\d)?\b
You can also remove the {1} from the last \d since it's redundant.

Explain this Regular Expression please

Regular Expressions are a complete void for me.
I'm dealing with one right now in TextMate that does what I want it to do...but I don't know WHY it does what I want it to do.
/[[:alpha:]]+|( )/(?1::$0)/g
This is used in a TextMate snippet and what it does is takes a Label and outputs it as an id name. So if I type "First Name" in the first spot, this outputs "FirstName".
Previously it looked like this:
/[[:alpha:]]+|( )/(?1:_:/L$0)/g (it might have been \L instead)
This would turn "First Name" into "first_name".
So I get that the underscore adds an underscore for a space, and that the /L lowercases everything...but I can't figure out what the rest of it does or why.
Someone care to explain it piece by piece?
EDIT
Here is the actual snippet in question:
<column header="$1"><xmod:field name="${2:${1/[[:alpha:]]+|( )/(?1::$0)/g}}"/></column>
This regular expression (regex) format is basically:
/matchthis/replacewiththis/settings
The "g" setting at the end means do a global replace, rather than just restricting the regex to a particular line or selection.
Breaking it down further...
[[:alpha:]]+|( )
That matches an alpha numeric character (held in parameter $0), or optionally a space (held in matching parameter $1).
(?1::$0)
As Roger says, the ? indicates this part is a conditional. If a match was found in parameter $1 then it is replaced with the stuff between the colons :: - in this case nothing. If nothing is in $1 then the match is replaced with the contents of $0, i.e. any alphanumeric character that is not a space is output unchanged.
This explains why the spaces are removed in the first example, and the spaces get replaced with underscores in your second example.
In the second expression the \L is used to lowercase the text.
The extra question in the comment was how to run this expression outside of TextMate. Using vi as an example, I would break it into multiple steps:
:0,$s/ //g
:0,$s/\u/\L\0/g
The first part of the above commands tells vi to run a substitution starting on line 0 and ending at the end of the file (that's what $ means).
The rest of the expression uses the same sorts of rules as explained above, although some of the notation in vi is a bit custom - see this reference webpage.
I find RegexBuddy a good tool for me in dealing with regexs. I pasted your 1st regex in to Buddy and I got the explanation shown in the bottom frame:
I use it for helping to understand existing regexs, building my own, testing regexs against strings, etc. I've become better # regexs because of it. FYI I'm running under Wine on Ubuntu.
it's searching for any alpha character that appears at least once in a row [[:alpha:]]+ or space ( ).
/[[:alpha:]]+|( )/(?1::$0)/g
The (?1 is a conditional and used to strip the match if group 1 (a single space) was matched, or replace the match with $0 if group 1 wasn't matched. As $0 is the entire match, it gets replaced with itself in that case. This regex is the same as:
/ //g
I.e. remove all spaces.
/[[:alpha:]]+|( )/(?1:_:/\L$0)/g
This regex is still using the same condition, except now if group 1 was matched, it's replaced with an underscore, and otherwise the full match ($0) is used, modified by \L. \L changes the case of all text that comes after it, so \LABC would result in abc; think of it as a special control code.