find all text before using regex - regex

How can I use regex to find all text before the text "All text before this line will be included"?
I have includes some sample text below for example
This can include deleting, updating, or adding records to your database, which would then be reflex.
All text before this line will be included
You can make this a bit more sophisticated by encrypting the random number and then verifying that it is still a number when it is decrypted. Alternatively, you can pass a value and a key instead.

Starting with an explanation... skip to end for quick answers
To match upto a specific piece of text, and confirm it's there but not include it with the match, you can use a positive lookahead, using notation (?=regex)
This confirms that 'regex' exists at that position, but matches the start position only, not the contents of it.
So, this gives us the expression:
.*?(?=All text before this line will be included)
Where . is any character, and *? is a lazy match (consumes least amount possible, compared to regular * which consumes most amount possible).
However, in almost all regex flavours . will exclude newline, so we need to explicitly use a flag to include newlines.
The flag to use is s, (which stands for "Single-line mode", although it is also referred to as "DOTALL" mode in some flavours).
And this can be implemented in various ways, including...
Globally, for /-based regexes:
/regex/s
Inline, global for the regex:
(?s)regex
Inline, applies only to bracketed part:
(?s:reg)ex
And as a function argument (depends on which language you're doing the regex with).
So, probably the regex you want is this:
(?s).*?(?=All text before this line will be included)
However, there are some caveats:
Firstly, not all regex flavours support lazy quantifiers - you might have to use just .*, (or potentially use more complex logic depending on precise requirements if "All text before..." can appear multiple times).
Secondly, not all regex flavours support lookaheads, so you will instead need to use captured groups to get the text you want to match.
Finally, you can't always specify flags, such as the s above, so may need to either match "anything or newline" (.|\n) or maybe [\s\S] (whitespace and not whitespace) to get the equivalent matching.
If you're limited by all of these (I think the XML implementation is), then you'll have to do:
([\s\S]*)All text before this line will be included
And then extract the first sub-group from the match result.

(.*?)All text before this line will be included
Depending on what particular regular expression framework you're using, you may need to include a flag to indicate that . can match newline characters as well.
The first (and only) subgroup will include the matched text. How you extract that will again depend on what language and regular expression framework you're using.
If you want to include the "All text before this line..." text, then the entire match is what you want.

This should do it:
<?php
$str = "This can include deleting, updating, or adding records to your database, which would then be reflex.
All text before this line will be included
You can make this a bit more sophisticated by encrypting the random number and then verifying that it is still a number when it is decrypted. Alternatively, you can pass a value and a key instead.";
echo preg_filter("/(.*?)All text before this line will be included.*/s","\\1",$str);
?>
Returns:
This can include deleting, updating, or adding records to your database, which would then be reflex.

Related

Regex: Find word that does not have a different target anywhere ahead of it in the file?

How do I find a word that does not have a specific target text anywhere ahead of it in a file?
Let's say I want to find "setting3" not preceeded by a right square bracket "]" (which denotes a header). This following file would fail that test, due to [header]:
[header] #can be named anything
setting1=True
setting2=various setting values, can include any type of text
setting3=1
But a file with an orphaned setting should be a match:
setting3=1
Lookbehinds won’t work, because I may have arbitrary settings in between the header and the text I'm looking for. Because my terms span multiple lines, it makes it trickier.
For context, this is to set a rule with a tool that only offers one regex line (Ansible, which I think uses Python's regex engine). I don't believe I have access to special settings (global, etc.)
You might be able to use the following regular expression :
^[^[]*setting3=1
It matches from the start of the file up to the setting you're looking for, but only matching characters that aren't [, which guarantees that it will only match the setting you're looking for if it wasn't preceded by an header since they contain [.
Note that this could miss some settings that should be matched, in particular if comments preceding the setting contain a [ character.

Regex taking too many characters

I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.

How to search and replace from the last match of a until b?

I have a latex file in which I want to get rid of the last \\ before a \end{quoting}.
The section of the file I'm working on looks similar to this:
\myverse{some text \\
some more text \\}%
%
\myverse{again some text \\
this is my last line \\}%
\footnote{possibly some footnotes here}%
%
\end{quoting}
over several hundred lines, covering maybe 50 quoting environments.
I tried with :%s/\\\\}%\(\_.\{-}\)\\end{quoting}/}%\1\\end{quoting}/gc but unfortunately the non-greedy quantifier \{-} is still too greedy.
It catches starting from the second line of my example until the end of the quoting environment, I guess the greedy quantifier would catch up to the last \end{quoting} in the file. Is there any possibility of doing this with search and replace, or should I write a macro for this?
EDIT: my expected output would look something like this:
this is my last line }%
\footnote{possibly some footnotes here}%
%
\end{quoting}
(I should add that I've by now solved the task by writing a small macro, still I'm curious if it could also be done by search and replace.)
I think you're trying to match from the last occurrence of \\}% prior to end{quoting}, up to the end{quoting}, in which case you don't really want any character (\_.), you want "any character that isn't \\}%" (yes I know that's not a single character, but that's basically it).
So, simply (ha!) change your pattern to use \%(\%(\\\\}%\)\#!\_.\)\{-} instead of \_.\{-}; this means that the pattern cannot contain multiple \\}% sequences, thus achieving your aims (as far as I can determine them).
This uses a negative zero-width look-ahead pattern \#! to ensure that the next match for any character, is limited to not match the specific text we want to avoid (but other than that, anything else still matches). See :help /zero-width for more of these.
I.e. your final command would be:
:%s/\\\\}%\(\%(\%(\\\\}%\)\#!\_.\)\{-}\)\\end{quoting}/}%\1\\end{quoting}/g
(I note your "expected" output does not contain the first few lines for some reason, were they just omitted or was the command supposed to remove them?)
You’re on the right track using the non-greedy multi. The Vim help files
state that,
"{-}" is the same as "*" but uses the shortest match first algorithm.
However, the very next line warns of the issue that you have encountered.
BUT: A match that starts earlier is preferred over a shorter match: "a{-}b" matches "aaab" in "xaaab".
To the best of my knowledge, your best solution would be to use the macro.

Regex to find two words on the page

I'm trying to find all pages which contain words "text1" and "text2".
My regex:
text1(.|\n)*text2
it doesn't work..
If your IDE supports the s (single-line) flag (so the . character can match newlines), you can search for your items with:
(text1).*(text2)|\2.*\1
Example with s flag
If the IDE does not support the s flag, you will need to use [\s\S] in place of .:
(text1)[\s\S]*(text2)|\2[\s\S]*\1
Example with [\s\S]
Some languages use $1 and $2 in place of \1 and \2, so you may need to change that.
EDIT:
Alternately, if you want to simply match that a file contains both strings (but not actually select anything), you can utilize look-aheads:
(?s)^(?=.*?text1)(?=.*?text2)
This doesn't care about the order (or number) of the arguments, and for each additional text that you want to search for, you simply append another (?=.*?text_here). This approach is nice, since you can even include regex instead of just plain strings.
text0[\s\S]*text1
Try this.This should do it for you.
What this does is match all including multiline .similar to having .*? with s flag.
\s takes care of spaces,newlines,tabs
\S takes care any non space character.
If you want the regex to match over several lines I would try:
text1[\w\W]*text2
Using . is not a good choice, because it usually doesn't match over multiple lines. Also, for matching single characters I think using square brackets is more idiomatic than using ( ... | ... )
If you want the match to be order-independent then use this:
(?:text1[\w\W]*text2)|(?:text2[\w\W]*text1)
Adding a response for IntelliJ
Building on #OnlineCop's answer, to swap the order of two expressions in IntelliJ,you would style the search as in the accepted response, but since IntelliJ doesn't allow a one-line version, you have to put the replace statement in a separate field. Also, IntelliJ uses $ to identify expressions instead of \.
For example, I tend to put my nulls at the end of my comparisons, but some people prefer it otherwise. So, to keep things consistent at work, I used this regex pattern to swap the order of my comparisons:
Notice that IntelliJ shows in a tooltip what the result of the replacement will be.
For me works text1*{0,}(text2){0,}.
With {0,} you can decide to get your keyword zero or more times OR you set {1,x} to get your keyword 1 or x-times (how often you want).

Regular expression using negative lookbehind not working in Notepad++

I have a source file with literally hundreds of occurrences of strings flecha.jpg and flecha1.jpg, but I need to find occurrences of any other .jpg image (i.e. casa.jpg, moto.jpg, whatever)
I have tried using a regular expression with negative lookbehind, like this:
(?<!flecha|flecha1).jpg
but it doesn't work! Notepad++ simply says that it is an invalid regular expression.
I have tried the regex elsewhere and it works, here is an example so I guess it is a problem with NPP's handling of regexes or with the syntax of lookbehinds/lookaheads.
So how could I achieve the same regex result in NPP?
If useful, I am using Notepad++ version 6.3 Unicode
As an extra, if you are so kind, what would be the syntax to achieve the same thing but with optional numbers (in this case only '1') as a suffix of my string? (even if it doesn't work in NPP, just to know)...
I tried (?<!flecha[1]?).jpg but it doesn't work. It should work the same as the other regex, see here (RegExr)
Notepad++ seems to not have implemented variable-length look-behinds (this happens with some tools). A workaround is to use more than one fixed-length look-behind:
(?<!flecha)(?<!flecha1)\.jpg
As you can check, the matches are the same. But this works with npp.
Notice I escaped the ., since you are trying to match extensions, what you want is the literal .. The way you had, it was a wildcard - could be any character.
About the extra question, unfortunately, as we can't have variable-length look-behinds, it is not possible to have optional suffixes (numbers) without having multiple look-behinds.
Solving the problem of the variable-length-negative-lookbehind limitation in Notepad++
Given here are several strategies for working around this limitation in Notepad++ (or any regex engine with the same limitation)
Defining the problem
Notepad++ does not support the use of variable-length negative lookbehind assertions, and it would be nice to have some workarounds. Let's consider the example in the original question, but assume we want to avoid occurrences of files named flecha with any number of digits after flecha, and with any characters before flecha. In that case, a regex utilizing a variable-length negative lookbehind would look like (?<!flecha[0-9]*)\.jpg.
Strings we don't want to match in this example
flecha.jpg
flecha1.jpg
flecha00501275696.jpg
aflecha.jpg
img_flecha9.jpg
abcflecha556677.jpg
The Strategies
Inserting Temporary Markers
Begin by performing a find-and-replace on the instances that you want to avoid working with - in our case, instances of flecha[0-9]*\.jpg. Insert a special marker to form a pattern that doesn't appear anywhere else. For this example, we will insert an extra . before .jpg, assuming that ..jpg doesn't appear elsewhere. So we do:
Find: (flecha[0-9]*)(\.jpg)
Replace with: $1.$2
Now you can search your document for all the other .jpg filenames with a simple regex like \w+\.jpg or (?<!\.)\.jpg and do what you want with them. When you're done, do a final find-and-replace operation where you replace all instances of ..jpg with .jpg, to remove the temporary marker.
Using a negative lookahead assertion
A negative lookahead assertion can be used to make sure that you're not matching the undesired file names:
(?<!\S)(?!\S*flecha\d*\.jpg)\S+\.jpg
Breaking it down:
(?<!\S) ensures that your match begins at the start of a file name, and not in the middle, by asserting that your match is not preceded by a non-whitespace character.
(?!\S*flecha\d*\.jpg) ensures that whatever is matched does not contain the pattern we want to avoid
\S+\.jpg is what actually gets matched -- a string of non-whitespace characters followed by .jpg.
Using multiple fixed-length negative lookbehinds
This is a quick (but not-so-elegant) solution for situations where the pattern you don't want to match has a small number of possible lengths.
For example, if we know that flecha is only followed by up to three digits, our regex could be:
(?<!flecha)(?<!flecha[0-9])(?<!flecha[0-9][0-9])(?<!flecha[0-9][0-9][0-9])\.jpg
Are you aware that you're only matching (in the sense of consuming) the extension (.jpg)? I would think you wanted to match the whole filename, no? And that's much easier to do with a lookahead:
\b(?!flecha1?\b)\w+\.jpg
The first \b anchors the match to the beginning of the name (assuming it's really a filename we're looking at). Then (?!flecha1?\b) asserts that the name is not flecha or flecha1. Once that's done, the \w+ goes ahead and consumes the name. Then \.jpg grabs the extension to finish off the match.