Notepad++ Regex Remove Character from Markdown Formatted Footnote - regex

This is a follow-up question to what was solved yesterday:
Notepad++ Regex Replace Makeshift Footnotes format With Proper Markdown format
I managed to find a Regex to remove the offending semicolons in the main text area but by only cutting out the text and pasting back the result, which can only be done one by one.
I'm not sure how this can be done, but the expert can tell me.
So I have footnote references in markdown format. Two instances of the same thing:
[^1]:
[^2]:
.
.
.
[^99]:
I might not have 99 in a document but I wanted to show I need to match two digits here again.
As I said, there are two instances of these numbered references in the text. One in the main text pointing to the footnote and the footnote at the end of the document.
What I need is deleting the semi-colons from the main text and leave the
[^3]:
[^15]:
etc.
references at the end intact.
Because the main text references come after a word or at the end of a sentence (ususally before the sentence-ending period), there is never a case a reference would start a sentence (even if they seem to appear there once or twice because of word wrap).
I provided the exact opposite of my needs here:
Click here for Regex101 website link
I put in the exact opposite of what I want because I already knew of the
^
sign to match anything that is at the front of the line.
Now I would like to negate this, if possible, so that I would delete the semi-colons in the main text, not down at the bottom.
Of course, it is likely that my approach is not good and you'll come up with a completely different approach. Especially because there doesn't seem to be a NOT operator in Regex, if I read correctly.
I repeat: the Regex101 example with the match and substitution is exactly the opposite of what I want.
I am not sure if you can play around in the substitution line to get the desired negative effect.
I could have probably asked for removing the first occurence of semi-colons but I thought the important part of tackling the problem is that those items not to be matched are always at the start of the line, not the others.
Thanks for any suggestions

In Notepad++ you might use a negative lookabehind asserting not the start of the string to the left, and use \K to clear the match buffer matching only the colon that should be replaced by an empty string.
(?<!^)\[\^\d{1,2}]\K:
Explanation
(?<!^) Negative lookbehind, assert not the start of the start directly to the left
\[\^ Match [^
\d{1,2} Match 1 or 2 digits
] Match literally
\K Forget what is matched so far
: Match a colon
Regex demo

Related

Regex NotePad++ or batch script to find and replace double bracketed text with CR LF -- would prefer NP++

I managed to do most of my conversion in VBA Macro (Word > txt) but some changes were made also that I could not forego or get around. Unfortunately, I had not been in the habit of using styles and precise formatting in my docs... (Which is why a PanDoc conversion did not "pan" out well, if you'll excuse the pun.)
In my docs, I was using bold text/lines for in-text titles (not Heading 2 alas) but as I was converting mid-sentence one or two-word bold phrases into phrases to go between double square brackets, the makeshift titles/headings were also changed to [[some title]] format in the process.
With Find and Replace (a batch script that goes through all files in a folder would also do), I would like to search for each and any number of instances of CRLF [[some title CRLF]]CRLF and replace the brackets with ** (to make the title bold), or perhaps ## to make the headings I was missing back in MS Word (I would of course need the line breaks as well).
For better understanding, please see attached picture here:
I am fairly sure that all instances are similarly syntaxed. If not, I may be able to tailor your regex code to differing instances later on.
As you can see, I was trying to do it in two steps but that's not good, because the second step (which I couldn't even get right) would propably have altered other texts I need intact (there must be sentences that start with double brackets after CRLF).
I would need the two steps in one so that only the targeted double bracketed text would be changed to bold or Heading 2.
Basically what I could not do is: find the proper regex solution for matching double CRLF-ed and square-bracketed text for any number of words than may occupy more than one line and starts with a capital letter. I would need an empty line above and below the title as indicated in the image (the VBA macro somehow made two instances of CRLF and carried the brackets to a new line, which I do not like, either).
EDIT.
In the meantime I managed to cook something up but now I couldn't insert the CRLF in front of the match string. At this point this is not enough as other instances are also changed, even lowercase in-line items, for some reason...
Regex:
\[\[([A-Z][\S\s]+?)\]\]
Substitution:
## $1\r\n
https://regex101.com/r/mH6B9N/1
Since then, I made improvements towards what I wanted (I had to test in NotePad++ and not Regex101, for different results), but now in multiple documents I have found match across spill-over lines, as described in here:
Single line regex search in Notepad++
Is it possible that I cannot do what I want? The problem is having non-title text strings having line-break, double brackets and capitalized letters.
What it looks like in other documents:
See here.
I circled around with red in image for clarification. See also:
https://regex101.com/r/8XsIGx/1
Is it possible to match a certain word like "címnél" and not execute on that match if that word is present in a line?
Thanks very much in advance,
F.
You can use
(?s)\R\K\[\[((?:(?!\[\[|]]).)*)\R*]](?=\R)
Replace with ## $1. See the regex demo.
Details:
(?s) - equivalent of the . matches newline option
\R - a line break sequence
\K - omit the text matched so far (the newlines)
\[\[ - a [[ text
((?:(?!\[\[|]]).)*) - Group 1: any char, as many as possible occurrences, that does not start a [[ or ]] char sequence
\R* - zero or more line breaks
]] - a ]] text
(?=\R) - immediately to the right, there must be a line break.

Regex: Replace double double quotes (solved), but only in lines that contain a special string (subcondition unsolved)

1. Summary of the problem
I have a csv file where I want to replace normal quotes in text with typographic ones.
It was hard (because HTML is also included), but I have meanwhile created a good regex expression that does just the right thing: in three "capturing groups" I find the left and right quotation marks and the text inside. Replacing then is a piece of cake.
2. Regex engine
I can use the regex engine of Notepad++ (boost) or PCRE2 comaptible, for developping and testing purposes I have used https://regex101.com.
3. What I'm having a hard time with and just can't get right, where I need your help is here:
I want to add a sub condition, in order to find the text in quotes only in certain lines, want to identify these lines by the language, e.g. ENGLISH or FRENCH (see also example in the screenshot).
Screenshot of a sample
The string indicating the language is always in the same line before the text to be found, BUT only the text in quotes (main condition) should be marked after matching the sub condition, so that I will be able to replace them.
It is about a few thousand records in the csv file, in the worst case I could also replace it manually. But I'm pretty sure that this should also work via regex.
4. What I have tried
Different approaches with look arounds and non-capturing groups didn't lead me to the desired result - possibly because I didn't really understand how they work.
An example can be found here: https://regex101.com/r/ketwwm/1
The example can be found here, it only contains the regex expression to match and mark the (three) groups WITHOUT the searched subcondition:
("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$)))
Hopefully anyone in the community could help? (Hopefully I have not missed anything, it's my first post here )
5. Update 03/18/2022: Almost resolved with two slightly different approaches (thank you all!) What is still unsolved ..
Solution of #Thefourthbird (see answer 1)
^(?!.?"ENGLISH")[^"]".*(SKIP)(F)|("")([^<>]?)("")(?=(?:[^>]?(?:<|$)))
Nearly perfect, just missing matches in an HTML section. HTML sections in the csv file are always enclosed by double quotes and may have line feeds (LF). https://regex101.com/r/x5shnx/1
Solution of #Wiktor Stribiżew (see in comments below)
^.?"ENGLISH".?\K("")([^<>]?)("")(?=(?:[^>]?(?:<|$)))
The same with matches in HTML sections, see above. Plus: Doesn't match text in double double quotes if more than one such entry occurs within a text. https://regex101.com/r/I4NTdb/1
Screenshot (only to illustrate)
If you want to match multiple occasions, you can use SKIP matching all lines that do not start with FRENCH:
^"(?!FRENCH")[^"]*".*(*SKIP)(*F)|("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$)))
The pattern matches:
^ Start of string
" Match literally
(?!FRENCH") Negative lookhead, assert not FRENCH" directly to the right
[^"]*" Match any char except " and match "
.*(*SKIP)(*F) Match the rest of the line and skip it
| Or
("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$))) Your current pattern
Regex demo

Regex taking too many characters

I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.

How to search and replace from the last match of a until b?

I have a latex file in which I want to get rid of the last \\ before a \end{quoting}.
The section of the file I'm working on looks similar to this:
\myverse{some text \\
some more text \\}%
%
\myverse{again some text \\
this is my last line \\}%
\footnote{possibly some footnotes here}%
%
\end{quoting}
over several hundred lines, covering maybe 50 quoting environments.
I tried with :%s/\\\\}%\(\_.\{-}\)\\end{quoting}/}%\1\\end{quoting}/gc but unfortunately the non-greedy quantifier \{-} is still too greedy.
It catches starting from the second line of my example until the end of the quoting environment, I guess the greedy quantifier would catch up to the last \end{quoting} in the file. Is there any possibility of doing this with search and replace, or should I write a macro for this?
EDIT: my expected output would look something like this:
this is my last line }%
\footnote{possibly some footnotes here}%
%
\end{quoting}
(I should add that I've by now solved the task by writing a small macro, still I'm curious if it could also be done by search and replace.)
I think you're trying to match from the last occurrence of \\}% prior to end{quoting}, up to the end{quoting}, in which case you don't really want any character (\_.), you want "any character that isn't \\}%" (yes I know that's not a single character, but that's basically it).
So, simply (ha!) change your pattern to use \%(\%(\\\\}%\)\#!\_.\)\{-} instead of \_.\{-}; this means that the pattern cannot contain multiple \\}% sequences, thus achieving your aims (as far as I can determine them).
This uses a negative zero-width look-ahead pattern \#! to ensure that the next match for any character, is limited to not match the specific text we want to avoid (but other than that, anything else still matches). See :help /zero-width for more of these.
I.e. your final command would be:
:%s/\\\\}%\(\%(\%(\\\\}%\)\#!\_.\)\{-}\)\\end{quoting}/}%\1\\end{quoting}/g
(I note your "expected" output does not contain the first few lines for some reason, were they just omitted or was the command supposed to remove them?)
You’re on the right track using the non-greedy multi. The Vim help files
state that,
"{-}" is the same as "*" but uses the shortest match first algorithm.
However, the very next line warns of the issue that you have encountered.
BUT: A match that starts earlier is preferred over a shorter match: "a{-}b" matches "aaab" in "xaaab".
To the best of my knowledge, your best solution would be to use the macro.

Vim S&R to remove number from end of InstallShield file

I've got a practical application for a vim regex where I'd like to remove numbers from the end of file location links. For example, if the developer is sloppy and just adds files and doesn't reuse file locations, you'll end up with something awful like this:
PATH_TO_MY_FILES&gt
PATH_TO_MY_FILES1&gt
...
PATH_TO_MY_FILES22&gt
PATH_TO_MY_FILES_ELSEWHERE&gt
PATH_TO_MY_FILES_ELSEWHERE1&gt
...
So all I want to do is to S&R and replace PATH_TO_MY_FILES*\d+ with PATH_TO_MY_FILES* using regex. Obviously I am not doing it quite right, so I was hoping someone here could not spoon feed the answer necessarily, but throw a regex buzzword my way to get me on track.
Here's what I have tried:
:%s\(PATH_TO_MY_FILES\w*\)\(\d+\)&gt:gc
But this doesn't work, i.e. if I just do a vim search on that, it doesn't find anything. However, if I use this:
:%s\(PATH_TO_MY_FILES\w*\)\(\d\)&gt:gc
It will match the string, but the grouping is off, as expected. For example, the string PATH_TO_MY_FILES22 will be grouped as (PATH_TO_MY_FILES2)(2), presumably because the \d only matches the 2, and the \w match includes the first 2.
Question 1: Why doesn't \d+ work?
If I go ahead and use the second string (which is wrong), Vim appears to find a match (even though the grouping is wrong), but then does the replacement incorrectly.
For example, given that we know the \d will only match the last number in the string, I would expect PATH_TO_MY_FILES22&gt to get replaced with PATH_TO_MY_FILES2&gt. However, instead it replaces it with this:
PATH_TO_MY_FILES2PATH_TO_MY_FILES22&gtgt
So basically, it looks like it finds PATH_TO_MY_FILES22&gt, but then replaces only the & with group 1, which is PATH_TO_MY_FILES2.
I tried another regex at Regexr.com to see how it would interpret my grouping, and it looked correct, but maybe a hack around my lack of regex understanding:
(PATH_TO_\D*)(\d*)&gt
This correctly broke my target string into the PATH part and the entire number, so I was happy. But then when I used this in Vim, it found the match, but still replaced only the &.
Question 2: Why is Vim only replacing the &?
Answer 1:
You need to escape the + or it will be taken literally. For example \d\+ works correctly.
Answer 2:
An unescaped & in the replacement portion of a substitution means "the entire matched text". You need to escape it if you want a literal ampersand.