Regex: remove lines not starting with a digit - regex

I have been fighting this problem with the help of a RegEx cheat sheet, trying to figure out how to do this, but I give up... I have this lengthy file open in Notepad++ and would like to remove all lines that do not start with a digit (0..9). I would use the Find/Replace functionality of N++. I am only mentioning this as I am not sure what Regex implementation is N++ using... Thank you
Example. From the following text:
1hello
foo
2world
bar
3!
I would like to extract
1hello
2world
3!
not:
1hello
2world
3!
by doing a find/replace on a regular expression.

You can clear up those line with ^[^0-9].* but it will leave blank lines.
Notepad++ use scintilla, and also using its regex engine to match those.
\r and \n are never matched because in
Scintilla, regular expression searches
are made line per line (stripped of
end-of-line chars).
http://www.scintilla.org/SciTERegEx.html
To clear up those blank lines, only way is choose extended mode, and replace \n\n to \n, If you are in windows mode change \r\n\r\n to \r\n

[^0-9] is a regular expression that matches pretty much anything, except digits. If you say ^[^0-9] you "anchor" it to the start of the line, in most regular expression systems. If you want to include the rest of the line, use ^[^0-9].+.

^[^\d].* marks a whole line whose first character is not a digit. Check if there are really no whitespaces in front of the digits. Otherwise you'd have to use a different expression.
UPDATE:
You will have to do ot in two steps. First empty the lines that do not start with a digit. Then remove the empty lines in extended mode.

One could also use the technique of bookmarking in Notepad++. I started benefiting from this feature (long time present but only more recently made somewhat more visible in the UI) not very long ago.
Simply bring up the find dialogue, type regex for lines not starting with digit ^\D.*$ and select Mark All. This will place blue circles, like marbles, in the left gutter - these are line bookmarks. Then just select from main menu Search -> Bookmark -> Remove bookmarked lines.
Bookmarks are cool, you could extract these lines by simply selecting to copy bookmarked lines, opening new document and pasting lines there. I sometimes use this technique when reviewing log files.

I'm not sure what you are asking. but the reg exp for finding the lines with a digit at the beginning would be
^\d.*
you can remove all the lines that match the above or alternatly keep all the lines that match this expression:
^[^\d].*

Related

Regex NotePad++ or batch script to find and replace double bracketed text with CR LF -- would prefer NP++

I managed to do most of my conversion in VBA Macro (Word > txt) but some changes were made also that I could not forego or get around. Unfortunately, I had not been in the habit of using styles and precise formatting in my docs... (Which is why a PanDoc conversion did not "pan" out well, if you'll excuse the pun.)
In my docs, I was using bold text/lines for in-text titles (not Heading 2 alas) but as I was converting mid-sentence one or two-word bold phrases into phrases to go between double square brackets, the makeshift titles/headings were also changed to [[some title]] format in the process.
With Find and Replace (a batch script that goes through all files in a folder would also do), I would like to search for each and any number of instances of CRLF [[some title CRLF]]CRLF and replace the brackets with ** (to make the title bold), or perhaps ## to make the headings I was missing back in MS Word (I would of course need the line breaks as well).
For better understanding, please see attached picture here:
I am fairly sure that all instances are similarly syntaxed. If not, I may be able to tailor your regex code to differing instances later on.
As you can see, I was trying to do it in two steps but that's not good, because the second step (which I couldn't even get right) would propably have altered other texts I need intact (there must be sentences that start with double brackets after CRLF).
I would need the two steps in one so that only the targeted double bracketed text would be changed to bold or Heading 2.
Basically what I could not do is: find the proper regex solution for matching double CRLF-ed and square-bracketed text for any number of words than may occupy more than one line and starts with a capital letter. I would need an empty line above and below the title as indicated in the image (the VBA macro somehow made two instances of CRLF and carried the brackets to a new line, which I do not like, either).
EDIT.
In the meantime I managed to cook something up but now I couldn't insert the CRLF in front of the match string. At this point this is not enough as other instances are also changed, even lowercase in-line items, for some reason...
Regex:
\[\[([A-Z][\S\s]+?)\]\]
Substitution:
## $1\r\n
https://regex101.com/r/mH6B9N/1
Since then, I made improvements towards what I wanted (I had to test in NotePad++ and not Regex101, for different results), but now in multiple documents I have found match across spill-over lines, as described in here:
Single line regex search in Notepad++
Is it possible that I cannot do what I want? The problem is having non-title text strings having line-break, double brackets and capitalized letters.
What it looks like in other documents:
See here.
I circled around with red in image for clarification. See also:
https://regex101.com/r/8XsIGx/1
Is it possible to match a certain word like "címnél" and not execute on that match if that word is present in a line?
Thanks very much in advance,
F.
You can use
(?s)\R\K\[\[((?:(?!\[\[|]]).)*)\R*]](?=\R)
Replace with ## $1. See the regex demo.
Details:
(?s) - equivalent of the . matches newline option
\R - a line break sequence
\K - omit the text matched so far (the newlines)
\[\[ - a [[ text
((?:(?!\[\[|]]).)*) - Group 1: any char, as many as possible occurrences, that does not start a [[ or ]] char sequence
\R* - zero or more line breaks
]] - a ]] text
(?=\R) - immediately to the right, there must be a line break.

Regex: Replace double double quotes (solved), but only in lines that contain a special string (subcondition unsolved)

1. Summary of the problem
I have a csv file where I want to replace normal quotes in text with typographic ones.
It was hard (because HTML is also included), but I have meanwhile created a good regex expression that does just the right thing: in three "capturing groups" I find the left and right quotation marks and the text inside. Replacing then is a piece of cake.
2. Regex engine
I can use the regex engine of Notepad++ (boost) or PCRE2 comaptible, for developping and testing purposes I have used https://regex101.com.
3. What I'm having a hard time with and just can't get right, where I need your help is here:
I want to add a sub condition, in order to find the text in quotes only in certain lines, want to identify these lines by the language, e.g. ENGLISH or FRENCH (see also example in the screenshot).
Screenshot of a sample
The string indicating the language is always in the same line before the text to be found, BUT only the text in quotes (main condition) should be marked after matching the sub condition, so that I will be able to replace them.
It is about a few thousand records in the csv file, in the worst case I could also replace it manually. But I'm pretty sure that this should also work via regex.
4. What I have tried
Different approaches with look arounds and non-capturing groups didn't lead me to the desired result - possibly because I didn't really understand how they work.
An example can be found here: https://regex101.com/r/ketwwm/1
The example can be found here, it only contains the regex expression to match and mark the (three) groups WITHOUT the searched subcondition:
("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$)))
Hopefully anyone in the community could help? (Hopefully I have not missed anything, it's my first post here )
5. Update 03/18/2022: Almost resolved with two slightly different approaches (thank you all!) What is still unsolved ..
Solution of #Thefourthbird (see answer 1)
^(?!.?"ENGLISH")[^"]".*(SKIP)(F)|("")([^<>]?)("")(?=(?:[^>]?(?:<|$)))
Nearly perfect, just missing matches in an HTML section. HTML sections in the csv file are always enclosed by double quotes and may have line feeds (LF). https://regex101.com/r/x5shnx/1
Solution of #Wiktor Stribiżew (see in comments below)
^.?"ENGLISH".?\K("")([^<>]?)("")(?=(?:[^>]?(?:<|$)))
The same with matches in HTML sections, see above. Plus: Doesn't match text in double double quotes if more than one such entry occurs within a text. https://regex101.com/r/I4NTdb/1
Screenshot (only to illustrate)
If you want to match multiple occasions, you can use SKIP matching all lines that do not start with FRENCH:
^"(?!FRENCH")[^"]*".*(*SKIP)(*F)|("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$)))
The pattern matches:
^ Start of string
" Match literally
(?!FRENCH") Negative lookhead, assert not FRENCH" directly to the right
[^"]*" Match any char except " and match "
.*(*SKIP)(*F) Match the rest of the line and skip it
| Or
("")([^<>]*?)("")(?=(?:[^>]*?(?:<|$))) Your current pattern
Regex demo

Deleting every 2nd line from a file using Notepad++

I am looking for some regex help.
I have a textfile, nothing super important but I would like to delete every second line from it - I have tried following this guide: Delete every other line in notepad++
However I just can't get it to work, is the regex I am using ok? I am noob with regex
Find:
([^\n]*\n)[^\n]*\n
Replace with:
$1
No matter what I try (mouse position at the beginning, ctrl+a and Replace All) I just can't get it to work. I appreciate any help.
I've put the regex into here: http://regexpal.com/ and if I remove the final \n it highlights the individual rows.
Make sure you select regular expression for the search mode...
Also, you may want to make that final newline optional. In the case that there are an even number of lines and you do not have a trailing newline, it won't remove the last line.
([^\n]*\n)[^\n]*\n?
Update:
See how Windows handle new lines with \r\n instead of just \n. Try updating the expression to take this into account:
([^\r\n]*[\r\n]+)[^\r\n]*[\r\n]*
Final Update:
Thanks to #zx81, I now know that N++ uses PCRE so \R can be used for unicode newline characters. However [^\R] won't work (this looks for anything except R literally), so you will need to keep [^\r\n]. This can be simplified as:
([^\r\n]*\R)[^\r\n]*\R?

Regular Expression Search Replace all non leading tabs with single space Notepad++

Regular Expressions have never been my strong suite, so I need some help here. I have a text file and I want to replace any "embedded" tabs with a space and only one space for x occurrences of tabs, but leave any "leading" tabs alone.
So for a line that looks like this:
\t\t\tThis is a\t\ttest to see\thow things\t will work.
would come out looking like this:
\t\t\tThis is a test to see how things will work.
So the only tabs left in the file would be at the beginning of any lines and there could be x number of tabs at the beginning of any line. Can anybody help me figure this one out?
I'm doing this with NotePad++ Search/Replace but I could use Visual Studio or some other tool if that would work better.
Find what:
(?<!\t)(?!^)\t+
The sequence of tabs \t+ must not be preceded by a tab (?<!\t), and also must not start from the beginning of a line (?!^).
Replace with:
<space>
Demo on regex101 (since Notepad++ also uses PCRE, I use t instead of tab for clarity)

Remove everything before and after variable=int

I'm terrible at regex and need to remove everything from a large portion of text except for a certain variable declaration that occurs numerous times, id like to remove everything except for instances of mc_gross=anyint.
Generally we'd need to use "negative lookarounds" to find everything but a specified string. But these are fairly inefficient (although that's probably of little concern to you in this instance), and lookaround is not supported by all regex engines (not sure about notepad++, and even then probably depends on the version you're using).
If you're interested in learning about that approach, refer to How to negate specific word in regex?
But regardless, since you are using notepad++, I'd recommend selecting your target, then inverting the selection.
This will select each instance, allowing for optional white space either side of the '=' sign.
mc_gross\s*=\s*\d+
The following answer over on super user explains how to use bookmarks in notepad++ to achieve the "inverse selection":
https://superuser.com/questions/290247/how-to-delete-all-line-except-lines-containing-a-word-i-need
Substitute the regex they're using over there, with the one above.
You could do a regular expression replace of ^.*\b(mc_gross\s*=\s*\d+)\b.*$ with \1. That will remove everything other than the wanted text on each line. Note that on lines where the wanted text occurs two or more times, only one occurrence will be retained. In the search the ^.*\b matches from start-of-line to a word boundary before the wanted text; the \b.*$ matches everything from a word boundary after the wanted text until end of line; the round brackets capture the wanted text for the replacement text. If text such as abcmc_gross=13def should be matched and retained as mc_gross=13 then delete the \bs from the search.
To remove unwanted lines do a regular expression search for ^mc_gross\s*=\s*\d+$ from the Mark tab, tick Bookmark line and click Mark all. Then use Menu => Search => Bookmark => Remove unmarked lines.
Find what: [\s\S]*?(mc_gross=\d+|\Z)
Replace with: \1
Position the cursor at the start of the text then Replace All.
Add word boundaries \b around mc_gross=\d+ if you think it's necessary.