Notepad++ and regex (multiline) - regex

I have been facing a challenge. I have a text file with the following pattern:
SOME RANDOM TITLE IN CAPS (nnnn)
text text text
more text
...
SOME OTHER RANDOM TITLE IN CAPS (nnnn)
What is for sure is that what I want to extract are lines with a bracket and a date ex: (2015) ; (20008)
After the (nnnn) there is no text, sometimes space and CR LF, sometimes just CR LF
I would like to delete everything else and keep just the TITLE LINE with the brackets
The time I spent I could have done it by hand (there are 100lines) but I like the challenge :)
I thought I could find the issue but I am stuck.
I have tried something along this line:
^.*\(\d\d\d\d\)(?s)(.*)(^.*\(\d\d\d\d\))
But I don't get what I want. I can't seem to stop the (?s)(.*) going all the way to the end of the text instead of stopping at the next occurrence.

I suggest using the Search > Mark feature. Use a pattern like \(\d{4}\) and check the "Bookmark Line" option then click "Mark All". Then use Search > Bookmark > Remove Unmarked Lines. This will remove all lines except the ones that have matched your pattern.
Note: If it's possible to have parentheses with 4 digits within your other lines you could add $ to the end of the expression to ensure that the pattern only matches the end of the line. E.g. more text (1234) and other stuff would be matched by the pattern I gave above but if you use pattern \(\d{4}\)$ it will no longer match.
If you want to be even more specific with your pattern by looking for those lines with only uppercase letters and spaces followed by parentheses with 4 digits inside where the parentheses are at the end of the line, then you could use a pattern like this: [A-Z ]+\(\d{4}\)$
Sample input:
SOME RANDOM TITLE IN CAPS (2008)
text text text
more text
...
SOME OTHER RANDOM TITLE IN CAPS (2010)
Here is how to mark the lines:
After clicking "Mark All" here is what you see:
Now use Search > Bookmark > Remove Unmarked Lines and you get this:

The following RegEx maches the 2 lines with brackets containing 4 numbers:
.*?\(\d{4}\)\s*
It starts matching anything at start zero or more times (non greedy), then it matches a start bracket followed by 4 numbers. Finally ending White Space and new line.

If you want to remove all lines but the ones that end with (4numbers) you may try with this:
^(?!.*\(\d{4}\)\h*$).*(?:\r?\n|\z)
Replace by: (nothing)
See demo

Related

Regular expression for selecting trailing whitespace except first space after last character in line

I'm editing text in Atom.
Beginning with the regex, $\s , I haven't been able to figure out how to anchor my selection from the second blank space after the line.
I want to remove the thousands of line returns in a text file ( originally formatted as an .srt video transcript ) and replace them with a single, blank space so as to not join together any words.
For example, my file looks like this:
This content is
difficult to read
because the lines
break after too
few characters.
$\s will select all trailing whitespace, something that I don't want to do, because if I delete all the space selected by that regex then I will cause lots of words to join up into nonsense.
I want to start trimming the trailing whitespace of each line from the second blank space, not the first, so that the expected output would be:
"This content is difficult to read because the lines break after too many characters."
Instead of:
"This content isdifficult to readbecause the linesbreak after toofew characters."
I have solved this problem using MS Word's Find and Replace; substituting a single space ( by literally hitting the space bar once ) for all the hard returns ( enter ^p in the Find field ).
I don't know why the Atom regex engine wasn't recognising the answer provided in the comments from regex101.com? It solves my problem in the regex101 tester.

How do you "quantify" a variable number of lines using a regexp?

Say you know the starting and ending lines of some section of text, but the chars in some lines and the number of lines between the starting and ending lines are variable, á la:
aaa
bbbb
cc
...
...
...
xx
yyy
Z
What quantifier do you use, something like:
aaa\nbbbb\ncc\n(.*\n)+xx\nyyy\nZ\n
to parse those sections of text as a group?
You can use the s flag to match multilines texts, you can do it like:
~\w+ ~s.
There is a similar question here:
Javascript regex multiline flag doesn't work
If I understood correctly, you know that your text begins with aaa\nbbbb\ncc and ends with xx\nyyy\nZ\n. You could use aaa.+?bbbb.+?cc(.+?)xx.+?yyy.+?Z so that all operators are not greedy and you don't accidentally capture two groups at once. The text inbetween these groups would be in match group 1. You also need to turn the setting that causes dot to match new line on.
Try this:
aaa( |\n)bbbb( |\n)cc( |\n)( |\n){0,1}(.|\n)*xx( |\n)yyy( |\n)Z
( |\n) matches a space or a newline (so your starting and ending phrases can be split into different lines)
RegExr
At the end of the day what worked for me using Kate was:
( )+aaa\n( )+bbbb\n( )+cc\n(.|\n)*( )+xx\n( )+yyy\n( )+Z\n
using such regexps you can clear pages of quite a bit of junk.

Issue with selecting items using regex

I have text separated by white spaces and a search range of more than 1000 words.
Approximately 70% of the words are following this pattern foo-bar-...-N, where N is unknown value for counter for words written between this sign: "-". After each word(between each word) there is a blank space.
What I need is for the script to select everything after the foo-bar up until the blank space.
I know how to select whole thing, but not how to get solution for my issue.
Here is some example for my idea:
foo-bar foo-bar-thing foo-bar-stuff-my-gosh ... foo-bar-for-educational-purposes
And regex should select them like so:
[foo-bar] [foo-bar]-thing [foo-bar]-stuff-my-gosh ... [foo-bar]-for-educational-purposes
You want a the regex to fetch a phrase and extract a substring from it.
To do that you need a group.
So here is the code you want :
foo-bar([\w-]*)
There is a space at the end don't forget it. You need to set the global flag as you can see in the demo. And your string has to end with a space if you want to match the last one. If it's multiline don't forget the multiline flag too.

How to group lines of text using Notepad++

I find Notepad++ regex to be very different from regex in Microsoft Word. I was wondering how I can group several lines of text using Notepad++. I have a text file with 100+ URLs. They are written one URL address per line. I would like to group all of them by tens by removing the carriage returns from every first to 9th line, but retaining the carriage return on every 10th line and adding another carriage return thereafter. For example:
I want this:
http://website1.com
http://website2.com
http://website3.com
http://website4.com
http://website5.com
http://website6.com
http://website7.com
http://website8.com
http://website9.com
http://website10.com
http://website11.com
http://website12.com
http://website13.com
http://website14.com
http://website15.com
http://website16.com
http://website17.com
http://website18.com
http://website19.com
http://website20.com
http://website21.com
http://website22.com
http://website23.com
http://website24.com
http://website25.com
http://website26.com
http://website27.com
http://website28.com
http://website29.com
http://website30.com
to look like:
http://website1.comhttp://website2.comhttp://website3.comhttp://website4.comhttp://website5.comhttp://website6.comhttp://website7.comhttp://website8.comhttp://website9.comhttp://website10.com
http://website11.comhttp://website12.comhttp://website13.comhttp://website14.comhttp://website15.comhttp://website16.comhttp://website17.comhttp://website18.comhttp://website19.comhttp://website20.com
http://website21.comhttp://website22.comhttp://website23.comhttp://website24.comhttp://website25.comhttp://website26.comhttp://website27.comhttp://website28.comhttp://website29.comhttp://website30.com
Any help would be appreciated!
Ok, I have found a way:
There is a such possibility, but only with 6 entries in a row (longest regex is not parsed by the Notepad++).
1)So, open the file and remove from it all newlines characters, so the text will be a long-long line.
2)Open replace dialog, insert in the "Find what" field the next :
(http://[^\:]*\.comhttp://[^\:]*\.comhttp://[^\:]*\.comhttp://[^\:]*\.comhttp://[^\:]*\.comhttp://[^\:]*\.com)
and in the "Replace With" the next:
\1\r\n
Put the cursor at the first position in the text and press "Replace all"
So, the regex contains this (http://[^\:]*\.com){6} (the regex is repeated 6 times). If you work with Unix and you need unix-type new line style, replace this : \1\r\n with this \1\n

Removing empty lines in Notepad++

How can I replace empty lines in Notepad++? I tried a find and replace with the empty lines in the find, and nothing in the replace, but it did not work; it probably needs regex.
There is now a built-in way to do this as of version 6.5.2
Edit -> Line Operations -> Remove Empty Lines or Remove Empty Lines (Containing Blank characters)
You need something like a regular expression.
You have to be in Extended mode
If you want all the lines to end up on a single line use \r\n. If you want to simply remove empty lines, use \n\r as #Link originally suggested.
Replace either expression with nothing.
There is a plugin that adds a menu entitled TextFX. This menu, which houses a dizzying array of quick text editing options, gives a person the ability to make quick coding changes. In this menu, you can find selections such as Drop Quotes, Delete Blank Lines as well as Unwrap and Rewrap Text
Do the following:
TextFX > TextFX Edit > Delete Blank Lines
TextFX > TextFX Edit > Delete Surplus Blank Lines
notepad++
Ctrl-H
Select Regular Expression
Enter ^[ \t]*$\r?\n into find what, leave replace empty. This will match all lines starting with white space and ending with carriage return (in this case a windows crlf)
Click the Find Next button to see for yourself how it matches only empty lines.
Press ctrl + h (Shortcut for replace).
In the Find what zone, type ^\R ( for exact empty lines) or ^\h*\R ( for empty lines with blanks, only).
Leave the Replace with zone empty.
Check the Wrap around option.
Select the Regular expression search mode.
Click on the Replace All button.
You can follow the technique as shown in the following screenshot:
Find what: ^\r\n
Replace with: keep this empty
Search Mode: Regular expression
Wrap around: selected
NOTE: for *nix files just find by \n
This worked for me:
Press ctrl + h (Shortcut for replace)
Write one of the following regex in find what box.
[\n\r]+$ or ^[\n\r]+
Leave Replace with box blank
In Search Mode, select Regex
Click on Replace All
Done!
In notepad++ press CTRL+H , in search mode click on the "Extended (\n, \r, \t ...)" radio button then type in the "Find what" box: \r\n (short for CR LF) and leave the "Replace with" box empty..
Finally hit replace all
Well I'm not sure about the regex or your situation..
How about CTRL+A, Select the TextFX menu -> TextFX Edit -> Delete Blank Lines and viola all blank line gone.
A side note - if the line is blank i.e. does not contain spaces, this will work
1) Ctrl + H ( Or Search 🠆 Replace..) to open Replace window.
2) Select 'Search Mode' 'Regular expression'
3) In 'Find What' type ^(\s*)(.*)(\s*)$ & in 'Replace With' type \2
^ - Matches start of line character
(\s*) - Matches empty space characters
(.*) - Matches any characters
(\s*) - Matches empty spaces characters
$ - Matches end of line character
\2 - Denotes the matching contend of the 2nd bracket
Refer https://www.rexegg.com/regex-quickstart.html for more on regex.
You can search for the following regex: ^(?:[\t ]*(?:\r?\n|\r))+ and replace it with empty field
Ctrl+H.
find - \r\r
replace with - \r.
This obviously does not work if the blank lines contain tabs or blanks. Many web pages (e.g. http://www.guardian.co.uk/) contain these white lines, as a result of a faulty HTML editor.
Remove white space using regular expression as follows:
change pattern: [\t ]+$
into nothing.
where [\t ] matches either tab or space. '+' matches one or more occurrences, and '$' marks the end of line.
Then use notepad++/textFX to remove single or extra empty lines.
Be sure that these blank lines are not significant in the given context.
Edit >> Blank Operations >> Trim Leading and Trailing Spaces (to remove black tabs and spaces in empty lines)
Ctrl + H to get replace window and replace pattern: ^\r\n with nothing (select regular expression)
Note: step 1 will remove your code intendation done via tabs and blank spaces
Sometimes \n\r etc not work, here to figure it out, what your actually regular expression should be.
Advantage of this trick: If you want to replace in multiple file at once, you must need this method. Above will not work...
CTRL+A, Select the TextFX menu -> TextFX Edit -> Delete Blank Lines as suggested above works.
But if lines contains some space, then move the cursor to that line and do a CTRL + H. The "Find what:" sec will show the blank space and in the "Replace with" section, leave it blank.
Now all the spaces are removed and now try CTRL+A, Select the TextFX menu -> TextFX Edit -> Delete Blank Lines
/n/r assumes a specific type of line break. To target any blank line you could also use:
^$
This says - any line that begins and then ends with nothing between. This is more of a catch-all. Replace with the same empty string.
I did not see the combined one as answer, so search for ^\s+$ and replace by {nothing}
^\s+$ means
^ start of line
\s+ Matches minimum one whitespace character (spaces, tabs, line breaks)
$ until end of line
This pattern is tested in Notepad++ v8.1.1
It replaces all spaces/tabs/blank lines before and after each row of text.
It shouldn't mess with anything in the middle of the text.
Find: ^(\s|\t)+|(\s|\t)+$
Replace: leave this blank
Before:
_____________________________________
\tWORD\r\n
\r\n
\tWORD\s\tWORD\s\t\r\n
\r\n
\r\n
WORD\s\s\tWORD\t\sWORD\s\r\n
\t\r\n
\s\s\s\r\n
WORD\s\sWORD\s\s\t\r\n
____________________________________
After:
_____________________________________
WORD\r\n
WORD\s\tWORD\r\n
WORD\s\s\tWORD\t\sWORD\r\n
WORD\s\sWORD
_____________________________________
A few of the above expressions and extended expressions did not work for me, but the regular expression "$\n$" did.
An easy alternative for removing white space from empty lines:
TextFX>TextFX Edit> Trim Trailing Spaces
This will remove all trailing spaces, including trailing spaces in blank lines.
Make sure, no trailing spaces are significant.
this work for me:
SEARCH:^\r
REPLACE: (empty)