Remove all lines that don't match regex in Notepad++ - regex

I have a range of files of a specific format. I have pasted an example here.
----------------------------------------------
Attempting to factor n = 160000000000110400000000018963... Found a factor: 400000000000129
Time (s): 18.9561
----------------------------------------------
Attempting to factor n = 164025000000137700000000028819... Found a factor: 405000000000179
Time (s): 22.3426
----------------------------------------------
Attempting to factor n = 168100000000155800000000036051... Found a factor: 410000000000197
Time (s): 101.183
I would like a regular expression that I can use to capture the times, e.g. for all the lines with format "Time (s): X.Y" I want to keep X.Y on a seperate line, and throw EVERYTHING ELSE away.
I have the following expression: Time (s):\s+(\d+.\d+), which captures these. This captures the lines I need, but Notepad++ only seems to have functionality to replace with something, not save what it matches. So I can remove all those lines, which is nearly the opposite of what I want.
Any help?

Well I don't know Noteplad++ but its likely that you can use the result of capture groups in the replacement field. Either try
\1
or
$1
1 = first capture group. So you basically replace the whole line with \2 in your case.

Use this on the command line:
for /f "usebackq tokens=3" %a in (`findstr /b "Time" 1.txt`) do #echo %a

Follow next steps (Notepad++ 6.2.3):
Clean and mark
Replace: ^(Time \(s\):)+ ([.\r]*) with: #\2
Remove unmarked lines
Replace: ^[^#]+[.\n]* with: (empty)
Remove mark
Replace: ^#(.*) with: \1

Use the following expression to match the entire line:
.*\(s\)\:\s+(\d+.\d+)
Now you can replace this with
\1
which gives you the matched group number 1 (the only group in the above expression) that matches the time

Adjust your regular expression so it either matches a "Time" line and captures the time within, or matches the whole line. Then replace with the captured text, which will be blank for ignored lines.
Find what: (Time \D+(\d+.\d+)|.*)
Replace with: \2
This leaves you with a sequence of captured times plus blank lines, which can be removed using TextFX's Remove blank lines, or Extended Replace on "\r\n\r\n".

Similar to MaurizioRam's answer (which lead me to figuring out this answer), you can take advantage of the "Mark" tab in the Find window.
As you probably know Ctrl+F opens a window with Find and Replace tabs. It also has tabs Find In Files, Find In Projects, and Mark.
Mark will let you add a special highlight (a mark) to everything your regex matches, by pressing "Mark All".
After pressing "Mark All" you can "Copy Marked Text" which will copy everything that your regex matched into your clipboard.
You can now paste this into a new file, which will give you a file with only the text your regex matched.

Related

Notepad++ Regex Replace selecting all text. Works in RegExr

I'm trying to replace all spaces in a log file with commas (to convert it to CSV format). However, some log entries have spaces that I don't want replaced. These entries are bounded by quotation marks. I looked at a couple of examples and came up with the following code, which seems to work in RegExr.com and regex101.com.
[\s](?=(?:"[^"]*"|[^"])*$)
However, when I do a find/replace with that expression, it runs correctly until it hits the first quotation with a space and then selects the entire contents of the file.
Sample log file entry:
date=2020-08-24 time=07:35:15 idseq=216296511061885345 itime="2020-08-24 07:35:15" euid=3 epid=4107 dsteuid=3 dstepid=101 type="utm" subtype="webfilter" level="notice" action="passthrough" msg="URL belongs to an allowed category in policy"
Desired result:
date=2020-08-24,time=07:35:15,idseq=216296511061885345,itime="2020-08-24 07:35:15",euid=3,epid=4107,dsteuid=3,dstepid=101,type="utm",subtype="webfilter",level="notice",action="passthrough",msg="URL belongs to an allowed category in policy"
RegExr result:
EDIT: After more testing, it appears that with a single line, the replace works. However, if you have more than one line, it replaces all lines with the replace character (in my case, the comma).
Ctrl+H
Find what: "[^"\r\n]+"(*SKIP)(*FAIL)|\h+
Replace with: ,
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
"[^"\r\n]+" # everything between quotes
(*SKIP)(*FAIL) # kip and fail the match
| # OR
\h+ # 1 or more horizontal spaces
Screenshot (before):
Screenshot (after):
While lengthy, if you have a known list of values, you can simply use them as replacement keys
first value is skipped as it shouldn't be prefixed with ,
must capture and = around labels to be more sure, (though this does not guarantee it will not find substrings in the msg field)
's/ (time|idseq|itime|euid|epid|dsteuid|dstepid|type|subtype|level|action|msg)=/,$1='
Example in Python
import re
>>> source = '''date=2020-08-24 time=07:35:15 idseq=216296511061885345 itime="2020-08-24 07:35:15" euid=3 epid=4107 dsteuid=3 dstepid=101 type="utm" subtype="webfilter" level="notice" action="passthrough" msg="URL belongs to an allowed category in policy"'''
>>> regex = ''' (time|idseq|itime|euid|epid|dsteuid|dstepid|type|subtype|level|action|msg)='''
>>> print(re.sub(regex, r",\1=", source)) # raw string to prevent loss of 1
date=2020-08-24,time=07:35:15,idseq=216296511061885345,itime="2020-08-24 07:35:15",euid=3,epid=4107,dsteuid=3,dstepid=101,type="utm",subtype="webfilter",level="notice",action="passthrough",msg="URL belongs to an allowed category in policy"
You may find some values contain \" or similar, which can break even quite careful regular expressions!
Also note for a CSV you may wish to replace the field names entirely

Notepad ++ regex. Finding and replacing with wildcard, but without allowing any spaces?

I have something like this in txt
[[asdfg]] [[abcd|qwerty]]
in a row, but I want it to look like that
[[asdfg]] [[qwerty]]
using wildcards ( [[.*\| ) when trying to search, results in it finding the whole line up to the "|" Not allowing it to have a space in between should work, but I don't know how to do that.
Edit 1
It's from a wikipedia dump, so the first part is the word in it's basic form and the second is how it fits into the sentence. Something like [[I]] [[be|was]] [[at]] [[the]] [[doctor]] And I want to change it into normal sentences
[[I]] [[was]] [[at]] [[the]] [[doctor]]
Edit 2
I found somewhat of a solution. I just put every word in a new line, did the first regex and then deleted newlines. That did kinda mess up my spacing though...
Try this regex:
\[\[\w+\|(\w+)\]\]
Replace with:
[[$1]]
Make sure you choose Regular expression at the bottom before you click Replace All in Notepad++.
You can do it all in one run like so
\[{2}(?:(?!\]{2}).)+?\|([^\]]+)
This needs to be replaced by
[[$1
See a demo on regex101.com.
Broken down this says:
\[{2} # match [[
(?:(?!\]{2}).)+? # do not overrun ]]
\| # |
([^\]]+) # capture anything not ] into group 1
Afterwards, you'll only need to replace the open brackets and the content of group $1

Remove columns from CSV

I don't know anything about Notepad++ Regex.
This is the data I have in my CSV:
6454345|User1-2ds3|62562012032|324|148|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|0|0|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|1534|51564|411b0fdf54fe29745897288c6ad699f7be30f389
How can I use a Regex to remove the 5th and 6th column? The numbers in the 5th and 6th column are variable in length.
Another problem is the User row can also contain a |, to make it even worse.
I can use a macro to fix this, but the file is a few millions lines long.
This is the final result I want to achieve:
6454345|User1-2ds3|62562012032|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|411b0fdf54fe29745897288c6ad699f7be30f389
I am open for suggestions on how to do this with another program, command line utility, either Linux or Windows.
Match \|[^|]+\|[^|]+(\|[^|]+$)
Repalce $1
Basically, Anchor to the end of the line, and remove columns [-1] and [-2] (I assume columns can't be empty. Replace + with * if they can)
If you need finer detail then that, I'd recommend writing a Java or Python script to manual parse and rewrite the file for you.
I've captured three groups and given them names. If you use a replace utility like sed or vimregex, you can replace remove with nothing. Or you can use a programming language to concatenate keep_before and keep_after for the desired result.
^(?<keep_before>(?:[^|]+\|){3})(?<remove>(?:[^|]+\|){2})(?<keep_after>.*)$
You may have to remove the group namings and use \1 etc. instead, depending on what environment you use.
Demo
From Notepad++ hit ctrl + h then enter the following in the dialog:
Find what: \|\d+\|\d+(\|[0-9a-z]+)$
Replace with: $1
Search mode: Regular Expression
Click replace and done.
Regex Explain:
\|\d+ : match 1st string that starts with | followed by number
\|\d+ : match 2nd string that starts with | followed by number
(\|[0-9a-z]+): match and capture the string after the 2nd number.
$ : This is will force regex search to match the end of the string.
Replacement:
$1 : replace the found string with whatever we have between the captured group which is whatever we have between the parentheses (\|[0-9a-z]+)

Finding Ten Digit Number using regex in notepad++

I am trying to replace everything from a data dump and keep only the ten digit numbers from that dump using notepad++ regex.
Trying to do something like this (?<!\d)0\d{7}(?!\d) but no luck.
Forward
There where problems in older versions of Notepad++ which wouldn't handle PCRE expressions. This proposed solution was tested in NotePad++ v6.8.8, but should work in any version later than v6.2.
Description
([0-9]{10})|.
Replace with: $1
This expression will do the following:
capture 10 digit numbers and place them into capture group 1, which is then just reinserted into the output string
matches everything less and removes it.
How To in Notepad ++
From Notepad++
press the ctrlh to enter the find and replace
mode
Select the Regular Expression option
In the "Find what" field place the regular expression
in the "Replace with" field enter $1
Click Replace all
Example
Live Demo
https://regex101.com/r/fZ9vH7/1
Source Text
fdsafasfa1234567890zzzzzzz12345
After Replacement
1234567890
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-9]{10} any character of: '0' to '9' (10 times)
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
Extra credit
The OP wasn't clear on what to do with substrings of numbers longer than 10 characters. If strings of numbers longer than 10 digits are undesirable and need to be removed in their entirity, then use this
([0-9]{10})(?![0-9])|[0-9]+|.
Replace with: $1
Live Demo: https://regex101.com/r/aS4sN1/1
Try this:
Find: .*(\d{10}).*
Replace: \1
This has been tested in Notepad++.
As an example of a different procedure but answering the question by examples: How to get a list of ID's of your Facebook group to avoid removal of active users, it's used to reduce as well a group from 10.000 to 5000 members as well as removal of not active members from a group on Facebook.
It might be outdated but don't mind old program just look below what will do what, since explanation was to understand FIND: and Replace: what it does:
As well as a different example of how to parse as well as text and code out of HTML. And a range of numbers if they are with 2 digits up to 30.
You can try this to purge the list of member_id= and with them along with numbers from 2 to up to 30 digits long. Making sure only numbers and whole "member_id=12456" or "member_id=12" are written to the file. Later you can replace out the member_id= with blanking it out. Then copy the whole list to a duplicate scanner or remove duplicates. And have all unique IDs. And then use it in the Java code below.
"This is used to purge all Facebook user ID's by a group out of a single HTML file after you saved it scrolling down the group"
You should use the "Regular Expression" and ". matches newline" on the code below. This represents the removal of all FIND by $1 zeroing out everything:
Find: (member_id=\d{2,30})|.
Replace: $1
Second use the Extended Mode on this mode:
Find: member_id=
Replace: \n
That will make new lines with \n and with an easy way to remove all Fx0 in all lines to manually remove all the extra characters that come in buggy Notepad++
Then you can easily as well then remove all duplicates.
Connect all lines into one single space between.
The option was to use this tool which aligns the whole text with one space between each ID since its removing all duplicates: https://www.tracemyip.org/tools/remove-duplicate-words-in-text/
As well then again "use Normal option in Notepad++":
Remember to add ' to beginning and end
Find: "ONE SPACE"
Replace ','
Then you can copy the whole line into your java edit and then remove all members who are not active. If you though use a whole scrolled down HTML of a page.
['21','234','124234'] <-- remember right characters from beginning.
Extra secure would be to add your IDs to the beginning.
The facebook group removal java code is here:
https://gist.github.com/michaelv/11145168

find a single quote at the end of a line starting with "mySqlQueryToArray"

I'm trying to use regex to find single quotes (so I can turn them all into double quotes) anywhere in a line that starts with mySqlQueryToArray (a function that makes a query to a SQL DB). I'm doing the regex in Sublime Text 3 which I'm pretty sure uses Perl Regex. I would like to have my regex match with every single quote in a line so for example I might have the line:
mySqlQueryToArray($con, "SELECT * FROM Template WHERE Name='$name'");
I want the regex to match in that line both of the quotes around $name but no other characters in that line. I've been trying to use (?<=mySqlQueryToArray.*)' but it tells me that the look behind assertion is invalid. I also tried (?<=mySqlQueryToArray)(?<=.*)' but that's also invalid. Can someone guide me to a regex that will accomplish what I need?
To find any number of single quotes in a line starting with your keyword you can use the \G anchor ("end of last match") by replacing:
(^\h*mySqlQueryToArray|(?!^)\G)([^\n\r']*)'
With \1\2<replacement>: see demo here.
Explanation
( ^\h*mySqlQueryToArray # beginning of line: check the keyword is here
| (?!^)\G ) # if not at the BOL, check we did match sth on this line
( [^\n\r']* ) ' # capture everything until the next single quote
The general idea is to match everything until the next single quote with ([^\n\r']*)' in order to replace it with \2<replacement>, but do so only if this everything is:
right after the beginning keyword (^mySqlQueryToArray), or
after the end of the last match ((?!^)\G): in that case we know we have the keyword and are on a relevant line.
\h* accounts for any started indent, as suggested by Xælias (\h being shortcut for any kind of horizontal whitespace).
https://stackoverflow.com/a/25331428/3933728 is a better answer.
I'm not good enough with RegEx nor ST to do this in one step. But I can do it in two:
1/ Search for all mySqlQueryToArray strings
Open the search panel: ⌘F or Find->Find...
Make sure you have the Regex (.* ) button selected (bottom left) and the wrap selector (all other should be off)
Search for: ^\s*mySqlQueryToArray.*$
^ beginning of line
\s* any indentation
mySqlQueryToArray your call
.* whatever is behind
$ end of line
Click on Find All
This will select every occurrence of what you want to modify.
2/ Enter the replace mode
⌥⌘F or Find->Replace...
This time, make sure that wrap, Regex AND In selection are active .
Them search for '([^']*)' and replace with "\1".
' are your single quotes
(...) si the capturing block, referenced by \1 in the replace field
[^']* is for any character that is not a single quote, repeated
Then hit Replace All
I know this is a little more complex that the other answer, but this one tackles cases where your line would contain several single-quoted string. Like this:
mySqlQueryToArray($con, "SELECT * FROM Template WHERE Name='$name' and Value='1234'");
If this is too much, I guess something like find: (?<=mySqlQueryToArray)(.*?)'([^']*)'(.*?) and replace it with \1"\2"\3 will be enough.
You can use a regex like this:
(mySqlQueryToArray.*?)'(.*?)'(.*)
Working demo
Check the substitution section.
You can use \K, see this regex:
mySqlQueryToArray[^']*\K'(.*?)'
Here is a regex demo.