Notepad++ Regex Replace selecting all text. Works in RegExr - regex

I'm trying to replace all spaces in a log file with commas (to convert it to CSV format). However, some log entries have spaces that I don't want replaced. These entries are bounded by quotation marks. I looked at a couple of examples and came up with the following code, which seems to work in RegExr.com and regex101.com.
[\s](?=(?:"[^"]*"|[^"])*$)
However, when I do a find/replace with that expression, it runs correctly until it hits the first quotation with a space and then selects the entire contents of the file.
Sample log file entry:
date=2020-08-24 time=07:35:15 idseq=216296511061885345 itime="2020-08-24 07:35:15" euid=3 epid=4107 dsteuid=3 dstepid=101 type="utm" subtype="webfilter" level="notice" action="passthrough" msg="URL belongs to an allowed category in policy"
Desired result:
date=2020-08-24,time=07:35:15,idseq=216296511061885345,itime="2020-08-24 07:35:15",euid=3,epid=4107,dsteuid=3,dstepid=101,type="utm",subtype="webfilter",level="notice",action="passthrough",msg="URL belongs to an allowed category in policy"
RegExr result:
EDIT: After more testing, it appears that with a single line, the replace works. However, if you have more than one line, it replaces all lines with the replace character (in my case, the comma).

Ctrl+H
Find what: "[^"\r\n]+"(*SKIP)(*FAIL)|\h+
Replace with: ,
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
"[^"\r\n]+" # everything between quotes
(*SKIP)(*FAIL) # kip and fail the match
| # OR
\h+ # 1 or more horizontal spaces
Screenshot (before):
Screenshot (after):

While lengthy, if you have a known list of values, you can simply use them as replacement keys
first value is skipped as it shouldn't be prefixed with ,
must capture and = around labels to be more sure, (though this does not guarantee it will not find substrings in the msg field)
's/ (time|idseq|itime|euid|epid|dsteuid|dstepid|type|subtype|level|action|msg)=/,$1='
Example in Python
import re
>>> source = '''date=2020-08-24 time=07:35:15 idseq=216296511061885345 itime="2020-08-24 07:35:15" euid=3 epid=4107 dsteuid=3 dstepid=101 type="utm" subtype="webfilter" level="notice" action="passthrough" msg="URL belongs to an allowed category in policy"'''
>>> regex = ''' (time|idseq|itime|euid|epid|dsteuid|dstepid|type|subtype|level|action|msg)='''
>>> print(re.sub(regex, r",\1=", source)) # raw string to prevent loss of 1
date=2020-08-24,time=07:35:15,idseq=216296511061885345,itime="2020-08-24 07:35:15",euid=3,epid=4107,dsteuid=3,dstepid=101,type="utm",subtype="webfilter",level="notice",action="passthrough",msg="URL belongs to an allowed category in policy"
You may find some values contain \" or similar, which can break even quite careful regular expressions!
Also note for a CSV you may wish to replace the field names entirely

Related

Notepad++ Regex Replace Makeshift Footnotes format With Proper Markdown format

In Word, I had to convert my footnotes to lines appearing at the end of each file to able to make changes in formatting. Some macro I found online was using braces and I ended up using also highlighting so I can see easily where my footnotes used to be. In this way, I have the following strings twice in my documents in the main text and also at the end of each document, sort of like makeshift endnotes.
=={1}==
.
.
.
=={99}==
I want to be able to match those instances in the text and convert them to proper markdown now. The problem is that the in-text format
[^1], [^2], etc.
will be different from what needs to come at the bottom with a semi-colon added:
[^1]:
etc.
So I'm guessing I'll have to live with replacing my old formatting with the new ones with semi-colons and deleting the semi-colons individually while I edit/clean up my text in the future. Without adding the semi-colon, it won't work.
My question is how to use the regex to match the two-digit strings with braces and equation marks.
This
==(\{d{1,2}\})==
did not work.
Also, as I am no pro, I would need the replacement as well. It probably will be
[^($1)]:
I reckon. Apparently, the equal mark doesn't have to be escaped.
Current format:
...some text...makeshift footnote in the format of
=={one- or two-digit number with no spaces in between}==
For example,
=={1}==
=={23}==
etc.
Desired result for all occurences recursively:
[^1]:
.
.
.
[^99]:
The markdown format is single square brackets with a caret and a number, also a semi-colon with the actual footnotes. Usually the number goes up to 42-45 maximum but it doesn't matter, the two digit regex is needed. As I said, the semi-colon will be needed in all instances.
Cheers
You have just some errors in your regex, you forget to escaped the d for digit, it should be \d and the capture group must not include the curly braces.
Use:
Ctrl+H
Find what: =={(\d{1,2})}==
Replace with: [^$1]:
TICK Wrap around
SELECT Regular expression
Replace all
Explanation:
=={ # literally
(\d{1,2}) # group 1, 1 or 2 digits
}== # literally
Screenshot (before):
Screenshot (after):

Notepad ++ regex. Finding and replacing with wildcard, but without allowing any spaces?

I have something like this in txt
[[asdfg]] [[abcd|qwerty]]
in a row, but I want it to look like that
[[asdfg]] [[qwerty]]
using wildcards ( [[.*\| ) when trying to search, results in it finding the whole line up to the "|" Not allowing it to have a space in between should work, but I don't know how to do that.
Edit 1
It's from a wikipedia dump, so the first part is the word in it's basic form and the second is how it fits into the sentence. Something like [[I]] [[be|was]] [[at]] [[the]] [[doctor]] And I want to change it into normal sentences
[[I]] [[was]] [[at]] [[the]] [[doctor]]
Edit 2
I found somewhat of a solution. I just put every word in a new line, did the first regex and then deleted newlines. That did kinda mess up my spacing though...
Try this regex:
\[\[\w+\|(\w+)\]\]
Replace with:
[[$1]]
Make sure you choose Regular expression at the bottom before you click Replace All in Notepad++.
You can do it all in one run like so
\[{2}(?:(?!\]{2}).)+?\|([^\]]+)
This needs to be replaced by
[[$1
See a demo on regex101.com.
Broken down this says:
\[{2} # match [[
(?:(?!\]{2}).)+? # do not overrun ]]
\| # |
([^\]]+) # capture anything not ] into group 1
Afterwards, you'll only need to replace the open brackets and the content of group $1

PCRE regex replace a text pattern within double quotes

In Notepad++ 6.5.1 I need to replace certain patterns within quote pairs. I want to save the replace as part of a macro, so all replacements need to happen in one step.
For example, in the following string, replace all 'a' characters within quote pairs with a dash, while leaving characters outside the quote pairs untouched:
Input: aa"bbabaavv"kdjhas"bbabaavv"x
Desired result: aa"bb-b--vv"kdjhas"bb-b--vv"x
Note that the quotes are matched up pairwise, such that the 'a' in kdjhas is untouched.
So far I have tried searching for (?:"[^"a]*|\G)\Ka([^"a]*) and replacing with -$1, but that simply replaces all the a's, with the result --"bb-b--vv"kdjh-s"bb-b--vv"x. I'm attempting PCRE regex that will let me recursively replace the quote-delimited text.
Edit: Quote marks within a quoted string are escaped with an extra quote, e.g. "". However, assume I will have already replaced these in a previous pass with a special character. Therefore a regex solution to this problem will not have to deal with escaped quotes.
It is hard to tell if this is possible as you've only provided one line of input text.
But assuming that input follows this pattern:
BOL|any text|string with two groups of a's|any text|string with two groups of a's|any text|EOL
aa "bbabaavv" kdjhas "bbabaavv" x
I was able to create this regexp search string:
^(.+?\".+?)([a]+)(.+?)([a]+)(.*?\")(.+?\".+?)([a]+)(.+?)([a]+)(.*?\".*)$
With this replace string:
\1-\3-\5\6-\8-\A
and it turn your input string from this:
aa"bbabaavv"kdjhas"bbabaavv"x
into this:
aa"bb-b-vv"kdjhas"bb-b-vv"x
Now naturally the search an replace will fail if the input varies from that pattern described as the search is looking for those four groups of a's inside the two groups of quoted strings.
Also I tested that regexp using Zeus which can create a regexp with more than 9 groups.
As you can see the regexp requires 10 groups.
I'm not familar with Notpad++ so I don't know if it supports that many groups.
If your data have variable number of occurrences of quoted strings, then it is not possible to perform replacements only via regex at least in its form offered by Notepad++.
To replace using regex, you would need to perform regex find in existing regex match. As far as I know such a functionality is not available in Notepad++ regexes.
Self-answer
I may have been reaching for the stars in trying to get Notepad++ to do this regex replace, but I think I found a workaround.
The actual task I was attempting involved creating a SQL Server VALUES list from an Excel spreadsheet, where I was copying and pasting selected cells into Notepad++. The delimiters are \t and \r\n. But, cells can have linefeeds too, which are delimited by ". So, I was going to replace these linefeeds with <br> (or something like it), so that
"line1
line2"
would become "line1<br>line2", before processing the actual end-of-row line feeds.
Having such parsing work reliably, especially when more than two lines were in a single cell, may have been too much to ask of Notepad++'s regex capability.
So I came up with a workaround that seems to be working:) Basically it starts with selecting a blank "dummy" column to the right of my column selection (which I can insert if I'm partially selecting from the middle). This will leave a trailing \t at the end of each row, which effectively sets these EOL's apart from ones that might exist with a text cell, freeing me from having to parse line feeds from a "..." field.
So I compiled a macro from the following steps, which seems to be working well:
replace ' with ''
replace \t\r\n with '\)\r\n, \('
replace \t with ', '
replace "" with ''
replace " with <blank>
replace ^ with \(' (cleanup - first row only)
replace ^, \('$ with <blank> (cleanup - last row only)
Example transformation:
from
line1 line 2
"line3
line3b
line3c" line 4
to
('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
which can now be easily modified into a SELECT statement:
SELECT *
FROM (VALUES('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
) t(a,b)

Remove all lines that don't match regex in Notepad++

I have a range of files of a specific format. I have pasted an example here.
----------------------------------------------
Attempting to factor n = 160000000000110400000000018963... Found a factor: 400000000000129
Time (s): 18.9561
----------------------------------------------
Attempting to factor n = 164025000000137700000000028819... Found a factor: 405000000000179
Time (s): 22.3426
----------------------------------------------
Attempting to factor n = 168100000000155800000000036051... Found a factor: 410000000000197
Time (s): 101.183
I would like a regular expression that I can use to capture the times, e.g. for all the lines with format "Time (s): X.Y" I want to keep X.Y on a seperate line, and throw EVERYTHING ELSE away.
I have the following expression: Time (s):\s+(\d+.\d+), which captures these. This captures the lines I need, but Notepad++ only seems to have functionality to replace with something, not save what it matches. So I can remove all those lines, which is nearly the opposite of what I want.
Any help?
Well I don't know Noteplad++ but its likely that you can use the result of capture groups in the replacement field. Either try
\1
or
$1
1 = first capture group. So you basically replace the whole line with \2 in your case.
Use this on the command line:
for /f "usebackq tokens=3" %a in (`findstr /b "Time" 1.txt`) do #echo %a
Follow next steps (Notepad++ 6.2.3):
Clean and mark
Replace: ^(Time \(s\):)+ ([.\r]*) with: #\2
Remove unmarked lines
Replace: ^[^#]+[.\n]* with: (empty)
Remove mark
Replace: ^#(.*) with: \1
Use the following expression to match the entire line:
.*\(s\)\:\s+(\d+.\d+)
Now you can replace this with
\1
which gives you the matched group number 1 (the only group in the above expression) that matches the time
Adjust your regular expression so it either matches a "Time" line and captures the time within, or matches the whole line. Then replace with the captured text, which will be blank for ignored lines.
Find what: (Time \D+(\d+.\d+)|.*)
Replace with: \2
This leaves you with a sequence of captured times plus blank lines, which can be removed using TextFX's Remove blank lines, or Extended Replace on "\r\n\r\n".
Similar to MaurizioRam's answer (which lead me to figuring out this answer), you can take advantage of the "Mark" tab in the Find window.
As you probably know Ctrl+F opens a window with Find and Replace tabs. It also has tabs Find In Files, Find In Projects, and Mark.
Mark will let you add a special highlight (a mark) to everything your regex matches, by pressing "Mark All".
After pressing "Mark All" you can "Copy Marked Text" which will copy everything that your regex matched into your clipboard.
You can now paste this into a new file, which will give you a file with only the text your regex matched.

How to cycle through delimited tokens with a Regular Expression?

How can I create a regular expression that will grab delimited text from a string? For example, given a string like
text ###token1### text text ###token2### text text
I want a regex that will pull out ###token1###. Yes, I do want the delimiter as well. By adding another group, I can get both:
(###(.+?)###)
/###(.+?)###/
if you want the ###'s then you need
/(###.+?###)/
the ? means non greedy, if you didn't have the ?, then it would grab too much.
e.g. '###token1### text text ###token2###' would all get grabbed.
My initial answer had a * instead of a +. * means 0 or more. + means 1 or more. * was wrong because that would allow ###### as a valid thing to find.
For playing around with regular expressions. I highly recommend http://www.weitz.de/regex-coach/ for windows. You can type in the string you want and your regular expression and see what it's actually doing.
Your selected text will be stored in \1 or $1 depending on where you are using your regular expression.
In Perl, you actually want something like this:
$text = 'text ###token1### text text ###token2### text text';
while($text =~ m/###(.+?)###/g) {
print $1, "\n";
}
Which will give you each token in turn within the while loop. The (.*?) ensures that you get the shortest bit between the delimiters, preventing it from thinking the token is 'token1### text text ###token2'.
Or, if you just want to save them, not loop immediately:
#tokens = $text =~ m/###(.+?)###/g;
Assuming you want to match ###token2### as well...
/###.+###/
Use () and \x. A naive example that assumes the text within the tokens is always delimited by #:
text (#+.+#+) text text (#+.+#+) text text
The stuff in the () can then be grabbed by using \1 and \2 (\1 for the first set, \2 for the second in the replacement expression (assuming you're doing a search/replace in an editor). For example, the replacement expression could be:
token1: \1, token2: \2
For the above example, that should produce:
token1: ###token1###, token2: ###token2###
If you're using a regexp library in a program, you'd presumably call a function to get at the contents first and second token, which you've indicated with the ()s around them.
Well when you are using delimiters such as this basically you just grab the first one then anything that does not match the ending delimiter followed by the ending delimiter. A special caution should be that in cases as the example above [^#] would not work as checking to ensure the end delimiter is not there since a singe # would cause the regex to fail (ie. "###foo#bar###). In the case above the regex to parse it would be the following assuming empty tokens are allowed (if not, change * to +):
###([^#]|#[^#]|##[^#])*###