vim: my regexp to select some words doesnt work - regex

Sentences = lines that may contain anything (including html tags). I have a lot of sentences like that. Those sentences are in a huge text where I dont want to remove all tags (I want all other lines to remain untouched):
<h2 id="aa">sentence</h2>
<h2 id="xx">Another sentence</h2>
And sometimes only:
<h2 id="aa">A sentence without a link</h2>
First thing that I find strange: I'm trying to search for any caracter and fill a group. I've tried all those solutions:
\(.\)\+ -> select whole line
\([.]\)\+ -> select only the "." caracter
\([\.]\)\+ -> select only the "." caracter
\([\.]\)\+ -> still select only the "." caracter (what the?)
From the documentation, if I want to select a group of any caracters and fill a register I thought I could use that expression but it doesnt work: \([\.]\+\). The only "close" expression that works is \(.\)\+ but if I try to output the register it's only filled with the last caracter matched.
So starting from this problem above, I can't do what I want which is convert all the sentences above by this output:
---sentence
---Another sentence
---A sentence without a link
I've tried something like :%s/^<h2 id=\(\[.\]\+\)<a\([.]\)\+>\(.\)\+<\/a><\/h2>$/--->\3/ but it didnt work properly, and didnt include sentences that did not have <a /> tag inside
How would you do this?

Simply use the regex below:
>([^<>]+)<
Demo: https://regex101.com/r/mS2oB5/2
For full text:
>([^<>\n]+)<
Demo: https://regex101.com/r/mS2oB5/3

Vim in command mode , type
%s/<[^>]*>//g.
Explaination:
1.\([\.]\)\+ still select only the "." caracter.Because the character in the [] is treated as normal chars, they dont have regex special meaning.
2.My Regex <[^>]*> is a simple way to remove all html tags.Will have some problems, but I will leave it to you.
3.<[^>]*> have another version <.*?> with include the greedy featrue of regex.

Related

How to reverse regx to not match

I have regular which select url, I want that it not select url only word, how to not select url? instead select word like (admin,hello).
Regex
((.*?\w+|\W):\/\/[\w\-\.]+.*?\/*.*?\w\W+.*\/.*?\w\W+.*?\/{0,})
Text
htt$ps://b24-56kck1.$bitr%ix24.kz/com#pany/pe#rsonal/us^&er/19/k/roce/
https://1.tesssst1.ru/ororo
admin
hello
##$#$$#w_svccx354V2346Vf
SendAjaxFilterToServer(quiz_questions);
Alex, it is very hard to invert a regular expression, so you want to think in terms of the attributes of what you want to match. One thing that jumps out to me is you just want the line to contain letters. For that, you can use ^[a-zA-Z]+$
Another way to go at it, is you can create an inverted list of characters - ones which you don't want present. This can be harder, but for the simple example input you give, you don't want ":", "/" or "#" to be in the line. That would be ^[^:/#]+$.
These are examples of how you need to think about the problem.
Try this, then trip the surrounding whitespace (because of lack of support for lookaround in Go):
(^|[\n\s])[a-zA-Z]+([\n\s]|$)
https://regex101.com/r/MqyDWC/3

Notepad++ Regex Replace selecting all text. Works in RegExr

I'm trying to replace all spaces in a log file with commas (to convert it to CSV format). However, some log entries have spaces that I don't want replaced. These entries are bounded by quotation marks. I looked at a couple of examples and came up with the following code, which seems to work in RegExr.com and regex101.com.
[\s](?=(?:"[^"]*"|[^"])*$)
However, when I do a find/replace with that expression, it runs correctly until it hits the first quotation with a space and then selects the entire contents of the file.
Sample log file entry:
date=2020-08-24 time=07:35:15 idseq=216296511061885345 itime="2020-08-24 07:35:15" euid=3 epid=4107 dsteuid=3 dstepid=101 type="utm" subtype="webfilter" level="notice" action="passthrough" msg="URL belongs to an allowed category in policy"
Desired result:
date=2020-08-24,time=07:35:15,idseq=216296511061885345,itime="2020-08-24 07:35:15",euid=3,epid=4107,dsteuid=3,dstepid=101,type="utm",subtype="webfilter",level="notice",action="passthrough",msg="URL belongs to an allowed category in policy"
RegExr result:
EDIT: After more testing, it appears that with a single line, the replace works. However, if you have more than one line, it replaces all lines with the replace character (in my case, the comma).
Ctrl+H
Find what: "[^"\r\n]+"(*SKIP)(*FAIL)|\h+
Replace with: ,
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
"[^"\r\n]+" # everything between quotes
(*SKIP)(*FAIL) # kip and fail the match
| # OR
\h+ # 1 or more horizontal spaces
Screenshot (before):
Screenshot (after):
While lengthy, if you have a known list of values, you can simply use them as replacement keys
first value is skipped as it shouldn't be prefixed with ,
must capture and = around labels to be more sure, (though this does not guarantee it will not find substrings in the msg field)
's/ (time|idseq|itime|euid|epid|dsteuid|dstepid|type|subtype|level|action|msg)=/,$1='
Example in Python
import re
>>> source = '''date=2020-08-24 time=07:35:15 idseq=216296511061885345 itime="2020-08-24 07:35:15" euid=3 epid=4107 dsteuid=3 dstepid=101 type="utm" subtype="webfilter" level="notice" action="passthrough" msg="URL belongs to an allowed category in policy"'''
>>> regex = ''' (time|idseq|itime|euid|epid|dsteuid|dstepid|type|subtype|level|action|msg)='''
>>> print(re.sub(regex, r",\1=", source)) # raw string to prevent loss of 1
date=2020-08-24,time=07:35:15,idseq=216296511061885345,itime="2020-08-24 07:35:15",euid=3,epid=4107,dsteuid=3,dstepid=101,type="utm",subtype="webfilter",level="notice",action="passthrough",msg="URL belongs to an allowed category in policy"
You may find some values contain \" or similar, which can break even quite careful regular expressions!
Also note for a CSV you may wish to replace the field names entirely

Notepad ++ regex. Finding and replacing with wildcard, but without allowing any spaces?

I have something like this in txt
[[asdfg]] [[abcd|qwerty]]
in a row, but I want it to look like that
[[asdfg]] [[qwerty]]
using wildcards ( [[.*\| ) when trying to search, results in it finding the whole line up to the "|" Not allowing it to have a space in between should work, but I don't know how to do that.
Edit 1
It's from a wikipedia dump, so the first part is the word in it's basic form and the second is how it fits into the sentence. Something like [[I]] [[be|was]] [[at]] [[the]] [[doctor]] And I want to change it into normal sentences
[[I]] [[was]] [[at]] [[the]] [[doctor]]
Edit 2
I found somewhat of a solution. I just put every word in a new line, did the first regex and then deleted newlines. That did kinda mess up my spacing though...
Try this regex:
\[\[\w+\|(\w+)\]\]
Replace with:
[[$1]]
Make sure you choose Regular expression at the bottom before you click Replace All in Notepad++.
You can do it all in one run like so
\[{2}(?:(?!\]{2}).)+?\|([^\]]+)
This needs to be replaced by
[[$1
See a demo on regex101.com.
Broken down this says:
\[{2} # match [[
(?:(?!\]{2}).)+? # do not overrun ]]
\| # |
([^\]]+) # capture anything not ] into group 1
Afterwards, you'll only need to replace the open brackets and the content of group $1

Matching all occurrences of a html element attribute in notepad++ regex

I have a file which has hundreds of links like this:
<h3>aspnet</h3>
Ex 1
Ex 2
Ex 3
So I want to remove all the elements
icon="..."
from all the lines. I went through the official Notepad++ regex wiki and have come up with this after several trials:
icon=\"[^\.]+\"
The problem with this is, it is selecting past the second double quote and stopping at the next occurring double quote. To illustrate, this will select the following content:
icon="data:image/png;base64,...jbvebich4sec9zgth1sfue1cdt...">EX 1</a> <a href="
If I modify the above regex to,
icon=\"[^\.]+\">
Then it is almost perfect, but it is also selecting the >:
icon="data:image/png;base64,...jbvebich4sec9zgth1sfue1cdt...">
The regex I am looking for would select like this:
icon="data:image/png;base64,...jbvebich4sec9zgth1sfue1cdt..."
I also tried the following, but it doesn't match anything at all
icon=\"[^\.]+\"$
Just match anything but a quote, followed by a quote:
icon="[^"]+"
Just tested with notepad++ 6.2.2 and confirmed that this matches correctly as written.
Broken down:
icon="
This is fairly obvious, match the literal text icon=".
[^"]+
This means to match any character that is not a ". Adding the + after it means "one or more times."
Finally we match another literal ".
I am not a notepad++ user. so don't know how notepad++ plays with regex, but can you try to replace
icon=\"[^>]* to (empty string) ?
Try this solution:
This is I just check was working as you wanted it.
The way achieving your goal:
Find what: (icon.*")|.*?
Replace with: $1

Regex Match That doesn't contain some text

I am tring to create a regex that finds a Start Prefix and an End Prefix that have paragraph tags between them. But the one i have cteated is not working to my expectations.
%%%HL_START%%%(.*?)</p><p>(.*?)%%%HL_END%%%
Correctly Matches
<p>This Should %%%HL_START%%%Work</p><p>This%%%HL_END%%% SHould Match</p>
This also matches but i dont want it to match becasue the </p><p> is not in bettween the Start and End Prefix
<p>%%%HL_START%%%One%%%HL_END%%% Some More Text %%%HL_START%%%Here%%%HL_END%%%</p><p>Some more text %%%HL_START%%%Here%%%HL_END%%%</p>
I'm not entirely comfortable that regex is the right solution here; if you are getting into nested start and stop markers, you might not have a regular language...
For this specific example, try changing the regex to use [^%] instead of . so that the .*?matching can't go past the %%%%H:_END%%%%
%%%HL_START%%%([^%]*?)</p><p>([^%]*?)%%%HL_END%%%