Vim Regex for html - regex

Working with vim regexes for folding html, trying to ignore html tags that start and end on the same line.
So far, I have
if line =~# '<\(\w\+\).*<\/\1>'
return '='
endif
Which works fine for tags like <a></a>, but when dealing with custom elements, I run into issues since there is a hyphen in the tag name.
Like for example, this element
<paper-input label="Input label"></paper-input>
What needs to change in the regex to also catch the hyphen?

The correct regex (updated because of this link) is:
<\([^ >]\+\)[ >].*<\/\1>
or
<\([^ >]\+\)\>.*<\/\1>
This is important [^ >]. This will match any character until whitespace or > i.e. it will match both a and paper_input

Related

Remove HTML tags with specific content using Regex

I want to remove an HTML tag but only with specific content is appear, like this:
<strong>\u00a0</strong>
I want to remove the tag only if \u00a0 is appearing inside the tag. Any solution?
If what you mean is to match the exact string \u00a0, then the regex is just about escaping a couple of slashes:
s/<strong>\\u00a0<\/strong>//g
Or, more readable:
s|<strong>\\u00a0</strong>||g
If you mean to match the actual unicode character U+00A0, then the regex is:
Non-PCRE syntax:
s/<strong>\u00a0<\/strong>//g
PCRE syntax:
s/<strong>\x{00a0}<\/strong>//g

Regular Expressions - Select the Second Match

I have a txt file with <i> and </i> between words that I would like to remove using Editpad
For example, I'd like to keep when it's like this:
<i>Phrases and words.</i>
And I'd like to remove the </i> and <i> tags inside the phrase, when it's like this:
<i>Phrases</i>and<i> words.</i>
<i>Phrases</i>and <i>words.</i>
I was trying to do that using regex, but I couldn't do it.
As the tag is followed by space or a word character I could find when the line has the double tag with
/ <i>|<\/i> /
but this way I can't just press replace for nothing, I have to edit line by line I search.
There's anyway to accomplish that?
* Edited *
Another example of lines found on the subtitle text
<i>- find me on the chamber.</i>
- What? <i>Go. Go, go, go!</i>
Rule number one: you can't parse html with regex.
That being said, if you know each line follows a certain pattern, you can usually hack something together to work. ;)
If I've understood correctly, it looks like you can simply remove all <i> and </i> that aren't either at the beginning or end of the lines. In that case, one method you could try is the following regex:
(?<=.)\<\/?i\>(?=.)
This will match the tags, with a lookahead and behind to make sure that we aren't at the end/start of a line (by checking if another character exists in front/behind. (Note that typically matched characters in a lookahead/behind won't be replaced when you search/replace.)
Disclaimer: this works on regex101, but notepad++ may have some differences to the pcre regex style.
update to work with Editpad
EDIT: since this question is actually wanting to know how to do this in Editpad, below is a modified alternative:
Try searching for the regex: (.)\<\/?i\>(.). This will match (and capture) exactly one character before and after the <i> tags.
When replacing, use backreferences to replace the entire match with the two captured characters - a replacement string of \1\2 should work.

Regex match between two regex expressions

This has been driving me crazy, I can't find a solution that works!
I'm trying to do a regex between a couple of tags, bad idea I've heard but necessary this time :P
What I have at the start is a <body class="foo"> where foo can vary between files - <body.*?> search works fine to locate the only copy in each file.
At the end I have a <div id="bar">, bar doesn't change between files.
eg.
<body class="foo">
sometext
some more text
<maybe even some tags>
<div id="bar">
What I need to do is select everything between the two tags but not including them - everything between the closing > on body and the opening < on div - sometext to maybe even some tags.
I've tried a bunch of things, mostly variations on (?<=<body.*>)(.*?)(?=<div id="bar">) but I'm actually getting invalid expressions at worst on notepad++, http://regexpal.com/ and no matches at best.
Any help appreciated!
You are attempting to implement variable-length lookbehind in which most regular expression languages and notepad++ does not support. I assume you are using notepad++ so you can use the \K escape sequence.
<body[^>]*>\K.*?(?=<div id="bar">)
The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. Make sure you have the . matches newline checkbox checked as well.
Alternatively, you can use a capturing group and avoid using lookaround assertions.
<body[^>]*>(.*?)<div id="bar">
Note: Using a capturing group, you can refer to group index "1" to get your match result.
Use the following pattern:
/<body[^>]*>(.*?)<div id="bar">/

parsing url for specific param value

im looking to use a regular expression to parse a URL to get a specific section of the url and nothing if I cannot find the pattern.
A url example is
/te/file/value/jifle?uil=testing-cdas-feaw:jilk:&jklfe=https://value-value.jifels/temp.html/topic?id=e997aad4-92e0-j30e-a3c8-jfkaliejs5#c452fds-634d-f424fds-cdsa&bf_action=jildape
I wish to get the bolded text in it.
Currently im using the regex "d=([^#]*)" but the problem is im also running across urls of this pattern:
and im getting the bold section of it
/te/file/value/jifle?uil=testing-cdas-feaw:jilk:&jklfe=https://value-value.jifels/temp.html/topic?id=e997aad4-92e0-j30e-a3c8-jfkaliejs5&bf_action=jildape
I would prefer it have no matches of this url because it doesnt contain the #
Regexes are not a magic tool that you should always use just because the problem involves a string. In this case, your language probably has a tool to break apart URLs for you. In PHP, this is parse_url(). In Perl, it's the URI::URL module.
You should almost always prefer an existing, well-tested solution to a common problem like this rather than writing your own.
So you want to match the value of the id parameter, but only if it has a trailing section containing a '#' symbol (without matching the '#' or what's after it)?
Not knowing the specifics of what style of regexes you're using, how about something like:
id=([^#&]*)#
regex = "id=([\\w-])+?#"
This will grab everything that is character class[a-zA-Z_0-9-] between 'id=' and '#' assuming everything between 'id=' and '#' is in that character class(i.e. if an '&' is in there, the regex will fail).
id=
-Self explanatory, this looks for the exact match of 'id='
([\\w-])
-This defines and character class and groups it. The \w is an escaped \w. '\w' is a predefined character class from java that is equal to [a-zA-Z_0-9]. I added '-' to this class because of the assumed pattern from your examples.
+?
-This is a reluctant quantifier that looks for the shortest possible match of the regex.
#
-The end of the regex, the last character we are looking for to match the pattern.
If you are looking to grab every character between 'id=' and the first '#' following it, the following will work and it uses the same logic as above, but replaces the character class [\\w-] with ., which matches anything.
regex = "id=(.+?)#"

Regex Match That doesn't contain some text

I am tring to create a regex that finds a Start Prefix and an End Prefix that have paragraph tags between them. But the one i have cteated is not working to my expectations.
%%%HL_START%%%(.*?)</p><p>(.*?)%%%HL_END%%%
Correctly Matches
<p>This Should %%%HL_START%%%Work</p><p>This%%%HL_END%%% SHould Match</p>
This also matches but i dont want it to match becasue the </p><p> is not in bettween the Start and End Prefix
<p>%%%HL_START%%%One%%%HL_END%%% Some More Text %%%HL_START%%%Here%%%HL_END%%%</p><p>Some more text %%%HL_START%%%Here%%%HL_END%%%</p>
I'm not entirely comfortable that regex is the right solution here; if you are getting into nested start and stop markers, you might not have a regular language...
For this specific example, try changing the regex to use [^%] instead of . so that the .*?matching can't go past the %%%%H:_END%%%%
%%%HL_START%%%([^%]*?)</p><p>([^%]*?)%%%HL_END%%%