I'm having a battle with a regex. (MOBI creation)
I have two files: one with XML, the other an HTML table of contents.
The important parts of the XML:
<navPoint id="_NeedsHTMLid" playOrder="40">
<navLabel><text>Needs anchor text from link.)</text></navLabel>
...
The HTML TOC, of course, looks like:
schema.org Article Mark-up
======
Hours and hours... worked with Textpad forever. Saw remarks here, now I'm using NotePad++... some of the regex results are different (NOT that I had it working anyway.) #_[\b(\w\b] was returning the ID: now? Not so much!
Does anyone know how to yank both the ID and the anchor text out of these? I'd be so grateful.
You can use this to get the id and the anchor text at the same time:
_(\w+)\b|([a-Z\s.]+[)]+)
#_[\b(\w\b] is not a valid regex. Try _([^"]+)\b.
Edited: try [^"] in place of \w.
If you want to match the ids and the text, go to Search > Find menu (shortcut CTRL+F) and do the following:
Find what:
id="([a-zA-Z0-9\-\:\_\.]+)"|<text>(.+?)<\/text>
Select radio button "Regular Expression"
Then press Find All in Current Document
You can test it with your example at regex101.
Here's a StackOverflow post about valid id names.
I didn't provided you with a Search and Replace solution, since you didn't mentioned anything about a replacement.
Related
I have some urls saved in DB like hello world
with break tags, so i need to delete them, the problem that <br/> are in other places to so i can't delete all of them,
i write RegExp <*"*<br\/?>"> but it select not only <br> and quotes too.
You really shouldn't be using regular expressions for parsing HTML or XML.
Having said that. As I understand it, you have br tags inside the href attribute of a tags.
try :
href\s*?=\s*?\"(.*?)(<br\/?\>)\"
If you try to search about the right lines in the database, then this is your regex extended to match the whole line:
<.*\".*<br\/>\">.*>
After this you can mach the '<br/>' directly in those lines. Is there a language to edit your DB?
Some of the other answers here are okay. I'll offer an alternative:
https://regex101.com/r/uG5PBA/2
This'll put the break tags in a capture group -- group 1, so that you can simply nix them.
Regex:
<a[\s\S]*?(\<br\/>)[\s\S]*?<\/a>
Test String:
hello worldhello world
I'm trying to solve a regex riddle. Let's say I have rows of hrefs looking like this:
anchor1.in
an3.php
setup.exe
What I want the regex (or any other solution) to do is to take the href title and copy it over to the actual url with a foward slash in front of it.
A successful result would become:
anchor1.in
an3.php
setup.exe
If you can solve this please explain how you did it.
You can use the following to match:
(<a\s+href=")(.*?)(">)(.*?)(<\/a>)
And replace with:
\1\2/\4\3\4\5
See DEMO and Explanation
I've searched around quite a bit now, but I can't get any suggestions to work in my situation. I've seen success with negative lookahead or lookaround, but I really don't understand it.
I wish to use RegExp to find URLs in blocks of text but ignore them when quoted. While not perfect yet I have the following to find URLs:
(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?
I want it to match the following:
www.test.com:50/stuff
http://player.vimeo.com/video/63317960
odd.name.amazone.com/pizza
But not match:
"www.test.com:50/stuff
http://plAyerz.vimeo.com/video/63317960"
"odd.name.amazone.com/pizza"
Edit:
To clarify, I could be passing a full paragraph of text through the expression. Sample paragraph of what I'd like below:
I would like the following link to be found www.example.com. However this link should be ignored "www.example.com". It would be nice, but not required, to have "www.example.com and www.example.com" ignored as well.
A sample of a different one I have working below. language is php:
$articleEntry = "Hey guys! Check out this cool video on Vimeo: player.vimeo.com/video/63317960";
$pattern = array('/\n+/', '/(https?\:\/\/)?(player\.vimeo\.com\/video\/[0-9]+)/');
$replace = array('<br/><br/>',
'<iframe src="http://$2?color=40cc20" width="500" height="281" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe>');
$articleEntry = preg_replace($pattern,$replace,$articleEntry);
The result of the above will replace any new lines "\n" with a double break "" and will embed the Vimeo video by replacing the Vimeo address with an iframe and link.
I've found a solution!
(?=(([^"]+"){2})*[^"]*$)((https?:\/\/)?(\w+\.)+\w{2,}(:[0-9]+)?((\/\w+)+(\.\w+)?)?\/?)
The first part from (? to *$) what makes it work for me. I found this as an answer in java Regex - split but ignore text inside quotes? by https://stackoverflow.com/users/548225/anubhava
While I had read that question before, I had overlooked his answer because it wasn't the one that "solved" the question. I just changed the single quote to double quote and it works out for me.
add ^ and $ to your regex
^(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?$
please notice you might need to escape the slashes after http (meaning https?\:\/\/)
update
if you want it to be case sensitive, you shouldn't use \w but [a-z]. the \w contains all letters and numbers, so you should be careful while using it.
for example, I have some feed, with an item title like this
Some text is better than one text http://t.co/blablabla #hashtag
then I want to get only the URL using regex like this
http://t.co/blablabla
how do i do that ?
(sorry I use google translate to make this question)
thanks for answer
Here's a random one I found with a simple google search:
(http|ftp|https)://([\w-_]+(?:(?:.[\w-_]+)+))([\w-.,#?^=%&:/~+#]*[\w-\#?^=%&/~+#])?
Using find and replace, what regex would remove the tags surrounding something like this:
<option value="863">Viticulture and Enology</option>
Note: the option value changes to different numbers, but using a regular expression to remove numbers is acceptable
I am still trying to learn but I can't get it to work.
I'm not using it to parse HTML, I have data from one of our company websites that we need in excel, but our designer deleted the original data file and we need it back. I have a list of the options and need to remove the HTML tags, using Notepad++ to find and replace
This works for me Notepad++ 5.8.6 (UNICODE)
search : <option value="\d+">(.*?)</option>
replace : $1
Be sure to select "Regular expression" and ". matches newline"
I have done by using following regular expression:
Find this : <.*?>|</.*?>
and
replace with : \r\n (this for new line)
By using this regular expression (<.*?>|</.*?>) we can easily find value between your HTML tags like below:
I have input:
<otpion value="123">1</option><otpion value="1234">2</option><otpion value="1235">3</option><otpion value="1236">4</option><otpion value="1237">5</option>
I need to find values between options like 1,2,3,4,5
and got below output :
This works perfectly for me:
Select "Regular Expression" in "Find" Mode.
Enter [<].*?> in "Find What" field and leave the "Replace With" field empty.
Note that you need to have version 5.9 of Notepad++ for the ? operator to work.
as found here:
digoCOdigo - strip html tags in notepad++
Something like this would work (as long as you know the format of the HTML won't change):
<option value="(\d+)">(.+)</option>
String s = "<option value=\"863\">Viticulture and Enology</option>";
s.replaceAll ("(<option value=\"[0-9]+\">)([^<]+)</option>", "$2")
res1: java.lang.String = Viticulture and Enology
(Tested with scala, therefore the res1:)
With sed, you would use a little different syntax:
echo '<option value="863">Viticulture and Enology</option>'|sed -re 's|(<option value="[0-9]+">)([^<]+)</option>|\2|'
For notepad++, I don't know the details, but "[0-9]+" should mean 'at least one digit', "[^<]" anything but a opening less-than, multiple times. Masking and backreferences may differ.
Regexes are problematic, if they span multiple lines, or are hidden by a comment, a regex will not recognize it.
However, a lot of html is genereated in a regex-friendly way, always fitting into a line, and never commented out. Or you use it in throwaway code, and can check your input before.