RegEx for excluding a match with prefix - regex

I first wanted to only match the first instance, but soon realized that is not possible. The tool I'm using only uses RegEx so I have no options as well.
Basically I got a text with HTML tags in it and I want to match the first paragraph's tags without the following tags.
For example out of this:
<p>erkfoijwdocndoufhwroguh</p><p>pijgoijkuohuhogiougwtg</p><p>pijgoijkuohuhogiougwtg</p><p>pijgoijkuohuhogiougwtg</p>
I want to match the first <p></p>
and nothing else.
So I figured I could exclude the tags that have a tag right next to them using negative lookahead. As in:
(?!>)(<|<\/)p>
But for some reason this still matches every <p> and </p> tag instead of leaving out those that have another tag before them. Any suggestions?
Edit to add: I only need to match the tags, not the text inside the tags. And lookbehind doesn't work with the tool I'm using. It seems that everything that works here, works also in my tool.
Second edit: I solved my problem, but I'm leaving the question open since the solution wasn't an answer and this seems like an interesting question and I might bump into similiar problem in the future. Basically if someone figures out how I can refer to <p> that doesn't have a > before it but also include the first </p>, I'd like to hear it.

I'm not sure I understood what you are trying to achieve, would this:
^<p>.*?<\/p>
Demo here: https://regex101.com/r/ZXgMPV/1

Related

Extract only the text field needed

I am at the beginning of learning Regex, and I use every opportunity to understand how it's working. Currently I am trying to extract dates from a text file (which is in fact a vnt-file type from my mobile phone). It looks like following:
BEGIN:VNOTE
VERSION:1.1
BODY;ENCODING=QUOTED-PRINTABLE;CHARSET=UTF-8:18.07.=0A14.08.=0A15.09.=0A15.10.=
=0A13.11.=0A13.12.=0A12.01.=0A03.02. Grippe=0A06.03.=0A04.04.2015=0A0=
5.05.2015=0A03.06.2015=0A03.07.2015=0A02.08.2015=0A30.08.2015=0A28.09=
17.11.2017=0A
DCREATED:20171118T095601
X-IRMC-LUID:150
END:VNOTE
I want to extract all dates, so that the final list is like that:
18.07.
14.08.
15.09.
15.10.
and so on. If the date has also a year, it should also be displayed.
I almost found out how to detect the dates by the following regex:
.+(\d\d\.\d\d\.(2015|2016|2017)?).+
But it only detect very few of the dates. The result is this:
BEGIN:VNOTE
VERSION:1.1
15.10.
04.04.2015
30.08.2015
24.01.2016
DCREATED:20171118T075601
X-IRMC-LUID:150
END:VNOTE
Then I tried to add a question mark which makes the .+ not greedy, as far as I read in tutorials. Then the regex looks like:
.+?(\d\d\.\d\d\.(2015|2016|2017)?).+?
But the result is still not what I am looking for:
BEGIN:VNOTE
VERSION:1.1
21.03.20.04.18.05.18.06.18.07.14.08.15.09.15.10.
13.11.13.12.12.01.03.02.06.03.04.04.20150A0=
03.06.201503.07.201502.08.201530.08.20150A28.09=
28.10.201525.11.201528.12.201524.01.20160A
DCREATED:20171118T075601
X-IRMC-LUID:150
END:VNOTE
For someone who is familiar with regex I am pretty sure this is very easy to solve, but I don't get it. It's very confusing when you are new to regex. I tried to find a hint in some tutorials or stackoverflow posts, but all I found is this: Notepad++ how to extract only the text field which is needed?
But it doesn't work for me. I assume it might have something to do with the fact that my text file is not one single line.
I have my example on regex101 too.
I would be very thankful if maybe someone can give me a hint what else I can try.
Edit: I would like to detect the dates with the regex and as a result have a list with only the dates (maybe it is called substitute?)
Edit 2: Sorry for not mentioning it earlier: I just want to use the regex in e.g. Notepad++ or an online regex test website. Just to get the result of the dates and save the result in a new txt-file. I don't want to use the regex in an programming language. My apologies for not being precisely before.
Edit 3: The result should be a list with the dates, and each date in a new line:
I want to extract all dates, so that the final list is like that:
18.07.
14.08.
15.09.
15.10.
I suggest this pattern:
(?:.*?|\G)(\d\d\.\d\d\.(?:\d{4})?)
This makes use of the \G flag that, in this case, allows for multiple matches from the very start of the match without letting any single unmatched character in the text, thus allowing the removal of all but what's wanted.
If you want to remove the extra matches as well, add |.* at the end:
(?:.*?|\G)(\d\d\.\d\d\.(?:\d{4})?)|.*
regex101 demo
In N++, make sure the options underlined are selected, and that the cursor is at the beginning. In the picture below, I replaced then undid the replacement, only to show that matches were identified (16 replacements).
You can try using the following pattern:
\d{2}\.\d{2}\.(?:\d{4})?
This will match day.month dates of the form 18.07., but it also allows such a date to be followed by a four digit year, e.g. 18.07.2017. While it would be nice to make the pattern more restrictive, to avoid false fire matches, I do not see anything obvious which can be added to the above pattern. Follow the demo link below to see the pattern in action.
Demo

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.
So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+
This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?
If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.
So it should be something like:
- find a string of numbers, letters and special signs until "http"
- delete what you found
- and keep searching for more numbers, letters ans special signs after "html"
- and delete that again
Any ideas? Thanks so much.
In Notepad++, in the Replace menu (CTRL+H) you can do the following:
Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
Replace: $1\n
Options: check the Regular expression and the . matches newline
This will return you with a list of all your links. There are two issues though:
The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
It will leave the text after the last matched URL intact. You have to delete it manually.
The answer made previously by #psxls was a great help for me when I have wanted to perform a similar process.
However, this regex rule was written six years ago now: accordingly, I had to adjust / complete / update it in order it can properly work with the some recent links, because:
a lot of URL are now using HTTPS instead of HTTP protocol
many websites less use www as main subdomain
some links adds punctuation mark (which have to be preserved)
I finally reshuffle the search rule to .*?(https?\:\/\/[a-zA-Z0-9[:punct:]]+) and it worked correctly with the file I had.
Unfortunately, this seemingly simple task is going to be almost impossible to do in notepad++. The regex you would have to construct would be...horrible. It might not even be possible, but if it is, it's not worth it. I pretty much guarantee that.
However, all is not lost. There are other tools more suitable to this problem.
Really what you want is a tool that can search through an input file and print out a list of regex matches. The UNIX utility "grep" will do just that. Don't be scared off because it's a UNIX utility: you can get it for Windows:
http://gnuwin32.sourceforge.net/packages/grep.htm
The grep command line you'll want to use is this:
grep -o 'http:\/\/www.[a-zA-Z0-9./-]\+\?' <filename(s)>
(Where <filename(s)> are the name(s) of the files you want to search for URLs in.)
You might want to shake up your regex a little bit, too. The problems I see with that regex are that it doesn't handle URLs without the 'www' subdomain, and it won't handle secure links (which start with https). Maybe that's what you want, but if not, I would modify it thusly:
grep -o 'https\?:\/\/[a-zA-Z0-9./-]\+\?' <filename(s)>
Here are some things to note about these expressions:
Inside a character group, there's no need to quote metacharacters except for [ and (sometimes) -. I say sometimes because if you put the dash at the end, as I have above, it's no longer interpreted as a range operator.
The grep utility's syntax, annoyingly, is different than most regex implementations in that most of the metacharacters we're familiar with (?, +, etc.) must be escaped to be used, not the other way around. Which is why you see backslashes before the ? and + characters above.
Lastly, the repetition metacharacter in this expression (+) is greedy by default, which could cause problems. I made it lazy by appending a ? to it. The way you have your URL match formulated, it probably wouldn't have caused problems, but if you change your match to, say [^ ] instead of [a-zA-Z0-9./-], you would see URLs on the same line getting combined together.
I did this a different way.
Find everything up to the first/next (https or http) (then everything that comes next) up to (html or htm), then output just the '(https or http)(everything next) then (html or htm)' with a line feed/ carriage return after each.
So:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace with: \1\2\3\r\n
Saves looking for all possible (incl non-generic) url matches.
You will need to manually remove any text after the last matched URL.
Can also be used to create url links:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace: \1\2\3\r\n
or image links (jpg/jpeg/gif):
Find: .*?(https:|http:)(.*?)(jpeg|jpg|gif)
Replace: <img src="\1\2\3">\r\n
I know my answer won't be RegEx related, but here is another efficient way to get lines containing URLs.
This won't remove text around links like Toto mentioned in comments.
At least if there is nice pattern to all links, like https://.
CTRL+F => change tab to Mark
Insert https://
Tick Mark to bookmark.
Mark All.
Find => Bookmarks => Delete all lines without bookmark.
I hope someone who lands here in search of same problem will find my way more user-friendly.
You can still use RegEx to mark lines :)

Regex to identify HTML tags (as a regex repetition learning exercise ONLY!!)

I'm very very new to regex. I'd managed to not touch it with a 10-foot pole for so long. And I tried my best to avoid it so far. But now a personal project is pushing me to learn it.
So I started. And I'm going through the tutorial located here:http://www.regular-expressions.info/tutorial.html
Currently I'm here: http://www.regular-expressions.info/repeat.html
My question is:
The tutorial says <[A-Za-z][A-Za-z0-9]*> will match an HTML tag.
But wouldn't it also match invalid html tags like - <h11> or <h111>?
Also how would it match the closing tags?
Edit - My question is very specific. I am referring to one particular example in one particular tutorial to clarify whether or not my understanding of repetitions is correct. Again, I REPEAT, I DO NOT care about html parsing with regex.
I don't see any harm in answering your question seeing as how you are attempting to learn regex:
1) Yes, it will match invalid tags as well because it's any letter followed by any zero or more matches of another letter or a number.
2) It will not match closing tags (there would have to be a search for a / somewhere in there).
One more comment: one way people used to use to look for html tags inside a document was to look for the pattern of opening and closing brackets, like so:
<\/?[^>]*>
That's opening-bracket, an optional slash, (anything but a closing bracket)-repeated and then a closing bracket. Of course, I am not recommending anyone do this. It's merely left here as an exercise.
The tutorial says <[A-Za-z][A-Za-z0-9]*> will match an HTML tag.
But wouldn't it also match invalid html tags like - or ?
Also how would it match the closing tags?
Yes, that will match <h11> as well as <X098wdfhfdshs98fhj2hsdljhkvjnvo9sudvsodfih23234osdfs>.
If you want to just match a letter followed by an optional single digit, so you'd match <h1>, then you want <[A-Za-z][0-9]?>

Delete all the content between <strong></strong> in Yahoo Pipes feed

I'm pulling a feed from GMA, don't ask why. I'm using yahoo pipes because I can filter out certain articles based on their title. Then I run the feed through feedenlarger.com so I can get the full text pretty easily.
The problem I'm having is that the feeds contain bold links in them that are disrupting the articles. Each one is surrounded by a <strong>....</strong>. I am trying to just delete any content that exists between the <strong></strong>, but I can't seem to get it right.
I have tried:
item.description replace <strong>*?</strong> with (and left blank)
as well as
item.description replace <strong>*?</strong> with (also left blank)
I know regex and html are not meant for one another, but if someone has a suggestion or direction, I'd appreciate it very much.
Thanks
I'm not familiar with what you are doing exactly, but I would first try just removing the <strong> tag to make the escape is needed. By that I mean see if <strong> or <strong> works to make sure you are on the right track.
I believe the source of the issue is that it appears you are trying to match many > rather than the actual contents between the tags. Try using .*? (or [^<]*? if you know there will be no other tags within the tags) instead.

Get value between <b> tag using regex in Yahoo Pipes

I have searched up and down trying to find an answer that will work for me but haven't been able to figure this out. I'm using Yahoo Pipes for this.
Lake Harmony Estates <b>Sleeps: 16</b>
What I need to do is extract the Sleeps: 16 out from the B tag and output just that value and nothing else. I don't suspect this is very hard to do, but given my limited regex knowledge it's giving me troubles. I've tried adapting regex code pertaining to other tags, but just can't seem to get this one to work.
Any help on this would be appreciated. Thanks.
Edit:
Here is my pipe if you wanted to take a look at the regex horrible-ness I've created. The one I'm trying to work though is the item.sleeps, last entry in the 2nd regex
http://pipes.yahoo.com/pipes/pipe.info?_id=567026d850223b0075d80fd3c9bf7e75
This should fit your needs assuming the html isn't ladened with quotes and such. Note that the + will mean that empty <b> tags are ignored. Also, html is not truly passable via regex, so this will only work for basic tags. It should work even if the tag has an ID or a class property, but there are absolutely manners to break this regex.
/<b[^>]*>([^<]+)<\/b>/
I posted this question to Twitter and got a response back that worked for me.
(?s)^.*<b>(.*?)</b>.*
Replace with $1 and have G flag checked.
This solution did everything I needed. I had additional data that I had already excluded in my example that became unnecessary with this regex.