Match line breaks with a regular expression - regex

The text:
<li><a href="#">Animal and Plant Health Inspection Service Permits
Provides information on the various permits that the Animal and Plant Health Inspection Service issues as well as online access for acquiring those permits.
I want to use a regular expression to insert </a> at the end of Permits. It just so happens that all of my similar blocks of HTML/text already have a line break in them. I believe I need to find a line break \n where the line contains (or starts with) <li><a href="#">.

You could search for:
<li><a href="#">[^\n]+
And replace with:
$0</a>
Where $0 is the whole match. The exact semantics will depend on the language are you using though.
WARNING: You should avoid parsing HTML with regex. Here's why.

By default . (any character) does not match newline characters.
This means you can simply match zero or more of any character then append the end tag.
Find: <li><a href="#">.*
Replace: $0</a>

Related

Regular Expression to exclude a String around the required String

In between a HTML code:
...<div class="..."><a class="..." href="...">I need this String only</a></div>...
How do I write Regular Expression (for Rainmeter which uses Perl RegEx) such that:
-required string "I need this String only" is grouped to be extracted,
-the HTML link tag <a>...</a> might be
absent or present & can be present in between the required string and multiple times as well.
My attempt:
(?siU) <div class="...">.*[>]{0,1}(.*)[</a>]{0,1}</div>
where:
.*= captures every characters except newline{<a class ... "}
[>]{0,1}= accepts 0 or 1 times presence of > {upto >}
(.*)= captures my String
[</a>]{0,1}= accepts 0 or 1 times presence of </a>
this, of course, doesn't work as I want,
This gives output with HTML linking preceding my string
so my question is
How to write a better(and working) RegEx?
Even though I agree with the advice to use a real parser for this problem, this regular expression should solve your problem:
<div [^.<>]|*>(?:[^<>]*<a [^<>]*>)*([^<>]*)(?:</a>)*</div>
Logic:
require <div ...> at the beginning and </div> at the end.
allow and ignore <a ...> before the matched text arbitrarily many times
allow and ignore </a> after the matched text arbitrarily many times
ignore any text before any <a ...> with [^<>]* in front of it. Using .* would also work, but then it would skip all text arbitrarily up to the last instance of <a ...> in your string.
I use [^<>]* instead of .* to match non-tag text in a protected way, since literal < and > are not allowed.
I use (?:...) to group without capturing. If that is not supported in your programming language, just use (...) instead, and adjust which match you use.
Caveat: this won't be fully general but should work for your problem as described.

sed (regex) not working properly

I have to separate out an expression from the following piece of HTML code:
<div class="summary">
<h3>Why is executing Java code in comments allowed?</h3>
<div class="tags t-java t-unicode">
java unicode
</div>
<div class="started">
modified <span title="2015-06-15 17:43:58Z" class="relativetime">yesterday</span>
zwol <span class="reputation-score" title="reputation score 52560" dir="ltr">52.6k</span>
</div>
</div>
The part which I want starts from .... 'title="the following code produces the outp ..........executing Java code in comments allowed?' all the way upto the end of 'a' and 'h3' tags.
Due to various reasons, I have to only use either sed or awk.
I have tried various regular expressions. Since the required part may sometimes even span multiple lines , I have used the following sed command: (Since .* matches only upto a newline character)
sed -n '1h;1!H;${;g;s/.*<h3><a href="\/questions\/.*link" title="\(.*\)<\/a><\/h3>.*/\1/p;}' Trial.html
I am getting no results with this. However, If I remove the end part:
sed -n '1h;1!H;${;g;s/.*<h3><a href="\/questions\/.*link" title="\(.*\)/\1/p;}' Trial.html
I am able to catch the beginning of my required string and it prints upto the end.
I have also referred to this serverfault.com question, for help:
https://serverfault.com/questions/315145/regex-for-sed-to-grab-multiple-lines-or-a-better-way
Edit:
There could be other similar blocks also. I don't have to stop at the first result. I have taken the html from this page:
https://stackoverflow.com/?tab=month
This is another question which is very similar to mine!
https://unix.stackexchange.com/questions/64645/text-between-two-tags
Your line
sed -n '1h;1!H;${;g;s/.*<h3><a href="\/questions\/.*link" title="\(\.*\)<\/a><\/h3>.*/\1/p;}' Trial.html
That line puts everything in hold space, than after file is read, swaps it to pattern space to be used for multi line parsing.
modification idea, instead of grouping \(\.*\) which by the way is not correct since you've escaped here '.' so it's not any character but literal '.'
you could use title="\([^<]*\) which will catch all characters till first '<'.
Also if title=" is only once present in file than no need for many letters in first part of pattern, only ^.*title=" will be enough.

Regex Match All Characters Between Tags on nth occurrence

I need to match text between two tags, but starting at a specific occurrence of the tag.
Imagine this text:
Some long <br> text goes <br> here. And some <br> more can <br> go here.<br>
In my example, I would like to match here. And some.
I successfully matched the text between the first occurrence (between the first and second br tags) with:
<br>(.*?)<br>
But I am looking for the text in the next match (which would be between the second and third br tags). This is probably more obvious than I realize, but Regex is not my strong suite.
Just extend your regex:
<br>(.*?)<br>(.*?)<br>
or, for an unlimited number of matches, and trimming the spaces:
<br>\s*(.*?)(?=\s*<br>)
EDIT: Now that I see that you are parsing an HTML document, be aware that regular expressions may not be the best tool for that job, especially if your parsing requirements are complex.

Regex to match content before string

I am trying to extract an url from content using yahoo pipes but for that I need to match everything before the url, and everything after :
<div class="medium mode player"><div class="info-header"><a rel="nofollow" target="_blank"
href="http://i1.sndcdn.com/artworks-000059185212-dsb68g-crop.jpg?3eddc42" class="artwork"
style="background:url(http://i1.sndcdn.com/artworks-000059185212-dsb68g-badge.jpg?
3eddc42);">Dream ft. Notorious BIG Artwork</a> <h3><a rel="nofollow" target="_blank"
href="http://soundcloud.com/tom-misch/dream-ft-notorious-big">Dream ft. Notorious BIG</a>
</h3> <span class="subtitle"><span class="user tiny online"><a rel="nofollow"
target="_blank" href="http://soundcloud.com/tom-misch" class="user-name">Tom Misch</a>
The url I want is that one : http://soundcloud.com/tom-misch/dream-ft-notorious-big
I tried to learn a bit about regex but when I think I understand, nothing I try works
Hope some of you can help me on that guys !
cheers
This probably will do, it only matches URLs from soundcloud, that uses the http protocol and have no subdomain, the group will capture the full url so that you can use it, and it uses a lazy quantifier to match up to the first quote:
(http://soundcloud.*?)"
Here is an alternative:, that does not uses a lazy quatifier, instead it uses a negated class to match anything but a quote:
(http://soundcloud[^"]+)
Keep in mind that both regexs will actually match both URLs, depending on the library and the flags that you use it might return only the first occurrence or both, you can just use the first one or further check the results for the correct format.
If you really want to use just a regex and your regex library supports look-ahead, you can do this:
(http://soundcloud.*?)\s+(?!class="user-name")
The look-ahead (?!= will not match if the string that follows is class="user-name"
I didn't too, find what library yahoo pipes uses, if you want to replace everything around the url, you can change the regex to:
^.*?(http://soundcloud[^"]+).*$
And use $1 in the replacement string to get the url back (keep in mind that I mixed .*? with [^"]+, that's because I want to replace the whole string with the first url and not the second one, so I need the first .* to match up to the point of the first url and stop, that's what the lazy quantifier if for).

how to match any string in Emacs regexp?

I'm referring to this page: http://ergoemacs.org/emacs/emacs_regex.html
which says that to capture a pattern in Emacs Regexp, you need to escape the paren like this: \(myPattern\).
It further says that the syntax for capturing a sequence of ASCII characters is [[:ascii:]]+
In my document, I'm trying to match all strings that occur between <p class="calibre3"> and </p>
So, following the syntax above, I do a replace-regexp for
<p class="calibre3">\([[:ascii:]]+\)</p>
but it finds no matches.
Suggestions?
Regexps are not good for general-purpose HTML parsing, but as paragraph tags cannot be validly nested, the following is going to be fine (provided the mark-up is valid & well-formed).
<p class="calibre3">\(.*?\)</p>
*? is the non-greedy zero-or-more repetitions operator, so it will match as little as possible -- in this case everything until the next </p> (as opposed to the greedy version, which would match everything until the final </p> in the text).
The [^<] approach is fine if it fits the data in question, but it won't work if there are other tags within the paragraphs.
You need to escape your angle brackets and I would use [^<] instead of [[:ascii]] like so:
\<p class="calibre3"\>([^<]+\)</p\>
<p class="calibre3">\([^<]\)+</p>
Source: #TooTone