Can't seem to capture newline+spaces in Regex - regex

I know regexes aren't the best for web parsing, but I'm using it as an exercise.
I'm using Район:[^<>]*\n\s*<[^<>]*>\n\s*<a[^<>]*>([^<>]+)<\/a>
to try to match:
Район: </span>
<span class="company__contacts-item-text">
<a class="link" href="/moscow/top/marina-roscha/">Марьина роща</a>
I've been looking at it for a while but I don't know what I've been doing wrong. How can I capture something that would have newlines and different urls in the tags?

Try this regex:
Район:.+?<a[^>]+>(.+?)</a>
DESCRIPTION
DEMO
https://regex101.com/r/wA4oH0/1

Related

How to Match only url from a tag node js

I have a tag <span style="color: rgb(255,255,255);">[1]</span>
I am using this regex <a href="(.*)">(.*)<\/a> But its not parsing the url only. Its also parsing <span style="color: rgb(255,255,255);">[1]</span>
How can i get only the url from a tags?
Easily! Capture everything that is between href, that is a key word, and <.
href=(.*?)>
If you don't want to capture "", try this one:
href="(.*?)">
Although I am not much experienced with node.js, I think this one may work, but it won't be hard for you if you know the Regex.
var pattern = new RegExp(/href="(.*?)">/);
Here is Regex101.

Regular Expression in JMeter 2.12

I'm trying to use the Regular Expression Extractor in JMeter. When I try to parse the following string:
8EC4146730CC4A27afMCCam3ZeAl4uWt3qMMi9cE7Q5YtIkS5BDaba6bI1cgv41dm07wWlFjAmCcRLd97tmLyuO0ycKflQzhaoQS68CGaRo1oqsL1ZQyLGJMM
From the html snippet:
YourCourse
</dt>
Using this Regular Expression:
<a href='siw_portal.url\?([^"]+)' id="STU_COURSE" title='Your course'>Your Course</a>
</dt>
And Template is set to $1$.
The Regular Expression Extractor doesn't find the string.
Any ideas on why this isn't working, or how to debug this will be much appreciated.
Thanks
Because you made a mistake with the quotes:
<a href="siw_portal.url\?([^"]+)".......title="Your course"
// __^ __^ __^ __^
instead of
<a href='siw_portal.url\?([^"]+)'.......title='Your course'
You can test your regex using any online regex tester, which will help you with simple syntax errors, and also provide hints which cn be really useful for a beginner.
I like this one: http://regex101.com/
You have used different quotation marks in your regex to the sample you are matching, which is why you don't find a match. You are matching " when the sample uses '.
You can make it work in both cases using ["'] or choose the correct ' or "
In your sample, try:
<a href=["']siw_portal\.url([^"^']+)["'] id=["']STU_COURSE["'] title=["']Your course["']>Your Course</a>
</dt>
This should work
<a href="siw_portal.url(.+?)" id="STU_COURSE" title="Your course">YourCourse<\/a>
<\/dt>

Regex, I just don't get it right

I just don't get my Regex right:
I have the following template:
<!-- Defines the template for the tabs. -->
{{TMPL:Import=../../../../Data/Templates/Ribbon/tabs.tmpl; Name=Tabs}}
<div class="tabs">
<ul role="tablist">
{{BOS:Sequence}}
<li role="tab" class="{{TabType}}" id="{{tabId}}">
<span>{{TabFile}}</span>
</li>
{{EOS:Sequence}}
</ul>
</div>
{{Render:Tabs}}
I would like to find everything between {{}} except the tags that begins with {{BOS, {{EOS, {{TMPL, {{Render
Here are a couple approaches:
Attempt 1:
({{).*(}})
This selects everything between {{ }} tags, which is not good.
Attempt 2:
({{)[^TMPL][^BOS][^EOS][^Render].*(}})
This will make that {{TabType}} and {{TabFile}} are not selected anymore and I just don't know why.
With some other regex, I get that {{TabType}}" id="{{tabId}} is selected as one match.
Does anyone have a clue on how to solve this, I really need a regex Guru :-)
You can use negative lookahead based regex like this:
{{(?!TMPL|[BE]OS|Render).*?}}
RegEx Demo
You have to use the following regex to get the content between braces:
\{\{(.*?)\}\}
Working Demo
If you want to exclude the content from the comment you posted you can use a regex technique to exclude what you don't want and keep what you want at the end of the regex:
\{\{BOS:Sequence\}\}|\{\{EOS:Sequence\}\}|\{\{TMPL:Import.*?\}\}|\{\{Render:Tabs\}\}|\{\{(.*?)\}\}
Working demo
By the way, if you want to have a shortcut for above regex you can use:
\{\{(?:BOS|EOS):Sequence\}\}|\{\{TMPL:Import.*?\}\}|\{\{Render:Tabs\}\}|\{\{(.*?)\}\}
This is a very useful technique for pattern exclusion that I glad to learn it from Anubhava and zx81 (they rock using regex pattern). For this regex technique you can find the content you need using capturing groups (check the green highlights on the screenshot below):
Using [^TMPL] and the like won't work because these are character classes. You could use a negative lookahead, though (or even lookbehind depending upon the regex library you are using).
\{\{(?!BOS:)(?!EOS:)(?!Render:)(?!TMPL:)(.*?)\}\}
Still I get the feeling that you want the BOS, EOS, etc. to just be strings in the template with {{ and other values to be interpolated. If you are using handlebars or something, you can have strings interpolated:
{{'{{BOS:Sequence}}'}}

Match that doesn't end with a slash

I'd like to match URLs that don't end in /, to use it in Dreamweaver's find tool.
What regex could I use?
For example, I'd like the following URL to be matched:
<a href="http://www.sometext"
You can do it with this simple regex:
href=".+?[^/]"
Explanation:
It will match href="________X", where X != /.
The following will match:
<a href="http://some-url.com">
<a href="http://www.another-url-here.com/content">
These ones won't:
<a href="http://www.url.com/">
<a href="http://www.url-2.com/posts/2014/">
Edit:
The following will allow URLs like <a href= http://www.url.com> too.
`href=\s*".+?[^/]"
Sure. You can use [^/]" at the end of your link expression to match any non-slash followed by a close-quote.
Maybe with this
href\s*=\s*"[^"]*[^"/\s]\s*"

Regex for extracting links with specified attributes

I'm trying to build regex to extract links from text which have not rel="nofollow".
Example:
aiusdiua asudauih <a rel="nofollow" hre="http://uashiuadha.asudh/adas>adsaag</a> uhwaida <br> asdgydug <a href="http://asdha.sda/uduih/dufhuis>aguuia</a>
Thanks!
The following regex will do the job:
<a (?![^>]*?rel="nofollow")[^>]*?href="(.*?)"
The wanted urls will be in the capture group #1. E.g. in Ruby it would be:
if input =~ /<a (?![^>]*?rel="nofollow")[^>]*?href="(.*?)"/
match = $~[1]
end
Since it accepts [^>]*? before rel in the negative lookahead, href or anything else can come before rel. If href comes after rel, it'll of course also be ok.
Try this
<(?:A|AREA)\b[^<>]*?(?!rel="nofollow")[^<>]*?href=['"]([^>"]*)[^>]*?>
if you are using .net regex then
<(?:A|AREA)\b[^<>]*?(?!rel="nofollow")[^<>]*?href=['"](?<URL>[^>"]*)[^>]*?>
data lies in group named URL or group 1