I have a regular expression that runs through html tags and grabs values.
I currently have this to grab all values within the tag.
<title\b[^>]*>(.*\s?)</title>
It works perfectly. So if I have a bunch of pages that have titles:
<title>Index</title>
<title>Artwork</title>
<title>Theory</title>
The values returned are:
Index,
Artwork,
Theory
How can I make this regular expression ignore all tags with the value Theory inside them?
Thanks in Advance
A basic look around would probably handle that.
<title\b[^>]*>(((?!Juju - Search Results).)*)(.*\s?)<\/title>
Related
I'm trying to scrape a piece of text from a website using Kimonolabs. The text is succesfully scraped using the advanced setting:
div > div > ul > li.location > span.value
The text being scraped using this CSS selector is:
Cityname, streetname 1
However, I wish to delete everything before the comma so that only remains:
Cityname
I wish to do this with regex, but I'm totally ignorant about it. What I do konw is that it has to containof 3 blocks when using Kimonolabs: https://help.kimonolabs.com/hc/en-us/articles/203043464-Manually-input-regular-expressions
Can anybody help me setting up the correct regex? All I got so far is the following, but it's not the correct markup for Kimonolabs (it doesn't allow for it in the dashboard):
^(.+?),
See the docs you referred to:
The regular expression pattern in kimono is defined in three parts. It's important that any custom regular expression you produce retains the three part notation, with the surrounding ( ) for each part. The first part refers to the pattern to the left of the desired content. The middle part refers to the pattern that the desired content must match and the third part refers to the pattern to the right of the desired content.
So, you seem to need:
/^()([^,]+)()/
Or, /(^)([^,]+)(,)/ (it should be equivalent), and the 2nd capture group (the middle part) should capture the Cityname.
My CMS allows PHP keyword replacements, and I'm currently building a format to return the first listed item element in a data field which usually contains a HTML unordered list, but can often contain paragraphs, etc.
If possible, I'd like to use a regular expression to match only the first listed item element li in a returned block, and print it.
One severe limitation, is that I cannot use the ^ character as my CMS (annoyingly) uses that character for modification functions.
So far, I've only come up with: replace:<\/li>.*:</li></ul> - but this is only replacing the first listed item's closing tag in a returned block. What I really need is something like:
replace:anything_that's_not_first_li_element:nothing
I appreciate that this question is a very long shot, so thanks in advance for all constructive responses.
You could use this regex with the s flag.
(?<=<ul>).*?<li>.*?<\/li>
Working regex example:
http://regex101.com/r/hL1zF0
PHP:
$list = '<ul>
<li>first</li>
<li>second</li>
<li>third</li>
<li>fourth</li>
</ul>';
preg_match('/(?<=<ul>).*?<li>.*?<\/li>/s', $list, $matches);
echo $matches[0];
Output:
<li>first</li>
Okay so I need to create an advance filter in Google Analytics that includes "breast", but DOES NOT include "before" "after" or "blog" in the url. I also want to filter out .jpg file extensions.
Here are example URLs that I want the filter to return:
http://www.doctortaylor.com/breast-lift-surgery/
http://www.doctortaylor.com/breast-augmentation-pasadena-and-los-angeles-area/
I want to filter out any urls that are before and after photo pages, and any actual .jpg file urls.
I'm a regex beginner, but this is pretty advanced. Any help would be greatly appreciated!!
This regular expression does fairly well:
^(?!before|after|blog)*((?!before|after|blog).)*breast(?!before|after|blog|\.jpg)*((?!before|after|blog|\.jpg).)*$
UPDATED: I have updated the expression to capture all scenarios, even characters that begin or end the string. This regular expression excludes all words that you list in your description and correctly identifies the word breast.
MATCHES
http://www.doctortaylor.com/breast-lift-surgery/
http://www.doctortaylor.com/breast-augmentation-pasadena-and-los-angeles-area/
DOES NOT MATCH
http://www.doctortaylor.com/breast-lift-surgeryblog/
http://www.doctortaylor.com/breast-lift-surgery.jpg/
http://blog.doctortaylor.com/breast-lift-surgery/
http://www.doctortaylor.com/after-breast-lift-surgery/
This regular expression uses an equivalent of inverse matching.
I need some help with a VB RegEx.
I've got two RegEx that I need to do two specific things.
RegEx one - I am not exactly sure how to do this, but I need to get everything within a Href tag. i.e.
String = "<a href=""test.html"">"
I need the RegEx to return .... test.html
RegEx Two - I have partly got this working.
I've got tags like
RegEx = "<div class=""top""(.*?)</div>"
String = "<div class=""top""><a><b><div class=""bottom""></div></b></a></div>"
The problem I have is this isnt returning anything, it should return everything withing "top", but it returns nothing.
Neither use-case can be solved well with regular expressions.
Use an HTML parser instead, e.g. the HTML Agility Pack.
Well, if your html doesn't contain nested tags you can do the first part with regex (as long as you can control your search source code, you can be much more certain of your results).
\<a href=""([^""]+)\>
the test.html will be found in the non-passive group referred to as $1.
The second part I'm concerned that you have nested tags in there and it's failing on that. The thing with regex and html is that regex can't delve well into the nested-allowable-but-not-best-practice code that can execute as expected but isn't well formed.
Can you post some search source for the second case so we can look?
I'm trying to write a regexp which will help to find non-translated texts in html code.
Translated texts means that they are going through special tag: or through construction: ${...}
Ex. non-translated:
<h1>Hello</h1>
Translated texts are:
<h1><fmt:message key="hello" /></h1>
<button>${expression}</button>
I've written the following expression:
\<(\w+[^>])(?:.*)\>([^\s]+?)\</\1\>
It finds correct strings like:
<p>text<p>
Correctly skips
<a><fmt:message key="common.delete" /></a>
But also catches:
<li><p><fmt:message key="common.delete" /></p></li>
And I can't figure out how to add exception for ${...} strings in this expression
Can anybody help me?
If I understand you properly, you want to ensure the data inside the "tag" doesn't contain fmt:messsage or ${....}
You might be able to use a negative-lookahead in conjuction with a . to assert that the characters captured by the . are not one of those cases:
/<(\w+)[^>]*>(?:(?!<fmt:message|\$\{|<\/\1>).)*<\/\1>/i
If you want to avoid capturing any "tags" inside the tag, you can ignore the <fmt:message portion, and just use [^<] instead of a . - to match only non <
/<(\w+)[^>]*>(?:(?!\$\{)[^<])*<\/\1>/i
Added from comment If you also want to exclude "empty" tags, add another negative-lookahead - this time (?!\s*<) - ensure that the stuff inside the tag is not empty or only containing whitespace:
/<(\w+)[^>]*>(?!\s*<)(?:(?!\$\{)[^<])*<\/\1>/i
If the format is simple as in your examples you can try this:
<(\w+)>(?:(?!<fmt:message).)+</\1>
Rewritten into a more formal question:
Can you match
aba
but not
aca
without catching
abcba ?
Yes.
FSM:
Start->A->B->A->Terminate
Insert abcba and run it
Start is ready for input.
a -> MATCH, transition to A
b -> MATCH, transition to B
c -> FAIL, return fail.
I've used a simple one like this with success,
<([^>]+)[^>]*>([^<]*)</\1>
of course if there is any CDATA with '<' in those it's not going to work so well. But should do fine for simple XML.
also see
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/
for a discussion of using regex to parse html
executive summary: don't