For web scraping, I need to match the last part of a URL and replace "-" dashes with " " spaces.
Code looks like this...
<div class="tags">
<span class="tag" style="background-color: #5A214A;">
SA
</span>
</div>
I want to be left with "Service Assurance" (this part may contain multiple "-" dashes and require multiple replacements).
Currently being used:
Xpath:
//span[#class="tag"]/a/#href
Regex:
/.*/(.*)/
This produces "Service-Assurance", but does not strip out the "-".
I am told elsewhere that this replacement is not possible since I am already using Regex to find the string between the final "/" slashes.
Can I do both? Can I replace the "-" dashes at the end, too?
Regex is plain, inside an app called import.io, no particular language flavour.
Thank-you very much.
Try this xpath without the regex:
//*[#class='tag-wrapper']/input[1]/#value
althernatively you can also try these methods:
I scrape urls in google-sheets all the time with xpaths and regexes - so if you want to try:
=importXML("url goes here","//span[#class="tag"]/a/#href")
now then if you do at least get the url string back, then you know its working ad we can then modify it to this to get what you want:
=SUBSTITUTE(REGEXEXTRACT(importXML("url goes here","//span[#class="tag"]/a/#href"),".*\/(.*)\/$"),"-"," ")
Let me know if you have issues - there are a couple of weird quirks with google - but if you share the url your pulling that xpath in with I can at least test it myself - i use this method now more than any others, I used to use import.io and outwit hub etc a ton
Related
I am trying to extract an url from content using yahoo pipes but for that I need to match everything before the url, and everything after :
<div class="medium mode player"><div class="info-header"><a rel="nofollow" target="_blank"
href="http://i1.sndcdn.com/artworks-000059185212-dsb68g-crop.jpg?3eddc42" class="artwork"
style="background:url(http://i1.sndcdn.com/artworks-000059185212-dsb68g-badge.jpg?
3eddc42);">Dream ft. Notorious BIG Artwork</a> <h3><a rel="nofollow" target="_blank"
href="http://soundcloud.com/tom-misch/dream-ft-notorious-big">Dream ft. Notorious BIG</a>
</h3> <span class="subtitle"><span class="user tiny online"><a rel="nofollow"
target="_blank" href="http://soundcloud.com/tom-misch" class="user-name">Tom Misch</a>
The url I want is that one : http://soundcloud.com/tom-misch/dream-ft-notorious-big
I tried to learn a bit about regex but when I think I understand, nothing I try works
Hope some of you can help me on that guys !
cheers
This probably will do, it only matches URLs from soundcloud, that uses the http protocol and have no subdomain, the group will capture the full url so that you can use it, and it uses a lazy quantifier to match up to the first quote:
(http://soundcloud.*?)"
Here is an alternative:, that does not uses a lazy quatifier, instead it uses a negated class to match anything but a quote:
(http://soundcloud[^"]+)
Keep in mind that both regexs will actually match both URLs, depending on the library and the flags that you use it might return only the first occurrence or both, you can just use the first one or further check the results for the correct format.
If you really want to use just a regex and your regex library supports look-ahead, you can do this:
(http://soundcloud.*?)\s+(?!class="user-name")
The look-ahead (?!= will not match if the string that follows is class="user-name"
I didn't too, find what library yahoo pipes uses, if you want to replace everything around the url, you can change the regex to:
^.*?(http://soundcloud[^"]+).*$
And use $1 in the replacement string to get the url back (keep in mind that I mixed .*? with [^"]+, that's because I want to replace the whole string with the first url and not the second one, so I need the first .* to match up to the point of the first url and stop, that's what the lazy quantifier if for).
im rubbish with regex if someone could help id be very appreciative.
its going to be a bit of a tough one i imagine - so my hats off too anyone that can solve it!
so say we have file that contains 2 html tags in the following formats:
abc1234
Some Text <P>
Some Text
abc1234
im trying to remove everything in those tags except the url (and leaving other text) so the output of the regex in this document would be
abc1234
http://google.com <P>
http://www.google.com
abc1234
Can any guru figure this one out? Id prefer one regex expression to handle both cases but two seperate ones would be fine too.
Thanks in advance/
ScottStevens, it is well known that trying to parse html with regex is difficult, in fact, there is quite a verbose post on this issue. However, if those are the only two formats the <a> ever takes, here is the approach to the problem:
Your first clue on how to approach this problem is that both tags start with <a href=", and you want to take that out, and for that, a simple remove on '<a href="' will do, no regex required.
Your next clue is that sometimes, your end tag sometimes has ">...</a> and sometimes has " rel=...</a> (what goes between rel= and doesn't matter from a regex point of view). Now notice that " rel="...</a> contains within it somewhere a ">...</a>. This means you can remove " rel="...</a> in two steps, remove " rel="... up to the ">, and then remove ">...</a>. Additionally, to make sure you remove between only one tag of <a...>...</a>, add the additional constraint that in the ... of ">...</a>, there cannot be any <a.
That and a regex cheat sheet can help you get started.
That said, you should really use an html parser. Robust and Mature HTML Parser for PHP
I'm a Rubyist, so my example is going to be in Ruby. I'd recommend using two regexes, just to keep things straight:
url_reg = /<a href="(.*?)"/ # Matches first string within <a href=""> tag
tag_reg = /(<a href=.*?a>)/ # Matches entire <a href>...</a> tag
You'll want to pull the URL with the first regex out and store it temporarily, then replace the entire contents of the tag (matched with the tag_reg) with the stored URL.
You might be able to combine it, but it doesn't seem like a good idea. You're fundamentally altering (by deleting) the original tag, and replacing it with something inside itself. Less chance of things going wrong if you separate those two steps as much as possible.
Example in Ruby
def replace_tag(input)
url_reg = /<a href="(.*?)"/ # Match URLS within an <a href> tag
tag_reg = /(<a href=.*?a>)/ # Match an entire <a href></a> tag
while (input =~ tag_reg) # While the input has matching <a href> tags
url = input.scan(url_reg).flatten[0] # Retrieve the first URL match
input = input.sub(tag_reg, url) # Replace first tag contents with URL
end
return input
end
File.open("test.html", "r") do |html_input| # Open original HTML file
File.open("output.html", "w") do |html_output| # Open an output file
while line = html_input.gets # Read each line
output = replace_tag(line) # Perform necessary substitutions
html_output.puts(output) # Write output lines to file
end
end
end
Even if you don't use Ruby, I hope the example makes sense. I tested this on your given input file, and it produces the expected output.
I have to create a test application on JMeter where i need to get only that anchor tag and all its content which has my url name.
<a href='http://www.mysite.com/' title='Free Stuff' target='_NEW'>Free Stuff</a>
or any other variant is returned. Only prerequisite is that it should start with < a , have mysite.com in between and < / a > at the end.
I have endlessly tried to do this, even searched this forum but to no avail.
Help needed desperately.
Thanks
You can take the parts you require and fill the remainder of the pattern with "give me anything".
<a[^>]*mysite\.com.+?<\/a>
demo
I am trying to write a RegEx rule to find all a href HTML links on my webpage and add a 'rel="nofollow"' to them.
However, I have a list of URLs that must be excluded (for exmaple, ANY (wildcards) internal link (eg. pokerdiy.com) - so that any internal link that has my domain name in is excluded from this. I want to be able to specify exact URLs in the exclude list too - for example - http://www.example.com/link.aspx)
Here is what I have so far which is not working:
(]+)(href="http://.*?(?!(pokerdiy))[^>]+>)
If you need more background/info you can see the full thread and requirements here (skip the top part to get to the meat):
http://www.snapsis.com/Support/tabid/601/aff/9/aft/13117/afv/topic/afpgj/1/Default.aspx#14737
An improvement to James' regex:
(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>
This regex will matches links NOT in the string array $follow_list. The strings don't need a leading 'www'. :)
The advantage is that this regex will preserve other arguments in the tag (like target, style, title...). If a rel argument already exists in the tag, the regex will NOT match, so you can force follows on urls not in $follow_list
Replace the with:
$1$2$3"$4 rel="nofollow">
Full example (PHP):
function dont_follow_links( $html ) {
// follow these websites only!
$follow_list = array(
'google.com',
'mypage.com',
'otherpage.com',
);
return preg_replace(
'%(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>%',
'$1$2$3"$4 rel="nofollow">',
$html);
}
If you want to overwrite rel no matter what, I would use a preg_replace_callback approach where in the callback the rel attribute is replaced separately:
$subject = preg_replace_callback('%(<a\s*[^>]*href="https?://(?:(?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"[^>]*)>%', function($m) {
return preg_replace('%\srel\s*=\s*(["\'])(?:(?!\1).)*\1(\s|$)%', ' ', $m[1]).' rel="nofollow">';
}, $subject);
I've developed a slightly more robust version that can detect whether the anchor tag already has "rel=" in it, therefore not duplicating attributes.
(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!blog.bandit.co.nz)[^"]+)"([^>]*)>
Matches
Google
<a title="Google" href="http://google.com">Google</a>
<a target="_blank" href="http://google.com">Google</a>
Google
But doesn't match
<a rel="nofollow" href="http://google.com">Google</a>
Google
Google
Google
Google
<a target="_blank" href="http://blog.bandit.co.nz">Bandit</a>
Replace using
$1$2$3"$4 rel="nofollow">
Hope this helps someone!
James
(<a href="https?://)((?:(?!\b(pokerdiy.com|www\.example\.com/link\.aspx)\b)[^"])+)"
would match the first part of any link that starts with http:// or https:// and doesn't contain pokerdiy.com or www.example.com/link.aspx anywhere in the href attribute. Replace that by
\1\2" rel="nofollow"
If a rel="nofollow" is already present, you'll end up with two of these. And of course, relative links or other protocols like ftp:// etc. won't be matched at all.
Explanation:
(?!\b(foo|bar)\b)[^"] matches any non-" character unless it it possible to match foo or bar at the current location. The \bs are there to make sure we don't accidentally trigger on rebar or foonly.
This whole contruct is repeated ((?: ... )+), and whatever is matched is preserved in backreference \2.
Since the next token to be matched is a ", the entire regex fails if the attribute contains foo or bar anywhere.
I have an apparently simple regex query for pipes - I need to truncate each item from it's (<img>) tag onwards. I thought a loop with string regex of <img[.]* replaced by blank field would have taken care of it but to no avail.
Obviously I'm missing something basic here - can someone point it out?
The item as it stands goes along something like this:
sample text title
<a rel="nofollow" target="_blank" href="http://example.com"><img border="0" src="http://example.com/image.png" alt="Yes" width="20" height="23"/></a>
<a.... (a bunch of irrelevant hyperlinks I don't need)...
Essentially I only want the title text and hyperlink that's why I'm chopping the rest off
Going one better because all I'm really doing here is making the item string more manageable by cutting it down before further manipulation - anyone know if it's possible to extract a href from a certain link in the page (in this case the 1st one) using Regex in Yahoo Pipes? I've seen the regex answer to this SO q but I'm not sure how to use it to map a url to an item attribute in a Pipes module?
You need to remove the line returns with a RegEx Pipe and replace the pattern [\r\n] with null text on the content or description field to make it a single line of text, then you can use the .* wildcard which will run to the end of the line.
http://www.yemkay.com/2008/06/30/common-problems-faced-in-yahoo-pipes/