I have many links in my page.
For example Australia
Now I want only the href with its value i.e (href="/promotions/download/schools/australia.aspx") with vbscript regular expression.
My regex would be something like:
href="([^"]*)"
Might need escaping in your context but that (or something very much like it) should work.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). Luckily, you should have access to the best parser available: the web browser. Modern browsers create a Document Object Model which is a tree structure that contains all of the information about the page. One of the methods you can call on the DOM is links. I don't really know vbscript, but this code looks like it should work:
For i = 0 To document.links.length
document.write(document.links(i).href & "<BR>")
Next
Related
I am coding custom CSS for Facebook using Stylish.
Everything goes well except that I need to have some custom values under the condition of URL-suffix. The only thing that comes close is URL-prefix which is the exact opposite.
So I was wondering if I could do something like:
Detect if URL is like either:
www.facebook.com/*/posts or just */post
where * could be any value.
Is it possible to do this through RegEx?
I googled it but I couldn't make anything out of it.
I want to apply some CSS code only when viewing some individual Facebook posts, and the URLbar shows:
www.facebook.com/User/Posts/PostID.php
Therefore, I would only like to detect if Post or post/postID.php exists and apply the style.
The below regex would match the links which contain the string /posts,
(?=.*?\/posts).*
DEMO
I am following the tutorial here, trying to build a robot against a website.
I am in a page that contains all the product categories. Say it is www.example.com/allproducts.
After diving into each category. You can see the product list in a table format and you can click the next page to loop through all the pages inside that category. Actually you can only see the 1,2,3,4,5, last page.
The first page in the category has a URL looks like www.example.com/level1/level2/_/N-1, then the second page will looks like www.example.com/level1/level2/_/N-1/?No=100 .. so on an so forth..
I personally don't have that much JAVA programming experience and I am wondering
can I crawl the all the products list page using Nutch and store the HTML for now..
and maybe later figure out a way to parse the html/index correctly.
(1) Can I just modify conf/regex-urlfilter.txt and replace
# accept anything else
+.
with something correct? (I just don't understand how could
+^http://([a-z0-9]*\.)*nutch.apache.org/
only restrict the URLs inside the Nutch domain..., I will interpret that regular expression to be between the double slash and nutch, there could be any characters that are alpha numeric or asterisk, backslash or dot..)
How can I build the regular expression so it only scrape http://www.example.com/.../.../_/N-../...
(2) I can see the HTML is stored in the content folder inside segment... However, when I open that file in VI, it just totally looks like nonsense to me... and I am wondering if that is the so-called JAVA serialization which I need to deserialize in JAVA to read it.
Forgive me if those questions are too basic and thanks a lot for reading.
(1) Can I just modify conf/regex-urlfilter.txt and replace
Sure. You should replace +. with these lines:
#accept all products page
+www\.example\.com/allproducts
#accept categories pages
+www\.example\.com/level1/level2/_/N-
One important note about regex in this file: the regular expressions are partially match. So if you write a rule like "+ab" it means: accept all urls that contain "ab" so it matches with these urls
ab
abc
http://ab.com/c.html
By default, nutch filter urls with ? (since mostly they are dynamic pages). To prevent this, comment this line in you regex-urlfilter.txt file:
-[?*!#=]
(2) I can see the HTML ...
Nutch saves the files in binary format. See https://stackoverflow.com/a/10150402/1881318
I'm trying to match anything between and including style=""
eg: style="whatever:0; morestuff:1; otherstuff:3"
The pattern will be /style="([^"]*)"/, but may vary a bit depending on what language you're using.
Also if you're trying to do this through javascript, jquery would make this as easy as
$("#element-id").attr("style");
If you're trying to do this from another language, use an HTML parsing lib as HTML isn't regular. BeautifulSoup for Python is quite nice.
String under test
style="whatever:0; morestuff:1; otherstuff:3"
Regex
style\s*=\s*"([^"]*)"
Contents of group 1
whatever:0; morestuff:1; otherstuff:3
Notice!
It is very hard to write a regex-based HTML parser that is correct, secure, and maintainable. If you need to write program that deals with HTML in a robust, reliable, and secure way, you should use a real HTML parsing library like jsoup (Java) or Html Agility Pack (C#). To find an HTML parser for your favorite language, Google: yourlanguage html parser.
If you need to remove all style tags from html (clean inline styles entirely), use this as regexp:
style=\"[^\"]*\"
This works for me in sublime text 2-3
/(style="([^"]*)")/
for the whole string (untested). do you want the key value pairs retrieved as well?
I want to extract URLs from a webpage these are just URLs by themselves not hyperlinks etc., they are just text. Some examples would be http://www.example.com, http://example.com, www.example.com etc. I am extremely new at regex so I have copy and pasted like 20 expressions online all failed to work. I don't know if I am doing it right or not. Any help would be really appreciated.
I wrote a post on using Regex to locate links within a HTML page (the intent was to use JavaScript to open external links or links to documents such as PDF's etc in a popup window).
The final regex was:
^(?:[./]+)?(?:Assets|https?://(?!(?:www.)?integralist))
The full post is here:
http://www.integralist.co.uk/javascript/regular-expression-to-open-external-links-in-popup-window/
The solution wont be perfect but might help point you in the right direction.
Mark
You're probably not escaping your .s. You need to use \. for each one.
Take a look at strfriend.com. It has a URL example, and represents it graphically.
The example it suggests is:
^((ht|f)tp(s?)://|~/|/)?(\w+:\w+#)?([a-zA-Z]{1}([\w-]+.)+(\w{2,5}))(:\d{1,5})?((/?\w+/)+|/?)(\w+.\w{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?
I've got a problem I need solved using Regex expressions; it involves taking a CSS selector and compiling a regex that matches the string representation of the nodes inside an HTML document. The point is to avoid parsing the HTML as XML and then either making Xpath or DOM queries to apply style attributes.
Does anyone know of a project that already implements something like this in any language? The target platform would be .NET 3.5.
Html Agility Pack
Regular expressions seem like an amazingly bad way of matching those nodes. I'm not sure I follow your problem - why not just use something like jquery to pick out those nodes? eg given a css selector 'div>span.red:first-child',
$('div>span.red:first-child')
would return an array of those matching nodes.
EDIT: Oh, wait - are you trying to do this 'offline', as it were - not in a user's browser? Yeah, ignore my advice. (Even so, I'd still suggest that regular expressions aren't going to help you. Why are you against generating an xml-document representation of the page?)