RegEX to search links in a text file? - regex

Im trying to clean a database that contains a lot of links that doesnt work
The problem is that there are a lot of links for picture and every picture has a different name of course.
Is it possible to select the entire link That contains "http://example.com/img/bloguploads/" with a regEX ?

Can find all hyperlinks with:
http[s]?://.[a-zA-Z0-9\.\/\-]+
And all example.com links with:
http://example\.com/img/bloguploads/\S+

Related

Regex to replace spam links in Wordpress

I am dealing with old hacked sites in Wordpress where there are injection spam links on images.
I have access to the database and would like to remove links that look like this:
<a style="text-decoration:none" href="/ansaid-retail-cost">.</a>
Now text varies inside the <href> it might be for cialas or any product, but the rest doesn't vary. I want to remove the entire LINK, so the result is a single space.
I don't know regex, so I would appreciate the help. I've tried online generators but they don't seem to be working.

regex include main page visits from one keyword

I got a lot of visits to my site main page from different keywords. Examples of format might be as following:
/?keyword= train hard
/
/?keyword=
etc., etc.
To be able to sum up all visits to my main page despite from the keyword, I wanted to use a Regex like ^/$. However, that didn't work out. What RegEx should I apply to get the proper result?
What RegEx should I apply to see other sections of my website in a similar way? E.g.
/booking?keyword= or /section?keyword=any ?
Thanks in advance!
For main page you can try: ^\/(?:\?keyword=.*)?$
Look here for example: https://regex101.com/r/5uCFun/2
For other pages similarly: ^\/booking(?:\?keyword=.*)?$
Example here: https://regex101.com/r/1dAaHL/2

Find and click links in ugly table with Python and Selenium webdriver

I'm trying to get Selenium Webdriver to click x number of links in a table, and I can't get it to work. I can print the links like this:
links = driver.find_elements_by_xpath("//table[2]/tbody/tr/td/p/strong/a")
for i in range(0,len(links)):
print links[i].text
But when I try to do a links[i].click() instead of printing, python throws me an error.
The site uses JSP and the hrefs of the links looks like this "javascript:loadTestResult(169)"
This is a sub/sub-page and not possible to access by direct URL, and the table containing the links are very messy and large so instead of pasting the whole source here I saved the page on this url.
http://wwwe.aftonbladet.se/redaktion/martin/badplats.html
(I'm hunting the 12 blue links in the left column)
Any ideas?
Thanks
Martin
Sorry, to trigger happy.
Simple solution to my own problem:
linkList = driver.find_elements_by_css_selector("a[href*='loadTestResult']")
for i in range(0,len(linkList)):
links = driver.find_elements_by_css_selector("a[href*='loadTestResult']")
links[i].click()

Writing a regular expression for nutch's regex-urlfilter.txt file

I'm having some problems with regex-urlfilter.txt file.
I want to crawl only links that have numbers before '.html', should be easy but I can't get it right...
Here's an example:
http://www.utiltrucks.com/annonce-occasion-camion-poids-lourd/marque-renault/modele-midliner/ref-71015.html
http://www.utiltrucks.com/annonce-occasion-camion-poids-lourd/dpt-.html
I want to catch the first link.
I've tried with the following entry in regex-urlfilter:
accept anything else
+http://www.utiltrucks.com/annonce-occasion.+?[0-9]+.html
I get a message:
0 records selected for fetching, exiting ...
Anybody got an idea how to pull this off?
Note that your url filters should also match with your seed URLs or else they will be filter out and hence nutch won't get any chance to parse them and extract the links you wanted.
For example, if your seed file contains this url http://www.utiltrucks.com/home then you should also add an entry in your regex-urlfilter file like this:
+http://www.utiltrucks.com/home
This should be also done for all pages that in the path from your seed urls to your target pages that you want to extract links from.
you have to start your url like
+^(http|https)://www.example.com

Using Selenium Python on Google page to click links

Trying to write a very simple script in Selenium Python. I am opening Google page with a search string then I am not able to locate any of the HTML element like "Images", "Maps" etc of any of the links appearing as a part of search. Though I am using Firebug. But only one thing worked and that is following
links = driver.find_elements_by_tag_name ("a")
for link in links:
print ("hello")
What to do if I want to click on "Images" or "Maps"?
What to do if I want click on 1st, 2nd or a particular numbered link or click the link by partial text ?
Any help would be appreciated.
Something like:
driver.get('http://www.google.com')
driver.find_element_by_xpath('//a[starts-with(#href,"https://maps.google")]').click()
But please note that your browser would often redirect 'http://www.google.com' to a slightly different URL, and that the web-page it displays might be slightly different.
What to do if i want to click on images or maps?
Images:
img = driver.find_element_by_css_selector('img#myImage').click()
Maps:
map = driver.find_element_by_css_selector("map[for='myImage']").click()
1st, second or N link:
link = driver.find_elements_by_tag_name('a')[n].click() -- or
link = driver.find_elements_by_css_selector("div#someParent > a:nth-child(n)").click()