I'm trying to extract some text out of a poorly design web page for a project, and after a long research and learning python I came close to make it happen, but the web page is poorly designed and can't find the right regular expression to do it.
So here we have what I've accomplished. http://coj.uci.cu/24h/status.xhtml?username=Diego1149&abb=1006 out of the source code of this web page I want to get the whole line of the first instance of an accepted problem. So I thought of this
exprespatFinderTitle = re.compile('<table id="submission" class="volume">.*(<tr class=.*>.*<label class="AC">.*Accepted.*</label>.*</tr>).*</table>')
but what does this does is clipping up until the last <tr> of the table. Can someone help me figure this out?
Im using Python 2.7 whit BeautifulSoup and urllib
Stick to BeautitfulSoup alone; regular expressions are not the tool for HTML parsing:
table = soup.find('table', id='submission')
accepted = table.tbody.find('label', class_='AC')
if accepted:
row = accepted.parent.parent # row with accepted column
Related
I am trying to use the REGEXP_EXTRACT custom field to pull a portion of my URL using the page dimension in Google Data Studio and cannot figure it out. The page url structure is similar to this -
website.forum.com/webforms/great_practiceinfo_part2.aspx?function=greatcoverage
I'd like to only extract the middle section "great_practiceinfo_part2". I've tried many different formulas, but nothing seems to work. Does the page dimension work in this scenario? Any help would be much appreciated.
Thanks
It seemed to work fine in Google Sheets when I =REGEXEXTRACT(A3,B3) using your string, website.forum.com/webforms/great_practiceinfo_part2.aspx?function=greatcoverage for A3 and the regex \/([^\/]*?)\.aspx\? for B3. I'm guessing you just need to learn more about how to make your regex pattern making string.
Ok, I asked this already, but I guess I didn't ask it to the way stackoverflow expects. Hopefully I will get more luck this time and an answer.
I am trying to run nutch to crawl this site: http://www.tigerdirect.com/
I want it to crawl that site and all sublinks.
The problem is its not working. In my reg-ex file I tried a couple of things, but none of them worked:
+^http://([a-z0-9]*\.)*tigerdirect.com/
+^http://tigerdirect.com/([a-z0-9]*\.)*
my urls.txt is:
http://tigerdirect.com
Basically what I am trying to accomplish is to crawl all the product pages on their website so I can create a search engine (I am using solr) of electronic products. Eventually I want to crawl bestbuy.com, newegg.com and other sites as well.
BTW, I followed the tutorial from here: http://wiki.apache.org/nutch/NutchTutorial and I am using the script mentioned in session 3.3 (after fixing a bug it had).
I have a background in java and android and bash so this is a little new to me. I used to do regex in perl 5 years ago, but that is all forgotten.
Thanks!
According to your comments I see that you have crawled something before and this is why your Nutch starts to crawl Wikipedia.
When you crawl something with Nutch it records some metada at a table (if you use Hbase it is a table named webpage) When you finish a crawling and start a new one that table is scanned and if there is a record that has a metada says "this record can be fetched again because next fetch time is passed" Nutch starts to fetch that urls and also your new urls.
So if you want to have just http://www.tigerdirect.com/ crawled at your system you have to clean up that table first. If you use Hbase start shell:
./bin/hbase shell
and disable table:
disable 'webpage'
and finally drop it:
drop 'webpage'
I could truncate that table but removed it.
Next thing is putting that into your seed.txt:
http://www.tigerdirect.com/
open regex-urlfilter.txt that is located at:
nutch/runtime/local/conf
write that line into it:
+^http://([a-z0-9]*\.)*www.tigerdirect.com/([a-z0-9]*\.)*
you will put that line instead of +.
I have indicated to crawl subdomains of tigerdirect, it is up to you.
After that you can send it into solr to index and make a search on it. I have tried it and works however you may have some errors at Nutch side but it is another topic to talk about.
You've got a / at the end of both of your regexes but your URL doesn't.
http://tigerdirect.com/ will match, http://tigerdirect.com will not.
+^http://tigerdirect.com/([a-z0-9]*\.)*
Try moving that tailing slash inside the parens
+^http://tigerdirect.com(/[a-z0-9]*\.)*
I've been trawling the internet trying to find a solution to this issue. Basically I am using a web service provided by the company that runs our support software to retrieve customer tickets and output them (dependent on filtering) through our system so that customers can see from their dashboard which current support tickets they have active. I've managed to get the desired tags from the XML that is returned via the web service and place their content in a html table (therefore listing the active tickets row by row in the table) however, as the ticket description tag is populated with the content from emails sent by clients, there is lots of nasty redundant css and styling that has been applied to the Email that I would like to remove.
So far I have managed to use the 'replace' function to replace some of the redundant content from this email content ->
l_html_build := replace(l_html_build,'<','<');
l_html_build := replace(l_html_build,'>','>');
l_html_build := replace(l_html_build,'<','');
l_html_build := replace(l_html_build,'>','');
l_html_build := replace(l_html_build,' ',' ');
However I now need to overwrite the p tags which have all sorts of garbage added to them so that they just become standard p tags->
From this:
<p 0in;"="" 3.0pt="" padding:="" 1.0pt;="" solid="" border-top:="" none;="" _mce_style=""border:" 0in"="" 0in="" 1.0pt;padding:3.0pt="" #b5c4df="" style=""border:none;border-top:solid">
To this:
<p>
I've looked into using the regEXP function listed here psoug however this appears to require a select statement that is performed each time. The data I need to manipulate is stored in a CLOB called l_html_build so is there any way of adapting the regEXP function to be used in a similar way to the replace function above or is there an alternative method that I am not aware of?
I apologise if this is a noob question. My expertise lies in front end development, PHP and MySQL but unfortunately I'm now required to bits of PL/SQL in my new role.
Any help would be greatly appreciated.
Knowing that:
There is no standard PL/SQL package that parses HTML.
You can't reliably parse HTML with regex. Furthermore, Oracle only support basic regular expressions, restricting its capabilities.
You want to stay in PL/SQL
You are left with few options (that I can think of):
Write a simple procedure yourself that will work in most of the cases (but there will be many exceptions that will break your parser).
Use a java parser, load class in database, call java from PL/SQL. Oracle comes with its integrated jvm, so this involves no extra setup.
I would go with option (2) if you want reliability, or option (1) if infrequent but inevitable losses are acceptable.
Since your content will be coming from email client, we can assume that only a tiny (negligible?) fraction will have very obscure HTML.
In that case you could start with simple regex expressions that may need some tweaking:
SQL> SELECT regexp_replace(
2 '<p1 3.0pt="" padding:="" #b5c4df="">
3 text
4 </p>',
5 '<([[:alpha:]]+)[^>]*>',
6 '<\1>') remove_attr_simple
7 FROM dual;
REMOVE_ATTR_SIMPLE
------------------
<p>
text
</p>
This will fail to catch tricky valid HTML (such as <P attr=">">) but since your input is somewhat standard this should be fine often enough. You may need to remove HTML comments with another procedure -- I'm not sure it can be done with regex.
SQL is really not the best tool for this job. Nor will regexes be able to perform this kind of task reliably. You would be better off extracting the data and processing it in another language using an XML parser.
Presumably Oracle itself is not sending these emails. What program does the sending, and can you add some programmatic processing at that point?
Since you already know PHP, here is a discussion of parsing HTML/XML in PHP. Similar tools are available in most other languages.
I've looked through other posts and have tried to implement what they have said into my code but I'm still missing something.
What I am trying to do is get all the image links off a website, specifically reddit.com
and once I obtain the links to display the images in my browser or download them and display them through Windows Image Viewer. I am just trying to practice and broaden my python skills.
I am stuck at obtaining the links and choosing how to display the images.
What I have right now is:
import urllib2
import re
links=urllib2.urlopen("http://www.reddit.com").read()
found=re.findall("http://imgur.com/+\w+.jpg", links)
print found #Just for testing purposes, to see what links are found
Thanks for the help.
The imgur.com links on reddit do not have any .jpg extensions, so your regular expression won't match anything. You should be looking for the i.imgur.com domain instead.
Matching re.findall("http://i.imgur.com/\w+.jpg", links) does return results:
>>> re.findall("http://i.imgur.com/\w+.jpg", links)
['http://i.imgur.com/PMNZ2.jpg', 'http://i.imgur.com/akg4l.jpg', 'http://i.imgur.com/dAHtq.jpg', 'http://i.imgur.com/dAHtq.jpg', 'http://i.imgur.com/nT73r.jpg', 'http://i.imgur.com/nT73r.jpg', 'http://i.imgur.com/z2wIl.jpg', 'http://i.imgur.com/z2wIl.jpg']
You can expand this to other file extensions:
>>> re.findall("http://i.imgur.com/\w+.(?:jpg|gif|png)", links)
['http://i.imgur.com/PMNZ2.jpg', 'http://i.imgur.com/akg4l.jpg', 'http://i.imgur.com/dAHtq.jpg', 'http://i.imgur.com/dAHtq.jpg', 'http://i.imgur.com/rsIfN.png', 'http://i.imgur.com/rsIfN.png', 'http://i.imgur.com/nT73r.jpg', 'http://i.imgur.com/nT73r.jpg', 'http://i.imgur.com/bPs5N.gif', 'http://i.imgur.com/z2wIl.jpg', 'http://i.imgur.com/z2wIl.jpg']
You may want to use a proper HTML parser instead of a regular expression; I can recommend both BeautifulSoup and lxml. It'll make it much easier to find all <img /> tags that use i.imgur.com links with those tools, including .gif and .png files, if any.
Ive got the following code in my RSS consumer (Vandelay Industries RemoteRSS) in my Orchard CMS implementation:
#using System.Xml.Linq
#{
var feed = Model.Feed as XElement;
}
<ul>
#foreach(var item in feed
.Element("channel")
.Elements("item")
.Take((int)Model.ItemsToDisplay))
{
<li>#T(item.Element("description").Value)</li>
}
</ul>
The rss feed Im using is from Pinterest, and this bundles the image, link, and a short description all inside the 'description' elements of the feed.
<description><a href="/pin/215609900882251703/"><img src="http://media-cache-ec2.pinterest.com/upload/88664686384961121_UIyVRN8A_b.jpg"></a>How to install Orchard CMS on IIS Server</description>
My issue is that I don't want the text bits, and I also need to prefix the 'href=' links with 'http://www.pinterest.com'.
I've managed to edit the original code with my newbie skills to the above,, which essentially displays the images as links which are only relative and thus pointing locally to my server. These images are also then followed by the short description.
So to summarise, I need a way to prefix all links with 'http://pinterest.com' and then to remove the fee text after the image/links.
Any pointers will be greatly appreciated, Thanks.
You should probably parse the description, with something like http://htmlagilitypack.codeplex.com/, and then tweak it to add the prefix. Or you can learn regular expression and do without a library. Could be a little trickier and error-prone however.