How to refine text data? - python-2.7

I built many spiders to get news articles from different websites and i have an api to convert the text to audio clips, but i need a framework or python tools to refine the articles' text such as:
removing anything related to the source. removing any dates formats.
removing urls. change acronyms such as CEO to chief excution officer
for example. removing special characters and typos.
making sure that the sentence is written correctly after all the edits.
use the previously edited articles as a reference for the new articles.
I am using python, nltk and re, but it's exhausting and each time i think i covered all the cases, i find new cases to add and i think i am stuck in an infinite loop.
Any suggestions?

First of all, expanding acronyms to their full form is non-trivial and should probably not be considered part of scraping but rather part of a second step of processing (cf. IBM's The Art of Tokenization).
Cleaning scraped data is tedious, unfortunately: There is no magical solution because everyone is interested in scaping something different than what you are — some might be interested only in URLs, for example. Nevertheless, have you not tried using BeautifulSoup? — it's a Python library which offers a very nice API for handling many common scraping-related tasks.

Related

A tool which checks that a local version of a site is fully translated (for continuous integration)

I'm working on a project, in which we design a localized version of an existing site (written in English) for another country (which is not English-speaking). And the business requirement is "no English text for all possible and impossible cases".
Does anyone know if there is a checker software/service which could check if a site is fully translated, that is which checks that there are no English text in it.
I new that there are sites for checking broken links, html validity etc, I need something like http://validator.w3.org/checklink but for checking that on all pages of the site there is no English text.
The reasons I think this way is needed are:
1. There is a lot of code which is common (both on backend and frontend) for all countries
2. If someone commits anything to the common code I need to be sure that this will not lead to english text issues in localized version.
3. From business point of view it is preferable that site does not support some functionality, than it shows english text ( legal matters)
4. The code both on frontend and backend changes a lot
5. There are a lot of files which affect text on the client's screen. Not just one with messages, unfortunately. And some of messages comes from backend, but most of them are in frontend
6. Due to all those fact currently someone manually fills all the forms and watch with his own eyes, and that is before each deploy...
I think you're approaching the problem from the wrong direction. You're looking for an algorithm or webcrawler that can detect wether any text is English or not? I don't know, but I doubt such a thing even exists.
If you have translated the website, you have full access to the codebase and/or translation texts, right? Can't you just open both the English and non-English strings files (.resx or whatever you are using) in a comparetool like Notepad++ to check the differences to see if there are any missing strings? And check the sourcecode and verify that all parts that can output user-displayable text use the meta:resourceKey property (or whatever you are using).
If you want to go the way of crawling, I'm not aware of an existing crawler that does this, but it sounds like a combination of two simple issues:
Finding existing open-source code for a web crawler should be dead simple
Identifying a language through n-gram analysis is trivial if there's a limited number of languages the text can be in.
The only difficult part would be to ensure that the analyzer always has a decent chunk of text to work with. You could extract stuff paragraph by paragraph. For forms you'd probably have to combine the text of several form labels.

How does Yelp create the "Review Highlights" section?

Take the following link as an example: http://www.yelp.com/biz/chef-yu-new-york.
In the section called 'Review Highlights', there are 3 phrases (spicy diced chicken, happy hour, lunch specials) that are highlighted based on reviews submitted by users. Obviously, these are the phrases that appeared most often, or longest phrases that appeared often, or some other logic.
Their official explanation is this:
In their reviews, Yelpers mentioned the linked phrases below a lot.
And these aren't any old common phrases, they're also the ones that
our Yelp Robots have determined are unique and good, quick ways to
describe this business. Click any of the phrases to see all the
reviews that mention it.
My question is, what did they use to mine the text input to get these data points? Is it some algorithm based on Lempel Ziv, or some kind of map reduce? I was not a CS major, so probably am missing something foundational here. Would love some help, theories, etc.
Thanks!
I don't have any insight on the exact algorithm Yelp is using but this is a common problem in natural language processing. Essentially you want to extract the most relevant collocations (http://en.wikipedia.org/wiki/Collocation).
A simple way to do this is to extract a list of n-grams with the highest PMI (pointwise mutual information). This SO question explains how to do this using Python and the nltk library:
How to extract common / significant phrases from a series of text entries
Lempel-Ziv is a data compression algorithm, and map-reduce is a technique for data processing. The former is probably not involved, and the latter is generally useful but not relevant here.
Without knowing the details of Yelp's code, it's impossible to say for sure, but it seems likely that their "review highlights" are simply based on tabulating all phrases that appear in reviews for this business, then displaying ones which are more common in reviews for this business than for other businesses. Some amount of natural language processing is likely to be involved to ensure that it picks noun phrases.

What are my options for white-listing HTML in ColdFusion?

I want to allow my users to input HTML.
Requirements
Allow a specific set of HTML tags.
Preserve characters (do not encode ã into ã, for example)
Existing options
AntiSamy. Unfortunately AntiSamy encodes special characters and breaks requirement 2.
Native ColdFusion functions (HTMLCodeFormat() etc...) don't work as they encode HTML into entities, and thus fail requirement 1.
I found this set of functions somewhere, but I have no way of telling how secure this is: http://pastie.org/2072867
So what are my options? Are there existing libraries for this?
Portcullis works well for Cold Fusion for attack-specific issues. I've used a couple of other regex solutions I found on the web over time that have worked well, though they haven't been nearly as fleshed out. In 15 years (10 as a CMS developer) nothing I've built has been hacked....knock on wood.
When developing input fields of any type, it's good to look at the problem from different angles. You've got the UI side, which includes both usability and client-side validation. Yes, it can be bypassed, but javascript-based validation is quicker, more responsive, and rates higher on the magical UI scale than backend-interruption method or simply making things "disappear" without warning. It will speed up the back-end validation because it does the initial screening. So, it's not an "instead of" but an "in-addition to" type solution that can't be ignored.
Also on the UI front, giving your users a good quality editor also can make a huge difference in the process. My personal favorite is CKeditor simply because it's the only one that can handle Microsoft Word code on the front-side, keeping it far away from my DB. It seems silly, but Word HTML is valid, so it won't setoff any red flags....but on a moderately sized document it will quickly overload a DB field insert max, believe it or not. Not only will a good editor reduce the amount of silly HTML that comes in, but it will also just make things faster for the user....win/win.
I personally encode and decode my characters...it's always just worked well so I've never changed practice.

Web Application Cross Site Scripting

My website http://www.imayne.com seems to have this issue, verified by MacAfee. Can someone show me how to fix this? (Title)
It says this:
General Solution:
When accepting user input ensure that you are HTML encoding potentially malicious characters if you ever display the data back to the client.
Ensure that parameters and user input are sanitized by doing the following:
Remove < input and replace with "&lt";
Remove > input and replace with "&gt";
Remove ' input and replace with "&apos";
Remove " input and replace with "&#x22";
Remove ) input and replace with "&#x29";
Remove ( input and replace with "&#x28";
I cannot seem to show the actual code. This website is showing something else.
Im not a web dev but I can do a little. Im trying to be PCI compliant.
Let me both answer your question and give you some advice. Preventing XSS properly needs to be done by defining a white-list of acceptable values at the point of user input, not a black-black of disallowed values. This needs to happen first and foremost before you even begin thinking about encoding.
Once you get to encoding, use a library from your chosen framework, don't attempt character substitution yourself. There's more information about this here in OWASP Top 10 for .NET developers part 2: Cross-Site Scripting (XSS) (don't worry about it being .NET orientated, the concepts are consistent across all frameworks).
Now for some friendly advice: get some expert support ASAP. You've got a fundamentally obvious reflective XSS flaw in an e-commerce site and based on your comments on this page, this is not something you want to tackle on your own. The obvious nature of this flaw suggests you've quite likely got more obscure problems in the site as well. By your own admission, "you're a noob here" and you're not going to gain the competence required to sufficiently secure a website such as this overnight.
The type of changes you are describing are often accomplished in several languages via an HTML Encoding function. What is the site written in. If this is an ASP.NET site this article may help:
http://weblogs.asp.net/scottgu/archive/2010/04/06/new-lt-gt-syntax-for-html-encoding-output-in-asp-net-4-and-asp-net-mvc-2.aspx
In PHP use this function to wrap all text being output:
http://ch2.php.net/manual/en/function.htmlentities.php
Anyplace you see echo(...) or print(...) you can replace it with:
echo(htmlentities( $whateverWasHereOriginally, ENT_COMPAT));
Take a look at the examples section in the middle of the page for other guidance.
Follow those steps exactly, and you're good to go. The main thing is to ensure that you don't treat anything the user submits to you as code (HTML, SQL, Javascript, or otherwise). If you fail to properly clean up the inputs, you run the risk of script injection.
If you want to see a trivial example of this problem in action, search for
<span style="color:red">red</span>
on your site, and you'll see that the echoed search term is red.

"Smart" way of parsing and using website data?

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed yesterday.
Improve this question
How does one intelligently parse data returned by search results on a page?
For example, lets say that I would like to create a web service that searches for online books by parsing the search results of many book providers' websites. I could get the raw HTML data of the page, and do some regexs to make the data work for my web service, but if any of the websites change the formatting of the pages, my code breaks!
RSS is indeed a marvelous option, but many sites don't have an XML/JSON based search.
Are there any kits out there that help disseminate information on pages automatically? A crazy idea would be to have a fuzzy AI module recognize patterns on a search results page, and parse the results accordingly...
I've done some of this recently, and here are my experiences.
There are three basic approaches:
Regular Expressions.
Most flexible, easiest to use with loosely-structured info and changing formats.
Harder to do structural/tag analysis, but easier to do text matching.
Built in validation of data formatting.
Harder to maintain than others, because you have to write a regular expression for each pattern you want to use to extract/transform the document
Generally slower than 2 and 3.
Works well for lists of similarly-formatted items
A good regex development/testing tool and some sample pages will help. I've got good things to say about RegexBuddy here. Try their demo.
I've had the most success with this. The flexibility lets you work with nasty, brutish, in-the-wild HTML code.
Convert HTML to XHTML and use XML extraction tools. Clean up HTML, convert it to legal XHTML, and use XPath/XQuery/ X-whatever to query it as XML data.
Tools: TagSoup, HTMLTidy, etc
Quality of HTML-to-XHML conversion is VERY important, and highly variable.
Best solution if data you want is structured by the HTML layout and tags (data in HTML tables, lists, DIV/SPAN groups, etc)
Most suitable for getting link structures, nested tables, images, lists, and so forth
Should be faster than option 1, but slower than option 3.
Works well if content formatting changes/is variable, but document structure/layout does not.
If the data isn't structured by HTML tags, you're in trouble.
Can be used with option 1.
Parser generator (ANTLR, etc) -- create a grammar for parsing & analyzing the page.
I have not tried this because it was not suitable for my (messy) pages
Most suitable if HTML structure is highly structured, very constant, regular, and never changes.
Use this if there are easy-to-describe patterns in the document, but they don't involve HTML tags and involve recursion or complex behaviors
Does not require XHTML input
FASTEST throughput, generally
Big learning curve, but easier to maintain
I've tinkered with web harvest for option 2, but I find their syntax to be kind of weird. Mix of XML and some pseudo-Java scripting language. If you like Java, and like XML-style data extraction (XPath, XQuery) that might be the ticket for you.
Edit: if you use regular expressions, make sure you use a library with lazy quantifiers and capturing groups! PHP's older regex libraries lack these, and they're indispensable for matching data between open/close tags in HTML.
Without a fixed HTML structure to parse, I would hate to maintain regular expressions for finding data. You might have more luck parsing the HTML through a proper parser that builds the tree. Then select elements ... that would be more maintainable.
Obviously the best way is some XML output from the engine with a fixed markup that you can parse and validate. I would think that a HTML parsing library with some 'in the dark' probing of the produced tree would be simpler to maintain than regular expressions.
This way, you just have to check on <a href="blah" class="cache_link">... turning into <a href="blah" class="cache_result">... or whatever.
Bottom line, grepping specific elements with regexp would be grim. A better approach is to build a DOM like model of the page and look for 'anchors' to character data in the tags.
Or send an email to the site stating a case for a XML API ... you might get hired!
You don't say what language you're using. In Java land you can use TagSoup and XPath to help minimise the pain. There's an example from this blog (of course the XPath can get a lot more complicated as your needs dictate):
URL url = new URL("http://example.com");
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build a JDOM tree from a SAX stream provided by tagsoup
Document doc = builder.build(url);
JDOMXPath titlePath = new JDOMXPath("/h:html/h:head/h:title");
titlePath.addNamespace("h","http://www.w3.org/1999/xhtml");
String title = ((Element)titlePath.selectSingleNode(doc)).getText();
System.out.println("Title is "+title);
I'd recommend externalising the XPath expressions so you have some measure of protection if the site changes.
Here's an example XPath I'm definitely not using to screenscrape this site. No way, not me:
"//h:div[contains(#class,'question-summary')]/h:div[#class='summary']//h:h3"
You haven't mentioned which technology stack you're using. If you're parsing HTML, I'd use a parsing library:
Beautiful Soup (Python)
HTML Agility Pack (.NET)
There are also webservices that do exactly what you're saying - commercial and free. They scrape sites and offer webservice interfaces.
And a generic webservice that offers some screen scraping is Yahoo Pipes. previous stackoverflow question on that
It isn't foolproof but you may want to look at a parser such as Beautiful Soup It won't magically find the same info if the layout changes but it's a lot easier then writing complex regular expressions. Note this is a python module.
Unfortunately 'scraping' is the most common solution, as you said attempting to parse HTML from websites. You could detect structural changes to the page and flag an alert for you to fix, so a change at their end doesn't result in bum data. Until the semantic web is a reality, that's pretty much the only way to guarantee a large dataset.
Alternatively you can stick to small datasets provided by APIs. Yahoo are working very hard to provide searchable data through APIs (see YDN), I think the Amazon API opens up a lot of book data, etc etc.
Hope that helps a little bit!
EDIT: And if you're using PHP I'd recommend SimpleHTMLDOM
Have you looked into using a html manipulation library? Ruby has some pretty nice ones. eg hpricot
With a good library you could specify the parts of the page you want using CSS selectors or xpath. These would be a good deal more robust than using regexps.
Example from hpricot wiki:
doc = Hpricot(open("qwantz.html"))
(doc/'div img[#src^="http://www.qwantz.com/comics/"]')
#=> Elements[...]
I am sure you could find a library that does similar things in .NET or Python, etc.
Try googling for screen scraping + the language you prefer.
I know several options for python, you may find the equivalent for your preferred language:
Beatiful Soup
mechanize: similar to perl WWW:Mechanize. Gives you a browser like object to ineract with web pages
lxml: python binding to libwww
scrapemark: uses templates to scrape pieces of pages
pyquery: allows you to make jQuery queries in xml/xhtml documents
scrapy: an high level scraping and web crawling framework for writing spiders to crawl and parse web pages
Depending on the website to scrape you may need to use one or more of the approaches above.
If you can use something like Tag Soup, that'd be a place to start. Then you could treat the page like an XML API, kinda.
It has a Java and C++ implementation, might work!
Parsley at http://www.parselets.com looks pretty slick.
It lets you define 'parslets' using JSON what you're define what to look for on the page, and it then parses that data out for you.
As others have said, you can use an HTML parser that builds a DOM representation and query it with XPath/XQuery. I found a very interesting article here: Java theory and practice: Screen-scraping with XQuery - http://www.ibm.com/developerworks/xml/library/j-jtp03225.html
There is a very interesting online service for parsing websites https://loadsiteinmysql.site This service splits the site into tags and loads them into MySQL tables. This allows you to parse sites using MySQL syntax