Scenario:
In Google Analytic, I notice that it is possible to replace certain URI parameter to words that you want by using search and replace filter like the following example below.
e.g. www.example.com/abc/product_id=3 -----> www.example.com/abc/product_name=shampoo
Problems:
Currently I've got a list of over 1000 products in my hand, instead of creating 1000 search and replace filter, what would be the most efficient and maintainable way to go solve the problem?
I've done some digging and notice that custom dimension could be the solution, however it would require me to modify the the JS code on the FTP sever which I dont have permission on. What other solutions do I have?
If it is not possible to show it here would there be any kind of tutorial that I could follow through?
Really appreciate for the help, Many Thanks
This is not a complete answer, but it's certainly more than a comment.
Besides the tedium of writing this out by hand, I can think of two options available to you.
Firstly, you could use the Google Analytics Management API (https://developers.google.com/analytics/devguides/config/mgmt/v3/). By constructing a set of commands, you could quickly iterate through your list and create the required 1,000 search and replace filters.
Secondly, if you were to use Google Tag Manager you would be able to create a Custom JavaScript Variable that takes the page path and compares it to your list. This variable could then replace the Page field before the hit data is sent to Google Analytics. This may sound more complicated, but it would allow you to pull your solution out of Google Analytics and into the flexible world of JavaScript.
Note that if you rewrite the product_id to a product_name once, you will have to maintain that cross reference every day and keep it in sync with what appears on the website -- make sure you have an automated solution or it will quickly get out-of-sync and be more of a mess than before.
An alternative is to do the search-and-replace on the reporting side.
I know Analytics Edge or Analytics Canvas products could easily do this, or you could just download into Excel or Google Sheets and do a series of lookup formulae.
I have a requirement in my project where I will have to built a webservice. This webservice will do the following things:
Accept XML format data
Return XML format data
The XML input data will have an element will have login information and another element data which needs processing.
Now I am looking for a design pattern where in I can make the webservice code look nice neat and clean. Because the webservice has to do plenty of things like.
First Parse the xml
Authenticate the request by checking username and password
Create objects from the data and then save the data to database
Prepare and xml which will be returned to the client.
So I have around 4 major steps which will definately make the code look ugly if I write whole thing in .asmx.cs file.
If anyone can suggest any design pattern to suit this so that the code is easy to maintain in near future.
As this module is to be integrated in my existing project hence there are some restrictions, like I cant use some 3rd party module or dll.
So I was looking for something like Single Responsibilty principle, Chain of Responsibility or Command or Decorator Patterns or anyother oop concept that fits.
I have searched but havent understood which way to start.
Thanks.
M.
I wouldn't write any of that from scratch. Use ServiceStack or MS MVC 4 for the webservice host. Rely upon them to do the conversion from XML to/from your objects. Both of those frameworks include authentication features. Start by reading their tutorials. It sounds to me like you have no experience with ORMs or micro ORMs or the various database options. I'd read a lot of tutorials on those as well.
It is my understanding that TED likely is not looking at making a BlackBerry App. I have a few frameworks I've created already for parsing various types of API's/feeds/services and would like to know if there is a way for a third party developer to make a TED app. I've heard mention of an API via the Googles but cannot find it.
I know this is quite old, but I just came across it. I was searching for an API for TED Talks as well, but was unable to find one. They have announced that they'll open up the site with an API, but I don't know when.
Instead, I decided to use the XML RSS feed, which is easily parsable. It's only usable for the latest talks and you naturally can't do any server-side filtering, but if that's good enough for you, you should definitely check out the feed:
http://www.ted.com/talks/rss
There's also the higher definition version of that feed:
http://feeds.feedburner.com/TedtalksHD?fmt=xml
There is an unofficial API for all ted talks details (gathered from youtube and ted.com)
its updated weekly with the data
http://market.mashape.com/bestapi/ted
supports:
get talks by name
get talks by speaker
get talks by description
get talks by transcript / search transcript
and looks like they publish more end-points from time to time
If you guys are still looking for something, I just open-sourced about 5400 hundred TED talks (TEDx, TED, TEDEd). Check it out and build something cool.
https://github.com/saranyan/TED-talks
I'm trying to code a small program for my friend's company. They build metal cabinets and every cabinet is made out of several parts. So when a customer tells them they need a new cabinet, they tell them the single part numbers. My friend needs now a little tool/database where he can look after those part numbers and if there is an entry, he can download the corresponding blueprint (saved as PDF). Of course the program needs also a function to create a new entry and uploading a PDF file with this entry.
The program needs only to be installed local on one windows machine.
Now I need to know if there is maybe a special way to solve this. It would be helpful if someone could give me some keywords, so I can google it and figure out how to begin :)
I have basic skills in C++ and Java and willing to learn new stuff :)
thanks!
Have used sqlite for database applications and found it to have a lot of functionality and speed. It lacks all the advanced database admin stuff but for single user/embedded databases its ideal. I use it over MySQL because of a significant performance improvement.
It has a Java interface via java.sql.Connection.
I think this is a good opportunity to try something else than C++/Java. In your case I would go for Ruby on Rails and for example the PaperClip extension (http://thoughtbot.com/community/) (haven't used PaperClip myself, though). Ruby on Rails is kind of made for this type of problem.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed yesterday.
Improve this question
How does one intelligently parse data returned by search results on a page?
For example, lets say that I would like to create a web service that searches for online books by parsing the search results of many book providers' websites. I could get the raw HTML data of the page, and do some regexs to make the data work for my web service, but if any of the websites change the formatting of the pages, my code breaks!
RSS is indeed a marvelous option, but many sites don't have an XML/JSON based search.
Are there any kits out there that help disseminate information on pages automatically? A crazy idea would be to have a fuzzy AI module recognize patterns on a search results page, and parse the results accordingly...
I've done some of this recently, and here are my experiences.
There are three basic approaches:
Regular Expressions.
Most flexible, easiest to use with loosely-structured info and changing formats.
Harder to do structural/tag analysis, but easier to do text matching.
Built in validation of data formatting.
Harder to maintain than others, because you have to write a regular expression for each pattern you want to use to extract/transform the document
Generally slower than 2 and 3.
Works well for lists of similarly-formatted items
A good regex development/testing tool and some sample pages will help. I've got good things to say about RegexBuddy here. Try their demo.
I've had the most success with this. The flexibility lets you work with nasty, brutish, in-the-wild HTML code.
Convert HTML to XHTML and use XML extraction tools. Clean up HTML, convert it to legal XHTML, and use XPath/XQuery/ X-whatever to query it as XML data.
Tools: TagSoup, HTMLTidy, etc
Quality of HTML-to-XHML conversion is VERY important, and highly variable.
Best solution if data you want is structured by the HTML layout and tags (data in HTML tables, lists, DIV/SPAN groups, etc)
Most suitable for getting link structures, nested tables, images, lists, and so forth
Should be faster than option 1, but slower than option 3.
Works well if content formatting changes/is variable, but document structure/layout does not.
If the data isn't structured by HTML tags, you're in trouble.
Can be used with option 1.
Parser generator (ANTLR, etc) -- create a grammar for parsing & analyzing the page.
I have not tried this because it was not suitable for my (messy) pages
Most suitable if HTML structure is highly structured, very constant, regular, and never changes.
Use this if there are easy-to-describe patterns in the document, but they don't involve HTML tags and involve recursion or complex behaviors
Does not require XHTML input
FASTEST throughput, generally
Big learning curve, but easier to maintain
I've tinkered with web harvest for option 2, but I find their syntax to be kind of weird. Mix of XML and some pseudo-Java scripting language. If you like Java, and like XML-style data extraction (XPath, XQuery) that might be the ticket for you.
Edit: if you use regular expressions, make sure you use a library with lazy quantifiers and capturing groups! PHP's older regex libraries lack these, and they're indispensable for matching data between open/close tags in HTML.
Without a fixed HTML structure to parse, I would hate to maintain regular expressions for finding data. You might have more luck parsing the HTML through a proper parser that builds the tree. Then select elements ... that would be more maintainable.
Obviously the best way is some XML output from the engine with a fixed markup that you can parse and validate. I would think that a HTML parsing library with some 'in the dark' probing of the produced tree would be simpler to maintain than regular expressions.
This way, you just have to check on <a href="blah" class="cache_link">... turning into <a href="blah" class="cache_result">... or whatever.
Bottom line, grepping specific elements with regexp would be grim. A better approach is to build a DOM like model of the page and look for 'anchors' to character data in the tags.
Or send an email to the site stating a case for a XML API ... you might get hired!
You don't say what language you're using. In Java land you can use TagSoup and XPath to help minimise the pain. There's an example from this blog (of course the XPath can get a lot more complicated as your needs dictate):
URL url = new URL("http://example.com");
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build a JDOM tree from a SAX stream provided by tagsoup
Document doc = builder.build(url);
JDOMXPath titlePath = new JDOMXPath("/h:html/h:head/h:title");
titlePath.addNamespace("h","http://www.w3.org/1999/xhtml");
String title = ((Element)titlePath.selectSingleNode(doc)).getText();
System.out.println("Title is "+title);
I'd recommend externalising the XPath expressions so you have some measure of protection if the site changes.
Here's an example XPath I'm definitely not using to screenscrape this site. No way, not me:
"//h:div[contains(#class,'question-summary')]/h:div[#class='summary']//h:h3"
You haven't mentioned which technology stack you're using. If you're parsing HTML, I'd use a parsing library:
Beautiful Soup (Python)
HTML Agility Pack (.NET)
There are also webservices that do exactly what you're saying - commercial and free. They scrape sites and offer webservice interfaces.
And a generic webservice that offers some screen scraping is Yahoo Pipes. previous stackoverflow question on that
It isn't foolproof but you may want to look at a parser such as Beautiful Soup It won't magically find the same info if the layout changes but it's a lot easier then writing complex regular expressions. Note this is a python module.
Unfortunately 'scraping' is the most common solution, as you said attempting to parse HTML from websites. You could detect structural changes to the page and flag an alert for you to fix, so a change at their end doesn't result in bum data. Until the semantic web is a reality, that's pretty much the only way to guarantee a large dataset.
Alternatively you can stick to small datasets provided by APIs. Yahoo are working very hard to provide searchable data through APIs (see YDN), I think the Amazon API opens up a lot of book data, etc etc.
Hope that helps a little bit!
EDIT: And if you're using PHP I'd recommend SimpleHTMLDOM
Have you looked into using a html manipulation library? Ruby has some pretty nice ones. eg hpricot
With a good library you could specify the parts of the page you want using CSS selectors or xpath. These would be a good deal more robust than using regexps.
Example from hpricot wiki:
doc = Hpricot(open("qwantz.html"))
(doc/'div img[#src^="http://www.qwantz.com/comics/"]')
#=> Elements[...]
I am sure you could find a library that does similar things in .NET or Python, etc.
Try googling for screen scraping + the language you prefer.
I know several options for python, you may find the equivalent for your preferred language:
Beatiful Soup
mechanize: similar to perl WWW:Mechanize. Gives you a browser like object to ineract with web pages
lxml: python binding to libwww
scrapemark: uses templates to scrape pieces of pages
pyquery: allows you to make jQuery queries in xml/xhtml documents
scrapy: an high level scraping and web crawling framework for writing spiders to crawl and parse web pages
Depending on the website to scrape you may need to use one or more of the approaches above.
If you can use something like Tag Soup, that'd be a place to start. Then you could treat the page like an XML API, kinda.
It has a Java and C++ implementation, might work!
Parsley at http://www.parselets.com looks pretty slick.
It lets you define 'parslets' using JSON what you're define what to look for on the page, and it then parses that data out for you.
As others have said, you can use an HTML parser that builds a DOM representation and query it with XPath/XQuery. I found a very interesting article here: Java theory and practice: Screen-scraping with XQuery - http://www.ibm.com/developerworks/xml/library/j-jtp03225.html
There is a very interesting online service for parsing websites https://loadsiteinmysql.site This service splits the site into tags and loads them into MySQL tables. This allows you to parse sites using MySQL syntax