Web crawler: web content does not show up in html code

Web crawler: web content does not show up in html code - python-2.7

I am doing a basic web crawler work for this web page (just for study purpose, and I have got their permission):
http://www.seattle.gov/council/calendar#/?i=0
What I wanted to do is to get all the events' "Time", "Description" and "Location" in that form. I have tried python regular expression however it looks like these information does not show up in the HTML code of this page. Instead, I am using a Selenium, but I still don't know where to find this information.

Sometimes, things are in front of you but you don't see them.
You can fetch/extract that data from their RSS Feed. It's here: http://www.trumba.com/calendars/seattle-city-council.rss
Hope this helps.

Related

Jekyll-Style Blog Posts on Rails

I'm making a blog using Ruby on Rails, but I'd like to make it so that I write the posts in html on my favorite text editor rather than on the browser, similarly to how it is done on Jekyll.
I know that I could save html as text in the database and use the 'raw' helper, but I would like to see some recommended approaches.
Thank you!

Possible to use page on external site as content for MailChimp?

We've got a WordPress site and I've built a page that pulls from different sections of our site which I'd like to use as the content for a bi-weekly MailChimp newsletter. Is there anyway to automate pulling in a div on our site into the body of a MailChimp template?
All the tools I've found pull in the page as "an article" and just put an image and headline into the message body, rather than the full page verbatim.
Not adverse to doing some coding, but not sure how to start.
Thanks for any suggestions.

I can think of two different routes you might be able to try. The first is to generate an RSS feed for the content you're talking about and then use an RSS Campaign to send the email. Depending on how you have this data stored on your site, WordPress might already be generating an RSS feed for you for that content.
The second option involves more coding. If you create a template with an editable section you can then pass in the content of that section via the API. This is probably harder, since the campaign content APIs are pretty convoluted in v2.0. v3.0 should make that easier, but it's still in beta.

ember hash urls in google

I am concerned about page ranking on google with the following situation:
I am looking to convert my existing site with 150k+ unique page results to a ember app, off the route. so currently its something like domain.com/model/id - With ember and hash change - it will be /#/model/id. I really want history state but lack of IE support doesn't leave that as a option. So my Sitemap for google has lots and lots of great results using the old model/id. On the rails side I will test browser for compatibility, before either rendering the JS rich app or the plain HTML / CSS. Does anyone have good SEO suggestions with my current schema for success.
Linked below is my schema and looking at the options -
http://static.allplaces.net/images/EmberTF.pdf
History state is awesome but it looks like support is only around 60% of browsers.
http://caniuse.com/history
Thanks guys for the suggestions, the google guide is similar to what I'm going to try. I will roll it out to 1 client this month, and see what webmasters and analytics show.

here is everything you need to have your hash links be seo friendly: https://developers.google.com/webmasters/ajax-crawling/
basically You write Your whole app with hashlinks, but You have to add "!" to them, so You have #!/model/id. Next You must have all pages somewhere generated and if google asks for them, return "plain html" as described here: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
use google webmaster tools to check if Your site is crawlable.

I'm not sure if you're aware that you can configure Ember to use the browser history for the location API and keep using your pages the way they are reference now. All you need to do is configure the Route's location property
App.Router.reopen({
location: 'history'
});
See more details about specifying the location api here

How do I extract data from Sitecore's Web Forms for Marketers database?

I want to be able to extract the form submissions from the Web Forms for marketers database and output it in an excel file I can make available to the public on demand.
Anyone know how I can do this?
I know they have an export to Excel option, but it is not automatic and requires someone to login and have access to the form.
I haven't been able to find any documentation from Sitecore on how to do this. Is this a supported operation? Do I have to reflect over the dlls to find api calls? Do I have to delve in to the SQL database and figure out how to do it manually? Is there no hope?

You might get lucky when using Reflector to disasamble the Sitecore.Forms dll. Try to find out if you can disassemble the code that get's run when clicking the Export button.
Actually:
The command comes from: Sitecore.Form.Core.Commands.Export
The executed code is in: Sitecore.Form.Core.Pipelines.Export.Excel
Good luck!

This writeup provides a very detailed account of how to do this if anyone else stumbles on this post. http://r-coding-sitecoreblog.blogspot.com/2011/11/extracting-data-from-sitecore-wffm.html

Hello You can use below given my blog URL to export data in CSV I have also written some blogs to export in XML and HTML on front end
http://sitecoretweaks.wordpress.com/2014/07/02/sitecore-export-to-csvexcel-of-web-form-for-marketers-form-wffm-reports/
You can find all blogs about export data
http://sitecoretweaks.wordpress.com/

Data mining to gather a website's details and put in CSV or SQL

I don't know if it's called data mining or something else.
Let's say I have a world business listing site, that list all the shops. And I saw this website ABC that also list shops, but only in Ausralia. They are in page by page, with no ID.
How do I start to write a program, that will crawl their pages, and put in the selective information of a page in the format of CSV, which I can then import it to my website?
At least, where can I learn this? Thank you.

What you are attempting to do is known as "Web Scraping", here's a good starting point for information, including the legal issues
http://en.wikipedia.org/wiki/Web_scraping
One common framework for writing crawlers like this is Scrapy- http://scrapy.org/

Yes, this process called Web Scraping. If you are familiar with java, most useful tools here is HTMLUnit and WEbDriver. You should use headless browser to go though you pages and extract important information using selector(mostly xpath, regexp in html format)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Web crawler: web content does not show up in html code - python-2.7

Sometimes, things are in front of you but you don't see them. You can fetch/extract that data from their RSS Feed. It's here: http://www.trumba.com/calendars/seattle-city-council.rss Hope this helps.

Related

Jekyll-Style Blog Posts on Rails

Possible to use page on external site as content for MailChimp?

ember hash urls in google

How do I extract data from Sitecore's Web Forms for Marketers database?

Data mining to gather a website's details and put in CSV or SQL

Categories

Resources