Data mining to gather a website's details and put in CSV or SQL - data-mining

I don't know if it's called data mining or something else.
Let's say I have a world business listing site, that list all the shops. And I saw this website ABC that also list shops, but only in Ausralia. They are in page by page, with no ID.
How do I start to write a program, that will crawl their pages, and put in the selective information of a page in the format of CSV, which I can then import it to my website?
At least, where can I learn this? Thank you.

What you are attempting to do is known as "Web Scraping", here's a good starting point for information, including the legal issues
http://en.wikipedia.org/wiki/Web_scraping
One common framework for writing crawlers like this is Scrapy- http://scrapy.org/

Yes, this process called Web Scraping. If you are familiar with java, most useful tools here is HTMLUnit and WEbDriver. You should use headless browser to go though you pages and extract important information using selector(mostly xpath, regexp in html format)

Related

I am new to Django can anyone help me with the below task....?

Go through the sample JSON file given here -
https://drive.google.com/open?id=1xZa3UoXZ3uj2j0Q7653iBp1NrT0gKj0Y
This JSON file describes a list of users & their corresponding periods of activity across
multiple months.
Now, design and implement a Django application with User and ActivityPeriod models, write
a custom management command to populate the database with some dummy data and design
an API to serve that data in the JSON format given above.
To build the application with the mentioned requirements you need to have at least a basic understanding of the Django and Django rest framework. There are plenty of learning materials on the web to get started. Mozilla has a very good tutorial to get begin with. Link: https://developer.mozilla.org/en-US/docs/Learn/Server-side/Django
Similarly, For the REST API, there are free videos on YouTube offered by channels such as Dennis Ivy's, CodingEntrepreneurs which are good to get started with. I would recommend you to check those videos!
All the best!
Happy Coding

Cleaner URLs for SAS Stored Process Web Application

Is it possible to make a SAS Stored Process available via a clean, nicer looking URL, but still be hosted on the server?
The native URL is something like http://[yourMachineName]:8080/SASStoredProcess/do?_PROGRAM=/WebApps/MyWebApp/Foo.
I'd prefer a nicer looking URL like http://[yourMachineName]:8080/SASStoredProcess/WebApps/MyWebApp/Foo
The documentation for the overall process at http://documentation.sas.com/?docsetId=stpug&docsetTarget=dbgsrvlt.htm&docsetVersion=9.4&locale=en, doesn't seem to address the issue.
Absolutely - yes, you can do this. The way to do this is to use a front end framework to provide a routing facility. Or - simply host an index.html file at a particular folder (corresponding to the _PROGRAM path) on your mid-tier, then use the 'on-load' javascript event to fire window.location.replace() with the full path to your STP as a parameter.
Your url could then be http://[yourMachineName]:8080/WebApps/MyWebApp/Foo.
I wrote a guide to building web applications with SAS which is available here, and a quick blog on the subject available here.
As a general point - is much more user friendly to build a nice looking UX using a modern framework such as REACT or Angular, and use that to call your SAS services as appropriate, displaying results in a myriad of ways - than to call raw SAS programs directly (for surfacing data).
Angular routing: https://angular.io/guide/router
You can't do this in the SAS Stored Process Web Application. The SAS URL must contain the SAS folder path and name of your storedprocess.
Possible options you can do within the Stored Process Web App are:
Use the Folders view in the SAS Stored Process Web Application URL, so each user can navigate to the desired stored process from there:
http://YourServer:8080/SASStoredProcess/do?_Action=index
If you have a web page or SAS Visual Analytics available to your users: you can hyperlink the SP URL to any Text.

ember hash urls in google

I am concerned about page ranking on google with the following situation:
I am looking to convert my existing site with 150k+ unique page results to a ember app, off the route. so currently its something like domain.com/model/id - With ember and hash change - it will be /#/model/id. I really want history state but lack of IE support doesn't leave that as a option. So my Sitemap for google has lots and lots of great results using the old model/id. On the rails side I will test browser for compatibility, before either rendering the JS rich app or the plain HTML / CSS. Does anyone have good SEO suggestions with my current schema for success.
Linked below is my schema and looking at the options -
http://static.allplaces.net/images/EmberTF.pdf
History state is awesome but it looks like support is only around 60% of browsers.
http://caniuse.com/history
Thanks guys for the suggestions, the google guide is similar to what I'm going to try. I will roll it out to 1 client this month, and see what webmasters and analytics show.
here is everything you need to have your hash links be seo friendly: https://developers.google.com/webmasters/ajax-crawling/
basically You write Your whole app with hashlinks, but You have to add "!" to them, so You have #!/model/id. Next You must have all pages somewhere generated and if google asks for them, return "plain html" as described here: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
use google webmaster tools to check if Your site is crawlable.
I'm not sure if you're aware that you can configure Ember to use the browser history for the location API and keep using your pages the way they are reference now. All you need to do is configure the Route's location property
App.Router.reopen({
location: 'history'
});
See more details about specifying the location api here

How to write mark-up in the dashboard?

I've been working with Microstrategy for awhile. I was hired into a company to help customize it. All the data is being pulled in by the Dataset Objects correctly. Before I was hired the whole Dashboard was created using the WYSIWYG tool by someone else. I've since created a html container that links to my custom javascript and css. But I've never been able to actually write my own HTML. It's only been the WYSIWYG tool.
I desperately want the ability to not have to use this terrible Design mode WYSIWYG tool and write my own mark-up. Is there a way? When I create a new dashboard is there a .html file that I can edit somewhere on the server?
Any help on customization will be greatly appreciated.
There isn't an HTML file that you can edit, because a Report Services document isn't (necessarily) just HTML output. Obviously it can be Flash, or if it's mostly static, PDF or Excel.
It actually sounds as though you want to build your own custom page entirely. This isn't a straightforward process, and you need (theoretically) to have an API licence. There is some reasonable documentation included with the product, and you can build an HTML page, using MicroStrategy's custom ASP/JSP tags to display content as you wish.
It's hard to be more specific without knowing what sort of customisation you want to do, but you really need to build a new page, rather than modifying/tinkering with an existing Report Services document.
If you want to do something more sophisticated, then you should be aware that you may need to be conversant in Java, which is the language MSTR has written its web API in.

Tracking User Actions on Landing Pages in Django

I'm developing a web application. It's months away from completion but I would like to build a landing page to show to potential customers to explain things and gauge their interest--basically collecting their email address and if they feel like it additional information like names + addresses.
Because I'm already using Django to build my site I thought I might use another Django App to serve as this landing page. The features I need are
to display a fairly static page and potentially a series of pages,
collect emails (and additional customer data)
track their actions--e.g., they got through the first two pages but didnt fill out the final page.
Is there any pre-existing Django app that provides any of these features?
If there is not a Django app, then does anyone know of another, faster/better way than building my own app? Perhaps a pre-existing web service that you can skin and make look like your own? Maybe there's the perfect system but it's PHP?--I'm open for whatever.
Option 1: Google Sites
You can set it up very very quickly. Though your monitoring wouldn't be as detailed as you're asking for.. Still, easy and fasssst!
Option 2: bbclone
Something else that may be helpful is to set up some PHP based site (wordpress or something) and use bbclone for tracking stuff on it. I've found bbclone to be pretty intense with the reporting what everyone does - though it's been a while since I used it.
Option 3: Django Flatpages
The flatpages Django contrib app is pretty handy for making static flat pages. I'd probably just embed a Google Docs Form to collect email addresses (as that's super fast and lets you get back to real work). But this suggestion would still leave you needing to figure out how to get the level of detail you want on the stats end.
Perhaps consider Google Analytics anyway?
Regardless, I suggest you use Google Analytics with everything. That'll work with anything you do really, and for all I know, perhaps you can find a way to get the stats you're really looking for out of it.