How can I start with data mining for small grocery shop - data-mining

My company got the project to build simple website of grocery shop with catalogue only without shop cart. Few days ago I read something about data mining from here
I found that it is possible to do some predictive modelling like
For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought
diapers on Thursdays and Saturdays, they also tended to buy beer.
I told them this example and they were happy if I can do something like that.
Now don't know how to start and where to start. I know mysql database and can program complex queries as well. But I don't know how i can get the type of data like beer and diapers
I have 3-4 months left. Can anyone guide me how i can start.
I also don't know what type of data of customer shopping i can get from the shop may be excel files .
But i want to start

Judging your question, you don't seem to know much, if anything, about data mining. That being said, you can get something usable running in 4 months, especially in a very restricted domain like a web shop, where all you are after is probably buying patterns for a start.
Please understand that you cannot expct some out-of-the-box solution that can be posted here in 10 lines of code, so I suggest you start by reading a decent book on the subject. I'd recommend:
Programming Collective Intelligence: Building Smart Web 2.0 Applications

Related

How did these big companies start from these three main points?

So this is something that Iv'e been thinking about lately, and it basically is : How did big music web apps or websites like Spotify, Youtube, or Anghami(if you know that one) start? I was actually thinking about 3 things, the first : How did they get these huge music libraries? the second : Did each of those big companies need to buy a special server to hold the website data and music Library? and if yes, how much does a special server cost in this case? and the third question is : How did they solve the copyrights with all of these creators or authors or publishers or whatever they're called, the copyrights owners in this case...?
1. They are uploaded by the artists/creators. I'd imagine pre-release Spotify would have had a library already put together by working with the artists.
2. Yes. They cost a lot. There are hundreds of millions of users and terabytes upon terabytes of data, spread around the world. Server costs will be in the millions. Starting out the upfront cost to set up infrastructure would be very high too.
3. This is definitely not the place to ask this kind of question. I would Google information on how copyrights with artists usually work

Document Database like MongoDB Design of an expense tracker application

Getting started in a design of an application that tracks expenses.
Using MongoDB only to get familiar with document oriented DBs.
If I start with a doc design that has one doc per day, and that doc has info like where each dollar was spent, and the amount, am I necessarily starting off in the wrong direction?
I eventually want to slice and dice all of the data like how much was spent at Target between two dates, how much was spent in restaurants for a month, stuff like that.
My question is if I start by having a design that is day oriented, will I get into any trouble right away?
I think that would be just fine. You can make the _id anything you want, but consider making it milliseconds since the Epoch. That might make range queries easier to work with. You can also embed the string version of the date in each document so you don't always have to parse the _id field.
I don't think you'll get into trouble with that design, but prepare to learn a lot when it comes to writing queries in Mongo. Try to stay within their recommendations for writing queries or things can get very slow.

Batch Geocoding with Garmin Mapsource

I lost track of this effort years ago but have need to geocode thousands of addresses nightly. I must use the very accurate database sitting on the machine, installed when the Nuvi map update installed Mapsource.
When I contacted Garmin years ago, they expressed an interest in providing an API for this, but then I heard nothing and did not follow up. Their database is provided by navtec? I believe. Anyone have experience with that format?
I posted on the Garmin Developer forum a while ago, but its a little lethargic over there :)
Has anyone done this?
Does anyone know how it might be done without an API; meaning database structure and calls?
I'll take a solution in any language.
Added:
Garmin has expressed an interest in making this available to me. They just have not done it.
I do not know the database format.
I am NOT looking for an online solution or any other "alternative". This question is very specific.
Talk to Navtec directly. They will sell you or license you their database directly. The database tables are clearly documented, then write your own Geocoder on top. Took me about a week 4 years ago, and I was marginally profficient in SQL at the time.
You can geocode up to 10,000/day by city with NN4D after you get their free application key.
You can geocode for $18 per 1,000 with CoreLogic (aka Proxix)
Yahoo looked most promising because it has the Hadoop feature, which is also currently being utilized at Navteq. I've contacted a guy at Navteq who uses Hadoop, and I'm awaiting his feedback. According to Ben Lorica's article on Datameer O'Reilly.com entitled "Big Data Tool for Business Analysts", Datameer can upload from spreadsheets to Hadoop. Hadoop is a pipeline to Navteq.
Starting point - a list of the tools at the GIS Dept at USC
(I can only have one link because I'm new, but I'll add the rest when I get my points up.
naveteq uses oracle format
BUT HOLD 1 SECOND:
doing 1000 lookups(per night) is easy,
doing 10000 lookups(per night) requires a good server,
doing 1000000 lookups(per night) requires a cluster
letting them do the searches requires less hardware(and more traffic) using xml-rpc or similar rpc would be the best( for everyone)
buy oracle db and start working
you can use almost anything BUT keeping in mind the volume you should use a compile language like c++
gpsbabel.org has lots of stuff on converting between lots of GPS formats, and a downloadable tool. My limited experience, mostly with google maps, streetview etc. is that geocoding is not very accurate.
cM
The free IBM DB2 Express-C DBMS comes with Spatial Extender that can be used to GEOcode US addresses. See a webinar on this. Don't know if this is exact fit but it can't hurt to take a look.
Also take a quick look the DB2 documentation http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.spatial.topics.doc/doc/csbp3008.html

What is data mining from a developer's perspective?

I can find the technical explanation of what data mining is in a book or on Wikipedia, but I'm wondering what sort of development does it exactly involve? Is it more about using tools or more about writing tools? Is it really any much different from other domains when it comes to R&D?
Data Mining is the process of discovering interesting patterns in large amounts of data. It is not querying data, which is just what user Treb describes (sorry Treb).
To understand DM from a developer's perspective, you should read the book Programming Collective Intelligence by Toby Segaran.
In my experience (I'm a former data miner :-)), it's a mixture of using tools and writing tools. A lot of the time, the tools you need to analyse the particular data set don't exist, so you have to write them yourself first. It can be very interesting but you often need quite a different approach to the sort of programming I do now (embedded wireless), for example.
You really ought to change the accepted answer on this question so it doesn't mislead those who come across it.
Saying that querying a database IS data mining because "[h]ow would you discover any pattern in your data without querying first?" is like saying opening your car door is driving because "how else would you be able to drive somewhere without opening the car door first."
You can read your data out of a text file if you want. My first data mining assignment used data sets from the UCI repository and those are almost all text files.
If you want to learn about data mining start by looking up clustering and classification. Learn about decision trees and rule based classification. Then look at k-nearest-neighbor and k-means. After that if you really want to see what data mining is all about look at Chameleon, DBScan, and Support Vector Machines. Don't necessarily learn the minutiae of the last three (they're pretty complex and math heavy) but understanding the abstract idea of what happens will tell you all you need to know in order to use the many tools and libraries that are available for each strategy.
These are only the algorithms that popped into my head just now. There are so many others that I don't recall or don't even know yet.
Data mining is about searching large quantities of data for hidden patterns. Web 2.0 example: News corp uses its site myspace.com as a large data mine to determine what movies and products to promote. They write software to identify trends in the data that it's users post to the site. News corp does this to gather information useful for advertising campaigns and market predictions. It's different from other domains of R&D in that from a data givers perspective its passive. Rather than going out on the street and asking people in person what movies they are likely to see this summer and other such questions, the data mining tools sort out these things by analyzing data given by users voluntarily.
Wikipedia actually does have a pretty good article on it:
- http://en.wikipedia.org/wiki/Data_mining
Data Mining as I say is finding patterns or trends from given data. A developer perspective might be in applications like Anti Money Laundring... Where given a pattern you will search data for that given pattern. One other use is in Projection Softwares... where you project a result or outcome in future against a heuristic by studying recognizing the current trend from data.
I think it's more about using off the shelf tools rather than developing your own. An academic example of that kind of tools might be WEKA. Of course, you still have to know what algorithms use, how to preprocess data (very important this part), etc.
In R&D I don't have much idea, but it should be like almost everything: maths, statistics, more maths...
On the development level, data mining is just another database application, but with a huge amount of data.
The mining itself is done by running specific queries on the database. It's in the creation of the queries where the important work is done. They of course depend on the data model, and on the hypotheses, what sort of trends the customer expects to find.
Therefore, the fine tuning of the queries usually can't be done in development, but only once the system is live and you have live data. Then the user can test his hypotheses and adapt the queries to show him the trends he is looking for.
So from a dev point of view, data maining is about
Managing large sets of data in your client (one query may return 100.000 rows of data)
Providing the user (who may know nothing about SQL or relational databases in general) with an effective way to modify his queries and view the results.

Data mining and Business Intelligence Technologies

I've noticed an increasing number of jobs that are asking for experience with data mining and business intelligence technologies. This sounds like an incredibly broad topic but where would one go if they wanted to develop at least a partial understanding of this stuff if it were to come up in an interview?
A very good book with practical examples is the
Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran.
Go read Data Mining: Practical Machine Learning Tools and Techniques (Second Edition).
Then use Weka on a pet project. Despite the name, this is a good book, and the Weka package has several levels of entry into the data-m... er machine-learning world.
Consider reading Ralph Kimball's books for an introduction to Business Intelligence.
Also, try to not stick to one technolgy-vendor, every company has its own biased vision of BI, you'll need a 360 overview.
Maybe you can also try to work with real BI - it is almost impossible to get in contact with data-filled and running SAS, MS, Oracle etc. I work in a team which integrates BI BellaDati for enterprises. For try-out and personal purposes it is free with some datastore limitations ( http://www.trgiman.eu/en/belladati/product/personal ).
BellaDati is also used as a learning tool on technical universities focused on practical application of data mining and analysis. The final manager-level dashboards examples of BellaDati can be seen at http://mercato.belladati.com/bi/mercato/show/worldexchanges
You can work here with SQL datasources, flat files, web services and play. From my own experience - to show real samples of market analysis practise (like case study etc.) is good for an interview.
I wish you luck,
Peter