Medical Machine Learning Data Set - data-mining

I'm researching Medical Data set which includes variable concerning illnesses and treatment type.
For example illnesses is colon cancer, it's decision variables (x,y,z,t) and treatment type is chemothreapy, radiothreaphy etc etc.
I want to reach such a data set for my KDD and exploratory lesson. Because I want to make useful project prototype.
if you know any data set web site pls share me (so-called site may not include medical)..

There is a standard machine learning data set repository at UC Irvine. R users can access it via the mlbench package from the CRAN network.

Try the UCI Machine Learning Repository. 189 sample ML datasets, many medical in nature.
Because the site is focussed on ML, it gives a lot of guidance that will help choose and tune your ML algorithms for good generalization performance.

Not sure this is an actual 'programming' question, strictly speaking. However, given that programs work on data, I'll go with it - and observe that the term 'medical dataset' returns quite a few (1.7m) hits in Google.

Related

Desktop SCADA Application - Reading and Writing to PLCs through C++

I did my best to search all topics regarding to SCADA and developing your own C++ desktop application to communicate with PLCs, but could not find any recent, or in my opinion, relevant topics that fit what I needed. If I missed them, a link to them would be very much appreciated. If I also happened to post this in the wrong section, or you can think of a better section for me to post this in, I will take it there.
With that said, I thank you in advance for taking the time to read my questions, and appreciate any input you have to offer.
A little bit about what I'm doing
I'm currently in school for electromechanical engineering, and for my final year project I am developing a desktop application in C++ to monitor PLCs we have located within one of our labs.
Within this lab, I have a pre-existing ethernet network connecting all PLCs to single point, which I am tying into with a PC, and will be doing all my work from there.
I will be developing the application in Qt for an easy way to design the GUI, and giving me access to the QNetworkInterface as well as QTcpSocket.
With that said, I wouldn't go as far as saying I'm an experienced programmer, but I have been fooling around with a few languages (i.e.: python, c++, c, php) for quite a few years, and am still learning, considering the learning NEVER stops.
My questions
Is there any reference material I can read, that you can suggest, on the subject to more easily understand what sort of process I need to go through to receive information (i.e.: individual I/Os, status bits, tags, logs, etc...) from the PLCs directly, and not through an OPC server?
If an OPC server is required, I've never dealt with OPC links other than using Rockwell Automations RSLinx to grab tags and display their values within excel (I had created a prototype using that exact method to start, but would like to move away from excel, and if possible, the OPC server (RSLinx) as well). What would you suggest to someone who knows nothing about the subject of OPC servers, or to my knowledge, OPC in general?
Have any of you previously written your own application to do something similar, if not of the same nature to what I'm trying to accomplish?
What advice or suggestions would you give for someone who is attempting this type of project?
PS: As a start for this project, I would initially just want to get the reading of the I/Os (tags or addresses) to view what their current values are (closed or open for inputs, energized or not for outputs). But eventually I would also like to be able to write values to tags on the PLCs I'm monitoring based on the values I've received from them.
PSS: I would like to note again, that I am still a student, and am still learning about this subject in it's entirety. I would just like to ask for your patience, as I may not grasp things completely the first time!
If I've missed any information you feel is pertinent to be able to provide an answer, please let me know! I will do my best to come up with said information in a timely manner!
Thank you!
EDIT #1: Added in another question, and altered my first question slightly
EDIT #2: Fixed up question 2
IMHO a SCADA program should have as a minimum requirement to be able to connect to an OPC server. OPC is used for most commercial PLCs.
Strictly speaking there is no need to have an OPC server/client approach but it gives you flexibility and gives you an abstraction model. If you want to directly connect to PLCs using a protocol then that is of course possible as well. You then need to know more details about the protocols and their various versions.
Yes I worked for a few years in a team that developed a commercial SCADA application.
It is very easy to get lost in details in such a project so try to keep things as simple as possible. By using OPC you will save time instead of fiddling directly with the protocols. You could add the ability to add custom-drivers for other protocols - depending on your timeframe. Try to model up your project before you start coding to a birdsview of the model and avoid getting lost in the details.
I would stay well away from looking to write your own code to connect directly
to an AB PLC - there are products out there that you can use in your application:
http://www.rtaautomation.com/software/ethernetip/client/tagc/ControlWin.html
http://www.automatedsolutions.com/products/dotnet/ascomm/
You would be better to use OPC - you can write you own OPC client if you want and follow examples you find here:
http://www.opcconnect.com/source.php#freesource
According to this http://www.control.com/thread/1026173407 you should be able to get source code of Kepwares OPC Quick Client.
It would probably be easier to just use a library as in this example (RSLogix,C#):
http://www.mesta-automation.com/opc-client-with-c-an-how-to-video/
You might find this of use:
http://www.rockwellautomation.co.kr/applications/gs/ap/GSKR.nsf/files/rslinxsdk_ma_eng.pdf/$file/rslinxsdk_ma_eng.pdf
Some resources:
http://www.opcconnect.com/ ,
http://www.mesta-automation.com/
Answer to question #4 - realize that your lab technically could contain ANY manufacturer's PLCs in the future. If you ever took a Data Communications class, you realize that for N different PLC types, you would have to write N different communication drivers for your PLC client.
This is where standards are helpful. Without the use of a standard protocol, scaling your lab could become more time consuming and less manageable. This is why communications standards exist.
HOWEVER, not all PLCs necessarily support the standard(s) you may decide upon.
The best choice is OPC/UA. Many PLCs have server drivers readily available. That means that your client just needs to understand 1 protocol (OPC/UA), and then it can "easily" be connected to any PLC that has a driver for that standard.
After that, there is OPC. After that, Modbus (TCP and RTU flavors), a relatively simple industry standard that is supported by most PLCs. EtherNet/IP is also a possible choice, although not all PLCs support it in a "server" role (many do support it as a client, but that is not what you need).
have a look at pycomm in github or pylogix at github which are Python written drivers to link to clx plc.

Starting with Data Mining

I have started learning Data Mining and wish to create a small project in C++/Java that allows me to utilize a database, say from twitter and then publish a particular set of results (for eg. all the news items on a feed). I want to know how to go about it? Where should I start?
This is a really broad question, so it's hard to answer. Here are some things to consider:
Where are you going to get the data? You mention twitter, but you'll still need to collect the data in some way. There are probably libraries out there for listening to twitter streams, or you could probably buy the data if someone is selling it.
Where are you going to store the data? Depending on how much you'll have and what you plan to do with it, a traditional relational database may or may not be the best fit. You may be better off with something that supports running mapreduce jobs out-of-the box.
Based on the answers to those questions, the choice of programming languages and libraries will be easier to make.
If you're really set on Java, then I think a Hadoop cluster is probably what you want to start out with. It supports writing mapreduce jobs in Java, and works as an effective platform for other systems such as HBase, a column-oriented datastore.
If your data are going to be fairly regular (that is, not much variation in structure from one record to the next), maybe Hive would be a better fit. With Hive, you can write SQL-like queries, given only data files as input. I've never used Mahout, but I understand that its machine learning capabilities are suited for data mining tasks.
These are just some ideas that come to mind. There are lots of options out there and choosing between them has as much to do with the particular problem you're trying to solve and your own personal tastes as anything else.
If you just want to start learning about Data Mining there are two books that I particularly really enjoy:
Pattern Recognition and Machine Learning. Christopher M. Bishop. Springer.
And this one, which is for free:
http://infolab.stanford.edu/~ullman/mmds.html
Good references for you are
AI course taught by people who actually know the subject,Weka website, Machine Learning datasets, Even more datasets, Framework for supporting the mining of larger datasets.
The first link is a good introduction on AI taught by Peter Norvig and Sebastian Thrun, Google's Research Director, and Stanley's creator (the autonomous car), respectively.
The second link you get you to Weka website. Download the software - which is pretty intuitive - and get the book. Make sure you understand all the concepts: what's data mining, what's machine learning, what are the most common tasks, and what are the rationales behind them. Play a lot with the examples - the software package bundles some datasets - until you understand what generated the results.
Next, go to real datasets and play with them. When tackling massive datasets, you may face several performance issues with Weka - which is more of a learning tool as far as my experience can tell. Thus I recommend you to take a look at the fifth link, which will get you to Apache Mahout website.
It's far from being a simple topic, however, it's quite interesting.
I can tell you how I did it.
1) I got the data using twitter4j.
2) I analyzed the data using JUNG.
You have to define a class representing edges and a class representing vertices.
These classes will contain the attributes of the edges and vertices.
3) Then, there is a simple function to add an edge g.addedge(V1,V2,edgeFromV1ToV2) or to add a vertex g.addVertex(V).
The class that defines edges or vertices is easy to create. As an example :
`public class MyEdge {
int Id;
}`
The same is done for vertices.
Today I would do it with R, but if you don't want to learn a new programming language, just import jung which is a java library.
Data mining is broad fields with many different techniques; classification, clustering, association and pattern mining, outlier detection, etc.
You should first decide what you want to do and then decide wich algorithm you need.
If you are new to data mining, I would recommend to read some books like Introduction to Data Mining by Tan, Steinbach and Kumar.
I would like to suggest you to use python or R for data mining process. Doing work with java or c , it bit difficult in the sense you need to do a lot coding

Data mining/BI/Analytics/ML : Can a mathematically challenged person move into this field?

I have recently become interested in the field(s) of data mining and machine learning. The idea of going through huge datasets and trying to correlate hidden patterns and trends is fascinating. So far I have done the following
Used Weka to load simple data sets and generate decision trees
Continously read books, wiki's, blogs and SO on the same
Started playing around SQL Server DM and Python API's
Have an idea on options of freely available data sets on the web(freedb, UN etc)
What is hindering me is the minute I try to go beyond classification/associsciation and into priori/apriori algorithms I am stuck because understanding mathematical equations and logic is not(to put it modestly) one of my strong points.
So my question would be are there anybody in the Data mining field(in the role of product owner or builder) who are not naturally mathematicians? If so, how would you approach in undestanding the field since free tools like Weka and Rapid-miner both expects some mathematical/statistical background?
P.S: Excuse me if I made some mistake in the query like mixing Data mining and analytics when they are separate as I am still getting my feet wet. I hope my core question is clear.
Well, being able to do some analysis of what the data mining models are showing is absolutely vital. However, these days all of the math and statistics are taken care of by the data mining models. You don't need to understand the math behind them (although it helps).
For example, you can look through the SQL Server Analysis Services Data Mining Algorithms and see that even the technical reference is how to use these implementations, not how to recreate them.
If you can understand the business cases and you can understand what the data mining is telling you, there's really no need to delve into the math behind it.
As for some of the free tools, I've never used them, so I can't speak to them. However, I'm a big fan of SSAS and those data mining models, which don't require an extensive mathematical background.
As Eric says, and as far as you only intend to use the existing algorithms and APIs and make sense from them, I don't see problems with the required math/statistics skill set (anyway, you'll need some previous basic knowledge/level).
Now, if you intend to do research or if you want to improve or modify existing algorithms, or why not, create your own algorithms, then math and statistics is a MUST. I just started doing some research in this area, and I'm still trying to fill my skills gap =)

What is data mining from a developer's perspective?

I can find the technical explanation of what data mining is in a book or on Wikipedia, but I'm wondering what sort of development does it exactly involve? Is it more about using tools or more about writing tools? Is it really any much different from other domains when it comes to R&D?
Data Mining is the process of discovering interesting patterns in large amounts of data. It is not querying data, which is just what user Treb describes (sorry Treb).
To understand DM from a developer's perspective, you should read the book Programming Collective Intelligence by Toby Segaran.
In my experience (I'm a former data miner :-)), it's a mixture of using tools and writing tools. A lot of the time, the tools you need to analyse the particular data set don't exist, so you have to write them yourself first. It can be very interesting but you often need quite a different approach to the sort of programming I do now (embedded wireless), for example.
You really ought to change the accepted answer on this question so it doesn't mislead those who come across it.
Saying that querying a database IS data mining because "[h]ow would you discover any pattern in your data without querying first?" is like saying opening your car door is driving because "how else would you be able to drive somewhere without opening the car door first."
You can read your data out of a text file if you want. My first data mining assignment used data sets from the UCI repository and those are almost all text files.
If you want to learn about data mining start by looking up clustering and classification. Learn about decision trees and rule based classification. Then look at k-nearest-neighbor and k-means. After that if you really want to see what data mining is all about look at Chameleon, DBScan, and Support Vector Machines. Don't necessarily learn the minutiae of the last three (they're pretty complex and math heavy) but understanding the abstract idea of what happens will tell you all you need to know in order to use the many tools and libraries that are available for each strategy.
These are only the algorithms that popped into my head just now. There are so many others that I don't recall or don't even know yet.
Data mining is about searching large quantities of data for hidden patterns. Web 2.0 example: News corp uses its site myspace.com as a large data mine to determine what movies and products to promote. They write software to identify trends in the data that it's users post to the site. News corp does this to gather information useful for advertising campaigns and market predictions. It's different from other domains of R&D in that from a data givers perspective its passive. Rather than going out on the street and asking people in person what movies they are likely to see this summer and other such questions, the data mining tools sort out these things by analyzing data given by users voluntarily.
Wikipedia actually does have a pretty good article on it:
- http://en.wikipedia.org/wiki/Data_mining
Data Mining as I say is finding patterns or trends from given data. A developer perspective might be in applications like Anti Money Laundring... Where given a pattern you will search data for that given pattern. One other use is in Projection Softwares... where you project a result or outcome in future against a heuristic by studying recognizing the current trend from data.
I think it's more about using off the shelf tools rather than developing your own. An academic example of that kind of tools might be WEKA. Of course, you still have to know what algorithms use, how to preprocess data (very important this part), etc.
In R&D I don't have much idea, but it should be like almost everything: maths, statistics, more maths...
On the development level, data mining is just another database application, but with a huge amount of data.
The mining itself is done by running specific queries on the database. It's in the creation of the queries where the important work is done. They of course depend on the data model, and on the hypotheses, what sort of trends the customer expects to find.
Therefore, the fine tuning of the queries usually can't be done in development, but only once the system is live and you have live data. Then the user can test his hypotheses and adapt the queries to show him the trends he is looking for.
So from a dev point of view, data maining is about
Managing large sets of data in your client (one query may return 100.000 rows of data)
Providing the user (who may know nothing about SQL or relational databases in general) with an effective way to modify his queries and view the results.

Data mining and Business Intelligence Technologies

I've noticed an increasing number of jobs that are asking for experience with data mining and business intelligence technologies. This sounds like an incredibly broad topic but where would one go if they wanted to develop at least a partial understanding of this stuff if it were to come up in an interview?
A very good book with practical examples is the
Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran.
Go read Data Mining: Practical Machine Learning Tools and Techniques (Second Edition).
Then use Weka on a pet project. Despite the name, this is a good book, and the Weka package has several levels of entry into the data-m... er machine-learning world.
Consider reading Ralph Kimball's books for an introduction to Business Intelligence.
Also, try to not stick to one technolgy-vendor, every company has its own biased vision of BI, you'll need a 360 overview.
Maybe you can also try to work with real BI - it is almost impossible to get in contact with data-filled and running SAS, MS, Oracle etc. I work in a team which integrates BI BellaDati for enterprises. For try-out and personal purposes it is free with some datastore limitations ( http://www.trgiman.eu/en/belladati/product/personal ).
BellaDati is also used as a learning tool on technical universities focused on practical application of data mining and analysis. The final manager-level dashboards examples of BellaDati can be seen at http://mercato.belladati.com/bi/mercato/show/worldexchanges
You can work here with SQL datasources, flat files, web services and play. From my own experience - to show real samples of market analysis practise (like case study etc.) is good for an interview.
I wish you luck,
Peter