Discrimination prevention in data mining - data-mining

I am trying to implement a system to prevent discrimination in data mining using the preprocessing technique, using the Adult Dataset. So far, I have preprocessed the data and extracted the frequent classification rules. But am unable to go forward with calculating the elift, elb values for the rules, which is a major step. My work is based on the paper "Methodology for direct and indirect discrimination prevention in data mining". So if anyone could help me out with this, it would be really great.

Related

Can control and data flow analysis be applied to other solution document?

I am doing control flow and data flow analysis to my examle code. I am using instant analyzer (Analyzer with code fix - Nuget,VSIX)
My aim is analyze not only the code developer coding, also want to follow the data to the other classes.
Can i learn your thoughts about that ?
Thanks

Medical Machine Learning Data Set

I'm researching Medical Data set which includes variable concerning illnesses and treatment type.
For example illnesses is colon cancer, it's decision variables (x,y,z,t) and treatment type is chemothreapy, radiothreaphy etc etc.
I want to reach such a data set for my KDD and exploratory lesson. Because I want to make useful project prototype.
if you know any data set web site pls share me (so-called site may not include medical)..
There is a standard machine learning data set repository at UC Irvine. R users can access it via the mlbench package from the CRAN network.
Try the UCI Machine Learning Repository. 189 sample ML datasets, many medical in nature.
Because the site is focussed on ML, it gives a lot of guidance that will help choose and tune your ML algorithms for good generalization performance.
Not sure this is an actual 'programming' question, strictly speaking. However, given that programs work on data, I'll go with it - and observe that the term 'medical dataset' returns quite a few (1.7m) hits in Google.

Data mining/BI/Analytics/ML : Can a mathematically challenged person move into this field?

I have recently become interested in the field(s) of data mining and machine learning. The idea of going through huge datasets and trying to correlate hidden patterns and trends is fascinating. So far I have done the following
Used Weka to load simple data sets and generate decision trees
Continously read books, wiki's, blogs and SO on the same
Started playing around SQL Server DM and Python API's
Have an idea on options of freely available data sets on the web(freedb, UN etc)
What is hindering me is the minute I try to go beyond classification/associsciation and into priori/apriori algorithms I am stuck because understanding mathematical equations and logic is not(to put it modestly) one of my strong points.
So my question would be are there anybody in the Data mining field(in the role of product owner or builder) who are not naturally mathematicians? If so, how would you approach in undestanding the field since free tools like Weka and Rapid-miner both expects some mathematical/statistical background?
P.S: Excuse me if I made some mistake in the query like mixing Data mining and analytics when they are separate as I am still getting my feet wet. I hope my core question is clear.
Well, being able to do some analysis of what the data mining models are showing is absolutely vital. However, these days all of the math and statistics are taken care of by the data mining models. You don't need to understand the math behind them (although it helps).
For example, you can look through the SQL Server Analysis Services Data Mining Algorithms and see that even the technical reference is how to use these implementations, not how to recreate them.
If you can understand the business cases and you can understand what the data mining is telling you, there's really no need to delve into the math behind it.
As for some of the free tools, I've never used them, so I can't speak to them. However, I'm a big fan of SSAS and those data mining models, which don't require an extensive mathematical background.
As Eric says, and as far as you only intend to use the existing algorithms and APIs and make sense from them, I don't see problems with the required math/statistics skill set (anyway, you'll need some previous basic knowledge/level).
Now, if you intend to do research or if you want to improve or modify existing algorithms, or why not, create your own algorithms, then math and statistics is a MUST. I just started doing some research in this area, and I'm still trying to fill my skills gap =)

What is data mining from a developer's perspective?

I can find the technical explanation of what data mining is in a book or on Wikipedia, but I'm wondering what sort of development does it exactly involve? Is it more about using tools or more about writing tools? Is it really any much different from other domains when it comes to R&D?
Data Mining is the process of discovering interesting patterns in large amounts of data. It is not querying data, which is just what user Treb describes (sorry Treb).
To understand DM from a developer's perspective, you should read the book Programming Collective Intelligence by Toby Segaran.
In my experience (I'm a former data miner :-)), it's a mixture of using tools and writing tools. A lot of the time, the tools you need to analyse the particular data set don't exist, so you have to write them yourself first. It can be very interesting but you often need quite a different approach to the sort of programming I do now (embedded wireless), for example.
You really ought to change the accepted answer on this question so it doesn't mislead those who come across it.
Saying that querying a database IS data mining because "[h]ow would you discover any pattern in your data without querying first?" is like saying opening your car door is driving because "how else would you be able to drive somewhere without opening the car door first."
You can read your data out of a text file if you want. My first data mining assignment used data sets from the UCI repository and those are almost all text files.
If you want to learn about data mining start by looking up clustering and classification. Learn about decision trees and rule based classification. Then look at k-nearest-neighbor and k-means. After that if you really want to see what data mining is all about look at Chameleon, DBScan, and Support Vector Machines. Don't necessarily learn the minutiae of the last three (they're pretty complex and math heavy) but understanding the abstract idea of what happens will tell you all you need to know in order to use the many tools and libraries that are available for each strategy.
These are only the algorithms that popped into my head just now. There are so many others that I don't recall or don't even know yet.
Data mining is about searching large quantities of data for hidden patterns. Web 2.0 example: News corp uses its site myspace.com as a large data mine to determine what movies and products to promote. They write software to identify trends in the data that it's users post to the site. News corp does this to gather information useful for advertising campaigns and market predictions. It's different from other domains of R&D in that from a data givers perspective its passive. Rather than going out on the street and asking people in person what movies they are likely to see this summer and other such questions, the data mining tools sort out these things by analyzing data given by users voluntarily.
Wikipedia actually does have a pretty good article on it:
- http://en.wikipedia.org/wiki/Data_mining
Data Mining as I say is finding patterns or trends from given data. A developer perspective might be in applications like Anti Money Laundring... Where given a pattern you will search data for that given pattern. One other use is in Projection Softwares... where you project a result or outcome in future against a heuristic by studying recognizing the current trend from data.
I think it's more about using off the shelf tools rather than developing your own. An academic example of that kind of tools might be WEKA. Of course, you still have to know what algorithms use, how to preprocess data (very important this part), etc.
In R&D I don't have much idea, but it should be like almost everything: maths, statistics, more maths...
On the development level, data mining is just another database application, but with a huge amount of data.
The mining itself is done by running specific queries on the database. It's in the creation of the queries where the important work is done. They of course depend on the data model, and on the hypotheses, what sort of trends the customer expects to find.
Therefore, the fine tuning of the queries usually can't be done in development, but only once the system is live and you have live data. Then the user can test his hypotheses and adapt the queries to show him the trends he is looking for.
So from a dev point of view, data maining is about
Managing large sets of data in your client (one query may return 100.000 rows of data)
Providing the user (who may know nothing about SQL or relational databases in general) with an effective way to modify his queries and view the results.

What is the difference between a data flow diagram and a flow chart?

I want to know why we use Data Flow Diagrams instead of flow charts.
A flow chart details the processes to follow. A DFD details the flow of data through a system.
In a flow chart, the arrows represent transfer of control (not data) between elements and the elements are instructions or decision (or I/O, etc).
In a DFD, the arrows are actually data transfer between the elements, which are themselves parts of a system.
Wikipedia has a good article on DFDs here.
You should use whatever you like. The diagram is just a tool. Use whatever tool fits you and your problem best. I usually just use boxes and arrows and squiggles and circles and little stick figures and whatever else I think gets the point across to the viewer. In short it doesn't matter if you even use a standard diagraming standard. People are usually pretty good at understanding pictures.
Data flow diagram shows the flow of data between the different entities and datastores in a system while a flow chart shows the steps involved to carried out a task. In a sense, data flow diagram provides a very high level view of the system, while a flow chart is a lower level view (basically showing the algorithm).
Whether you use data flow diagram or flow charts depends on figuring out what is it that you are trying to show.
The difference between a data flow diagram (DFD) and a flow chart (FC) are that a data flow diagram typically describes the data flow within a system and the flow chart usually describes the detailed logic of a business process.
Data Flow and Flow Chart differ in processes, flow, and timing.
Processes
a.) On DFDs, processes can operate in parallel (at-the-same-time).
b.) On flowcharts, processes execute one at a time.
Flow
a.) DFDs show the flow of data through a system
b.) Flowcharts show the flow of control (sequence and transfer of control)
Timing
a.) Processes on a DFD can have dramatically different timing (daily, weekly, on demand)
b.)
Processes on flowcharts are part of a single program with consistent timing
A DFD shows how the data moves through a system, a flowchart is closer to the operations that system does.
In the classic make a cup of tea example, a DFD would show where the water, tea, milk, sugar were going, whereas the flowchart shows the process.
Other answers have gone over the basics of what each thing is. At the higher level, a flowchart is a design level tool, while DFDs are more analysis.
DFDs have some nice features. Since they show the flow of data, some things become more obvious when charted this way: some data is only used by a few routines, some routines use only some bits of data, some routines touch everything. Seeing that up front helps organize, restructuring, and planning.
A follow-on worth exploring is the Event-Response Diagram, which is basically a DFD only showing process and data needed to process an "event", meaning something triggered externally (customer makes payment, etc.).
A Data Flow Diagram is functional relationship which includes input values and output values
and internal data stored.
A Flow Chart is a process relationship which includes input and output values.
Flow chart describes the program (see old fortran flow charts - surely, there are some floating around on google).
Data flow diagram determines the flow of data, for example, between subroutines, or between different programs.
Although my experience with DFD diagrams is limited I can tell you that a DFD shows you how the data moves (flows) between the various modules. Furthermore a DFD can be partitioned in levels, that is in the Initial Level you see the system (say, a System to Rent a Movie) as a whole (called the Context Level). That level could be broken down into another Level that contains activities (say, rent a movie, return a movie) and how the data flows into those activities (could be a name, number of days, whatever). Now you can make a sublevel for each activity detailing the many tasks or scenarios of those activities. And so on, so forth. Remember that the data is always passing between levels.
Now as for the flowchart just remember that a flowchart describes an algorithm!
have a look to this site
http://yourdon.com/strucanalysis/wiki/index.php?title=Chapter_9#The_Flow
its really help u to understand what is DFD
Between the above answers its been explained but I will try to expand slightly...
The point about the cup of tea is a good one. A flow chart is concerned with the physical aspects of a task and as such is used to represent something as it is currently. This is useful in developing understanding about a situation/communication/training etc etc..You will likley have come across these in your work places, certainly if they have adopted the ISO9000 standards.
A data flow diagram is concerned with the logical aspects of an activity so again the cup of tea analogy is a good one. If you use a data flow diagram in conjunction with a process flow your data flow would only be concerned with the flow of data/information regarding a process, to the exclusion of the physical aspects. If you wonder why that would be useful then its because data flow diagrams allow us to move from the 'as it is' situation and see it that something as it could/will be. These two modelling approaches are common in structured analysis and design and typically used by systems/business analysts as part of business process improvement/re-engineering.
Data flow diagram: A modeling notation that represents a functional decomposition of a system.
Flow chart: Step by step flow of a programe.
Data Flow Diagrams
The formal, structured analysis approach employs the data-flow diagram (DFD) to assist in the functional decomposition process. I learned structured analysis techniques from DeMarco [7], and those techniques are representative of present conventions. To summarize, DFD's are comprised of four components: