Building an Intrusion Detection System using fuzzy logic - data-mining

I want to develop an Intrusion Detection System (IDS) that might be used with one of the KDD datasets. In the present case, my dataset has 42 attributes and more than 4,000,000 rows of data.
I am trying to build my IDS using fuzzy association rules, hence my question: What is actually considered as the best tool for fuzzy logic in this context?

Fuzzy association rule algorithms are often extensions of normal association rule algorithms like Apriori and FP-growth in order to model uncertainty using probability ranges. I thus assume that your data consists of quite uncertain measurements and therefore you want to group the measurements in more general ranges like e.g. 'low'/'medium'/'high'. From there on you can use any normal association rule algorithm to find the rules for your IDS (I'd suggest FP-growth as it has lower complexity than Apriori for large data sets).

Related

Sentiment analysis feature extraction

I am new to NLP and feature extraction, i wish to create a machine learning model that can determine the sentiment of stock related social media posts. For feature extraction of my dataset I have opted to use Word2Vec. My question is:
Is it important to train my word2vec model on a corpus of stock related social media posts - the datasets that are available for this are not very large. Should I just use a much larger pretrained word vector ?
The only way to to tell what will work better for your goals, within your constraints of data/resources/time, is to try alternate approaches & compare the results on a repeatable quantititave evaluation.
Having training texts that are properly representative of your domain-of-interest can be quite important. You may need your representation of the word 'interest', for example, to represent that of stock/financial world, rather than the more general sense of the word.
But quantity of data is also quite important. With smaller datasets, none of your words may get great vectors, and words important to evaluating new posts may be missing or of very-poor quality. In some cases taking some pretrained set-of-vectors, with its larger vocabulary & sharper (but slightly-mismatched to domain) word-senses may be a net help.
Because these pull in different directions, there's no general answer. It will depend on your data, goals, limits, & skills. Only trying a range of alternative approaches, and comparing them, will tell you what should be done for your situation.
As this iterative, comparative experimental pattern repeats endlessly as your projects & knowledge grow – it's what the experts do! – it's also important to learn, & practice. There's no authority you can ask for any certain answer to many of these tradeoff questions.
Other observations on what you've said:
If you don't have a large dataset of posts, and well-labeled 'ground truth' for sentiment, your results may not be good. All these techniques benefit from larger training sets.
Sentiment analysis is often approached as a classification problem (assigning texts to bins of 'positive' or 'negative' sentiment, operhaps of multiple intensities) or a regression problem (assigning texts a value on numerical scale). There are many more-simple ways to create features for such processes that do not involve word2vec vectors – a somewhat more-advanced technique, which adds complexity. (In particular, word-vectors only give you features for individual words, not texts of many words, unless you add some other choices/steps.) If new to the sentiment-analysis domain, I would recommend against starting with word-vector features. Only consider adding them later, after you've achieved some initial baseline results without their extra complexity/choices. At that point, you'll also be able to tell if they're helping or not.

How can I analyze a nonstructured text?

I use TF-IDF to affect weight that can help me to construct my dictionary. but my model is not really good enough because I have unstructured text.
Any suggestions about TF-IDF similar algorithms?
When you say, your model is not good enough, does it mean that your generated dictionary is not good enough? Extracting key terms and constructing the dictionary using TF-IDF weight is actually feature selection step.
To extract or select features for your model, you can follow other approaches like principle component analysis, latent semantic analysis etc. Lot of other feature selection techniques in machine learning can be useful too!
But I truly believe for sentiment classification task, TF-IDF should be a very good approach to construct the dictionary. I rather suggest you to tune your model parameters when you are training it rather than blaming the feature selection approach.
There are many deep learning techniques as well that are applicable for your target task.

Choosing the best subset of features

I want to choose the best feature subset available that distinguish two classes to be fed into a statistical framework that I have built, where features are not independent.
After looking at the feature selection methods on machine learning it seems that it fall into three different categories: Filter,wrapper and Embedded methods. and the filter methods can be either: univariate or multivariate. It does make sense to use either Filter(multivariate) or wrapper methods because both -as I understood- looks for the best subset, however, as I am not using a classifier how can use it ?
Does it make sense to apply such methods (e.g. Recursive feature
elimination ) to DT or Random Forest classifier where the features
have rules there, and then take the resulted best subset and fed it
into my framework ?**
Also, as most of the algorithms provided by Scikit-learn are
univariate algorithms, Is there any other python-based libraries
that provide more subset feature selection algorithms ?
I think the statement that "most of the algorithms provided by Scikit-learn are univariate algorithms" is false. Scikit-learn handles multi-dimensional data very nicely. The RandomForestClassifier that they provide will give you an estimate of feature importance.
Another way to estimate feature importance is to choose any classifier that you like, train it and estimate performance on a validation set. Record the accuracy and this will be your baseline. Then take that same train/validation split and randomly permute all values along one feature dimension. Then train and validate again. Record the difference of this accuracy from your baseline. Repeat this for all feature dimensions. The results will be a list of numbers for each feature dimension that indicates its importance.
You can extend this to use pairs, or triples of feature dimensions, but the computational cost will grow quickly. If you're features are highly correlated you may benefit from doing this for at least the pairwise case.
Here is the source document of where I learned that trick: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
(It should work for classifiers other than Random Forests.)

Which Data Mining task to retrieve a unique instance

I'm working with data mining, and I'm familiar with classification, clustering and regression tasks. In classification, one can have a lot of instances (e.g. animals), their features (e.g. number of legs) and a class (e.g. mammal, reptile).
But what I need to accomplish is, given some attributes, including the class attribute, to determine which unique instance I'm referring to (e.g. giraffe). I can supply all known attributes that I have, and if the model can’t figure out the answer, it can ask for another attribute – just analogous to a 20 questions style of game.
So, my question is: does this specific task have a name? It seems to be similar to classification, where the class is unique to each instance, but this wouldn’t fit on the current training models, except perhaps for a decision tree model.
Your inputs, denoted features in machine learning, are tuples of species (what I think you mean by "instance"), and physical attributes. Your outputs are broader taxonomic ranks. Thus, assigning one to each input is a classification problem. Since your features are incomplete, you want to perform ... classification with incomplete data, or impute missing features. Searching for these terms will give you enough leads.
(And the other task is properly called clustering.)
IMHO you are looking for simply a decision tree.
Except, that you don't train it on your categorial attribute (your "class"), but on the individual instance label.
You need to carefully choose the splitting measure though, as many measures go for class sizes - and all your classes have size 1 now. Finding a good split for the decision tree may involve planning some splits ahead to get an optimal balanced tree. A random forest like approach may be of use to improve the chance of finding a good tree.

algorithm for data quality in a data warehouse

I'm looking for a good algorithm / method to check the data quality in a data warehouse.
Therefore I want to have some algorithm that "knows" the possible structure of the values and then checks if the values are a member of this structure and then decide if they are correct / not correct.
I thought about defining a regexp and the check each value whether it fits or not.
Is this a good way? Are there some good alternatives? (Any research papers?)
I have seen some authors suggest adding a special dimension called a data quality dimension to describe each facttable-record further.
Typical values in a data quality dimension could then be “Normal value,” “Out-of-bounds value,” “Unlikely value,” “Verified value,” “Unverified value,” and “Uncertain value.”
I would recommend using a dedicated data quality tool, like DataCleaner (http://datacleaner.eobjects.org), which I have been doing quite a lot of work on.
You need a tool that not only check strict rules like constraints, but also one that will give you a profile of your data and make it easy for you to explore and identify inconsistencies on your own. Try for example the "Pattern finder" which will tell you the patterns of your string values - something that will often reveal the outliers and errornous values. You can also use the tool for actual cleansing the data, by transforming values, extracting information from them or enriching using third party services. Good luck improving your data quality!