How do you perform bootstrapping and remove outliers in Weka? - data-mining

I am just starting to play around with the Weka API and a couple of the example data sets, but just wanted to understand a couple bits and pieces. Does anyone know how to perform 0.632 bootstrapping in Weka?
Also how do would I go about detecting outliers (I understand there are many different methods of doing this...)?
Also how would I remove say 10% of outliers, once they have been identified?
Any help would be greatly appreciated!
Cheers,
Neil

You can perform supervised resampling, which is what bootstrap is, using the Resample filter.

Related

Word-Embeddings compression for deep learning models

Can anyone suggest any techniques/concepts to compress word-embedding like Glove 300d?
As it increases the size of model.
A decent option to go through where deep com-positional code learning neural network is helpful to reduce word embedding size drastically.
This github link will help you and give more insights . it is quite beneficial in Sentimental Analysis tasks.
Need to test on more different NLP tasks.

Specific topics on Tensorflow for CNN

I have a mini project for my new course in Tensorflow for this semester with random topics. Since I have some background on Convolution Neuron Network, I intend to use it for my project. My computer can only run CPU version of TensorFlow.
However, as a new bee, I realize that there are a lot of topics such that MNIST, CIFAR-10, etc, thus I don't know which suitable topic I should pick out from them. I only have two weeks left. It would be great if the topic is not too complicated but too not easy for study because it matchs my intermediate level.
In your experience, could you give me some advice about the specific topic I should do for my project?
Moreover, it would be better if in this topic I can provide my own data to test my training, because my professor said that it is a plus point to get A grade in my project.
Thanks in advance,
I think that to answer this question you need to properly evaluate the marking criteria for your project. However, I can give you a brief overview of what you've just mentioned.
MNIST: MNIST is a Optical Character Recognition task for individual numbers 0-9 in images size 28px square. This is considered the "Hello World" of CNNs. It's pretty basic and might be too simplistic for your requirements. Hard to gauge without more information. Nonetheless, this will run pretty quickly with CPU Tensorflow and the online tutorial is pretty good.
CIFAR-10: CIFAR is a much bigger dataset of objects and vehicles. The image sizes are 32px square so individual image processing isn't too bad. But the dataset is very large and your CPU might struggle with it. It takes a long time to train. You could try training on a reduced dataset but I don't know how that would go. Again, depends on your course requirements.
Flowers-Poets: There is the Tensorflow for Poets re-training example which might not be suitable for your course, you could use the flowers dataset to build your own model.
Build-your-own-model: You could use tf.Layers to build your own network and experiment with it. tf.Layers is pretty easy to use. Alternatively you could look at the new Estimators API that will automate a lot of the training processes for you. There are a number of tutorials (of varying quality) on the Tensorflow website.
I hope that helps give you a run-down of what's out there. Other datasets to look at are PASCAL VOC and imageNet (however they are huge!). Models to look at experimenting with may include VGG-16 and AlexNet.

How to extract the best features through NSGA-II algo in weka

I want to perform the best features selection that should be helpful in best classification results in the form of precision and recall, through NSGA-II in weka.
How can I perform this? Can anyone give me blueprint for this task? Any help will be really appreciated.
Maybe this paper can help you to choose the complements to use in weka.

Classification using text mining - by values versus keywords

I have a classification problem that is highly correlated to economics by city. I have unstructured data in free text such as population, median income, employment, etc. Is it possible to use text mining to understand the values in the text and make a classification. Most text mining articles if have read use keyword or phrase count to make classification. I would like to be able to make classifications by the meaning of the text versus the frequency of the text. Is this possible?
BTW, I currently use RapidMiner and R. Not sure if this would work with either of these?
Thanks in advance,
John
Yes, this probably is possible.
But no, I cannot give you a simple solution, you will have to collect a lot of experience and experiment yourself. There is no push-button magic solution that works for everybody.
As your question is overly broad, I don't think there will be a better answer than "Yes, this might be possible", sorry.
You could think of these as two separate problems.
Extract information from unstructured data.
Classification
There are several approaches to mine specific features from the text. On the other hand you could also use directly use bag of words approach for classification directly and see the results. Depending on your problem, a classifier could potentially learn from just the text features.
You could also use PCA or something similar to find all the important features and then run mining process to extract those features.
All of this depends on your problem which is too broad and vague.

could anyone give me help on ground-truth data

I recently came to a term in one of my email communicatons with my supervisor.Since I am beinging doing a data-mining project on facebook user profile,and he said I should being collecting groud-truth data.
I am very new to this term and I searched online for it,but found very few results about it in data-mining sense.
Could anyone give me an example of what this groud-truth data is in a data-mining task pleae?
Thank you very much.
Ground-truth is data annotated (generally by human) known to be sure at 100%.
It's used to train algorithm since it's what you expect the algorithm to give you.