.This is a basic question .I am trying to classify text files into 20 different classes.
Therefore I have a project structure with a folder called train,test.
In the train folder I have 20 different folders ,each folder again has many files related to that particular class.ex:weather, atheism...etc
I have now created a train.arff file for the entire train folder.When the data is visualized through I can see only two attributes .
Have provided a link below:
Screen in weka
My Doubt is how can i view the various files under these folders and remove the stopwords,punctuation,stemmin.How do I go about preprocessing.If some links to good resources are available please suggest and provide the necessary links
I found the videos below quite helpful when I first got my hands on text classification using Weka. You might want to take a look.
Weka Tutorial 31: Document Classification 1 (Application)
Weka Tutorial 32: Document classification 2 (Application)
WEKA Text Classification for First Time & Beginner Users
You might want to use StringToWordVector filter to see the effect of each word as an attribute, which is indeed described in detail in the first and last video . Within the filter settings you can give a stopwords list and choose in each run to use it or not. Same with the stemming you can change it as well. This documentation and videos will get you to understand it easily.
Related
I am new to ImageNet and would like to download full sized images of one of the subsets/synsets however I have found it incredibly difficult to actually find what subsets are available and where to find the ID code so I can download this.
All previous answers (from only 7 months ago) contain links which are now all invalid. Some seem to imply there is some sort of algorithm to making up an ID as it is linked to wordnet??
Essentially I would like a dataset of plastic or plastic waste or ideally marine debris. Any help on how to get the relevant ImageNet ID or suggestions on other datasets would be much much appreciated!!
I used this repo to achieve what you're looking for. Follow the following steps:
Create an account on Imagenet website
Once you get the permission, download the list of WordNet IDs for your task
Once you've the .txt file containing the WordNet IDs, you are all set to run main.py
As per your need, you can adjust the number of images per class
By default ImageNet images are automatically resized into 224x224. To remove that resizing, or implement other types of preprocessing, simply modify the code in line #40
Source: Refer this medium article for more details.
You can find all the 1000 classes of ImageNet here.
EDIT:
Above method doesn't work post March 2021. As per this update:
The new website is simpler; we removed tangential or outdated functions to focus on the core use case—enabling users to download the data, including the full ImageNet dataset and the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
So with this, to parse and search imagenet now you may have to use nltk.
More recently, the organizers hosted a Kaggle challenge based on the original dataset with additional labels for object detection. To download the dataset you need to register a Kaggle account and join this challenge. Please note that by doing so, you agree to abide by the competition rules.
Please be aware that this file is very large (168 GB) and the download will take anywhere from minutes to days depending on your network connection.
Install the Kaggle CLI and set up credentials as per this guideline.
pip install kaggle
Then run these:
kaggle competitions download -c imagenet-object-localization-challenge
unzip imagenet-object-localization-challenge.zip -d <YOUR_FOLDER>
Additionally to understand ImageNet hierarchy refer this.
I'm using AWS SageMaker, and i want to create something that, with a given text, it recognize the place of that description. Is it possible?
If there are no other classes besides the text that you would like your model to identify, you may not need a multiclass classifier.
You could train your own text detection model using Amazon SageMaker, and train using a dataset with labelled examples using the Object Detection Algorithm, but this becomes rather involved for a problem that has existing solutions available.
If the appearance of the text you're trying to detect is identical each time, your problem space gets reduced from trying to interpret variable text, to simply having to gather enough examples and perform object detection for the "pattern" your text forms visually. Note that if the text were to appear in different fonts or styles, that the generic object detection method would not interpret it dynamically, and an OCR-based solution would likely be necessary.
More broadly, for text identification in images on AWS, you have quite a few options:
Amazon Rekognition has a DetectText method that will enable you to easily find text within an image. If it's a small or simple phrase, with alphanumeric characters, this should work very well for your use case.
Amazon Textract will help you perform OCR (optical character recognition) while retaining the structure of the source. This is great for documents and tables, but doesn't sound like it may be applicable to your use case.
The AWS marketplace will also have hosted options available from third party vendors. One example of this for text region identification is this one from RocketML.
There are also some great open source tools I'd recommend looking into: OpenCV for ascertaining the text bounding boxes, and Tesseract for OCR and text extraction. This blog post does a good job walking through the process of using them together.
Any of these will help to solve your problem of performing OCR/text identification on AWS, but the best choice comes down to what your current and future needs are, and how quickly you're looking to implement the feature.
Your question is not clear regarding the data that you have or the problem that you want to solve.
If you have a text that includes a place name in it (for example, "I visited Seattle and enjoyed the fish market"), you can use Amazon Comprehend Name Entity Extraction (NEE) including places ("Seattle" in the above example)
{
"Entities": [
{
"Score": 0.9857407212257385,
"Type": "LOCATION",
"Text": "Seattle",
"BeginOffset": 10,
"EndOffset": 17
}
]
}
If the description is more general and you want to classify if the description is of a hotel, a restaurant, a theme park, a concert/show, or similar types of places, you can either use the Custom classification in Comprehend or the Neural Topic Model in SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/ntm.html). You will need some examples of the classes and documents/sentences that are used for the model training.
I'm still very new to the world of machine learning and am looking for some guidance for how to continue a project that I've been working on. Right now I'm trying to feed in the Food-101 dataset into the Image Classification algorithm in SageMaker and later deploy this trained model onto an AWS deeplens to have food detection capabilities. Unfortunately the dataset comes with only the raw image files organized in sub folders as well as a .h5 file (not sure if I can just directly feed this file type into sageMaker?). From what I've gathered neither of these are suitable ways to feed in this dataset into SageMaker and I was wondering if anyone could help point me in the right direction of how I might be able to prepare the dataset properly for SageMaker i.e convert to a .rec or something else. Apologies if the scope of this question is very broad I am still a beginner to all of this and I'm simply stuck and do not know how to proceed so any help you guys might be able to provide would be fantastic. Thanks!
if you want to use the built-in algo for image classification, you can either use Image format or RecordIO format, re: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html#IC-inputoutput
Image format is straightforward: just build a manifest file with the list of images. This could be an easy solution for you, since you already have images organized in folders.
RecordIO requires that you build files with the 'im2rec' tool, re: https://mxnet.incubator.apache.org/versions/master/faq/recordio.html.
Once your data set is ready, you should be able to adapt the sample notebooks available at https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms
Currently, i am working on a project using WEKA. Being naive and newbie in it, there are many things which i am not familair with. In my last project I used text files as a classification using WEKA. I applied the TextDirectoryLoader convertor to convert a directory containing text files as mentioned on this URL Text categorization with WEKA. Now I want to use the same stretagy for converting a directory containing source code (instead of text). For example, I have a Jedit source file containing Java source code. I am trying to convert it to ARFF file so that i can apply classifiers or other functions present in WEKA on that ARFF file for data mining purposes. I have also tried a test file given on following URL ARFF files from Text Collections. I believe i can use the same file as an example to convert source code files. However, I do not know what attributes should I define in a FastVector? and What format should the data be in (String or numeric). And what other sections should an ARFF file may have?
As in the example the authors have defined following attributes
FastVector atts = new FastVector(2);
atts.addElement(new Attribute("filename", (FastVector) null));
atts.addElement(new Attribute("contents", (FastVector) null));
I have tried to find some examples on Google but no success.
Could anyone here suggests me any solution or alternate to solve the above said problem? (Example code will be highly appreciated).
Or atleast could give me a short example which convertes a source code directory into an ARFF file. (If it is possible).
If not possible what could be the possible reason
Any alternate solution (except WEKA) where I can use the same set of functions on a source code.
It is not clear, what is your goal? Do you want to classify the source code files, or find the files which are contains any bug, or what?
As I imagine, you want to extract features from each source file, and represent it with an instance. Then you can apply any machine learning based algorithm.
Here, you can find a java example, how can you construct an arff file from java:
https://weka.wikispaces.com/Creating+an+ARFF+file
But, you have to define your task specific features and extract it from each source code files.
Good day.
An apology for my English but it's not my native language so they know apologize for any errors.
I have a text file with the data processing which want to get a .arff file, which is the file type using weka.
I do not want to generate a single file. I get 2 files, one for training the model (training) and another to test the model (test).
This is done directly weka eh applying a filter stringtowordtokenizer but the problem is that when you use the second file to test a mistake because it is not fair test a model that has words that should not be.
If someone helps me I would appreciate.
Thank you and best regards.