Does ML.Net transformations apply stopwords? - ml.net

I'm new to ML.Net and playing around with some basic MultiClassClassification scenarios and wondering if it can already handle stopwords by default now or should I do that in my data prep?

Please check out this section of ML.NET cookbook.
If you use mlContext.Transforms.Text.FeaturizeText in your pipeline, it will by default remove English stopwords.
Of course, you are free to tweak your NLP preprocessing using other ML.NET provided components, but, from my little experience with text classification, the catch-all FeaturizeText is doing a reasonable job for most cases.

Related

Does Vowpal Wabbit support adding and removing features during training?

I would seem to think that it is fine at supporting adding features while training by just expanding the weight vector, and from a few tests it looks like it does exactly that.
I am also aware that the feature names are hashed by VW and therefore I was thinking that it is possible to remove features while training as well, but I cannot seem to confirm this in the code and have been having trouble testing via indices and weight values.
Is there a definitive answer on these issues?
If you train with --save_resume, you can continue the training (loading the previously trained model with -i) with different training data (with more features) and/or with different command line options for generating more features (e.g. --quadratic or --interactions). This is equivalent to adding features "while training".
There are also some reductions which automatically add more features during training. Currently, I am only aware of --stage_poly, but maybe there are more.

Amazon Machine Learning for sentiment analysis

How flexible or supportive is the Amazon Machine Learning platform for sentiment analysis and text analytics?
You can build a good machine learning model for sentiment analysis using Amazon ML.
Here is a link to a github project that is doing just that: https://github.com/awslabs/machine-learning-samples/tree/master/social-media
Since the Amazon ML supports supervised learning as well as text as input attribute, you need to get a sample of data that was tagged and build the model with it.
The tagging can be based on Mechanical Turk, like in the example above, or using interns ("the summer is coming") to do that tagging for you. The benefit of having your specific tagging is that you can put your logic into the model. For example, the difference between "The beer was cold" or "The steak was cold", where one is positive and one was negative, is something that a generic system will find hard to learn.
You can also try to play with some sample data, from the project above or from this Kaggle competition for sentiment analysis on movie reviews: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews. I used Amazon ML on that data set and got fairly good results rather easily and quickly.
Note that you can also use the Amazon ML to run real-time predictions based on the model that you are building, and you can use it to respond immediately to negative (or positive) input. See more here: http://docs.aws.amazon.com/machine-learning/latest/dg/interpreting_predictions.html
It is great for starting out. Highly recommend you explore this as an option. However, realize the limitations:
you'll want to build a pipeline because models are immutable--you have to build a new model to incorporate new training data (or new hyperparameters, for that matter)
you are drastically limited in the tweakability of your system
it only does supervised learning
the target variable can't be other text, only a number, boolean or categorical value
you can't export the model and import it into another system if you want--the model is a black box
Benefits:
you don't have to run any infrastructure
it integrates with AWS data sources well
the UX is nice
the algorithms are chosen for you, so you can quickly test and see if it is a fit for your problem space.

Existing API for NLP in C++?

Is/are there existing C++ NLP API(s) out there? The closest thing I have found is CLucene, a port of Lucene. However, it seems a bit obsolete and the documentation is far from complete.
Ideally, this/these API(s) would permit tokenization, stemming and PoS tagging.
Freeling is written in C++ too, although most people just use their binaries to run the tools: http://devel.cpl.upc.edu/freeling/downloads?order=time&desc=1
Try something like DyNet, it's a generic neural net framework but most of its processes are focusing on NLP because the maintainers are creators of the NLP community.
Or perhaps Marian-NMT, it was designed for sequence-to-sequence model machine translation but potentially many NLP tasks can be structured as a sequence-to-sequence task.
Outdated
Maybe you can try Ellogon http://www.ellogon.org/ , they have GUI support and also C/C++ API for NLP too.
if you remove the restriction on c++ , you get the perfect NLTK (python)
the remaining effort is then interfacing between python and c++.
Apache Lucy would get you part of the way there. It is under active development.
Maybe you can use Weka-C++. It's the very popular Weka library for machine learning and data mining (including NLP) ported from Java to C++.
Weka supports tokenization and stemming, you'll probably need to train a classifier for PoS tagging.
I only used Weka with Java though, so I'm afraid can't give you more details on this version.
There is TurboParser by André Martins at CMU, also has a Python wrapper. There is is an online demo for it.
This project provides free (even for commercial use) state-of-the-art information extraction tools. The current release includes tools for performing named entity extraction and binary relation detection as well as tools for training custom extractors and relation detectors.
MITIE is built on top of dlib, a high-performance machine-learning library, MITIE makes use of several state-of-the-art techniques including the use of distributional word embeddings and Structural Support Vector Machines[3]. MITIE offers several pre-trained models providing varying levels of support for both English and Spanish, trained using a variety of linguistic resources (e.g., CoNLL 2003, ACE, Wikipedia, Freebase, and Gigaword). The core MITIE software is written in C++, but bindings for several other software languages including Python, R, Java, C, and MATLAB allow a user to quickly integrate MITIE into his/her own applications.
https://github.com/mit-nlp/MITIE

Extensible lightweight markup language

Lightweight markup languages offer a fixed set of features. This feature set is growing, but every time I write a more complex article, I have to realize something is missing. Examples include: proper image captions, table of figures, file include, cross-references, etc. So I end up creating a tool chain around it, with a Makefile and tricky sed commands.
I typically want to insert ad-hoc markers into my text and process them later. They can be one-liners, or more complex -- and this where the whole regex approach fails. Here is a snippet of an imaginary markup.
I can generate an image from an external dot file [.myDot diag.dot The process],
and it will be included with a caption.
Or the dot source is right here [.myDotHere
foo->bar->Done;
]
I'm looking for a markup tool which can be easily extended to suite my ad-hoc needs. The options I found so far
Makefile, pre- and postprocessing with sed/perl scripts
Built in regex pre-processing in txt2tags
Pandoc parses markdown into an internal AST which can be transformed with haskell scripts
So what I'm looking for is
a language designed with customization and extensibility in mind
lightweight; no TeX/LaTeX please
not something which handles all my specific issues, but not extensible
My output is usually just html, so it doesn't have to support many targets
I created Glyph with extensibility in mind. You can create your own macros either using Glyph itself or Ruby.
Glyph aims to make publishing easier while giving all possible control to the writer, it can manage book metadata, ToC, internal links, snippets, etc. etc.
For documentation on all its features check out the Glyph book, which was created using Glyph itself.
Your "toolchain" approach is a good one - You won't IMO find a single project that will handle your specific needs, best to follow the *nix philosophy and use the best tool for the job that plugs into your open toolchain.
If macro inclusion is an issue, don't worry about solving that by your choice of markup syntax - find the right tool for that specific job and use it upstream.
The choice of markup should be IMO based on the availability of transformation tools to your desired output. IMO Pandoc is by far the most actively developed project in this space, and very flexible, especially with its scripting facility. Note it's also very well supported in GoogleGroups - John will likely respond directly and quickly to any issues you may have.
Note that Pandoc's flexibility also means your master source text isn't as "locked in", as you can easily convert for example from its extended markdown syntax to reST, if say you wanted to take advantage of Sphinx's or DocBook's capabilities. (BTW also check out AsciiDoc, which the latest Pandoc outputs - apparently a reader is also in the works)
Check out Pandoc's "extras" wiki page, I've been particularly excited by the ConTeXt filter script; I'm not sure if it'll be a good fit for you, but it includes some macro include capabilities, and IMO nothing will give you better typographical control.

Workflow to Turn Wiki content into a system manual

We're in the middle of deploying a new software system to lot's of users in lot's of places (200+ users over 8 countries). In the past we've written a manual for the users, then update it every so often. This works ok, in that all the users ahve the same manual and it covers the main things but it has it's problems, like it doesn't get updated that often, we sometimes miss updates, and some users will have old copies.
We've been talking about using a wiki during the testing and deployment phases to build a knowledge base about the system. Ideally we'd then like some way to convert that into some form fo electronic document that we can then 'pretty-fie' and send out as the official manual, as well as letting users use and update the wiki.
Has anyone else done anything similar ? Any suggestions for wiki systems, workflows, document formats etc?
Most wikis support export via PDF e.g.:
MediaWiki PDF Export
DokuWiki PDF Export
TWiki PDF Export
You can write something that generates LaTeX from the wiki and renders a manual to PDF. With packages like hyperref you can retain cross-references as hyperlinks.
Additionally, you can integrate content from multiple sources such as a data dictionary into the LaTeX document, which can be mixed and matched with the wiki content. You could also set the architecture up so it can support cross-referencing that goes either way.
Framemaker could also support this using generated MIF files, and you could also use Lout in a similar way or convert your wiki content to docbook, which would allow you to use any of the many rendering options available to that format.
As an aside, the following Stackoverflow postings discuss various systems for maintaining documentation.
Application (Not a Markup Language) for Producing a User Manual
Can LaTeX be used for producing any documentation that accompanies software?
What tools are used to write documentation?
What tools does your team use for writing user manuals?
How best to write documentation (ideally in latex) targeting both the web (html) and paper (pdf)?
Best tool(s) for working with DocBook XML documents?
What is the recommended toolchain for formatting XML DocBook?
Is a successor for TeX/LaTeX in sight?
Madcap Flare is a help-and-manual authoring tool that uses HTML for the source of each topic. You could pretty easily do a mass import of the Wiki pages. Would then require some cleaning but after that you have a nice single-source system that can output CHM, web-browsable help, PDF, DOC/DOCX, etc.
How are you storing the help source at the moment? Is it MS Word files, MS help, LaTeX?
If you put your help source files under version control then you will get all the benefits of a wiki without having to migrate to a new system - people can make edits to the help files easily - those changes can be tracked, reverted etc. and you get the prettified manuals as before.
I followed Node's links and came across some mediawiki pages that I thought were noteworthy.
Extension:OpenDocument Export
Extension:PDF Writer
Category:Data extraction extensions
I gave a previous answer which may be useful for the "wiki to PDF" part -- look at using the open source PediaPress code or functionality. You can get ODFs from it too, although their PDFs are already quite pretty (but you might want to rebrand it and restyle it for your company I suppose).