Differences between Rcpp::is_na, R_IsNA and std::isnan [duplicate] - c++

I'm converting R based code into Rcpp based code. The head of my function is:
NumericMatrix createMatrixOfLinkRatiosC(NumericMatrix matr, double threshold4Clean) {
int i,j;
NumericMatrix myMatr(matr.nrow(),matr.ncol());
myMatr=matr;
....;
}
I want to handle call to the function where threshold4Clean is missing but I'm not finding how to do... Any help will be greatly appreciated.

R has both NaN and NA (which is really a special kind of NaN) for representing missing values. This is important to know because there are many functions that check if a value is NaN-y (NA or NaN):
Some truth tables for functions from the R/C API (note the frustrating lack of consistency)
+---------------------+
| Function | NaN | NA |
+---------------------+
| ISNAN | t | t |
| R_IsNaN | t | f |
| ISNA | f | t |
| R_IsNA | f | t |
+---------------------+
and Rcpp:
+-------------------------+
| Function | NaN | NA |
+-------------------------+
| Rcpp::is_na | t | t |
| Rcpp::is_nan | t | f |
+-------------------------+
and from the R interpreter (note: Rcpp tries to match this, rather than the R/C API):
+---------------------+
| Function | NaN | NA |
+---------------------+
| is.na | t | t |
| is.nan | t | f |
+---------------------+
Unfortunately it's a confusing landscape, but this should empower you a bit.

Both Rcpp and RcppArmadillo have predicates to test for NA, NaN (an R extension) and Inf.
Here is a short RcppArmadillo example:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat foo(int n, double threshold=NA_REAL) {
arma::mat M = arma::zeros<arma::mat>(n,n);
if (arma::is_finite(threshold)) M = M + threshold;
return M;
}
/*** R
foo(2)
foo(2, 3.1415)
***/
We initialize a matrix of zeros, and test for the argument. If it is finite (ie not NA or Inf or NaN), then we add that value. If you wanted to, you could test for the possibilities individually too.
This produces the desired result: without a second argument the default value of NA applies, and we get a matrix of zeros.
R> Rcpp::sourceCpp("/tmp/giorgio.cpp")
R> foo(2)
[,1] [,2]
[1,] 0 0
[2,] 0 0
R> foo(2, 3.1415)
[,1] [,2]
[1,] 3.1415 3.1415
[2,] 3.1415 3.1415
R>

I've been testing this and can shed some light on the possibilities.
For a single SEXP target, the Rcpp option I've used is:
switch(TYPEOF(target)) {
case INTSXP:
return Rcpp::traits::is_na<INTSXP>(Rcpp::as<int>(target));
case REALSXP:
return Rcpp::traits::is_na<REALSXP>(Rcpp::as<double>(target));
case LGLSXP:
return Rcpp::traits::is_na<LGLSXP>(Rcpp::as<int>(target));
case CPLXSXP:
return Rcpp::traits::is_na<CPLXSXP>(Rcpp::as<Rcomplex>(target));
case STRSXP: {
Rcpp::StringVector vec(target);
return Rcpp::traits::is_na<STRSXP>(vec[0]);
}
}
If you want to check without using Rcpp there are some caveats:
As mentioned here,
integer and logical NA (both stored as int) is equal to the minimum value of int (-2147483648).
For double, you could directly use what Rcpp uses,
namely R_isnancpp.
Equivalently, the ISNAN macro could be used.
For complex numbers, you could check both real and imaginary parts with the double method from above.
Character NA is tricky, since it's a singleton, so the address is what matters.
I personally have been testing ways to do operations with R characters without storing std::string to avoid copies,
i.e. using the char* directly.
What I've found that works is to declare this in a .cpp file:
static const char *na_string_ptr = CHAR(Rf_asChar(NA_STRING));
and, based on this answer,
do something like this for a Rcpp::StringVector or Rcpp::StringMatrix x:
Rcpp::CharacterVector one_string = Rcpp::as<Rcpp::CharacterVector>(x[i]);
char *ptr = (char *)(one_string[0]);
return ptr == na_string_ptr;
This last one still uses Rcpp,
but I can use it once for initial setup and then just use the char pointers.
I'm sure there's a way to do something similar with R's API,
but that's something I haven't tried yet.

Related

How to repeatedly insert arguments from a list into a function until the list is empty?

Using R, I am working with simulating the outcome from an experiment where participants choose between two options (A or B) defined by their outcomes (x) and probabilities of winning the outcome (p). I have a function "f" that collects its arguments in a matrix with the columns "x" (outcome) and "p" (probability):
f <- function(x, p) {
t <- matrix(c(x,p), ncol=2)
colnames(t) <- c("x", "p")
t
}
I want to use this function to compile a big list of all the trials in the experiment. One way to do this is:
t1 <- list(1A=f(x=c(10), p=c(0.8)),
1B=f(x=c(5), p=c(1)))
t2 <- list(2A=f(x=c(11), p=c(0.8)),
2B=f(x=c(7), p=c(1)))
.
.
.
tn <- list(nA=f(x=c(3), p=c(0.8)),
nB=f(x=c(2), p=c(1)))
Big_list <- list(t1=t1, t2=t2, ... tn=tn)
rm(t1, t2, ... tn)
However, I have very many trials, which may change in future simulations, why repeating myself in this way is intractable. I have my trials in an excel document with the following structure:
| Option | x | p |
|---- |------| -----|
| A | 10 | 0.8 |
| B | 7 | 1 |
| A | 9 | 0.8 |
| B | 5 | 1 |
|... |...| ...|
I am trying to do some kind of loop which takes "x" and "p" from each "A" and "B" and inserts them into the function f, while skipping two rows ahead after each iteration (so that each option is only inserted once). This way, I want to get a set of lists t1 to tn while not having to hardcode everything. This is my best (but still not very good) attempt to explain it in pseudocode:
TRIALS <- read.excel(file_with_trials)
for n=1 to n=(nrows(TRIALS)-1) {
t(*PRINT 'n' HERE*) <- list(
(*PRINT 'n' HERE*)A=
f(x=c(*INSERT COLUMN 1, ROW n FROM "TRIALS"*),
p=c(*INSERT COLUMN 2, ROW n FROM "TRIALS"*)),
(*PRINT 'Z' HERE*)B=
f(x=c(*INSERT COLUMN 1, ROW n+1 FROM "TRIALS"*),
p=c(*INSERT COLUMN 2, ROW n+1 FROM "TRIALS"*)))
}
Big_list <- list(t1=t1, t2=t2, ... tn=tn)
That is, I want the code to create a numbered set of lists by drawing x and p from each pair of rows until my excel file is empty.
Any help (and feedback on how to improve this question) is greatly appreciated!

Select right kernel size for median blur to reduce noise

I am new to image processing. We have a requirement to get circle centers with sub pixel accuracy from an image. I have used median blurring to reduce the noise. A portion of the image is shown below. The steps I followed for getting circle boundaries is given below
Reduced the noise with medianBlur
Applied OTSU thresholding with threshold API
Identified circle boundaries with findContours method.
I get different results when used different kernel size for medianBlur. I selected medianBlur to keep edges. I tried kernel size 3, 5 and 7. Now I am confused to use the right kernel size for medianBlur.
How can I decide the right kernel size?
Is there any scientific approach to decide the right kernel size for medianBlur?
I will give you two suggestions here for how to find the centroids of these disks, you can pick one depending on the level of precision you need.
First of all, using contours is not the best method. Contours depend a lot on which pixels happen to fall within the object on thresholding, noise affects these a lot.
A better method is to find the center of mass (or rather, the first order moments) of the disks. Read Wikipedia to learn more about moments in image analysis. One nice thing about moments is that we can use pixel values as weights, increasing precision.
You can compute the moments of a binary shape from its contours, but you cannot use image intensities in this case. OpenCV has a function cv::moments that computes the moments for the whole image, but I don't know of a function that can do this for each object separately. So instead I'll be using DIPlib for these computations (I'm an author).
Regarding the filtering:
Any well-behaved linear smoothing should not affect the center of mass of the objects, as long as the objects are far enough from the image edge. Being close to the edge will cause the blur to do something different on the side of the object closest to the edge compared to the other sides, introducing a bias.
Any non-linear smoothing filter has the ability to change the center of mass. Please avoid the median filter.
So, I recommend that you use a Gaussian filter, which is the most well-behaved linear smoothing filter.
Method 1: use binary shape's moments:
First I'm going to threshold without any form of blurring.
import diplib as dip
a = dip.ImageRead('/Users/cris/Downloads/Ef8ey.png')
a = a(1) # Use green channel only, simple way to convert to gray scale
_, t = dip.Threshold(a)
b = a<t
m = dip.Label(b)
msr = dip.MeasurementTool.Measure(m, None, ['Center'])
print(msr)
This outputs
| Center |
- | ----------------------- |
| dim0 | dim1 |
| (px) | (px) |
- | ---------- | ---------- |
1 | 18.68 | 9.234 |
2 | 68.00 | 14.26 |
3 | 19.49 | 48.22 |
4 | 59.68 | 52.42 |
We can now apply a smoothing to the input image a and compute again:
a = dip.Gauss(a,2)
_, t = dip.Threshold(a)
b = a<t
m = dip.Label(b)
msr = dip.MeasurementTool.Measure(m, None, ['Center'])
print(msr)
| Center |
- | ----------------------- |
| dim0 | dim1 |
| (px) | (px) |
- | ---------- | ---------- |
1 | 18.82 | 9.177 |
2 | 67.74 | 14.27 |
3 | 19.51 | 47.95 |
4 | 59.89 | 52.39 |
You can see there's some small change in the centroids.
Method 2: use gray scale moments:
Here we use the error function to apply a pseudo-threshold to the image. What this does is set object pixels to 1 and background pixels to 0, but pixels around the edges retain some intermediate value. Some people refer to this as a "fuzzy thresholding". These two images show the normal ("hard") threshold, and the error function clip ("fuzzy threshold"):
By using this fuzzy threshold, we retain more information about the exact (sub-pixel) location of the edges, which we can use when computing the first order moments.
import diplib as dip
a = dip.ImageRead('/Users/cris/Downloads/Ef8ey.png')
a = a(1) # Use green channel only, simple way to convert to gray scale
_, t = dip.Threshold(a)
c = dip.ContrastStretch(-dip.ErfClip(a, t, 30))
m = dip.Label(a<t)
m = dip.GrowRegions(m, None, -2, 2)
msr = dip.MeasurementTool.Measure(m, c, ['Gravity'])
print(msr)
This outputs
| Gravity |
- | ----------------------- |
| dim0 | dim1 |
| (px) | (px) |
- | ---------- | ---------- |
1 | 18.75 | 9.138 |
2 | 67.89 | 14.22 |
3 | 19.50 | 48.02 |
4 | 59.79 | 52.38 |
We can now apply a smoothing to the input image a and compute again:
a = dip.Gauss(a,2)
_, t = dip.Threshold(a)
c = dip.ContrastStretch(-dip.ErfClip(a, t, 30))
m = dip.Label(a<t)
m = dip.GrowRegions(m, None, -2, 2)
msr = dip.MeasurementTool.Measure(m, c, ['Gravity'])
print(msr)
| Gravity |
- | ----------------------- |
| dim0 | dim1 |
| (px) | (px) |
- | ---------- | ---------- |
1 | 18.76 | 9.094 |
2 | 67.87 | 14.19 |
3 | 19.50 | 48.00 |
4 | 59.81 | 52.39 |
You can see the differences are smaller this time, because the measurement is more precise.
In the binary case, the differences in centroids with and without smoothing are:
array([[ 0.14768417, -0.05677508],
[-0.256 , 0.01668085],
[ 0.02071882, -0.27547569],
[ 0.2137167 , -0.03472741]])
In the gray-scale case, the differences are:
array([[ 0.01277204, -0.04444567],
[-0.02842993, -0.0276569 ],
[-0.00023144, -0.01711335],
[ 0.01776011, 0.01123299]])
If the centroid measurement is given in µm rather than px, it is because your image file contains pixel size information. The measurement function will use this to give you real-world measurements (the centroid coordinate is w.r.t. the top-left pixel). If you do not desire this, you can reset the image's pixel size:
a.SetPixelSize(1)
The two methods in C++
This is a translation to C++ of the code above, including a display step to double-check that the thresholding produced the right result:
#include "diplib.h"
#include "dipviewer.h"
#include "diplib/simple_file_io.h"
#include "diplib/linear.h" // for dip::Gauss()
#include "diplib/segmentation.h" // for dip::Threshold()
#include "diplib/regions.h" // for dip::Label()
#include "diplib/measurement.h"
#include "diplib/mapping.h" // for dip::ContrastStretch() and dip::ErfClip()
int main() {
auto a = dip::ImageRead("/Users/cris/Downloads/Ef8ey.png");
a = a[1]; // Use green channel only, simple way to convert to gray scale
dip::Gauss(a, a, {2});
dip::Image b;
double t = dip::Threshold(a, b);
b = a < t; // Or: dip::Invert(b,b);
dip::viewer::Show(a);
dip::viewer::Show(b); // Verify that the segmentation is correct
dip::viewer::Spin();
auto m = dip::Label(b);
dip::MeasurementTool measurementTool;
auto msr = measurementTool.Measure(m, {}, { "Center"});
std::cout << msr << '\n';
auto c = dip::ContrastStretch(-dip::ErfClip(a, t, 30));
dip::GrowRegions(m, {}, m, -2, 2);
msr = measurementTool.Measure(m, c, {"Gravity"});
std::cout << msr << '\n';
// Iterate through the measurement structure:
auto it = msr["Gravity"].FirstObject();
do {
std::cout << "Centroid coordinates = " << it[0] << ", " << it[1] << '\n';
} while(++it);
}

DataArray case-insensitive match that returns the index value of the match

I have a DataFrame inside of a function:
using DataFrames
myservs = DataFrame(serverName = ["elmo", "bigBird", "Oscar", "gRover", "BERT"],
ipAddress = ["12.345.6.7", "12.345.6.8", "12.345.6.9", "12.345.6.10", "12.345.6.11"])
myservs
5x2 DataFrame
| Row | serverName | ipAddress |
|-----|------------|---------------|
| 1 | "elmo" | "12.345.6.7" |
| 2 | "bigBird" | "12.345.6.8" |
| 3 | "Oscar" | "12.345.6.9" |
| 4 | "gRover" | "12.345.6.10" |
| 5 | "BERT" | "12.345.6.11" |
How can I write the function to take a single parameter called server, case-insensitive match the server parameter in the myservs[:serverName] DataArray, and return the match's corresponding ipAddress?
In R this can be done by using
myservs$ipAddress[grep("server", myservs$serverName, ignore.case = T)]
I don't want it to matter if someone uses ElMo or Elmo as the server, or if the serverName is saved as elmo or ELMO.
I referenced how to accomplish the task in R and tried to do it using the DataFrames pkg, but I only did this because I'm coming from R and am just learning Julia. I asked a lot of questions from coworkers and the following is what we came up with:
This task is much cleaner if I was to stop thinking in terms of
vectors in R. Julia runs plenty fast iterating through a loop.
Even still, looping wouldn't be the best solution here. I was told to look into
Dicts (check here for an example). Dict(), zip(), haskey(), and
get() blew my mind. These have many applications.
My solution doesn't even need to use the DataFrames pkg, but instead
uses Julia's Matrix and Array data representations. By using let
we keep the global environment clutter free and the server name/ip
list stays hidden from view to those who are only running the
function.
In the sample code, I'm recreating the server matrix every time, but in reality/practice I'll have a permission restricted delimited file that gets read every time. This is OK for now since the delimited files are small, but this may not be efficient or the best way to do it.
# ONLY ALLOW THE FUNCTION TO BE SEEN IN THE GLOBAL ENVIRONMENT
let global myIP
# SERVER MATRIX
myservers = ["elmo" "12.345.6.7"; "bigBird" "12.345.6.8";
"Oscar" "12.345.6.9"; "gRover" "12.345.6.10";
"BERT" "12.345.6.11"]
# SERVER DICT
servDict = Dict(zip(pmap(lowercase, myservers[:, 1]), myservers[:, 2]))
# GET SERVER IP FUNCTION: INPUT = SERVER NAME; OUTPUT = IP ADDRESS
function myIP(servername)
sn = lowercase(servername)
get(servDict, sn, "That name isn't in the server list.")
end
end
​# Test it out
myIP("SLIMEY")
​#>​"That name isn't in the server list."
myIP("elMo"​)
#>​"12.345.6.7"
Here's one way:
julia> using DataFrames
julia> myservs = DataFrame(serverName = ["elmo", "bigBird", "Oscar", "gRover", "BERT"],
ipAddress = ["12.345.6.7", "12.345.6.8", "12.345.6.9", "12.345.6.10", "12.345.6.11"])
5x2 DataFrames.DataFrame
| Row | serverName | ipAddress |
|-----|------------|---------------|
| 1 | "elmo" | "12.345.6.7" |
| 2 | "bigBird" | "12.345.6.8" |
| 3 | "Oscar" | "12.345.6.9" |
| 4 | "gRover" | "12.345.6.10" |
| 5 | "BERT" | "12.345.6.11" |
julia> grep{T <: String}(pat::String, dat::DataArray{T}, opts::String = "") = Bool[isna(d) ? false : ismatch(Regex(pat, opts), d) for d in dat]
grep (generic function with 2 methods)
julia> myservs[:ipAddress][grep("bigbird", myservs[:serverName], "i")]
1-element DataArrays.DataArray{ASCIIString,1}:
"12.345.6.8"
EDIT
This grep works faster on my platform.
julia> function grep{T <: String}(pat::String, dat::DataArray{T}, opts::String = "")
myreg = Regex(pat, opts)
return convert(Array{Bool}, map(d -> isna(d) ? false : ismatch(myreg, d), dat))
end

Retrieving a single row of a truth table with a non-constant number of variables

I need to write a function that takes as arguments an integer, which represents a row in a truth table, and a boolean array, where it stores the values for that row of the truth table.
Here is an example truth table
Row| A | B | C |
1 | T | T | T |
2 | T | T | F |
3 | T | F | T |
4 | T | F | F |
5 | F | T | T |
6 | F | T | F |
7 | F | F | T |
8 | F | F | F |
Please note that a given truth table could have more or fewer rows than this table, since the number of possible variables can change.
A function prototype could look like this
getRow(int rowNum, bool boolArr[]);
If this function was called, for example, as
getRow(3, boolArr[])
It would need to return an array with the following elements
|1|0|1| (or |T|F|T|)
The difficulty for me arises because the number of variables can change, therefore increasing or decreasing the number of rows. For instance, the list of variables could be A, B, C, D, E, and F instead of just A, B, and C.
I think the best solution would to be write a loop that counted up to the row number, and essentially changed the elements of the array like it was counting in binary. So that
1st loop iteration, array elements are 0|0|...|0|1|
2nd loop iteration, array elements are 0|0|...|1|0|
I can't for the life of me figure out how to do this, and can't find a solution elsewhere on the web. Sorry for all the confusion and thanks for the help
Ok now that you rewrote your question to be much clearer. First, getRow needs to take an extra argument: the number of bits. Row 1 with 2 bits produces a different result than row 1 with 64 bits, so we need a way to differentiate that. Second, typically with C++, everything is zero-indxed, so I am going to shift your truth table down one row so that row "0" returns all trues.
The key here is to realize that the row number in binary is already what you want. Take this row (having shifted down the 4 to 3):
3 | T | F | F |
3 in binary is 011, which inverted is {true, false, false} - exactly what you want. We can express that using bitwise-or as the array:
{!(3 | 0x4), !(3 | 0x2), !(3 | 0x1)}
So it's just a matter of writing that as a loop:
void getRow(int rowNum, bool* arr, int nbits)
{
int mask = 1 << (nbits - 1);
for (int i = 0; i < nbits; ++i, mask >>= 1) {
arr[i] = !(rowNum & mask);
}
}

Simple Text Analysis library for C

I'm in the midst of creating my school project for our programming class.
I'm making a Medical Care system console app and I want to implement this kind of feature:
When a user enters what they are feeling. (Like they are feeling sick, having sore throat, etc) I want the C Text analysis library to help me analyze and parse the info given by the user (which have been saved into a string) and determine the medicine to be given. (I'll be the one to give which medicine is for which, I just want the library to help me analyze the info given by the user).
Thanks!
A good example would be this one:
http://www.codeproject.com/Articles/32175/Lucene-Net-Text-Analysis
Unfortunately it's for C#
Update:
Any C library that can help me even for the simple tokenizing and indexing of the words? I know I could do it by brute force coding... But a reliable and stable api would be better. Thanks!
Analyzing natural language text is one of the most difficult problems you could possibly pick.
Most likely your solution will come down to simply looking for keywords like "sick" "sore throat", etc - which can be accomplished with a simple dictionary of keywords and results.
As far as truly "understanding" what the user typed though - good luck with that.
EDIT:
A few technologies worth pointing out:
Regarding your question about a lexer - you can easily use flex if you feel you need something like that. Probably faster (in terms of execution speed AND development speed) than trying to code the multi-token search by hand.
On Mac there is a very cool framework called Latent Semantic Mapping. There is a WWDC 2011 video on it - and it's awesome. You basically feed it a ton of example inputs and train it on what result you want. It may be as close as you're going to get. It is C-based.
http://en.wikipedia.org/wiki/Latent_semantic_mapping
https://developer.apple.com/library/mac/#documentation/TextFonts/Reference/LatentSemanticMapping/index.html
This is what wakkerbot makes of your question. (The scores are low, because wakkerbot/Hubert is all Dutch.)
But the tokeniser seems to do fine on English:
[ 6]: | 29/ 27| 4.792 | weight |
------|--------+----------+---------+--------+
0 11| 15645 | 10/ 9 | 0.15469 | 0.692 |'to'
1 0| 19416 | 10/10 | 0.12504 | 0.646 |'i'
2 10| 10483 | 4/ 3 | 0.10030 | 0.84 |'and'
3 3| 3292 | 5/ 5 | 0.09403 | 1.4 |'be'
4 7| 27363 | 3/ 3 | 0.06511 | 1.4 |'one'
5 12| 36317 | 3/ 3 | 0.06511 | 8.52 |'this'
6 2| 35466 | 2/ 2 | 0.05746 | 10.7 |'just'
7 4| 12258 | 2/ 2 | 0.05301 | 0.56 |'info'
8 18| 81898 | 2/ 2 | 0.04532 | 20.1 |'ll'
9 20| 67009 | 3/ 3 | 0.04124 | 48.8 |'text'
10 13| 70575 | 2/ 2 | 0.03897 | 156 |'give'
11 19| 16806 | 2/ 2 | 0.03426 | 1.13 |'c'
12 14| 5992 | 2/ 2 | 0.03376 | 0.914 |'for'
13 1| 3940 | 1/ 1 | 0.02561 | 1.12 |'my'
14 5| 7804 | 1/ 1 | 0.02561 | 2.94 |'class'
15 17| 7920 | 1/ 1 | 0.02561 | 7.35 |'feeling'
16 15| 20429 | 3/ 2 | 0.01055 | 3.93 |'com'
17 16| 36544 | 2/ 1 | 0.00433 | 4.28 |'www'
To support my lex/nonlex tokeniser argument, this is the relevant part of wakkerbot's tokeniser:
for(pos=0; str[pos]; ) {
switch(*sp) {
case T_INIT: /* initial */
if (myisalpha(str[pos])) {*sp = T_WORD; pos++; continue; }
if (myisalnum(str[pos])) {*sp = T_NUM; pos++; continue; }
/* if (strspn(str+pos, "-+")) { *sp = T_NUM; pos++; continue; }*/
*sp = T_ANY; continue;
break;
case T_ANY: /* either whitespace or meuk: eat it */
pos += strspn(str+pos, " \t\n\r\f\b" );
if (pos) {*sp = T_INIT; return pos; }
*sp = T_MEUK; continue;
break;
case T_WORD: /* inside word */
while ( myisalnum(str[pos]) ) pos++;
if (str[pos] == '\0' ) { *sp = T_INIT;return pos; }
if (str[pos] == '.' ) { *sp = T_WORDDOT; pos++; continue; }
*sp = T_INIT; return pos;
...
As you can see, most of the time will be spent in the line with while ( myisalnum(str[pos]) ) pos++;,
which catches all the words. myisalnum() is a static function, which will probably be inlined. (There are similar tight loops for numbers and whitespace, of course)
UPDATE: for completeness, the definition for myisalpha():
static int myisalpha(int ch)
{
/* with <ctype.h>, this is a table lookup, too */
int ret = isalpha(ch);
if (ret) return ret;
/* don't parse, just assume valid utf8 */
if (ch == -1) return 0;
if (ch & 0x80) return 1;
return 0;
}
Yes, There's a C++ Data science toolkit called MeTA - ModErn Text Analysis Toolkit. Here's follow the features:
text tokenization, including deep semantic features like parse trees
inverted and forward indexes with compression and various caching strategies
a collection of ranking functions for searching the indexes
topic models
classification algorithms
graph algorithms
language models
CRF implementation (POS-tagging, shallow parsing)
wrappers for liblinear and libsvm (including libsvm dataset parsers)
UTF8 support for analysis on various languages
multithreaded algorithms
It comes with tests and examples. In your case I think statistical classifiers, like Bayes, will perfectly do the job, but, you can also do manual classification. It was the best feat to my personal case. Hope it helps.
Here's the link https://meta-toolkit.org/
Best Regards,