Simple Text Analysis library for C - c++

I'm in the midst of creating my school project for our programming class.
I'm making a Medical Care system console app and I want to implement this kind of feature:
When a user enters what they are feeling. (Like they are feeling sick, having sore throat, etc) I want the C Text analysis library to help me analyze and parse the info given by the user (which have been saved into a string) and determine the medicine to be given. (I'll be the one to give which medicine is for which, I just want the library to help me analyze the info given by the user).
Thanks!
A good example would be this one:
http://www.codeproject.com/Articles/32175/Lucene-Net-Text-Analysis
Unfortunately it's for C#
Update:
Any C library that can help me even for the simple tokenizing and indexing of the words? I know I could do it by brute force coding... But a reliable and stable api would be better. Thanks!

Analyzing natural language text is one of the most difficult problems you could possibly pick.
Most likely your solution will come down to simply looking for keywords like "sick" "sore throat", etc - which can be accomplished with a simple dictionary of keywords and results.
As far as truly "understanding" what the user typed though - good luck with that.
EDIT:
A few technologies worth pointing out:
Regarding your question about a lexer - you can easily use flex if you feel you need something like that. Probably faster (in terms of execution speed AND development speed) than trying to code the multi-token search by hand.
On Mac there is a very cool framework called Latent Semantic Mapping. There is a WWDC 2011 video on it - and it's awesome. You basically feed it a ton of example inputs and train it on what result you want. It may be as close as you're going to get. It is C-based.
http://en.wikipedia.org/wiki/Latent_semantic_mapping
https://developer.apple.com/library/mac/#documentation/TextFonts/Reference/LatentSemanticMapping/index.html

This is what wakkerbot makes of your question. (The scores are low, because wakkerbot/Hubert is all Dutch.)
But the tokeniser seems to do fine on English:
[ 6]: | 29/ 27| 4.792 | weight |
------|--------+----------+---------+--------+
0 11| 15645 | 10/ 9 | 0.15469 | 0.692 |'to'
1 0| 19416 | 10/10 | 0.12504 | 0.646 |'i'
2 10| 10483 | 4/ 3 | 0.10030 | 0.84 |'and'
3 3| 3292 | 5/ 5 | 0.09403 | 1.4 |'be'
4 7| 27363 | 3/ 3 | 0.06511 | 1.4 |'one'
5 12| 36317 | 3/ 3 | 0.06511 | 8.52 |'this'
6 2| 35466 | 2/ 2 | 0.05746 | 10.7 |'just'
7 4| 12258 | 2/ 2 | 0.05301 | 0.56 |'info'
8 18| 81898 | 2/ 2 | 0.04532 | 20.1 |'ll'
9 20| 67009 | 3/ 3 | 0.04124 | 48.8 |'text'
10 13| 70575 | 2/ 2 | 0.03897 | 156 |'give'
11 19| 16806 | 2/ 2 | 0.03426 | 1.13 |'c'
12 14| 5992 | 2/ 2 | 0.03376 | 0.914 |'for'
13 1| 3940 | 1/ 1 | 0.02561 | 1.12 |'my'
14 5| 7804 | 1/ 1 | 0.02561 | 2.94 |'class'
15 17| 7920 | 1/ 1 | 0.02561 | 7.35 |'feeling'
16 15| 20429 | 3/ 2 | 0.01055 | 3.93 |'com'
17 16| 36544 | 2/ 1 | 0.00433 | 4.28 |'www'
To support my lex/nonlex tokeniser argument, this is the relevant part of wakkerbot's tokeniser:
for(pos=0; str[pos]; ) {
switch(*sp) {
case T_INIT: /* initial */
if (myisalpha(str[pos])) {*sp = T_WORD; pos++; continue; }
if (myisalnum(str[pos])) {*sp = T_NUM; pos++; continue; }
/* if (strspn(str+pos, "-+")) { *sp = T_NUM; pos++; continue; }*/
*sp = T_ANY; continue;
break;
case T_ANY: /* either whitespace or meuk: eat it */
pos += strspn(str+pos, " \t\n\r\f\b" );
if (pos) {*sp = T_INIT; return pos; }
*sp = T_MEUK; continue;
break;
case T_WORD: /* inside word */
while ( myisalnum(str[pos]) ) pos++;
if (str[pos] == '\0' ) { *sp = T_INIT;return pos; }
if (str[pos] == '.' ) { *sp = T_WORDDOT; pos++; continue; }
*sp = T_INIT; return pos;
...
As you can see, most of the time will be spent in the line with while ( myisalnum(str[pos]) ) pos++;,
which catches all the words. myisalnum() is a static function, which will probably be inlined. (There are similar tight loops for numbers and whitespace, of course)
UPDATE: for completeness, the definition for myisalpha():
static int myisalpha(int ch)
{
/* with <ctype.h>, this is a table lookup, too */
int ret = isalpha(ch);
if (ret) return ret;
/* don't parse, just assume valid utf8 */
if (ch == -1) return 0;
if (ch & 0x80) return 1;
return 0;
}

Yes, There's a C++ Data science toolkit called MeTA - ModErn Text Analysis Toolkit. Here's follow the features:
text tokenization, including deep semantic features like parse trees
inverted and forward indexes with compression and various caching strategies
a collection of ranking functions for searching the indexes
topic models
classification algorithms
graph algorithms
language models
CRF implementation (POS-tagging, shallow parsing)
wrappers for liblinear and libsvm (including libsvm dataset parsers)
UTF8 support for analysis on various languages
multithreaded algorithms
It comes with tests and examples. In your case I think statistical classifiers, like Bayes, will perfectly do the job, but, you can also do manual classification. It was the best feat to my personal case. Hope it helps.
Here's the link https://meta-toolkit.org/
Best Regards,

Related

rust how to collapse if let - clippy suggestion

I run cargo clippy to get some feedback on my code and clippy told me that I can somehow collapse a if let.
Here is the exact "warning":
warning: this `if let` can be collapsed into the outer `if let`
--> src\main.rs:107:21
|
107 | / if let Move::Normal { piece, from, to } = turn {
108 | | if i8::abs(from.1 - to.1) == 2 && piece.getColor() != *color && to.0 == x {
109 | | let offsetX = x - to.0;
110 | |
... |
116 | | }
117 | | }
| |_____________________^
I thought I could maybe just append the inner if using && but then i get a warning ( `let` expressions in this position are experimental, I am using rust version 1.57.0, not nightly).
Any idea what clippy wants me to do?
Edit:
the outer if let is itself again inside another if let:
if let Some(turn) = board.getLastMove() {
And it seems you can indeed combine them like so:
if let Some(Move::Normal { piece, from, to }) = board.getLastMove() {
In my opinion the clippy lint should include the line above as it is otherwise, at least for me, somewhat confusing
Edit 2:
Turns out I just cant read, below the warning listed above was some more information telling me exactly what to do.
= note: `#[warn(clippy::collapsible_match)]` on by default
help: the outer pattern can be modified to include the inner pattern
--> src\main.rs:126:29
|
126 | if let Some(turn) = board.getLastMove() {
| ^^^^ replace this binding
127 | if let Move::Normal { piece, from, to } = turn {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ with this pattern
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#collapsible_match

antlr visitor: lookup of reserved words efficiently

I'm learning Antlr. At this point, I'm writing a little stack-based language as part of my learning process -- think PostScript or Forth. An RPN language. For instance:
10 20 mul
This would push 10 and 20 on the stack and then perform a multiply, which pops two values, multiplies them, and pushes 200. I'm using the visitor pattern. And I find myself writing some code that's kind of insane. There has to be a better way.
Here's a section of my WaveParser.g4 file:
any_operator:
value_operator |
stack_operator |
logic_operator |
math_operator |
flow_control_operator;
value_operator:
BIND | DEF
;
stack_operator:
DUP |
EXCH |
POP |
COPY |
ROLL |
INDEX |
CLEAR |
COUNT
;
BIND is just the bind keyword, etc. So my visitor has this method:
antlrcpp::Any WaveVisitor::visitAny_operator(Parser::Any_operatorContext *ctx);
And now here's where I'm getting to the very ugly code I'm writing, which leads to the question.
Value::Operator op = Value::Operator::NO_OP;
WaveParser::Value_operatorContext * valueOp = ctx->value_operator();
WaveParser::Stack_operatorContext * stackOp = ctx->stack_operator();
WaveParser::Logic_operatorContext * logicOp = ctx->logic_operator();
WaveParser::Math_operatorContext * mathOp = ctx->math_operator();
WaveParser::Flow_control_operatorContext * flowOp = ctx->flow_control_operator();
if (valueOp) {
if (valueOp->BIND()) {
op = Value::Operator::BIND;
}
else if (valueOp->DEF()) {
op = Value::Operator::DEF;
}
}
else if (stackOp) {
if (stackOp->DUP()) {
op = Value::Operator::DUP;
}
...
}
...
I'm supporting approximately 50 operators, and it's insane that I'm going to have this series of if statements to figure out which operator this is. There must be a better way to do this. I couldn't find a field on the context that mapped to something I could use in a hashmap table.
I don't know if I should make every one of my operators have a separate rule, and use the corresponding method in my visitor, or if what else I'm missing.
Is there a better way?
With ANTLR, it's usually very helpful to label components of your rules, as well as the high level alternatives.
If part of a parser rule can only be one thing with a single type, usually the default accessors are just fine. But if you have several alternatives that are essentially alternatives for the "same thing", or perhaps you have the same sub-rule reference in a parser rule more than one time and want to differentiate them, it's pretty handy to give them names. (Once you start doing this and see the impact to the Context classes, it'll become pretty obvious where they provide value.)
Also, when rules have multiple top-level alternatives, it's very handy to give each of them a label. This will cause ANTLR to generate a separate Context class for each alternative, instead of dumping everything from every alternative into a single class.
(making some stuff up just to get a valid compile)
grammar WaveParser
;
any_operator
: value_operator # val_op
| stack_operator # stack_op
| logic_operator # logic_op
| math_operator # math_op
| flow_control_operator # flow_op
;
value_operator: op = ( BIND | DEF);
stack_operator
: op = (
DUP
| EXCH
| POP
| COPY
| ROLL
| INDEX
| CLEAR
| COUNT
)
;
logic_operator: op = (AND | OR);
math_operator: op = (ADD | SUB);
flow_control_operator: op = (FLOW1 | FLOW2);
AND: 'and';
OR: 'or';
ADD: '+';
SUB: '-';
FLOW1: '>>';
FLOW2: '<<';
BIND: 'bind';
DEF: 'def';
DUP: 'dup';
EXCH: 'exch';
POP: 'pop';
COPY: 'copy';
ROLL: 'roll';
INDEX: 'index';
CLEAR: 'clear';
COUNT: 'count';

Pyspark filter dataframe by columns of another dataframe

Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas() loads all the data into the driver’s memory in pyspark.
I have 2 dataframes: df1 and df2. I want to filter df1 (remove all rows) where df1.userid = df2.userid AND df1.group = df2.group. I wasn't sure if I should use filter(), join(), or sql For example:
df1:
+------+----------+--------------------+
|userid| group | all_picks |
+------+----------+--------------------+
| 348| 2|[225, 2235, 2225] |
| 567| 1|[1110, 1150] |
| 595| 1|[1150, 1150, 1150] |
| 580| 2|[2240, 2225] |
| 448| 1|[1130] |
+------+----------+--------------------+
df2:
+------+----------+---------+
|userid| group | pick |
+------+----------+---------+
| 348| 2| 2270|
| 595| 1| 2125|
+------+----------+---------+
Result I want:
+------+----------+--------------------+
|userid| group | all_picks |
+------+----------+--------------------+
| 567| 1|[1110, 1150] |
| 580| 2|[2240, 2225] |
| 448| 1|[1130] |
+------+----------+--------------------+
EDIT:
I've tried many join() and filter() functions, I believe the closest I got was:
cond = [df1.userid == df2.userid, df2.group == df2.group]
df1.join(df2, cond, 'left_outer').select(df1.userid, df1.group, df1.all_picks) # Result has 7 rows
I tried a bunch of different join types, and I also tried different
cond values:
cond = ((df1.userid == df2.userid) & (df2.group == df2.group)) # result has 7 rows
cond = ((df1.userid != df2.userid) & (df2.group != df2.group)) # result has 2 rows
However, it seems like the joins are adding additional rows, rather than deleting.
I'm using python 2.7 and spark 2.1.0
Left anti join is what you're looking for:
df1.join(df2, ["userid", "group"], "leftanti")
but the same thing can be done with left outer join:
(df1
.join(df2, ["userid", "group"], "leftouter")
.where(df2["pick"].isNull())
.drop(df2["pick"]))

DataArray case-insensitive match that returns the index value of the match

I have a DataFrame inside of a function:
using DataFrames
myservs = DataFrame(serverName = ["elmo", "bigBird", "Oscar", "gRover", "BERT"],
ipAddress = ["12.345.6.7", "12.345.6.8", "12.345.6.9", "12.345.6.10", "12.345.6.11"])
myservs
5x2 DataFrame
| Row | serverName | ipAddress |
|-----|------------|---------------|
| 1 | "elmo" | "12.345.6.7" |
| 2 | "bigBird" | "12.345.6.8" |
| 3 | "Oscar" | "12.345.6.9" |
| 4 | "gRover" | "12.345.6.10" |
| 5 | "BERT" | "12.345.6.11" |
How can I write the function to take a single parameter called server, case-insensitive match the server parameter in the myservs[:serverName] DataArray, and return the match's corresponding ipAddress?
In R this can be done by using
myservs$ipAddress[grep("server", myservs$serverName, ignore.case = T)]
I don't want it to matter if someone uses ElMo or Elmo as the server, or if the serverName is saved as elmo or ELMO.
I referenced how to accomplish the task in R and tried to do it using the DataFrames pkg, but I only did this because I'm coming from R and am just learning Julia. I asked a lot of questions from coworkers and the following is what we came up with:
This task is much cleaner if I was to stop thinking in terms of
vectors in R. Julia runs plenty fast iterating through a loop.
Even still, looping wouldn't be the best solution here. I was told to look into
Dicts (check here for an example). Dict(), zip(), haskey(), and
get() blew my mind. These have many applications.
My solution doesn't even need to use the DataFrames pkg, but instead
uses Julia's Matrix and Array data representations. By using let
we keep the global environment clutter free and the server name/ip
list stays hidden from view to those who are only running the
function.
In the sample code, I'm recreating the server matrix every time, but in reality/practice I'll have a permission restricted delimited file that gets read every time. This is OK for now since the delimited files are small, but this may not be efficient or the best way to do it.
# ONLY ALLOW THE FUNCTION TO BE SEEN IN THE GLOBAL ENVIRONMENT
let global myIP
# SERVER MATRIX
myservers = ["elmo" "12.345.6.7"; "bigBird" "12.345.6.8";
"Oscar" "12.345.6.9"; "gRover" "12.345.6.10";
"BERT" "12.345.6.11"]
# SERVER DICT
servDict = Dict(zip(pmap(lowercase, myservers[:, 1]), myservers[:, 2]))
# GET SERVER IP FUNCTION: INPUT = SERVER NAME; OUTPUT = IP ADDRESS
function myIP(servername)
sn = lowercase(servername)
get(servDict, sn, "That name isn't in the server list.")
end
end
​# Test it out
myIP("SLIMEY")
​#>​"That name isn't in the server list."
myIP("elMo"​)
#>​"12.345.6.7"
Here's one way:
julia> using DataFrames
julia> myservs = DataFrame(serverName = ["elmo", "bigBird", "Oscar", "gRover", "BERT"],
ipAddress = ["12.345.6.7", "12.345.6.8", "12.345.6.9", "12.345.6.10", "12.345.6.11"])
5x2 DataFrames.DataFrame
| Row | serverName | ipAddress |
|-----|------------|---------------|
| 1 | "elmo" | "12.345.6.7" |
| 2 | "bigBird" | "12.345.6.8" |
| 3 | "Oscar" | "12.345.6.9" |
| 4 | "gRover" | "12.345.6.10" |
| 5 | "BERT" | "12.345.6.11" |
julia> grep{T <: String}(pat::String, dat::DataArray{T}, opts::String = "") = Bool[isna(d) ? false : ismatch(Regex(pat, opts), d) for d in dat]
grep (generic function with 2 methods)
julia> myservs[:ipAddress][grep("bigbird", myservs[:serverName], "i")]
1-element DataArrays.DataArray{ASCIIString,1}:
"12.345.6.8"
EDIT
This grep works faster on my platform.
julia> function grep{T <: String}(pat::String, dat::DataArray{T}, opts::String = "")
myreg = Regex(pat, opts)
return convert(Array{Bool}, map(d -> isna(d) ? false : ismatch(myreg, d), dat))
end

check if the value already exists in vector

I have made a form that collects data which is then sent to a database.
Database has 2 tables, one is main and second one is in relation 1-to-many with it.
To make things clear, I will name them: main table is Table1, and child table is ElectricEnergy.
In table ElectricEnergy is stored energy consumption through months and year, so the table has following schema:
ElectricEnergy< #ElectricEnergy_pk, $Table1_pk, January,February, ...,December, Year>
In the form, user can enter data for a specific year. I will try to illustrate this bellow:
Year: 2012
January : 20.5 kW/h
February: 250.32 kW/h
and so on.
Filled table looks like this:
YEAR | January | February | ... | December | Table1_pk | ElectricEnergy_pk |
2012 | 20.5 | 250.32 | ... | 300.45 | 1 | 1 |
2013 | 10.5 | 50.32 | ... | 300 | 1 | 2 |
2012 | 50.5 | 150.32 | ... | 400.45 | 2 | 3 |
Since the number of years for which consumption can be stored is unknown, I have decided to use vector to store them.
Since vectors can’t contain arrays, and I need an array of 13 ( 12 months + year ), I have decided to store the form data into a vector.
Since data has decimals in it, vector type is double.
A small clarification:
vector<double> DataForSingleYear;
vector< vector<double> > CollectionOfYears.
I can successfully push data into vector DataForSingleYear, and I can successfully push all those years into vector CollectionOfYears.
The problem is that user can enter same year into edit box many times, add different values for monthly consumption, which would create duplicate values.
It would look something like this:
YEAR | January | February | ... | December | Table1_pk | ElectricEnergy_pk |
2012 | 20.5 | 250.32 | ... | 300.45 | 1 | 1 |
2012 | 2.5 | 50.32 | ... | 300 | 1 | 2(duplicate!) |
2013 | 10.5 | 50.32 | ... | 300 | 1 | 3 |
2012 | 50.5 | 150.32 | ... | 400.45 | 2 | 4 |
My question is:
What is the best solution to check if that value is in the vector ?
I know that question is “broad” one, but I could use at least an idea just to get me started.
NOTE:
Year is at the end of the vector, so its iterator position is 12.
The order of the data that will be inserted into database is NOT important, there are no sorting requirements whatsoever.
By browsing through SO archive, I have found suggestions for the usage of std::set, but its documentation says that elements can’t be modified when inserted, and that is unacceptable option for me.
On the other hand, std::find looks interesting.
( THIS PART WAS REMOVED WHEN I EDITED THE QUESTION:
, but does not handle last element, and year is at the end of the
vector. That can change, and I am willing to do that small adjustment if std::find can help me.
)
The only thing that crossed my mind was to loop through vectors, and see if the value already exists, but I don’t think it is the best solution:
wchar_t temp[50];
GetDlgItemText( hwnd, IDC_EDIT1, temp, 50 ); // get the year
double year = _wtof( temp ); // convert it to double,
// so I can push it to the end of the vector
bool exists = false; // indicates if the year is already in the vector
for( vector< vector <double> >::size_type i = 0;
i < CollectionOfYears.size(); i++ )
if( CollectionOfYears[ i ] [ ( vector<double>::size_type ) 12 ] == year )
{
exists = true;
break;
}
if( !exists)
// store main vector in the database
else
MessageBox( ... , L”Error”, ... );
I work on Windows XP, in MS Visual Studio, using C++ and pure Win32.
If additional code is needed, ask, I will post it.
Thank you.
Using find_if and lambda filter:
auto match = std::find_if(CollectionOfYears.begin(), CollectionOfYears.end(),
[&year](v){ return year == v.last(); })
if (match == CollectionOfYears.end()){ //no value previously
}
This still iterates the whole array. If you need more efficient search you should keep the array sorted and use binary search or std::set.
Note that vector::end() returns iterator to the element after the last element. This is why std::find ignores the last value (because it is already out of bounds!).