The easiest way to convert string vector to numeric vector in ML.NET - ml.net

I'm trying to use vectors of variable length as features. The data view with features.
Label | Feature1 | Feature2
--------------------------------
1 | 5#15#25 | 25#75
2 | 10 | 100#65#0
3 | 115#90 | 5#80
Using pipeline estimator TokenizeIntoWords, I can split selected features into a vector of words.
var pipeline = Context
.Transforms
.Conversion
.MapValueToKey("Output", "Label")
.Append(Context.Transforms.Text.TokenizeIntoWords("Tokens", "Feature1", new[] { '#' }))
After this, how can I specify that Tokens is a vector of float, not string? Also, how do I normalize this vector, considering that this is a row-based vector and standard ML transformers, like NormalizeMinMax, apply to column-based data?

Related

Filter by distance from coordinates (Django over MariaDB)?

I have a collection of records with latitude & longitude coordinates in my database. Is there a straightforward way to filter for other objects within some specified radius of a given object?
Use would be for a method place.get_nearby(rad_km=1.3) -> "QuerySet[Place]" or a function find_nearby(place: Place, rad_km: float) -> "QuerySet[Place]" is fine.
If there is a heavily involved solution (lots of new libraries & refactoring necessary), I'll declare this to be out of scope. I currently have a method on my Places to calculate distances between them (both in radians and km), but no way to filter for nearby places.
The final use case would be to generate informational tables such as the following:
+-------------------------+-----------------------+-----------------------------+
| Northwest Plaza (1.1km) | | NE Corner (0.04km) |
+-------------------------+-----------------------+-----------------------------+
| West Street (0.789km) | You are hereā„¢ | |
+-------------------------+-----------------------+-----------------------------+
| | South Avenue (1.17km) | SW Sunset Building (0.43km) |
+-------------------------+-----------------------+-----------------------------+
Additionally, would the best way to determine which square to put an object in be arctan((lat2-lat1)/(lon2-lon1)) (assuming they're reasonably close)?

Insert cell's logic into another cell's logic in Google Sheets

I have a column in Google Sheets where each cell contains pre-defined logic. For example, something like the second column in this table:
| 1 | =A1*-1 |
| 2 | =B2*-1 |
| -3 | =C2*-1 |
Let's say later I want to add the same logic to each cell in column B. For example, make it such that it looks like:
| 1 | =MAX(A1*-1,0) |
| 2 | =MAX(B2*-1,0) |
| -3 | =MAX(C2*-1,0) |
What is the fastest way to do this, besides manually typing MAX(...,0) in each cell? Normal Sheets functions act on the value of the cell, not the logic, so I'm a bit lost.
To my knowledge there isn't a function that pipes in the logic from one cell to another ...
try:
=ARRAYFORMULA(IF(A1:A="",,IF(SIGN(A1:A)<0, A1:A*-1, 0)))
=ARRAYFORMULA(IF(A1:A="",,IF(SIGN(A1:A)>0, A1:A, 0)))

Emotion Classification in Text Using R

I have a enormous data set of texts, from which I have separated the text which holds particular keyword/s. Here is the data set with particular keywords. Now my next task is classify this data set according to 8 emotions and 2 sentiments, in total there will be 10 different classes. I have got this idea from NRC emotion lexicon which holds 14182 different words with their emotion+sentiment classes. The main NRC work in http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm. I know Naive Bayes classification, or clustering works well with binary classification (for say, two class positive and negative sentiment). But when 10 class problem comes, I have no idea how I will process further. I would really appreciate for your suggestion. I am doing the assignment with R. The final result will be as bellow:
|==================================|====================================|
| SentencesWithKeywords | emotion or sentiment class |
-----------------------------------|------------------------------------|
|conflict need resolved turned | anger/anticipation/disgust/fear/joy|
|conversation exchange ideas | negative/positive/sadness/ |
|richer environment | surprise/trust |
| | |
|----------------------------------|------------------------------------|
| sentence2 |anger/anticipation/disgust/fear/joy |
| | negative/positive/sadness/ |
| | surprise/trust |
|----------------------------------|------------------------------------|
You should check out the caret package (http://topepo.github.io/caret/index.html). What you are trying to do are two different classifications (one mulit-class and one two class problem). Represent the document as term frequency vectors and run a classification algorithm of your choice. SVMs usually work well with bag of words approaches.

search for specific characters within column and then create different columns from it

I have param_Value column that have different values. I need to extract these values and create columns for all of them.
|PARAM_NAME |param_Value |
__________|____________
|Step 4 | SP:0.09 |
|Procedure | MAX:125 |
|Step 4 | SP:Ambient|
|(null) | +/-:N/A |
|Steam | SP:2 |
|Step 3 | MIN:0 |
|Step 4 | RDPHN427B |
|Testing De | N/A |
I only want columns with: And give them names:
SP: SET_POINT_VALUE,
MAX: MAX_LIMIT,
MIN: MIN_LIMIT,
+/-: UPPER_LOWER_LIMIT
So what I have so far is:
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME,
REGEXP_LIKE("param_Value", 'SP:') SET_POINT_VALUE,
REGEXP_LIKE("param_Value", '+/-:') UPPER_LOWER_LIMIT,
REGEXP_LIKE("param_Value", 'MAX:') MAX_VALUE,
REGEXP_LIKE("param_Value", 'MIN:') MIN_VALUE
FROM PROCESS_STEPS
;
I'm more familiar with TSQL and MySQL, but this ought to do what I think you're looking for. If it doesn't exactly, it should at least point you in the right direction.
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME
, CASE WHEN "param_Value" LIKE 'SP:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END SET_POINT_VALUE
, CASE WHEN "param_Value" LIKE '+/-:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END UPPER_LOWER_LIMIT
, CASE WHEN "param_Value" LIKE 'MAX:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MAX_VALUE
, CASE WHEN "param_Value" LIKE 'MIN:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MIN_VALUE
FROM PROCESS_STEPS
;
The basic concept here is identifying the information you want via LIKE, then using SUBSTR and INSTR to extract it. While LIKE is normally something to stay away from, since there's no leading % in your case, it's Sargable, and thus probably not a total efficiency sink.
Really, though, I have to ask you to question why you're laying out your data like this - substring operations are slow in any language, and a DB is no exception. Why not use another column for your limit type? Why not lay it out in the view you're currently looking at?

Cached data structure design

I've got a C++ program that needs to access this wind data, refreshed every 6 hours. As clients of the server need the data, the server queries the database and provides the data to the client. The client will use lat, lon, and mb as keys to find the the 5 values.
+------------+-------+-----+-----+----------+----------+-------+------+------+
| id | lat | lon | mb | wind_dir | wind_spd | uv | vv | ts |
+------------+-------+-----+-----+----------+----------+-------+------+------+
| 1769584117 | -90.0 | 0.0 | 100 | 125 | 9 | -3.74 | 2.62 | 2112 |
| 1769584118 | -90.0 | 0.5 | 100 | 125 | 9 | -3.76 | 2.59 | 2112 |
| 1769584119 | -90.0 | 1.0 | 100 | 124 | 9 | -3.78 | 2.56 | 2112 |
Because the data changes so infrequently, I'd like the data to be cached by the server so if a client needs data previously queried, a second SQL query is not necessary.
I'm trying to determine the most efficient in-memory data structure, in terms of storage/speed, but more importantly, ease of access.
My initial thought was a map keyed by lat, containing a map keyed by lon, containing a map keyed by mb for which the value is a map containing the wind_dir, wind_speed, uv, vv and ts fields.
However, that gets complicated fast. Another thought of course is a 3-dimensional array (lat, lon, mb indices) containing a struct of the last 5 fields.
As I'm sitting here, I came up with the thought of combining lat, lon and mb into a string, which could be used as an index into a map, given that I'm 99% sure the combination of lat, lon and mb would always be unique.
What other ideas make sense?
Edit: More detail from comment below
In terms of data, there are 3,119,040 rows in the data set. That will be fairly constant, though it may slowly grow over the years as new reporting stations are added. There are generally between 700 and 1500 clients requesting the data. The clients are flight simulators. They'll be requesting the data every 5 minutes by default, though the maximum possible frequency would be every 30 seconds. There is not additional information - what you see above is the data desired to return.
One final note I forgot to mention: I'm quite rusty in my C++ and especially STL stuff, so the simpler, the better.
You can use std::map with a three part key and a suitable less than operator (this is what Crazy Eddie proposed, extended with some lines of code)
struct key
{
double mLat;
double mLon;
double mMb;
key(double lat, double lon, double mb) :
mLat(lat), mLon(lon), mMb(mb) {}
};
bool operator<(const key& a, const key& b)
{
return (a.lat < b.lat ||
a.lat == b.lat && a.lon < b.lon ||
a.lat == b.lat && a.lon == b.lon && a.mb < b.mb);
}
Defining and inserting into the map would look like:
std::map<key, your_wind_struct> values;
values[key(-90.0, 0.0, 100)] = your_wind_struct(1769584117, 125, ...);
A sorted vector also makes sense. You can feed it a less predicate that compares your three part key. You could do the same with a map or set. A hash... Depends on a lot of factors which container you chose.
Another option is the c++11 unordered_set, which uses a hash table instead of red black tree as the internal data structure, and gives (I believe) an amortized lookup time of O(1) vs O(logn) for red-black. Which data structure you use depends on the characteristics of the data in question - how many pieces of data, how often will a particular record likely be accessed, etc. I'm in agreement with several commentors, that using a structure as a key is the cleanest way to go. It also allows you to more simply alter what the unique key is, should that change in the future; you would just need to add a member to your key structure, and not create a whole new level of maps.