Emotion Classification in Text Using R - data-mining

I have a enormous data set of texts, from which I have separated the text which holds particular keyword/s. Here is the data set with particular keywords. Now my next task is classify this data set according to 8 emotions and 2 sentiments, in total there will be 10 different classes. I have got this idea from NRC emotion lexicon which holds 14182 different words with their emotion+sentiment classes. The main NRC work in http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm. I know Naive Bayes classification, or clustering works well with binary classification (for say, two class positive and negative sentiment). But when 10 class problem comes, I have no idea how I will process further. I would really appreciate for your suggestion. I am doing the assignment with R. The final result will be as bellow:
|==================================|====================================|
| SentencesWithKeywords | emotion or sentiment class |
-----------------------------------|------------------------------------|
|conflict need resolved turned | anger/anticipation/disgust/fear/joy|
|conversation exchange ideas | negative/positive/sadness/ |
|richer environment | surprise/trust |
| | |
|----------------------------------|------------------------------------|
| sentence2 |anger/anticipation/disgust/fear/joy |
| | negative/positive/sadness/ |
| | surprise/trust |
|----------------------------------|------------------------------------|

You should check out the caret package (http://topepo.github.io/caret/index.html). What you are trying to do are two different classifications (one mulit-class and one two class problem). Represent the document as term frequency vectors and run a classification algorithm of your choice. SVMs usually work well with bag of words approaches.

Related

Filter by distance from coordinates (Django over MariaDB)?

I have a collection of records with latitude & longitude coordinates in my database. Is there a straightforward way to filter for other objects within some specified radius of a given object?
Use would be for a method place.get_nearby(rad_km=1.3) -> "QuerySet[Place]" or a function find_nearby(place: Place, rad_km: float) -> "QuerySet[Place]" is fine.
If there is a heavily involved solution (lots of new libraries & refactoring necessary), I'll declare this to be out of scope. I currently have a method on my Places to calculate distances between them (both in radians and km), but no way to filter for nearby places.
The final use case would be to generate informational tables such as the following:
+-------------------------+-----------------------+-----------------------------+
| Northwest Plaza (1.1km) | | NE Corner (0.04km) |
+-------------------------+-----------------------+-----------------------------+
| West Street (0.789km) | You are here™ | |
+-------------------------+-----------------------+-----------------------------+
| | South Avenue (1.17km) | SW Sunset Building (0.43km) |
+-------------------------+-----------------------+-----------------------------+
Additionally, would the best way to determine which square to put an object in be arctan((lat2-lat1)/(lon2-lon1)) (assuming they're reasonably close)?

How do I find change point in a timeseries in PoweBi

I have a group of people who started receiving a specific type of social benefit called benefitA, I am interested in knowing what(if any) social benefits the people in the group might have received immediately before they started receiving BenefitA.
My optimal result would be a table with the number people who was receiving respectively BenefitB, BenefitC and not receiving any benefit “BenefitNon” immediately before they started receiving BenefitA.
My data is organized as a relation database with a Facttabel containing an ID for each person in my data and several dimension tables connected to the facttabel. The important ones here at DimDreamYdelse(showing type of benefit received), DimDreamTid(showing week and year). Here is an example of the raw data.
Data Example
I'm not sure how to approach this in PowerBi as I am fairly new to this program. Any advice is most welcome.
I have tried to solve the problem in SQL but as I need this as part of a running report i need to do it in PowerBi. This bit of code might however give some context to what I want to do.
USE FLISDATA_Beskaeftigelse;
SELECT dbo.FactDream.DimDreamTid , dbo.FactDream.DimDreamBenefit , dbo.DimDreamTid.Aar, dbo.DimDreamTid.UgeIAar, dbo.DimDreamBenefit.Benefit,
FROM dbo.FactDream INNER JOIN
dbo.DimDreamTid ON dbo.FactDream.DimDreamTid = dbo.DimDreamTid.DimDreamTidID INNER JOIN
dbo.DimDreamYdelse ON dbo.FactDream.DimDreamBenefit = dbo.DimDreamYdelse.DimDreamBenefitID
WHERE (dbo.DimDreamYdelse.Ydelse LIKE 'Benefit%') AND (dbo.DimDreamTid.Aar = '2019')
ORDER BY dbo.DimDreamTid.Aar, dbo.DimDreamTid.UgeIAar
I suggest to use PowerQuery to transform your table into more suitable form for your analysis. Things would be much easier if each row of the table represents the "change" of benefit plan like this.
| Person ID | Benefit From | Benefit To | Date |
|-----------|--------------|------------|------------|
| 15 | BenefitNon | BenefitA | 2019-07-01 |
| 15 | BenefitA | BenefitNon | 2019-12-01 |
| 17 | BenefitC | BenefitA | 2019-06-01 |
| 17 | BenefitA | BenefitB | 2019-08-01 |
| 17 | BenefitB | BenefitA | 2019-09-01 |
| ...
Then you can simply count the numbers by COUNTROWS(BenefitChanges) filtering/slicing with both Benefit From and Benefit To.

search for specific characters within column and then create different columns from it

I have param_Value column that have different values. I need to extract these values and create columns for all of them.
|PARAM_NAME |param_Value |
__________|____________
|Step 4 | SP:0.09 |
|Procedure | MAX:125 |
|Step 4 | SP:Ambient|
|(null) | +/-:N/A |
|Steam | SP:2 |
|Step 3 | MIN:0 |
|Step 4 | RDPHN427B |
|Testing De | N/A |
I only want columns with: And give them names:
SP: SET_POINT_VALUE,
MAX: MAX_LIMIT,
MIN: MIN_LIMIT,
+/-: UPPER_LOWER_LIMIT
So what I have so far is:
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME,
REGEXP_LIKE("param_Value", 'SP:') SET_POINT_VALUE,
REGEXP_LIKE("param_Value", '+/-:') UPPER_LOWER_LIMIT,
REGEXP_LIKE("param_Value", 'MAX:') MAX_VALUE,
REGEXP_LIKE("param_Value", 'MIN:') MIN_VALUE
FROM PROCESS_STEPS
;
I'm more familiar with TSQL and MySQL, but this ought to do what I think you're looking for. If it doesn't exactly, it should at least point you in the right direction.
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME
, CASE WHEN "param_Value" LIKE 'SP:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END SET_POINT_VALUE
, CASE WHEN "param_Value" LIKE '+/-:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END UPPER_LOWER_LIMIT
, CASE WHEN "param_Value" LIKE 'MAX:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MAX_VALUE
, CASE WHEN "param_Value" LIKE 'MIN:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MIN_VALUE
FROM PROCESS_STEPS
;
The basic concept here is identifying the information you want via LIKE, then using SUBSTR and INSTR to extract it. While LIKE is normally something to stay away from, since there's no leading % in your case, it's Sargable, and thus probably not a total efficiency sink.
Really, though, I have to ask you to question why you're laying out your data like this - substring operations are slow in any language, and a DB is no exception. Why not use another column for your limit type? Why not lay it out in the view you're currently looking at?

Cached data structure design

I've got a C++ program that needs to access this wind data, refreshed every 6 hours. As clients of the server need the data, the server queries the database and provides the data to the client. The client will use lat, lon, and mb as keys to find the the 5 values.
+------------+-------+-----+-----+----------+----------+-------+------+------+
| id | lat | lon | mb | wind_dir | wind_spd | uv | vv | ts |
+------------+-------+-----+-----+----------+----------+-------+------+------+
| 1769584117 | -90.0 | 0.0 | 100 | 125 | 9 | -3.74 | 2.62 | 2112 |
| 1769584118 | -90.0 | 0.5 | 100 | 125 | 9 | -3.76 | 2.59 | 2112 |
| 1769584119 | -90.0 | 1.0 | 100 | 124 | 9 | -3.78 | 2.56 | 2112 |
Because the data changes so infrequently, I'd like the data to be cached by the server so if a client needs data previously queried, a second SQL query is not necessary.
I'm trying to determine the most efficient in-memory data structure, in terms of storage/speed, but more importantly, ease of access.
My initial thought was a map keyed by lat, containing a map keyed by lon, containing a map keyed by mb for which the value is a map containing the wind_dir, wind_speed, uv, vv and ts fields.
However, that gets complicated fast. Another thought of course is a 3-dimensional array (lat, lon, mb indices) containing a struct of the last 5 fields.
As I'm sitting here, I came up with the thought of combining lat, lon and mb into a string, which could be used as an index into a map, given that I'm 99% sure the combination of lat, lon and mb would always be unique.
What other ideas make sense?
Edit: More detail from comment below
In terms of data, there are 3,119,040 rows in the data set. That will be fairly constant, though it may slowly grow over the years as new reporting stations are added. There are generally between 700 and 1500 clients requesting the data. The clients are flight simulators. They'll be requesting the data every 5 minutes by default, though the maximum possible frequency would be every 30 seconds. There is not additional information - what you see above is the data desired to return.
One final note I forgot to mention: I'm quite rusty in my C++ and especially STL stuff, so the simpler, the better.
You can use std::map with a three part key and a suitable less than operator (this is what Crazy Eddie proposed, extended with some lines of code)
struct key
{
double mLat;
double mLon;
double mMb;
key(double lat, double lon, double mb) :
mLat(lat), mLon(lon), mMb(mb) {}
};
bool operator<(const key& a, const key& b)
{
return (a.lat < b.lat ||
a.lat == b.lat && a.lon < b.lon ||
a.lat == b.lat && a.lon == b.lon && a.mb < b.mb);
}
Defining and inserting into the map would look like:
std::map<key, your_wind_struct> values;
values[key(-90.0, 0.0, 100)] = your_wind_struct(1769584117, 125, ...);
A sorted vector also makes sense. You can feed it a less predicate that compares your three part key. You could do the same with a map or set. A hash... Depends on a lot of factors which container you chose.
Another option is the c++11 unordered_set, which uses a hash table instead of red black tree as the internal data structure, and gives (I believe) an amortized lookup time of O(1) vs O(logn) for red-black. Which data structure you use depends on the characteristics of the data in question - how many pieces of data, how often will a particular record likely be accessed, etc. I'm in agreement with several commentors, that using a structure as a key is the cleanest way to go. It also allows you to more simply alter what the unique key is, should that change in the future; you would just need to add a member to your key structure, and not create a whole new level of maps.

The best way to generate path pattern for materialized path tree structures

Browsing through examples all over the web, I can see that people generate the path using something like "parent_id.node_id". Examples:-
uid | name | tree_id
--------------------
1 | Ali | 1.
2 | Abu | 2.
3 | Ita | 1.3.
4 | Ira | 1.3.
5 | Yui | 1.3.4
But as explained in this question - Sorting tree with a materialized path?, using zero padding to the tree_id make it easy to sort it by the creation order.
uid | name | tree_id
--------------------
1 | Ali | 0001.
2 | Abu | 0002.
3 | Ita | 0001.0003.
4 | Ira | 0001.0003.
5 | Yui | 0001.0003.0004
Using fix length string like this also make it easy for me to calculate the level - length(tree_id)/5. What I'm worried is it would limit me to maximum 9999 users rather than 9999 per branch. Am I right here ?
9999 | Tar | 0001.9999
10000 | Tor | 0001.??
You are correct -- zero-padding each node ID would allow you to sort the entire tree quite simply. However, you have to make the padding width match the upper limit of digits of the ID field, as you have pointed out in your last example. E.g., if you're using an int unsigned field for your ID, the highest value would be 4,294,967,295. This is ten digits, meaning that the record set from your last example might look like:
uid | name | tree_id
9999 | Tar | 0000000001.0000009999
10000 | Tor | 0000000001.0000010000
As long as you know you're not going to need to change your ID field to bigint unsigned in the future, this will continue work, though it might be a bit data-hungry depending on how huge your tables get. You could shave off two bytes per node ID by storing the values in hexadecimal, which would still be sorted correctly in a string sort:
uid | name | tree_id
9999 | Tar | 00000001.0000270F
10000 | Tor | 00000001.00002710
I can imagine this would make things a real headache when trying to update the paths (pruning nodes, etc) though.
You can also create extra fields for sorting, e.g.:
uid | name | tree_id | name_sort
9999 | Tar | 00000001.0000270F | Ali.Tar
10000 | Tor | 00000001.00002710 | Ali.Tor
There are limitations, however, as laid out by this guy's answer to a similar materialized path sorting question. The name field would have to be padded to a set length (fortunately, in your example, each name seems to be three characters long), and it would take up a lot of space.
In conclusion, given the above issues, I've found that the most versatile way to do sorting like this is to simply do it in your application logic -- say, using a recursive function that builds a nested array, sorting the children of each node as it goes.