I am trying to get a better understanding of the outputs given by Google's sentiment analysis API. It takes in a sentence and gives out two values - magnitude and score. I am trying to interpret the magnitude value better. Magnitude is defined in the documentation as -
A non-negative number in the [0, +inf) range, which represents the absolute magnitude of sentiment regardless of score (positive or negative).
Initially, I thought it is a confidence score or weight but I am not sure how the value would change since it can be ANY number. Does anybody know how it is calculated or what it means apart from the definition provided in the documentation?
magnitude indicates the overall strength of emotion (both positive and negative) within the given text, between 0.0 and +inf. Unlike score, magnitude is not normalized; each expression of emotion within the text (both positive and negative) contributes to the text's magnitude (so longer text blocks may have greater magnitudes) (Ref).
The score of a document's sentiment indicates the overall emotion of a document. The magnitude of a document's sentiment indicates how much emotional content is present within the document, and this value is often proportional to the length of the document (Ref).
A document with a neutral score (around 0.0) may indicate a low-emotion document, or may indicate mixed emotions, with both high positive and negative values which cancel each out. Generally, you can use magnitude values to disambiguate these cases, as truly neutral documents will have a low magnitude value, while mixed documents will have higher magnitude values (Ref).
Related
I have used to below hyper parameters to train the model.
rcf.set_hyperparameters(
num_samples_per_tree=200,
num_trees=250,
feature_dim=1,
eval_metrics =["accuracy", "precision_recall_fscore"])
is there any best way to choose the num_samples_per_tree and num_trees parameters.
what are the best numbers for both num_samples_per_tree and num_trees.
There are natural interpretations for these two hyper-parameters that can help you determine good starting approximations for HPO:
num_samples_per_tree -- the reciprocal of this value approximates the density of anomalies in your data set/stream. For example, if you set this to 200 then the assumption is that approximately 0.5% of the data is anomalous. Try exploring your dataset to make an educated estimate.
num_trees -- the more trees in your RCF model the less noise in scores. That is, if more trees are reporting that the input inference point is an anomaly then the point is much more likely to be an anomaly than if few trees suggest so.
The total number of points sampled from the input dataset is equal to num_samples_per_tree * num_trees. You should make sure that the input training set is at least this size.
(Disclosure - I helped create SageMaker Random Cut Forest)
I'm presenting visualizations of time to complete different tasks. Some the data is heavily skewed by certain tasks which take much longer than the rest, so I thought it would be a good idea to show both the means and the medians, to demonstrate where that skew is present. I have one page of visualizations for the mean, and an identical page where the mean has been replaced by the median. However, when PowerBI Calculates the medians, it seems to be giving me integer values, where I would like it to display the full decimal value.
Here's screenshots of each page (I've had to black out the labels for confidentiality reasons).
And a snippet of the data so you can see it's being read in as decimal numbers.
I have a dataset that have multiple variables with each of them heavily centered around zero to form a high peak. The kurtosis of each variable is more than 100.
What I want to estimate is the probability density of any given value if it belongs to the dataset. The most accessible distribution function I found currently is the multivariant Gaussian distribution. However, since my dataset is not is a normal shape and I am worried that it is inaccurate estimate the probability density using this function.
Does anyone have any good suggestions on which function to use to for this purpose?
You are repeating a common incorrect interpretation of kurtosis, namely, "peakedness," which contributes the confusion about what distribution to use.
Kurtosis does not measure "peakedness" at all. You can have a distribution with a perfectly flat peak, with a V-shaped peak, with a trimodal peak, with a wavy peak, or with any shape peak whatsoever, that has infinite kurtosis. And you can have a distribution with infinite peak than has negative (excess) kurtosis.
Instead, kurtosis is a measure of the tails (outlier potential) of the distribution, not the peak. The only reason people think that there is a "high peak" when there is high kurtosis is that the outliers stretch the horizontal scale of the histogram, making the data appear concentrated in a narrow vertical strip. But if you zoom in on the bulk of the data in that strip, the peak can have any shape whatsoever. Further, if you compare the height of your histogram of standardized data with the height of a corresponding standard normal, either can be higher, no matter what your data show. The "height" mythology was debunked around 1945 by Kaplansky.
For your data, you do not need a "peaked" distribution. Instead, you need a distribution that allows such extreme values as you have observed. Examples include mixture distributions, lognormal distributions, t distributions with small degrees of freedom, or multivariate versions of such, if that's what you need.
References:
Westfall, P.H. (2014). Kurtosis as Peakedness, 1905 – 2014. R.I.P. The American Statistician, 68, 191–195.
(A simplified discussion of the above paper is given in the talk section of the Wikipedia entry on kurtosis.)
I'm wondering which is the best way to create two lookups table for square root and cubic root of float values in range [0.0, 1.0).
I already profiled the code and saw that this is quite a strong bottleneck of performances (because I need to compute them for several tenths of thousands of values each). Then I remembered about lookup tables and thought they would help me increasing the performance.
Since my values are in a small range I was thinking about splitting the range with steps of, let's say, 0.0025 (hoping it's enough) but I'm unsure about which should be the most efficient way to retrieve them.
I can easily populate the lookup table but I need a way to efficiently get the correct value for a given float (which is not discretized on any step). Any suggestions or well known approaches to this problem?
I'm working with a mobile platform, just to specify.
Thanks in advance
You have (1.0-0.0)/0.0025 = 400 steps
Just create a 400x1 matrix and access it by multiplying the float you want the square/cube to by 400.
For instance if you want to look up the square of 0.0075. Multiply 0.0075 by 400 and get 3 which is your index in the matrix
double table_sqrt(double v)
{
return table[(unsigned int)(v / 0.0025)];
}
You could multiply the values by whatever precision that you want, and then use a hash-table since the results would be integral values.
For instance, rather than using a floating point key-value for something like 0.002, give yourself a precision of three or four decimal places, making your key value for 0.002 equal to 200 or 2000. Then you can quickly look-up the resulting floating point value for the square and cubic root stored in the hash-table key for the 2000 slot.
If you're wanting to also get values out of the non-discreet ranges in-between slots, you could use an array or tree rather than a hash-table so that you can generate "in-between" values by interpolating between the roots stored at two adjacent key-value slots.
If you only need to split into 10 different stripes, find the inputs which correspond to the thresholds between stripes, and use an unrolled binary search to test against those 9 values. Or is there additional computation required before the threshold test is done, so that the looked-up value isn't the final result.
I am processing some images using ImageMagick library. As part of the processing I want to minimize the number of colors if this doesn't affect image quality (too much).
For this I have tried to use MagickQuantizeImage function. Can someone explain me whow should I choose the parameters ?
treedepth:
Normally, this integer value is zero or one. A zero or one tells Quantize to choose a optimal tree depth of Log4(number_colors).% A tree of this depth generally allows the best representation of the reference image with the least amount of memory and the fastest computational speed. In some cases, such as an image with low color dispersion (a few number of colors), a value other than Log4(number_colors) is required. To expand the color tree completely, use a value of 8.
dither:
A value other than zero distributes the difference between an original image and the corresponding color reduced algorithm to neighboring pixels along a Hilbert curve.
measure_error:
A value other than zero measures the difference between the original and quantized images. This difference is the total quantization error. The error is computed by summing over all pixels in an image the distance squared in RGB space between each reference pixel value and its quantized value.
ps: I have made some tests but sometimes the quality of images in severely affected, and I don't want find a result by trial and error.
This is a really good description of the algorithm
http://www.imagemagick.org/www/quantize.html
They are referencing the command-line version, but the concepts are the same.
The parameter measure_error is meant to give you an indication of how good an answer you got. Set to non-zero, then look at the Image object's mean_error_per_pixel field after you quantize to see how good a quantization you got.
If it's not good enough, increase the number of colors.