extract environmental value from points predicted from maxent - r-raster

I have used maxent to predict species distribution model of interested species.
I would like to extract point (coordinate) with high confidence predicted from maxent (eg >0.8), and use these point to extract their detail environmental factors using raster and Worldclim database.
However, I don't know how can I quickly extract points with high confidence.
Do any experts know ways to do this?

Related

High mAP#50 with low precision and recall. What does it mean and what metric should be more important?

I am comparing models for the detection of objects for maritime Search and Rescue (SAR) purposes. From the models that I used, I got the best results for the improved version of YOLOv3 for small object detection and for FASTER RCNN.
For YOLOv3 I got the best mAP#50, but for FASTER RCNN I got better all other metrics (precision, recall, F1 score). Now I am wondering how to read it and which model is really better in this case?
I would like to add that there are only two classes in the dataset: small and large objects. We chose this solution because the objects' distinction between classes is not as important to us as the detection of any human origin object.
However, small objects don't mean small GT bounding boxes. These are objects that actually have a small area - less than 2 square meters (e.g. people, buoys). Large objects are objects with a larger area (boats, ships, canoes, etc.).
Here are the results per category:
And two sample images from the dataset (with YOLOv3 detections):
The mAP for object detection is the average of the AP calculated for all the classes. mAP#0.5 means that it is the mAP calculated at IOU threshold 0.5.
The general definition for the Average Precision(AP) is finding the area under the precision-recall curve.
The process of plotting the model's precision and recall as a function of the model’s confidence threshold is the precision recall curve.
Precision measures how accurate is your predictions. i.e. the percentage of your predictions that are correct. Recall measures how good you find all the positives. F1 score is HM (Harmonic Mean) of precision and recall.
To answer your questions now.
How to read it and which model is really better in this case?
The mAP is a good measure of the sensitivity of the neural network. So good mAP indicates a model that's stable and consistent across difference confidence thresholds. In your case faster rcnn results indicate that precision-recall curve metric is bad compared to that of Yolov3, which means that either faster rcnn has very bad recall at higher confidence thresholds or very bad precision at lower confidence threshold compared to that of Yolov3 (especially for small objects).
Precision, Recall and F1 score are computed for given confidence threshold. I'm assuming you're running the model with default confidence threshold (could be 0.25). So higher Precision, Recall and F1 score of faster rcnn indicate that at that confidence threshold it's better in terms of all the 3 metric compared to that of Yolov3.
What metric should be more important?
In general to analyse better performing model, I would suggest you to use validation set (data set that is used to tune hyper-parameters) and test set (data set that is used to assess the performance of a fully-trained model).
Note: FP - False Positive FN - False Negative
On validation set:
Use mAP to select best performing model (model that is more stable and consistent) out of all the trained weights across iterations/epochs. Use mAP to understand whether model should be trained/tuned further or not.
Check class level AP values to ensure model is stable and good across the classes.
As per use-case/application, if you're completely tolerant to FNs and highly intolerant to FPs then to train/tune the model accordingly use Precision.
As per use-case/application, if you're completely tolerant to FPs and highly intolerant to FNs then to train/tune the model accordingly use Recall.
On test set:
If you're neutral towards FPs and FNs, then use F1 score to evaluate best performing model.
If FPs are not acceptable to you (without caring much about FNs) pick the model with higher Precision
If FNs are not acceptable to you (without caring much about FPs) pick the model with higher Recall
Once you decide metric you should be using, try out multiple confidence thresholds (say for example - 0.25, 0.35 and 0.5) for given model to understand for which confidence threshold value the metric you selected works in your favour and also to understand acceptable trade off ranges (say you want Precision of at least 80% and some decent Recall). Once confidence threshold is decided, you use it across different models to find out best performing model.

Hyper parameter tuning for Random cut forest

I have used to below hyper parameters to train the model.
rcf.set_hyperparameters(
num_samples_per_tree=200,
num_trees=250,
feature_dim=1,
eval_metrics =["accuracy", "precision_recall_fscore"])
is there any best way to choose the num_samples_per_tree and num_trees parameters.
what are the best numbers for both num_samples_per_tree and num_trees.
There are natural interpretations for these two hyper-parameters that can help you determine good starting approximations for HPO:
num_samples_per_tree -- the reciprocal of this value approximates the density of anomalies in your data set/stream. For example, if you set this to 200 then the assumption is that approximately 0.5% of the data is anomalous. Try exploring your dataset to make an educated estimate.
num_trees -- the more trees in your RCF model the less noise in scores. That is, if more trees are reporting that the input inference point is an anomaly then the point is much more likely to be an anomaly than if few trees suggest so.
The total number of points sampled from the input dataset is equal to num_samples_per_tree * num_trees. You should make sure that the input training set is at least this size.
(Disclosure - I helped create SageMaker Random Cut Forest)

Using class weight to balance data set lowers accuracy in RBF SVM

I have been using sklearn to learn on some data. This is a binary classifcation task and I am using a RBF kernel. My data set is quite unbalanced (80:20) and I'm using only 120 samples, with 10ish features (I've been experimenting with a few less). Since I set class_weight="auto" the accuracy I've calculated from a cross validated (10 folds) gridsearch has dropped dramatically. Why??
I will include a couple of validation accuracy heatmaps to demonstrate the difference.
NOTE: top heatmap is before classweight was changed to auto.
Accuracy is not the best metrics to use when dealing with unbalanced dataset. Let's say you have 99 positive examples and 1 negative example, and if you predict all outputs to be positive, still you will get 99% accuracy, whereas you have mis-classified the only negative example. You might have gotten high accuracy in the first case because your predictions will be on the side which has high number of samples.
When you do class weight = auto, it takes the imbalance into consideration and hence, your predictions might have moved towards center, you can cross-check it using plotting the histograms of predictions.
My suggestion is, don't use accuracy as performance metric, use something like F1 Score or AUC.

What is class_weight parameter does in scikit-learn SGD

I am a frequent user of scikit-learn, I want some insights about the “class_ weight ” parameter with SGD.
I was able to figure out till the function call
plain_sgd(coef, intercept, est.loss_function,
penalty_type, alpha, C, est.l1_ratio,
dataset, n_iter, int(est.fit_intercept),
int(est.verbose), int(est.shuffle), est.random_state,
pos_weight, neg_weight,
learning_rate_type, est.eta0,
est.power_t, est.t_, intercept_decay)
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/stochastic_gradient.py
After this it goes to sgd_fast and I am not very good with cpython. Can you give some celerity on these questions.
I am having a class biased in the dev set where positive class is somewhere 15k and negative class is 36k. does the class_weight will resolve this problem. Or doing undersampling will be a better idea. I am getting better numbers but it’s hard to explain.
If yes then how it actually does it. I mean is it applied on the features penalization or is it a weight to the optimization function. How I can explain this to layman ?
class_weight can indeed help increasing the ROC AUC or f1-score of a classification model trained on imbalanced data.
You can try class_weight="auto" to select weights that are inversely proportional to class frequencies. You can also try to pass your own weights has a python dictionary with class label as keys and weights as values.
Tuning the weights can be achieved via grid search with cross-validation.
Internally this is done by deriving sample_weight from the class_weight (depending on the class label of each sample). Sample weights are then used to scale the contribution of individual samples to the loss function used to trained the linear classification model with Stochastic Gradient Descent.
The feature penalization is controlled independently via the penalty and alpha hyperparameters. sample_weight / class_weight have no impact on it.

When calculating a distance from a city, how can I factor in the approximate size (physical area) of the city?

I'm building a store locator based on in-house geocoding data. Effectively I need to query stories near City X or Zip Y within a certain radius. The data sets I'm working with are relatively comprehensive and include things such as population.
One issue is that large cities (Los Angeles for example) are many miles in radius so you could be within the city but miles from the coordinate we have loaded.
Is there a rule of thumb, or a free data feed which would list an approximate radius of a city, or perhaps even outlines of the city points?
Also, assuming I have a shape defining the city what calculation would I use to say "stores within X miles of this area"?
Why don't you use the zip codes and latitude/longitude of the stores, instead of the cities? You know the addresses of the stores, so use its zip code, look up the coordinates, and calculate the distance from the origin zip code. Then it wouldn't matter how big the city is, because big cities have many zip codes, but each store has its own zip code.
It would only be a problem for states with big zip codes like Texas, but then there is likely not more than 1 store per zip code anyways so not a big deal.
Ultimately we didn't implement this feature, but before it was cancelled I had a fair amount of success using the below approach:
Finding coordinates for the city itself, as well as all zip codes of the city
"Connecting the dots" of all the above coordinates to create a polygon of the (very rough shape of the city)
Checking if the user's input coordinate was within the given range of the polygon
The above approach worked relatively well and may have ultimately developed into a sound solution with some more enhancements and tuning.