How to include losses due to lifetime degradation of PV panels in pvlib - pvlib

Is there a way to specify and include losses due to panel degradation into pvlib calculation of ac/dc power output? When trying to estimate e.g. 20-year performance, how to represent losses due to panel degradation? Is that done simply by reducing power output by a different factor each year or is there a better way to do it in pvlib?

Have you considered the pvsystem.pvwatts_losses function? This will apply a uniform loss. You'd need to write your own functions to do anything more complicated than that with pvlib.
Also consider the rdtools library.

Related

In SAS, does the using the GLS option 'absorb' invalidate the standard errors of the parameter estimates?

SAS provides a handy tool for handling GLM models with a large number of groups or fixed effects - "absorb" (PROC GLM; ABSORB). My understanding is that it factors the effect of the absorbed parameters out of the data before estimating the remaining parameters. It seems like this may invalidate the standard errors of the parameter estimates. The variance covariance matrix would not be calculated with the full set of parameters. Hence, we would be calculating the variance covariance matrix conditional on the absorbed parameters, where we would really want the unconditional variance. Am I misunderstanding what is going on here? I haven't found any good explanations in the help files.

Is there a way to quantify impact of independent variables with gradient boosting?

I've been asked to run a model using gradient boosting or random forest. So far so good, however, the only output that comes back in terms of variable importance is based on the number of times a variable was used as a branch rule. I've now been asked to basically get coefficients or somehow quantify the impact that the variables have on the target.
Is there a way to do this with a gradient boosting model? My other thoughts were to either use only the variables that were showed to be sued as branch rules in a regular decision tree or in a GLM or regular regression model.
Any help or ides would be appreciated!! Thanks so much!
Just to make certain there is not a misunderstanding: SAS implementation of decision tree/gradient boosting (at least in EM) uses Split-Based variable Importance.
Split-Based Importance does NOT count the number splits made.
It is the ratio of the reduction of sum-of-squares by one variable (specific the sum over all splits by this variable) in relation to the reduction of sum-of-squares achieved by all splits in the model.
If you are using surrogate rules, highly correlated variables will receive roughly the same value.

What is a proper method for minimizing st deviation of dependent variable (e.g. clustering?)

I'm stuck with minimizing the st deviation of a dependent variable being time difference in days. The mean is OK, but the deviation is terrible. Tried clustering by independent variables and noticed quite dissimilar clusters. Now, wondering:
1) How I can actually apply this knowledge from clustering to the independent variable? The fact is that it was not included in initial clustering analysis, as I know it's dependent on the others.
2) Given that I know the variable of time difference is dependent, should I run clustering of it with the variable of cluster number being the result of my initial clustering analysis? Would it help?
3) Is there any other technique apart from clustering that can help me somehow categorize observation groups so that for each group I would have a separate mean of the independent variable with low st deviation?
Any help highly appreciated!
P.S. I was using Stata and SPSS, though I can also use SAS if you can share the code.
It sounds like you're going about this all wrong. Here are some relevant points to consider.
It's more important for the variance to be consistent across groups than it is to be low.
Clustering is (generally) going to organize individuals based on similar patterns of the clustering variables.
Fewer observations will generally not decrease the size of your standard deviation.
Any time you take continuous variables (either IV or DVs) and convert them into categorical variables, you are removing variance from the equation, and including more measurement error. Sometimes there are good reasons to do this, often times there are not.
Analysis should be theory-driven whenever possible, as data driven analysis (like what you're trying to accomplish here) is more likely to produce results that can't be reproduced or generalized to other data sets, samples, or populations.

When is an event too rare for predictive modelling to be worthwhile?

Background
I built a complaints management system for my company. It works fine. I'm interested in using the data it contains to do predictive modelling on complaints. We have ~40,000 customers of whom ~400 have complained.
Problem
I want to use our complaints data to model the probability that any given customer will complain. My concern is that a model giving each customer a probability of 0.000 for complaining would already be 99% accurate and thus hard to improve upon. Is it even possible to build a useful predictive model of the kind I describe trying to predict such a rare event with so little data?
That is why there are alternative measures than just accuracy.
Here, recall is probably what you are interested in. An in order to balance precision and recall, F1 is a popular mixture that takes both into account.
But in general, avoid trying to break down things into a single number.
It's a 1 dimensional result, and too much of a simplification. In practise, you will want to study errors in detail, to avoid a systematic error from happening.

appropriate minimum support for itemset?

Please suggest me for any kind material about appropriate minimum support and confidence for itemset!
::i use apriori algorithm to search frequent itemset. i still don't know appropriate support and confidence for itemset. i wish to know what kinds of considerations to decide how big is the support.
The answer is that the appropriate values depends on the data.
For some datasets, the best value may be 0.5. But for some other datasets it may be 0.05. It depends on the data.
But if you set minsup =0 and minconf = 0, some algorithms will run out of memory before terminating, or you may run out of disk space because there is too many patterns.
From my experience, the best way to choose minsup and minconf is to start with a high value and then to lower them down gradually until you find enough patterns.
Alternatively, if you don't want to have to set minsup, you can use a top-k algorithms where instead of specifying minsup, you specify for example that you want the k most frequent rules. For example, k = 1000 rules.
If you are interested by top-k association rule mining, you can check my Java code here:
http://www.philippe-fournier-viger.com/spmf/
The algorithm is called TopKRules and the article describing it will be published next month.
Besides that, you need to know that there is many other interestingness measures beside the support and confidence: lift, all-confidence, ... To know more about this, you can read this article: "On selecting interestingness measures for association rules" and "A Survey of Interestingness Measures for Association Rules" Basically, all measures have some problems in some cases... no measure is perfect.
Hope this helps!
In any association rule mining algorithm, including Apriori, it is up to the user to decide what support and confidence values they want to provide. Depending on your dataset and your objectives you decide the minSup and minConf.
Obviously, if you set these values lower, then your algorithm will take longer to execute and you will get a lot of results.
The minimum support and minimum confidence parameters are a user preference. If you want a larger quantity of results (with lower statistical confidence), choose the parameters appropriately. In theory you can set them to 0. The algorithm will run, but it will take a long time, and the result will not be particularly useful, as it contains just about anything.
So choose them so that the result suit your needs. Mathematically, any value is "correct".