I'm a cyclist and a programmer. During my rides, I'm recording data to xml files using a phone based gps tracker and a power meter. After the ride, I use the power meter software to merge the data and then upload to a web site. On the website, the resulting data is showing highly inaccurate data for WR Watts (It is a weighted average, also known as normalized power, which by definition is higher than average power and lower than my maximum recorded watts. See http://ridewithgps.com/trips/4834566 (Export TCX History to get the file I'm referring to). /<Watts>\d{4,} returns no results.
Calories: 1809
Max Watts: 676
Avg. Watts: 213 (170 with 0s)
WR Power 23487
Work 1681 kJ
Max Speed: 26.2 mph
Avg. Speed: 16.6 mph
Here are two sample readings from the tcx history file.
<Trackpoint>
<Time>2015-05-30T11:35:50Z</Time>
<Position>
<LatitudeDegrees>41.96306</LatitudeDegrees>
<LongitudeDegrees>-87.645939</LongitudeDegrees>
</Position>
<AltitudeMeters>177.7</AltitudeMeters>
<DistanceMeters>71.5</DistanceMeters>
<Cadence>67</Cadence>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>104</Watts>
</TPX>
</Extensions>
</Trackpoint>
<Trackpoint>
<Time>2015-05-30T11:35:51Z</Time>
<Position>
<LatitudeDegrees>41.963076</LatitudeDegrees>
<LongitudeDegrees>-87.646094</LongitudeDegrees>
</Position>
<AltitudeMeters>178.0</AltitudeMeters>
<DistanceMeters>75.7</DistanceMeters>
<Cadence>67</Cadence>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>156</Watts>
</TPX>
</Extensions>
</Trackpoint>
I've reviewed every entry for <Watts>\d*</Watts> where the corresponding Cadence was zero (if I'm not pedaling, watts should be zero).
:g/<Cadence>0/.,+3s/<Watts>[1-9]\d*/<Watts>0/c
But that did not resolve the issue. My next step is to find entries where the distance between tracepoints does not change and which contains a wattage greater than zero. This returns E65: Illegal back reference
:g/<DistanceMeters>\(\d*\)/+1,+15s/<DistanceMeters>\1/
CLARIFICATION:
I'm looking for locations where Watts must be zero and are not. These would be where I am not pedaling (Cadence = 0) and also when I am not moving (Consecutive distance nodes that are identical). I've already corrected the Wattage for cadence = 0, but don't know how to find consecutive <DistanceMeters>N</DistanceMeters> nodes where N is unchanged.
Given enough determination, it can be done:
:%s/\m<DistanceMeters>\([0-9.]\+\)<\/DistanceMeters>\n\(.*\n\)\{1,15}\s\+<DistanceMeters>\1<\/DistanceMeters>\n\(.*\n\)\{3}\s\+<Watts>\zs[1-9]\d*\ze<\/Watts>/0/gc
However, why on Earth would you want to do that in Vim?
Related
For a project I am working on, which uses annual financial reports data (of multiple categories) from companies which have been successful or gone bust/into liquidation, I previously created a (fairly well performing) model on AWS Sagemaker using a multiple linear regression algorithm (specifically, the AWS stock algorithm for logistic regression/classification problems - the 'Linear Learner' algorithm)
This model just produces a simple "company is in good health" or "company looks like it will go bust" binary prediction, based on one set of annual data fed in; e.g.
query input: {data:[{
"Gross Revenue": -4000,
"Balance Sheet": 10000,
"Creditors": 4000,
"Debts": 1000000
}]}
inference output: "in good health" / "in bad health"
I trained this model by just ignoring what year for each company the values were from and pilling in all of the annual financial reports data (i.e. one years financial data for one company = one input line) for the training, along with the label of "good" or "bad" - a good company was one which has existed for a while, but hasn't gone bust, a bad company is one which was found to have eventually gone bust; e.g.:
label
Gross Revenue
Balance Sheet
Creditors
Debts
good
10000
20000
0
0
bad
0
5
100
10000
bad
20000
0
4
100000000
I hence used these multiple features (gross revenue, balance sheet...) along with the label (good/bad) in my training input, to create my first model.
I would like to use the same features as before as input (gross revenue, balance sheet..) but over multiple years; e.g take the values from 2020 & 2019 and use these (along with the eventual company status of "good" or "bad") as the singular input for my new model. However I'm unsure of the following:
is this an inappropriate use of logistic regression Machine learning? i.e. is there a more suitable algorithm I should consider?
is it fine, or terribly wrong to try and just use the same technique as before, but combine the data for both years into one input line like:
label
Gross Revenue(2019)
Balance Sheet(2019)
Creditors(2019)
Debts(2019)
Gross Revenue(2020)
Balance Sheet(2020)
Creditors(2020)
Debts(2020)
good
10000
20000
0
0
30000
10000
40
500
bad
100
50
200
50000
100
5
100
10000
bad
5000
0
2000
800000
2000
0
4
100000000
I would personally expect that a company which has gotten worse over time (i.e. companies finances are worse in 2020 than in 2019) should be more likely to be found to be a "bad"/likely to go bust, so I would hope that, if I feed in data like in the above example (i.e. earlier years data comes before later years data, on an input line) my training job ends up creating a model which gives greater weighting to the earlier years data, when making predictions
Any advice or tips would be greatly appreciated - I'm pretty new to machine learning and would like to learn more
UPDATE:
Using Long-Short-Term-Memory Recurrent Neural Networks (LSTM RNN) is one potential route I think I could try taking, but this seems to commonly just be used with multivariate data over many dates; my data only has 2 or 3 dates worth of multivariate data, per company. I would want to try using the data I have for all the companies, over the few dates worth of data there are, in training
I once developed a so called Genetic Time Series in R. I used a Genetic Algorithm which sorted out the best solutions from multivariate data, which were fitted on a VAR in differences or a VECM. Your data seems more macro economic or financial than user-centric and VAR or VECM seems appropriate. (Surely it is possible to treat time-series data in the same way so that we can use LSTM or other approaches, but these are very common) However, I do not know if VAR in differences or VECM works with binary classified labels. Perhaps if you would calculate a metric outcome, which you later label encode to a categorical feature (or label it first to a categorical) than VAR or VECM may also be appropriate.
However you may add all yearly data points to one data points per firm to forecast its survival, but you would loose a lot of insight. If you are interested in time series ML which works a little bit different than for neural networks or elastic net (which could also be used with time series) let me know. And we can work something out. Or I'll paste you some sources.
Summary:
1.)
It is possible to use LSTM, elastic NEt (time points may be dummies or treated as cross sectional panel) or you use VAR in differences and VECM with a slightly different out come variable
2.)
It is possible but you will loose information over time.
All the best,
Patrick
currently I'm experimenting a bit with RRDTool. I'm aware that the accuracy gets lower the longer the time periods are selected. But I thought I could bypass this with my datasource settings.
For example temperature and humidity from my house, resoultion 1h:
And now with the resolution of 1d:
As you could see, there is a great difference for the max. value of the blue line.
I created my datasources and archives with this values:
"rrdtool create temp.rrd --step 30",
"DS:temp:GAUGE:60:U:U",
"DS:humidity:GAUGE:60:U:U",
"RRA:AVERAGE:0.5:1:1051200",
"RRA:MAX:0.5:1:1051200",
"RRA:MIN:0.5:1:1051200",
I thought that 1051200 (1 year = 31536000 / 30 s (resoulution) = 1051200) is correct for saving every value for a year and that there should be no need for interpolating.
Is it possible to get the exact values displayed even if the resolution changes (for example the max humidity (Luftfeuchtigkeit) at 99.9%)?
Here are my values for image creation:
"--start" => "-1h", (-1d etc-)
"--title" => "Haustemperatur",
"--vertical-label" => "°C / % RLF",
"--width" => 800,
"--height" => 600,
"--lower-limit" => "-5",
"DEF:temperatur=$rrdFile:temperatur:LAST",
"DEF:humidity=$rrdFile:humidity:LAST",
"LINE1:temperatur#33CC33:Temperatur",
"GPRINT:temperatur:LAST:\t\tAktuell\: %4.2lf °C",
"GPRINT:temperatur:AVERAGE:Schnitt\: %4.2lf °C",
"GPRINT:temperatur:MAX:Maximum\: %4.2lf °C\j",
"LINE1:humidity#0000FF:Relative Luftfeuchtigkeit",
"GPRINT:humidity:LAST:Aktuell\: %4.2lf %%",
"GPRINT:humidity:AVERAGE:Schnitt\: %4.2lf %%",
"GPRINT:humidity:MAX:Maximum\: %4.2lf %%\j",
Thanks for your help and any suggestions.
P.S. I'm using a library to generate the graphs and the database, please do not be surprised about possible syntax errors.
Your problem is that you are causing the values to be rolled-up on the fly at graph time, but have not correctly specified which rollup function to use. Your second graph is showing the MAXIMUM of the LAST in the interval, not the true Maximum.
There are a few issues to explain with this configuration:
Firstly, your RRD is defined using 3 RRAs with 1cdp=1pdp and different consolidation functions (AVG, MIN, MAX). This means they are functionally identical, but they do not save you any time at graphing as they have not done any pre-rollup for you! You should definitely consider having just one of these (probably AVG) and adding others at lower resolution to help speed up graphing when you have a bigger time window.
Secondly, you need to specify the on-the-fly rollup function. When graphing, RRDTool will work out the best RRA to use based on your DEF lines, and will perform any additional consolidation required on the fly. This can take a long time if the only available RRA is too high-granularity.
Your graph request uses DEF:temperatur=$rrdFile:temperatur:LAST but you do not actually have a LAST type RRA, so RRDTool will grab the last average. Your RRA data points are at 30s interval, but your second graph has (approx) 5min per pixel, meaning that RRDTool needs to grab the 10 entries from the RRA, and print the last. Looking at the data in the top graph, it seems that the last in that interval was the 66 value, though previous ones were 100.
So you have a choice. Do you want the graph to show the average for the time period, the maximum, or both? Do you want the figures at the bottom to show the maximum of the average, or the maximum of everything?
For example
"DEF:temperatur=$rrdFile:temperatur:AVERAGE",
"DEF:humidity=$rrdFile:humidity:AVERAGE",
"DEF:temperaturmax=$rrdFile:temperatur:MAX;reduce=MAX",
"DEF:humiditymax=$rrdFile:humidity:MAX;reduce=MAX",
"LINE1:temperatur#33CC33:Temperatur",
"LINE1:temperaturmax#66EE66:Maximum Temperatur",
"GPRINT:temperatur:LAST:\t\tAktuell\: %4.2lf °C",
"GPRINT:temperatur:AVERAGE:Schnitt\: %4.2lf °C",
"GPRINT:temperaturmax:MAX:Maximum\: %4.2lf °C\j",
"LINE1:humidity#0000FF:Relative Luftfeuchtigkeit",
"LINE1:humiditymax#3333FF:Maximum Luftfeuchtigkeit",
"GPRINT:humidity:LAST:Aktuell\: %4.2lf %%",
"GPRINT:humidity:AVERAGE:Schnitt\: %4.2lf %%",
"GPRINT:humiditymax:MAX:Maximum\: %4.2lf %%\j",
In this case, we define a separate DEF for the maximum data set, so that we can always obtain the highest value even after consolidation. This is also used in the GPRINT so that we get the MAX of the MAX rather than the MAX of the AVERAGE. The Maximum line is now drawn separately to the average line, so that we can see the effect of any rollup of data - the lines will be together at high-resolution but get further apart as the time window widens and resolution decreases.
TheDEF is set to force any rollup function used for the maxima to be MAX rather than AVG, so we can be sure to get the maximum rather than average of maxima.
We are also using AVERAGE rather than LAST in order to get more meaningful data after rollup. Note that we could also use a separate DEF for the LAST as well if we wanted to though it is of less usefulness.
Note that, if you ever expect to be generating graphs over more than a few days, you should definitely consider adding some lower-resolution RRAs for AVERAGE and MAX or else the graphs will generate very slowly. RRDTool is designed with the intention that data will be rolled up over time, rather than (as in a traditional database) every sample kept as-is. So, unless you really need to have 30s resolution data kept for an entire year, you may prefer to keep this high resolution data for only a week, and then have separate RRAs that roll up to 1 hour resolution and keep for longer. Many people keep the 30s for 2 days, then 30min-summary for 2 weeks, 2h-summary for 2 months, and then 1day-summary for 2 years.
For more information, see the RRDTool manual pages.
Say one has a program that performs PCA.
The program calculates the number of PCs necessary in order to cover a given share of total variation in the data, e.g. 95 %.
Say the number of PCs necessary in order to cover 95 % of the variance is 10 for the data used at time t=1.
At t=2 we re-run the program with data from t=2.
For t=2 the number of PCs necessary in order to cover 95 % of the variance is 5.
Hence the number of necessary PCs in order to cover 95 % of the variance has dropped from 10 to 5 from t=1 to t=2.
Main question:
Can we make any conclusions about changes in the data from t=1 to t=2 in this case?
Example:
Can we say something like: "Since the number of PCs decreases from t=1 to t=2, there is more correlation in the data at t=1 than at t=2. With more correlation in the data, fewer PCs are needed to cover a given share of the varaince in the data."
Yes, If the original variables are strongly correlated, a reduced number of components can explain 80% to 90% of the variance, and the percentage of variance corresponds to the percentage of information from your data, that has been kept by the PCs. Furthermore, if you'd like to have more details about PCA, you can read this great comment: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579
I want to ask a question on how to compute the p-value without a t-stat table, just by looking at the table, like on the first page of the pdf in the following link http://faculty.arts.ubc.ca/dwhistler/326UBC/stataHILL.pdf . Like if I don't know the value 0.062, how can I know it is 0.062 by looking at other information from the table?
You need to use the ttail() function, which returns the reverse cumulative Student's t distribution, aka the probability T > t:
display ttail(38,abs(_b[_cons]/_se[_cons]))*2
The first argument, 38, is the degrees of freedom (sample size less number of parameters), while the second, 1.92, is the absolute value of the coefficient of interest divided by its standard error, or the t-stat. The factor of two comes from the fact that Stata is doing a two-tailed test. You can also use the stored DoF with
display ttail(e(df_r),abs(_b[_cons]/_se[_cons]))*2
You can also do the integration of the t density by "hand" using Adrian Mander's integrate:
ssc install integrate
integrate, f(tden(38,x)) l(-1.92) u(1.92)
This gives you 0.93761229, but you want Pr(T>|t|), which is 1-0.93761229=0.06238771.
If you look at many statistics textbooks, you will find a table called the Z-table which will give you the probability that Z is beyond your test statistic. The table is actually a cumulative distribution function of the normal curve.
When people went to school with 4-function calculators, one or more of the questions on the statistics test would include a copy of this Z-table, and the dear students would have to interpolate columns of numbers to find the p-value. In your example, you would see the test statistic between .06 and .07 and those fingers would tap out that it was closer to .06 and do a linear interpolation to come up with .062.
Today, the p-value is something that Stata or SAS will calculate for you.
Here is another SO question that may be of interest: How do I calculate a p-value if I have the t-statistic and d.f. (in Perl)?
Here is a basic page on how to determine p-value "by hand": http://www.dummies.com/how-to/content/how-to-determine-a-pvalue-when-testing-a-null-hypo.html
Here is how you can determine p-value using Excel: http://ms-office.wonderhowto.com/how-to/find-p-value-with-excel-346366/
===EDIT===
My Stata text ("Microeconometrics using Stata", Revised Ed, Cameron & Trivedi) says the following on p. 402.
* p-values for t(30), F(1,30), Z, and chi(1) at y=2
. scalar y=2
. scalar p_t30 = 2 * ttail(30,y)
. scalar p_f1and30 = Ftail(1,30,y^2)
. scalar p_z = 2 * (1 - normal(y))
. scalar p_chi1 = chi2tail(1,y^2)
. display "p-values" " t(30)=" %7.4f p_t30
p-values t(30) = 0.0546
I've been doing some work for my exams in a few days and I'm going through some past papers but unfortunately there are no corresponding answers. I've answered the question and I was wondering if someone could tell me if I am correct.
My question is
(c) A transactional dataset, T, is given below:
t1: Milk, Chicken, Beer
t2: Chicken, Cheese
t3: Cheese, Boots
t4: Cheese, Chicken, Beer,
t5: Chicken, Beer, Clothes, Cheese, Milk
t6: Clothes, Beer, Milk
t7: Beer, Milk, Clothes
Assume that minimum support is 0.5 (minsup = 0.5).
(i) Find all frequent itemsets.
Here is how I worked it out:
Item : Amount
Milk : 4
Chicken : 4
Beer : 5
Cheese : 4
Boots : 1
Clothes : 3
Now because the minsup is 0.5 you eliminate boots and clothes and make a combo of the remaining giving:
{items} : Amount
{Milk, Chicken} : 2
{Milk, Beer} : 4
{Milk, Cheese} : 1
{Chicken, Beer} : 3
{Chicken, Cheese} : 3
{Beer, Cheese} : 2
Which leaves milk and beer as the only frequent item set then as it is the only one above the minsup?
I agree you should go for the Apriori Algorithm.
The Apriori algorithm is based on the idea that for a pair o items to be frequent, each individual item should also be frequent.
If the hamburguer-ketchup pair is frequent, the hamburger itself must also appear frequently in the baskets. The same can be said about the ketchup.
So for the algorithm, it is established a "threshold X" to define what is or it is not frequent. If an item appears more than X times, it is considered frequent.
The first step of the algorithm is to pass for each item in each basket, and calculate their frequency (count how many time it appears).
This can be done with a hash of size N, where the position y of the hash, refers to the frequency of Y.
If item y has a frequency greater than X, it is said to be frequent.
In the second step of the algorithm, we iterate through the items again, computing the frequency of pairs in the baskets. The catch is that
we compute only for items that are individually frequent. So if item y and item z are frequent on itselves,
we then compute the frequency of the pair. This condition greatly reduces the pairs to compute, and the amount of memory taken.
Once this is calculated, the frequencies greater than the threshold are said frequent itemset.
(http://girlincomputerscience.blogspot.com.br/2013/01/frequent-itemset-problem-for-mapreduce.html)
There are two ways to solve the problem:
using Apriori algorithm
Using FP counting
Assuming that you are using Apriori, the answer you got is correct.
The algorithm is simple:
First you count frequent 1-item sets and exclude the item-sets below minimum support.
Then count frequent 2-item sets by combining frequent items from previous iteration and exclude the item-sets below support threshold.
The algorithm can go on until no item-sets are greater than threshold.
In the problem given to you, you only get 1 set of 2 items greater than threshold so you can't move further.
There is a solved example of further steps on Wikipedia here.
You can refer "Data Mining Concepts and Techniques" by Han and Kamber for more examples.
OK to start, you must first understand, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
Now, the amount of raw data stored in corporate databases is exploding. From trillions of point-of-sale transactions and credit card purchases to pixel-by-pixel images of galaxies, databases are now measured in gigabytes and terabytes. (One terabyte = one trillion bytes. A terabyte is equivalent to about 2 million books!) For instance, every day, Wal-Mart uploads 20 million point-of-sale transactions to an A&T massively parallel system with 483 processors running a centralized database. Raw data by itself, however, does not provide much information. In today's fiercely competitive business environment, companies need to rapidly turn these terabytes of raw data into significant insights into their customers and markets to guide their marketing, investment, and management strategies.
Now you must understand that association rule mining is an important model in data mining. Its mining algorithms discover all item associations (or rules) in the data that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. Minsup controls the minimum number of data cases that a rule must cover. Minconf controls the predictive strength of the rule. Since only one minsup is used for the whole database, the model implicitly assumes that all items in the data are of the same nature and/or have similar frequencies in the data. This is, however, seldom the case in real- life applications. In many applications, some items appear very frequently in the data, while others rarely appear. If minsup is set too high, those rules that involve rare items will not be found. To find rules that involve both frequent and rare items, minsup has to be set very low. This may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways. This dilemma is called the rare item problem. This paper proposes a novel technique to solve this problem. The technique allows the user to specify multiple minimum supports to reflect the natures of the items and their varied frequencies in the database. In rule mining, different rules may need to satisfy different minimum supports depending on what items are in the rules.
Given a set of transactions T (the database), the problem of mining association rules is to discover all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf).
I hope that once you understand the very basics of data mining that the answer to this question shall become apparent.