How can I calculate the moving average of a time series that has breaks using PROC EXPAND? I am looking for a more efficient way to do this calculation because a DATA steps and JOINS on large datasets take a very long time to execute due to server constrains.
Data:
data have(drop=i);
call streaminit(1);
do i = 1 to 20;
period = i;
if i > 10 then period = i + 5;
if i >17 then period = i + 6;
x = round(rand('uniform')*10,1.);
output;
end;
run;
│ period │ x │
├────────┼────┤
│ 1 │ 9 │
│ 2 │ 10 │
│ 3 │ 5 │
│ 4 │ 9 │
│ 5 │ 7 │
│ 11 │ 9 │
│ 12 │ 9 │
│ 13 │ 5 │
│ 15 │ 8 │
│ 16 │ 9 │
Notice that there are two break points in the period variable: 5-11 and 13-15.
Here is the expected result (3 month moving average):
Proc sql;
create table want as
select a.period, a.x
,mean(b.x) as x_avg format=10.2
from have as a
left join have as b
on a.period -3 < b.period <= a.period
group by 1,2;
Quit;
│ period │ x │ x_avg │
├────────┼────┼───────┤
│ 1 │ 9 │ 9.00 │
│ 2 │ 10 │ 9.50 │
│ 3 │ 5 │ 8.00 │
│ 4 │ 9 │ 8.00 │
│ 5 │ 7 │ 7.00 │
│ 11 │ 9 │ 9.00 │
│ 12 │ 9 │ 9.00 │
│ 13 │ 5 │ 7.67 │
│ 15 │ 8 │ 6.50 │
│ 16 │ 9 │ 8.50 │
Use proc timeseries to add missing values between each gap, then run it through proc expand with method=none. We'll consider the interval daily since it's incrementing by one value at a time. Filter out your final dataset to have no missing values of x.
proc timeseries data = have
out = have_ts;
id period interval=day setmissing=missing;
var x;
run;
proc expand data = have_ts
out = want(where=(NOT missing(x)) );
id period;
convert x = x_avg / method=none transform=(moveave 3);
run;
You'll need to reformat period to 8. with proc datasets since proc timeseries needs to treat it as a date.
proc datasets lib=work nolist;
modify want;
format x 8.;
quit;
You can make SQL faster by a little modification.
proc sql noprint;
create table want2 as
select a.period, a.x ,mean(b1.x,b2.x,a.x) as x_avg format=10.2
from have as a
left join have as b1 on a.period -2 = b1.period
left join have as b2 on a.period -1 = b2.period
order by a.period;
quit;
And more faster by data step.
data want3;
set have;
period_l2 = lag2(period);
period_l1 = lag(period);
x_l2 = ifn(period_l2=period-2,lag2(x),ifn(period_l1=period-2,lag(x),.));
x_l1 = ifn(period_l1=period-1,lag(x),.);
x_avg = mean(x_l2,x_l1,x);
run;
if the length of series is not 3 anymore, use arrays and mean(of _array_[*]) to help yourself.
Related
I'm totally new into Weka and data science, I got an assignment to detect which of the following Iris attributes (SW, SL, PW, PL) belongs to which cluster? can you assist me? Thanks!
enter image description here
The iris dataset that comes with Weka has three classes (Iris-setosa, Iris-versicolor, Iris-virginica).
If you want to see how well clusters determined by your cluster algorithm align with the class labels, you need to select Classes to clusters evaluation in the Weka Explorer or via the -c <class_att_index> option on the command-line.
The following command uses SimpleKMeans with three clusters on the iris dataset that comes with Weka (-c last uses the last attribute as class and performs clusters to classes evaluation):
java -cp weka.jar weka.clusterers.SimpleKMeans -N 3 -c last -t data/iris.arff
Which will result in this output:
=== Clustering stats for training data ===
kMeans
======
Number of iterations: 6
Within cluster sum of squared errors: 6.998114004826762
Initial starting points (random):
Cluster 0: 6.1,2.9,4.7,1.4
Cluster 1: 6.2,2.9,4.3,1.3
Cluster 2: 6.9,3.1,5.1,2.3
Missing values globally replaced with mean/mode
Final cluster centroids:
Cluster#
Attribute Full Data 0 1 2
(150.0) (61.0) (50.0) (39.0)
=========================================================
sepallength 5.8433 5.8885 5.006 6.8462
sepalwidth 3.054 2.7377 3.418 3.0821
petallength 3.7587 4.3967 1.464 5.7026
petalwidth 1.1987 1.418 0.244 2.0795
Clustered Instances
0 61 ( 41%)
1 50 ( 33%)
2 39 ( 26%)
Class attribute: class
Classes to Clusters:
0 1 2 <-- assigned to cluster
0 50 0 | Iris-setosa
47 0 3 | Iris-versicolor
14 0 36 | Iris-virginica
Cluster 0 <-- Iris-versicolor
Cluster 1 <-- Iris-setosa
Cluster 2 <-- Iris-virginica
Incorrectly clustered instances : 17.0 11.3333 %
I have a csv file which i need to parse using python.
triggerid,timestamp,hw0,hw1,hw2,hw3
1,234,343,434,78,56
2,454,22,90,44,76
I need to read the file line by line, slice the triggerid,timestamp and hw3 columns from these. But the column-sequence may change from run to run. So i need to match the field name, count the column and then print out the output file as :
triggerid,timestamp,hw3
1,234,56
2,454,76
Also, is there a way to generate an hash-table(like we have in perl) such that i can store the entire column for hw0 (hw0 as key and the values in the columns as values) for other modifications.
I'm unsure what you mean by "count the column".
An easy way to read the data in would use pandas, which was designed for just this sort of manipulation. This creates a pandas DataFrame from your data using the first row as titles.
In [374]: import pandas as pd
In [375]: d = pd.read_csv("30735293.csv")
In [376]: d
Out[376]:
triggerid timestamp hw0 hw1 hw2 hw3
0 1 234 343 434 78 56
1 2 454 22 90 44 76
You can select one of the columns using a single column name, and multiple columns using a list of names:
In [377]: d[["triggerid", "timestamp", "hw3"]]
Out[377]:
triggerid timestamp hw3
0 1 234 56
1 2 454 76
You can also adjust the indexing so that one or more of the data columns are used as index values:
In [378]: d1 = d.set_index("hw0"); d1
Out[378]:
triggerid timestamp hw1 hw2 hw3
hw0
343 1 234 434 78 56
22 2 454 90 44 76
Using the .loc attribute you can retrieve a series for any indexed row:
In [390]: d1.loc[343]
Out[390]:
triggerid 1
timestamp 234
hw1 434
hw2 78
hw3 56
Name: 343, dtype: int64
You can use the column names to retrieve the individual column values from that one-row series:
In [393]: d1.loc[343]["triggerid"]
Out[393]: 1
Since you already have a solution for the slices here's something for the hash table part of the question:
import csv
with open('/path/to/file.csv','rb') as fin:
ht = {}
cr = csv.reader(fin)
k = cr.next()[2]
ht[k] = list()
for line in cr:
ht[k].append(line[2])
I used a different approach (using.index function)
bpt_mode = ["bpt_mode_64","bpt_mode_128"]
with open('StripValues.csv') as file:
for _ in xrange(1):
next(file)
for line in file:
stat_values = line.split(",")
draw_id=stats.index('trigger_id')
print stat_values[stats.index('trigger_id')],',',
for j in range(len(bpt_mode)):
print stat_values[stats.index('hw.gpu.s0.ss0.dg.'+bpt_mode[j])],',', file.close()
#holdenweb Though i am unable to figure out how to print the output to a file. Currently i am redirecting while running the script
Can you provide a solution for writing to a file. There will be multiple writes to a single file.
I have simple CSV file that looks like this:
inches,12,3,56,80,45
tempF,60,45,32,80,52
I read in the CSV using this command:
import pandas as pd
pd_obj = pd.read_csv('test_csv.csv', header=None, index_col=0)
Which results in this structure:
1 2 3 4 5
0
inches 12 3 56 80 45
tempF 60 45 32 80 52
But I want this (unnamed index column):
0 1 2 3 4
inches 12 3 56 80 45
tempF 60 45 32 80 52
EDIT: As #joris pointed out additional methods can be run on the resulting DataFrame to achieve the wanted structure. My question is specifically about whether or not this structure could be achieved through read_csv arguments.
from the documentation of the function:
names : array-like
List of column names to use. If file contains no header row, then you
should explicitly pass header=None
so, apparently:
pd_obj = pd.read_csv('test_csv.csv', header=None, index_col=0, names=range(5))
My data set is in this format as mentioned below:
NEWID
Age
H_PERS
Income
OCCU
FAMTYPE
REGION
Metro(Yes/No)
Exp_alcohol
population sample-(This is the weighted population each new id represents) etc.
I would like to generate a summarized view like below:
average expenditure value (This should be sum of (exp_alcohol/population sample))
% of population sample across Region Metro and each demographic variable
Please help me with your ideas.
Since I can't see your data set and your description was not very clear, I'm going to guess that you have data that looks something like this and you would like add some new variables that summarizes your data...
data alcohol;
input NEWID Age H_PERS Income OCCU $ FAMTYPE $ REGION $ Metro $
Exp_alcohol population_sample;
datalines;
1234 32 4 65000 abc m CA Yes 2 4
5678 23 5 35000 xyz s WA Yes 3 6
9923 34 3 49000 def d OR No 3 9
8844 26 4 54000 gdp m CA No 1 5
;
run;
data summar;
set alcohol;
retain TotalAvg_expend metro_count total_pop;
Divide = exp_alcohol/population_sample;
TotalAvg_expend + Divide;
total_pop + population_sample;
if metro = 'Yes' then metro_count + population_sample;
percent_metro = (metro_count/total_pop)*100;
drop NEWID Age H_PERS Income OCCU FAMTYPE REGION Divide;
run;
Output:
Exp_ population_ TotalAvg_ metro_ total_ percent_
Metro alcohol sample expend count pop metro
Yes 2 4 0.50000 4 4 100.000
Yes 3 6 1.00000 10 10 100.000
No 3 9 1.33333 10 19 52.632
No 1 5 1.53333 10 24 41.667
I have a bunch of directories that I need to restructure. They're in a format as such:
./1993-02-22 - The Moon - Tallahassee, FL/**files**
./1993-02-23 - The Moon - Tallahassee, FL/**files**
./1993-02-24 - The Moon - Tallahassee, FL/**files**
./1993-02-25 - The Moon - Tallahassee, FL/**files**
./1993-03-01 - The Test - Null, FL/**files**
I want to extract the dates from the beginning of each folder. For example, in regex: ([0-9]{4})\-([0-9]{1,2})\-([0-9]{1,2}) and reformat the directories to ./year/month/day.
So, it should output:
./1993/02/22/**files**
./1993/02/23/**files**
./1993/02/24/**files**
./1993/02/25/**files**
./1993/03/01/**files**
How can I go about doing that from a command line?
I understood the question in a different way than Kent. I thought that you wanted to create a new tree of directories from each original one and move all files that it contained. You could try following perl script if that was what you were looking for:
perl -MFile::Path=make_path -MFile::Copy=move -e '
for ( grep { -d } #ARGV ) {
#date = m/\A(\d{4})-(\d{2})-(\d{2})/;
next unless #date;
$outdir = join q{/}, #date;
make_path( $outdir );
move( $_, $outdir );
}
' *
It reads every file from current directory (* passed as argument) and does two step filter. The first one is the grep for no-directories files and the second one is an undefined #date for those that don't begin with it. Then it joins date's components into a path, creates it if doesn't exist and move the old one with all its files to the new one.
A test:
Here the result of ls -lR to show initial state:
.:
total 24
drwxr-xr-x 2 dcg dcg 4096 sep 7 00:56 1993-02-22 - The Moon - Tallahassee, FL
drwxr-xr-x 2 dcg dcg 4096 sep 7 00:56 1993-02-23 - The Moon - Tallahassee, FL
drwxr-xr-x 2 dcg dcg 4096 sep 7 00:56 1993-02-24 - The Moon - Tallahassee, FL
drwxr-xr-x 2 dcg dcg 4096 sep 7 00:57 1993-02-25 - The Moon - Tallahassee, FL
drwxr-xr-x 2 dcg dcg 4096 sep 7 00:57 1993-03-01 - The Test - Null, FL
drwxr-xr-x 2 dcg dcg 4096 sep 7 00:47 dummy_dir
-rw-r--r-- 1 dcg dcg 0 sep 7 00:47 dummy_file
./1993-02-22 - The Moon - Tallahassee, FL:
total 0
-rw-r--r-- 1 dcg dcg 0 sep 7 00:56 file1
-rw-r--r-- 1 dcg dcg 0 sep 7 00:56 file2
./1993-02-23 - The Moon - Tallahassee, FL:
total 0
-rw-r--r-- 1 dcg dcg 0 sep 7 00:56 file3
./1993-02-24 - The Moon - Tallahassee, FL:
total 0
-rw-r--r-- 1 dcg dcg 0 sep 7 00:56 file4
./1993-02-25 - The Moon - Tallahassee, FL:
total 0
-rw-r--r-- 1 dcg dcg 0 sep 7 00:57 file5
-rw-r--r-- 1 dcg dcg 0 sep 7 00:57 file6
./1993-03-01 - The Test - Null, FL:
total 0
-rw-r--r-- 1 dcg dcg 0 sep 7 00:57 file7
./dummy_dir:
total 0
And after running the previous script note that the base directory only keeps dummy files and the root of the tree created (1993). Running the same ls -lR yields:
.:
total 8
drwxr-xr-x 4 dcg dcg 4096 sep 7 00:59 1993
drwxr-xr-x 2 dcg dcg 4096 sep 7 00:47 dummy_dir
-rw-r--r-- 1 dcg dcg 0 sep 7 00:47 dummy_file
./1993:
total 8
drwxr-xr-x 6 dcg dcg 4096 sep 7 00:59 02
drwxr-xr-x 3 dcg dcg 4096 sep 7 00:59 03
./1993/02:
total 16
drwxr-xr-x 2 dcg dcg 4096 sep 7 00:56 22
drwxr-xr-x 2 dcg dcg 4096 sep 7 00:56 23
drwxr-xr-x 2 dcg dcg 4096 sep 7 00:56 24
drwxr-xr-x 2 dcg dcg 4096 sep 7 00:57 25
./1993/02/22:
total 0
-rw-r--r-- 1 dcg dcg 0 sep 7 00:56 file1
-rw-r--r-- 1 dcg dcg 0 sep 7 00:56 file2
./1993/02/23:
total 0
-rw-r--r-- 1 dcg dcg 0 sep 7 00:56 file3
./1993/02/24:
total 0
-rw-r--r-- 1 dcg dcg 0 sep 7 00:56 file4
./1993/02/25:
total 0
-rw-r--r-- 1 dcg dcg 0 sep 7 00:57 file5
-rw-r--r-- 1 dcg dcg 0 sep 7 00:57 file6
./1993/03:
total 4
drwxr-xr-x 2 dcg dcg 4096 sep 7 00:57 01
./1993/03/01:
total 0
-rw-r--r-- 1 dcg dcg 0 sep 7 00:57 file7
./dummy_dir:
total 0
Suppose your files are stored in "old" folder, then you may write a shell script (avoid using "for" loop that has difficulties with file names containing spaces):
mkdir -p new
ls -d -1 old/*/* | while read oldfile; do
newfile=`echo "$oldfile" | sed -r 's#^old/([0-9]{4})\-([0-9]{1,2})\-([0-9]{1,2})(.*)$#new/\1/\2/\3/\4#'`
newdir=` echo $newfile | sed 's#/[^/]*$##'`
echo "Creating \"$newdir\""
mkdir -p "$newdir"
echo "Moving files from \"$oldfile\" to \"$newfile\""
cp -r "$oldfile" "$newfile"
done
output of script:
Creating "new/1993/02/22/ - The Moon - Tallahassee, FL"
Moving files from "old/1993-02-22 - The Moon - Tallahassee, FL/test" to "new/1993/02/22/ - The Moon - Tallahassee, FL/test"
Creating "new/1993/02/23/ - The Moon - Tallahassee, FL"
Moving files from "old/1993-02-23 - The Moon - Tallahassee, FL/test" to "new/1993/02/23/ - The Moon - Tallahassee, FL/test"
Creating "new/1993/02/24/ - The Moon - Tallahassee, FL"
Moving files from "old/1993-02-24 - The Moon - Tallahassee, FL/test" to "new/1993/02/24/ - The Moon - Tallahassee, FL/test"
Creating "new/1993/02/25/ - The Moon - Tallahassee, FL"
Moving files from "old/1993-02-25 - The Moon - Tallahassee, FL/test" to "new/1993/02/25/ - The Moon - Tallahassee, FL/test"
Creating "new/1993/03/01/ - The Tes - Null, FL"
Moving files from "old/1993-03-01 - The Tes - Null, FL/test2" to "new/1993/03/01/ - The Tes - Null, FL/test2"
and you'll find your new tree in ... the "new" folder indeed:
$ tree old new
old
├── 1993-02-22 - The Moon - Tallahassee, FL
│ └── test
├── 1993-02-23 - The Moon - Tallahassee, FL
│ └── test
├── 1993-02-24 - The Moon - Tallahassee, FL
│ └── test
├── 1993-02-25 - The Moon - Tallahassee, FL
│ └── test
└── 1993-03-01 - The Tes - Null, FL
└── test2
new
└── 1993
├── 02
│ ├── 22
│ │ └── - The Moon - Tallahassee, FL
│ │ └── test
│ ├── 23
│ │ └── - The Moon - Tallahassee, FL
│ │ └── test
│ ├── 24
│ │ └── - The Moon - Tallahassee, FL
│ │ └── test
│ └── 25
│ └── - The Moon - Tallahassee, FL
│ └── test
└── 03
└── 01
└── - The Tes - Null, FL
└── test2
like this?
kent$ awk '{gsub(/-/,"/",$1);sub(/^[^/]*\//,"/",$NF);print $1$NF}' file
./1993/02/22/**files**
./1993/02/23/**files**
./1993/02/24/**files**
./1993/02/25/**files**
./1993/03/01/**files**
Here is a sed+awk after so much of comments :
awk 'BEGIN{FS="-"}{print $1"/"$2"/"$3}' file | awk 'BEGIN{FS="**files**"}{print $1"/"FS}' | sed -e 's/ \//\//g'
./1993/02/22/**files**
./1993/02/23/**files**
./1993/02/24/**files**
./1993/02/25/**files**
./1993/03/01/**files**