I want to see the frequency of the data for each year.
My array looks like this : List[Data,Year]
List[[259,1910],[259,1910],[259,1910],[192,1910].....
Data Year
259 1910
259 1910
259 1910
192 1910
313 1910
259 1911
259 1911
192 1912
313 1912
I want to get the result like
Data Year Frequency
259 1910 3
259 1911 2
259 1912 0
192 1910 1
192 1911 0
192 1912 1
...
..
.
You can use dictionary to count frequency. Python allows using tuple as dictionary key.
data = [259, 259, 192, 313, 259, 259, 192, 313]
yrs = [1910, 1910, 1910, 1910, 1911, 1911, 1912, 1912]
frequencies = {}
for idx in range(len(data)):
key = (data[idx], yrs[idx])
if key in frequencies:
frequencies[key] += 1
else:
frequencies[key] = 1
data_with_freq = []
for key, freq in frequencies.iteritems():
print (key[0], key[1], freq)
data_with_freq.append((key[0], key[1], freq))
Related
I wanted to see if this was doable in SAS. I have a dataset of the members of congress and want to split full name into first and last. However, occasionally they seem to list their middle initial or name. It is from a .txt file.
Norton, Eleanor Holmes [D-DC] 16 0 440 288 0
Cohen, Steve [D-TN] 15 0 320 209 0
Schakowsky, Janice D. [D-IL] 6 0 289 186 0
McGovern, James P. [D-MA] 8 1 252 139 0
Clarke, Yvette D. [D-NY] 7 0 248 166 0
Moore, Gwen [D-WI] 2 3 244 157 1
Hastings, Alcee L. [D-FL] 13 1 235 146 0
Raskin, Jamie [D-MD] 8 1 232 136 0
Grijalva, Raul M. [D-AZ] 9 1 228 143 0
Khanna, Ro [D-CA] 4 0 223 150 0
Good day,
SAS is a bit clunky when it comes to Strings. However it can be done. As other have mentioned, it's the logic defined, which is the really hard part.
Begin with some raw data...
data begin;
input raw_str $ 1-100;
cards;
Norton, Eleanor Holmes [D-DC] 16 0 440 288 0
Cohen, Steve [D-TN] 15 0 320 209 0
Schakowsky, Janice D. [D-IL] 6 0 289 186 0
McGovern, James P. [D-MA] 8 1 252 139 0
Clarke, Yvette D. [D-NY] 7 0 248 166 0
Moore, Gwen [D-WI] 2 3 244 157 1
Hastings, Alcee L. [D-FL] 13 1 235 146 0
Raskin, Jamie [D-MD] 8 1 232 136 0
Grijalva, Raul M. [D-AZ] 9 1 228 143 0
Khanna, Ro [D-CA] 4 0 223 150 0
; run;
first I select the leading names till the first bracket.
count the number of strings
data names;
set begin;
names_only = scan(raw_str,1,'[');
Nr_of_str = countw(names_only,' ');
run;
Assumption: First sting is the last name.
If there are only 2 strings, the first and last are pretty easy with scan and substring:
data names2;
set names;
if Nr_of_str = 2 then do;
last_name = scan(names_only, 1, ' ');
_FirstBlank = find(names_only, ' ');
first_name = strip(substr(names_only, _FirstBlank));
end;
run;
Assumption: there are only 3 strings.
approach 1. Middle name has dot in it. Filter it out.
approach 2. Middle name is shorter than real name:
data names3;
set names2;
if Nr_of_str > 2 then do;
last_name = scan(names_only, 1, ' '); /*this should still hold*/
_FirstBlank = find(names_only, ' '); /*Substring approach */
first_name = strip(substr(names_only, _FirstBlank));
second_str = scan(names_only, 2, ' ');
third_str = scan(names_only, 3, ' ');
if find(second_str,'.') = 0 then /*1st approch */
first_name = scan(names_only, 2, ' ');
else
first_name = scan(names_only, 3, ' ');
if len(second_str) > len(second_str) then /*2nd approch */
first_name = second_str;
else
first_name = third_str;
end;
run;
For more see about subsring and scan:
I have the following DataFrame:
prefix operator_name country_name mno_subscribers
0 267.0 Airtel Botswana 490
1 373.0 Orange Moldova 207
2 248.0 Airtel Seychelles 490
3 91.0 Reliance Bostwana 92
4 233.0 Vodafone Bostwana 516
I am trying to acheive this:
prefix operator_name country_name mno_subscribers operator_proba
0 267.0 Airtel Botswana 490 0.045
1 373.0 Orange Moldova 207 0.004
2 248.0 Airtel Seychelles 490 0.135
3 91.0 Reliance India 92 0.945
4 233.0 Vodafone Ghana 516 0.002
With this:
countries = df["country_name"].unique()
df["operator_proba"] = 0
for country in countries:
country_name = df[df["country_name"] == country]
for operator in country:
mno_sum = country_name["mno_subscribers"].sum()
df["operator_proba"]["country_name"] = country_name["mno_subscribers"] / mno_sum
Where am I going wrong in assigning the operator_proba to the original DataFrame?
This line
df["operator_proba"]["country_name"] = country_name["mno_subscribers"] / mno_sum
can't really work, since df["operator_proba"] is a column (or Series); you can't use ["country_name"] indexing on that.
That is probably why things don't work for you.
It's not entirely clear what you want to achieve, but I guess this may work:
df['operator_proba'] = df.groupby('country_name')['mno_subscribers'].apply(lambda x : x/x.sum())
This saves you a double loop, and is more Pandas-style (there are probably even nicer ways to compute this). The result is:
prefix operator_name country_name mno_subscribers operator_proba
0 267.0 Airtel Botswana 490 1.000000
1 373.0 Orange Moldova 207 1.000000
2 248.0 Airtel Seychelles 490 1.000000
3 91.0 Reliance Bostwana 92 0.151316
4 233.0 Vodafone Bostwana 516 0.848684
with the limited data set (and Botswana/Bostwana difference), most "probabilities" are 1.
I'm looking for a suggested approach to the following that is time efficient in Pandas. Let's say I have a dataframe that looks like this:
[TimeStamp] [Val]
2017-08-19 22:28:42.000 151
2017-08-19 22:28:42.001 127
2017-08-19 22:29:42.000 149
2017-08-19 22:34:10.000 127
2017-08-19 22:35:10.000 126
2017-08-19 22:36:10.000 132
2017-08-19 22:37:10.000 129
2017-08-19 22:39:10.000 124
How would I get the duration when the Val exceeds 127?
So I'd expect an answer of:
22:28:42 -> 22:28:42.001
22:29:42 -> 22:34:10.000
22:36:10 -> 22:39:10.000
I would also like to then look at these date ranges and carry out actions like:
How many datapoint are there between dates where value is above 127
First sort your data by TimeStamp
>> df['TimeStamp'] = pd.to_datetime(df['TimeStamp'])
>> df = df.sort_values('TimeStamp')
Then find positions where Val changes to lte or gt 127
>> df['changed'] = (df['Val'] > 127).astype(int).diff().fillna(1).astype(int)
>> df
TimeStamp Val changed
0 2017-08-19 22:28:42.000 151 1
1 2017-08-19 22:28:42.001 127 -1
2 2017-08-19 22:29:42.000 149 1
3 2017-08-19 22:34:10.000 127 -1
4 2017-08-19 22:35:10.000 126 0
5 2017-08-19 22:36:10.000 132 1
6 2017-08-19 22:37:10.000 129 0
7 2017-08-19 22:39:10.000 124 -1
Above, for particular TimeStamp
-1 means that Val changed to lte 127
+1 means that Val changed to gt 127
Finally construct the time intervals you need
>> pd.DataFrame({
>> 't_0': df.loc[df.changed == 1, 'TimeStamp'].reset_index(drop=True),
>> 't_n': df.loc[df.changed == -1, 'TimeStamp'].reset_index(drop=True)})
t_n t_0
0 2017-08-19 22:28:42.001 2017-08-19 22:28:42
1 2017-08-19 22:34:10.000 2017-08-19 22:29:42
2 2017-08-19 22:39:10.000 2017-08-19 22:36:10
Using SAS 9.3
I have files with two variables (Time and pulse), one file for each person.
I have the information which date they started measuring for each person.
Now I want create a date variable whom change date at midnight (of course), how?
Example from text files:
23:58:02 106
23:58:07 105
23:58:12 103
23:58:17 98
23:58:22 100
23:58:27 97
23:58:32 99
23:58:37 100
23:58:42 99
23:58:47 104
23:58:52 95
23:58:57 96
23:59:02 98
23:59:07 96
23:59:12 104
23:59:17 109
23:59:22 105
23:59:27 111
23:59:32 111
23:59:37 104
23:59:42 110
23:59:47 100
23:59:52 106
23:59:57 114
00:00:02 123
00:00:07 130
00:00:12 130
00:00:17 125
00:00:22 119
00:00:27 116
00:00:32 122
00:00:37 116
00:00:42 119
00:00:47 117
00:00:52 114
00:00:57 114
00:01:02 110
00:01:07 103
00:01:12 98
00:01:17 98
00:01:22 102
00:01:27 97
00:01:32 99
00:01:37 93
00:01:42 97
00:01:47 103
00:01:52 96
00:01:57 93
00:02:02 93
00:02:07 95
00:02:12 106
00:02:17 99
00:02:22 102
00:02:27 96
00:02:32 93
00:02:37 97
00:02:42 102
00:02:47 101
00:02:52 95
00:02:57 92
00:03:02 100
00:03:07 95
00:03:12 102
00:03:17 102
00:03:22 109
00:03:27 109
00:03:32 107
00:03:37 111
00:03:42 112
00:03:47 113
00:03:52 115
Regex:
\d{2}:\d{2}:\d{2} \d*
See here for an example and play around with regex:
https://regex101.com/r/xF1fQ5/1
EDIT: and have a look at the SAS regex tip sheet: http://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf
Something like this:
Date lastDate = startDate;
List<NData> ListData = new ArrayList<NData>();
for(FileData fdat:ListFileData){
Date nDate = this.getDate(lastDate,fdat.gettime());
NData ndata= new NData(ndate,fdat.getMeasuring());
LisData.add(nData);
lastDate = nDate;
}
.
.
.
.
function Date getDate(Date ld,String time){
Calendar cal = Calendar.getInstance();
cal.setTime(ld);
int year = cal.get(Calendar.YEAR);
int month = cal.get(Calendar.MONTH)+1;
int day = cal.get(Calendar.DAY_OF_MONTH);
int hourOfDay = this.getHour(time);
int minuteOfHour = this.getMinute(time);
org.joda.time.LocalDateTime lastDate = new org.joda.time.LocalDateTime(ld)
org.joda.time.LocalDateTime newDate = new org.joda.time.LocalDateTime(year,month,day,hourOfDay,minuteOfHour);
if(newDate.isBefore(lastDate)){
newDate = newDate.plusDays(1);
}
return newDate.toDate();
}
It's hard to provide a complete answer without sample code, but the SAS lag() function might be enough to do what you need. Your data step would include lines like the following, assuming your time variable is called time and your date variable is called date:
retain date;
if time < lag(time) then date = date + 1;
This assumes you never have any 24 hour gaps (but it appears you'd have to assume that anyway).
This answer also assumes that the time field is already in a SAS time format.
I wanted to modify this working module given below into this upper one with purpose that instead of using whole sample of p from 1 to m, the module would use only previous 18 and next 18 values around the time-point x. So p(x-18...x+18). But I end up with error and can't really understand where's the problem. Error message with whole command line at the end of post.
start mhatx2(m,p,h,pi,e);
t5=j(m,1); /*mhatx omit x=t*/
upb=m-18;
do x=19 to upb;
lo=x-18;
up=x+18;
i=T(lo:up);
temp1=x-i;
ue=Kmod(temp1,h,pi,e)#p[i];
le=Kmod(temp1,h,pi,e);
t5[x]=(sum(ue)-ue[x])/(sum(le)-le[x]);
end;
return (t5);
finish;
start mhatx2(m,p,h,pi,e);
t5=j(m,1); /*mhatx omit x=t*/
do x=1 to nrow(p);
i=T(1:m);
temp1=x-i;
ue=Kmod(temp1,h,pi,e)#p[i];
le=Kmod(temp1,h,pi,e);
t5[x]=(sum(ue)-ue[x])/(sum(le)-le[x]);
end;
return (t5);
finish;
Error message:
430 proc iml;
NOTE: IML Ready
431
432
433 EDIT kirjasto.basfraaka var "open";
434
435 read all var "open" into p;
436
437
438 m=nrow(p);
439 x=T(1:m);
440 pi=constant("pi");
441 e=constant("e");
442
443 h=0.75;
444
445 start Kmod(x,h,pi,e);
446 k=1/(h#(2#pi)##(1/2))#e##(-x##2/(2#h##2));
447 return (k);
448 finish;
NOTE: Module KMOD defined.
449 start mhatx2(m,p,h,pi,e);
450 t5=j(m,1);
450! /*mhatx omit x=t*/
451 upb=m-18;
452 do x=19 to upb;
453 lo=x-18;
454 up=x+18;
455 i=T(lo:up);
456 temp1=x-i;
457 ue=Kmod(temp1,h,pi,e)#p[i];
458 le=Kmod(temp1,h,pi,e);
459 t5[x]=(sum(ue)-ue[x])/(sum(le)-le[x]);
460 end;
461 return (t5);
462 finish;
NOTE: Module MHATX2 defined.
463
464 ptz=j(m,1);
465 ptz=mhatx2(m,p,h,pi,e);
ERROR: (execution) Invalid subscript or subscript out of range.
operation : [ at line 459 column 18
operands : ue, x
ue 37 rows 1 col (numeric)
x 1 row 1 col (numeric)
38
statement : ASSIGN at line 459 column 1
traceback : module MHATX2 at line 459 column 1
NOTE: Paused in module MHATX2.
466 print ptz;
ERROR: Matrix ptz has not been set to a value.
statement : PRINT at line 466 column 1
It looks like this line:
t5[x]=(sum(ue)-ue[x])/(sum(le)-le[x]);
is incorrectly referencing ue and le members. If you're trying to subtract out the 'current iteration' piece, then you want
t5[x]=(sum(ue)-ue[19])/(sum(le)-le[19]);
since that is the 'middle' of the range (which corresponds to the current x value).