I'm trying to groupby a large data set using chunking.
What works:
chunks = pd.read_stata('data.dta', chunksize = 50000, columns = ['year', 'race', 'app'])
pieces = [chunk.groupby(['race'])['app'].agg(['sum']) for chunk in chunks]
agg = pd.concat(pieces.groupby(level = 0).sum()
What doesn't work (error: Categorical objects has no attribute flags)
chunks = pd.read_stata('data.dta', chunksize = 50000, columns = ['year', 'race', 'app'])
pieces = [chunk.groupby(['year', 'race'])['app'].agg(['sum']) for chunk in chunks]
agg = pd.concat(pieces.groupby(['year', 'race']).sum()
Thoughts on what i'm missing when adding in year?
pieces:
2013 Asian 9325
Black 2655
AmInd 118
Hisp 6371
White 16825
Other 2446
Unknown 3502
Foreign 7280
Name: app, dtype: float64, year race
2013 Asian 8884
Black 2969
AmInd 72
Hisp 3760
White 18926
Other 1843
Unknown 3262
Foreign 8183
Name: app, dtype: float64, year race
2013 Asian 6429
Black 2176
AmInd 89
Hisp 3804
White 13903
Other 1752
Unknown 2760
Foreign 6825
2014 Asian 1522
Black 738
AmInd 23
Hisp 1133
White 4243
Other 437
Unknown 316
Foreign 1997
Name: app, dtype: float64, year race
Related
I want to query a number of rows from one sheet into another sheet, and to the right of this row add a column based on one of the queried columns. Meaning that if column C is "Il", I want to add a column to show 0, otherwise 1 (the samples below will make it clearer.
I have tried doing this with Query and Arrayformula, without query, with Filter and importrange. An example of what I tried:
=query(Data!A1:AG,"Select D, E, J, E-J, Q, AG " & IF(AG="Il",0, 1),1)
Raw data sample:
Captured Amount Fee Country
TRUE 336 10.04 NZ
TRUE 37 1.37 GB
TRUE 150 4.65 US
TRUE 45 1.61 US
TRUE 20 0.88 IL
What I would want as a result:
Amount Fee Country Sort
336 10.04 NZ 1
37 1.37 GB 1
150 4.65 US 1
45 1.61 US 1
20 0.88 IL 0
try it like this:
=ARRAYFORMULA(QUERY({Data!A1:Q, {"Sort"; IF(Data!AG2:AG="IL", 0, 1)}},
"select Col4,Col5,Col9,Col5-Col9,Col17,Col18 label Col5-Col9''", 1))
I am a complete newb to SAS and I only know is basic sql. Currently taking Regression class and having trouble with SAS code.
I am trying to input two columns of data where x variable is State; y variable is # of accidents for a simple regression.
I keep getting this:
ERROR: No valid observations are found.
Number of Observations Read 51
Number of Observations Used 0
Number of Observations with Missing Values 51
Is it because datalines only read numbers and not charcters?
Here is the code as well as the datalines:
Data Firearm_Accidents_1999_to_2014;
ods graphics on;
Input State Sum_OF_Deaths;
Datalines;
Alabama 526
Alaska 0
Arizona 150
Arkansas 246
California 834
Colorado 33
Connecticut 0
Delaware 0
District_of_Columbia 0
Florida 350
Georgia 413
Hawaii 0
Idaho 0
Illinois 287
Indiana 288
Iowa 0
Kansas 44
Kentucky 384
Louisiana 562
Maine 0
Maryland 21
Massachusetts 27
Michigan 168
Minnesota 0
Mississippi 332
Missouri 320
Montana 0
Nebraska 0
Nevada 0
New_Hampshire 0
New_Jersey 85
New_Mexico 49
New_York 218
North_Carolina 437
North_Dakota 0
Ohio 306
Oklahoma 227
Oregon 41
Pennsylvania 465
Rhode_Island 0
South_Carolina 324
South_Dakota 0
Tennessee 603
Texas 876
Utah 0
Vermont 0
Virginia 203
Washington 45
West_Virginia 136
Wisconsin 64
Wyoming 0
;
run; proc print;
proc reg data = Firearm_Accidents_1999_to_2014;
model State = Sum_OF_Deaths;
ods graphics off;
run; quit;
OK, some different levels of issues here.
ODS GRAPHICS go before and after procs, not inside them.
When reading a character variable you need to tell SAS using an informat.
This allows you to read in the data. However your regression has several issues. For one, State is a character variable and you can do regression with a character variable. I think that issue is beyond this forum. Review your regression basics and check what you're trying to do.
Data Firearm_Accidents_1999_to_2014;
informat state $32.;
Input State Sum_OF_Deaths;
Datalines;
Alabama 526
Alaska 0
Arizona 150
Arkansas 246
California 834
Colorado 33
....
;
run;
I have a following dataframe in pandas
Date Title
58 March 2015 Data Visualization with JavaScript
63 December 2014 Eloquent JavaScript, 2nd Edition
90 October 2014 If Hemingway Wrote JavaScript
96 December 2014 JavaScript for Kids
158 February 2014 Principles of Object-Oriented JavaScript
209 November 2005 Wicked Cool Java
I have to filter the rows which contains word JavaScript in it. I am doing following.
category_javascript = np.where(Publisher['Title'].str.contains(r'(?:\s|^)JavaScript(?:\s|$)'))
It gives me following outupt
category_javascript
Out[106]: (array([ 58, 90, 96, 158], dtype=int64),)
It does not filter 63 December 2014 Eloquent JavaScript, 2nd Edition I think because word JavaScript has comma after it. I want to find exact word irrespective of punctuation or combination. for e.g JavaScript-Book also would do.
Please help
IIUC you dont need regex, only string JavaScript:
category_javascript = np.where(Publisher['Title'].str.contains('JavaScript'))
print (Publisher['Title'].str.contains('JavaScript'))
58 True
63 True
90 True
96 True
158 True
209 False
Name: Title, dtype: bool
print (Publisher[Publisher['Title'].str.contains('JavaScript')])
Date Title
58 March 2015 Data Visualization with JavaScript
63 December 2014 Eloquent JavaScript, 2nd Edition
90 October 2014 If Hemingway Wrote JavaScript
96 December 2014 JavaScript for Kids
158 February 2014 Principles of Object-Oriented JavaScript
You can add diacritics to regex like [,;]:
print (Publisher['Title'].str.contains('(?:\s|^|[,;])JavaScript(?:\s|$|[,;])'))
58 True
63 True
90 True
96 True
158 True
209 False
Name: Title, dtype: bool
print (Publisher['Title'].str.contains('(?:\s|^|[,;])Java(?:\s|$|[,;])'))
58 False
63 False
90 False
96 False
158 False
209 True
Name: Title, dtype: bool
I am trying do a ARIMA model estimation for 5 different variables. The data consists of 16 months of Point of Sales. How do I approach this complicated ARIMA modelling?
Furthermore I would like to do:
A simple moving average of each product group
A Holt-Winters
exponential smoothing model
Data is as follows with date and product groups:
Date Gloves ShoeCovers Socks Warmers HeadWear
apr-14 11015 3827 3465 1264 772
maj-14 11087 2776 4378 1099 1423
jun-14 7645 1432 4490 674 670
jul-14 10083 7975 2577 1558 8501
aug-14 13887 8577 6854 1305 15621
sep-14 9186 5213 5244 1183 6784
okt-14 7611 4279 4150 977 6191
nov-14 6410 4033 2918 507 8276
dec-14 4856 3552 3192 450 4810
jan-15 17506 7274 3137 2216 3979
feb-15 21518 5672 8848 1838 2321
mar-15 17395 5200 5712 1604 2282
apr-15 11405 4531 5185 1479 1888
maj-15 11509 5690 4370 1145 2369
jun-15 9945 2610 4884 882 1709
jul-15 8707 5658 4570 1948 6255
Any skilled forecasters out there willing to help? Much appreciated!
I have a csv file which i need to parse using python.
triggerid,timestamp,hw0,hw1,hw2,hw3
1,234,343,434,78,56
2,454,22,90,44,76
I need to read the file line by line, slice the triggerid,timestamp and hw3 columns from these. But the column-sequence may change from run to run. So i need to match the field name, count the column and then print out the output file as :
triggerid,timestamp,hw3
1,234,56
2,454,76
Also, is there a way to generate an hash-table(like we have in perl) such that i can store the entire column for hw0 (hw0 as key and the values in the columns as values) for other modifications.
I'm unsure what you mean by "count the column".
An easy way to read the data in would use pandas, which was designed for just this sort of manipulation. This creates a pandas DataFrame from your data using the first row as titles.
In [374]: import pandas as pd
In [375]: d = pd.read_csv("30735293.csv")
In [376]: d
Out[376]:
triggerid timestamp hw0 hw1 hw2 hw3
0 1 234 343 434 78 56
1 2 454 22 90 44 76
You can select one of the columns using a single column name, and multiple columns using a list of names:
In [377]: d[["triggerid", "timestamp", "hw3"]]
Out[377]:
triggerid timestamp hw3
0 1 234 56
1 2 454 76
You can also adjust the indexing so that one or more of the data columns are used as index values:
In [378]: d1 = d.set_index("hw0"); d1
Out[378]:
triggerid timestamp hw1 hw2 hw3
hw0
343 1 234 434 78 56
22 2 454 90 44 76
Using the .loc attribute you can retrieve a series for any indexed row:
In [390]: d1.loc[343]
Out[390]:
triggerid 1
timestamp 234
hw1 434
hw2 78
hw3 56
Name: 343, dtype: int64
You can use the column names to retrieve the individual column values from that one-row series:
In [393]: d1.loc[343]["triggerid"]
Out[393]: 1
Since you already have a solution for the slices here's something for the hash table part of the question:
import csv
with open('/path/to/file.csv','rb') as fin:
ht = {}
cr = csv.reader(fin)
k = cr.next()[2]
ht[k] = list()
for line in cr:
ht[k].append(line[2])
I used a different approach (using.index function)
bpt_mode = ["bpt_mode_64","bpt_mode_128"]
with open('StripValues.csv') as file:
for _ in xrange(1):
next(file)
for line in file:
stat_values = line.split(",")
draw_id=stats.index('trigger_id')
print stat_values[stats.index('trigger_id')],',',
for j in range(len(bpt_mode)):
print stat_values[stats.index('hw.gpu.s0.ss0.dg.'+bpt_mode[j])],',', file.close()
#holdenweb Though i am unable to figure out how to print the output to a file. Currently i am redirecting while running the script
Can you provide a solution for writing to a file. There will be multiple writes to a single file.