MultiIndexing in pandas based on column conditions - python-2.7

I have a huge GPS datasets in a csv file.
It is something like this.
12,1999-09-08 12:12:12, 116.3426, 32.5678
12,1999-09-08 12:12:17, 116.34234, 32.5678
.
.
.
where each column is in the form of
id, timestamp, longitude, latitude
Now, I am using pandas and importing the file into a dataframe, I have so far written this code.
import pandas as pd
import numpy as np
#this imports the columns and making the timestamp values as row indexes
df = pd.read_csv('/home/abc/Downloads/..../366.txt',delimiter=',',
index_col=1,names=['id','longitude','latitude'])
#removes repeated entries due to gps errors.
df = df.groupby(df.index).first()
Sometimes, there will be 2 or 3 multiple entries for same date which should be removed
I get something like this
id longitude latitude
1999-09-08 12:12:12 12 116.3426 32.5678
1999-09-08 12:12:17 12 116.34234 32.5678
# and so on with redundant entries removed
Now I want rows which have same latitude and longitude to be indexed serially..
i.e., my visualization is
id longitude latitude
0 1999-09-08 12:12:12 12 116.3426 32.5678
1 1999-09-08 12:12:17 12 116.34234 32.5678
2 1999-09-08 12:12:22 12 116.342341 32.5678
1999-09-08 12:12:27 12 116.342341 32.5678
1999-09-08 12:12:32 12 116.342341 32.5678
....
1999-09-08 12:19:37 12 116.342341 32.5678
3 1999-09-08 12:19:42 12 116.34234 32.56123
and so on..
i.e., rows with same latitude and longitude values are to be indexed serially. How can i achieve that? i am a beginner in pandas so i don't know much about it. pls help!

You should leverage the DataFrame.duplicated and do some math with it:
idx = df.duplicated(['longitude', 'latitude'])
idx *= -1
idx += 1
idx.ix[0] = 0
df = df.set_index(idx.cumsum(), append=True).swaplevel(0,1)
How the code works
Starting with the df you get:
In [215]: df
Out[215]:
id longitude latitude
stamp
1999-09-08T12:12:12 12 116.342600 32.56780
1999-09-08T12:12:17 12 116.342340 32.56780
1999-09-08T12:12:22 12 116.342341 32.56780
1999-09-08T12:12:27 12 116.342341 32.56780
1999-09-08T12:12:32 12 116.342341 32.56780
1999-09-08T12:19:37 12 116.342341 32.56780
1999-09-08T12:19:42 12 116.342340 32.56123
First calculate the consecutive duplicated (longitude, latitude) tuples:
In [216]: idx = df.duplicated(['longitude', 'latitude'])
In [217]: idx
Out[217]:
stamp
1999-09-08T12:12:12 False
1999-09-08T12:12:17 False
1999-09-08T12:12:22 False
1999-09-08T12:12:27 True
1999-09-08T12:12:32 True
1999-09-08T12:19:37 True
1999-09-08T12:19:42 False
Then we use cumsum to create a zero-based index that does not increment on duplicaes.
Put some math with it to obtain zeros on duplicated rows and ones for others:
In [218]: idx *= -1
In [219]: idx += 1
In [220]: idx
Out[220]:
stamp
1999-09-08T12:12:12 1
1999-09-08T12:12:17 1
1999-09-08T12:12:22 1
1999-09-08T12:12:27 0
1999-09-08T12:12:32 0
1999-09-08T12:19:37 0
1999-09-08T12:19:42 1
As we want a zero-based index, we set the first cell to 0, and we append that column to the index of df to create the MultiIndex:
In [221]: idx.ix[0] = 0
In [222]: df = df.set_index(idx.cumsum(), append=True)
By default, set_index adds the index at an inferior level than the existing one. We must finish by swapping the levels between the timestamps and our additional index:
In [223]: df = df.swaplevel(0,1)
In [224]: df
Out[224]:
id longitude latitude
stamp
0 1999-09-08T12:12:12 12 116.342600 32.56780
1 1999-09-08T12:12:17 12 116.342340 32.56780
2 1999-09-08T12:12:22 12 116.342341 32.56780
1999-09-08T12:12:27 12 116.342341 32.56780
1999-09-08T12:12:32 12 116.342341 32.56780
1999-09-08T12:19:37 12 116.342341 32.56780
3 1999-09-08T12:19:42 12 116.342340 32.56123

Related

Running Total in Matrix Rows

I have incremental data elements that I want to summarize. I'm pulling the incremental data into a matrix object just fine, but I need to summarize them by cumulating across columns (within the Row)
What I'm seeing:
Column: 1 2 3 4 5
Row |-----------------------------------------
1 | 10 15 5 4 1
2 | 12 12 3 1
3 | 10 9 6
4 | 9 15
5 | 11
What I want to see:
Column: 1 2 3 4 5
Row |-----------------------------------------
1 | 10 25 30 34 35
2 | 12 24 27 28
3 | 10 19 25
4 | 9 24
5 | 11
What I've tried, this just returns the incremental data (as if I just pointed it to [INC_AMT]:
Cum_Loss = CALCULATE(
SUM('Table1'[INC_AMT]),
FILTER(All (Table1[ColNum]), Table1[ColNum] <= max(Table1[Column])))
Give this measure a try:
PERIODIC INCREMENTAL SUM = CALCULATE
(
SUM('TestData'[INC_AMT])
, FILTER(
ALLSELECTED(TestData)
, and(
TestData[ColNum] <= max(TestData[ColNum])
, TestData[RowNum] = max(TestData[RowNum])
)
)
)
I found it helpful to not think about the measures in a matrix perspective. Transform it to a table and you see that one way to think about it is that it's just a cumulative sum where 'Row Number' is also the same. So, add that requirement to your filter and... presto.

Looking up data within a file versus merging

I have a file that look at ratings that teacher X gives to teacher Y and the date it occurs
clear
rating_id RatingTeacher RatedTeacher Rating Date
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
5 11 13 1 "1/7/2010"
end
I want to look in the history to see how many times the RatingTeacher had been rated at the time they make the rating and the cumulative score. The result would look like this.
rating_id RatingTeacher RatedTeacher Rating Date TimesRated CumulativeRating
1 15 12 1 "1/1/2010" 0 0
2 12 11 2 "1/2/2010" 1 1
3 14 11 3 "1/2/2010" 0 0
4 14 13 2 "1/5/2010" 0 0
5 19 11 4 "1/6/2010" 0 0
5 11 13 1 "1/7/2010" 3 9
end
I have been merging the dataset with itself to get this to work, and it is fine. I was wondering if there was a more efficient way to do this within the file
In your input data, I guess that the last rating_id should be 6 and that dates are MDY. Statalist members are asked to use dataex (SSC) to set up data examples. This isn't Statalist but there is no reason for lower standards to apply. See the Statalist FAQ
I rarely see even programmers be precise about what they mean by "efficient", whether it means fewer lines of code, less use of memory, more speed, something else or is just some all-purpose term of praise. This code loops over observations, which can certainly be slow for large datasets. More in this paper
We can't compare with your merge solution because you don't give the code.
clear
input rating_id RatingTeacher RatedTeacher Rating str8 SDate
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
6 11 13 1 "1/7/2010"
end
gen Date = daily(SDate, "MDY")
sort Date
gen Wanted = .
quietly forval i = 1/`=_N' {
count if Date < Date[`i'] & RatedT == RatingT[`i']
replace Wanted = r(N) in `i'
}
list, sep(0)
+---------------------------------------------------------------------+
| rating~d Rating~r RatedT~r Rating SDate Date Wanted |
|---------------------------------------------------------------------|
1. | 1 15 12 1 1/1/2010 18263 0 |
2. | 2 12 11 2 1/2/2010 18264 1 |
3. | 3 14 11 3 1/2/2010 18264 0 |
4. | 4 14 13 2 1/5/2010 18267 0 |
5. | 5 19 11 4 1/6/2010 18268 0 |
6. | 6 11 13 1 1/7/2010 18269 3 |
+---------------------------------------------------------------------+
The building block is that the rater and ratee are a pair. You can use egen's group() to give a unique ID to each rater ratee pair.
egen pair = group(rater ratee)
bysort pair (date): timesRated = _n

Labels for explicitly plotted stacked histogram pandas from series with datetimeindex

I've got a pandas series and want to plot a stacked histogram by using a filter to create two new (smaller) series. This is some dummy data, but my actual series has a (non-unique) datetimeindex.
d =
0 2520
1 0
2 1083
3 0
4 0
5 1260
6 960
7 13
8 300
9 433
10 1860
11 1920
12 13
13 0
14 2460
15 2472
16 12
17 60
18 2832
19 12
d1 = d[0:19:2]
d2 = d[1:17:3]
d1.hist(color = 'r', label = 'foo')
d2.hist(label = 'bar')
However, the labels don't show up. I've looked at the pandas docs, which shows everything working when plotted from different columns of a dataframe, but in my case I can't combine these into a single dataframe since they have different indices (and lengths). Any suggestions?
Set the column names and plot one by one over a single subplot:
d1.columns = ['foo']
d2.columns = ['bar']
f = plt.figure()
_ax = f.add_subplot(111)
d1.plot(color='r',kind='hist', stacked=True, ax=_ax)
d2.plot(kind='hist', stacked=True, ax=_ax)
Following #iayork's comments above, the following works:
import pandas as pd
df = pd.concat([d1, d2], axis = 1)
df.plot(kind = hist)
Note: df.hist() plots individual histograms for each column within the dataframe

Pandas read_table using wrong column as index

I'm trying to make a dataframe for a url that is delimited by tabs. However, pandas is using the industry_code column as the index.
dff = pd.read_table('http://download.bls.gov/pub/time.series/ce/ce.industry')
will output
industry_code naics_code publishing_status industry_name display_level selectable sort_sequence
0 - B Total nonfarm 0 T 1 NaN
5000000 - A Total private 1 T 2 NaN
6000000 - A Goods-producing 1 T 3 NaN
7000000 - B Service-providing 1 T 4 NaN
8000000 - A Private service-providing 1 T 5 NaN
Easy!
table_location = 'http://download.bls.gov/pub/time.series/ce/ce.industry'
dff = pd.read_table(table_location, index_col=False)

How can I 'align' multiple two-way tabulations in esttab?

I am trying to prepare a table that will display two-way frequency tables of several variables. The logic is that each of the variables will be tabulated by the same binary indicator.
I would like to send the output to a tex file using the estout community-contributed family of commands. However, each cross-tabulation appears in new column.
Consider the following reproducible toy example:
sysuse auto
eststo clear
eststo: estpost tab headroom foreign, notot
eststo: estpost tab trunk foreign, notot
esttab, c(b) unstack wide collabels(N)
----------------------------------------------------------------
(1) (2)
Domestic Foreign Domestic Foreign
N N N N
----------------------------------------------------------------
1_missing_5 3 1
2 10 3
2_missing_5 4 10
3 7 6
3_missing_5 13 2
4 10 0
4_missing_5 4 0
5 1 0 0 1
6 0 1
7 3 0
8 2 3
9 3 1
10 3 2
11 4 4
12 1 2
13 4 0
14 1 3
15 2 3
16 10 2
17 8 0
18 1 0
20 6 0
21 2 0
22 1 0
23 1 0
----------------------------------------------------------------
N 74 74
----------------------------------------------------------------
Is there a way to 'align' the output so that there are only two Domestic and Foreign columns?
If you're outputting to a tex file, one solution would be to use the append option of esttab. So in your case it would be something like:
sysuse auto
eststo clear
estpost tab headroom foreign, notot
eststo tab1
estpost tab trunk foreign, notot
eststo tab2
esttab tab1 using outputfile.tex, c(b) unstack wide collabels(N) replace
esttab tab2 using outputfile.tex, c(b) unstack wide collabels(N) append
I believe there may be a more elegant solution as well, but this is generally pretty easy to implement. When appending, you'll likely have to specify a bunch of options to remove the various column headers (I believe estout's default assumes you don't want most of those headers, so it may be worth looking into estout instead of esttab).
Producing the desired output requires that you stack the results together.
First define the program append_tabs, which is a quickly modified version of appendmodels, Ben Jann's program for stacking models:
program append_tabs, eclass
version 8
syntax namelist
tempname b tmp
local i 0
foreach name of local namelist {
qui est restore `name'
foreach x in Domestic Foreign {
local ++i
mat `tmp'`i' = e(b)
mat li `tmp'`i'
mat `tmp'`i' = `tmp'`i'[1,"`x':"]
local cons = colnumb(`tmp'`i',"_cons")
if `cons'<. & `cons'>1 {
mat `tmp'`i' = `tmp'`i'[1,1..`cons'-1]
}
mat li `tmp'`i'
mat `b'`i' = `tmp'`i''
mat li `b'`i'
}
}
mat `b'D = `b'1 \ `b'3
mat `b'F = `b'2 \ `b'4
mat A = `b'D , `b'F
ereturn matrix results = A
eret local cmd "append_tabs"
end
Next run your tabulations and stack their results using append_tabs:
sysuse auto, clear
estimates store clear
estpost tabulate headroom foreign, notot
estimates store one
estpost tabulate trunk foreign, notot
estimates store two
append_tabs one two
Finally, see the results:
esttab e(results), nonumber mlabels(none) eqlabels(none) collabels("Domestic" "Foreign")
--------------------------------------
Domestic Foreign
--------------------------------------
1_missing_5 3 1
2 10 3
2_missing_5 4 10
3 7 6
3_missing_5 13 2
4 10 0
4_missing_5 4 0
5 1 0
5 0 1
6 0 1
7 3 0
8 2 3
9 3 1
10 3 2
11 4 4
12 1 2
13 4 0
14 1 3
15 2 3
16 10 2
17 8 0
18 1 0
20 6 0
21 2 0
22 1 0
23 1 0
--------------------------------------
Use the tex option in the esttab command to see the LaTeX output.