Best imputation for principal component analysis - stata

I have a list of assets. I use these assets to create an asset index using pca:
local assets newspaper magazine clock water borehole table bed study bicycle cart car motorcycle tractor electricity fridge airconditioner fan washing_machine vacuum
alpha $assets
pca $assets
predict asset_idx
label var asset_idx "pupil asset ownership index: created using pca"
egen asset_idx_std = std(asset_idx)
However, I am missing about 3% of assets for each variable. This amounts to 30% after pairwise deletion. As such, I wish to impute the missing variables so every student with less than 10% of observations missing is not deleted during PCA. I do not otherwise wish to mi set, but can't work out another way:
mi set wide
mi register imputed newspaper magazine clock water borehole table bed study bicycle cart car motorcycle tractor electricity fridge airconditioner fan washing_machine vacuum computer internet radio tv vcr dvd cd cassette camera digi_camera vid_cam landline phone
set seed 1234
mi impute logit newspaper magazine clock water borehole table bed study bicycle cart car motorcycle tractor electricity fridge airconditioner fan washing_machine vacuum computer internet radio tv vcr dvd cd cassette camera digi_camera vid_cam landline phone, force add(1)
Unfortunately, this is only successfully imputing a small fraction of missing observations:
------------------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
newspaper | 7007 110 4 | 7117
------------------------------------------------------------------
Any suggestions on appropriate imputation are much appreciated.

Related

Power BI - Show hierarchical/grouping data

I'm looking to see if it's possible in Power BI to have a widget that shows "common" data in one row, then "differentiating" data below it in other rows.
For example, let's say I want to show a list of TV shows. And below each TV show, I want to show data about each season. So data might look like (I'm not filling out all of the data, just enough to show an example):
TV Show Title
Broadcaster
Genre
Season #
Season Premiere Episode Title
Mad Men
AMC
Period Drama
1
Smoke Gets in Your Eyes
Mad Men
AMC
Period Drama
2
For Those Who Think Young
Game of Thrones
HBO
Fantasy
1
Winter is Coming
Game of Thrones
HBO
Fantasy
2
The North Remembers
I want to show the data that looks kind of like this:
TV Show Title
Broadcaster
Genre
Season #
Season Premiere Episode Title
+/-
Mad Men
AMC
Period Drama
1
Smoke Gets in Your Eyes
2
For Those Who Think Young
+/-
Game of Thrones
HBO
Fantasy
1
Winter is Coming
2
The North Remembers
At first I thought I could use a matrix, but it doesn't seem to work that way I hoped.
Suggestions?
Like this?
If so, turn stepped layout off in the options.
My data looks like this

How to group the following table in order to display top values (strings) per category in one column?

I have the following big table over 13 million rows.
ProductCode
ProductName
valueUSD
ExportOrImport
Dest
100100
Fish
120K
Export
China
100100
Fish
122M
Export
Russia
100150
Oil
120B
Export
China
100150
Oil
122M
Export
US
I need to display the following summary table on dashboard.
ProductCode
ProductName
valueUSD
% From total
TopDest
100150
Oil
120.122B
90%
China, US
100100
Fish
122.12M
10%
China, Russia
...
...
...
...
...
I have created a "button" that separates export from import. But now I do not know how to compose TopDest column where I need to show top 5 Countries where particular product ExportedOrImported. Also, how to properly formulate this question for google search? is it grouping by category display topN ?
Any ideas how to create this table??

Ascending order sorting in Dataframe-Pandas after GroupBy

from pandas import Series, DataFrame
import pandas as pd
df1=pd.read_csv('/Users/nirmal/Desktop/Python/Assignments/Data/employee_compensation.csv', sep=',', skiprows=(1,1))
dfq2=DataFrame(df1.groupby(["Organization Group", "Department"])['Total Compensation'].mean())
dfq2
I need to sort the Total Compensation column descending order wise. and based on it Department should change within each Organization group. Organization group column should not get changed.
You can use sort_values with sort_index:
print (df.sort_values('Total Compensation', ascending=False)
.sort_index(level=0, sort_remaining=False))
Total Compensation
Organization Group Department
Community Health Academy of Sciences 107319.727692
Public Health 96190.190140
Arts Commission 94339.597388
Asian Art Museum 71401.520060
Culture & Recreation Law Library 188424.362222
City Attorney 166082.677561
Controller 104515.234944
Assessor/Recorder 89994.260614
City Planning 89022.876966
Board of Supervisors 78801.347641
War Memorial 76250.068022
Public Library 70446.352147
Civil Service Commission 67966.756559
Fine Arts Museum 44205.439895
Recreation and Park Commission 38912.859465
Elections 20493.166618
General Administration & Finance Ethics Commission 98631.380366
Another solution with reset_index, sort_values and set_index:
print (df.reset_index()
.sort_values(['Organization Group','Total Compensation'], ascending=[True, False])
.set_index(['Organization Group','Department']))
Total Compensation
Organization Group Department
Community Health Academy of Sciences 107319.727692
Public Health 96190.190140
Arts Commission 94339.597388
Asian Art Museum 71401.520060
Culture & Recreation Law Library 188424.362222
City Attorney 166082.677561
Controller 104515.234944
Assessor/Recorder 89994.260614
City Planning 89022.876966
Board of Supervisors 78801.347641
War Memorial 76250.068022
Public Library 70446.352147
Civil Service Commission 67966.756559
Fine Arts Museum 44205.439895
Recreation and Park Commission 38912.859465
Elections 20493.166618
General Administration & Finance Ethics Commission 98631.380366

sas hierarchical raw file - no record type identifiers - multiple observations per records

I have a problem importing a hierarchical text file into SAS.
I've been searching for the past week and had no luck.
The problem is that this file does not contain anything that indicates that the detail records are linked with the header.
I have tried the various methods explained with the. Input #1 Test # with the then do.
Extract of file (every new record starts with Hong Kong but each record has a variable number of lines):
HONG KONG
STEEL GROUP
Invoice Date
09.12.2015
Number
90035565
Delivery note no.
80006292
SAP Order number
18915
Customer number
105226
Order number
RCHEB 5114 1-1 24-11
Shipped from Saldanha bay, South Africa, per vessel
LAN MAY
Bill of lading date
14.11.2015
Port of discharge
ANY CHINESE PORT
Reference no.
Agreement/Contract/Order
OMl/24/ll
Port Wet Metric Tons Dry Metric Tons
ANY CHINESE PORT 202,079.000 199,957.171
Product % USD Value
Steel Ore 50%;29% 3,500.00
HONG KONG
TRADING CORP
Invoice Date
21.12.2015
Number
90035792
Provisional Invoice No
90033952
SAP Order number
50005313
Customer number
102872
Order number
KITST 5007 1-1 21-11
Shipped from Saldanha bay, South Africa, per vessel
HEBEI SUCCESS
Bill of lading date
15.06.2015
Port of discharge
BEILUN
Reference no.
WUGANG
Agreement/Contract/Order
OM6/21/ABG
Port Wet Metric Tons Dry Metric Tons
BEILUN 124,772.000 122,214.174
Product % USD Value Sishen 63.5%, 8 mm Fine Ore
Steel Ore 50%,10% 2,500.00
Iron Ore 20%,80% 1,500.00
Unfortunately, there is not an easy way to do this in SAS (that I know of). I think you are the right track reading the file in, record by record, and writing logic in a data step to parse it.
I would do it like this:
data blah;
infile "stuff.txt";
format inStr $2000.;
input inStr $;
if strip(inStr) = "HONG KONG" do;
...
end;
else if ... then do;
...
end;
...
run;

How to parse a long dataframe of text using regular expressions into a dataframe [R]

I have a giant data frame which I would like to turn into a more usable format. It is based on a copy-pasted text file of a schedule, where the entries have a consistent format.So, I know that each day will have the format:
###
Title - Date
First event
Time: 11:00 AM
Location: Address
Address line 2
Second event description
Time: 12:00 AM
Location: Address
Address
###
What I am having trouble with is figuring out how to parse this. Basically, I want to store everything between the "###"'s as a single day, and then add events based on how many times the above format repeats, and make a string or datetime entry based on if letters are following a "Time:" or a "Location:".
However, I am really having trouble with this. I have tried putting it all into a giant dataframe where each line is a row, and then adding dummies for location rows, time rows, etc as seperate columns, but am not sure how to translate that into discrete days or events. Any help would be appreciated.
Data is public, so a sample is below -- it is a giant dataframe with one row for each row of text:
*Text*
###
The Public Schedule for Mayor Rahm Emanuel – January 5, 2014
There are no public events scheduled at this time.
###
The Public Schedule for Mayor Rahm Emanuel – January 6, 2014
Mayor Rahm Emanuel will greet and thank snow clearing teams from the Department of Streets and Sanitation.
Time: 11:30 AM
Location: Department of Streets and Sanitation
Salt Station
West Grand Avenue and North Rockwell Street
Chicago, IL*
*There will be no media availability.
Mayor Rahm Emanuel and City Officials will provide an update on the City’s efforts to provide services during the extreme weather.
Time: 2:00 PM
Location: Office of Emergency Management and Communications
Headquarters
1411 West Madison Street
Chicago, IL*
*There will be a media availability following.
Mayor Rahm Emanuel will greet and visit with residents taking advantage of a City warming center.
Time: 3:00 PM
Location: Department of Family and Support Services
10 South Kedzie Avenue
Chicago, IL**
*There will be no media availability.
**Members of the media must respect the privacy of residents at the facility, and can only film City of Chicago employees.
###
Edit:
An example output I would like is something like (sorry the code here is broken, not sure why!):
Date Time Description Location
December 4th 9:00 AM A housewarming party 1211 Main St.
December 5th 11:00 AM Another big event 1234 Main St.
If at all possible.
EDIT 2:
Basically -- I know how to pull all this stuff out of the columns. I think my issue may really be reshaping the data intelligently. I have split it into this giant dataframe with one column where each row is a string which corresponds to a row of text int he original schedule. I then have a bunch of columns like "is_time", "is_new_entry", "is_location" which are 1 if the row is a time, new entry beginning, or location. I just don't know how to reshape this into the output above.