Scikit-learn labelencoder: how to preserve mappings between batches?

Scikit-learn labelencoder: how to preserve mappings between batches? - python-2.7

I have 185 million samples that will be about 3.8 MB per sample. To prepare my dataset, I will need to one-hot encode many of the features after which I end up with over 15,000 features.
But I need to prepare the dataset in batches since the memory footprint exceeds 100 GB for just the features alone when one hot encoding using only 3 million samples.
The question is how to preserve the encodings/mappings/labels between batches?
The batches are not going to have all the levels of a category necessarily. That is, batch #1 may have: Paris, Tokyo, Rome.
Batch #2 may have Paris, London.
But in the end I need to have Paris, Tokyo, Rome, London all mapped to one encoding all at once.
Assuming that I can not determine the levels of my Cities column of 185 million all at once since it won't fit in RAM, what should I do?
If I apply the same Labelencoder instance to different batches will the mappings remain the same?
I also will need to use one hot encoding either with scikitlearn or Keras' np_utilities_to_categorical in batches as well after this. So same question: how to basically use those three methods in batches or apply them at once to a file format stored on disk?

I suggest using Pandas' get_dummies() for this, since sklearn's OneHotEncoder() needs to see all possible categorical values when .fit(), otherwise it will throw an error when it encounters a new one during .transform().
# Create toy dataset and split to batches
data_column = pd.Series(['Paris', 'Tokyo', 'Rome', 'London', 'Chicago', 'Paris'])
batch_1 = data_column[:3]
batch_2 = data_column[3:]
# Convert categorical feature column to matrix of dummy variables
batch_1_encoded = pd.get_dummies(batch_1, prefix='City')
batch_2_encoded = pd.get_dummies(batch_2, prefix='City')
# Row-bind (append) Encoded Data Back Together
final_encoded = pd.concat([batch_1_encoded, batch_2_encoded], axis=0)
# Final wrap-up. Replace nans with 0, and convert flags from float to int
final_encoded = final_encoded.fillna(0)
final_encoded[final_encoded.columns] = final_encoded[final_encoded.columns].astype(int)
final_encoded
output
City_Chicago City_London City_Paris City_Rome City_Tokyo
0 0 0 1 0 0
1 0 0 0 0 1
2 0 0 0 1 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 0 1 0 0

Related

Power BI Report with dynamic columns according to slicer selection

I have a table with the columns as below -
There are rows showing the allocation of various people under various project. The months (columns) can extend to Dec,20 and continue on from Jan,21 in the same pattern as above.
One Staff can be tagged to any number of projects in a given month.
Now I want to prepare a Power BI report on this in the format as below -
Staff ID, Project ID and End Date are the slicers to be present.
For the End Date slicer we can select options in the format of (MMM,YY) (eg - Jan,23). On the basis of this slicer I want to show the preceding 6 months of data, as portrayed by the above sample image.
I have tried using parameters but those have to specified for each combination so not usable for this data as this is going to increase over time.
Is there any way to do this or am I missing some simple thing in particular?
Any help on this will be highly appreciated.
Adding in the sample data as below -
Staff ID
Project ID
Jan,20
Feb,20
Mar,20
Apr,20
May,20
Jun,20
Jul,20
1
20
0
0
0
100
80
10
0
1
30
0
0
0
0
20
90
100
2
20
100
100
100
0
0
0
0
3
50
80
100
0
0
0
0
0
3
60
15
0
0
0
20
0
0
3
70
5
0
100
100
80
0
0

Merge rows with unique ID in stata

I have a dataset where I need unique county FIPS codes that need to be merged. The dataset looks like:
FIPS yr1990 yr2000 yr2010
1001 1 0 1
1002 1 1 0
1003 1 0 0
1004 0 0 0
1005 0 0 1
County boundaries have changed and I need to merge several FIPS codes together. Essentially, I need the dataset to look like:
FIPS yr1990 yr2000 yr2010
1001/1003 1 1 1
1002 1 1 0
1004/1005 0 0 1
Is there a way to select specific FIPS to be merged over rows?

This solution might not scale to very large datasets as writing the replace statements must be done manually. But it keeps the exact format you are using in your example. And a more scalable way might be difficult if there is no system in how the FIPS codes were combined.
* Example generated by -dataex-. For more info, type help dataex
clear
input str4 FIPS byte(yr1990 yr2000 yr2010)
"1001" 1 0 1
"1002" 1 1 0
"1003" 1 0 0
"1004" 0 0 0
"1005" 0 0 1
end
*Combine the FIPS codes
replace FIPS = "1001/1003" if inlist(FIPS,"1001","1003")
replace FIPS = "1004/1005" if inlist(FIPS,"1004","1005")
*Collapse rows by FIPS value, use max value for each var on format yr????
collapse (max) yr???? , by(FIPS)

Assigning a value to a certain number of rows within a "by" group - SAS

I've spent quite a lot of time on Stack Overflow looking for answers to other questions, but I'm really stuck on this one, so I'm finally asking a question!
I have a dataset of fish in SAS, with:
a unique ID for each angler
three different variables with number of fish released in each category by that angler: over legal size, under legal size, and released dead
a sequential number (fishno) based on the number of rows for each ID; 1 to the last row of that ID.
Variable to be created: Disposition--could be either character variable with "legal" "under" "dead" options or even numeric values of 1-3.
It was originally set up with one row per unique ID, but I set it so that now there is one row per fish discarded (i.e. if there were 3 legal size and 2 undersize fish, I now have 5 rows).
I need to assign, by unique ID, whether each row/fish was released legal, undersize or dead. In the previous example, for a unique ID, I'd need 3 rows assigned to a Disposition of "legal" and 2 rows assigned to a Disposition of "under".
I've tried first.var statements along with if-then-do statements; played around with macros; nothing worked quite right and I'm pretty stuck here. Is there some sort of random assignment I should try? Is there a much easier way that I'm missing?
Example of the data below...
THANK YOU!!
Data in Excel format

Assuming you already have the FISHNO variable, there needs to be some method for assigning each fish as legal, dead, or undersize. The following code will assign the disposition in the that order:
data have;
input ID LEGAL DEAD UNDERSIZE FISHNO;
datalines;
15 1 0 1 1
15 1 0 1 2
29 2 0 2 1
29 2 0 2 2
29 2 0 2 3
29 2 0 2 4
38 1 0 1 1
38 1 0 1 2
53 1 0 1 1
53 1 0 1 2
55 1 0 1 1
55 1 0 1 2
;
run;
data want;
set have;
if legal>0 and legal>=fishno then disposition = 'legal';
else if dead>0 and legal+dead>=fishno then disposition = 'dead';
else if undersize>0 and legal+dead+undersize>=fishno then disposition = 'under';
run;

Way to get SCSI disk names in Linux C++ application

In my Linux C++ application I want to get names of all SCSI disks which are present on the
system. e.g. /dev/sda, /dev/sdb, ... and so on.
Currently I am getting it from the file /proc/scsi/sg/devices output using below code:
host chan SCSI id lun type opens qdepth busy online
0 0 0 0 0 1 128 0 1
1 0 0 0 0 1 128 0 1
1 0 0 1 0 1 128 0 1
1 0 0 2 0 1 128 0 1
// If SCSI device Id is > 26 then the corresponding device name is like /dev/sdaa or /dev/sdab etc.
if (MAX_ENG_ALPHABETS <= scsiId)
{
// Device name order is: aa, ab, ..., az, ba, bb, ..., bz, ..., zy, zz.
deviceName.append(1, 'a'+ (char)(index / MAX_ENG_ALPHABETS) - 1);
deviceName.append(1, 'a'+ (char)(index % MAX_ENG_ALPHABETS));
}
// If SCSI device Id is < 26 then the corresponding device name is liek /dev/sda or /dev/sdb etc.
else
{
deviceName.append(1, 'a'+ index);
}
But the file /proc/scsi/sg/devices also contains the information about the disk which were previously present on the system. e.g If I detach the disk (LUN) /dev/sdc from the system
the file /proc/scsi/sg/devices still contains info of /dev/sdc which is invalid.
Tell me is there any different way to get the SCSI disk names? like a system call?
Thanks

You can simply read list of all files like /dev/sd* (in C, you would need to use opendir/readdir/closedir) and filter it by sdX (where X is one or two letters).
Also, you can get list of all partitions by reading single file /proc/partitions, and then filter 4th field by sdX:
$ cat /proc/partitions
major minor #blocks name
8 0 52428799 sda
8 1 265041 sda1
8 2 1 sda2
8 5 2096451 sda5
8 6 50066541 sda6
which would give you list of all physical disks together with their capacity (3rd field).

After get disk name list from /proc/scsi/sg/devices, you can verify the existence through code. For example, install sg3-utils, and use sg_inq to query whether the disk is active.

Writing both characters and digits in an array

I have a Fortran code which reads a txt file with seperate lines of characters and digits and then write them in a 1D array with 20 elements.
This code is not compatible with Fortran 77 compiler Force 2.0.9. My question is that how we can apply the aformenetioned procedure using a Fortran 77 compiler;i.e defining a 1D array nd then write the txt file line by line into elements of the array?
Thank you in advance.
The txt file follows:
Case 1:
10 0 1 2 0
1.104 1.008 0.6 5.0
25 125.0 175.0 0.7 1000.0
0.60
1 5
Advanced Case
15 53 0 10 0 1 0 0 1 0 0 0 0
0 0 0 0
0 0 1500.0 0 0 .03
0 0.001 0
0.1 0 0.125 0.08 0.46
0.1 5.0 0.04
# Jason:
I am a beginner and still learning Fortran. I guess Force 2 uses g77.
The followings are the correspond part of the original code. Force 2 editor returns an empty txt file as a result.
DIMENSION CARD(20)
CHARACTER*64 FILENAME
DATA XHEND / 4HEND /
OPEN(UNIT=3,FILE='CON')
OPEN(UNIT=4,FILE='CON')
OPEN(UNIT=7,STATUS='SCRATCH')
WRITE(3,9000) 'PLEASE ENTER THE INPUT FILE NAME : '
9000 FORMAT (A)
READ(4,9000) FILENAME
OPEN(UNIT=5,FILE=FILENAME,STATUS='OLD')
WRITE(3,9000) 'PLEASE ENTER THE OUTPUT FILE NAME : '
READ(4,9000) FILENAME
OPEN(UNIT=6,FILE=FILENAME,STATUS='NEW')
FILENAME = '...'
IR = 7
IW = 6
IP = 15
5 REWIND IR
I = 0
2 READ (5,7204,END=10000) CARD
IF (I .EQ. 0 ) WRITE (IW,7000)
7000 FORMAT (1H1 / 10X,15HINPUT DECK ECHO / 10X,15(1H-))
I= I + 1
WRITE (IW,9204) I,CARD
IF (CARD(1) .EQ. XHEND ) GO TO 7020
WRITE (IR,7204) CARD
7204 FORMAT (20A4)
9204 FORMAT (1X,I4,2X,20A4)
GO TO 2
7020 REWIND IR

It looks that CARD is being used as a to hold 20 4-character strings. I don't see the declaration as a character variable, only as an array, so perhaps in extremely old FORTRAN style a non-character variable is being used to hold characters? You are using a 20A4 format, so the values have to be positioned in the file precisely as 20 groups of 4 characters. You have to add blanks so that they are aligned into groups of 4 columns.
If you want to read numbers it would be much easier to read them into a numeric type and use list-directed IO:
real values (20)
read (5, *) values
Then you wouldn't have to worry about precision positioning of the values in the file.
This is really archaic FORTRAN ... even pre-FORTRAN-77 in style. I can't remember the last time that I saw Hollerith (H) formats! Where are you learning this from?
Edit: While I like Fortran for many programming tasks, I wouldn't use FORTRAN 66! Computers are supposed to make things easier ... there is no reason to have to count characters. Instead of
7000 FORMAT (1H1 / 10X,15HINPUT DECK ECHO / 10X,15(1H-))
You can use
7000 FORMAT ( / 10X, "INPUT DECK ECHO" / 10X, 15("-") )
I can think of only two reasons to use a Hollerith code: not bothering to change legacy source code (it is remarkable that a current Fortran compiler can process a feature that was obsolete 30 years ago! Fortran source code never dies!), or studying the history of computing languages. The name honors a great computing pioneer, whose invention accomplished the 1890 US Census in one year, when the 1880 Census took eight years: http://en.wikipedia.org/wiki/Herman_Hollerith
I much doubt that you will see the "1" in the first column performing "carriage control" today. I had to look up that "1" was the code for page eject. You are much more likely to see it in your output. See Are Fortran control characters (carriage control) still implemented in compilers?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js