Fast way to convert strings to numbers on large dataset - python-2.7

I have a data set with tens of millions of rows. Several columns on this data represent categorical features. Each level of these features is represented by an alpha-numeric string like "b009d929".
C1 C2 C3 C4 C5 C6 C7
68fd1e64 80e26c9b fb936136 7b4723c4 25c83c98 7e0ccccf de7995b8 ...
68fd1e64 f0cf0024 6f67f7e5 41274cd7 25c83c98 fe6b92e5 922afcc0
I'd like to to be able to use Python to map each distinct level to a number to save memory. So that feature C1's levels would be replaced by numbers from 1 to C1_n, C2's levels would be replaced by numbers from 1 to C2_n...
Each feature has different number of levels, ranging from under 10 to 10k+.
I tried dictionaries with Pandas' .replace() but it gets extremely slow.
What is a fast way to approach this problem?

I figured out that the categorical features values were hashed onto 32 bits. So I ended up reading the file in chunks and applying this simple function
int(categorical_feature_value, 16)

Related

Making two IF statements in a manual input cell which reference two different cells

Currently, I have Data Validation for cell E4, stating formula =IF(E4>1,””,1). This prohibits the user from entering any number other than 1 or zero.
I need the user to manually enter the number “1” in cell E4 if a particular action is accomplished (a numerical checkbox, essentially).
I also want the value of cell E4 to read “1” if cell B4 reads the number “3”.
I’ve read several examples of nested IF statements and can’t find any that reference two different cells to decide the value of another cell.
Can I make nested IF statements in Data Validation? I haven’t been successful in doing so.
The two formulas I need to affect cell E4 are;
IF(E4>1,””,1)
and
IF(B4=3,E4=1,””)
Any help is much appreciated.
You just need an OR statement.
=IF(OR(E4 > 1, B4 = 3),””,1)

Pandas: How to One Hot Encode Categorical features

I have a data-frame X which has two categorical features and 41 numerical features. So X has total of 43 features.
Now, I would like to convert the categorical features into numerical levels so they can be used in RandomForest Classifier.
I have done following, where 0 and 1 indicate location of categorical features:
import pandas as pd
X = pd.read_csv("train.csv")
F1 = pd.get_dummies(X.iloc[:, 0])
F2 = pd.get_dummies(X.iloc[:, 1])
Then, I concatenate these two data-frames:
Xnew = pd.concat([F1, F2, X.ix[:, 2:]])
Now, Xnew has 63 features (F1 has 18 and F2 has 4 features, remaining 41 are from X)
Is this correct? Is there a better way of doing the same thing? Do I need to drop the first column from F1 and F2 to avoid collinearity?
Since F1 has 18 levels (not features) and F2 has 4, your result looks correct.
To avoid collinearity, you should better drop one of the columns (for each F1 and F2). Not necessarily the first column. Typically you drop the column with the most common level.
Why the one with the most common level? Think about feature importance. If you drop one column, it has no chance to get its importance estimated. This level (the one you dropped) is like your "base level". Only deviations from the base level can be marked as important or not.

Index-based access on Matrix-like structure in C++

I have a mapping Nx2 between two set of encodings (not relevant: Unicode and GB18030) under this format:
Warning: huge XML, don't open if having slow connection:
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
Snapshot:
<a u="00B7" b="A1 A4"/>
<a u="00B8" b="81 30 86 30"/>
<a u="00B9" b="81 30 86 31"/>
<a u="00BA" b="81 30 86 32"/>
I would like to save the b-values (right column) in a data structure and to access them directly (no searching) with indexes based on a-values (left column).
Example:
I can store those elements in a data structure like this:
unsigned short *my_page[256] = {my_00,my_01, ....., my_ff}
, where the elements are defined like:
static unsigned short my_00[256] etc.
.
So basically a matrix of matrix => 256x256 = 65536 available elements.
In the case of other encodings with less elements and different values (ex. Chinese Big5, Japanese Shift, Korean KSC etc), I can access the elements using a bijective function like this:
element = my_page[(unicode[i]>>8)&0x00FF][unicode[i]&0x00FF];, where unicode[i] is filled with the a-like elements from the mapping (as mentioned above). How do I generate and fill the my_page structure is analogous. For the working encodings, I have like around 7000 characters to store (and they are stored in a unique place in my_page).
The problem comes with the GB18030 encoding, trying to store 30861 elements in my_page (65536 elements). I am trying to use the same bijective function for filling (and then accessing, analogously) the my_page structure, but it fails since the access mode does not return unique results.
For example: For the unicode values, there are more than 1 element accessed via
my_page[(unicode[i]>>8)&0x00FF][unicode[i]&0x00FF] since the indexes can be the same for i and for i+1 for example. Do you know another way of accessing/filling the elements in the my_page structure based only on pre-computed indexes like I was trying to do?
I assume I have to use something like a pseudo-hash function that returns me a range of values VRange and based on a set of rules I can extract from the range VRange the integer indexes of my_page[256][256].
If you have any advice, please let me know :)
Thank you !
For GB18030, refer to this document: http://icu-project.org/docs/papers/gb18030.html
As explained in this article:
“The number of valid byte sequences -- of Unicode code points covered and of mappings defined between them -- makes it impractical to directly use a normal, purely mapping-table-based codepage converter. With about 1.1 million mappings, a simple mapping table would be several megabytes in size.”
So most probably is not good to implement a conversion based on a pure mapping table.
For large parts, there is a direct mapping between GB18030 and Unicode. Most of the four-bytes characters can be translated algorithmically. The author of the article suggests to handle them such ranges with a special code, and the other ones with a classic mapping table. These characters are the ones given in the XML mapping table: http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
Therefore, the index-based access on Matrix-like structure in C++ can be a problem opened for whom wants to research on such bijective functions.

Getting trends from raw data

Let we have many data which looks like
chain of digits time
23 67 34 23 54 | 12:34
23 54 | 12:42
78 96 23 | 12:46
56 93 23 54 | 12:48
I need to found numbers chain trends (grow, fall, stable) . In my example it might be 23 54 or 23.
Also i want to found different corelations between trends. Data is very big. It might be billions rows. Can you suggest any books articles or algorithms? Note i need information only about trends and corelations in such data type. I donnt need basic data mining books.
Here's the grain of an algorithm. It certainly isn't flushed out or tested, and it may not be complete. I'm just throwing it out here as a possible starting point.
It seems the most challenging issue is time required to run the algorithm over billions of rows, followed perhaps by memory limitations.
I also believe the fundamental task involved in solving this problem lies in the single operation of "comparing one set of numbers with another" to locate a shared set.
Therefore, might I suggest the following (rough) approach, in order to tackle both time, and memory:
(1) Consolidate multiple sets into a single, larger set.
i.e., take 100 consecutive sets (in your example, 23, 67, 34, 23, 54, 23, 54, 78, 96, 23, and the following 97 sets), and simply merge them together into a single set (ignoring duplicates).
(2) Give each *consolidated* set from (1) a label (or index),
and then map this set (by its label) to the original sets that compose it.
In this way, you will be able to retrieve (look up) the original individual sets 23, 67, 34, 23, 54, etc.
(3) The data is now denormalized - there are a much smaller number of sets, and each set is much larger.
Now, the algorithm moves onto a new stage.
(4) Develop an algorithm to look for matching sequences between any two of these larger sets.
There will be many false positives; however, hopefully the nature of your data is that the false positives will not "ruin" the efficiency that is gained by this approach.
I don't provide an algorithm to perform the matching between 2 individual sets here; I assume that you can come up with one yourself (sort both the sets, etc.).
(5) For every possible matching sequence found in (4), iterate through the individual sets that compose
the two larger sets being compared, weeding out false positives.
I suspect that the above step could be optimized significantly, but this is the basic idea.
At this point, you will have all of the matching sequences between all original sets that compose the two larger sets being compared.
(6) Execute steps (4) and (5) for every pair of large sets constructed in (2).
Now, you will have ALL matching sequences - with duplicates.
(7) Remove duplicates from the set of matching sequences.
Just a thought.

Multiple rows with a single INSERT in SQLServer 2008

I am testing the speed of inserting multiple rows with a single INSERT statement.
For example:
INSERT INTO [MyTable] VALUES (5, 'dog'), (6, 'cat'), (3, 'fish)
This is very fast until I pass 50 rows on a single statement, then the speed drops significantly.
Inserting 10000 rows with batches of 50 take 0.9 seconds.
Inserting 10000 rows with batches of 51 take 5.7 seconds.
My question has two parts:
Why is there such a hard performance drop at 50?
Can I rely on this behavior and code my application to never send batches larger than 50?
My tests were done in c++ and ADO.
Edit:
It appears the drop off point is not 50 rows, but 1000 columns. I get similar results with 50 rows of 20 columns or 100 rows of 10 columns.
It could also be related to the size of the row. The table you use as an example seems to have only 2 columns. What if it has 25 columns? Is the performance drop off also at 50 rows?
Did you also compare with the "union all" approach shown here? http://blog.sqlauthority.com/2007/06/08/sql-server-insert-multiple-records-using-one-insert-statement-use-of-union-all/
I suspect there's an internal cache/index that is used up to 50 rows (it's a nice round decimal number). After 50 rows it falls back on a less efficient general case insertion algorithm that can handle arbitrary amounts of inputs without using excessive memory.
the slowdown is probably the parsing of the string values: VALUES (5, 'dog'), (6, 'cat'), (3, 'fish) and not an INSERT issue.
try something like this, which will insert one row for each row returned by the query:
INSERT INTO YourTable1
(col1, col2)
SELECT
Value1, Value2
FROM YourTable2
WHERE ...--rows will be more than 50
and see what happens
If you are using SQL 2008, then you can use table value parameters and just do a single insert statement.
personally, I've never seen the slowdown at 50 inserts records even with regular batches. Regardless we moved to table value parameters which had a significant speed increase for us.
Random thoughts:
is it completely consistent when run repeatedly?
are you checking for duplicates in the 1st 10k rows for the 2nd 10k insert?
did you try batch size of 51 first?
did you empty the table between tests?
For high-volume and high-frequency inserts, consider using Bulk Inserts to load your data. Not the simplest thing int he world to implement and it brings with it a new set of challenges, but it can be much faster than doing an INSERT.