Capture undefined number of items in Regex - regex

I want to capture the following data:
[TREND, JOHN, DATA1, 0.17, 33.34, 26, 33.33, 0.25, 33.33, DATA2, 0.26, 20.0, 261, 20.0, 0.234, 20.0, 0.1, 20.0, 5, 20.0, FINAL, 245]
From the following text:
Market
#TREND
Person: JOHN
DATA1
1) 0.17 (33.34%)
2) 26 (33.33%)
3) 0.25 (33.33%)
* random text here
DATA2
1) 0.26 (20.0%)
2) 261 (20.0%)
3) 0.234 (20.0%)
4) 0.1 (20.0%)
5) 5 (20.0%)
* qsdfdsf random dsfg text random here
FINAL
245
Signature
I have written the following Regex code that works properly in this precise example :
#(TREND)\n+\w*:\s*(JOHN)\n+(DATA1)\n\d\S\s(\d+.?\d*)\s\((\d+.?\d*)%\)\s*\n\d\S\s(\d+.?\d*)\s\((\d+.?\d*)%\)\s*\n\d\S\s(\d+.?\d*)\s\((\d+.?\d*)%\)\s*\n.*\n*(DATA2)\n\d\S\s(\d+.?\d*)\s\((\d+.?\d*)%\)\s*\n\d\S\s(\d+.?\d*)\s\((\d+.?\d*)%\)\s*\n\d\S\s(\d+.?\d*)\s\((\d+.?\d*)%\)\s*\n\d\S\s(\d+.?\d*)\s\((\d+.?\d*)%\)\s*\n\d\S\s(\d+.?\d*)\s\((\d+.?\d*)%\)\s*\n.*\n*(FINAL)\n(\d+)
I would like to make it extentable for random number of item of DATA1 and DATA2 from 1 to 10 items each :
Market
#TREND
Person: JOHN
DATA1
1) 0.17 (33.34%)
2) 26 (33.33%)
3) 0.25 (33.33%)
4) 0.11 (40.40%)
5) 0.222 (50.50%)
* random text here
DATA2
1) 0.26 (20.0%)
2) 261 (20.0%)
3) 0.234 (20.0%)
* qsdfdsf random dsfg text random here
FINAL
245
Signature

Related

Calculating cumulative multivariate normal distribution

I have 1000 observations and 3 variables in Stata that are associated with 1000 people. Let's say the data looks something like this (I just make up the numbers)
Observation
B1
B2
B3
1
-3
5
3
2
2
-3
2
3
6
-2
5
4
5
3
3
...
..
...
...
1000
..
..
...
Which has a correlation matrix (again made up numbers)
R = (1, 0.5, 0.5
0.5, 1, 0.5
0.5, 0.5, 1
0.5, 0.5, 0.5)
I want to calculate the CDF of the multivariate normal distribution of variables B1, B2 and B3 for each of the 1000 persons, using the same correlation matrix. Basically, it is similar to Example 3 in this document: https://www.stata.com/manuals/m-5mvnormal.pdf, but with 3 variables, and rather than multiple limits, multiple correlation matrix, I will do multiple limits and single correlation matrix. So basically, I will have 1000 CDF values for 1000 people. I have tried mvnormal(U,R). Specifically, I wrote:
mkmat B1 B2 B3, matrix(U)
matrix define R = (1, 0.5, 0.5 \
0.5, 1, 0.5 \
0.5, 0.5, 1 \
0.5, 0.5, 0.5)
gen CDF = mvnormal(U,R)
But this doesn't work. Apparently this function is not on Stata anymore. I believe Stata has binormal for calculating the CDF of bivariate normal. But is it able to do the CDF of more than 2 variables?

Creating a Borland C++ Builder-compatible dll in Visual C++

friends and comrades!
I have program that was created in Borland C++ Builder. This program uses several dll libraries that was created in Borland C++ Builder too. But now we use Microsoft Visual Studio.
I have task to rewrite existing dll. I need to add several new strings with math-formulas. Chief gave me cpp-file, that was written in Borland C++ Builder. I have included needed formulas, but I'm not sure that i made correct preprocessor directives. Must I build only cpp-file? Are there needed any other files?
Original cpp-file
#include <windows.h>
#include <stdio.h>
#include <math.h>
#pragma argsused
void FAR PASCAL _export userdll(float *x,char *c1,char *c2,char *c3,char *c4,char *c5)
{
x[501]=x[115]*9.81; // -- Mdv v Nm
x[502]=x[501]*x[122]/7023.5; //-- Ndv v lc
x[503]=x[502]*0.7355; // -- Ndv v kW
if(x[502]>1.0&&x[123]>4.0)x[510]=x[123]/x[502]; // -- Ce v kg/lc*h
else x[510]=0;
if(x[502]>1.0&&x[123]>4.0)x[511]=x[123]/x[503]; // -- Ce v kg/kW*h
else x[511]=0;
x[520]=x[34]*7.50064; // -- Pbar v mmHg
return;
}
Rewritten cpp-file
#define STRUCT
#include <windows.h>
#include <stdio.h>
#include <math.h>
#pragma argsused
__declspec(dllexport)
void FAR PASCAL userdll(float *x, char *c1, char *c2, char *c3, char *c4, char *c5)
{
x[501] = x[115] * 9.81; // -- Mdv v Nm
x[502] = x[501] * x[122] / 7023.5; //-- Ndv v lc
x[503] = x[502] * 0.7355; // -- Ndv v kW
if (x[502] > 1.0 && x[123] > 4.0)
x[510] = x[123] / x[502]; // -- Ce v kg/lc*h
else x[510] = 0;
if (x[502] > 1.0 && x[123] > 4.0)
x[511] = x[123] / x[503]; // -- Ce v kg/kW*h
else x[511] = 0;
x[520] = x[34] * 7.50064; // -- Pbar v mmHg
int chooseStr, chooseCol;
float temp; // temp - ТЕМПЕРАТУРА по датчику
float DAVLENIE; // ДАВЛЕНИЕ по таблице из ГОСТ Р 52517-2005
float massiv[61][9] =
{
// --------------------------------------------------------------------------------------
// | Влажность, % | |
// -------------------------------------------------------| Температура, град. Цельсия |
// | 100 | 90 | 80 | 70 | 60 | 50 | 40 | 30 | 20 | |
// --------------------------------------------------------------------------------------
{ 0.30, 0.27, 0.24, 0.21, 0.18, 0.15, 0.12, 0.09, 0.06 }, // -10
{ 0.33, 0.29, 0.26, 0.23, 0.20, 0.16, 0.13, 0.10, 0.07 }, // -9
{ 0.35, 0.32, 0.28, 0.25, 0.21, 0.18, 0.14, 0.11, 0.07 }, // -8
{ 0.38, 0.34, 0.30, 0.27, 0.23, 0.19, 0.15, 0.11, 0.08 }, // -7
{ 0.41, 0.36, 0.32, 0.28, 0.24, 0.20, 0.16, 0.12, 0.08 }, // -6
{ 0.43, 0.39, 0.35, 0.30, 0.26, 0.22, 0.17, 0.13, 0.09 }, // -5
{ 0.46, 0.41, 0.37, 0.32, 0.28, 0.23, 0.18, 0.14, 0.09 }, // -4
{ 0.49, 0.44, 0.39, 0.34, 0.30, 0.25, 0.20, 0.15, 0.10 }, // -3
{ 0.53, 0.47, 0.42, 0.37, 0.32, 0.26, 0.21, 0.16, 0.10 }, // -2
{ 0.56, 0.50, 0.45, 0.39, 0.34, 0.28, 0.22, 0.17, 0.11 }, // -1
{ 0.60, 0.54, 0.48, 0.42, 0.36, 0.30, 0.24, 0.18, 0.12 }, // 0
{ 0.64, 0.58, 0.51, 0.45, 0.39, 0.32, 0.26, 0.19, 0.13 }, // 1
{ 0.69, 0.62, 0.55, 0.48, 0.41, 0.34, 0.28, 0.21, 0.14 }, // 2
{ 0.74, 0.66, 0.59, 0.52, 0.44, 0.37, 0.30, 0.22, 0.15 }, // 3
{ 0.79, 0.71, 0.63, 0.55, 0.47, 0.40, 0.32, 0.24, 0.16 }, // 4
{ 0.85, 0.76, 0.68, 0.59, 0.51, 0.42, 0.34, 0.25, 0.17 }, // 5
{ 0.91, 0.82, 0.73, 0.64, 0.55, 0.46, 0.36, 0.27, 0.18 }, // 6
{ 0.98, 0.88, 0.78, 0.68, 0.59, 0.49, 0.39, 0.29, 0.20 }, // 7
{ 1.05, 0.94, 0.84, 0.73, 0.63, 0.52, 0.42, 0.31, 0.21 }, // 8
{ 1.12, 1.01, 0.90, 0.78, 0.67, 0.56, 0.45, 0.34, 0.22 }, // 9
{ 1.20, 1.08, 0.96, 0.84, 0.72, 0.60, 0.48, 0.36, 0.24 }, // 10
{ 1.28, 1.16, 1.03, 0.90, 0.77, 0.64, 0.51, 0.39, 0.26 }, // 11
{ 1.37, 1.24, 1.10, 0.96, 0.82, 0.69, 0.55, 0.41, 0.27 }, // 12
{ 1.47, 1.32, 1.17, 1.03, 0.88, 0.73, 0.59, 0.44, 0.29 }, // 13
{ 1.57, 1.41, 1.25, 1.10, 0.94, 0.78, 0.63, 0.47, 0.31 }, // 14
{ 1.67, 1.51, 1.34, 1.17, 1.00, 0.84, 0.67, 0.50, 0.33 }, // 15
{ 1.79, 1.61, 1.43, 1.25, 1.07, 0.89, 0.71, 0.54, 0.36 }, // 16
{ 1.90, 1.71, 1.52, 1.33, 1.14, 0.95, 0.76, 0.57, 0.38 }, // 17
{ 2.03, 1.83, 1.62, 1.42, 1.22, 1.01, 0.81, 0.61, 0.41 }, // 18
{ 2.16, 1.94, 1.73, 1.51, 1.30, 1.08, 0.86, 0.65, 0.43 }, // 19
{ 2.30, 2.07, 1.84, 1.61, 1.38, 1.15, 0.92, 0.69, 0.46 }, // 20
{ 2.45, 2.20, 1.96, 1.71, 1.47, 1.22, 0.98, 0.73, 0.49 }, // 21
{ 2.60, 2.34, 2.08, 1.82, 1.56, 1.30, 1.04, 0.78, 0.52 }, // 22
{ 2.77, 2.49, 2.21, 1.94, 1.66, 1.38, 1.11, 0.83, 0.55 }, // 23
{ 2.94, 2.65, 2.35, 2.06, 1.76, 1.47, 1.18, 0.88, 0.59 }, // 24
{ 3.12, 2.81, 2.50, 2.19, 1.87, 1.56, 1.25, 0.94, 0.62 }, // 25
{ 3.32, 2.98, 2.65, 2.32, 1.99, 1.66, 1.33, 0.99, 0.66 }, // 26
{ 3.52, 3.17, 2.82, 2.46, 2.11, 1.76, 1.41, 1.06, 0.70 }, // 27
{ 3.73, 3.36, 2.99, 2.61, 2.24, 1.87, 1.49, 1.12, 0.75 }, // 28
{ 3.96, 3.56, 3.17, 2.77, 2.38, 1.98, 1.58, 1.19, 0.79 }, // 29
{ 4.20, 3.78, 3.36, 2.94, 2.52, 2.10, 1.68, 1.26, 0.84 }, // 30
{ 4.45, 4.01, 3.56, 3.12, 2.67, 2.23, 1.78, 1.34, 0.89 }, // 31
{ 4.72, 4.25, 3.78, 3.30, 2.83, 2.36, 1.89, 1.42, 0.94 }, // 32
{ 5.00, 4.50, 4.00, 3.50, 3.00, 2.50, 2.00, 1.50, 1.00 }, // 33
{ 5.29, 4.76, 4.24, 3.71, 3.18, 2.65, 2.12, 1.59, 1.06 }, // 34
{ 5.60, 5.04, 4.48, 3.92, 3.36, 2.80, 2.24, 1.68, 1.12 }, // 35
{ 5.93, 5.34, 4.74, 4.15, 3.56, 2.97, 2.37, 1.78, 1.19 }, // 36
{ 6.27, 5.64, 5.02, 4.39, 3.76, 3.14, 2.51, 1.88, 1.25 }, // 37
{ 6.63, 5.97, 5.30, 4.64, 3.98, 3.32, 2.65, 1.99, 1.33 }, // 38
{ 7.01, 6.31, 5.61, 4.90, 4.20, 3.50, 2.80, 2.10, 1.40 }, // 39
{ 7.40, 6.66, 5.92, 5.18, 4.44, 3.70, 2.96, 2.22, 1.48 }, // 40
{ 7.81, 7.03, 6.25, 5.47, 4.69, 3.91, 3.12, 2.34, 1.56 }, // 41
{ 8.24, 7.42, 6.59, 5.77, 4.94, 4.12, 3.30, 2.47, 1.65 }, // 42
{ 8.69, 7.82, 6.95, 6.08, 5.21, 4.34, 3.47, 2.61, 1.74 }, // 43
{ 9.15, 8.24, 7.32, 6.41, 5.49, 4.58, 3.66, 2.75, 1.83 }, // 44
{ 9.63, 8.67, 7.71, 6.74, 5.78, 4.82, 3.85, 2.89, 1.93 }, // 45
{ 10.13, 9.12, 8.11, 7.09, 6.08, 5.07, 4.05, 3.04, 2.03 }, // 46
{ 10.65, 9.58, 8.52, 7.45, 6.39, 5.33, 4.26, 3.20, 2.13 }, // 47
{ 11.18, 10.07, 8.95, 7.83, 6.71, 5.59, 4.47, 3.36, 2.24 }, // 48
{ 11.73, 10.56, 9.39, 8.21, 7.04, 5.87, 4.69, 3.52, 2.35 }, // 49
{ 12.30, 11.07, 9.84, 8.61, 7.38, 6.15, 4.92, 3.69, 2.46 } // 50
};
temp = x[33];
chooseStr = roundf(temp) + 10; // Округляем для получения номера строки в массиве massiv
if (x[32] <= 100 && x[32] > 95)
chooseCol = 0;
if (x[32] <= 95 && x[32] > 85)
chooseCol = 1;
if (x[32] <= 85 && x[32] > 75)
chooseCol = 2;
if (x[32] <= 75 && x[32] > 65)
chooseCol = 3;
if (x[32] <= 65 && x[32] > 55)
chooseCol = 4;
if (x[32] <= 55 && x[32] > 45)
chooseCol = 5;
if (x[32] <= 45 && x[32] > 35)
chooseCol = 6;
if (x[32] <= 35 && x[32] > 25)
chooseCol = 7;
if (x[32] <= 25)
chooseCol = 8;
DAVLENIE = massiv[chooseStr][chooseCol]; // Давление по таблице из ГОСТ Р 52517-2005
x[530] = ((x[34] - DAVLENIE) / 100.26)*sqrt(288 / (273 + x[33])); // Коэффициент индикаторной мощности К (формула 3)
x[531] = 1.15*x[530] - 0.15; // Коэффициент приведения Аи (формула 2)
x[535] = x[502] / x[531]; // Приведенная мощность, л.с. (формула 1)
x[536] = x[503] / x[531]; // Приведенная мощность, кВт (формула 1)
x[533] = x[501] / x[531]; // Приведенный крутящий момент, Н*м (формула 4)
x[537] = x[123] / x[530]; // Приведенный расход топлива, кг/ч (формула 5)
x[539] = (x[123] * 1000 * x[531]) / (x[503] * x[530]); // Приведенный удельный расход топлива Ceи, г/(кВТ*ч) (формула 6)
return;
}
If it's possible, check this code. I obtained several errors after copying dll-file to program folder and executing program. Program (written in Borland C++) can't use my dll file (created in Visual Studio).
One of the error is something like this
"...Missing MSVCR120D.dll file..."
Thank you in advance, friends!

How do I plot data in a text file depending on the the value present in one of the columns

I have a text file with with a header and a few columns, which represents results of experiments where some parameters were fixed to obtain some metrics. the file is he following format :
A B C D E
0 0.5 0.2 0.25 0.75 1.25
1 0.5 0.3 0.12 0.41 1.40
2 0.5 0.4 0.85 0.15 1.55
3 1.0 0.2 0.11 0.15 1.25
4 1.0 0.3 0.10 0.11 1.40
5 1.0 0.4 0.87 0.14 1.25
6 2.0 0.2 0.23 0.45 1.55
7 2.0 0.3 0.74 0.85 1.25
8 2.0 0.4 0.55 0.55 1.40
So I want to plot x = B, y = C for each fixed value of And E so basically for an E=1.25 I want a series of line plots of x = B, y = C at each value of A then a plot for each unique value of E.
Anyone could help with this?
You could do a combination of groupby() and seaborn.lineplot():
for e,d in df.groupby('E'):
fig, ax = plt.subplots()
sns.lineplot(data=d, x='B', y='C', hue='A', ax=ax)
ax.set_title(e)

Merging multiple .txt files into a csv

*New to Python.
I'm trying to merge multiple text files into 1 csv; example below -
filename.csv
Alpha
0
0.1
0.15
0.2
0.25
0.3
text1.txt
Alpha,Beta
0,10
0.2,20
0.3,30
text2.txt
Alpha,Charlie
0.1,5
0.15,15
text3.txt
Alpha,Delta
0.1,10
0.15,20
0.2,50
0.3,10
Desired output in the csv file: -
filename.csv
Alpha Beta Charlie Delta
0 10 0 0
0.1 0 5 10
0.15 0 15 20
0.2 20 0 50
0.25 0 0 0
0.3 30 0 10
The code I've been working with and others that were provided give me an answer similar to what is at the bottom of the page
def mergeData(indir="Dir Path", outdir="Dir Path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/Path/Final.csv"
right = filename
output = "/Path/finalMerged.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
outputDf = pandas.merge(leftDf, outputDf, how='inner', on='Alpha', sort=True, copy=False).fillna(0)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
The answer I get however is instead of the desired result: -
Alpha Beta Charlie Delta
0 10 0 0
0.1 0 5 0
0.1 0 0 10
0.15 0 15 0
0.15 0 0 20
0.2 20 0 0
0.2 0 0 50
0.25 0 0 0
0.3 30 0 0
0.3 0 0 10
IIUC you can create list of all DataFrames - dfs, in loop append mergedDf and last concat all DataFrames to one:
import pandas
import glob
import os
def mergeData(indir="dir/path", outdir="dir/path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/path/filename.csv"
right = filename
output = "/path/filename.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='right',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
#add missing rows from leftDf (in sample Alpha - 0.25)
#fill NaN values by 0
outputDf = pandas.merge(leftDf,outputDf,how='left',on="Alpha", sort=True).fillna(0)
#columns are converted to int
outputDf[['Beta', 'Charlie']] = outputDf[['Beta', 'Charlie']].astype(int)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
Alpha Beta Charlie
0 0.00 10 0
1 0.10 0 5
2 0.15 0 15
3 0.20 20 0
4 0.25 0 0
5 0.30 30 0
EDIT:
Problem is you change parameter how='left' in second merge to how='inner':
def mergeData(indir="Dir Path", outdir="Dir Path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/Path/Final.csv"
right = filename
output = "/Path/finalMerged.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
#need left join, not inner
outputDf = pandas.merge(leftDf, outputDf, how='left', on='Alpha', sort=True, copy=False)
.fillna(0)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
Alpha Beta Charlie Delta
0 0.00 10.0 0.0 0.0
1 0.10 0.0 5.0 0.0
2 0.10 0.0 0.0 10.0
3 0.15 0.0 15.0 0.0
4 0.15 0.0 0.0 20.0
5 0.20 20.0 0.0 0.0
6 0.20 0.0 0.0 50.0
7 0.25 0.0 0.0 0.0
8 0.30 30.0 0.0 0.0
9 0.30 0.0 0.0 10.0
import pandas as pd
data1 = pd.read_csv('samp1.csv',sep=',')
data2 = pd.read_csv('samp2.csv',sep=',')
data3 = pd.read_csv('samp3.csv',sep=',')
df1 = pd.DataFrame({'Alpha':data1.Alpha})
df2 = pd.DataFrame({'Alpha':data2.Alpha,'Beta':data2.Beta})
df3 = pd.DataFrame({'Alpha':data3.Alpha,'Charlie':data3.Charlie})
mergedDf = pd.merge(df1, df2, how='outer', on ='Alpha',sort=False)
mergedDf1 = pd.merge(mergedDf, df3, how='outer', on ='Alpha',sort=False)
a = pd.DataFrame(mergedDf1)
print(a.drop_duplicates())
output:
Alpha Beta Charlie
0 0.00 10.0 NaN
1 0.10 NaN 5.0
2 0.15 NaN 15.0
3 0.20 20.0 NaN
4 0.25 NaN NaN
5 0.30 30.0 NaN

Python Pandas: Create Groups by Range using map

I have a large data set where I am looking to create groups based upon cumulative sum percent of the total. I have gotten this to work by using the map function see below code. Is there a better way to do this say if I wanted to make my groups even more granular? So for example now am looking at 5% increments...what if want to look at 1 % increments. Wondering if there is another way where I don't have to explicitly enter them into my "codethem" function.
def codethem(dl):
if dl < .05 : return '5'
elif .05 < dl <= .1: return '10'
elif .1 < dl <= .15: return '15'
elif .15 < dl <= .2: return '20'
elif .2 < dl <= .25: return '25'
elif .25 < dl <= .3: return '30'
elif .3 < dl <= .35: return '35'
elif .35 < dl <= .4: return '40'
elif .4 < dl <= .45: return '45'
elif .45 < dl <= .5: return '50'
elif .5 < dl <= .55: return '55'
elif .55 < dl <= .6: return '60'
elif .6 < dl <= .65: return '65'
elif .65 < dl <= .7: return '70'
elif .7 < dl <= .75: return '75'
elif .75 < dl <= .8: return '80'
elif .8 < dl <= .85: return '85'
elif .85 < dl <= .9: return '90'
elif .9 < dl <= .95: return '95'
elif .95 < dl <= 1: return '100'
else: return 'None'
my_df['code'] = my_df['sales_csum_aspercent'].map(code them)
Thank you!
there is a special method for that - pd.cut()
Demo:
create random DF:
In [393]: df = pd.DataFrame({'a': np.random.rand(10)})
In [394]: df
Out[394]:
a
0 0.860256
1 0.399267
2 0.209185
3 0.773647
4 0.294845
5 0.883161
6 0.985758
7 0.559730
8 0.723033
9 0.126226
we should specify bins when calling pd.cut():
In [404]: np.linspace(0, 1, 11)
Out[404]: array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
In [395]: pd.cut(df.a, bins=np.linspace(0, 1, 11))
Out[395]:
0 (0.8, 0.9]
1 (0.3, 0.4]
2 (0.2, 0.3]
3 (0.7, 0.8]
4 (0.2, 0.3]
5 (0.8, 0.9]
6 (0.9, 1]
7 (0.5, 0.6]
8 (0.7, 0.8]
9 (0.1, 0.2]
Name: a, dtype: category
Categories (10, object): [(0, 0.1] < (0.1, 0.2] < (0.2, 0.3] < (0.3, 0.4] ... (0.6, 0.7] < (0.7, 0.8] < (0.8, 0.9] < (0.9, 1]]
if we want to have a custom labels, we should explicitly specify them:
In [401]: bins = np.linspace(0,1, 11)
NOTE: bin labels must be one fewer than the number of bin edges
In [402]: labels = (bins[1:]*100).astype(int)
In [412]: labels
Out[412]: array([ 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
In [403]: pd.cut(df.a, bins=bins, labels=labels)
Out[403]:
0 90
1 40
2 30
3 80
4 30
5 90
6 100
7 60
8 80
9 20
Name: a, dtype: category
Categories (10, int64): [10 < 20 < 30 < 40 ... 70 < 80 < 90 < 100]
Lets do it with the 5% step
In [419]: bins = np.linspace(0, 1, 21)
In [420]: bins
Out[420]: array([ 0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.8
5, 0.9 , 0.95, 1. ])
In [421]: labels = (bins[1:]*100).astype(int)
In [422]: labels
Out[422]: array([ 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100])
In [423]: pd.cut(df.a, bins=bins, labels=labels)
Out[423]:
0 90
1 40
2 25
3 80
4 30
5 90
6 100
7 60
8 75
9 15
Name: a, dtype: category
Categories (20, int64): [5 < 10 < 15 < 20 ... 85 < 90 < 95 < 100]