Reading text into table format in pandas - regex

I have a table in text form that I want to read into pandas
I can use \n to separate the rows, but how can I separate the columns they are in the format ( 2 x text fields, then 6 x numeric).
Is there a method using regex or similar?
table_text = '''Name AIC sector Price (last close) Price (bid) Price (offer) NAV Total assets (£m) Market cap (£m)
3i Infrastructure Plc Infrastructure GBX 296.00 2.96 2.96 254.50 2,268.700 2,638.645
Aberdeen Asian Income Fund Limited Asia Pacific Income GBX 227.50 2.26 2.29 252.51 479.110 399.796
Aberdeen Diversified Income & Growth Ord Flexible Investment GBX 95.20 0.95 0.96 115.34 379.030 294.985
Aberdeen Emerging Markets Investment Company Limited Global Emerging Markets GBX 704.00 6.98 7.10 829.47 391.268 323.595
Aberdeen Japan Investment Trust Plc Japan GBX 712.50 7.00 7.25 784.79 114.957 94.198
Aberdeen Latin American Income Latin America GBX 57.00 0.54 0.57 62.13 40.985 32.555
Aberdeen New Dawn Asia Pacific GBX 322.00 3.22 3.26 365.56 431.544 350.752
Aberdeen New India Investment Trust Plc India GBX 516.00 5.16 5.18 601.47 375.170 301.268
Aberdeen New Thai Investment Trust Plc Country Specialist GBX 445.00 4.40 4.50 516.30 92.585 71.180
Aberdeen Smaller Companies Income Trust UK Smaller Companies GBX 358.00 3.56 3.60 397.45 95.028 79.153
Aberdeen Standard Asia Focus 2025 CULS Asia Pacific Smaller Companies GBX 100.95 1.01 1.01 97.25 391.484 37.026
Aberdeen Standard Asia Focus PLC Asia Pacific Smaller Companies GBX 1,280.00 12.75 13.00 1,440.65 483.841 402.730
Aberdeen Standard Equity Inc Trust plc UK Equity Income GBX 353.00 3.50 3.56 379.60 203.368 170.598
Aberdeen Standard European Logistics Income PLC Property - Europe GBX 116.00 1.15 1.16 117.82 309.808 305.022
Aberforth Smaller Companies Trust Plc UK Smaller Companies GBX 1,496.00 14.94 15.00 1,613.41 1,513.467 1,327.297
Aberforth Split Level Income Trust Plc UK Smaller Companies GBX 80.10 0.80 0.81 91.46 228.143 152.390
Aberforth Split Level Income ZDP 2024 UK Smaller Companies GBX 111.50 1.10 1.13 113.83 227.713 53.032
Acorn Income Fund Ltd UK Equity & Bond Income GBX 351.00 3.46 3.56 415.97 100.206 55.517
Acorn Income Fund ZDP 2022 UK Equity & Bond Income GBX 161.00 1.61 1.61 162.09 34.413 34.182
AEW UK REIT Ord Property - UK Commercial GBX 92.40 0.92 0.92 97.85 194.107 146.384'''
df = pd.DataFrame([x.split(';') for x in table_text.split('\n')])
print(df)
Outputs:
0
0 Name AIC sector Price (last close) Price (bid)...
1 3i Infrastructure Plc Infrastructure GBX 296...
2 Aberdeen Asian Income Fund Limited Asia Paci...
3 Aberdeen Diversified Income & Growth Ord Fle...
4 Aberdeen Emerging Markets Investment Company...
5 Aberdeen Japan Investment Trust Plc Japan GB...
6 Aberdeen Latin American Income Latin America...
7 Aberdeen New Dawn Asia Pacific GBX 322.00 3....
8 Aberdeen New India Investment Trust Plc Indi...
9 Aberdeen New Thai Investment Trust Plc Count...
10 Aberdeen Smaller Companies Income Trust UK S...
11 Aberdeen Standard Asia Focus 2025 CULS Asia ...
12 Aberdeen Standard Asia Focus PLC Asia Pacifi...
13 Aberdeen Standard Equity Inc Trust plc UK Eq...
14 Aberdeen Standard European Logistics Income ...
15 Aberforth Smaller Companies Trust Plc UK Sma...
16 Aberforth Split Level Income Trust Plc UK Sm...
17 Aberforth Split Level Income ZDP 2024 UK Sma...
18 Acorn Income Fund Ltd UK Equity & Bond Incom...
19 Acorn Income Fund ZDP 2022 UK Equity & Bond ...
20 AEW UK REIT Ord Property - UK Commercial GBX...
EDIT:
This is my hacky way of doing it. Relies on there being a currency column populated with "GBX" though.
Would welcome any ideas on better ways of doing this?
Is there a regex way of finding three capital letters preceded by a space and with a space then number afterwards? That would find the currency without hardcoding "GBX".
def convert_rows(df):
sector_name = "GBX"
for index, row in df.iterrows():
if sector_name in row[0]:
name = row[0].split(sector_name)[0]
numbers = row[0].split(sector_name)[1]
df.at[index, ['Name']] = name
df.at[index, ['AIC sector']] = sector_name
df.at[index,['Price (last close)', 'Price (bid)', 'Price (offer)', 'NAV', 'Total assets (£m)', 'Market cap (£m)']] = numbers.split()
return df
df = convert_rows(df)

You could try this:
import re
def convert_rows(df):
for index, row in df.iterrows():
# Search for the pattern
sector_name = re.match(r".+\s([A-Z]{3})\s\d+.+", row[0])
if sector_name:
sector_name = sector_name.group(1) # GBX for instance
name = row[0].split(sector_name)[0]
numbers = row[0].split(sector_name)[1]
df.at[index, ['Name']] = name
df.at[index, ['AIC sector']] = sector_name
df.at[index,['Price (last close)', 'Price (bid)', 'Price (offer)', 'NAV', 'Total assets (£m)', 'Market cap (£m)']] = numbers.split()
return df

Related

Power BI: Conditional Formating Matrix Visual with data bars

I would like to create a matrix visual like below and add data bars as conditional formating to the "Sales Percentage" Column with different user defined max and min values based on the countries.
I have the following dummy data
Salesperson
Country
Product
Sales Percentage
Total Sales
Gina
Canada
City Bike
0.02
232
Gina
Canada
Mountain Bike
0.56
2800
Gina
Italy
City Bike
0.32
213
Gina
Italy
Mountain Bike
0.21
1050
Gina
USA
City Bike
0.11
122
Gina
USA
Mountain Bike
0.43
2150
John
Canada
City Bike
0.32
333
John
Canada
Mountain Bike
0.34
442
John
Italy
City Bike
0.12
2132
John
Italy
Mountain Bike
0.67
1233
John
USA
City Bike
0.22
3300
John
USA
Mountain Bike
0.45
7300
Mary
Canada
City Bike
0.21
121
Mary
Canada
Mountain Bike
0.53
2650
Mary
Italy
City Bike
0.32
213
Mary
Italy
Mountain Bike
0.12
600
Mary
USA
City Bike
0.11
123
Mary
USA
Mountain Bike
0.12
600
The matrix looks like this after showing columns as rows and putting "Sales Percentage" and "Total Sales" as values, Country as columns and Product + Salesperson as rows:
I can add databars when I right click the Sales Percentage under values but I can only enter one user defined min and max value for the whole "Sales Percentage" column. Is it possible to have different maximum value for data bars based on the Country? For example to create a target value of 35% for Canada, 40% for USA and 50% for Italy. So in other words the data bar would be full when the Sales Percentage for Canada reaches 35% and full when Sales Percentage for USA reaches 40% and so on.
This isn't possible with you current setup. The best you could do to approximate this is as follows.
Create a measure as follows:
% Canada = CALCULATE(SUM('Table'[Total Sales]), 'Table'[Country ] = "Canada")
Do the same for USA and Italy and then add them as values to your matrix.
You can now select individual targets for each country.

Power BI: Conditional Formating Data bars for Matrix Visual

I need to create a matrix in the following format The total sales and percentage sales below each other:
This is why I have created a table with data like this:
Salesperson
Country
Sales
Product
Format
John
USA
0.45
Mountain Bike
Percentage
John
Canada
0.34
Mountain Bike
Percentage
John
Italy
0.67
Mountain Bike
Percentage
Gina
USA
0.43
Mountain Bike
Percentage
Gina
Canada
0.56
Mountain Bike
Percentage
Gina
Italy
0.21
Mountain Bike
Percentage
Mary
USA
0.12
Mountain Bike
Percentage
Mary
Canada
0.53
Mountain Bike
Percentage
Mary
Italy
0.12
Mountain Bike
Percentage
John
USA
0.22
City Bike
Percentage
John
Canada
0.32
City Bike
Percentage
John
Italy
0.12
City Bike
Percentage
Gina
USA
0.11
City Bike
Percentage
Gina
Canada
0.02
City Bike
Percentage
Gina
Italy
0.32
City Bike
Percentage
Mary
USA
0.11
City Bike
Percentage
Mary
Canada
0.21
City Bike
Percentage
Mary
Italy
0.32
City Bike
Percentage
John
USA
2250
Mountain Bike
Total
John
USA
1700
Mountain Bike
Total
John
USA
3350
Mountain Bike
Total
Gina
USA
2150
Mountain Bike
Total
Gina
Canada
2800
Mountain Bike
Total
Gina
Italy
1050
Mountain Bike
Total
Mary
USA
600
Mountain Bike
Total
Mary
Canada
2650
Mountain Bike
Total
Mary
Italy
600
Mountain Bike
Total
John
USA
1100
City Bike
Total
John
USA
1600
City Bike
Total
John
USA
600
City Bike
Total
...
...
...
...
...
Under Sales column is the total amount and percentage amount of sale and the matrix will filter after the Format column. But since I need to change the format of the percentage to percent, because it's in decimal format, I have created a measure for sales like this:
Sales_all =
VAR variable = SUM ( 'Table'[Sales])
RETURN
SWITCH (
SELECTEDVALUE ( 'Table'[Format]),
"Total", FORMAT ( variable, "General Number" ),
"Percentage", FORMAT ( variable, "Percent" ))
I have two questions. I would like to create a data bar conditional formatting for Percentage:
Is it possible to use different values for max and min of the data bar for each country. Currently when I choose data bars, I can only enter values for the whole column of Sales, disregarding the Countries (Canada, Italy, USA). For example I would like to enter a max value for Canada as 60% and max value for Italy as 25%. If I use the Sales column directly, not as measure, I can only choose one max value for the whole Sales column. The bar for the percentage should be full at 60% for Canada and full at 25% for Italy.
Since I have used a measure to change the format of the values in Sales column based on the Format column, I can't choose data bar under conditional formatting anymore? Why is this the case and how can I change it?
Please keep each post to a single question. Please don't paste data as images and keep the sample data as copiable text.
I don't understand question 1 so you will need to elaborate (ideally in a brand new question with copiable sample data). The reason for question 2 is that FORMAT() returns text and so is no longer a number and can't produce a data bar. Either keep the measure as a number or change the display formatting using calculation groups.
EDIT
You need to reshape your data. In PQ, pivot Format column with value of Sales as follows:
You end up with this (missing data because your sample wasn't complete)
Create a matrix as follows:
Highlight the column or measure for percentage and in the ribbon select percent for the format. This keeps the underlying value as a number but changes the display only.
On the matrix, ensure you have the following option.
You should now have the following:
You can now add data bars to percentage column.

How do I reshape data by groups? (Stata)

I need some help with reshaping some data into groups. The variables are country1 and country2, and samegroup, which indicates if the countries are in the same group (continent). The original data I have is something like this:
country1
country2
samegroup
China
Vietnam
1
France
Italy
1
Brazil
Argentina
1
Argentina
Brazil
1
Australia
US
0
US
Australia
0
Vietnam
China
1
Vietnam
Thailand
1
Thailand
Vietnam
1
Italy
France
1
And I would like the output to be this:
country
group
China
1
Vietnam
1
Thailand
1
Italy
2
France
2
Brazil
3
Argentina
3
Australia
4
US
5
My first instinct would be to sort the initial data by "samegroup", then reshape (long to wide). But that doesn't quite solve the issue and I'm not sure how to continue from there. Any help would be greatly appreciated!
Unless you have a non-standard definition of continent, it is much easier to use kountry (which you will probably have to install) than reshape or repeated merges:
clear
input str12 country1 str12 country2 byte samegroup
China Vietnam 1
France Italy 1
Brazil Argentina 1
Argentina Brazil 1
Australia US 0
US Australia 0
Vietnam China 1
Vietnam Thailand 1
Thailand Vietnam 1
Italy France 1
end
capture net install dm0038_1
kountry country1, from(other) geo(marc) marker
rename (country1 GEO) (country group)
sort group country
capture ssc install sencode
sencode group, replace // or use recode here
keep country group
duplicates drop
list, clean noobs
label list group
This will produce
. list, clean noobs
country group
China Asia
Thailand Asia
Vietnam Asia
Australia Australasia
France Europe
Italy Europe
US North America
Argentina South America
Brazil South America
. label list group
group:
1 Asia
2 Australasia
3 Europe
4 North America
5 South America

Filter with DAX in power BI

I am trying to filter with a DAX measure in power BI. I have a list of countries by in my DAX formula I want to return United Kingdom and France
Country
United Kingdom
France
Germany
Turkey
South Africa
Ghana
Nigeria
Australia
New Zealand
Fiji
Solomon Islands
Canada
United States
India
Mexico
Brazil
China
My DAX is
ListCountry = CALCULATE(MAX(Orders[Country]),FILTER(Orders,Orders[Country]="France" || Orders[Country] ="United Kingdom"))
When I test it it returned only United Kingdom
BUT what I want is display
United Kingdom
France
It returns only United Kingdom, because you are calculating the MAX value (MAX(Orders[Country])). In this case, the filter returns France and United Kingdom, and the later one is the maximum value. Otherwise the filter returns what you expect:
Table = FILTER(Orders, Orders[Country] = "France" || Orders[Country] = "United Kingdom")

Group Bar chart in powerBI

I have a dashboard in power BI that i want to group the countries by their continent name using bar chart
currently when I do it i have the below
Expected output
Any idea on how i can achieve this?
this is my day
Continet Country TotalSales
Africa Ghana 7612491.751
Africa Nigeria 14124361.42
Africa South Africa 5112305.914
Asia China 17817372.96
Asia India 7641389.641
Australia/Oceania Australia 12740363.52
Europe France 15415410.76
Europe Germany 12750071.97
Europe Turkey 6382936.304
Europe United Kingdom 23096905.81
North America Canada 8812713.914
North America United States 11517603.12
South America Brazil 10218528.38
You can put both Continet and Country in the Axis box and drill down but for some reason, Power BI only lets you turn off Concatenate labels on a horizontal bar chart.