Right now, I have a view with a mess of common, conditional string replacement and substitutions for an open text field - in this example, regional classification.
(Please ignore the accuracy of geography, I'm just working with historical standard assignments. Also, I know I could speed things up with REPLACE or even just cleaning the RegEx statements for lookback - I'm just asking about the variable/nesting here.)
CREATE OR REPLACE FUNCTION public.region_cleanup(record_region text)
RETURNS text
LANGUAGE sql
STRICT
AS $function$
SELECT REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(record_region,'(NORTH AMERICA\s\-\sUSA\s\-\sUSA)','USA')
,'Rest\sof\sthe\sWorld\s\-\s','')
,'NORTH\sAMERICA\s\-\sCANADA','NORTH AMERICA - Canada')
,'\&\;','&')
,'Georgia\s\-\sGeorgia','MIDDLE EAST - Georgia')
,'EUROPE - Turkey','MIDDLE EAST - Turkey')
A sample output using this function would look like this in my dataset, pulling out records impacted (some are already in the correct format):
record_region_input
record_region_output
NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro
USA - NORTHEAST - Massachusetts - Boston Metro
NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia
USA - MIDATLANTIC - Virginia
Rest of the World - ASIA - Thailand
ASIA - Thailand
Rest of the World - EUROPE - Portugal
EUROPE - Portugal
Rest of the World - ASIA - China - Shanghai Metro
ASIA - China - Shanghai Metro
Georgia - Georgia
MIDDLE EAST - Georgia
This is... fine. Regex is needed since there's tons of variability on what may come before or after these strings, and I have a proper validation list elsewhere. This is just a bulk scrub of common historical naming issues.
The problem is where I get hundreds of these kind of "known substitutions" (100+) for things like company naming or cross-department standards. Having dozens and dozens of REGEXP_REPLACE( nested statements makes editing/adding/dropping anything a maddening game of counting.
I'm trying to clean data within Postgres exclusively, since my current pipeline doesn't always allow for standardization prior to upload. I know how I'd tackle this cleanly outside of pure SQL, but in a 'vanilla' PostgreSQL instance (v12+) is there a better method for transforming strings for a view?
Updated with a sample input/output table using the example function.
If when you will split a string of data into additional regions then maybe replacing regions will be easy for you. For example:
with tb as (
select 1 as id, 'NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro' as record_region_input
union all
select 2 as id, 'NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia'
union all
select 3 as id, 'Rest of the World - ASIA - China - Shanghai Metro'
)
select * from (
select distinct tb.id, unnest(string_to_array(record_region_input, ' - ')) as region from tb
order by tb.id
) a1 where a1.region not in ('NORTH AMERICA', 'Rest of the World');
-- Result:
1 Boston Metro
1 Massachusetts
1 NORTHEAST
1 USA
2 MIDATLANTIC
2 USA
2 Virginia
3 ASIA
3 China
3 Shanghai Metro
After then, for example, for duplicating regions you can use distinct, for unnecessary regions you can use NOT in, and you can use like '%ASIA%' to get all regions which contain ASIA and etc. After all processes, you can merge the corrected string again. Example:
with tb as (
select 1 as id, 'NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro' as record_region_input
union all
select 2 as id, 'NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia'
union all
select 3 as id, 'Rest of the World - ASIA - China - Shanghai Metro'
)
select a1.id, string_agg(a1.region, ' - ') from (
select distinct tb.id, unnest(string_to_array(record_region_input, ' - ')) as region from tb
order by tb.id
) a1 where a1.region not in ('NORTH AMERICA', 'Rest of the World')
group by a1.id
-- Return:
1 Boston Metro - Massachusetts - NORTHEAST - USA
2 MIDATLANTIC - USA - Virginia
3 ASIA - China - Shanghai Metro
This is a simple idea, maybe this idea helps you to replace regions.
Related
On a table os solicitations inside a PowerBI project I have multiple fields that references the same another table of users.
solicitation_id
category
region
curr_user_responsible
user_solver
solver_data
1
A
south
Thomas
2
C
north
Maria
3
A
south
Maria
2022-10-17
4
A
east
Maria
5
B
west
Joseph
6
C
south
Maria
7
C
west
Thomas
2022-10-12
8
B
south
Maria
9
B
east
Joseph
2022-10-10
10
A
north
Thomas
2022-10-09
11
C
north
Maria
I want an user slice. However, in some visualization I want to slice by "curr_user_responsible" columns, and at other, by the "user_solver" column.
For example, in a chart I want to show the number of solicitations in each category sliced by curr_user_responsible.
In a second chart, I want to show the number of solicitations in each region sliced by curr_user_responsible.
For these two, I need to slice by curr_user_responsible.
But in another chart I want to show a line couting the number of solutions by date, sliced by user_solver. In this case, I need the slice to filter by "user_solver".
I don't want to put multiple slicers in same page. Just want to choose the user once.
On this exemple, registers with curr_user_responsible filled are with user_solver empty, and vice-versa. But this is not a rule. It may occur both fields being filled at same time.
Ps. All the users are listed on another table.
My solution was to create a virtual table getting data from the solicitations table with only the solver related fields and relate this table with the users table.
solicitation (data from SQL DB)
solicitation_id
category
region
curr_user_responsible
user_solver
solver_data
1
A
south
Thomas
2
C
north
Maria
3
A
south
Maria
2022-10-17
4
A
east
Maria
5
B
west
Joseph
6
C
south
Maria
7
C
west
Thomas
2022-10-12
8
B
south
Maria
9
B
east
Joseph
2022-10-10
10
A
north
Thomas
2022-10-09
11
C
north
Maria
solicitations solver (data from solicitations table)
solicitation_id
user_solver
solver_data
1
2
3
Maria
2022-10-17
4
5
6
7
Thomas
2022-10-12
8
9
Joseph
2022-10-10
10
Thomas
2022-10-09
11
User table (data from SQL DB)
user
Maria
Thomas
Joseph
Relationships
solicitation[curr_user_responsible] <-> user[user]
solicitations solver[user_solver] <-> user[user]
I read about the USERELATIONSHIP function and I understood that it is possible to do this without creating the second table, but I failed to do this. By the way, my solution was enough.
I need some help with reshaping some data into groups. The variables are country1 and country2, and samegroup, which indicates if the countries are in the same group (continent). The original data I have is something like this:
country1
country2
samegroup
China
Vietnam
1
France
Italy
1
Brazil
Argentina
1
Argentina
Brazil
1
Australia
US
0
US
Australia
0
Vietnam
China
1
Vietnam
Thailand
1
Thailand
Vietnam
1
Italy
France
1
And I would like the output to be this:
country
group
China
1
Vietnam
1
Thailand
1
Italy
2
France
2
Brazil
3
Argentina
3
Australia
4
US
5
My first instinct would be to sort the initial data by "samegroup", then reshape (long to wide). But that doesn't quite solve the issue and I'm not sure how to continue from there. Any help would be greatly appreciated!
Unless you have a non-standard definition of continent, it is much easier to use kountry (which you will probably have to install) than reshape or repeated merges:
clear
input str12 country1 str12 country2 byte samegroup
China Vietnam 1
France Italy 1
Brazil Argentina 1
Argentina Brazil 1
Australia US 0
US Australia 0
Vietnam China 1
Vietnam Thailand 1
Thailand Vietnam 1
Italy France 1
end
capture net install dm0038_1
kountry country1, from(other) geo(marc) marker
rename (country1 GEO) (country group)
sort group country
capture ssc install sencode
sencode group, replace // or use recode here
keep country group
duplicates drop
list, clean noobs
label list group
This will produce
. list, clean noobs
country group
China Asia
Thailand Asia
Vietnam Asia
Australia Australasia
France Europe
Italy Europe
US North America
Argentina South America
Brazil South America
. label list group
group:
1 Asia
2 Australasia
3 Europe
4 North America
5 South America
I have a dashboard in power BI that i want to group the countries by their continent name using bar chart
currently when I do it i have the below
Expected output
Any idea on how i can achieve this?
this is my day
Continet Country TotalSales
Africa Ghana 7612491.751
Africa Nigeria 14124361.42
Africa South Africa 5112305.914
Asia China 17817372.96
Asia India 7641389.641
Australia/Oceania Australia 12740363.52
Europe France 15415410.76
Europe Germany 12750071.97
Europe Turkey 6382936.304
Europe United Kingdom 23096905.81
North America Canada 8812713.914
North America United States 11517603.12
South America Brazil 10218528.38
You can put both Continet and Country in the Axis box and drill down but for some reason, Power BI only lets you turn off Concatenate labels on a horizontal bar chart.
I have data like this along with other columns in a pandas df.
Apologies I haven't figured out how to present the question with code for the dataframe. First Post
Location:
- Tokyo, Japan
- Sacramento, USA
- Mexico City, Mexico
- Mexico City, Mexico
- Colorado Springs, USA
- New York, USA
- Chicago, USA
Does anyone know how I could isolate the country name from the location and create a new column with just the Country Name?
Try this:
In [29]: pd.DataFrame(df.Location.str.split(',',1).tolist(), columns = ['City','Country'])
Out[29]:
City Country
0 Tokyo Japan
1 Sacramento USA
2 Mexico City Mexico
3 Mexico City Mexico
4 Colorado Springs USA
5 Seoul South Korea
You can do this without any regular expressions - you can find the String.indexOf(“, “) to find the position of the seperator in the String, and then use String.substring to cut the String down to just this section.
However, a regular expression can also do this easily, but would likely be slower.
Formatting issue that is getting passed to document. Sample companyList below which varies each time script is run
companyList = ["Apple - Seattle Washington (800) 555-5555", "Microsoft - Tampa Florida (800) 555-1234", "Samsung - Tokyo Japan (01) 555 123-1234"]
Right now the line of code to format this text is:
companyInfo = "\n\n".join(companyList)
and companyInfo outputs like this:
Apple - Seattle Washington (800) 555-5555 Microsoft - Tampa Florida (800) 555-1234 Samsung - Tokyo Japan (01) 555 123-1234
How can I rewrite this to format like this (note tabbed one over per new line):
Apple - Seattle Washington (800) 555-5555
Microsoft - Tampa Florida (800) 555-1234
Samsung - Tokyo Japan (01) 555 123-1234
Many thanks in advance
You can do it like this:
companyInfo = "\n\n".join("\t%s" % x for x in companyList)