Redshift : Split the comma separated values to different rows - amazon-web-services

I have a table which has one column with values separated by commas
Table_A
State City
Colorado Denver
Texas Dallas, Houston, Austin
Arizona Phoenix, Flagstaff
Expected_Result
Table_A
State City
Colorado Denver
Texas Dallas
Texas Houston
Texas Austin
Arizona Phoenix
Arizona Flagstaff
There are easier ways to do it in SQL but can't find anything similar in Redshift. Please help

You can use split_to_array function to change the comma separated column into an array, then you can do with a join.
with array_data as (
select State, split_to_array(City , ',') as city
from Table_A
)
SELECT t.state, cities as city
FROM array_data AS t
LEFT JOIN t.city AS cities ON TRUE;

Related

How to filter a table for distinct values Powerapps

I am new to Powerapps and I have noticed that the Distinct function returns a table of the distinct values(only returns the distinct column not the full row). Is there a way to filter a table so that it returns back a subset of the full table with distinct values in a specified column.
You can use the GroupBy function for this. Take a look at the documentation, or in the example below:
Assuming that cities is a table with the following values:
City
Country
Population
London
UK
8615000
Berlin
Germany
3562000
Madrid
Spain
3165000
Rome
Italy
2874000
Paris
France
2273000
Hamburg
Germany
1760000
Barcelona
Spain
1602000
Munich
Germany
1494000
Milan
Italy
1344000
The expression GroupBy(cities, "Country", "Cities") will return a table with a column "Country", and a column called "Cities" whose value will be a table with all cities for that country.
You can then use functions such as AddColumns and Sum to aggregate the values of the inner table, like in the example below:
AddColumns(
GroupBy(cities, "Country", "Cities"),
"Sum of City Populations",
Sum(Cities, Population))
In your tweets example, if you want to get one tweet from each day, you can have an expression like the one below:
AddColumns(
GroupBy(Tweets, "crf1d_date_index", "Dates"),
"SampleTweet",
First(Dates))
Where it would have a new column with the first tweet from each date. Or if you want a single field from the group, you can have something like this:
AddColumns(
GroupBy(Tweets, "crf1d_date_index", "Dates"),
"FirstTweetTime",
First(Dates).tweet_time)

Match-Merging Causing Concatenated Data

I am trying to match merge two data sets by the variable "country". Both data sets contain the variable country (one has it named as "name" but was changed to country) and other variables, one data set (data1) contains continent information. However, I run into the issue of SAS just concatenating the data sets, that is, stacking them on top of one another.
I have tried the basics, sorting the data sets by the same by variables and making sure to use the by statement when merging the data sets.
proc sort data=data1;
by name;
run;
proc sort data=data2;
by country;
run;
data merged_data;
length continent $ 20 country $ 200;
merge data1(rename=(name=country)) data2;
by country;
run;
The result of this code is the data sets just being stacked on top of one another. My goal is to attach the continent to the country, ie identify the continent of each country.
data1:
Continent Name
Asia China
Australia New Zealand
Europe France
data2:
Country Var City
China 1.2 Beijing, China
New Zealand 3.5 Auckland, New Zealand
France 2.8 Paris, France
data I want:
Country Var City Continent
China 1.2 Beijing, China Asia
New Zealand 3.5 Auckland, New Zealand Australia
France 2.8 Paris, France Europe
data I get:
Country Var City Continent
China 1.2 Beijing, China
New Zealand 3.5 Auckland, New Zealand
France 2.8 Paris, France
China Asia
New Zealand Australia
France Europe
From my example data your logic works for me. Maybe your error has to do with your length statement
Data Df1;
INPUT Country $1-18 #19 Temp;
datalines;
United States 87
Canada 68
Mexico 88
Russia 77
China 55
;
Run;
Data Df2;
INPUT name $1-18 #19 season $;
datalines;
United States Summer
Canada Summer
Mexico Summer
Russia Winter
China Winter
;
Run;
Proc sort data=Df1;
by Country;
Proc sort data= Df2;
by Name;
Run;
Data Merged_data;
merge Df1 Df2(rename=(name=country));
by country;
Run;
Make sure the values of the variables are what you think they are. Print the values using $QUOTE. format. Look at the results using fixed length font. etc.
Perhaps one has the actual values you see and the other has a code that is decoded by a format to the values you see.
If it is not an issue of formatted value versus actual value then perhaps the records in DATA2 have leading spaces.
This program produces the result you are showing. If you remove the leading spaces from COUNTRY in DATA2 then the merge works as expected.
data data1 ;
input Continent $13. Name $15.;
cards;
Asia China
Australia New Zealand
Europe France
;
data data2;
input Country $15. Var City $25.;
country=' '||country;
cards;
China 1.2 Beijing, China
New Zealand 3.5 Auckland, New Zealand
France 2.8 Paris, France
;
proc sort data=data1; by name; run;
proc sort data=data2; by country; run;
data want ;
merge data2 data1(rename=(name=country)) ;
by country;
run;
Results:
Obs Country Var City Continent
1 China 1.2 Beijing, China
2 France 2.8 Paris, France
3 New Zealand 3.5 Auckland, New Zealand
4 China . Asia
5 France . Europe
6 New Zealand . Australia

Splitting a Column into two based on condtions in Proc Sql ,SAS

I want to Split the airlines column into two groups and then
Add each group 's amount for all clients... : -
Group 1 = Air India & jet airways
| Group 2 = Others.
Loc Client_Name Airlines Amout
BBI A_1ABC2 Air India 41302
BBI A 1ABC2 Air India 41302
MAA Th 1ABC2 Spice Jet Airlines 288713
HYD Ma 1ABC2 Jet Airways 365667
BOM Vi 1ABC2 Air India 552506
Something like this: -
Rank Client_name Group1 Group2 Total
1 Ca 1ABC2 5266269 7040320 1230658
2 Ve 1ABC2 2815593 2675886 5491479
3 Ma 1ABC2 1286686 437843 1724529
4 Th 1ABC2 723268 701712 1424980
5 Ec 1ABC2 113517 627734 741251
6 A 1ABC2 152804 439381 592185
I grouped it first ..but i am confused regarding how to split: -
Data assign6.Airlines_grouping1;
Set assign6.Airlines_grouping;
if Scan(Airlines,1) IN ('Air','Jet') then Group = "Group1";
else
if Scan(Airlines,1) Not in('Air','Jet') then Group = "Group2";
Run;
You are categorizing a row based on the first word of the airline.
Proc TRANSPOSE with an ID statement is one common way to reshape data so that a categorical value becomes a column. A second way is to bypass the categorization and use a data step to produce the new shape of data directly.
Here is an example of the second way -- create new columns group1 and group2 and set value based on airline criteria.
data airlines_group_amounts;
set airlines;
if scan (airlines,1) in ('Air', 'Jet') then
group1 = amount;
else
group2 = amount;
run;
summarize over client
proc sql;
create table want as
select
client_name
, sum(group1) as group1
, sum(group2) as group2
, sum(amount) as total
from airlines_group_amounts
group by client_name
;
You can avoid the two steps and do all of the processing in a single query, or you can do the summarization with Proc MEANS
Here is a single query way.
proc sql;
create table want as
select
client_name
, sum(case when scan (airlines,1) in ('Air', 'Jet') then amount else 0 end) as group1
, sum(case when scan (airlines,1) in ('Air', 'Jet') then 0 else amount end) as group2
, sum(amount) as total
from airlines
group by client_name
;

SQL Dev: Updating column from another table's column with where statement

I have the following 2 table examples (large databases with many more columns)
table1
Dirty1 code
Ne yok 553
Bufflo 5767
Ne yok -345
Tchicgo -35
Albunny 543
Dtroit -443
Bufflo -4534
Matatan -45
Ne yok -345
table 2
Dirty2 Standardized
Manhatahn Manhattan
Ne yok New York
Matatan Manhattan
Brocklyn Brooklyn
Albunny Albany
Bufflo Buffalo
Baffalow Buffalo
I want to update table 1 with the standardized city format in table 2 where table1.dirty1 = table2.dirty2 and code is < 0
so the output should look like the following
output table1
Dirty1 code
Ne yok 553
Bufflo 5767
New York -345
Tchicgo -35
Albunny 543
Dtroit -443
Buffalo -4534
Manhattan -45
New York -345
I also want to make sure any that don't have a standardized form in the table 2 get skipped (example: Dtroit and tchicgo)
UPDATE: for Oracle-
UPDATE table1 SET table1.Dirty1= (SELECT table2.Standardized FROM table2
WHERE table1.Dirty1=table2.Dirty2)
WHERE table1.code<0 AND EXISTS (SELECT table2.Standardized FROM table2
WHERE table1.Dirty1=table2.Dirty2);
Note I'm not using Oracle and haven't tested it but it should work.
This should do the trick (MS-SQL)-
UPDATE table1 INNER JOIN table2 ON table1.Dirty1=table2.Dirty2 SET
table1.Dirty1=table2.Standardized WHERE table1.code<0;

How to calculate quantile data for table of frequencies in SAS?

I am interested in dividing my data into thirds, but I only have a summary table of counts by a state. Specifically, I have estimated enrollment counts by state, and I would like to calculate what states comprise the top third of all enrollments. So, the top third should include at least a total cumulative percentage of .33333...
I have tried various means of specifying cumulative percentages between .33333 and .40000 but with no success in specifying the general case. PROC RANKalso can't be used because the data is organized as a frequency table...
I have included some dummy (but representative) data below.
data state_counts;
input state $20. enrollment;
cards;
CALIFORNIA 440233
TEXAS 318921
NEW YORK 224867
FLORIDA 181517
ILLINOIS 162664
PENNSYLVANIA 155958
OHIO 141083
MICHIGAN 124051
NEW JERSEY 117131
GEORGIA 104351
NORTH CAROLINA 102466
VIRGINIA 93154
MASSACHUSETTS 80688
INDIANA 75784
WASHINGTON 73764
MISSOURI 73083
MARYLAND 73029
WISCONSIN 72443
TENNESSEE 71702
ARIZONA 69662
MINNESOTA 66470
COLORADO 58274
ALABAMA 54453
LOUISIANA 50344
KENTUCKY 49595
CONNECTICUT 47113
SOUTH CAROLINA 46155
OKLAHOMA 43428
OREGON 42039
IOWA 38229
UTAH 36476
KANSAS 36469
MISSISSIPPI 33085
ARKANSAS 32533
NEVADA 27545
NEBRASKA 24571
NEW MEXICO 22485
WEST VIRGINIA 21149
IDAHO 20596
NEW HAMPSHIRE 19121
MAINE 18213
HAWAII 16304
RHODE ISLAND 13802
DELAWARE 12025
MONTANA 11661
SOUTH DAKOTA 11111
VERMONT 10082
ALASKA 9770
NORTH DAKOTA 9614
WYOMING 7457
DIST OF COLUMBIA 6487
;
run;
***** calculating the cumulative frequencies by hand ;
proc sql;
create table dummy_3 as
select
state,
enrollment,
sum(enrollment) as total_enroll,
enrollment / calculated total_enroll as percent_total
from state_counts
order by percent_total desc ;
quit;
data dummy_4; set dummy_3;
if first.percent_total then cum_percent = 0;
cum_percent + percent_total;
run;
Based on the value for cum_percent, the states that make up the top third of enrollment are: California, Texas, New York, Florida, and Illinois.
Is there any way to do this programatically? I'd eventually like to specify a flag variable for selecting states.
Thanks...
You can easily count percentages using PROC FREQ with WEIGHT statement and then select those in the first third using LAG function:
proc freq data=state_counts noprint order=data;
tables state / out=state_counts2;
weight enrollment;
run;
data top3rd;
set state_counts2;
cum_percent+percent;
if lag(cum_percent)<100/3 then top_third=1;
run;
It seems like you're 90% of the way there. If you just need a way to put cum_percent into flagged buckets, setting up a format is pretty straightforward.
proc format;
value pctile
low-0.33333 = 'top third'
0.33333<-.4 = 'next bit'
0.4<-high = 'the rest'
;
run;
options fmtsearch=(work);
And add a statement at the end of your datastep:
pctile_flag = put(cum_percent,pctile.);
Rewrite your last data step like this:
data dummy_4(drop=found);
set dummy_3;
retain cum_percent 0 found 0;
cum_percent + percent_total;
if cum_percent < (1/3) then do;
top_third = 1;
end;
else if ^found then do;
top_third = 1;
found =1;
end;
else
top_third = 0;
run;
note: your first. syntax is incorrect. first. and last. only work on BY groups. You get the right values in CUM_PERCENT by way of the cum_percent + percent_total; statement.
I am not aware of a PROC that will do this for you.