I'm looking to do a lookup that is sort of a hybrid between a join and union. I have a large number of records in my main dataset, so I'm looking to do something that wouldn't be a "brute force" method of a many-to-many matrix.
Here is my main dataset, called 'All', which already contains price for each of the products listed.
product date price
apple 1/1/2011 1.05
apple 1/3/2011 1.02
apple 1/4/2011 1.07
pepper 1/2/2011 0.73
pepper 1/3/2011 0.75
pepper 1/6/2011 0.79
My other data dataset ('Prices' - not shown here, but contains the same two keys, product and date) contains prices for all products, on each possible date. The hash table look up I would like to create would essentially look up every date in the 'All' table, and output prices for ALL products for that date, resulting in a table such as this:
product date price
apple 1/1/2011 1.05
pepper 1/1/2011 0.71 *
apple 1/2/2011 1.04 *
pepper 1/2/2011 0.73
apple 1/3/2011 1.02
pepper 1/3/2011 0.75
apple 1/4/2011 1.07
pepper 1/4/2011 0.76 *
apple 1/6/2011 1.10 *
pepper 1/6/2011 0.79
That is, as long as one product has a date and price specified 'All' table, all other products should pull that in from the lookup table as welll.
The asterisks indicate that the price was looked up from the prices table, and new rows containing prices for products were essentially inserted into the new table.
If hash table are not a great way to go about this, please let me know alternative methods.
Well this is far from elegant, but curious if the below gives you the desired result? Since you have multiple records per key in ALL (which I assume you want to maintain), I basically unioned ALL with the records in PRICES that have a date in All, but I added an Except so as to excluded records that were already in ALL. No idea if this makes sense, or is doing what you want. Certainly doesn't qualify as 'elegant'.
data all;
input product $7. date mmddyy10. price;
Y=1;
format date mmddyy10.;
cards;
apple 01/01/2011 1.05
apple 01/01/2011 1.05
apple 01/03/2011 1.02
pepper 01/02/2011 0.73
pepper 01/03/2011 0.75
pepper 01/06/2011 0.79
;
run;
data prices;
input product $7. date mmddyy10. price;
format date mmddyy10.;
cards;
apple 01/01/2011 1.05
apple 01/02/2011 1.04
apple 01/03/2011 1.02
apple 01/04/2011 1.07
apple 01/05/2011 1.01
pepper 01/01/2011 0.70
pepper 01/02/2011 0.73
pepper 01/03/2011 0.75
pepper 01/04/2011 0.76
pepper 01/05/2011 0.77
pepper 01/06/2011 0.79
;
run;
proc sql;
create table want as
select * from all
union corr all
( (select product,date,price from
prices
where date IN (select distinct date from all)
)
except corr
select product,date,price from all
)
;
quit;
Related
I would like to create a matrix visual like below and add data bars as conditional formating to the "Sales Percentage" Column with different user defined max and min values based on the countries.
I have the following dummy data
Salesperson
Country
Product
Sales Percentage
Total Sales
Gina
Canada
City Bike
0.02
232
Gina
Canada
Mountain Bike
0.56
2800
Gina
Italy
City Bike
0.32
213
Gina
Italy
Mountain Bike
0.21
1050
Gina
USA
City Bike
0.11
122
Gina
USA
Mountain Bike
0.43
2150
John
Canada
City Bike
0.32
333
John
Canada
Mountain Bike
0.34
442
John
Italy
City Bike
0.12
2132
John
Italy
Mountain Bike
0.67
1233
John
USA
City Bike
0.22
3300
John
USA
Mountain Bike
0.45
7300
Mary
Canada
City Bike
0.21
121
Mary
Canada
Mountain Bike
0.53
2650
Mary
Italy
City Bike
0.32
213
Mary
Italy
Mountain Bike
0.12
600
Mary
USA
City Bike
0.11
123
Mary
USA
Mountain Bike
0.12
600
The matrix looks like this after showing columns as rows and putting "Sales Percentage" and "Total Sales" as values, Country as columns and Product + Salesperson as rows:
I can add databars when I right click the Sales Percentage under values but I can only enter one user defined min and max value for the whole "Sales Percentage" column. Is it possible to have different maximum value for data bars based on the Country? For example to create a target value of 35% for Canada, 40% for USA and 50% for Italy. So in other words the data bar would be full when the Sales Percentage for Canada reaches 35% and full when Sales Percentage for USA reaches 40% and so on.
This isn't possible with you current setup. The best you could do to approximate this is as follows.
Create a measure as follows:
% Canada = CALCULATE(SUM('Table'[Total Sales]), 'Table'[Country ] = "Canada")
Do the same for USA and Italy and then add them as values to your matrix.
You can now select individual targets for each country.
I need to create a matrix in the following format The total sales and percentage sales below each other:
This is why I have created a table with data like this:
Salesperson
Country
Sales
Product
Format
John
USA
0.45
Mountain Bike
Percentage
John
Canada
0.34
Mountain Bike
Percentage
John
Italy
0.67
Mountain Bike
Percentage
Gina
USA
0.43
Mountain Bike
Percentage
Gina
Canada
0.56
Mountain Bike
Percentage
Gina
Italy
0.21
Mountain Bike
Percentage
Mary
USA
0.12
Mountain Bike
Percentage
Mary
Canada
0.53
Mountain Bike
Percentage
Mary
Italy
0.12
Mountain Bike
Percentage
John
USA
0.22
City Bike
Percentage
John
Canada
0.32
City Bike
Percentage
John
Italy
0.12
City Bike
Percentage
Gina
USA
0.11
City Bike
Percentage
Gina
Canada
0.02
City Bike
Percentage
Gina
Italy
0.32
City Bike
Percentage
Mary
USA
0.11
City Bike
Percentage
Mary
Canada
0.21
City Bike
Percentage
Mary
Italy
0.32
City Bike
Percentage
John
USA
2250
Mountain Bike
Total
John
USA
1700
Mountain Bike
Total
John
USA
3350
Mountain Bike
Total
Gina
USA
2150
Mountain Bike
Total
Gina
Canada
2800
Mountain Bike
Total
Gina
Italy
1050
Mountain Bike
Total
Mary
USA
600
Mountain Bike
Total
Mary
Canada
2650
Mountain Bike
Total
Mary
Italy
600
Mountain Bike
Total
John
USA
1100
City Bike
Total
John
USA
1600
City Bike
Total
John
USA
600
City Bike
Total
...
...
...
...
...
Under Sales column is the total amount and percentage amount of sale and the matrix will filter after the Format column. But since I need to change the format of the percentage to percent, because it's in decimal format, I have created a measure for sales like this:
Sales_all =
VAR variable = SUM ( 'Table'[Sales])
RETURN
SWITCH (
SELECTEDVALUE ( 'Table'[Format]),
"Total", FORMAT ( variable, "General Number" ),
"Percentage", FORMAT ( variable, "Percent" ))
I have two questions. I would like to create a data bar conditional formatting for Percentage:
Is it possible to use different values for max and min of the data bar for each country. Currently when I choose data bars, I can only enter values for the whole column of Sales, disregarding the Countries (Canada, Italy, USA). For example I would like to enter a max value for Canada as 60% and max value for Italy as 25%. If I use the Sales column directly, not as measure, I can only choose one max value for the whole Sales column. The bar for the percentage should be full at 60% for Canada and full at 25% for Italy.
Since I have used a measure to change the format of the values in Sales column based on the Format column, I can't choose data bar under conditional formatting anymore? Why is this the case and how can I change it?
Please keep each post to a single question. Please don't paste data as images and keep the sample data as copiable text.
I don't understand question 1 so you will need to elaborate (ideally in a brand new question with copiable sample data). The reason for question 2 is that FORMAT() returns text and so is no longer a number and can't produce a data bar. Either keep the measure as a number or change the display formatting using calculation groups.
EDIT
You need to reshape your data. In PQ, pivot Format column with value of Sales as follows:
You end up with this (missing data because your sample wasn't complete)
Create a matrix as follows:
Highlight the column or measure for percentage and in the ribbon select percent for the format. This keeps the underlying value as a number but changes the display only.
On the matrix, ensure you have the following option.
You should now have the following:
You can now add data bars to percentage column.
I have a list of observations with a few variables. I need to put them in a bin (below) and only keep one observation in each bin which is closest to the bin's number:
Bins
0.94
0.96
0.98
1.00
1.02
1.04
1.06
Data
Variable Price Value_to_bin Closest bin
a 0.630527682 0.935 0.94
b 0.441296291 0.979 0.98
c 0.350173415 0.969
d 0.920932417 0.993
e 0.361863025 0.959 0.96
f 0.027205755 1.003 1
g 0.878286791 1.045
h 0.206434946 0.971
i 0.259272294 1.021 1.02
j 0.081774863 0.982
k 0.01146324 0.992
l 0.283027273 1.037 1.04
m 0.188747537 0.993
n 0.554786 1.064 1.06
o 0.784774 1.065
And then just keep the ones that are closest to the bin value (i.e. delete the ones that have blanks in the 'closest_bin' variable.
I tried to use proc rank but I can't get rid of the rest or match with the bin (something like 'closest' doesn't exist as far as I know).
SAS SQL with automatic remerging can perform the query quite succinctly. The consistent binning to a 0.02 level allows the ROUND function to compute the bin values to the nearest 0.02 unit.
proc sql;
create table want as
select
var,
price,
value,
round(value,0.02) as valbin_02
from have
group by valbin_02
having abs(valbin_02-value) = min(abs(valbin_02-value))
;
I have many dataset from files to be merged and arranged in one single output file. Here is the example of any two datasets to be merged accordingly.
Data 1 from File 1:
9.00 2.80 13.08 12.78 0.73
10.00 -3.44 19.30 18.99 0.14
12.00 2.60 20.28 20.12 0.39
Data 2 from File 2:
2.00 -7.73 20.04 18.49 0.62
5.00 -4.82 17.07 16.38 0.59
6.00 -2.69 12.55 12.25 0.50
8.00 -3.85 18.06 17.64 0.94
9.00 -3.59 16.13 15.73 0.64
Expected output in one file:
9.00 2.80 13.08 12.78 0.73
10.00 -3.44 19.30 18.99 0.14
12.00 2.60 20.28 20.12 0.39
2.00 -7.73 20.04 18.49 0.62
5.00 -4.82 17.07 16.38 0.59
6.00 -2.69 12.55 12.25 0.50
8.00 -3.85 18.06 17.64 0.94
9.00 -3.59 16.13 15.73 0.64
Temporarily the script i used using Python loop for is like this:
import numpy as np
import glob
path='./13-stat-plot-extreme-combine/'
files=glob.glob(path+'13-stat*.dat')
for x in range(len(files)):
file1=files[x]
data1=np.loadtxt(file1)
np.savetxt("Combine-Stats.dat",data1,fmt='%9.2f')
The problem is only one dataset is saved on that new file. Question how to use concatenate to such case at different axis dataset?
Like this:
arrays = [np.loadtxt(name) for name in files]
combined = np.concatenate(arrays)
I have 2 datasets, one large and one small. While functionality is most important here, efficiency matters as well, because the large dataset could be potentially tens of millions of records. Let's call my large dataset "Transactions" and my small one "Prices." Here's what I'm trying to do. In the transactions file, there are a bunch of values for 'Store'. For each 'Store', I would like to create a hash table of relevant 'Product', and for each unique 'Saledate', pull all product 'Price' into the output table, not just for the associated 'Product' but for ALL 'Products' found in the Transactions dataset for that 'Store'.
Here's a sample of the "Transactions" dataset:
Store Product SaleDate Price
A apple 1/1/2011 1.05
A apple 1/3/2011 1.02
A apple 1/4/2011 1.07
A pepper 1/2/2011 0.73
A pepper 1/3/2011 0.75
A pepper 1/6/2011 0.79
And here's a sample of the "Prices" dataset:
Product Saledate Price
apple 1/1/2011 1.05
apple 1/2/2011 1.06
apple 1/3/2011 1.02
apple 1/4/2011 1.07
...
pepper 1/1/2011 0.74
pepper 1/2/2011 0.73
pepper 1/3/2011 0.75
pepper 1/4/2011 0.75
pepper 1/5/2011 0.75
pepper 1/6/2011 0.79
...
mango 1/1/2011 2.40
mango 1/2/2011 2.42
...
And here's the desired output for this scenario(the asterisk denotes that the price was inserted from "Prices" into "Transactions" via lookup).
Store Product Saledate Price
A apple 1/1/2011 1.05
A pepper 1/1/2011 0.74 *
A apple 1/2/2011 1.06 *
A pepper 1/2/2011 0.73
A apple 1/3/2011 1.02
A pepper 1/3/2011 0.75
A apple 1/4/2011 1.07
A pepper 1/4/2011 0.75 *
A apple 1/6/2011 1.10 *
A pepper 1/6/2011 0.79
Basically, in pseudo-code, the idea is do loop through each store, look-up its list of distinct products and saledates, then find corresponding prices for those products and saledates in the prices data and insert them if they are not already there. In the example, since apple has a saledate of 1/1 in "Transactions", I need the saledate for pepper as well (from "Prices"). The same applies for 1/2, except vice-versa since the price is already there for pepper, but not for apple. 1/5 has a record in "Prices", but it's not needed, since it doesn't occur for either apple/pepper in "Transactions." For another store, different products exist, so pepper is not relevant at all, but mango is.
I've tried several approaches and can't get around the "double" look-up hangup that I think is necessary.
Here is a link to a previous question/answer related to this new question. The answer provides sample code to create the dummy tables. https://stackoverflow.com/a/36961795/214994
I think you can accomplish what you are looking for with a series of SQL statements.
data trans;
format Store $1. Product $6. SaleDate mmddyy10. price best.;
informat SaleDate mmddyy10.;
input Store $ Product $ SaleDate price;
datalines;
A apple 1/1/2011 1.05
A apple 1/3/2011 1.02
A apple 1/4/2011 1.07
A pepper 1/2/2011 0.73
A pepper 1/3/2011 0.75
A pepper 1/6/2011 0.79
B apple 1/1/2011 1.05
B apple 1/3/2011 1.02
B apple 1/4/2011 1.07
B mango 1/2/2011 2.42
B mango 1/3/2011 2.43
B mango 1/6/2011 2.46
;
data prices;
format Product $6. SaleDate mmddyy10. price best.;
informat SaleDate mmddyy10.;
input Product $ SaleDate price;
datalines;
apple 1/1/2011 1.05
apple 1/2/2011 1.06
apple 1/3/2011 1.02
apple 1/4/2011 1.07
apple 1/5/2011 1.10
apple 1/6/2011 1.15
pepper 1/1/2011 0.74
pepper 1/2/2011 0.73
pepper 1/3/2011 0.75
pepper 1/4/2011 0.75
pepper 1/5/2011 0.75
pepper 1/6/2011 0.79
mango 1/1/2011 2.40
mango 1/2/2011 2.42
mango 1/3/2011 2.43
mango 1/4/2011 2.44
mango 1/5/2011 2.45
mango 1/6/2011 2.46
;
proc sql noprint;
/*Store and Date Combinations*/
create table store_dates as
select distinct store, saledate
from trans;
/*Store and Product Combinations*/
create table store_products as
select distinct store, product
from trans;
/*Full Joins the combination tables and look up the price with a left join*/
create table want as
select a.store,
b.product,
a.saledate,
c.price
from store_dates as a
full join
store_products as b
on a.store=b.store
left join
prices as c
on b.product = c.product
and a.saledate = c.saledate
order by a.store, a.saledate, b.product;
quit;
This assumes that there are no differences in prices between TRANS and PRICES. If so, add another left join of TRANS and a coalesce() to handle.