How do I group or summarize multiple rows where multiple column values are the same using MSSQL2017 - grouping

I would like to summarize the data in this table to calculate the total NumberOfPallets which have the same content
ie
15 Pallets which contain the following:
Item Colour Packing QtyPerPallet
Item1 Red FOIL 35
Item2 Blue FOIL 110
2 Pallets which contain the following:
Item Colour Packing QtyPerPallet
Item1 Red PLASTIC 35
Item3 Yellow PLASTIC 50
I have no idea where to begin!
CREATE TABLE Orders
(SalesOrder INT NOT NULL,
PalletNo INT NOT NULL,
Item CHAR(20) NOT NULL,
Colour CHAR(10) NOT NULL,
Packing CHAR(10) not Null,
QtyPerPallet INT Not Null,
NumberOfPallets INT not Null)
INSERT INTO Orders
(SalesOrder, PalletNo, Item, Colour,Packing, QtyPerPallet, NumberOfPallets)
VALUES
(1, 22, 'ITEM1', 'RED', 'FOIL', 35,5),
(1, 22, 'ITEM2', 'BLUE', 'FOIL',110,5),
(112, 47, 'ITEM2', 'BLUE', 'FOIL',110,10),
(112, 47, 'ITEM1', 'RED', 'FOIL',35,10),
(217,1100, 'ITEM1', 'RED', 'PLASTIC', 35,2),
(217,1100, 'ITEM3', 'YELLOW', 'PLASTIC', 50, 2)

based on what I understand from your question , it helps to know what you want your final output to look like
but let me know if this works
select
item
,colour
,packing
,QtyPerPallet
,count(distinct palletNo) number_of_pallets
from
orders
group by
item
,colour
,packing
,QtyPerPallet

Related

Sorting a map of objects by object variables

I have these classes: SortBy, Employee, Company and Database. Database has Database::Sorted method that will use SortBy class and sort depending on added key. Database has a map<string,Company*> m_companies of companies and that is the map I need to sort. I can have one key or multiple keys.
Sorting by using multiple keys (for example first by name and then by age) is not a big problem, however, I'm not sure how I should attack this problem with maps and objects as a beginner to OOP.
Let's say I have:
John, 27, 300, Washington
Jane, 20, 500, Fargo
Anna, 44, 150, Stanford
Kyle, 44, 150, Paris
then:
assert( EqualLists( a.Sorted("Apple", CSortBy()
.AddKey(CSortBy::by_age,true)
.AddKey(CsortBy::by_address,true) ),
list<CEmployee> {
CEmployee("Jane", 20, 500, "Fargo", "Apple"),
CEmployee("John", 27, 300, "Washington", "Apple"),
CEmployee("Kyle", 44, 150, "Paris", "Apple"),
CEmployee("Anna", 44, 150, "Stanford", "Apple")
}
) );
in this case I will sort first by age and then by address, both times ascending.
FULL CODE HERE.
What's the quickest way to do this?

Boxplot by groups, plus a user-defined scatter plot (markers for a subset of values)

Working with lab data, I want to overlay a subset of data points on a boxplot grouped by treatment and sequenced by timepoint. Bringing all elements together is not straightforward in SAS, and requires a clever approach that I can't devise or find myself :)
The beauty of the desired plot is that it displays 2 distinct types of outliers:
The boxplots include statistical outliers - square markers (1.5 IQR)
Then overlay markers for "normal range" outliers - a clinical definition, specific to each lab test.
This is difficult when grouping data (e.g., by treatment) and then blocking or categorizing by another variable (e.g., a timepoint). SAS internally determines the spacing of the boxplots, so this spacing is difficult to mimic for the overlayed normal-range data markers. A generic solution in this direction would be an unreliable kludge.
I've demoed this approach, below, of manually mimicking the group separation for the overlay markers -- just to give an idea of intent. As expected, normal range outliers do not line up with the boxplot groups. Plus, data points that meet both outlier criteria (statistical and clinical) appear as separate points, rather than single points with overlayed markers. My annotations in green:
SGPLOT-overlay-fail
Is there an easy, robust way to instruct SAS to overlay grouped data points on a boxplot, keeping everything aligned as intended?
Here's the code to reproduce that miss:
proc sql;
create table labstruct
( mygroup char(3) label='Treatment Group'
, myvisitnum num label='Visit number'
, myvisitname char(8) label='Visit name'
, labtestname char(8) label='Name of lab test'
, labseed num label='Lab measurement seed'
, lablow num label='Low end of normal range'
, labhigh num label='High end of normal range'
)
;
insert into labstruct
values('A', 1, 'Day 1', 'Test XYZ', 48, 40, 60)
values('A', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('A', 10, 'Week 2', 'Test XYZ', 52, 40, 60)
values('B', 1, 'Day 1', 'Test XYZ', 52, 40, 60)
values('B', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('B', 10, 'Week 2', 'Test XYZ', 48, 40, 60)
;
quit;
data labdata;
set labstruct;
* Put normal range outliers on 2nd axis, manually separate groups on 2nd axis *;
select (mygroup);
when ('A') scatternum = myvisitnum - 1;
when ('B') scatternum = myvisitnum + 1;
otherwise;
end;
* Make more obs from the seeds above *;
label labvalue = 'Lab measurement';
do repeat = 1 to 20;
labvalue = labseed + 6*rannor(3297);
* Scatter plot ONLY normal range outliers *;
if labvalue < lablow or labvalue > labhigh
then scattervalue = labvalue;
else scattervalue = .;
output;
end;
drop repeat labseed;
run;
proc sgplot data=labdata;
block x=myvisitnum block=myvisitname /
nofill
lineattrs=(color=lightgray);
vbox labvalue /
category=myvisitnum
group=mygroup
outlierattrs=(symbol=square);
scatter x=scatternum y=scattervalue /
group=mygroup
x2axis
jitter;
x2axis display=none;
keylegend / position=bottom type=marker;
run;
So - I think there is a solution here, but I'm not sure how general it is. Certainly, it only works for a two element boxplot.
The issue you have right now is that the axis type by default for a scatterplot is linear, not discrete, while a boxplot is by default discrete. This is always going to be messy if you have it set up that way, though you could in theory work out the exact difference and plot it. You could also use the annotate facility, though it will have the same problem.
However, if you set the scatterplot to use a discrete axis, you can use the discreteoffset option to make things line up properly - more or less. Unfortunately, there's no way to use the group on scatterplot to tell SAS to place the appropriate marker on the appropriate boxplot, so by default everything ends up in the center of the discrete axis; so you will need to use two separate plots here, one for a and one for b, one with negative offset and one with positive.
The advantage of discreteoffset is it should be a constant value for any two-group boxplot, unless you make some alteration to the box widths; no matter how big the actual plot is, the discreteoffset amount should be the same (as it's a percentage of the total width of the block assigned for that value).
Some things to consider here include having six elements in your boxplot instead of three (so get rid of group and just have six different visnum values, a_1 b_1 etc.); that would guarantee that each boxplot centered right on the center of the discrete axis (then your scatterplot would have a 0 discrete offset). You also could consider rolling your own boxplot; calculate your own IQR, for example, and then use high-low plots to draw the boxes and draw the whiskers via annotation, then scatterplot all of the different outliers (not just your 'normal' ones).
Here's the code that seems to work for your specific example, and hopefully would work for most cases similar (with two bars). For 3 bars it's probably easy as well (1 bar has a 0 offset, the other two are probably around +/- 0.25). Beyond that you start having to do more calculations to figure out where the boxes will be, but overall SAS will be pretty good at spacing them out equally so it'll usually be fairly straightforward.
proc sql;
create table labstruct
( mygroup char(3) label='Treatment Group'
, myvisitnum num label='Visit number'
, myvisitname char(8) label='Visit name'
, labtestname char(8) label='Name of lab test'
, labseed num label='Lab measurement seed'
, lablow num label='Low end of normal range'
, labhigh num label='High end of normal range'
)
;
insert into labstruct
values('A', 1, 'Day 1', 'Test XYZ', 48, 40, 60)
values('A', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('A', 10, 'Week 2', 'Test XYZ', 52, 40, 60)
values('B', 1, 'Day 1', 'Test XYZ', 52, 40, 60)
values('B', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('B', 10, 'Week 2', 'Test XYZ', 48, 40, 60)
;
quit;
data labdata;
set labstruct;
* Put normal range outliers on 2nd axis, manually separate groups on 2nd axis *;
select (mygroup);
when ('A') a_scatternum = myvisitnum; /* Note the separate names now, but no added +/- 1 */
when ('B') b_scatternum = myvisitnum;
otherwise;
end;
* Make more obs from the seeds above *;
label labvalue = 'Lab measurement';
do repeat = 1 to 20;
labvalue = labseed + 6*rannor(3297);
* Scatter plot ONLY normal range outliers *;
if labvalue < lablow or labvalue > labhigh
then scattervalue = labvalue;
else scattervalue = .;
output;
end;
drop repeat labseed;
run;
proc sgplot data=labdata noautolegend; /* suppress auto-legend */
block x=myvisitnum block=myvisitname /
nofill
lineattrs=(color=lightgray);
vbox labvalue /
category=myvisitnum
group=mygroup
outlierattrs=(symbol=square) name="boxplot"; /* Name for keylegend */
scatter x=a_scatternum y=scattervalue / /* Now you have two of these - and no need for an x2axis */
group=mygroup discreteoffset=-0.175
jitter
;
scatter x=b_scatternum y=scattervalue /
group=mygroup discreteoffset=0.175
jitter
;
keylegend "boxplot" / position=bottom type=marker; /* Needed to make a custom keylegend or else you have a mess with three plots in it */
run;
Thanks for the insights! I was stuck on the the same disconnect between boxplot discrete axis and scatter plot real axis. It turns out that with SAS 9.4, scatter plots can handle "categories" like the vbox, but SAS refers to this as the x-axis rather than a category. This SAS 9.4 example also helped crack it for me (as soon as I'd given up, naturally :).
This is pretty close, and leaves most processing to SAS (always my preference for a robust solution):
The updated code: The "category" from the VBOX is the "x" for the SCATTER. Note that the default cluster-width for VBOX and SCATTER are different, 0.7 and 0.85, respectively, so I have to explicitly set them to the same value:
proc sql;
create table labstruct
( mygroup char(3) label='Treatment Group'
, myvisitnum num label='Visit number'
, myvisitname char(8) label='Visit name'
, labtestname char(8) label='Name of lab test'
, labseed num label='Lab measurement seed'
, lablow num label='Low end of normal range'
, labhigh num label='High end of normal range'
)
;
insert into labstruct
values('A', 1, 'Day 1', 'Test XYZ', 48, 40, 60)
values('A', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('A', 10, 'Week 2', 'Test XYZ', 52, 40, 60)
values('B', 1, 'Day 1', 'Test XYZ', 52, 40, 60)
values('B', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('B', 10, 'Week 2', 'Test XYZ', 48, 40, 60)
;
quit;
data labdata;
set labstruct;
* Make more obs from the seeds above *;
label labvalue = 'Lab measurement';
do repeat = 1 to 20;
labvalue = labseed + 6*rannor(3297);
* Scatter plot ONLY normal range outliers *;
if labvalue < lablow or labvalue > labhigh
then scattervalue = labvalue;
else scattervalue = .;
output;
end;
drop repeat labseed;
run;
proc sgplot data=labdata;
block x=myvisitnum block=myvisitname /
nofill
lineattrs=(color=lightgray);
vbox labvalue /
category=myvisitnum
group=mygroup
groupdisplay=cluster
clusterwidth=0.7
outlierattrs=(symbol=square);
scatter x=myvisitnum y=scattervalue /
group=mygroup
groupdisplay=cluster
clusterwidth=0.7
jitter;
keylegend /
position=bottom type=marker;
run;
Thanks, again, for getting me back on track so quickly!

What determines Z-Index in a Google Geochart?

I am trying to plot yearly data on a geochart. I would like the most recent data on top, but for whatever reason, the earliest year is always on top in the actual visualization.
I have tried re-ordering the table to have the latest years as the first entries in the data with no effect.
I thought that maybe it was happening because I used a view to filter my data, but the filter is not reordering the items with the older ones first (so that shouldn't impact how it is displayed).
I do not want to filter out data since I use transparency to display all points. Here is some sample code that displays the same problem:
function drawVisualization() {
var data = new google.visualization.DataTable();
data.addColumn('number', 'Latitude');
data.addColumn('number', 'Longitude');
data.addColumn('number', 'Color');
data.addColumn('number', 'Output (MW)');
data.addRows([
[35, 135, 2, 334],
[35, 135, 1, 100],
[35.1, 135.1, 1, 100],
[35.1, 135, 1, 100],
[35, 135.1, 1, 100],
[34.9, 134.9, 1, 100],
[34.9, 135, 1, 100],
[35, 135.1, 1, 100],
]);
var geochart = new google.visualization.GeoChart(
document.getElementById('visualization'));
geochart.draw(data, {
colorAxis: {
'minValue': 1,
'maxValue': 2,
'values': [1, 2],
'colors': ['black','red'],
},
'markerOpacity': 0.5,
'region': 'JP'
});
}
I can change the values in column 2 or 3 (0-indexed), or I can change the order of the entries in to the data table, but I keep getting the same result. I have a feeling it always sticks bigger sized values in the back so you can still see the little values, but I'm wondering if there is any authoritative reference on it, or any way to get around it.
This is what it looks like no matter what I do:
What I want it to look like is as follows (manipulated the SVG manually to adjust the Z-order):
I played around with it for a bit, and I think you're right: it's automatically z-indexing the markers in size-order. If I read your intent correctly, you are looking to show some subset of years, and you want the markers to be z-indexed by years. I think you can accomplish that with some custom filtering: sort your data by location and year, then for every location, filter out every year with a smaller size than any of the newer years. Something like this should work:
// order by location and year (descending)
var rows = data.getSortedRows([0, 1, {column: 2, desc: true}]);
// parse the rows backwards, removing all years where a location has a newer year with a larger size value
// we don't need to parse row 0, since that will always be the latest year for some location
var size, lat, long;
for (var i = rows.length - 1; i > 0; i--) {
size = data.getValue(rows[i], 3);
lat = data.getValue(rows[i], 0);
long = data.getValue(rows[i], 1);
for (var j = i - 1; j >= 0 && lat == data.getValue(rows[j], 0) && long == data.getValue(rows[i], 1); j--) {
if (size < data.getValue(rows[j], 3)) {
rows.splice(i, 1);
break;
}
}
}
var view = new google.visualization.DataView(data);
view.setRows(rows);
Here's a working example based on your code: http://jsfiddle.net/asgallant/36AmD/
You are correct that the order of the markers is determined by the size, with the larger markers drawn first so they end up below the smaller markers, which is a convenience for most applications. If you wish to hide 'later' markers based on order, you'll have to do that another way, perhaps by hiding the rows of data.
Is there a reason it makes sense to hide data if it covers 'earlier' data? Perhaps an option could be added to disable this automatic reordering, especially if transparent colors are used to allow you to see through.
Try this, helped me in a project:
setTimeout(function () {
$('.google-visualization-table').css("z-index", "1");
}, 500);

How do you calculate expanding mean on time series using pandas?

How would you create a column(s) in the below pandas DataFrame where the new columns are the expanding mean/median of 'val' for each 'Mod_ID_x'. Imagine this as if were time series data and 'ID' 1-2 was on Day 1 and 'ID' 3-4 was on Day 2.
I have tried every way I could think of but just can't seem to get it right.
left4 = pd.DataFrame({'ID': [1,2,3,4],'val': [10000, 25000, 20000, 40000],
'Mod_ID': [15, 35, 15, 42],'car': ['ford','honda', 'ford', 'lexus']})
right4 = pd.DataFrame({'ID': [3,1,2,4],'color': ['red', 'green', 'blue', 'grey'], 'wheel': ['4wheel','4wheel', '2wheel', '2wheel'],
'Mod_ID': [15, 15, 35, 42]})
df1 = pd.merge(left4, right4, on='ID').drop('Mod_ID_y', axis=1)
Hard to test properly on your DataFrame, but you can use something like this:
>>> df1["exp_mean"] = df1[["Mod_ID_x","val"]].groupby("Mod_ID_x").transform(pd.expanding_mean)
>>> df1
ID Mod_ID_x car val color wheel exp_mean
0 1 15 ford 10000 green 4wheel 10000
1 2 35 honda 25000 blue 2wheel 25000
2 3 15 ford 20000 red 4wheel 15000
3 4 42 lexus 40000 grey 2wheel 40000

unable to read a tab delimited file into a numpy 2-D array

I am quite new to nympy and I am trying to read a tab(\t) delimited text file into an numpy array matrix using the following code:
train_data = np.genfromtxt('training.txt', dtype=None, delimiter='\t')
File contents:
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
what I expect is a 2-D array matrix of shape (3, 15)
but with my above code I only get a single row array of shape (3,)
I am not sure why those fifteen fields of each row are not assigned a column each.
I also tried using numpy's loadtxt() but it could not handle type conversions on my data i.e even though I gave dtype=None it tried to convert the strings to default float type and failed at it.
Tried code:
train_data = np.loadtxt('try.txt', dtype=None, delimiter='\t')
Error:
ValueError: could not convert string to float: State-gov
Any pointers?
Thanks
Actually the issue here is that np.genfromtxt and np.loadtxt both return a structured array if the dtype is structured (i.e., has multiple types). Your array reports to have a shape of (3,), because technically it is a 1d array of 'records'. These 'records' hold all your columns but you can access all the data as if it were 2d.
You are loading it correctly:
In [82]: d = np.genfromtxt('tmp',dtype=None)
As you reported, it has a 1d shape:
In [83]: d.shape
Out[83]: (3,)
But all your data is there:
In [84]: d
Out[84]:
array([ (38, 'Private', 215646, 'HS-grad', 9, 'Divorced', 'Handlers-cleaners', 'Not-in-family', 'White', 'Male', 0, 0, 40, 'United-States', '<=50K'),
(53, 'Private', 234721, '11th', 7, 'Married-civ-spouse', 'Handlers-cleaners', 'Husband', 'Black', 'Male', 0, 0, 40, 'United-States', '<=50K'),
(30, 'State-gov', 141297, 'Bachelors', 13, 'Married-civ-spouse', 'Prof-specialty', 'Husband', 'Asian-Pac-Islander', 'Male', 0, 0, 40, 'India', '>50K')],
dtype=[('f0', '<i8'), ('f1', 'S9'), ('f2', '<i8'), ('f3', 'S9'), ('f4', '<i8'), ('f5', 'S18'), ('f6', 'S17'), ('f7', 'S13'), ('f8', 'S18'), ('f9', 'S4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 'S13'), ('f14', 'S5')])
The dtype of the array is structured as so:
In [85]: d.dtype
Out[85]: dtype([('f0', '<i8'), ('f1', 'S9'), ('f2', '<i8'), ('f3', 'S9'), ('f4', '<i8'), ('f5', 'S18'), ('f6', 'S17'), ('f7', 'S13'), ('f8', 'S18'), ('f9', 'S4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 'S13'), ('f14', 'S5')])
And you can still access "columns" (known as fields) using the names given in the dtype:
In [86]: d['f0']
Out[86]: array([38, 53, 30])
In [87]: d['f1']
Out[87]:
array(['Private', 'Private', 'State-gov'],
dtype='|S9')
It's more convenient to give proper names to the fields:
In [104]: names = "age,military,id,edu,a,marital,job,fam,ethnicity,gender,b,c,d,country,income"
In [105]: d = np.genfromtxt('tmp',dtype=None, names=names)
So you can now access the 'age' field, etc.:
In [106]: d['age']
Out[106]: array([38, 53, 30])
In [107]: d['income']
Out[107]:
array(['<=50K', '<=50K', '>50K'],
dtype='|S5')
Or the incomes of people under 35
In [108]: d[d['age'] < 35]['income']
Out[108]:
array(['>50K'],
dtype='|S5')
and over 35
In [109]: d[d['age'] > 35]['income']
Out[109]:
array(['<=50K', '<=50K'],
dtype='|S5')
Updated answer
Sorry, I misread your original question:
what I expect is a 2-D array matrix of shape (3, 15)
but with my above code I only get a single row array of shape (3,)
I think you misunderstand what np.genfromtxt() will return. In this case, it will try to infer the type of each 'column' in your text file and give you back a structured / "record" array. Each row will contain multiple fields (f0...f14), each of which can contain values of a different type corresponding to a 'column' in your text file. You can index a particular field by name, e.g. data['f0'].
You simply can't have a (3,15) numpy array of heterogeneous types. You can have a (3,15) homogeneous array of strings, for example:
>>> string_data = np.genfromtext('test', dtype=str, delimiter='\t')
>>> print string_data.shape
(3, 15)
Then of course you could manually cast the columns to whatever type you want, as in #DrRobotNinja's answer. However you might as well let numpy create a structured array for you, then index it by field and assign the columns to new arrays.
I do not believe Numpy arrays handle different datatypes within a single array. What can be done, is load the entire array as strings, then convert the necessary columns to numbers as necessary
# Load data as strings
train_data = np.loadtxt('try.txt', dtype=np.str, delimiter='\t')
# Convert numeric strings into integers
first_col = train_data[:,0].astype(np.int)
third_col = train_data[:,2].astype(np.int)