GLPK multiple dimension param - operations

How do I use param Distance as following in GLPK? like param Distance {line in Line, dir in Direction , ori in Station , des in Station};?
data;
set Direction := Eastbound Westbound;
set Line := District Piccadilly;
set Station := ACTON_TOWN ALDGATE_EAST ALPERTON ARNOS_GROVE...;
param Distance := # Line Direction StationFrom StationTo Kilometers
District Eastbound ACTON_TOWN CHISWICK_PARK 1.22
District Eastbound ALDGATE_EAST WHITECHAPEL 0.82
District Eastbound BARKING UPNEY 1.38
District Eastbound BARONS_COURT WEST_KENSINGTON 0.64
District Eastbound BAYSWATER PADDINGTON 0.98
District Eastbound BECONTREE DAGENHAM_HEATHWAY 1.37
...
end;

Before the "data" Part you can define the parameter like the following
param Distance {Line, Direction, Station, Station};
And then use it like
var x, >= 0;
minimize obj : sum{line in Line, dir in Direction , ori in Station , des in Station}(x*Distance[line,dir,ori,des]);
But you have one big Problem. With one Station set you will have connection like from and to Acton_Town, so you have to set them to Zero in the data part. Additional there is a problem with this combined distance representation with the direction Eastbound and Westbound - there will be an ACTON_TOWN to CHIPSWICK_PARK in both directions, so you have to handle the value for the unplausible connection in some way (like fixed indexing or high pricing). Same for Stations not on the Line.
You should probably think about a seperate representation of your Stations like
set Station_Piccadilly := ...;
set Station_District := ...;
...
If you want to make some routing you should probably look at the glpk example tsp.mod containing the Traveling Salesman Problem.

Related

Orange3: lat/lng data geocoded with discretized date -> Average result value per year per continent

I have data that has:
a result value
A lat & lng
a date
I have discretized the date to years which works fine. I have used the Geocoding add-on to extract the continent from the lat/lng coords. And now I want to get the average result value of the continent per year. I'm at the point where I have all the values available, but I can't figure out how to group the result (mean)by continent-year categories.
I have set up the following flow.
The table you see above comes from the red square
enter image description here
enter image description here

SAS cutting off string during data recode

On a dataset that is created by:
data voa;
input Address $50.;
input City $1-15 State $16-17 Zip;
input Latitude Longitude;
datalines;
1675 C Street, Suite 201
Anchorage AK 99501
61.21 -149.89
600 Azalea Road
Mobile AL 36609
30.65 -88.15
I'm attempting to add a new variable which is essential a recoding of Long and Lat, like so:
data voa1;
set voa;
if Longitude < -110 then Region = "West";
if Latitude > 40 and Longitude < -90 and Longitude > -110 then Region = "Mid-West";
if Latitude > 40 and Longitude > -90 then Region = "North-East";
if Latitude < 40 and Longitude < -110 then Region = "South";
run;
Unfortunately, it seems that SAS is cutting the strings short and leaving them at 4 characters (e.g. "Mid-West" just becomes "Mid-"). If I had to guess I would assume that this is because SAS assigns a certain number of bytes for each value in a column based on the first value in that column, and doesn't dynamically modify the number of bytes based on new values. How do I fix this?
Note: I think a potential fix might be putting the longest potential output (in this case "North-East") first, but this seems like an inelegant solution.
One of the nice features of SAS is that you are not forced to define your variables before using them. But if you don't define the variable then SAS must make a guess at what you meant by the code that you write. In your case since the first reference to the new variable Region is in the assignment statement:
Region = "West"
SAS makes the logical decision to define it as a character variable of length 4.
To fix that just add a LENGTH statement before the first IF statement.
length region $10;

How to do column wise intersection with itertools

When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix.
I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for the same.
Find the sample of the dataset below:
ID AGE Occupation Gender Product_range Product_cat Product
1100 25-34 IT M 50-60 Gaming XPS 6610
1101 35-44 Research M 60-70 Business Latitude lat6
1102 35-44 Research M 60-70 Performance Inspiron 5810
1103 25-34 Lawyer F 50-60 Business Latitude lat5
1104 45-54 Business F 40-50 Performance Inspiron 5410
The matrix I get is
Problem Statement:
If you see the value under the red box that shows the similarity of row (1104) and (1101) of the sample dataset. The two rows are not similar if you look at their respective columns, however the value 0.16 is because of the term "Business" present in "occupation" column of row (1104) and "product_cat" column of row(1101), which gives outcome as 1 when the intersection of the rows are taken.
My code just takes the intersection of the two rows without looking at the columns, How do I change my code to handle this case and keep the performance equally good.
My code:
half_matrix=[]
for row1, row2 in itertools.combinations(data_set, r=2):
intersection_len = row1.intersection(row2)
half_matrix.append(float(len(intersection_len)) /tot_len)
The simplest way out of this is to add a column-specific prefix to all entries. Example of a parsed row:
row = ["ID:1100", "AGE:25-34", "Occupation:IT", "Gender:M", "Product_range:50-60", "Product_cat:Gaming", "Product:XPS 6610"]
There are many other ways around this, including splitting each row into a set of k-mers and applying the Jaccard-based MinHash algorithm to compare these sets, but there is no need in such a thing in your case.

Plotting log odds against mid-point of category

I have a binary outcome variable (disease) and a continuous independent variable (age). There's also a cluster variable clustvar. Logistic regression assumes that the log odds is linear with respect to the continuous variable. To visualize this, I can categorize age as (for example, 0 to <5, 5 to <15, 15 to <30, 30 to <50 and 50+) and then plot the log odds against the category number using:
logistic disease i.agecat, vce(cluster clustvar)
margins agecat, predict(xb)
marginsplot
However, since the categories are not equal width, it would be better to plot the log odds against the mid-point of the categories. Is there any way that I can manually define that the values plotted on the x-axis by marginsplot should be 2.5, 10, 22.5, 40 and (slightly arbitrarily) 60, and have the points spaced appropriately?
If anyone is interested, I achieved the required graph as follows:
Recategorised age variable slightly differently using (integer) labels that represent the mid-point of the category:
gen agecat = .
replace agecat = 3 if age<6
replace agecat = 11 if age>=6 & age<16
replace agecat = 23 if age>=16 & age<30
replace agecat = 40 if age>=30 & age<50
replace agecat = 60 if age>=50 & age<.
For labelling purposes, created a label:
label define agecat 3 "Less than 5y" 11 "10 to 15y" 23 "15 to <30y" 40 "30 to <50y" 60 "Over 50 years"
label values agecat
Ran logistic regression as above:
logistic disease i.agecat, vce(cluster clustvar)
Used margins and plot using marginsplot:
margins agecat, predict(xb)
marginsplot

Stata: Uniquely sorting points within groups

I'm conducting a household survey with a random sample of 200 villages. Using QGIS, I picked a random point 5-10km from my original villages. I then obtained, from the national statistical office, the village codes for those 200 "neighbor" villages - as well as a buffer of 10 additional neighbor villages. So my total sample is:
200 original villages + 210 neighbor villages = 410 villages, total
We're going to begin fieldwork soon, and I want to give each survey team a map for 1 original village + the nearest neighbor village. Because I'm surveying in some dense urban areas as well, sometimes a neighbor village is actually quite close to more than one original village.
My problem is this: if I run Distance Matrix in QGIS, matching an old village to its nearest neighbor village, I get duplicates in the latter. To get around this, I've matched each old village to the nearest 5 neighbor villages. My main idea/goal is to pick the nearest neighbor that hasn't already been picked.
I end up with a .csv like so:
As you can see, picking the five nearest villages, I'm getting repeats - neighbor village 79 is showing up as nearby to original villages 1, 2, 3, and 4. This is fine, as long as I can assign neighbor village 79 to one (and only one) original village, and then have the rest uniquely match as well.
What I want to do, then, is to uniquely match each original village to one neighbor village. I've tried a bunch of stuff, none of which has worked: My sense is that I need to loop over original village groups, assign a variable (e.g. taken==1) to one of the neighbor villages, and then - somehow - have each instance of that taken==1 apply to all instances of, say, neighbor village 79.
Here's some sample code of what I was thinking. Note: this uniquely matches 163 of my neighbors.
gen taken = 0
so ea distance
by ea: replace taken=1 if _n==1
keep if taken==1
codebook FID ea
This also doesn't work; it just sets taken to 1 for all obs:
foreach i in 5 4 3 2 1 {
by ea: replace taken=1 if _n==`i' & taken==0
}
What I need to do, I think, is loop over both _N and _n, and maybe use an if/else. But I'm not sure how to put it all together.
(Tangentially, is there a better way to loop over decreasing values in Stata? Similar to i-- in other programming languages?)
This should work but the setup is a little different than what you say you need. By comparing with only five neighbors, you have an ill-posed problem. Imagine that geography is such that you end up with six (or more) original villages that have all the same list of five neighbors. What do you assign the sixth original village?
Given this, I compare the original village with all other villages, not only five. The strategy is then to assign original village 1 its closest neighbor; to original village 2 its closest neighbor after discarding the one previously assigned, and so on. This assumes equal number of original and neighbor villages but you have ten additional, so you need to give that a thought.
clear
set more off
*----- example data -----
local numvilla = 4 // change to test
local numobs = `numvilla'^2
set obs `numobs'
egen origv = seq(), from(1) to(`numvilla') block(`numvilla')
bysort origv: gen neigh = _n
set seed 1956
gen dist = runiform()*10
*----- what you want ? -----
sort origv dist
list, sepby(origv)
quietly forvalues villa = 1/`numvilla' {
drop if origv == `villa' & _n > `villa'
drop if neigh == neigh[`villa'] & _n > `villa'
}
list
The other issue is that results will depend on which original village is set to first, second, and so on; because order of assignments will change according to that. That is, the order in which available options are discarded changes with the order in which you set up the original villages. You may want to randomize the order of the original villages before you start the assignments.
You can increase efficiency substituting out & _n > `villa' for in `=`villa'+1'/L, but you won't notice much with your sample size.
I'm not qualified to say anything about your sample design, so take this answer to address only the programming issue you pose.
By the way, to loop over decreasing values:
forvalues obs = 5(-1)1 {
display "`obs'"
}
See help numlist.