Use case for "sets of tuple data" in Pyomo - pyomo

When we specify the data for a set we have the ability to give it tuples of data. For example, we could write in our .dat file the following:
set A : 1 2 3 :=
1 + - -
2 - - +
3 - + +
This would specify that we would have 4 tuples in our set: (1,1), (2,3), (3,2), (3,3)
But I guess that I am struggling to understand exactly why we would want to do this? Furthermore, suppose we instantiated a Set object in our code as:
model.Aset = RangeSet(4, dimen=2)
Would this then specify that our tuples would have the indices 1, 2, 3, and 4?
I am thinking that specifying tuples in our set could potentially be useful when working with some data in which it's important to have a bit of a "spatial" understanding of the problem. But I would be curious to hear from the community what the potential applications of specifying set data this way might be.

The most common place this appears is when you're trying to model edges between nodes in a network. Networks aren't usually completely dense (have edges between every pair of nodes) so it's beneficial to represent just the edges that appear using a sparse set of tuples.

Related

Elasticsearch scoring on multiple indexes: dfs_query_then_fetch returns the same scores as query_then_fetch

I have multiple indices in Elasticsearch (and the corresponding documents in Django created using django-elasticsearch-dsl). All of the indices have these settings:
settings = {'number_of_shards': 1,
'number_of_replicas': 0}
Now, I am trying to perform a search across all the 10 indices. In order to retrieve consistent scoring between the results from different indices, I am using dfs_query_then_fetch:
search = Search(index=['mov*'])
search = search.params(search_type='dfs_query_then_fetch')
objects = search.query("multi_match", query='Tom & Jerry', fields=['title', 'actors'])
I get bad results due to inconsistent scoring. A book called 'A story of Jerry and his friend Tom' from one index can be ranked higher than the cartoon 'Tom & Jerry' from another index. The reason is that dfs_query_then_fetch is not working. When I remove it or substitute with the simple query_then_fetch, I get absolutely the same results with the identical scoring.
I have tested it on URI requests as well, and I always get the same scores for both search types.
What can be the reason for it?
UPDATE: The results are actually not the same, but they are only really slightly different, e.g. a score of 50.1 with dfs and 50.0 without dfs, while the same model within one index has a score of 80.0.
If the number of shards is 1, then dfs_query_then_fetch and query_then_fetch will return the same result. DFS query will do a query to all shards and then show you results based on the scores computed, but in this case there is only one shard.
Regarding the scoring, you might wanna have a look at your actors field too. Also, do let us know what are the analyzer and tokenizer if you have used custom ones?

Finding all possible variations in a Bin-Packing Problem

For my task I had to write a bin-packing algorithm where there are N objects with different volumes. They all had to be packed into boxes of volume V. Using Decreasing Sorting I successfully wrote the algorithm. But another task includes writing out all possible variations of bin-packing in an amount of boxes that I previously found most effective. So for example:
There are 4 objects with volumes: 4, 6, 3, 2. Volume of boxes is 10. Using the bin-packing algorithm I find that I will need 2 boxes.
All possible variations would be:
4,6 and 3,2
4,3 and 6,2
4,2 and 6,3
6 and 4,3,2
I'm having trouble coming up with an appropriate algorithm for this problem, where should I start ? Any help would be greatly appreciated.
The general algorithm for solving this problem goes like this:
Try to fit all objects in n bins by creating all possible split configurations into n groups and test if any such configuration fits in the bins.
If not, increase n and try again.
Now, how do you find all possible split configurations?
Consider putting a tag on each object to decide into which bin it belongs. If you have 3 objects and 2 bins, then each object can get the tag 0 or 1 (for any of the two bins). This makes 2^3 = 8 combinations:
000
001
010
...
Now it also becomes clear how to create all combinations. You can use a counter and convert it into the base of the number of bins (2 in this case) and use the digits as tags. There are other options, e. g. you could use a recursive solution. I prefer that.
When you have a solution you just need to check that for each bin the volume sum of the objects of this tag is not greater than the bin size.
Here would be some pseudo code for creating a list of all the combinations recursively:
combinations(object_counter, bin_counter) {
if (object_counter == 0) {
return [[]] // a list of one empty list
}
result = [] // empty list
for i in 0 .. bin_counter-1 {
sub_results = combinations(object_counter-1, bin_counter)
for sub_result in sub_results {
result.append([i] + sub_result)
}
}
return result
}

Data structure to multi-sort feature vector by attributes

I need to sort a vector with tuples
[
(a_11, ..., a_1n),
... ,
(a_m1, ..., a_mn)
]
based on a list of attributes and their comparison operators < or >.
For example: sort first by a_2 with the > operator and by a_57 with the < operator.
Question: I am looking for a data structure to do this efficiently under the assumption that sorting happens much more often than updates to the vector.
My current idea is to store the sorting order for each attribute by adding pointers similar to a linked list for each attribute:
For example, this vector:
0: (1, 7, 4)
1: (2, 5, 6)
2: (3, 4, 5)
Would get the data structure
0: (1 next:1 prev:-, 7 next:- prev:1, 4 next:2 prev:-)
1: (2 next:2 prev:1, 5 next:0 prev:2, 6 next:- prev:2)
2: (3 next:- prev:2, 4 next:1 prev:-, 5 next:1 prev:0)
Edit:
At any given time I need only one sorting order. After I get a user request for a different sorting order I need to recompute as quickly as possible.
The incremental idea is very good, but I need to make an estimate on how much time I need and this is way more easy if I have an idea how it should be done.
Once i am finished I need random access to groups of 100 elements, i.e. the first 100, the second 100, or elements 5100-5199.
I would use boost::MultiIndex for this. – drescherjm

Neo4j search unknown number of properties

I've got graph like that:
(A)-[r1]-(B)-[r2]-(C)
The thing is, that r1 and r2 can have different number of properties.
relation1:
index1: 10
index2: 2
relation2:
index1: 6
index2: 4
index3: 5
Is it possible to search among all properties without knowing their names? Or is there better way to keep lists in neo4j?
Property values can be lists, as long as all the elements are the same type. So you can have
match (A) -[r1]-> (B) -[r2]-> (C) set r1.vals = [10, 6], r2.vals=[6, 4, 5]
and later search with
match (A) -[r]-> (B) where 10 in r.vals return a,b
I don't know whether this works with indexing, so presumably tstorms' answer is better if you have a lot of these relationships.
There's no way to do this in "native" cypher, but you could use automatic relationship indexing that uses Lucene. I think you can do the following in Cypher:
START r=rel:rel_auto_index("*:'your_search_value'")
RETURN startNode(r), endNode(r), type(r);
Make sure automatic indexing is enabled in your Neo4j properties:
relationship_auto_indexing = true

Creating train, test and cross validation datasets in sklearn (python 2.7) with a grouping constraints?

While creating a train,test & cross validation sample in Python, I see the default method as -:
1. Reading the dataset , after skipping headers
2. Creating the train, test and Cross validation sample
import csv
with open('C:/Users/Train/Trainl.csv', 'r') as f1:
next(f1)
reader = csv.reader(f1, delimiter=',')
input_set = []
for row in reader:
input_set.append(row)
import numpy as np
from numpy import genfromtxt
from sklearn import cross_validation
train, intermediate_set = cross_validation.train_test_split(input_set, train_size=0.6, test_size=0.4)
cv, test = cross_validation.train_test_split(intermediate_set, train_size=0.5, test_size=0.5)
My problem though is that I have a field say "A" in the csv file that I read into the numpy array, and all sampling should respect this field. That is, all entries with similar values for "A" should go in one sample .
Line #|A | B | C | D
1 |1 |
2 |1 |
3 |1 |
4 |1 |
5 |2 |
6 |2 |
7 |2 |
Required : line 1,2,3,4 should go in "one" sample and 5,6,7 should go in the "one" sample.
Value of column A is a unique id, corresponding to one single entity(could be seen as a cross section data points on one SINGLE user, so it MUST go in one unique sample of train, test, or cv), and there are many such entities, so a grouping by entity id is required.
B, C,D columns may have any values, but a grouping preservation is not required on them. (Bonus: can I group the sampling for multiple fields?)
What I tried :
A. Finding all unique values of A's - denoting this as my sample I now distribute the sample among-st train, intermediate & cv & test -> then putting the rest of the rows for this value of "A" in each of these files.
that is if train had entry for "3" , test for"2" and cv for "1" then all rows with value of A as 3 go in train, all with 2 go in test and all with 1 go in cv.
Ofcourse this approach is not scalable.
And I doubt, it may have introduced bias into the datasets, since the number of 1's in column A , no of 2's etc. is not equal, meaning this approach will not work !
B. I also tried numpy.random.shuffle, or numpy.random.permutation as per the thread here - Numpy: How to split/partition a dataset (array) into training and test datasets for, e.g., cross validation? , but it did not meet my requirement.
C. A third option of-course is writing a custom function that does this grouping, and then balances the training, test and cv data-sets based on number of data points in each group. But just wondering, if there's already an efficient way to implement this ?
Note my data set is huge, so ideally I would like to have a deterministic way to partition my datasets, without having multiple eye-ball-scans to be sure that the partition is correct.
EDIT Part 2:
Since I did not find any that fit my sampling criteria - I actually wrote a module to sample with grouping constraints. This is the github code to it. The code was not written for very large data in mind, so it's not very efficient. Should you FORK this code - please point out how can I improve the run-time.
https://github.com/ekta1007/Sampling-techniques/blob/master/sample_expedia.py
By forcing such constraints you will introduce bias either way, to you procedure. So approach based on the partition of the "users" data and then collecting their respective "measurements" does not seem bad. And it will scale just fine, this is O(n) method, the only reason for not scaling up is bad implementation, not bad method.
The reason for no such functionality in existing methods (like sklearn library) is because it looks highly artificial, and counter machine learning models idea. If these are somehow one entities then they should not be treated as separate data points. If you need this separate representation then requiring such division, that the particular entity cannot be partially in test test and partially in training will for sure bias the whole model.
To sum up - you should really deeply analyze whether your approach is reasonable from the machine learning point of view. If you are sure about it, I think the only possibility is to write the segmentation by yourself, as even though using many ML libraries in the past, I've never seen such functionality.
In fact I am not sure, if the problem of creating segmentation of the set containing N numbers (sizes of entities) into K (=3) subsets of given sums proportions with uniform distribution when treated as a random process is not NP problem on itself. If you cannot guarantee uniform distribution, then your datasets cannot be used as a statistically correct method of training/testing/validating your model. Even if it has a reasonable polynomial solution, it can still scale up badly (much worse then linear methods). This doubt applies if your constraints are "strict", if they are "weak" you can always do "generate and reject" approach, which should have amortized linear complexity.
I was also facing similar kind of issue, though my coding is not too good I came up with the solution as given below:
Created a new data frame that only contains the Unique Id of the df and removed duplicates.
new = df[["Unique_Id "]].copy()
New_DF = new.drop_duplicates()
Created training and test set on the basis of New_DF
train, test = train_test_split(New_DF, test_size=0.2)
And then merged those training and test set with original df.
df_Test = pd.merge(df, test, how='inner', on = “Unique_Id”)
df_Train = pd.merge(df, train, how='inner', on = “Unique_Id”)
Similarly, we can create sample for the validation part too.
Cheers.