Fast way to lookup entries with duplicates

Fast way to lookup entries with duplicates - c++

I am looking for a way to help me quickly look up duplicate entries in some sort of table structure in a very efficient way. Here's my problem:
I have objects of type Cargo where each Cargo has a list of other associated Cargo. Each cargo also has a type associated with it.
so for example:
class Cargo {
int cargoType;
std::list<Cargo*> associated;
}
now for each cargo object and its associated cargo, there is a certain value assigned to it based on their types. This evaluation happens by classes that implement CargoEvaluator.
Now, I have a CargoEvaluatorManager which basically handles connecting everything together. CargoEvaluators are registered with CargoEvaluatorManager, then, to evaluate cargo, I call CargoEvaluatorManager.EvaluateCargo(Cargo* cargo).
Here's the current state
class CargoEvaluatorManager{
std::vector<std::vector<CargoEvaluator*>> evaluators;
double EvaluateCargo(Cargo* cargo)
{
double value = 0.0;
for(auto& associated : cargo->associated) {
auto evaluator = evaluators[cargo->cargoType][associated->cargoType];
if(evaluator != nullptr)
value += evaluator->Evaluate(cargo, associated);
}
return value;
}
}
So to recap and mention a few extra points:
CargoEvaluatorManager stores CargoEvaluators in a 2-D array using cargo types as indices. The entire 2d vector is initialized with nullptrs. When a CargoEvaluator is registered, resizing the array and the other bits and peieces I haven't shown here are handled appropriately.
I had tried using a map with std::pair as a key to look up different evaluators, but it is too slow.
This only allows one CargoEvaluator per combination of cargotype. I want to have multiple cargo evaluators potentially as well.
I am calling this EvaluateCargo tens and billions of times. I am aware my current implementation is not the most efficient and am looking for alternatives.
What I am looking for
As stated above, I want to do much of what I've outlined with the exception that I want to allow multiple Evaluators for each pair of Cargo types. As I envision it, naively, is a table like this :
--------------------------------------------
|int type 1 | int type 2 | CaroEvaluator* |
|------------------------------------------|
| 5 | 7 | Eval1* |
| 5 | 7 | Eval2* |
| 4 | 6 | Eval3* |
--------------------------------------------
The lookup should be symmetric in that the set (5,7) and (7,5) should resolve to the same entries. For speed I don't mind preloading duplicate entries.
There are maybe 3x or more more associated Caro in the list than there are Evaluators, if that factors into things.
Performance is crucial, as mentioned above!
For bonus points, each cargo evaluator may have an additional penalty value associated with it, that is not dependent on how many Associates a Cargo has. In other words: if a row in the table above is looked up, I want to call double Evaluator->AddPenality() once and only once each time EvaluateCargo is called. I cannot store any instance variables since it would cause some multithreading issues.
One added constraint is I need to be able to identify the CargoEvaluators associated with a particular cargotype, meaning that hashing the two cargotypes together is not a viable option.
If any further clarification is needed, I'll gladly try to help.
Thank you all in advance!

Related

What ML.NET Concatenate really does?

I believe I understand when Concatenate needs to be called, on what data and why. What I'm trying to understand is what physically happens to the input columns data when Concatenate is called.
Is this some kind of a hash function that hashes all the input data from the columns and generates a result?
In other words, I would like to know if that is technically possible to restore original values from the value generated by Concatenate?
Is the order of data columns passed into Concatenate affects the resulting model and in what way?
Why I'm asking all that. I'm trying to understand what input parameters and in what way affect the quality of the produced model. I have many input columns of data. They are all rather important and it is important the relation between those values. If Concatenate does something simple and loses the relations between values I would try one approach to improve the quality of the model. If it is rather complex and keeps details of the values I would use other approaches.

In ML.NET, Concatenate takes individual features (of the same type) and creates a feature vector.
In pattern recognition and machine learning, a feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis. When representing images, the feature values might correspond to the pixels of an image, while when representing texts the features might be the frequencies of occurrence of textual terms. Feature vectors are equivalent to the vectors of explanatory variables used in statistical procedures such as linear regression.
To my understanding there's no hashing involved. Conceptually you can think of it like the String.Join method, where you're taking individual elements and join them into one. In this case, that single component is a feature vector that as a whole represents the underlying data as an array of type T where T is the data type of the individual columns.
As a result, you can always access the individual components and order should not matter.
Here's an example using F# that takes data, creates a feature vector using the concatenate transform, and accesses the individual components:
#r "nuget:Microsoft.ML"
open Microsoft.ML
open Microsoft.ML.Data
// Raw data
let housingData =
seq {
{| NumRooms = 3f; NumBaths = 2f ; SqFt = 1200f|}
{| NumRooms = 2f; NumBaths = 1f ; SqFt = 800f|}
{| NumRooms = 6f; NumBaths = 7f ; SqFt = 5000f|}
}
// Initialize MLContext
let ctx = new MLContext()
// Load data into IDataView
let dataView = ctx.Data.LoadFromEnumerable(housingData)
// Get individual column names (NumRooms, NumBaths, SqFt)
let columnNames =
dataView.Schema
|> Seq.map(fun col -> col.Name)
|> Array.ofSeq
// Create pipeline with concatenate transform
let pipeline = ctx.Transforms.Concatenate("Features", columnNames)
// Fit data to pipeline and apply transform
let transformedData = pipeline.Fit(dataView).Transform(dataView)
// Get "Feature" column containing the result of applying Concatenate transform
let features = transformedData.GetColumn<float32 array>("Features")
// Deconstruct feature vector and print out individual features
printfn "Rooms | Baths | Sqft"
for [|rooms;baths;sqft|] in features do
printfn $"{rooms} | {baths} | {sqft}"
The result output to the console is:
Rooms | Baths | Sqft
2 | 3 | 1200
1 | 2 | 800
7 | 6 | 5000
If you're looking to understand the impact individual features have on your model, I'd suggest looking at Permutation Feature Importance (PFI) and Feature Contribution Calculation

Sorting query by distance requires reading entire data set?

To perform geoqueries in DynamoDB, there are libraries in AWS (https://aws.amazon.com/blogs/mobile/geo-library-for-amazon-dynamodb-part-1-table-structure/). But to sort the results of a geoquery by distance, the entire dataset must be read, correct? If a geoquery produces a large number of results, there is no way to paginate that (on the backend, not to the user) if you're sorting by distance, is there?

You are correct. To sort all of the datapoint by distance from some arbitrary location, you must read all the data from your DynamoDB table.
In DynamoDB, you can only sort results using a pre-computed value that has been stored in the DynamoDB table and is being used as the sort key of the table or one of its indexes. If you need to sort by distance from a fixed location, then you can do this with DynamoDB.
Possible Workaround (with limitations)
TLDR; it's not such a bad problem if you can get away with only sorting the items that are within X kms from an arbitrary point.
This still involves sorting the data points in memory, but it makes the problem easier by producing incomplete results (by limiting the maximum range of the results.)
To do this, you need the Geohash of your point P (from which you are measuring the distance of all other points). Suppose it is A234311. Then you need to pick what range of results is appropriate. Let's put some numbers on this to make it concrete. (I'm totally making these numbers up because the actual numbers are irrelevant for understanding the concepts.)
A - represents a 6400km by 6400km area
2 - represents a 3200km by 3200km area within A
3 - represents a 1600km by 1600km area within A2
4 - represents a 800km by 800km area within A23
3 - represents a 400km by 400km area within A234
1 - represents a 200km by 200km area within A2343
1 - represents a 100km by 100km area within A23431
Graphically, it might look like this:
View of A View of A23
|----------|-----------| |----------|-----------|
| | A21 | A22 | | | |
| A1 |-----|-----| | A231 | A232 |
| | A23 | A24 | | | |
|----------|-----------| |----------|-----------|
| | | | |A2341|A2342|
| A3 | A4 | | A233 |-----|-----|
| | | | |A2343|A2344|
|----------|-----------| |----------|-----------| ... and so on.
In this case, our point P is in A224132. Suppose also, that we want to get the sorted points within 400km. A2343 is 400km by 400km, so we need to load the result from A2343 and all of its 8-connected neighbors (A2341, A2342, A2344, A2334, A2332, A4112, A4121, A4122). Then once we've loaded only those in memory, then you calculate the distances, sort them, and discard any results that are more than 400km.
(You could keep the results that are more than 400km away as long as the users/clients know that beyond 400km, the data could be incomplete.)
The hashing method that DynamoDB Geo library uses is very similar to a Z-Order Curve—you may find it helpful to familiarize yourself with that method as well as Part 1 and Part 2 of the AWS Database Blog on Z-Order Indexing for Multifaceted Queries in DynamoDB.

Not exactly. When querying location you can query by a fixed query value (partition key value) and by sort key, so you can limit your query data result and also apply a little filtering.
I have been racking my brain while designing a DynamoDB Geo Hash proximity locator service. For this example customer_A wants to find all service providers_X in their area. All customers and providers have a 'g8' key that stores their precise geoHash location (to 8 levels).
The accepted way to accomplish this search is to generate a secondary index from the main table with a less accurate geoHash 'g4' which gives a broader area for the main query key. I am applying key overloading and composite key structures for a single table design. The goal in this design is to return all the data required in a single query, secondary indexes can duplicate data by design (storage is cheap but cpu and bandwidth is not)
GSI1PK GSI1SK providerId Projected keys and attributes
---------------------------------------------
g4_9q5c provider pr_providerId1 name rating
g4_9q5c provider pr_providerId2 name rating
g4_9q5h provider pr_providerId3 name rating
Scenario1: customer_A.g8_9q5cfmtk So you issue a query where GSI1PK=g4_9q5c and a list of two providers is returned, not three I desire.
But using geoHash.neighbor() will return eight surrounding neighbors like 9q5h (see reference below). That's great because there a provider in 9q5h but this means I have to run nine queries, one on the center and eight on the neighbors, or run 1-N until I have the minimum results I require.
But which direction to query second, NW, SW, E?? This would require another level of hinting toward which neighbor has more results, without knowing first, unless you run a pre-query for weighted results. But then you run the risk of only returning favorable neighbors as there could be new providers in previously unfavored neighbors. You could apply some ML and randomized query into neighbors to check current counts.
Before the above approach I tried this design.
GSI1PK GSI1SK providerId Projected keys and attributes
---------------------------------------------
loc g8_9q5cfmtk pr_provider1
loc g8_9q5cfjgq pr_provider2
loc g8_9q5fe954 pr_provider3
Scenario2: customer_A.g8_9q5cfmtk So you issue a query where GSI1PK=loc and GSI1SK in between g8_9q5ca and g8_9q5fz and a list of three providers is returned, but a ton of data was pulled and discarded.
To achieve the above query the between X and Y sort criteria is composed of. 9q5c.neighbors().sorted() = 9q59, 9q5c, 9q5d, 9q5e, 9q5f, 9q5g, 9qh1, 9qh4, 9qh5. So we can just use X=9q59 and Y=9qh5 but there are over 50 (I really didn't count after 50) matching quadrants in such a UTF between function.
Regarding the hash/size table above I would recommend to use this https://www.movable-type.co.uk/scripts/geohash.html
Geohash length Cell width Cell height
1 ≤ 5,000km × 5,000km
2 ≤ 1,250km × 625km
3 ≤ 156km × 156km
4 ≤ 39.1km × 19.5km
5 ≤ 4.89km × 4.89km
...

The most performant way to merge N lists, track duplicates, and sort them by date

I am new to Haskell and am wanting to know the most efficient way to merge an arbitrary number of lists of an arbitrary number of items. Here's example data:
LIST 1: steve
2014-01-20 | cookies | steve
LIST 2: chris
2014-02-05 | cookies | chris
LIST 3: mark
2014-09-30 | brownies | mark
2014-03-30 | candy | mark
2014-05-12 | pie | mark
LIST 4: anthony
2014-05-18 | cookies | anthony
2013-12-25 | fudge | anthony
LIST 5: andy
2014-10-04 | cookies | andy
LIST 7: john
2014-06-19 | pie | john
RESULTING LIST
2014-10-04 | cookies | andy chris steve anthony
2014-09-30 | brownies | mark
2014-06-19 | pie | john mark
2014-03-30 | candy | mark
2013-12-25 | fudge | anthony
Notice the lists are all oriented around people and may or may not be sorted by date, and the result needs to merge the prior lists, group and create a list where the dessert is unique but has a list of the constituent people who ate it, sorted by date reverse chronologically.

What is the most performant way to solve a problem, is in most cases not answerable neither in haskell nor in any other programming language I think.
A better approach would be to think about, how can I solve this problem (at all) and keep a few principles in the back of your mind.
testability
abstraction and expressiveness
maintainability
readability
performance
Maybe I've forgot about something but for your problem I want to give a hintlist
If I know all the items and names in advance I would use algebraic datatypes to model this situation
data Name = Mark | Chris ...
deriving (Ord,Eq,Show)
data Items = Pie | Cookies ...
deriving (Ord,Eq,Show)
If I do not already know how haskell represents a date datatype I can use a plain old String to model this, or I would use hoogle to see if there already exists a date-thingy.
> hoogle date
...
Data.Time.Calendar...
...
So I guess the Data.Time.Calendar module seems a good choice for that, and I would look at its documentation which can both be found online or if you install the package locally you can use haddock to generate it yourself from the source files.
Next step I would approach is to model the "database" of course there exists libraries to work with sqly stuff or acid-state a database that uses algebraic datatypes instead of a database backend. But for getting a better grasp of haskell I would try to reinvent the wheel for once and use either a list of tupels, or a dictionary-like collection, which is in haskell called Map. But working with Map one has to be careful and do a qualified import as most of its provided functions would lead to a name collision with the functions in the standard library (Prelude).
import qualified Map as M
and to model my database I would use the Items as keys and a tuple of date and list of names as the values and as I want to be aware that this is my database I would provide a type alias for that.
type DB = M.Map Item (Date, [Name])
For working with that I would again have a glance at the Map docu and be happy to find the functions insertWith, empty and toList. And for the insertWith functions I would think of a mixture of max and list cons (:) functions to make new entries.
To get a better feel for the whole thing I would fire up ghci and import qualified Data.Map as M and fool around with some examples using M.Map String (String,[Int]) or whatnot to model my data in a first approximation.
For the result I have to sort the toList of my Map by date, which is just a little problem. The type of my toList myDb is [(Item,(Date,[Name]))] so sorting by the fst.snd with sortBy should lead to the desired result.
After I'd done this much, I'd take a break and read something about parsers - to get all my files in context with my program. A search with the search engine of your least distrust will turn up a few articles worth reading (Parser Parsec Haskell).
If all of this is too complicated I would go back and change all my types to be Strings and hope I wouldn't have any typeos until I had time to read again about parsers ;-).
For any problems in the intermediate steps people here will be glad to help you, assumed you provide a concrete question/problem description.
If all of this were not performant enough, the profiling tools provided by haskell are good enough to help me, but this is my last concern to solve.

Creating train, test and cross validation datasets in sklearn (python 2.7) with a grouping constraints?

While creating a train,test & cross validation sample in Python, I see the default method as -:
1. Reading the dataset , after skipping headers
2. Creating the train, test and Cross validation sample
import csv
with open('C:/Users/Train/Trainl.csv', 'r') as f1:
next(f1)
reader = csv.reader(f1, delimiter=',')
input_set = []
for row in reader:
input_set.append(row)
import numpy as np
from numpy import genfromtxt
from sklearn import cross_validation
train, intermediate_set = cross_validation.train_test_split(input_set, train_size=0.6, test_size=0.4)
cv, test = cross_validation.train_test_split(intermediate_set, train_size=0.5, test_size=0.5)
My problem though is that I have a field say "A" in the csv file that I read into the numpy array, and all sampling should respect this field. That is, all entries with similar values for "A" should go in one sample .
Line #|A | B | C | D
1 |1 |
2 |1 |
3 |1 |
4 |1 |
5 |2 |
6 |2 |
7 |2 |
Required : line 1,2,3,4 should go in "one" sample and 5,6,7 should go in the "one" sample.
Value of column A is a unique id, corresponding to one single entity(could be seen as a cross section data points on one SINGLE user, so it MUST go in one unique sample of train, test, or cv), and there are many such entities, so a grouping by entity id is required.
B, C,D columns may have any values, but a grouping preservation is not required on them. (Bonus: can I group the sampling for multiple fields?)
What I tried :
A. Finding all unique values of A's - denoting this as my sample I now distribute the sample among-st train, intermediate & cv & test -> then putting the rest of the rows for this value of "A" in each of these files.
that is if train had entry for "3" , test for"2" and cv for "1" then all rows with value of A as 3 go in train, all with 2 go in test and all with 1 go in cv.
Ofcourse this approach is not scalable.
And I doubt, it may have introduced bias into the datasets, since the number of 1's in column A , no of 2's etc. is not equal, meaning this approach will not work !
B. I also tried numpy.random.shuffle, or numpy.random.permutation as per the thread here - Numpy: How to split/partition a dataset (array) into training and test datasets for, e.g., cross validation? , but it did not meet my requirement.
C. A third option of-course is writing a custom function that does this grouping, and then balances the training, test and cv data-sets based on number of data points in each group. But just wondering, if there's already an efficient way to implement this ?
Note my data set is huge, so ideally I would like to have a deterministic way to partition my datasets, without having multiple eye-ball-scans to be sure that the partition is correct.
EDIT Part 2:
Since I did not find any that fit my sampling criteria - I actually wrote a module to sample with grouping constraints. This is the github code to it. The code was not written for very large data in mind, so it's not very efficient. Should you FORK this code - please point out how can I improve the run-time.
https://github.com/ekta1007/Sampling-techniques/blob/master/sample_expedia.py

By forcing such constraints you will introduce bias either way, to you procedure. So approach based on the partition of the "users" data and then collecting their respective "measurements" does not seem bad. And it will scale just fine, this is O(n) method, the only reason for not scaling up is bad implementation, not bad method.
The reason for no such functionality in existing methods (like sklearn library) is because it looks highly artificial, and counter machine learning models idea. If these are somehow one entities then they should not be treated as separate data points. If you need this separate representation then requiring such division, that the particular entity cannot be partially in test test and partially in training will for sure bias the whole model.
To sum up - you should really deeply analyze whether your approach is reasonable from the machine learning point of view. If you are sure about it, I think the only possibility is to write the segmentation by yourself, as even though using many ML libraries in the past, I've never seen such functionality.
In fact I am not sure, if the problem of creating segmentation of the set containing N numbers (sizes of entities) into K (=3) subsets of given sums proportions with uniform distribution when treated as a random process is not NP problem on itself. If you cannot guarantee uniform distribution, then your datasets cannot be used as a statistically correct method of training/testing/validating your model. Even if it has a reasonable polynomial solution, it can still scale up badly (much worse then linear methods). This doubt applies if your constraints are "strict", if they are "weak" you can always do "generate and reject" approach, which should have amortized linear complexity.

I was also facing similar kind of issue, though my coding is not too good I came up with the solution as given below:
Created a new data frame that only contains the Unique Id of the df and removed duplicates.
new = df[["Unique_Id "]].copy()
New_DF = new.drop_duplicates()
Created training and test set on the basis of New_DF
train, test = train_test_split(New_DF, test_size=0.2)
And then merged those training and test set with original df.
df_Test = pd.merge(df, test, how='inner', on = “Unique_Id”)
df_Train = pd.merge(df, train, how='inner', on = “Unique_Id”)
Similarly, we can create sample for the validation part too.
Cheers.

What's faster? Searching for shortest path in a matrix or a list?

I have to store some cities and the distances between some of them and then search for the shortest path. The cities and the distances are read from a file.
I started with doing a matrix but saw that it took too much space(more than double) so I changed to a list. Each list item stores 3 things: point1, point2 and the distance between them.
So for example I have this file:
Athens Stockholm 34
Stockholm Prague 23
which when I read is stored in the array as this:
_____0______ ______1______
point1 | Athens | Stockholm |
point2 | Stockholm | Prague |
distance | 34 | 23 |
------------ -------------
Then I got some doubts.. This surely saves space but is it going to take more time to go through? The list is an array but the connections(edges) are placed in an arbitrary way and that's why I started thinking that it may take more time than if I used a matrix.

You might want to look into the adjacency-list representation of graphs, which is a modified version of your second idea that's best suited for shortest path problems. The idea is to have a table of nodes where for each node you store a list of outgoing edges from that node. This allows you to iterate over all the edges leaving a node in time proportional to the number of edges you have leaving a node, not the total number of edges in the graph (as you have in both the matrix and list versions suggested above). For this reason, most fast algorithms for doing graph searches use adjacency lists.
Hope his helps!

Separate the names from the distances. Create a list that just contains city names.
_____0______ ______1______ ______2______
city | Athens | Stockholm | Prague |
------------ ------------- -------------
Then create the matrix separately
__0__ __1__ __2__
0 | 0 | 34 | 0 |
----- ----- -----
1 | 34 | 0 | 23 |
----- ----- -----
2 | 0 | 23 | 0 |
----- ----- -----
If you want to search, say, a route from Prague to Athens, then you start by finding where Prague and Athens are in the list...
Prague: 2
Athens: 0
Then you search through the matrix for your path.
(2,1,23) -> (1,0,34)
Finally, you translate to cities using your list
(Prague, Stockholm, 23) ->
(Stockholm, Athens, 34)

I think that adjecency lists here are surely the best option here. The're later very useful when it comes to algorithms like D/BFS or Dijkstra.
If you don't know how to keep both towns and distances just use some structure to keep them both together. If you could use numbered indexed towns you would use just a simple pair structure (also the easiest implementation would be n STL vectors of pair.
Of course if you don't want to use STL you should try implementing own lists of structures with pointers you would want to use.

Your approach looks just fine.
To relax your mind, remember that parsing a single list/array will always be be faster and more resource friendly than working with two (or more) lists when you practically just need to look up a single line/entry of predefined data.
I tend to disagree with some of the other answers here, since I do not see any need to complicate things. Looking up several data-cells and the additional need to be combine those data-cells to produce a resulting data set (as some answers proposed), takes more steps than doing a simple one-time run over a list to fetch a line of data. You would merely risk losing CPU cycles and memory resources for functions that lookup and combine distributed data-cells across several lists, while the data you currently use already is combined as a collection of perfect results.
Simpler said: doing a simple run-down on the list/array you currently have is hard to beat when it comes to speed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js