So if we want a collection of unique items we can use a 'set'.
If we already have a collection of items that we want to dedupe, we could pass them to the set function, or alternatively we could use the distinct or dedupe functions.
What are the situations for using each of these (pros/cons)?
Thanks.
The differences are:
set will create a new set collection eagerly.
distinct will create a lazy sequence with duplicates from the input collection removed. It has an advantage over set if you process big collections and lazyness might save you from eagerly evaluating the input collection (e.g. with take)
dedupe removes consecutive duplicates from the input collection so it has a different semantics than set and distinct. For example it will return (1 2 3 1 2 3) when applied to (1 1 1 2 3 3 1 1 2 2 2 3 3)
Set and lazy seq have different APIs available (e.g. disj, get vs nth) and performance characteristics (e.g. O(log32 n) look up for set and O(n) for lazy seq) and they should be chosen depending on how you would like to use their results.
Additionally distinct and dedupe return a transducer when called without argument.
Related
When we specify the data for a set we have the ability to give it tuples of data. For example, we could write in our .dat file the following:
set A : 1 2 3 :=
1 + - -
2 - - +
3 - + +
This would specify that we would have 4 tuples in our set: (1,1), (2,3), (3,2), (3,3)
But I guess that I am struggling to understand exactly why we would want to do this? Furthermore, suppose we instantiated a Set object in our code as:
model.Aset = RangeSet(4, dimen=2)
Would this then specify that our tuples would have the indices 1, 2, 3, and 4?
I am thinking that specifying tuples in our set could potentially be useful when working with some data in which it's important to have a bit of a "spatial" understanding of the problem. But I would be curious to hear from the community what the potential applications of specifying set data this way might be.
The most common place this appears is when you're trying to model edges between nodes in a network. Networks aren't usually completely dense (have edges between every pair of nodes) so it's beneficial to represent just the edges that appear using a sparse set of tuples.
High level overview with simple integer order value to get my point across:
id (primary) | order (sort) | attributes ..
----------------------------------------------------------
ft8df34gfx 1 ...
ft8df34gfx 2 ...
ft8df34gfx 3 ...
ft8df34gfx 4 ...
ft8df34gfx 5 ...
Usually it would be easy to change the order (e.g if user drags and drops list items on front-end): shift item around, calculate new order values and update affected items in db with new order.
Constraints:
Doesn't have all the items at once, only a subset of them (think pagination)
Update only a single item in db if single item is moved (1 item per shift)
My initial idea:
Use epoch as order and append something unique to avoid duplicate epoch times, e.g <epoch>#<something-unique-to-item>. Initial value is insertion time (default order is therefore newest first).
Client/server (whoever calculates order) knows the epoch for each item in subset of items it has.
If item is shifted, look at the epoch of previous and next item (if has previous or next - could be moved to first or last), pick a value between and update. More than 1 shifts? Repeat the process.
But..
If items are shifted enough times, epoch values get closer and closer to each other until you can't find a middleground with whole integers.
Add lots of zeroes to epoch on insert? Still reach limit at some point..
If item is shifted to first or last and there are items in previous or next page (remember, pagination), we don't know these values and can't reliably find a "value between".
Fetch 1 extra hidden item from previous and next page? Querying gets complicated..
Is this even possible? What type/value should I use as order?
DynamoDB does not allow the primary partition and sort keys to be changed for a particular item (to change them, the item would need to be deleted and recreated with the new key values), so you'll probably want to use a local or global secondary index instead.
Assuming the partition/sort keys you're mentioning are for a secondary index, I recommend storing natural numbers for the order (1, 2, 3, etc.) and then updating them as needed.
Effectively, you would have three cases to consider:
Adding a new item - You would perform a query on the secondary partition key with ScanIndexForward = false (to reverse the results order), with a projection on the "order" attribute, limited to 1 result. That will give you the maximum order value so far. The new item's order will just be this maximum value + 1.
Removing an item - It may seem unsettling at first, but you can freely remove items without touching the orders of the other items. You may have some holes in your ordering sequence, but that's ok.
Changing the order - There's not really a way around it; your application logic will need to take the list of affected items and write all of their new orders to the table. If the items used to be (A, 1), (B, 2), (C, 3) and they get changed to A, C, B, you'll need to write to both B and C to update their orders accordingly so they end up as (A, 1), (C, 2), (B, 3).
Consider the fictional data to illustrate my problem, which contains in reality thousands of rows.
Figure 1
Each individual is characterized by values attached to A,B,C,D,E. In figure1, I show 3 individuals for which some characteristics are missing. Do you have any idea how can I get the following completed table (figure 2)?
Figure 2
With the ID in figure 1 I could have used the carryforward command to filling in the values. But since each individual has a different number of rows I don't know how to create the ID.
Edit: All individual share the characteristic "A".
Edit: the existing order of observations is informative.
To detect the change of id, the idea is to compare if the precedent value of char is >= in each rows.
This works only if your data are ordered, but it seems mandatory in your data.
gen id= 1 if (char[_n-1] >= char[_n]) | _n ==1
replace id = sum(id) if id==1
replace id = id[_n-1] if missing(id)
fillin id char
drop _fillin
If an individual as only the characteristics A and C and another individual as only the characteristics D and E, this won't work, but it seems impossible to detect with your data.
I have an unbalanced panel with the panel id member
I would like to delete particular members from the data set (i.e. in every panel they appear), and would like to delete those specific members that appear in a list/vector of values.
If I have the list of values of member (say 1, 3, 10, 17, 173, 928)
I would like a way to drop every observation where the panel id (member) is contained in the list.
The list is ~1500 values long, so rather than manually typing
drop if member == 1
drop if member == 3
drop if member == 10
drop if member == 928
I would like to somehow automate this process.
#Brendan Cox (namesake, not a relative) has the nub of the matter. To expand a bit:
Note first that
drop if inlist(member,1,3,10,17,173,928)
would be an improvement on your code, but both illegal and impractical for a very large number of values: here 1500 or so certainly qualifies as very large.
At some critical point it becomes a much better idea to put the identifiers in a file and merge. For more on the spirit of this, see http://www.stata.com/support/faqs/data-management/selecting-subset-of-observations/
It's not a paradox that you merge here (temporarily making a bigger dataset) even though you want to make a smaller dataset. merge identifies the intersection of the datasets, which is precisely those observations you wish to drop. merge to create unions of datasets merely happens to be the main and most obvious motive for using the command, but there are others.
You do not specify how the list is structured. Please remember to post all details relevant to your problem.
Below two examples.
clear
set more off
*----- case 1 (list in another .dta file) -----
// a hypothetical list
input ///
idcode
1
3
end
list
tempfile mylist
save "`mylist'"
// rest of data
clear
use http://www.stata-press.com/data/r13/union.dta
list if idcode <= 4, sepby(idcode)
merge m:1 idcode using "`mylist'", keep(master)
list if idcode <= 4, sepby(idcode)
*----- case 2 (list in a macro) -----
clear
use http://www.stata-press.com/data/r13/union.dta
// a hypothetical list
local mylist 1, 3
drop if inlist(idcode, `mylist')
list if idcode <= 4, sepby(idcode)
help inlist mentions the following limit:
The number of arguments is between 2 and 255 for
reals and between 2 and 10 for strings.
I have a table where the entries are something like this
Row - Column1 - Column2 - Column3 Column4
1 0X0A 1 2 A
2 0X0B 2 2 B
3 0x0C 3 2 C
Now i want to use map in such that i can use column 1 or Column 2 as the key to get the row.
What kind of map i should use to achieve this?
(Note- Table is just for explanation and not the exact requirement)
I thought of using multimap, but that is not going to solve the prob
try multi-index containers from boost.
Define a class similar to pair with a custom comparator that indicates equality if either the first member or the second member match, but not necessarily both. You could then use that class as your key type. You would probably need a particular value for each member that will never be used in your data, to use as default values in your constructor, to avoid keys where only the first member has been initialized occasionally matching on the second member due to leftover data.
You could use one map to map from column 1 to the row and another to map from column 2 to the row. Repeat for as many columns as needed