Algorithm to Group Selected Numbers In a List - list

Given a list of consecutive and unique numbers where some are selected and others are not, I need to create groups that contain all selected numbers. The number of groups should be kept to a minimum, and the number of non-required values in the groups should also be kept to a minimum. The max size of the groups is also a variable.
Example list, where * indicates selected number, and group size is limited to 5:
1*,2,3,4,5*,6*,7,8*,9
The most optimized groups would be [(1) and (5,6,7,8)].
[(1,2,3,4,5) and (6,7,8)] is another possible answer, but it contains more non-selected values, thus is not desirable.
Is there a name for this type of algorithm? I don't need someone to write the code for me, just looking for pointers if this problem is already well known.
For those curious what this is for, I am trying to optimize Modbus TCP register requests. A user may define a list of registers they need, and only continuous groups of registers may be requested at a time. Due to TCP latency, we want to make as few requests as possible, and only request the minimum number of non-required registers.

Try this:
numbers = [1,2,3,4,5,4,5,6,2,4]
groups, current_group = [], []
max_group_size = 4 # here you put your max size
for n in numbers:
is_valid = is_selected(n)
if is_valid:
current_group.append(n)
elif (not is_valid and current_group) or len(current_group) == max_group_size:
groups.append(current_group)
current_group = []
Assuming is_selected is a function that tells you if a number is selected

Related

how to find a pattern which is repeated n number of times in a column of a table in informatica

i have a scenario in which a field, of a particular record, in my table looks like below (array format)
The set of id, email and address can be repeated n number of times for each record. So i need to set up a mapping in informatica where it will give me the output like below:
......waiting for a solution thanks
i tried with substr and instr functions but with that i need to know beforehand how many times the mail id is occurring in a particular record. since the email can be repeated n number of times for each row, hence i am not able to find a way which will dynamically tell my instr function to run for n number of times

Application for filtering database for the short period of time

I need to create an application that would allow me to get phone numbers of users with specific conditions as fast as possible. For example we've got 4 columns in sql table(region, income, age [and 4th with the phone number itself]). I want to get phone numbers from the table with specific region and income. Just make a sql query won't help because it takes significant amount of time. Database updates 1 time per day and I have some time to prepare data as I wish.
The question is: How would you make the process of getting phone numbers with specific conditions as fast as possible. O(1) in the best scenario. Consider storing values from sql table in RAM for the fastest access.
I came up with the following idea:
For each phone number create smth like a bitset. 0 if the particular condition is false and 1 if the condition is true. But I'm not sure I can implement it for columns with not boolean values.
Create a vector with phone numbers.
Create a vector with phone numbers' bitsets.
To get phone numbers - iterate for the 2nd vector and compare bitsets with required one.
It's not O(1) at all. And I still don't know what to do about not boolean columns. I thought maybe it's possible to do something good with std::unordered_map (all phone numbers are unique) or improve my idea with vector and masks.
P.s. SQL table consumes 4GB of memory and I can store up to 8GB in RAM. The're 500 columns.
I want to get phone numbers from the table with specific region and income.
You would create indexes in the database on (region, income). Let the database do the work.
If you really want it to be fast I think you should consider ElasticSearch. Think of every phone in the DB as a doc with properties (your columns).
You will need to reindex the table once a day (or in realtime) but when it's time to search you just use the filter of ElasticSearch to find the results.
Another option is to have an index for every column. In this case the engine will do an Index Merge to increase performance. I would also consider using MEMORY Tables. In case you write to this table - consider having a read replica just for reads.
To optimize your table - save your queries somewhere and add index(for multiple columns) just for the top X popular searches depends on your memory limitations.
You can use use NVME as your DB disk (if you can't load it to memory)

Order items with single write

High level overview with simple integer order value to get my point across:
id (primary) | order (sort) | attributes ..
----------------------------------------------------------
ft8df34gfx 1 ...
ft8df34gfx 2 ...
ft8df34gfx 3 ...
ft8df34gfx 4 ...
ft8df34gfx 5 ...
Usually it would be easy to change the order (e.g if user drags and drops list items on front-end): shift item around, calculate new order values and update affected items in db with new order.
Constraints:
Doesn't have all the items at once, only a subset of them (think pagination)
Update only a single item in db if single item is moved (1 item per shift)
My initial idea:
Use epoch as order and append something unique to avoid duplicate epoch times, e.g <epoch>#<something-unique-to-item>. Initial value is insertion time (default order is therefore newest first).
Client/server (whoever calculates order) knows the epoch for each item in subset of items it has.
If item is shifted, look at the epoch of previous and next item (if has previous or next - could be moved to first or last), pick a value between and update. More than 1 shifts? Repeat the process.
But..
If items are shifted enough times, epoch values get closer and closer to each other until you can't find a middleground with whole integers.
Add lots of zeroes to epoch on insert? Still reach limit at some point..
If item is shifted to first or last and there are items in previous or next page (remember, pagination), we don't know these values and can't reliably find a "value between".
Fetch 1 extra hidden item from previous and next page? Querying gets complicated..
Is this even possible? What type/value should I use as order?
DynamoDB does not allow the primary partition and sort keys to be changed for a particular item (to change them, the item would need to be deleted and recreated with the new key values), so you'll probably want to use a local or global secondary index instead.
Assuming the partition/sort keys you're mentioning are for a secondary index, I recommend storing natural numbers for the order (1, 2, 3, etc.) and then updating them as needed.
Effectively, you would have three cases to consider:
Adding a new item - You would perform a query on the secondary partition key with ScanIndexForward = false (to reverse the results order), with a projection on the "order" attribute, limited to 1 result. That will give you the maximum order value so far. The new item's order will just be this maximum value + 1.
Removing an item - It may seem unsettling at first, but you can freely remove items without touching the orders of the other items. You may have some holes in your ordering sequence, but that's ok.
Changing the order - There's not really a way around it; your application logic will need to take the list of affected items and write all of their new orders to the table. If the items used to be (A, 1), (B, 2), (C, 3) and they get changed to A, C, B, you'll need to write to both B and C to update their orders accordingly so they end up as (A, 1), (C, 2), (B, 3).

Ordering by sum of difference

I have a model that has one attribute with a list of floats:
values = ArrayField(models.FloatField(default=0), default=list, size=64, verbose_name=_('Values'))
Currently, I'm getting my entries and order them according to the sum of all diffs with another list:
def diff(l1, l2):
return sum([abs(v1-v2) for v1, v2 in zip(l1, l2)])
list2 = [0.3, 0, 1, 0.5]
entries = Model.objects.all()
entries.sort(key=lambda t: diff(t.values, list2))
This works fast if my numer of entries is very slow small. But I'm afraid with a large number of entries, the comparison and sorting of all the entries will get slow since they have to be loaded from the database. Is there a way to make this more efficient?
best way is to write it yourself, right now you are iterating over a list over 4 times!
although this approach looks pretty but it's not good.
one thing that you can do is:
have a variable called last_diff and set it to 0
iterate through all entries.
iterate though each entry.values
from i = 0 to the end of list, calculate abs(entry.values[i]-list2[i])
sum over these values in a variable called new_diff
if new_diff > last_diff break from inner loop and push the entry into its right place (it's called Insertion Sort, check it out!)
in this way, in average scenario, time complexity is much lower than what you are doing now!
and maybe you must be creative too. I'm gonna share some ideas, check them for yourself to make sure that they are fine.
assuming that:
values list elements are always positive floats.
list2 is always the same for all entries.
then you may be able to say, the bigger the sum over the elements in values, the bigger the diff value is gonna be, no matter what are the elements in list2.
then you might be able to just forget about whole diff function. (test this!)
The only way to makes this really go faster, is to move as much work as possible to the database, i.e. the calculations and the sorting. It wasn't easy, but with the help of this answer I managed to actually write a query for that in almost pure Django:
class Unnest(models.Func):
function = 'UNNEST'
class Abs(models.Func):
function = 'ABS'
class SubquerySum(models.Subquery):
template = '(SELECT sum(%(field)s) FROM (%(subquery)s) _sum)'
x = [0.3, 0, 1, 0.5]
pairdiffs = Model.objects.filter(pk=models.OuterRef('pk')).annotate(
pairdiff=Abs(Unnest('values')-Unnest(models.Value(x, ArrayField(models.FloatField())))),
).values('pairdiff')
entries = Model.objects.all().annotate(
diff=SubquerySum(pairdiffs, field='pairdiff')
).order_by('diff')
The unnest function turns each element of the values into a row. In this case it happens twice, but the two resulting columns are instantly subtracted and made positive. Still, there are as many rows per pk as there are values. These need to be summed, but that's not as easy as it sounds. The column can't be simply be aggregated. This was by far the most tricky part—even after fiddling with it for so long, I still don't quite understand why Postgres needs this indirection. Of the few options there are to make it work, I believe a subquery is the single one expressible in Django (and only as of 1.11).
Note that the above behaves exactly the same as with zip, i.e. the when one array is longer than the other, the remainder is ignored.
Further improvements
While it will be a lot faster already when you don't have to retrieve all rows anymore and loop over them in Python, it doesn't change yet that it results in a full table scan. All rows will have to be processed, every single time. You can do better, though. Have a look into the cube extension. Use it to calculate the L1 distance—at least, that seems what you're calculating—directly with the <#> operator. That will require the use of RawSQL or a custom Expression. Then add a GiST index on the SQL expression cube("values"), or directly on the field if you're able to change the type from float[] to cube. In case of the latter, you might have to implement your own CubeField too—I haven't found any package yet that provides it. In any case, with all that in place, top-N queries on the lowest distance will be fully indexed hence blazing fast.

Finding the most common three-item sequence in a very large file

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page sequence of all. The log files are too large to be held in main memory at once.
Sample log file:
User ID  Page ID
A          1
A          2
A          3
B          2
B          3
C          1
B          4
A          4
Corresponding results:
A: 1-2-3, 2-3-4
B: 2-3-4
2-3-4 is the most popular three-page sequence
My idea is to use use two hash tables. The first hashes on user ID and stores its sequence; the second hashes three-page sequences and stores the number of times each one appears. This takes O(n) space and O(n) time.
However, since I have to use two hash tables, memory cannot hold everything at once, and I have to use disk. It is not efficient to access disk very often.
How can I do this better?
If you want to quickly get an approximate result, use hash tables, as you intended, but add a limited-size queue to each hash table to drop least recently used entries.
If you want exact result, use external sort procedure to sort logs by userid, then combine every 3 consecutive entries and sort again, this time - by page IDs.
Update (sort by timestamp)
Some preprocessing may be needed to properly use logfiles' timestamps:
If the logfiles are already sorted by timestamp, no preprocessing needed.
If there are several log files (possibly coming from independent processes), and each file is already sorted by timestamp, open all these files and use merge sort to read them.
If files are almost sorted by timestamp (as if several independent processes write logs to single file), use binary heap to get data in correct order.
If files are not sorted by timestamp (which is not likely in practice), use external sort by timestamp.
Update2 (Improving approximate method)
Approximate method with LRU queue should produce quite good results for randomly distributed data. But webpage visits may have different patterns at different time of day, or may be different on weekends. The original approach may give poor results for such data. To improve this, hierarchical LRU queue may be used.
Partition LRU queue into log(N) smaller queues. With sizes N/2, N/4, ... Largest one should contain any elements, next one - only elements, seen at least 2 times, next one - at least 4 times, ... If element is removed from some sub-queue, it is added to other one, so it lives in all sub-queues, which are lower in hierarchy, before it is completely removed. Such a priority queue is still of O(1) complexity, but allows much better approximation for most popular page.
There's probably syntax errors galore here, but this should take a limited amount of RAM for a virtually unlimited length log file.
typedef int pageid;
typedef int userid;
typedef pageid[3] sequence;
typedef int sequence_count;
const int num_pages = 1000; //where 1-1000 inclusive are valid pageids
const int num_passes = 4;
std::unordered_map<userid, sequence> userhistory;
std::unordered_map<sequence, sequence_count> visits;
sequence_count max_count=0;
sequence max_sequence={};
userid curuser;
pageid curpage;
for(int pass=0; pass<num_passes; ++pass) { //have to go in four passes
std::ifstream logfile("log.log");
pageid minpage = num_pages/num_passes*pass; //where first page is in a range
pageid maxpage = num_pages/num_passes*(pass+1)+1;
if (pass==num_passes-1) //if it's last pass, fix rounding errors
maxpage = MAX_INT;
while(logfile >> curuser >> curpage) { //read in line
sequence& curhistory = userhistory[curuser]; //find that user's history
curhistory[2] = curhistory[1];
curhistory[1] = curhistory[0];
curhistory[0] = curhistory[curpage]; //push back new page for that user
//if they visited three pages in a row
if (curhistory[2] > minpage && curhistory[2]<maxpage) {
sequence_count& count = visits[curhistory]; //get times sequence was hit
++count; //and increase it
if (count > max_count) { //if that's new max
max_count = count; //update the max
max_sequence = curhistory; //arrays, so this is memcpy or something
}
}
}
}
std::cout << "The sequence visited the most is :\n";
std::cout << max_sequence[2] << '\n';
std::cout << max_sequence[1] << '\n';
std::cout << max_sequence[0] << '\n';
std::cout << "with " << max_count << " visits.\n";
Note that If you pageid or userid are strings instead of ints, you'll take a significant speed/size/caching penalty.
[EDIT2] It now works in 4 (customizable) passes, which means it uses less memory, making this work realistically in RAM. It just goes proportionately slower.
If you have 1000 web pages then you have 1 billion possible 3-page sequences. If you have a simple array of 32-bit counters then you'd use 4GB of memory. There might be ways to prune this down by discarding data as you go, but if you want to guarantee to get the correct answer then this is always going to be your worst case - there's no avoiding it, and inventing ways to save memory in the average case will make the worst case even more memory hungry.
On top of that, you have to track the users. For each user you need to store the last two pages they visited. Assuming the users are referred to by name in the logs, you'd need to store the users' names in a hash table, plus the two page numbers, so let's say 24 bytes per user on average (probably conservative - I'm assuming short user names). With 1000 users that would be 24KB; with 1000000 users 24MB.
Clearly the sequence counters dominate the memory problem.
If you do only have 1000 pages then 4GB of memory is not unreasonable in a modern 64-bit machine, especially with a good amount of disk-backed virtual memory. If you don't have enough swap space, you could just create an mmapped file (on Linux - I presume Windows has something similar), and rely on the OS to always have to most used cases cached in memory.
So, basically, the maths dictates that if you have a large number of pages to track, and you want to be able to cope with the worst case, then you're going to have to accept that you'll have to use disk files.
I think that a limited-capacity hash table is probably the right answer. You could probably optimize it for a specific machine by sizing it according to the memory available. Having got that you need to handle the case where the table reaches capacity. It may not need to be terribly efficient if it's likely you rarely get there. Here's some ideas:
Evict the least commonly used sequences to file, keeping the most common in memory. I'd need two passes over the table to determine what level is below average, and then to do the eviction. Somehow you'd need to know where you'd put each entry, whenever you get a hash-miss, which might prove tricky.
Just dump the whole table to file, and build a new one from scratch. Repeat. Finally, recombine the matching entries from all the tables. The last part might also prove tricky.
Use an mmapped file to extend the table. Ensure that the file is used primarily for the least-commonly used sequences, as in my first suggestion. Basically, you'd simply use it as virtual memory - the file would be meaningless later, after the addresses have been forgotten, but you wouldn't need to keep it that long. I'm assuming there isn't enough regular virtual memory here, and/or you don't want to use it. Obviously, this is for 64-bit systems only.
I think you only have to store the most recently seen triple for each userid right?
So you have two hash tables. The first containing key of userid, value of most recently seen triple has size equal to number of userids.
EDIT: assumes file sorted by timestamp already.
The second hash table has a key of userid:page-triple, and a value of count of times seen.
I know you said c++ but here's some awk which does this in a single pass (should be pretty straight-forward to convert to c++):
# $1 is userid, $2 is pageid
{
old = ids[$1]; # map with id, most-recently-seen triple
split(old,oldarr,"-");
oldarr[1]=oldarr[2];
oldarr[2]=oldarr[3];
oldarr[3] = $2;
ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
tripleid = $1":"ids[$1]; # build a triple-id of userid:triple
if (oldarr[1] != "") { # don't accumulate incomplete triples
triples[tripleid]++; } # count this triple-id
}
END {
MAX = 0;
for (tid in triples) {
print tid" "triples[tid];
if (triples[tid] > MAX) MAX = tid;
}
print "MAX is->" MAX" seen "triples[tid]" times";
}
If you are using Unix, the sort command can cope with arbitrarily large files. So you could do something like this:
sort -k1,1 -s logfile > sorted (note that this is a stable sort (-s) on the first column)
Perform some custom processing of sorted that outputs each triplet as a new line to another file, say triplets, using either C++ or a shell script. So in the example given you get a file with three lines: 1-2-3, 2-3-4, 2-3-4. This processing is quick because Step 1 means that you are only dealing with one user's visits at a time, so you can work through the sorted file a line at a time.
sort triplets | uniq -c | sort -r -n | head -1 should give the most common triplet and its count (it sorts the triplets, counts the occurrences of each, sorts them in descending order of count and takes the top one).
This approach might not have optimal performance, but it shouldn't run out of memory.