Huge amout data analysis

Huge amout data analysis - c++

Say we have about 1e10 lines of log file everyday, each one contains: an ID number(integer below 15 digits length), a login time, and a logout time. Some ID may login and logout several times.
Question 1:
How to count the total number of ID that have logged in?(We should not count each ID twice or more)
I tried to use a hashtable here, but I found the memory we should obtained may be so large.
Question 2:
Calculate the time when the population of online users are largest.
I think we may split the time of a day into 86400 seconds, then for each line of log file, add 1 to each seconds in the online interval. Or maybe I can sort the log file by login time?

you can do that in a *nix shell.
cut -f1 logname.log | sort | uniq | wc -l
cut -f2 logname.log | sort | uniq -c | sort -r

For question 2 to make sense: you probably have to log 2 things: user logs in and user logs out. Two different activities along with the user id. If this list is sorted by the time in which the activity (either log in or log out was done). You just scan with a counter called currentusers: add 1 for each log in and -1 for each log out. The maximum that number (current users) reaches is the value you're interested in, you will probably be interested also in tracking at what time it occurred..

For question 1, forget C++ and use *nix tools. Assuming the log file is space delimited, then the number of unique logins in a given log is computed by:
$ awk '{print $1}' foo.log | sort | uniq | wc -l
Gnu sort, will happily sort files larger than memory. Here's what each piece is doing:
awk is extracting the first space-delimited column (the ID number).
sort is sorting those ID numbers, because uniq needs sorted input.
uniq is returning only uniq numbers.
wc prints the number of lines, which will be the number of uniq numbers.

use a segment tree to store intervals of consecutive ids.
Scan the logs for all the login events.
To insert an id, first search a segment containing the id: if it exists, the id is a duplicate. If it doesn't search the segments which are right after or before the id. If they exist, remove them and merge the new id as needed, and insert the new segment. If they don't exist, insert the id as a segment of 1 element.
Once all ids have been inserted, count their number by summing the cardinals of all the segments in the tree.
assuming that:
a given id may be logged in only once at any given time,
events are stored in chronological order (that's what logs are normally)
Scan the log and keep a counter c of the number of currently logged in users, as well as the max number m found, and the associated time t. For each log in, increment c, and for each log out decrement it. At each step update m and t if m is lower than c.

For 1, you can try working with fragments of the file at a time that are small enough to fit into memory.
i.e. instead of
countUnique([1, 2, ... 1000000])
try
countUnique([1, 2, ... 1000]) +
countUnique([1001, 1002, ... 2000]) +
countUnique([2001, 2002, ...]) + ... + countUnique([999000, 999001, ... 1000000])
2 is a bit more tricky. Partitioning the work into manageable intervals (a second, as you suggested) is a good idea. For each second, find the number of people logged in during taht second by using the following check (pseudocode):
def loggedIn(loginTime, logoutTime, currentTimeInterval):
return user is logged in during currentTimeInterval
Apply loggedIn to all 86400 seconds, and then maximize the list of 86400 user counts to find the time that the population of online users is the largest.

Related

How do I keep a running count in DynamoDB without a hot table row?

We have a completely server-less architecture and have been using DynamoDB almost since it was released, but I am struggling to see how to deal with tabulating global numbers on a large scale. Say we have users who choose to do either A or B. We want to keep track of how many users do each and they could happen at a high scale. According to DyanamoDB best practices, you are not supposed to write continually to a single row. What is the best way to handle this outside using another service like CouchDB or ElastiCache?

You could bucket your users by first letter of their usernames (or something similar) as the partition key, and either A or B as the sort key, with a regular attribute as the counts.
For example:
PARTITION KEY | SORT KEY | COUNT
--------------------------------
a | A | 5
a | B | 7
b | B | 15
c | A | 1
c | B | 3
The advantage is that you can reduce the risk of hot partitions by spreading your writes across multiple partitions.
Of course, you're trading hot writes for more expensive reads, since now you'll have to scan + filter(A) to get the total count that chose A, and another scan + filter(B) for the total count of B. But if you're writing a bunch and only reading on rare occasions, this may be ok.

Is there a DynamoDB max partition size of 10GB for a single partition key value?

I've read lots of DynamoDB docs on designing partition keys and sort keys, but I think I must be missing something fundamental.
If you have a bad partition key design, what happens when the data for a SINGLE partition key value exceeds 10GB?
The 'Understand Partition Behaviour' section states:
"A single partition can hold approximately 10 GB of data"
How can it partition a single partition key?
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
The docs also talk about limits with a local secondary index being limited to 10GB of data after which you start getting errors.
"The maximum size of any item collection is 10 GB. This limit does not apply to tables without local secondary indexes; only tables that have one or more local secondary indexes are affected."
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LSI.html#LSI.ItemCollections
That I can understand. So does it have some other magic for partitioning the data for a single partition key if it exceeds 10GB. Or does it just keep growing that partition? And what are the implications of that for your key design?
The background to the question is that I've seen lots of examples of using something like a TenantId as a partition key in a multi-tentant environment. But that seems limiting if a specific TenantId could have more than 10 GB of data.
I must be missing something?

TL;DR - items can be split even if they have the same partition key value by including the range key value into the partitioning function.
The long version:
This is a very good question, and it is addressed in the documentation here and here. As the documentation states, items in a DynamoDB table are partitioned based on their partition key value (which used to be called hash key) into one or multiple partitions, using a hashing function. The number of partitions is derived based on the maximum desired total throughput, as well as the distribution of items in the key space. In other words, if the partition key is chosen such that it distributes items uniformly across the partition key space, the partitions end up having approximately the same number of items each. This number of items in each partition is approximately equal to the total number of items in the table divided by the number of partitions.
The documentation also states that each partition is limited to about 10GB of space. And that once the sum of the sizes of all items stored in any partition grows beyond 10GB, DynamoDB will start a background process that will automatically and transparently split such partitions in half - resulting in two new partitions. Once again, if the items are distributed uniformly, this is great because each new sub-partition will end up holding roughly half the items in the original partition.
An important aspect to splitting is that the throughput of the split-partitions will each be half of the throughput that would have been available for the original partition.
So far we've covered the happy case.
On the flip side it is possible to have one, or a few, partition key values that correspond to a very large number of items. This can usually happen if the table schema uses a sort key and several items hash to the same partition key. In such case, it is possible that a single partition key could be responsible for items that together take up more than 10 GB. And this will result in a split. In this case DynamoDB will still create two new partitions but instead of using only the partition key to decide which sub-partition should an item be stored in, it will also use the sort key.
Example
Without loss of generality and to make things easier to reason about, imagine that there is a table where partition keys are letters (A-Z), and numbers are used as sort keys.
Imaging that the table has about 9 partitions, so letters A,B,C would be stored in partition 1, letters D,E,F would be in partition 2, etc.
In the diagram below, the partition boundaries are marked h(A0), h(D0) etc. to show that, for instance, the items stored in the first partition are the items who's partition key hashes to a value between h(A0) and h(D0) - the 0 is intentional, and comes in handy next.
[ h(A0) ]--------[ h(D0) ]---------[ h(G0) ]-------[ h(J0) ]-------[ h(M0) ]- ..
| A B C | E F | G I | J K L |
| 1 1 1 | 1 1 | 1 1 | 1 1 1 |
| 2 2 2 | 2 2 | 2 | 2 |
| 3 3 | 3 | 3 | |
.. .. .. .. ..
| 100 | 500 | | |
+-----------------+----------------+---------------+---------------+-- ..
Notice that for most partition key values, there are between 1 and 3 items in the table, but there are two partition key values: D and F that are not looking too good. D has 100 items while F has 500 items.
If items with a partition key value of F keep getting added, eventually the partition [h(D0)-h(G0)) will split. To make it possible to split the items that have the same hash key, the range key values will have to be used, so we'll end up with the following situation:
..[ h(D0) ]------------/ [ h(F500) ] / ----------[ h(G0) ]- ..
| E F | F |
| 1 1 | 501 |
| 2 2 | 502 |
| 3 | 503 |
.. .. ..
| 500 | 1000 |
.. ---+-----------------------+---------------------+--- ..
The original partition [h(D0)-h(G0)) was split into [h(D0)-h(F500)) and [h(F500)-h(G0))
I hope this helps to visualize that items are generally mapped to partitions based on a hash value obtained by applying a hashing function to their partition key value, but if need be, the value being hashed can include the partition key + a sort key value as well.

How to load the first half records in one file and other half in other file in informatica?

I have tried expression transformation so far along with aggregate transformation to get the maximum value of the sequence number.Source is flat file

The way you are designing would require reading the source twice in the mapping, one to get the total number of records (max sequence as you called it) and then another one to read the detail records and pass them to target1 or target2.
You can simplify it by passing the number of records as a mapping parameter.
Either way, to decide when to route to a target - you can count the number of records read by keeping a running total in a variable port, incrementing every time a row passes thru the expression and checking against the (record count)/2.

If you don't really care about first half and second half and all you need is two output files equal in size, you can:
number the rows (with a rank transformation or a variable port),
then route even and odd rows to two different targets.

If you can, write a Unix (assuming your platform is Unix) shell script to do a head of the first file with half the file size in lines (use wc of the file with the right param as the param to head after dividing it by 2) and direct the output to a 3rd file. Then do a tail on the second file also using wc as just described and >> the output to the 3rd file you created. These would be pre-session commands. You'd use that 3rd file as the source file for your session. It'd look something like this (untested, but it gets the general idea across):
halfsize=`wc -l filename`
halfsize=$((halfsize/2))
head -n $halfsize filename > thirdfile
halfsize=`wc -l filename2`
halfsize=$((halfsize/2))
tail -n $halfsize filename2 >> thirdfile

prior to writing to the target you keep counts in an expression. then connect this expression to a router.
The router should have 2 groups
group1 count1 <= n/2 then route it to Target1
group2 count1 > n/2 then route it to Target2
Or
MOD(nextval/2) will send alternative records to alternative targets.
I guess it won't send first half to 1st target and 2nd half to 2nd target.

Change data in table and copying to new table

I would like to make a macro in Excel, but I think it's too complicated to do it with recording... That's why I'm coming here for assistance.
The file:
I have a list of warehouse boxes all containing a specific ID, location (town), location (in or out) and a date.
Whenever boxes change location, this needs to be changed in this list and the date should be adjusted accordingly (this should be a manual input, since the changing of the list might not happen on the same day as the movement of the box).
On top of that, I need to count the number of times the location changes from in to out (so that I know how many times the box has been used).
The way of inputting:
A good way of inputting would be that you can make a list of the boxes where you want to change the information from, f.e.:
ID | Location (town) | Location (in/out) | Date
------------------------------------------------
123-4 | Paris | OUT | 9-1-14
124-8 | London | IN | 9-1-14
999-84| London | IN | 10-1-14
124-8 | New York | OUT | 9-1-14
Then I'd make a button that uses the macro to change the data mentioned above in the master list (where all the boxes are listed) and on some way count the number of times OUT changes to IN etc.
Is this possible?

I'm not entirely sure what you want updated in your Main List but I don't think you need Macros at all to achieve this. You can count the number of times and box location has changed by simply making a list of all your boxes in one column and the count in the next column. For the count use the formula COUNTIFS to count all the rows where the box id is the same and the location is in/out. Check VLOOKUP for updating your main list values.

Finding the most common three-item sequence in a very large file

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page sequence of all. The log files are too large to be held in main memory at once.
Sample log file:
User ID  Page ID
A          1
A          2
A          3
B          2
B          3
C          1
B          4
A          4
Corresponding results:
A： 1-2-3， 2-3-4
B： 2-3-4
2-3-4 is the most popular three-page sequence
My idea is to use use two hash tables. The first hashes on user ID and stores its sequence; the second hashes three-page sequences and stores the number of times each one appears. This takes O(n) space and O(n) time.
However, since I have to use two hash tables, memory cannot hold everything at once, and I have to use disk. It is not efficient to access disk very often.
How can I do this better?

If you want to quickly get an approximate result, use hash tables, as you intended, but add a limited-size queue to each hash table to drop least recently used entries.
If you want exact result, use external sort procedure to sort logs by userid, then combine every 3 consecutive entries and sort again, this time - by page IDs.
Update (sort by timestamp)
Some preprocessing may be needed to properly use logfiles' timestamps:
If the logfiles are already sorted by timestamp, no preprocessing needed.
If there are several log files (possibly coming from independent processes), and each file is already sorted by timestamp, open all these files and use merge sort to read them.
If files are almost sorted by timestamp (as if several independent processes write logs to single file), use binary heap to get data in correct order.
If files are not sorted by timestamp (which is not likely in practice), use external sort by timestamp.
Update2 (Improving approximate method)
Approximate method with LRU queue should produce quite good results for randomly distributed data. But webpage visits may have different patterns at different time of day, or may be different on weekends. The original approach may give poor results for such data. To improve this, hierarchical LRU queue may be used.
Partition LRU queue into log(N) smaller queues. With sizes N/2, N/4, ... Largest one should contain any elements, next one - only elements, seen at least 2 times, next one - at least 4 times, ... If element is removed from some sub-queue, it is added to other one, so it lives in all sub-queues, which are lower in hierarchy, before it is completely removed. Such a priority queue is still of O(1) complexity, but allows much better approximation for most popular page.

There's probably syntax errors galore here, but this should take a limited amount of RAM for a virtually unlimited length log file.
typedef int pageid;
typedef int userid;
typedef pageid[3] sequence;
typedef int sequence_count;
const int num_pages = 1000; //where 1-1000 inclusive are valid pageids
const int num_passes = 4;
std::unordered_map<userid, sequence> userhistory;
std::unordered_map<sequence, sequence_count> visits;
sequence_count max_count=0;
sequence max_sequence={};
userid curuser;
pageid curpage;
for(int pass=0; pass<num_passes; ++pass) { //have to go in four passes
std::ifstream logfile("log.log");
pageid minpage = num_pages/num_passes*pass; //where first page is in a range
pageid maxpage = num_pages/num_passes*(pass+1)+1;
if (pass==num_passes-1) //if it's last pass, fix rounding errors
maxpage = MAX_INT;
while(logfile >> curuser >> curpage) { //read in line
sequence& curhistory = userhistory[curuser]; //find that user's history
curhistory[2] = curhistory[1];
curhistory[1] = curhistory[0];
curhistory[0] = curhistory[curpage]; //push back new page for that user
//if they visited three pages in a row
if (curhistory[2] > minpage && curhistory[2]<maxpage) {
sequence_count& count = visits[curhistory]; //get times sequence was hit
++count; //and increase it
if (count > max_count) { //if that's new max
max_count = count; //update the max
max_sequence = curhistory; //arrays, so this is memcpy or something
}
}
}
}
std::cout << "The sequence visited the most is :\n";
std::cout << max_sequence[2] << '\n';
std::cout << max_sequence[1] << '\n';
std::cout << max_sequence[0] << '\n';
std::cout << "with " << max_count << " visits.\n";
Note that If you pageid or userid are strings instead of ints, you'll take a significant speed/size/caching penalty.
[EDIT2] It now works in 4 (customizable) passes, which means it uses less memory, making this work realistically in RAM. It just goes proportionately slower.

If you have 1000 web pages then you have 1 billion possible 3-page sequences. If you have a simple array of 32-bit counters then you'd use 4GB of memory. There might be ways to prune this down by discarding data as you go, but if you want to guarantee to get the correct answer then this is always going to be your worst case - there's no avoiding it, and inventing ways to save memory in the average case will make the worst case even more memory hungry.
On top of that, you have to track the users. For each user you need to store the last two pages they visited. Assuming the users are referred to by name in the logs, you'd need to store the users' names in a hash table, plus the two page numbers, so let's say 24 bytes per user on average (probably conservative - I'm assuming short user names). With 1000 users that would be 24KB; with 1000000 users 24MB.
Clearly the sequence counters dominate the memory problem.
If you do only have 1000 pages then 4GB of memory is not unreasonable in a modern 64-bit machine, especially with a good amount of disk-backed virtual memory. If you don't have enough swap space, you could just create an mmapped file (on Linux - I presume Windows has something similar), and rely on the OS to always have to most used cases cached in memory.
So, basically, the maths dictates that if you have a large number of pages to track, and you want to be able to cope with the worst case, then you're going to have to accept that you'll have to use disk files.
I think that a limited-capacity hash table is probably the right answer. You could probably optimize it for a specific machine by sizing it according to the memory available. Having got that you need to handle the case where the table reaches capacity. It may not need to be terribly efficient if it's likely you rarely get there. Here's some ideas:
Evict the least commonly used sequences to file, keeping the most common in memory. I'd need two passes over the table to determine what level is below average, and then to do the eviction. Somehow you'd need to know where you'd put each entry, whenever you get a hash-miss, which might prove tricky.
Just dump the whole table to file, and build a new one from scratch. Repeat. Finally, recombine the matching entries from all the tables. The last part might also prove tricky.
Use an mmapped file to extend the table. Ensure that the file is used primarily for the least-commonly used sequences, as in my first suggestion. Basically, you'd simply use it as virtual memory - the file would be meaningless later, after the addresses have been forgotten, but you wouldn't need to keep it that long. I'm assuming there isn't enough regular virtual memory here, and/or you don't want to use it. Obviously, this is for 64-bit systems only.

I think you only have to store the most recently seen triple for each userid right?
So you have two hash tables. The first containing key of userid, value of most recently seen triple has size equal to number of userids.
EDIT: assumes file sorted by timestamp already.
The second hash table has a key of userid:page-triple, and a value of count of times seen.
I know you said c++ but here's some awk which does this in a single pass (should be pretty straight-forward to convert to c++):
# $1 is userid, $2 is pageid
{
old = ids[$1]; # map with id, most-recently-seen triple
split(old,oldarr,"-");
oldarr[1]=oldarr[2];
oldarr[2]=oldarr[3];
oldarr[3] = $2;
ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
tripleid = $1":"ids[$1]; # build a triple-id of userid:triple
if (oldarr[1] != "") { # don't accumulate incomplete triples
triples[tripleid]++; } # count this triple-id
}
END {
MAX = 0;
for (tid in triples) {
print tid" "triples[tid];
if (triples[tid] > MAX) MAX = tid;
}
print "MAX is->" MAX" seen "triples[tid]" times";
}

If you are using Unix, the sort command can cope with arbitrarily large files. So you could do something like this:
sort -k1,1 -s logfile > sorted (note that this is a stable sort (-s) on the first column)
Perform some custom processing of sorted that outputs each triplet as a new line to another file, say triplets, using either C++ or a shell script. So in the example given you get a file with three lines: 1-2-3, 2-3-4, 2-3-4. This processing is quick because Step 1 means that you are only dealing with one user's visits at a time, so you can work through the sorted file a line at a time.
sort triplets | uniq -c | sort -r -n | head -1 should give the most common triplet and its count (it sorts the triplets, counts the occurrences of each, sorts them in descending order of count and takes the top one).
This approach might not have optimal performance, but it shouldn't run out of memory.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js