I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page sequence of all. The log files are too large to be held in main memory at once.
Sample log file:
User ID Page ID
A 1
A 2
A 3
B 2
B 3
C 1
B 4
A 4
Corresponding results:
A: 1-2-3, 2-3-4
B: 2-3-4
2-3-4 is the most popular three-page sequence
My idea is to use use two hash tables. The first hashes on user ID and stores its sequence; the second hashes three-page sequences and stores the number of times each one appears. This takes O(n) space and O(n) time.
However, since I have to use two hash tables, memory cannot hold everything at once, and I have to use disk. It is not efficient to access disk very often.
How can I do this better?
If you want to quickly get an approximate result, use hash tables, as you intended, but add a limited-size queue to each hash table to drop least recently used entries.
If you want exact result, use external sort procedure to sort logs by userid, then combine every 3 consecutive entries and sort again, this time - by page IDs.
Update (sort by timestamp)
Some preprocessing may be needed to properly use logfiles' timestamps:
If the logfiles are already sorted by timestamp, no preprocessing needed.
If there are several log files (possibly coming from independent processes), and each file is already sorted by timestamp, open all these files and use merge sort to read them.
If files are almost sorted by timestamp (as if several independent processes write logs to single file), use binary heap to get data in correct order.
If files are not sorted by timestamp (which is not likely in practice), use external sort by timestamp.
Update2 (Improving approximate method)
Approximate method with LRU queue should produce quite good results for randomly distributed data. But webpage visits may have different patterns at different time of day, or may be different on weekends. The original approach may give poor results for such data. To improve this, hierarchical LRU queue may be used.
Partition LRU queue into log(N) smaller queues. With sizes N/2, N/4, ... Largest one should contain any elements, next one - only elements, seen at least 2 times, next one - at least 4 times, ... If element is removed from some sub-queue, it is added to other one, so it lives in all sub-queues, which are lower in hierarchy, before it is completely removed. Such a priority queue is still of O(1) complexity, but allows much better approximation for most popular page.
There's probably syntax errors galore here, but this should take a limited amount of RAM for a virtually unlimited length log file.
typedef int pageid;
typedef int userid;
typedef pageid[3] sequence;
typedef int sequence_count;
const int num_pages = 1000; //where 1-1000 inclusive are valid pageids
const int num_passes = 4;
std::unordered_map<userid, sequence> userhistory;
std::unordered_map<sequence, sequence_count> visits;
sequence_count max_count=0;
sequence max_sequence={};
userid curuser;
pageid curpage;
for(int pass=0; pass<num_passes; ++pass) { //have to go in four passes
std::ifstream logfile("log.log");
pageid minpage = num_pages/num_passes*pass; //where first page is in a range
pageid maxpage = num_pages/num_passes*(pass+1)+1;
if (pass==num_passes-1) //if it's last pass, fix rounding errors
maxpage = MAX_INT;
while(logfile >> curuser >> curpage) { //read in line
sequence& curhistory = userhistory[curuser]; //find that user's history
curhistory[2] = curhistory[1];
curhistory[1] = curhistory[0];
curhistory[0] = curhistory[curpage]; //push back new page for that user
//if they visited three pages in a row
if (curhistory[2] > minpage && curhistory[2]<maxpage) {
sequence_count& count = visits[curhistory]; //get times sequence was hit
++count; //and increase it
if (count > max_count) { //if that's new max
max_count = count; //update the max
max_sequence = curhistory; //arrays, so this is memcpy or something
}
}
}
}
std::cout << "The sequence visited the most is :\n";
std::cout << max_sequence[2] << '\n';
std::cout << max_sequence[1] << '\n';
std::cout << max_sequence[0] << '\n';
std::cout << "with " << max_count << " visits.\n";
Note that If you pageid or userid are strings instead of ints, you'll take a significant speed/size/caching penalty.
[EDIT2] It now works in 4 (customizable) passes, which means it uses less memory, making this work realistically in RAM. It just goes proportionately slower.
If you have 1000 web pages then you have 1 billion possible 3-page sequences. If you have a simple array of 32-bit counters then you'd use 4GB of memory. There might be ways to prune this down by discarding data as you go, but if you want to guarantee to get the correct answer then this is always going to be your worst case - there's no avoiding it, and inventing ways to save memory in the average case will make the worst case even more memory hungry.
On top of that, you have to track the users. For each user you need to store the last two pages they visited. Assuming the users are referred to by name in the logs, you'd need to store the users' names in a hash table, plus the two page numbers, so let's say 24 bytes per user on average (probably conservative - I'm assuming short user names). With 1000 users that would be 24KB; with 1000000 users 24MB.
Clearly the sequence counters dominate the memory problem.
If you do only have 1000 pages then 4GB of memory is not unreasonable in a modern 64-bit machine, especially with a good amount of disk-backed virtual memory. If you don't have enough swap space, you could just create an mmapped file (on Linux - I presume Windows has something similar), and rely on the OS to always have to most used cases cached in memory.
So, basically, the maths dictates that if you have a large number of pages to track, and you want to be able to cope with the worst case, then you're going to have to accept that you'll have to use disk files.
I think that a limited-capacity hash table is probably the right answer. You could probably optimize it for a specific machine by sizing it according to the memory available. Having got that you need to handle the case where the table reaches capacity. It may not need to be terribly efficient if it's likely you rarely get there. Here's some ideas:
Evict the least commonly used sequences to file, keeping the most common in memory. I'd need two passes over the table to determine what level is below average, and then to do the eviction. Somehow you'd need to know where you'd put each entry, whenever you get a hash-miss, which might prove tricky.
Just dump the whole table to file, and build a new one from scratch. Repeat. Finally, recombine the matching entries from all the tables. The last part might also prove tricky.
Use an mmapped file to extend the table. Ensure that the file is used primarily for the least-commonly used sequences, as in my first suggestion. Basically, you'd simply use it as virtual memory - the file would be meaningless later, after the addresses have been forgotten, but you wouldn't need to keep it that long. I'm assuming there isn't enough regular virtual memory here, and/or you don't want to use it. Obviously, this is for 64-bit systems only.
I think you only have to store the most recently seen triple for each userid right?
So you have two hash tables. The first containing key of userid, value of most recently seen triple has size equal to number of userids.
EDIT: assumes file sorted by timestamp already.
The second hash table has a key of userid:page-triple, and a value of count of times seen.
I know you said c++ but here's some awk which does this in a single pass (should be pretty straight-forward to convert to c++):
# $1 is userid, $2 is pageid
{
old = ids[$1]; # map with id, most-recently-seen triple
split(old,oldarr,"-");
oldarr[1]=oldarr[2];
oldarr[2]=oldarr[3];
oldarr[3] = $2;
ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
tripleid = $1":"ids[$1]; # build a triple-id of userid:triple
if (oldarr[1] != "") { # don't accumulate incomplete triples
triples[tripleid]++; } # count this triple-id
}
END {
MAX = 0;
for (tid in triples) {
print tid" "triples[tid];
if (triples[tid] > MAX) MAX = tid;
}
print "MAX is->" MAX" seen "triples[tid]" times";
}
If you are using Unix, the sort command can cope with arbitrarily large files. So you could do something like this:
sort -k1,1 -s logfile > sorted (note that this is a stable sort (-s) on the first column)
Perform some custom processing of sorted that outputs each triplet as a new line to another file, say triplets, using either C++ or a shell script. So in the example given you get a file with three lines: 1-2-3, 2-3-4, 2-3-4. This processing is quick because Step 1 means that you are only dealing with one user's visits at a time, so you can work through the sorted file a line at a time.
sort triplets | uniq -c | sort -r -n | head -1 should give the most common triplet and its count (it sorts the triplets, counts the occurrences of each, sorts them in descending order of count and takes the top one).
This approach might not have optimal performance, but it shouldn't run out of memory.
Related
For a data quality check I need to collect data in a specific interval.
Some tables are huge in size.
Is there any hack to do this without affecting the performance?
Like select 100 rows randomly.
How random do you need? The classic way to do this is with "WHERE RANDOM() < .001". If you need it to give you a repeatable "random" set then you can add a seed. The issue is that your tables are huge and this means reading (scanning) every row from disk just to throw most of them away and since table scan can take a significant time this isn't what you want to do.
So you may want to take advantage of Redshift "limited table scan" capabilities as part of your "random" sampling. (The fastest data to read from disk is the data you don't read from disk.) The issue here is that this solution will depend on your table sort keys and ordering which will push the solution into even "more pseudo" random territory (less of a true random sampling). In many cases this isn't a big deal but if the statistics really matter then this may not work for you.
This is done by sampling "blocks", not rows, based on the sort key(s). This sampling of blocks can be done randomly and each block of data will represent about 250K rows (based on sort key data type, compression etc. and COULD range anywhere from <100K rows to 2M rows). Doing this process will take a little inspection of STV_BLOCKLIST. The storage quanta for Redshift is the 1MB block and each and every block's metadata in the system can be referenced in STV_BLOCKLIST. This system table contains min and max values for each block. First find all the blocks for the sort key for the table in question. Next pick a random sample of these blocks (and if you are still dealing with a lot of data make sure that this sampling picks an even number from across all the slices to avoid execution skew).
Now the trick is to translate these min a max metadata values into a WHERE clause the performs the desired sampling. These min and max values are BIGINTs and are hashed from the data in the sort key column. This hash is data type dependent. If the data type is BIGINT then the has is quite simple - if the data type is timestamp then it is a bit more complex. But the ordering will be preserved across the hashing function for the data type involved. Reverse engineering this hash isn't hard - just perform a few experiments - but I can help if you tell me the type involved as I've done this for just about every data type at this point.
You can even do a random sampling rows on top of this random sampling of blocks. Or if you want you can just pick some narrow ranges of the sort key value and then randomly sample row and avoid all this reverse engineering business. The idea is to use Redshift "reduced scan" capability to greatly reduce the amount of data read from disk. To do this you need to be metadata aware in your choice of sampling windows which often means a sort key where clause. This is all about understanding how the database engine works and using its capabilities to your advantage.
I understand that this answer is based on some unstated information so please reach out in a comment if something isn't clear.
I have a file as follows:
The file consists of 2 parts: header and data.
The data part is separated into equally sized pages. Each page holds data for a specific metric. Multiple pages (needs not to be consecutive) might be needed to hold data for a single metric. Each page consists of a page header and a page body. A page header has a field called "Next page" that is the index of the next page that holds data for the same metric. A page body holds real data. All pages have the same & fixed size (20 bytes for header and 800 bytes for body (if data amount is less than 800 bytes, 0 will be filled)).
The header part consists of 20,000 elements, each element has information about a specific metric (point 1 -> point 20000). An element has a field called "first page" that is actually index of the first page holding data for the metric.
The file can be up to 10 GB.
Requirement: Re-order data of the file in the shortest time, that is, pages holding data for a single metric must be consecutive, and from metric 1 to metric 20000 according to alphabet order (header part must be updated accordingly).
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
One possible solution is to create an index from the file, containing the page number and the page metric that you need to sort on. Create this index as an array, so that the first entry (index 0) corresponds to the first page, the second entry (index 1) the second page, etc.
Then you sort the index using the metric specified.
When sorted, you end up with a new array which contains a new first, second etc. entries, and you read the input file writing to the output file in the order of the sorted index.
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
Yes. After you get a working solution, measure it's efficiency, then decide which parts you wish to optimize. What and how you optimize will depend greatly on what results you get here (what are your bottlenecks).
A few generic things to consider:
if you have one set of steps that read data for a single metric and move it to the output, you should be able to parallelize that (have 20 sets of steps instead of one).
a 10Gb file will take a bit to process regardless of what hardware you run your code on (concievably, you could run it on a supercomputer but I am ignoring that case). You / your client may accept a slower solution if it displays it's progress / shows a progress bar.
do not use string comparisons for sorting;
Edit (addressing comment)
Consider performing the read as follows:
create a list of block offset for the blocks you want to read
create a list of worker threads, of fixed size (for example, 10 workers)
each idle worker will receive the file name and a block offset, then create a std::ifstream instance on the file, read the block, and return it to a receiving object (and then, request another block number, if any are left).
read pages should be passed to a central structure that manages/stores pages.
Also consider managing the memory for the blocks separately (for example, allocate chunks of multiple blocks preemptively, when you know the number of blocks to be read).
I first read header part, then sort metrics in alphabetic order. For each metric in the sorted list I read all data from the input file and write to the output file. To remove bottlenecks at reading data step, I used memory mapping. The results showed that when using memory mapping the execution time for an input file of 5 GB was reduced 5 ~ 6 times compared with when not using memory mapping. This way temporarily solve my problems. However, I will also consider suggestions of #utnapistim.
I'm facing with the following problem:
I have a huge file (let's say 30 GB), that is streamed in memory with a specific API.
This API only allows me to read going forward (not backward). But the files can be read as many times as I want.
The file contains data that is almost all sorted, as in, 99% of the data is sorted but it can happen that a record is not in its correct position and should have been inserted much before if everything was sorted.
I'm trying to create a duplicate of this file, except it would need to be sorted.
Is there a graceful way to do this ?
The only way I can think of is the most generic way:
read the file
create batch of a few GB of memory, sort them, write them to a file on the HDD
use external merge to merge all these temporary files into the final output
However this is not using the specificities that the data is "almost" sorted. Would there be a better way to do this ? For instance without using external files on the HDD?
You could do this (example in Python)
last = None
special = []
for r in records:
if last is None or r > last:
last = r
else:
special.append(r)
if len(special) > max_memory:
break
if len(special) > max_memory:
# too many out of sequence records, use a regular sort
...
else:
sort(special)
i = 0
for r in records:
while i < len(special) and special[i] < r:
write(special[i])
i += 1
write(r)
while i < len(special):
write(special[i])
i += 1
Use a variation of bottom up merge sort called natural merge sort. The idea here is to find runs of ordered data, then repeatedly merge those runs back and forth between two files (all sequential I/O) until there's only a single run left. If the sort doesn't have to be stable (preserve the order of equal elements), then you can consider a run boundary to occur whenever a pair of sequential elements are out of order. This eliminates some housekeeping. If the sort needs to be stable, then you need to keep track of run boundaries on the initial pass that finds the runs, this could be an array of counts (the size of each run). Hopefully this array would fit in memory. After each merge pass, the number of counts in the array is cut in half, and once there's only a single count, the sort is done.
Wiki article (no sample code given though): natural bottom up merge sort .
If all the out of order elements consist of somewhat isolated records, you could separate the out of order elements into a third file, only copying in order records from the first file to the second file. Then you sort the third file with any method you want (bottom up merge sort is probably still best if the third file is large), then merge the second and third files to create a sorted file.
If you have multiple hard drives, keep the files on separate drives. If doing this on a SSD drive, it won't matter. If using a single hard drive, reading or writing a large number of records at a time, like 10MB to 100MB per read or write, will greatly reduce the seek overhead during the sort process.
Recently I was asked this question in an interview. I gave an answer in O(n) time but in two passes. Also he asked me how to do the same if the url list cannot fit into the memory. Any help is very much appreciated.
If it all fits in memory, then the problem is simple: Create two sets (choose your favorite data structure), both initially empty. One will contain unique URLs and the other will contain URLs that occur multiple times. Scan the URL list once. For each URL, if it exists in the unique set, remove it from the unique set and put it in the multiple set; otherwise, if it does not exist in the multiple set, add it to the unique set.
If the set does not fit into memory, the problem is difficult. The requirement of O(n) isn't hard to meet, but the requirement of a "single pass" (which seems to exclude random access, among other things) is tough; I don't think it's possible without some constraints on the data. You can use the set approach with a size limit on the sets, but this would be easily defeated by unfortunate orderings of the data and would in any event only have a certain probability (<100%) of finding a unique element if one exists.
EDIT:
If you can design a set data structure that exists in mass storage (so it can be larger than would fit in memory) and can do find, insert, and deletes in O(1) (amortized) time, then you can just use that structure with the first approach to solve the second problem. Perhaps all the interviewer was looking for was to dump the URLs into a data base with a UNIQUE index for URLs and a count column.
One could try to use Trie structure for keeping data. It's compressed so it would take less memory, as memory reusage for common url parts.
loop would look like:
add string s to trie;
check that added string is not finished in existing node
internal node -> compress path
leaf node -> delete path
For the "fits-in-memory" case, you could use two hash-tables as follows (pseudocode):
hash-table uniqueTable = <initialization>;
hash-table nonUniqueTable = <initialization>;
for-each url in url-list {
if (nonUniqueTable.contains(url)) {
continue;
}
else if (uniqueTable.contains(url)) {
nonUniqueTable.add(url);
uniqueTable.remove(url);
}
else {
uniqueTable.add(url)
}
}
if (uniqueTable.size() > 1)
return uniqueTable.first();
Python based
You have a list - not sure where it's "coming" from, but if you already have it in memory then:
L.sort()
from itertools import groupby
for key, vals in groupby(L, lambda L: L):
if len(vals) == 1:
print key
Otherwise use storage (possibly using):
import sqlite3
db = sqlite3.connect('somefile')
db.execute('create table whatever(key)')
Get your data into that, then execute "select * from whatever group by key where count(*) = 1)"
This is actually a classic interview question and the answer they were expecting was that you first sort the urls and then make a binary search.
If it doesn't fit in memory, you can do the same thing with a file.
I have started using binary trees in c++, and i must say i really like the idea and things are clear for me, until i think of storing data on the disk in an order where later i can instantly read a chunk of data.
So far i have stored everything (nodes) into the ram... but this is just a simple and not real life app. I am not interested in storing the whole binary tree on the disk as that would be useless again since you have to read it again back to the memory! what i am after is a method just like for example MYSQL.
I haven't found any good article on this so i would appreciate if i someone include some urls or books.
The main difference from b-tree and b+tree:
- The leaf nodes are linked for fast lockup sequential reads. Can point ascending, can point descending , or both (like i saw in one IBM DB)
You should write it on disk, if the table or file grows, you will have memory problems.
(SEEK operations on files ARE REALLY FAST. You can create a 1 GB file on disk in less than 1 second... C# filestream,method .SetFilesize)
If you manage to have multiple readers/writers, you need concurrency control over the index and table(or file).... You gona do that in memory? If a power failure occures, how do you rollback?Ye, you dont.
IE:Field f1 is indexed.
WHERE 1=1 (dont need to access b+tree, give me all and the order is irrelevant)
WHERE 1=1 ORDER BY f1 ASC/DESC (Need to access b+tree, give me all by ascending/descending order)
WHERE f1>=100 (Need to access b+tree, lock up where the leaf node =100 and give all leaf node items following right pointers. If this process is a multithreaded read, they probablly come with a strange order, but no problem... no order by in clause).
WHERE f1>=100 order by f1 asc (Need to access b+tree, lock up where the leaf node =100 and give all leaf node items following right pointers. This process shouldnt be multithreaded following the b+tree, comes naturally in order.
Field f2 indexed with a b+tree and type string.
Where name like '%ODD' (Internally, the compared value must be inverted and the all symbol stays at the end Like starts with 'DDO' and ends with anything. 'DDOT' is in the group so 'TODD' must belongs to the result!!!! Tricky, tricky logic ;P)
with this statement,
WHERE name like '%OD%' (has in the middle 'OD'). The things get hot :))))
Internally, the result is the UNION of the sub result for 'OD%' with the sub result inverted 'DO%'. After that, removes of starting 'OD' and ending 'OD' without 'OD' in the middle, otherwise its a valid result('ODODODODOD' its a valid result. Invalid results 'ODABCD' and 'ABCDOD' ).
Consider what i said and check some more things if you gona do deep:
- FastIO on files:C# Filestream no_buffered_flag, wriththought disk flag on.
- Memory mapped files/memory views: Yes we can manipulate an huge file in small portions as we need it
- Indexes:Bitmap index, hash index (hash function;perfect hash function;ambiguity of the hashfunction), sparse index, dense index, b+tree, r-tree, reversed index.
- Multithreads: lock, mutexes,semaphores
- Transactional conciderations (Log file, 2phase commit;3phase commit);
- Locks (database,table,page,record)
- Deadlocks: 3 ways to kill it (longer conflicting process;Younger conflicting process;The process which locks more objects). Modern RDBMs use a mixed of the 3 ways...
- SQL parsing (AST-Tree).
- Caching recurrent queries.
- Triggers, procedures, views, etc.
- Passing parameters to the procedures (can use the object type ;P)
DONT LOAD EVERYTHING IN MEMORY,INTELLIGENT SOLUTIONS LOADS PARTS AS THEY NEED IT AND RELEASES WHEN ITS NO LONGER USABLE. Why=> your db engine (and PC) becomes more responsive using less memory. Using b+tree for lockup the branch leaf nodes needs just 2 Disk IO's. Knowing the lockup value, you get the record long pointer. SEEK the main file for the position, read the content. This is too fast. Memory is faster... Yes it is, but can you put 10 GB's of a b+tree on memory? If so, how your DB engine program starts to behave? Slowlly?
Forget binary trees and convencional btrees: they are academic tutorials. Real life they are replaced by hashtables or b+trees (B PLUS TREE showing storage and ordered ascending- http://en.wikipedia.org/wiki/B%2B_tree)
Consider using dataspaces for the db data in multiple disks. You can parallelize Disk IO performance. Dont forget to mirrored them... Each dataspace, should have a fragment of the table with a fragment of the indice, with a partial log file. You should develop the coordinator which presents wiselly the queries for the sub units.
IE: 3 dataspaces...
INSERT INTO etc...... only should happend in 1 table space.
but
select * from TB_XPTO, should be presented to all dataspaces.
select * from TB_XPTO order by an indexed field, should be presented to all dataspaces. Each data space executes the instruction, so now we have a 3 subsets by their sub order.
The result will be on the coordinator, where will reorder it again.
Confuse, BUT FAST!!!!!!
The coordinator should controls the master transaction.
if dataspace A commited
dataspace B commited
dataspace C is in uncommited state
the coordinator will rollback a C,B and A.
if dataspace A commited
dataspace B commited
dataspace C commited
the coordinator will Commit the overall transaction.
COORDINATOR LOG:
CREATE MASTER TRANSACTION UID 121212, CHILD TRANSACTIONS(1111,2222,3333)
DATA SPACE A LOG
1111 INSERT len byte array
1111 INSERT len byte array
COMMIT 1111
DATA SPACE B LOG
2222 INSERT len byte array
2222 INSERT len byte array
COMMIT 2222
DATA SPACE C LOG
3333 INSERT len byte array
3333 ---> No more nothing..... Power failure here!!!!!!!
On startup coordinator check if the db was properlly closed, if not, it will check his log file. Well, is missing a master commit line like COMMIT 121212. So it will enquire the data spaces for the log consistency.
A,B repplies COMMITED, but C, after checked his log file, detects a failure. Replies UNCOMMITED.
Master Coordinator FORCES TABLESPACE A,B,C FOR ROLLBACK 1111,2222,3333
After that, himself rollbacks his master transaction and puts DB state=OK.
The main point here is speed on insert,selects, updates, and deletes
Consider to maintain the index well balanced. Many deletes on the index will unbalanced it. An unbalanced index drops its performance.... Add a heap on the head of the index file, for controlling it. Some math here would help. If deletes are higher than 5% of records, balance it and reset the counter. If an update is over on an indexed field, should count it too.
Be smart considering the field index. If the column is Gender, there are only 2 options(i hope, lol.... ops, can be nullable too....), a bitmap index is well applied. If the distinctness (i think i spell it badlly) of a field is 100% (all values heterogeneous), like a sequence applied on a field like Oracle do, or an identity field like SQL Server do, a b+tree is well applied. If a field is kind of geometric type, like in Oracle, the R-Tree is the best. For strings, reversed Index is well applied, or b+tree if heterogenous.
Houston, we have problems....
NULL value fields, should be considered too in the index. Its a value too!!!!
IE: WHERE F1 is null
Add some socket functionality:Async TCP/IP SERVER
-If you delete a record, dont resize the file right now. Mark it as deleted. You should do some metrics here too. If unused space > x and transactions =0, do a database lock and re-allocate pointers, then resize database. Some spaces appears on the DB file, you can try to do some page locks instead of database lock... Things can keep going and no one gets hurt.... Measure the last unlocked page of the DB, lock it for you. Check a deleted page that you can fill with your page. Not Found, release lock; If found, move for the new position, fix pointers, mark old page as deleted, resize file, release lock. Why so many operations? To keep the log well formed!!!! You can split the page in small pages, but you get fragmentation (argh...we lost speed commander?)... 2 algorithms comes here. Best-Fit, and Worst-Fit....Google it. The best is .... using both :P
And if you solve all of this stuff, you can shout out loud "DAM, I DID A DATABASE... IM GONA NAME IT ORACLE!!!!" ;P