Implementing cdbpp library for string values - c++

I am trying to implement the cdbpp library from chokkan. I am facing some problems when I was trying to implement the same for values with data type of strings.
The original code and documentation can be found here:
http://www.chokkan.org/software/cdbpp/ and the git source code is here: https://github.com/chokkan/cdbpp
This is what I have so far:
In the sample.cpp (from where i am calling the main function), I modified the build() function:
bool build()
{
// Open a database file for writing (with binary mode).
std::ofstream ofs(DBNAME, std::ios_base::binary);
if (ofs.fail()) {
std::cerr << "ERROR: Failed to open a database file." << std::endl;
return false;
}
try {
// Create an instance of CDB++ writer.
cdbpp::builder dbw(ofs);
// Insert key/value pairs to the CDB++ writer.
for (int i = 1;i < N;++i) {
std::string key = int2str(i);
const char* val = "foobar"; //string value here
dbw.put(key.c_str(), key.length(), &val, sizeof(i));
}
} catch (const cdbpp::builder_exception& e) {
// Abort if something went wrong...
std::cerr << "ERROR: " << e.what() << std::endl;
return false;
}
return true;
}
and in cdbpp.h file, i modified the put() function as :
void put(const key_t *key, size_t ksize, const value_t *value, size_t vsize)
{
// Write out the current record.
std::string temp2 = *value;
const char* temp = temp2.c_str();
write_uint32((uint32_t)ksize);
m_os.write(reinterpret_cast<const char *>(key), ksize);
write_uint32((uint32_t)vsize);
m_os.write(reinterpret_cast<const char *>(temp), vsize);
// Compute the hash value and choose a hash table.
uint32_t hv = hash_function()(static_cast<const void *>(key), ksize);
hashtable& ht = m_ht[hv % NUM_TABLES];
// Store the hash value and offset to the hash table.
ht.push_back(bucket(hv, m_cur));
// Increment the current position.
m_cur += sizeof(uint32_t) + ksize + sizeof(uint32_t) + vsize;
}
Now the I get the correct value if the string is less than or equal to 3 characters(eg: foo will return foo). If it is greater than 3 it gives me the correct string up to 3 characters then garbage value(eg. foobar gives me foo�`)
I am a little new to c++ and I would appreciate any help you could give me.

(moving possible answer in comment to real answer)
vsize as passed into put is the size of an integer when it should be the length of the value string.

Related

rocksdb merge operarator performance is very slow for large number of keys

I'm trying to figure out why using the merge operator for a large number of keys with rocksdb is very slow.
My program uses a simple associative merge operator (based on upstream StringAppendOperator) that concatenates values using a delimiter for a given key. It takes a very long time to merge all the keys and for the program to finish running.
PS: I built rocksdb from source - latest master. I'm not sure if I'm missing something very obvious.
Here's a minimally reproducible example with about 5 million keys - number of keys can be adjusted by changing the limit of the for loop. Thank you in advance!
#include <filesystem>
#include <iostream>
#include <utility>
#include <rocksdb/db.h>
#include "rocksdb/merge_operator.h"
// Based on: https://github.com/facebook/rocksdb/blob/main/utilities/merge_operators/string_append/stringappend.h#L13
class StringAppendOperator : public rocksdb::AssociativeMergeOperator
{
public:
// Constructor: specify delimiter
explicit StringAppendOperator(std::string delim) : delim_(std::move(delim)) {};
bool Merge(const rocksdb::Slice &key, const rocksdb::Slice *existing_value,
const rocksdb::Slice &value, std::string *new_value,
rocksdb::Logger *logger) const override;
static const char *kClassName() { return "StringAppendOperator"; }
static const char *kNickName() { return "stringappend"; }
[[nodiscard]] const char *Name() const override { return kClassName(); }
[[nodiscard]] const char *NickName() const override { return kNickName(); }
private:
std::string delim_;// The delimiter is inserted between elements
};
// Implementation for the merge operation (concatenates two strings)
bool StringAppendOperator::Merge(const rocksdb::Slice & /*key*/,
const rocksdb::Slice *existing_value,
const rocksdb::Slice &value, std::string *new_value,
rocksdb::Logger * /*logger*/) const
{
// Clear the *new_value for writing.
assert(new_value);
new_value->clear();
if (!existing_value)
{
// No existing_value. Set *new_value = value
new_value->assign(value.data(), value.size());
}
else
{
// Generic append (existing_value != null).
// Reserve *new_value to correct size, and apply concatenation.
new_value->reserve(existing_value->size() + delim_.size() + value.size());
new_value->assign(existing_value->data(), existing_value->size());
new_value->append(delim_);
new_value->append(value.data(), value.size());
std::cout << "Merging " << value.data() << "\n";
}
return true;
}
int main()
{
rocksdb::Options options;
options.create_if_missing = true;
options.merge_operator.reset(new StringAppendOperator(","));
# tried a variety of settings
options.max_background_compactions = 16;
options.max_background_flushes = 16;
options.max_background_jobs = 16;
options.max_subcompactions = 16;
rocksdb::DB *db{};
auto s = rocksdb::DB::Open(options, "/tmp/test", &db);
assert(s.ok());
rocksdb::WriteBatch wb;
for (uint64_t i = 0; i < 2500000; i++)
{
wb.Merge("a:b", std::to_string(i));
wb.Merge("c:d", std::to_string(i));
}
db->Write(rocksdb::WriteOptions(), &wb);
db->Flush(rocksdb::FlushOptions());
rocksdb::ReadOptions read_options;
rocksdb::Iterator *it = db->NewIterator(read_options);
for (it->SeekToFirst(); it->Valid(); it->Next())
{
std::cout << it->key().ToString() << " --> " << it->value().ToString() << "\n";
}
delete it;
delete db;
std::filesystem::remove_all("/tmp/test");
return 0;
}
Shared your question in the Speedb Hive, on Discord.
The reply is from Hilik, our o-founder and chief scientist.
'Merge operators are very useful to get a quick write response time, alas reads require reading the original and applying by order the merges . This operation may be very expensive esp with strings that needed to be copied and appended on each merge. The simplest way to resolve this is to use read modify write eventually . Doing this at the application level is possible but may be problematic (if two threads can do this operation concurrently) . We are thinking of ways to resolve this during the compaction and are willing to work with you on a PR...'
Hope this helps. Join the discord server to participate in this discussion and many other interesting and related topics.
Here is the link to the discussion about your topic

mbedTLS pk parse error

Could anyone help me find out why I get a -16000 (bad input data) on attempt to parse a public/private key from a unsigned char*?
Here's my code (edited for brevity):
DataPoint* GetPublicKey(mbedtls_pk_context* pppctx)
{
unsigned char* PKey = new unsigned char[16000];
if (mbedtls_pk_write_pubkey_pem(pppctx, PKey, 16000) != 0)
{
delete[] PKey;
return NULL;
}
DataPoint* Out = new DataPoint(strlen((char*)PKey) + 1); //Initializes an internal unsigned char* and size_t with the length of the key and the null byte
memcpy(Out->Data, PKey, Out->Length);
delete[] PKey;
return Out;
}
void GenRSA(mbedtls_rsa_context* rs)
{
mbedtls_rsa_gen_key(rs, mbedtls_ctr_drbg_random, &dctx, 2048, 65537);
}
int main()
{
mbedtls_pk_context pctx;
mbedtls_pk_init(&pctx);
mbedtls_pk_setup(&pctx, mbedtls_pk_info_from_type(MBEDTLS_PK_RSA));
DataPoint* Key = GetPublicKey(&some_context_with_GenRSA_called);
cout << mbedtls_pk_parse_public_key(&pctx, Key->Data, Key->Length) << endl; //Returns -16000
return 0
}
And the same thing with the private key, what am I doing wrong?
The docs for mbedtls_pk_parse_public_key say:
On entry, ctx must be empty, either freshly initialised with mbedtls_pk_init() or reset with mbedtls_pk_free().
Your pseudo-code calls mbedtls_pk_setup on pctx. Perhaps this is the problem?
Can you check with other converters such as https://superdry.apphb.com/tools/online-rsa-key-converter to see if they can parse your PEM?

C++ Returning results from several threads into an array

I've a pattern-matching program which takes as input a string and returns a string closely matched by a dictionary. Since the algorithm takes several seconds to run one match query, I am attempting to use multi-threading to run batch queries.
I first read in a file containing a list of queries and for each query dispatch a new thread to perform the matching algorithm, returning the results into an array using pthread_join.
However, I'm getting some inconsistent results. For example, if my query file contains the terms "red, green, blue", I may receive "red, green, green" as the result. Another run may generate the correct "red, green, blue" result. It appears to sometimes be writing over the result in the array, but why would this happen since the array value is set according to the thread id?
Dictionary dict; // global, which performs the matching algorithm
void *match_worker(void *arg) {
char* temp = (char *)arg;
string strTemp(temp);
string result = dict.match(strTemp);
return (void *)(result.c_str());
}
void run(const string& queryFilename) {
// read in query file
vector<string> queries;
ifstream inquery(queryFilename.c_str());
string line;
while (getline(inquery, line)) {
queries.push_back(line);
}
inquery.close();
pthread_t threads[queries.size()];
void *results[queries.size()];
int rc;
size_t i;
for (i = 0; i < queries.size(); i++) {
rc = pthread_create(&threads[i], NULL, match_worker, (void *)(queries[i].c_str()));
if (rc) {
cout << "Failed pthread_create" << endl;
exit(1);
}
}
for (i = 0; i < queries.size(); i++) {
rc = pthread_join(threads[i], &results[i]);
if (rc) {
cout << "Failed pthread_join" << endl;
exit(1);
}
}
for (i = 0; i < queries.size(); i++) {
cout << (char *)results[i] << endl;
}
}
int main(int argc, char* argv[]) {
string queryFilename = arg[1];
dict.init();
run(queryFilename);
return 0;
}
Edit: As suggested by Zac, I modified the thread to explicitly put the result on the heap:
void *match_worker(void *arg) {
char* temp = (char *)arg;
string strTemp(temp);
int numResults = 1;
cout << "perform match for " << strTemp << endl;
string result = dict.match(strTemp, numResults);
string* tmpResult = new string(result);
return (void *)((*tmpResult).c_str());
}
Although, in this case, where would I put the delete calls? If I try putting the following at the end of the run() function it gives an invalid pointer error.
for (i = 0; i < queries.size(); i++) {
delete (char*)results[i];
}
Without debugging it, my guess is that it has something to do with the following:
void *match_worker(void *arg)
{
char* temp = (char *)arg;
string strTemp(temp);
string result = dict.match(strTemp); // create an automatic
return (void *)(result.c_str()); // return the automatic ... but it gets destructed right after this!
}
So when the next thread runs, it writes over the same memory location you are pointing to (by chance), and you are inserting the same value twice (not writing over it).
You should put the result on the heap to ensure it does not get destroyed between the time your thread exits and you store it in your main thread.
With your edit, you are trying to mix things up a bit too much. I've fixed it below:
void *match_worker(void *arg)
{
char* temp = (char *)arg;
string strTemp(temp);
int numResults = 1;
cout << "perform match for " << strTemp << endl;
string result = dict.match(strTemp, numResults);
string* tmpResult = new string(result);
return (void *)(tmpResult); // just return the pointer to the std::string object
}
Declare results as
// this shouldn't compile
//void* results[queries.size()];
std::string** results = new std::string[queries.size()];
for (int i = 0; i < queries.size(); ++i)
{
results[i] = NULL; // initialize pointers in the array
}
When you clean up the memory:
for (i = 0; i < queries.size(); i++)
{
delete results[i];
}
delete [] results; // delete the results array
That said, you would have a much easier time if you used the C++11 threading templates instead of mixing the C pthread library and C++.
The problem is caused by the lifetime of the local variable result and the data returned by the member function result.c_str(). You make this task unnecessary difficult by mixing C with C++. Consider using C++11 and its threading library. It makes the task much easier:
std::string match_worker(const std::string& query);
void run(const std::vector<std::string>& queries)
{
std::vector<std::future<std::string>> results;
results.reserve(queries.size());
for (auto& query : queries)
results.emplace_back(
std::async(std::launch::async, match_worker, query));
for (auto& result : results)
std::cout << result.get() << '\n';
}

Insert an array of tables into one table SQLite C/C++

I made my own database format, and it sadly required too much memory and the size of it got horrendous and upkeep was horrible.
So I'm looking for a way to store an array of a struct that's in an object into a table.
I'm guessing I need to use a blob, but all other options are welcome. An easy way to implement a blob would be helpful as well.
I've attached my saving code and related structures(Updated from my horrible post earlier)
#include "stdafx.h"
#include <string>
#include <stdio.h>
#include <vector>
#include "sqlite3.h"
using namespace std;
struct PriceEntry{
float cardPrice;
string PriceDate;
int Edition;
int Rarity;
};
struct cardEntry{
string cardName;
long pesize;
long gsize;
vector<PriceEntry> cardPrices;
float vThreshold;
int fav;
};
vector<cardEntry> Cards;
void FillCards(){
int i=0;
int j=0;
char z[32]={0};
for(j=0;j<3;j++){
cardEntry tmpStruct;
sprintf(z, "Card Name: %d" , i);
tmpStruct.cardName=z;
tmpStruct.vThreshold=1.00;
tmpStruct.gsize=0;
tmpStruct.fav=1;
for(i=0;i<3;i++){
PriceEntry ss;
ss.cardPrice=i+1;
ss.Edition=i;
ss.Rarity=i-1;
sprintf(z,"This is struct %d", i);
ss.PriceDate=z;
tmpStruct.cardPrices.push_back(ss);
}
tmpStruct.pesize=tmpStruct.cardPrices.size();
Cards.push_back(tmpStruct);
}
}
int SaveCards(){
// Create an int variable for storing the return code for each call
int retval;
int CardCounter=0;
int PriceEntries=0;
char tmpQuery[256]={0};
int q_cnt = 5,q_size = 256;
sqlite3_stmt *stmt;
sqlite3 *handle;
retval = sqlite3_open("sampledb.sqlite3",&handle);
if(retval)
{
printf("Database connection failed\n");
return -1;
}
printf("Connection successful\n");
//char create_table[100] = "CREATE TABLE IF NOT EXISTS users (uname TEXT PRIMARY KEY,pass TEXT NOT NULL,activated INTEGER)";
char create_table[] = "CREATE TABLE IF NOT EXISTS Cards (CardName TEXT, PriceNum NUMERIC, Threshold NUMERIC, Fav NUMERIC);";
retval = sqlite3_exec(handle,create_table,0,0,0);
printf( "could not prepare statemnt: %s\n", sqlite3_errmsg(handle) );
for(CardCounter=0;CardCounter<Cards.size();CardCounter++){
char Query[512]={0};
for(PriceEntries=0;PriceEntries<Cards[CardCounter].cardPrices.size();PriceEntries++){
//Here is where I need to find out the process of storing the vector of PriceEntry for Cards then I can modify this loop to process the data
}
sprintf(Query,"INSERT INTO Cards VALUES('%s', %d, %f, %d)",
Cards[CardCounter].cardName.c_str(),
Cards[CardCounter].pesize,
Cards[CardCounter].vThreshold,
Cards[CardCounter].fav); //My insert command
retval = sqlite3_exec(handle,Query,0,0,0);
if(retval){
printf( "Could not prepare statement: %s\n", sqlite3_errmsg(handle) );
}
}
// Insert first row and second row
sqlite3_close(handle);
return 0;
}
I tried googling but my results didn't suffice.
You have two types here: Cards and PriceEntries. And for each Card there can be many PriceEntries.
You can store Cards in one table, one Card per row. But you're puzzled about how to store the PriceEntries, right?
What you'd normally do here is have a second table for PriceEntries, keyed off a unique column (or columns) of the Cards table. I guess the CardName is unique to each card? Let's go with that. So your PriceEntry table would have a column CardName, followed by columns of PriceEntry information. You'll have a row for each PriceEntry, even if there are duplicates in the CardName column.
The PriceEntry table might look like:
CardName | Some PE value | Some other PE value
Ace | 1 | 1
Ace | 1 | 5
2 | 2 | 3
and so on. So when you want to find the array of PriceEntries for a card, you'd do
select * from PriceEntry where CardName = 'Ace'
And from the example data above you'd get back 2 rows, which you could shove into an array (if you wanted to).
No need for BLOBs!
This is a simple serialization and deserialization system. The class PriceEntry has been extended with serialization support (very simply). Now all you have to do is serialize a PriceEntry (or a set of them) to binary data and store it in a blob column. Later on, you get the blob data and from that deserialize a new PriceEntry with the same values. An example of how it is used is given at the bottom. Enjoy.
#include <iostream>
#include <vector>
#include <string>
#include <cstring> // for memcpy
using std::vector;
using std::string;
// deserialization archive
struct iarchive
{
explicit iarchive(vector<unsigned char> data)
: _data(data)
, _cursor(0)
{}
void read(float& v) { read_var(v); }
void read(int& v) { read_var(v); }
void read(size_t& v) { read_var(v); }
void read(string& v) { read_string(v); }
vector<unsigned char> data() { return _data; }
private:
template <typename T>
void read_var(T& v)
{
// todo: check that the cursor will not be past-the-end after the operation
// read the binary data
std::memcpy(reinterpret_cast<void*>(&v), reinterpret_cast<const void*>(&_data[_cursor]), sizeof(T));
// advance the cursor
_cursor += sizeof(T);
}
inline
void
read_string(string& v)
{
// get the array size
size_t sz;
read_var(sz);
// get alignment padding
size_t padding = sz % 4;
if (padding == 1) padding = 3;
else if (padding == 3) padding = 1;
// todo: check that the cursor will not be past-the-end after the operation
// resize the string
v.resize(sz);
// read the binary data
std::memcpy(reinterpret_cast<void*>(&v[0]), reinterpret_cast<const void*>(&_data[_cursor]), sz);
// advance the cursor
_cursor += sz + padding;
}
vector<unsigned char> _data; // archive data
size_t _cursor; // current position in the data
};
// serialization archive
struct oarchive
{
void write(float v) { write_var(v); }
void write(int v) { write_var(v); }
void write(size_t v) { write_var(v); }
void write(const string& v) { write_string(v); }
vector<unsigned char> data() { return _data; }
private:
template <typename T>
void write_var(const T& v)
{
// record the current data size
size_t s(_data.size());
// enlarge the data
_data.resize(s + sizeof(T));
// store the binary data
std::memcpy(reinterpret_cast<void*>(&_data[s]), reinterpret_cast<const void*>(&v), sizeof(T));
}
void write_string(const string& v)
{
// write the string size
write(v.size());
// get alignment padding
size_t padding = v.size() % 4;
if (padding == 1) padding = 3;
else if (padding == 3) padding = 1;
// record the data size
size_t s(_data.size());
// enlarge the data
_data.resize(s + v.size() + padding);
// store the binary data
std::memcpy(reinterpret_cast<void*>(&_data[s]), reinterpret_cast<const void*>(&v[0]), v.size());
}
vector<unsigned char> _data; /// archive data
};
struct PriceEntry
{
PriceEntry()
{}
PriceEntry(iarchive& in) // <<< deserialization support
{
in.read(cardPrice);
in.read(PriceDate);
in.read(Edition);
in.read(Rarity);
}
void save(oarchive& out) const // <<< serialization support
{
out.write(cardPrice);
out.write(PriceDate);
out.write(Edition);
out.write(Rarity);
}
float cardPrice;
string PriceDate;
int Edition;
int Rarity;
};
int main()
{
// create a PriceEntry
PriceEntry x;
x.cardPrice = 1;
x.PriceDate = "hi";
x.Edition = 3;
x.Rarity = 0;
// serialize it
oarchive out;
x.save(out);
// create a deserializer archive, from serialized data
iarchive in(out.data());
// deserialize a PriceEntry
PriceEntry y(in);
std::cout << y.cardPrice << std::endl;
std::cout << y.PriceDate << std::endl;
std::cout << y.Edition << std::endl;
std::cout << y.Rarity << std::endl;
}

C++ program running out of memory for large data

I am trying to solve an issue in a C++ program I wrote. I am basically running out of memory. The program is a cache simulator. There is a file which has memory addresses collected beforehand, like this:
Thread Address Type Size Instruction Pointer
0 0x7fff60000000 1 8 0x7f058c482af3
There can be 100-500 billion such entries. First, I am trying to read all those entries and store it in a vector. Also while reading, I build up a set of these addresses (using map), and store the sequence numbers of a particular address. Sequence number simply means the position of the address-entry in the file (one address can be seen multiple times). For large inputs the program fails while doing this, with a bad_alloc error at around the 30 millionth entry. I guess I am running out of memory. Please advise on how can I circumvent the problem. Is there an alternative way to handle this kind of large data. Thank you very much! Sorry for the long post. I wanted to give some context and the actual code which I am writing.
Below is the relevant code. The ParseTaceFile() reads each line and calls the
StoreTokens(), which gets the address and size, and calls AddAddress() which actually stores the address in a vector and a map. The class declaration is also given below. The first try block in AddAddress() actually throws the bad_alloc exception.
void AddressList::ParseTraceFile(const char* filename) {
std::ifstream in_file;
std::cerr << "Reading Address Trace File..." << std::endl;
in_file.exceptions(std::ifstream::failbit | std::ifstream::badbit);
char *contents = NULL;
try {
in_file.open(filename, std::ifstream::in | std::ifstream::binary);
in_file.seekg(0, std::ifstream::end);
std::streampos length(in_file.tellg());
if (length < 0) {
std::cerr << "Can not read input file length" << std::endl;
throw ExitException(1);
}
contents = (new char[length]);
in_file.seekg(0, std::ifstream::beg);
in_file.read(contents, length);
in_file.close();
uint64_t linecount = 0, i = 0, lastline = 0, startline = 0;
while (i < static_cast<uint64_t>(length)) {
if ((contents[i] == '\n') or (contents[i] == EOF)) {
contents[i] = '\0';
lastline = startline;
startline = i + 1;
++linecount;
if (linecount > 1) {
StoreTokens((contents + lastline), &linecount);
}
}
++i;
}
} catch (std::bad_alloc& e) {
delete [] contents;
std::cerr << "error allocating memory while parsing" << std::endl;
throw;
} catch (std::ifstream::failure &exc1) {
if (!in_file.eof()) {
delete[] contents;
std::cerr << "error in reading address trace file" << exc1.what()
<< std::endl;
throw ExitException(1);
}
}
std::cerr << "Done" << std::endl;
}
//=========================================================
void AddressList::StoreTokens(char* line, uint64_t * const linecount) {
uint64_t address, size;
char *token = strtok(line, " \t");
uint8_t tokencount = 0;
while (NULL != token) {
++tokencount;
switch (tokencount) {
case 1:
break;
case 2:
address = strtoul(token, NULL, 16);
break;
case 3:
break;
case 4:
size = strtoul(token, NULL, 0);
break;
case 5:
break;
default:
break;
}
token = strtok(NULL, " \t");
}
AddAddress(address, size);
}
//================================================================
void AddressList::AddAddress(const uint64_t& byteaddr, const uint64_t& size) {
//allocate memory for the address vector
try {
if ((sequence_no_ % kReserveCount) == 0) address_list_.reserve(kReserveCount);
} catch (std::bad_alloc& e) {
std::cerr
<< "error allocating memory for address trace vector, address count"
<< sequence_no_ << std::endl;
throw;
}
uint64_t offset = byteaddr & (CacheParam::Instance()->LineSize() - 1);
//lineaddress = byteaddr >> CacheParam::Instance()->BitsForLine();
// this try block is for allocating memory for the address set and the queue it holds
try {
// splitter
uint64_t templinesize = 0;
do {
Address temp_addr(byteaddr + templinesize);
address_list_.push_back(temp_addr);
address_set_[temp_addr.LineAddress()].push(sequence_no_++);
templinesize = templinesize + CacheParam::Instance()->LineSize();
} while (size + offset > templinesize);
} catch (std::bad_alloc& e) {
address_list_.pop_back();
std::cerr
<< "error allocating memory for address trace set, address count"
<< sequence_no_ << std::endl;
throw;
}
}
//======================================================
typedef std::queue<uint64_t> TimeStampQueue;
typedef std::map<uint64_t, TimeStampQueue> AddressSet;
class AddressList {
public:
AddressList(const char* tracefilename);
bool Simulate(uint64_t *hit_count, uint64_t* miss_count);
~AddressList();
private:
void AddAddress(const uint64_t& byteaddr, const uint64_t& size);
void ParseTraceFile(const char* filename);
void StoreTokens(char* line, uint64_t * const linecount);
std::vector<Address> address_list_;
AddressSet address_set_;
uint64_t sequence_no_;
CacheMemory cache_;
AddressList (const AddressList&);
AddressList& operator=(const AddressList&);
};
The output is like this:
Reading Cache Configuration File...
Cache parameters read...
Reading Address Trace File...
error allocating memory for address trace set, address count 30000000
error allocating memory while parsing
As it seems your datasets will be much larger then your memory you would have to write an on disk index. Probably easiest to import the whole thing into a database and let that build the indexes for you.
A map sorts its input while it is being populated, to optimize lookup times and to provide a sorted output. It sounds like you aren't using the lookup feature, so the optimal strategy is to sort the list using another method. Merge sorting is fantastic for sorting collections that don't fit into memory. Even if you are doing lookups, a binary search into a sorted file will be faster than a naive approach as long as each record is a fixed size.
Forgive me for stating what may be obvious, but the need to store and query large amounts of data in an efficient manner is the exact reason databases were invented. They have already solved all these issues in a better way than you or I could come up with in a reasonable amount of time. No need to reinvent the wheel.