I am working on an application written in C++ that uses the 32-bit Instant Client drivers to access an Oracle database. The application uses the Record Field Exchange (RFX) methods to update the columns in the database tables. The database schema cannot be modified.
The C++ code was originally written to handle OID values as doubles because the OID column in the database is NUMBER(14), so a regular int won't be big enough. However, this leads to the database occasionally selecting a bad execution plan where it takes the OID values sent from the application and uses the to_binary_double function on them, rather than converting them to BIGINT. If this happens, the database does not do an index search over the data and instead does a full table scan.
We tried switching the OIDs to be type __int64 in the application, but there was an issue with the ODBC driver not supporting the BigInt type (or long long in C++). Similarly, when we tried to make the OIDs into longs, the database or the driver gave an error that the values sent to the database were too big for the column.
Working with the OIDs as Strings in C++ will work, but the database will never use the optimal index search because it has to convert the String to an integer before it can do any data retrieval. Because of this, we're just better off using the doubles we already have.
Does anyone have an idea of what we can do next? It is not the end of the world if we have to keep using doubles as before, but we were hoping to eliminate the chance for the database to run slowly.
We actually went with the "Convert all the OIDs to Strings in the C++ code" option. It turns out the database was still able to run an indexed search after converting the OIDs from Strings to integers. It would have been better if we switched ODBC driver to one that could handle BigInt, but that wasn't really an option for us so this will suffice.
I would like to persist some of my objects in a database (this could be relational (postgresql or MariaDB) or MongoDB). I have found a number of libraries that seem potentially useful, but I am missing the overall picture.
I have used boost::serialization serialize c++ to xml / binary, but it is not clear to me how to get this into the database (do I use the binary or xml format?)?
How do I get this into my mongoDB or postgresql?
You'd serialize to binary, as it is smaller and much faster. Also, the XML format isn't really pretty/easy to use outside of Boost Serialization anyways.
WARNING: Use Boost Portable Archive (EPA http://epa.codeplex.com/) if you need to use the format across different machines.
You'd usually store it in a column
text or CLOB (character large object) by encoding in base64 and putting that in the Database native charset (base64 is safe even for ASCII)
BLOB (binary large object) which doesn't bring the need to encode and could be more efficient storage wise.
Note: if you need to index, store the index properties in normal database columns.
Finally, if you like, I have recently made a streambuffer that allows you to stream data directly into a Sqlite BLOB column. Perhaps you can glean some ideas from this you could use:
How to boost::serialize into a sqlite::blob?
I have a software project that I am working on at work that has been driving me crazy. Here's our problem: we have a series data contacts that need to be logged every second. It needs to include time, bearing (array of 360-1080 bytes), range, and a few other fields. Our system also needs the capability to store this data for up to 30 days. In practice, there can be up to 100 different contacts, so at a maximum, there can be anywhere from around 150,000,000 points to about 1,000,000,000 different points in 30 days.
I'm trying to think of the best method for storing all of this data and retrieving later on. My first thought was to use some RDBMS like MySQL. Being a embedded C/C++ programmer, I have very little experience working with MySQL with such large data sets. I've dabbled with it on small datasets, but nothing nearly as large. I generated the below schema for two tables that will store some of the data:
CREATE TABLE IF NOT EXISTS `HEADER_TABLE` (
`header_id` tinyint(3) unsigned NOT NULL auto_increment,
`sensor` varchar(10) NOT NULL,
`bytes` smallint(5) unsigned NOT NULL,
PRIMARY KEY (`header_id`),
UNIQUE KEY `header_id_UNIQUE` (`header_id`),
UNIQUE KEY `sensor_UNIQUE` (`sensor`)
) ENGINE=MyISAM AUTO_INCREMENT=0 DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `RAW_DATA_TABLE` (
`internal_id` bigint(20) NOT NULL auto_increment,
`time_sec` bigint(20) unsigned NOT NULL,
`time_nsec` bigint(20) unsigned NOT NULL,
`transverse` bit(1) NOT NULL default b'0',
`data` varbinary(1080) NOT NULL,
PRIMARY KEY (`internal_id`,`time_sec`,`time_nsec`),
UNIQUE KEY `internal_id_UNIQUE` (`internal_id`),
KEY `time` (`time_sec`)
KEY `internal_id` (`internal_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `rel_RASTER_TABLE` (
`internal_id` bigint(20) NOT NULL auto_increment,
`raster_id` int(10) unsigned NOT NULL,
`time_sec` bigint(20) unsigned NOT NULL,
`time_nsec` bigint(20) unsigned NOT NULL,
`header_id` tinyint(3) unsigned NOT NULL,
`data_id` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`internal_id`, `raster_id`,`time_sec`,`time_nsec`),
KEY `raster_id` (`raster_id`),
KEY `time` (`time_sec`),
KEY `data` (`data_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
The header table only contains 10 rows and is static. It just tells what sensor the raw data came from, and the number of bytes output by that type of sensor. The RAW_DATA_TABLE essentially stores the raw bearing data (an array of 360-1080 bytes, it represents up to three samples per degree). The rel_RASTER_TABLE holds meta data for the RAW_DATA_TABLE, there can be multiple contacts that refer to the same raw data row. The data_id found in rel_RASTER_TABLE points to the internal_id of some row in the RAW_DATA_TABLE, I did this to decrease the amount of writes needed.
Obviously, as you can probably tell, I'm having performance issues when reading and deleting from this database. An operator to our software can see real time data as it comes across and also go into reconstruction mode and overlay a data range from the past, the past week for example. Our backend logging server grabs the history rows and sends them to a display via a CORBA interface. While all of this is happening, I have a worker thread that deletes 1000 rows at a time for data greater than 30 days. This is there in case a session runs longer than 30 days, which can happen.
The system we currently have implemented works well for smaller sets of data, but not for large sets. Our select and delete statements can take upwards of 2 minutes to return results. This completely kills the performance of our real time consumer thread. I suspect we're not designing our schemas correctly, picking the wrong keys, not optimizing our SQL queries correctly, or some subset of each. Our writes don't see to be affected unless the other operations take too long to run.
Here is an example SQL Query we use to get history data:
SELECT
rel_RASTER_TABLE.time_sec,
rel_RASTER_TABLE.time_nsec,
RAW_DATA_TABLE.transverse,
HEADER_TABLE.bytes,
RAW_DATA_TABLE.data
FROM
RASTER_DB.HEADER_TABLE,
RASTER_DB.RAW_DATA_TABLE,
RASTER_DB.rel_RASTER_TABLE
WHERE
rel_RASTER_TABLE.raster_id = 2952704 AND
rel_RASTER_TABLE.time_sec >= 1315849228 AND
rel_RASTER_TABLE.time_sec <= 1315935628 AND
rel_RASTER_TABLE.data_id = RAW_DATA_TABLE.internal_id AND
rel_RASTER_TABLE.header_id = HEADER_TABLE.header_id;
I apologize in advance for this being such a long question, but I've tapped out other resources and this is my last resort. I figure I'd try to be as descriptive as possible Do you guys see of any way I can improve upon our design at first glance? Or, anyway we can optimize our select and delete statements for such large data sets? We're currently running RHEL as the OS and unfortunately can't change our hardware configuration on the server (4 GB RAM, Quad Core). We're using C/C++ and the MySQL API. ANY speed improvements would be EXTREMELY beneficial. If you need me to clarify anything, please let me know. Thanks!
EDIT: BTW, if you can't provide specific help, maybe you can link me to some excellent tutorials you've come across for optimizing SQL queries, schema design, or MySQL tuning?
First thing you could try is de-normalizing the data. On a data set of that size, doing a join, even if you have indexes is going to require very intense computation. Turn those three tables into 1 table. Sure there will be duplicate data, but without joins it will be much easier to work with. Second thing, see if you can get a machine with enough memory to fit the whole table in memory. It doesn't cost much ($1000 or less) for a machine with 24GB of RAM. I'm not sure if that will hold your entire data set, but it will help tremendously Get an SSD as well. For anything that isn't stored in memory, an SSD should help you access it with high speed. And thirdly, look into other data storage technologies such as BigTable that are designed to deal with very large data sets.
I would say partitioning is an absolute must in a case like this:
large amount of data
new data coming in continuously
implicit: old data getting deleted continuously.
Check out this for mySQL.
Looking at your select stmt (which filters on time), I'll say partition on the time column.
Of course you might wanna add a few indexes based on the frequent queries you want to use.
--edit--
I see that many have suggested indexes. My experiences have been that having an index on a table with really large num of rows either kills the performance (eventually) or requires lot of resources (CPU, memory,...) to keep the indexes up to date.
So although I also suggest addition of indexes, please note that it's absolutely useless unless you partition the table first.
Finally, follow symcbean's advise (optimize your indexes in number and keys) when you add indexes.
--edit end--
A quickie on partitioning if you're new to it.
Usually a single table translates to a single data file. A partitioned table translates to one file per partition.
Advantages
insertions are faster as physically it's inserted into a smaller file (partition).
deletion of large number of rows would usually translate to dropping a partition (much much much much cheaper than 'delete from xxx where time > 100 and time < 200');
queries with a where clause on the key by which the table is partitioned is much much faster.
Index building is faster.
I don't have much experience with MySQL, but here are some a priori thoughts that jump to mind.
Is your select in a stored procedure?
The select's predicate is usually searched in the order its asked in. If the data on the disk is reordered to match the primary key, then doing raster id first is fine. You would be paying the cost of reordering on every insert though. If the data is stored in time order on disk, you would probably want to search on time_sec before raster_id.
WHERE
rel_RASTER_TABLE.raster_id = 2952704 AND
rel_RASTER_TABLE.time_sec >= 1315849228 AND
rel_RASTER_TABLE.time_sec <= 1315935628 AND
rel_RASTER_TABLE.data_id = RAW_DATA_TABLE.internal_id AND
rel_RASTER_TABLE.header_id = HEADER_TABLE.header_id;
Your indexes don't follow the search predicates.
It will create indexes based on the keys, generally.
PRIMARY KEY (`internal_id`, `raster_id`,`time_sec`,`time_nsec`),
KEY `raster_id` (`raster_id`),
KEY `time` (`time_sec`),
KEY `data` (`data_id`)
It may not be using the primary index because you aren't using internal_id. You may want to set internal_id as the primary key and create a separate index based on your search parameters. At least on raster_id and time_sec.
Are the joins too loose?
This may be my inexperience with MySQL, but I expect to see conditions on the joins. Does using FROM here do a natural join? I don't see any foreign keys specified, so I don't know how it would join these tables rationally.
FROM
RASTER_DB.HEADER_TABLE,
RASTER_DB.RAW_DATA_TABLE,
RASTER_DB.rel_RASTER_TABLE
Usually when developing something like this I would work with a smaller set and remove predicates to makes sure that each step meets what I expect. If you accidentally cast a wide net up front, then narrow down later you may mask some inefficiencies.
Most query optimizers have a way to output how the optimized, make sure it meets your expectations. One of the comments mention Explain plans, I assume that is what it is called.
Without knowing what all the queries are its difficult to give specific advice, however looking at the single query you have provided, there are no indexes which are idealy suited to resolving this.
In fact the structure is a bit messy - if internal_id is an auto-increment value then it is unique - why add other stuff in the primary key? It looks as if a more sensible structure for rel_RASTER_TABLE would be:
PRIMARY KEY (`internal_id`),
KEY (`raster_id`,`time_sec`,`time_nsec`),
And as for RAW_DATA_TABLE, it should be blindingly obvious that its indexes are far from optimal. And should probably be:
PRIMARY KEY (`internal_id`,`time_sec`,`time_nsec`),
KEY `time` (`time_sec`, `time_nsec`)
Note that removing redundant indexes will speed up inserts/updates.
Capturing slow queries should help - and learn how to use 'explain' to see what indexes are redundant / needed.
You may also get a performance boost by tuning the mysql instance - particularly increasing the sort and join buffers - try running mysqltuner
First, I would try to create a view with only the necessary info that needs to be selected between the different tables.
By the way, MySQL is not necessarily the most optimized database system for what you are trying to accomplish... Look into other solutions such Oracle, Microsoft SQL, PostgreSQL etc. Also, the performance will vary depending on the server being used.
What kind of technique does use Google Plus to generate users' unique ids?
Example
https://plus.google.com/102766325060234825733/posts
You can only assume that they are randomly generated ID's that are large enough to be generated non-sequentially with sufficient entropy.
The ID's are too big to be stored in a bigint field which is interesting, again probably due to the required entropy and non-sequential requirement (so that nothing can be inferred by comparing userid's).
A simple encryption of a serially generated number, with a secret key can be used to generate the IDs. It can be a 1 way hash, or a decryptable encryption.
The reason for not using serial numbers directly is obvious: You can easily guess userids of other users on the network, which can result in Bots scraping the content of the network.
I'm writing a database front end for a website. Next to the records I want to include a link likes this:
Record 1 - [Add][1] [Edit][2] [Delete][3]
But I want to protect these links from being used more than once. My thinking is to pass a hash value then store a list of valid HASH values in a table somewhere and only process requests with valid hash values. Is there a better way to do is?
Update: The answer to this question led me to ask this question: What is the difference between a "nonce" and a "GUID"?. Why exactly would you use a nonce instead of a GUID?
Your idea is correct, except that you should use cryptographically secure random bytes (a "nonce") instead of a hash.