Fastest OLEDB read from ORACLE - c++

What would be the fastest way of retrieving data from the Oracle DB via OLEDB?
It should be portable (have to work on Postgres and MS SQL), only one column is transfered (ID from some large table).
Current performance is 100k rows/sec. Am I expecting too much if I want it to go faster?
Clarification:
datatable has 23M records
Query is: SELECT ID FROM OBJECTS
Bottleneck is transfer from oracle to the client software, which is c++/OLEDB

What the heck, I'll take a chance.
Edit: As far as connectivity, I HEARTILTY recommend:
Oracle Objects for OLE, OO4O for short.
It's made by Oracle for Oracle, not by MS. It uses high-performance native drivers, NOT ODBC for a performance boost. I've personally used this myself on several occasions and it is fast. I was connecting to extremely large DB's and data warehouses where every table was never less than 2 million records, most were far larger.
Note you do not need to know OLE to use this. It wraps OLE, hence the name. Conceptually and syntactically, it wraps the "result set" into a dynaset fed by SQL commands. If you've ever used DAO, or ADO you will be productive in 5 minutes.
Here's a more in-depth article.
If you can't use OO4O, then the specialized .Net Data Provider made by Oracle is very good. NOT the one made by MS.
HTH
Use a "WHERE" clause? Example: "select id from objects where id = criteria"
WHERE
This sends only the record of interest across the network. Otherwise all 23 million records are sent across the wire.
OR, look into "between."
"select id from objects where id between thisone and thatone"
BETWEEN
That sends a reduced set of records in the range you specify.
HTH

Related

Cloud Spanner - read performance with large number of items in WHERE clause

I'm in the process of evaluating some different data stores for a project and I have a strange but inflexible requirement to check the existence of a 1500 keys per query... Basically the only query I'll be running is of the form:
SELECT user_id, name, gender
WHERE user_id in (user1, user2, ..., user1500)
I will have around 3.5 billion rows in the table. One data store that has caught my eye is Spanner. I was wondering if querying the data in this way would be feasible or if I would run into performance issues due to the large number of items in my WHERE clause. I have only been able to test these queries on a small amount of data so far so I'm leaning more on what the theoretical performance hit might look like instead having the luxury to just "try and found out".
Also, are there other data stores that might work better for this read pattern? I expected to run no more than 80 queries per second. Also, the data will be bulk loaded on a weekly basis. The data is structured by nature but we don't use it in a relational way (i.e. no joins).
Anyways, sorry if this question is vague in any way. I'm happy to provide more detail if needed.
1500 keys should not be a problem if you use a bound array parameter to specify the keys:
SELECT user_id, name, gender
FROM table
WHERE user_id in UNNEST(#users)
https://cloud.google.com/spanner/docs/sql-best-practices#write_efficient_queries_for_range_key_lookup

Django model count() with caching

I have an Django application with Apache Prometheus monitoring and model called Sample.
I want to monitor Sample.objects.count() metric
and cache this value for concrete time interval
to avoid costly COUNT(*) queries in database.
From this tutorial
https://github.com/prometheus/client_python#custom-collectors
i read that i need to write custom collector.
What is best approach to achieve this?
Is there any way in django to
get Sample.objects.count() cached value and update it after K seconds?
I also use Redis in my application. Should i store this value there?
Should i make separate thread to update Sample.objects.count() cache value?
First thing to note is that you don't really need to cache the result of a count(*) query.
Though different RDBMS handle count operations differently, they are slow across the board for large tables. But one thing they have in common is that there is an alternative to SELECT COUNT(*) provided by the RDBMS which is in fact a cached result. Well sort of.
You haven't mentioned what your RDBMS is so let's see how it is in the popular ones used wtih Django
mysql
Provided you have a primary key on your table and you are using MyISAM. SELECT COUNT() is really fast on mysql and scales well. But chances are that you are using Innodb. And that's the right storage engine for various reasons. Innodb is transaction aware and can't handle COUNT() as well as MyISAM and the query slows down as the table grows.
the count query on a table with 2M records took 0.2317 seconds. The following query took 0.0015 seconds
SELECT table_rows FROM information_schema.tables
WHERE table_name='for_count';
but it reported a value of 1997289 instead of 2 million but close enough!
So you don't need your own caching system.
Sqlite
Sqlite COUNT(*) queries aren't really slow but it doesn't scale either. As the table size grows the speed of the count query slows down. Using a table similar to the one used in mysql, SELECT COUNT(*) FROM for_count required 0.042 seconds to complete.
There isn't a short cut. The sqlite_master table does not provide row counts. Neither does pragma table_info
You need your own system to cache the result of SELECT COUNT(*)
Postgresql
Despite being the most feature rich open source RDBMS, postgresql isn't good at handling count(*), it's slow and doesn't scale very well. In other words, no different from the poor relations!
The count query took 0.194 seconds on postgreql. On the other hand the following query took 0.003 seconds.
SELECT reltuples FROM pg_class WHERE relname = 'for_count'
You don't need your own caching system.
SQL Server
The COUNT query on SQL server took 0.160 seconds on average but it fluctuated rather wildly. For all the databases discussed here the first count(*) query was rather slow but the subsequent queries were faster because the file was cached by the operating system.
I am not an expert on SQL server so before answering this question, I didn't know how to look up the row count using schema info. I found this Q&A helpfull. One of them I tried produced the result in 0.004 seconds
SELECT t.name, s.row_count from sys.tables t
JOIN sys.dm_db_partition_stats s
ON t.object_id = s.object_id
AND t.type_desc = 'USER_TABLE'
AND t.name ='for_count'
AND s.index_id = 1
You dont' need your own caching system.
Integrate into Django
As can be seen, all databases considered except sqlite provide a built in 'Cached query count' There isn't a need for us to create one of our own. It's a simple matter of creating a customer manager to make use of this functionality.
class CustomManager(models.Manager):
def quick_count(self):
from django.db import connection
with connection.cursor() as cursor:
cursor.execute("""SELECT table_rows FROM information_schema.tables
WHERE table_name='for_count'""")
row = cursor.fetchone()
return row[0]
class Sample(models.Model):
....
objects = CustomManager()
The above example is for postgresql, but the same thing can be used for mysql or sql server by simply changing the query into one of those listed above.
Prometheus
How to plug this into django prometheus? I leave that as an exercise.
A custom collector that returns the previous value if it's not too old and fetches otherwise would be the way to go. I'd keep it all in-process.
If you're using MySQL you might want to look at the collectors the mysqld_exporter offers as there's some for table size that should be cheaper.

Identifying needed statistics - Azure SQL Data Warehouse

Is there any hint or directive that can be used with EXPLAIN of a query on Azure SQL Data Warehouse that would return recommended statistics that were not available for the optimizer? Alternatively is there a tool that can analyze a workload and make any recommendation.
Today, no. Right now the recommendation is to create statistics on every column as these are needed to create an optimal parallel query plan (I.e. how to move data around between nodes to return a result since it's a MPP architecture).
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best-practices#maintain-statistics
An example of how to script this out can be found here as well (example H).
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-statistics#examples-create-statistics
As you know, statistics should be created (according to this article):
on columns involved in JOINs, GROUP BY, HAVING and WHERE clauses.
There are no tools to do this (yet), but if you have access to the EXPLAIN plans they give you certain information. For example the shuffle_columns element lists all columns involved in a SHUFFLE_MOVE:
<shuffle_columns>col;</shuffle_columns>
as well as myriad other information. Review the annotation I did of an Azure SQL Data Warehouse plan here.
Lastly, (and I haven't actually done this, I've only been thinking about doing it), you could set up a copy of your database on SQL Server 2016, bearing in mind the syntax differences (eg distribution, lack of unique indexes etc). this would give you access to certain useful resources like execution plans, including index suggestions, and certain trace flags which tell you what stats were used. I mean the database engines and indexing are really different so I don't know how worthwhile this might be. I'll post back if I progress my thinking on this. I do find the question "Why is this query going slow?" much harder to answer on this platform that ordinary "box product" SQL Server because the tools aren't as mature yet.

Generating efficient fast reports on amounts of data on AWS

I'm really confused about how or what AWS services to use for my case.
I have a web application which stores user interaction events. Currently these events are stored on a RDS table. Each event contains about 6 fields like timestamp, event type, userID, pageID, etc etc. Currently I have millions of event records on each account schema. When I try to generate reports out of this raw data - the reports are extremely slow since I do complex aggregation queries over long time period. a report of a time period of 30 days might take 4 minutes to generate on RDS.
Is there any way to make these reports running MUCH faster? I was thinking about storing the events on DynamoDB, but I cannot run such complex queries on the data, and to do any attribute based sorting.
Is there a good service combination to achieve this? Maybe using RedShift, EMP, Kinesis?
I think Redshift is your solution.
I'm working with a dataset that generates about 2.000.000 new rows each day and I made really complex operations on it. You could take advance of Redshift sort keys, and order your data by date.
Also if you do complex aggregate functions I really recommend to denormalize all the information and insert it in only one table with all the data. Redshift uses a very efficient, and automatic, column compression you won't have problems with the size of the dataset.
My usual solution to problems like this is to have a set of routines that rollup and store the aggregated results, to various levels in additional RDS tables. This transactional information you are storing isn't likely to change once logged, so, for example, if you find yourself running daily/weekly/monthly rollups of various slices of data, run the query and store those results, not necessarily at the final level that you will need, but at a level that significantly reduces the # of rows that goes into those eventual rollups. For example, have a daily table that summarizes eventtype, userid and pageId one row per day, instead of one row per event (or one row per hour instead of day) - you'll need to figure out the most logical rollups to make, but you get the idea - the goal is to pre-summarize at the levels that will reduce the amount of raw data, but still gives you plenty of flexibility to serve your reports.
You can always go back to the granular/transactional data as long as you keep it around, but there is not much to be gained by constantly calculating the same results every time you want to use the data.

Container for in-memory representation of a DB table

Let's say I have a (MySQL) DB. I want to automate the update of this database via an application, that will:
1. Import from DB
2. Calculate updated data
3. Export back updated data
The timing is important, I don't want to import while calculating, in fact I don't want any queries then; I want to import (a) table(s) as a whole, then calculate. So, my question is, if a row is represented with an instance of a class, then what container do I put these objects into?
A vector? A set? What about ordered vs. unordered? Just use what seems best for my case according to big O times? Any special traps to fall into here? Is this case no different than with data "born in memory", so the only things to consider besides size overhead are "do I want the lookup or the insertion to be faster" ?
Probably the best route is to use some ORM, but let's say I don't want to.
I've seen some apps use boost::unordered_set, and I wondered, if there is a particular reason for its use...
I use a jdbc-like interface as the connector (libmysqlcpp).
I do not think that the container you have to use can be guessed with so few information. It mainly depends of the data size, type and the algorithm you will run.
But my main concern over such a design is that it will quickly choke your network or your base and database. If you have a big table you'll:
select all the data from the table
retrieve all the data over the network
process on you machine part (some columns ?) or the entirety of the data
push the data over the network
update your rows (or erase/replace maybe)
Why don't you consider working directly on the mysql server ? You create your user defined function that work on the directly data, saving the network and even taking advantage of the fact that mysql is built to handle gigantic amount of data, quantity that an in-memory container is not built to handle.