Making SQLite run SELECT faster - python-2.7

Situation: I have about 40 million rows, 3 columns of unorganised data in a table in my SQLite DB (~300MB). An example of my data is as follows:
| filehash | filename | filesize |
|------------|------------|------------|
| hash111 | fileA | 100 |
| hash222 | fileB | 250 |
| hash333 | fileC | 380 |
| hash111 | fileD | 250 | #Hash collision with fileA
| hash444 | fileE | 520 |
| ... | ... | ... |
Problem: A single SELECT statement could take between 3 to 5 seconds. The application I am running needs to be fast. A single query taking 3 to 5 seconds is too long.
#calculates hash
md5hash = hasher(filename)
#I need all 3 columns so that I do not need to parse through the DB a second time
cursor.execute('SELECT * FROM hashtable WHERE filehash = ?', (md5hash,))
returned = cursor.fetchall()
Question: How can I make the SELECT statement run faster (I know this sounds crazy but I am hoping for speeds of below 0.5s)?
Additional information 1: I am running it on Python 2.7 program on a RPi 3B (1GB RAM, default 100MB SWAP). I am asking mainly because I am afraid that it will crash the RPi because 'not enough RAM'.
For reference, when reading from the DB normally with my app running, we are looking at max 55MB of RAM free, with a few hundred MB of cached data - I am unsure if this is the SQLite caches (SWAP has not been touched).
Additional information 2: I am open to using other databases to store the table (I was looking at either PyTables or ZODB as a replacement - let's just say that I got a little desperate).
Additional information 3: There are NO unique keys as the SELECT statement will look for a match in the column which are just hash values, which apparently have collisions.

Currently, the database has to scan the entire table to find all matches. To speed up searches, use an index:
CREATE INDEX my_little_hash_index ON hashtable(filehash);

Related

How can I visualize timeseries data aggregated by more than one dimension on AWS insights?

I'd like to use cloudwatch insights to visualize a multiline graph of average latency by host over time. One line for each host.
This stats query extracts the latency and aggregates it in 10 minute buckets by host, but it doesn't generate any visualization.
stats avg(latencyMS) by bin(10m), host
bin(10m) | host | avg(latencyMS)
0m | 1 | 120
0m | 2 | 220
10m | 1 | 130
10m | 2 | 230
The docs call this out as a common mistake but don't offer any alternative.
The following query does not generate a visualization, because it contains more than one grouping field.
stats avg(myfield1) by bin(5m), myfield4
aws docs
Experementally, cloudwatch will generate a multi line graph if each record has multiple keys. A query that would generate a line graph must return results like this:
bin(10m) | host-1 avg(latencyMS) | host-2 avg(latencyMS)
0m | 120 | 220
10m | 130 | 230
I don't know how to write a query that would output that.
Parse individual message for each host then compute their stats.
For example, to get average latency for responses from processes with PID=11 and PID=13.
parse #message /\[PID:11\].* duration=(?<pid_11_latency>\S+)/
| parse #message /\[PID:13\].* duration=(?<pid_13_latency>\S+)/
| display #timestamp, pid_11_latency, pid_13_latency
| stats avg(pid_11_latency), avg(pid_13_latency) by bin(10m)
| sort #timestamp desc
| limit 20
The regular expressions extracts duration for processes having id 11 and 13 to parameters pid_11_latency and pid_13_latency respectively and fills null where there is no match series-wise.
You can build from this example by creating the match regular expression that extracts for metrics from message for hosts you care about.

Why Spanner performs full table scan using a underscore in a LIKE, while using % leverages the index?

In a query, if I use LIKE '<value>%' on the primary key it performs well, using the index:
Operator | Rows returned | Executions | Latency
-- | -- | -- | --
Serialize Result 32 1 1.80 ms
Sort 32 1 1.78 ms
Hash Aggregate 32 1 1.73 ms
Distributed union 32 1 1.61 ms
Hash Aggregate 32 1 1.56 ms
Distributed union 128 1 1.34 ms
Compute - - -
FilterScan 128 1 1.33 ms
Table Scan: <tablename> 128 1 1.30 ms
Nevertheless, using LIKE '<value>_' performs a full table scan:
Operator | Rows returned | Executions | Latency
-- | -- | -- | --
Serialize Result | 32 | 1 | 76.27 s
Sort | 32 | 1 | 76.27 s
Hash Aggregate | 32 | 1 | 76.27 s
Distributed union | 32 | 1 | 76.27 s
Hash Aggregate | 32 | 2 | ~72.18 s
Distributed union | 128 | 2 | ~72.18 s
Compute | - | - | -
FilterScan | 128 | 2 | ~72.18 s
Table Scan: <tablename> (full scan: true) | 13802624 | 2 | ~69.97 s
The query looks like this:
SELECT
'aggregated-quadkey AS quadkey' AS quadkey, day,
SUM(a_value_1), SUM(a_value_2), AVG(a_value_3), SUM(a_value_4), SUM(a_value_5), AVG(a_value_6), AVG(a_value_6), AVG(a_value_7), SUM(a_value_8), SUM(a_value_9), AVG(a_value_10), SUM(a_value_11), SUM(a_value_12), AVG(a_value_13), AVG(a_value_14), AVG(a_value_15), SUM(a_value_16), SUM(a_value_17), AVG(a_value_18), SUM(a_value_19), SUM(a_value_20), AVG(a_value_21), AVG(a_value_22), AVG(a_value_23)
FROM <tablename>
WHERE quadkey LIKE '03201012212212322_'
GROUP BY quadkey, day ORDER BY day
For a prefix matching LIKE pattern (column LIKE 'xxx%'), the query optimiser internally converts the condition into STARTS_WITH(column, 'xxx'), which then uses the index.
So the reason is probably because the query optimizer is not smart enough to
convert an exact length prefix matching LIKE pattern
column LIKE 'xxx_'
into a combined condition:
(STARTS_WITH(column, 'xxx') AND CHAR_LENGTH(column)=4)
Similarly, a pattern such as
`column LIKE 'abc%def'`
is not optimised into the combined condition:
`(STARTS_WITH(column,'abc') AND ENDS_WITH(column,'def'))`.
You can always work around this by optimising the query in your SQL generation by using the above condition.
(This is assuming that the LIKE pattern is a string value in the query, not a parameter - LIKE using a parameter cannot be optimised because the pattern is not known at query compile time.)
Thank you for reporting this! I have added this rewrite in the backlog. In the meantime, you can use STARTS_WITH and CHAR_LENGTH to work around the issue as RedPandaCurios suggested.

Compound Queries in Amazon CloudSearch

I want to use group by result from AWS cloud search.
user | expense | status
1 | 1000 | 1
1 | 300 | 1
1 | 700 | 2
1 | 500 | 2
2 | 1000 | 1
2 | 1200 | 3
3 | 200 | 1
3 | 600 | 1
3 | 1000 | 2
Above are my table structure, I want total count of expense for all user. Expected Answer is-
{ user:1,expense_count:2500},{user:2,expense_count:2200 },{user:3,expense_count:1800 }
I want GROUP BY the user column, and it should count the total expenses of the respective user.
There is no (easy) way to do this in CloudSearch, which is understandable when you consider that your use case is more like a SQL query and is not really what I would consider a search. If what you want to do is look up users by userId and sum their expenses, then a search engine is the wrong tool to use.
CloudSearch isn't meant to be used as a datastore; it should return minimal information (ideally just IDs), which you then use to retrieve data. Here is a blurb about it from the docs:
You should only store document data in the search index by making
fields return enabled when it's difficult or costly to retrieve the
data using other means. Because it can take some time to apply
document updates across the domain, you should retrieve critical data
such as pricing information by using the returned document IDs instead
of returned from the index.

I want to optimize a stored procedure that uses IN clause and a regex_str function. I am not sure that how I should optimize it more?

The response time I am getting is around 200ms.
I want to optimize it more.
How can I achieve this?
CREATE OR REPLACE
PROCEDURE GETSTORES
(
LISTOFOFFERIDS IN VARCHAR2,
REF_OFFERS OUT TYPES.OFFER_RECORD_CURSOR
)
AS
BEGIN
OPEN REF_OFFERS FOR
SELECT
/*+ PARALLEL(STORES 5) PARALLEL(MERCHANTOFFERS 5)*/
MOFF.OFFERID,
s.STOREID,
S.LAT,
s.LNG
FROM
MERCHANTOFFERS MOFF
INNER JOIN STORES s ON MOFF.STOREID =S.STOREID
WHERE
MOFF.OFFERID IN
(
SELECT
REGEXP_SUBSTR(LISTOFOFFERIDS,'[^,]+', 1, LEVEL)
FROM
DUAL CONNECT BY REGEXP_SUBSTR(LISTOFOFFERIDS, '[^,]+', 1, LEVEL) IS NOT NULL
)
;
END
GETSTORES;
I am using the regex_substr to get a list of OfferIDs from the comma separated string that comes in LISTOFOFFERIDS.
I have created the index on STOREID of the Stores table but to no avail.
A new approach to achieve the same is also fine if its faster.
The types declaration for the same:
create or replace
PACKAGE TYPES
AS
TYPE OFFER_RECORD
IS
RECORD(
OFFER_ID MERCHANTOFFERS.OFFERID%TYPE,
STORE_ID STORES.STOREID%TYPE,
LAT STORES.LAT%TYPE,
LNG STORES.LNG%TYPE
);
TYPE OFFER_RECORD_CURSOR
IS
REF
CURSOR
RETURN OFFER_RECORD;
END
TYPES;
The plan for the select reveals following information:
Plan hash value: 1501040938
-------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 276 | 67620 | 17 (12)| 00:00:01 |
|* 1 | HASH JOIN | | 276 | 67620 | 17 (12)| 00:00:01 |
| 2 | NESTED LOOPS | | | | | |
| 3 | NESTED LOOPS | | 276 | 61272 | 3 (34)| 00:00:01 |
| 4 | VIEW | VW_NSO_1 | 1 | 202 | 3 (34)| 00:00:01 |
| 5 | HASH UNIQUE | | 1 | | 3 (34)| 00:00:01 |
|* 6 | CONNECT BY WITHOUT FILTERING (UNIQUE)| | | | | |
| 7 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
|* 8 | INDEX RANGE SCAN | OFFERID_INDEX | 276 | | 0 (0)| 00:00:01 |
| 9 | TABLE ACCESS BY INDEX ROWID | MERCHANTOFFERS | 276 | 5520 | 0 (0)| 00:00:01 |
| 10 | TABLE ACCESS FULL | STORES | 9947 | 223K| 13 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("MERCHANTOFFERS"."STOREID"="STORES"."STOREID")
6 - filter( REGEXP_SUBSTR ('M1-Off2,M2-Off5,M2-Off9,M5-Off4,M10-Off1,M1-Off3,M2-Off4,M3-Off2,M4-Of
f6,M5-Off1,M6-Off1,M8-Off1,M7-Off3,M1-Off1,M2-Off1,M3-Off1,M3-Off4,M3-Off5,M3-Off6,M4-Off1,M4-Off7,M2
-Off2,M3-Off3,M5-Off2,M7-Off1,M7-Off2,M1-Off7,M2-Off3,M3-Off7,M5-Off5,M4-Off2,M4-Off3,M4-Off5,M8-Off2
,M6-Off2,M1-Off5,M1-Off6,M1-Off9,M1-Off8,M2-Off6,M2-Off7,M4-Off4,M9-Off1,M6-Off4,M1-Off4,M1-Off10,M2-
Off8,M3-Off8,M6-Off3,M5-Off3','[^,]+',1,LEVEL) IS NOT NULL)
8 - access("MERCHANTOFFERS"."OFFERID"="$kkqu_col_1")
If your server supports it (seems you want it), change the hints into /*+ PARALLEL(S 8) PARALLEL(MOFF 8)*/. When you have aliases you must use the aliases in the hints.
you should try the compound index suggested by APC(STORES(STOREID, LAT, LNG))
Please respond to the questions: For the example presented, how many distinct stores you get (select count(distinct storeid) from (your_query)) and how many stores are in the STORES table? (Select count(*) from Stores)?
Have you analysed the table with dbms_stats.gather_table_stats?
I believe the connect by query is NOT the problem. It runs in 0.02 seconds.
If you look at you explain plan the timings for each step are the same: there is no obvious candidate to focus on tuning.
The sample you posted has fifty tokens for OFFERID. Is that representative? They map to 276 STORES - is that a representative ratio? Do any offers hit more than one Store?
276 rows is about 2.7% of the rows which is a small-ish sliver: however, as STORES seems to be a very compact table it's marginal as to whether indexed reads would be faster than a full table scan.
The only obvious thing you could do to squeeze more juice out of the database would be to build a compound index on STORES(STOREID, LAT, LNG); presumably it's not a table which sees much DML so the overhead of an additional index wouldn't be much.
One last point: your query executes in 0.2s. So how much faster do you want it to go?
Consider dropping the regex on the join, so the join can happen fast.
If there are indexes on the join columns, chances are the join may move from nested loops
to a hashed join of some sort.
Once you have that result set (with hopefully fewer rows), then filter it with your regex.
You may find that the WITH statement helpful in this scenerio.
Something on the order of this. ( untested example )
WITH
base AS
(
SELECT /*+ PARALLEL(STORES 5) PARALLEL(MERCHANTOFFERS 5) */
moff.OFFERID,
s.STOREID,
s.LAT,
s.LNG
FROM MERCHANTOFFERS moff
INNER JOIN STORES s
ON MOFF.STOREID = S.STOREID
),
offers AS
(
SELECT REGEXP_SUBSTR(LISTOFOFFERIDS,'[^,]+', 1, LEVEL) offerid
FROM DUAL
CONNECT BY REGEXP_SUBSTR(LISTOFOFFERIDS, '[^,]+', 1, LEVEL) IS NOT NULL
)
SELECT base.*
FROM base,
offers
WHERE base.offerid = offers.offerid
Oracle may execute the two views into in memory tables, then join.
No guarentees. Your milage may vary. You were looking for ideas. This is an idea.
The very best of luck to you.
If I recall a hints chapter correctly, when you alias your table names, you need to use that alias in your hint. /*+ PARALLEL(s 5) PARALLEL(moff 5) */
I would be curious as to why you decided on the value 5 for your hints. I was under the impression that Oracle would chose a best value for it, depending on system load and other mysterious conditions.

How to store data with large number (constant) of properties in SQL

I am parsing the USDA's food database and storing it in SQLite for query purposes. Each food has associated with it the quantities of the same 162 nutrients. It appears that the list of nutrients (name and units) has not changed in quite a while, and since this is a hobby project I don't expect to follow any sudden changes anyway. But each food does have a unique quantity associated with each nutrient.
So, how does one go about storing this kind of information sanely. My priorities are multi-programming language friendly (Python and C++ having preference), sanity for me as coder, and ease of retrieving nutrient sets to sum or plot over time.
The two things that I had thought of so far were 162 columns (which I'm not particularly fond of, but it does make the queries simpler), or a food table that has a link to a nutrient_list table that then links to a static table with the nutrient name and units. The second seems more flexible i ncase my expectations are wrong, but I wouldn't even know where to begin on writing the queries for sums and time series.
Thanks
You should read up a bit on database normalization. Most of the normalization stuff is quite intuitive, but really going through the definition of the steps and seeing an example helps understanding the concepts and will help you greatly if you want to design a database in the future.
As for this problem, I would suggest you use 3 tables: one for the foods (let's call it foods), one for the nutrients (nutrients), and one for the specific nutrients of each food (foods_nutrients).
The foods table should have a unique index for referencing and the food's name. If the food has other data associated to it (maybe a link to a picture or a description), this data should also go here. Each separate food will get a row in this table.
The nutrients table should also have a unique index for referencing and the nutrient's name. Each of your 162 nutrients will get a row in this table.
Then you have the crossover table containing the nutrient values for each food. This table has three columns: food_id, nutrient_id and value. Each food gets 162 rows inside this table, oe for each nutrient.
This way, you can add or delete nutrients and foods as you like and query everything independent of programming language (well, using SQL, but you'll have to use that anyway :) ).
Let's try an example. We have 2 foods in the foods table and 3 nutrients in the nutrients table:
+------------------+
| foods |
+---------+--------+
| food_id | name |
+---------+--------+
| 1 | Banana |
| 2 | Apple |
+---------+--------+
+-------------------------+
| nutrients |
+-------------+-----------+
| nutrient_id | name |
+-------------+-----------+
| 1 | Potassium |
| 2 | Vitamin C |
| 3 | Sugar |
+-------------+-----------+
+-------------------------------+
| foods_nutrients |
+---------+-------------+-------+
| food_id | nutrient_id | value |
+---------+-------------+-------+
| 1 | 1 | 1000 |
| 1 | 2 | 12 |
| 1 | 3 | 1 |
| 2 | 1 | 3 |
| 2 | 2 | 7 |
| 2 | 3 | 98 |
+---------+-------------+-------+
Now, to get the potassium content of a banana, your'd query:
SELECT food_nutrients.value
FROM food_nutrients, foods, nutrients
WHERE foods_nutrients.food_id = foods.food_id
AND foods_nutrients.nutrient_id = nutrients.nutrient_id
AND foods.name = 'Banana'
AND nutrients.name = 'Potassium';
Use the second (more normalized) approach.
You could even get away with fewer tables than you mentioned:
tblNutrients
-- NutrientID
-- NutrientName
-- NutrientUOM (unit of measure)
-- Otherstuff
tblFood
-- FoodId
-- FoodName
-- Otherstuff
tblFoodNutrients
-- FoodID (FK)
-- NutrientID (FK)
-- UOMCount
It will be a nightmare to maintain a 160+ field database.
If there is a time element involved too (can measurements change?) then you could add a date field to the nutrient and/or the foodnutrient table depending on what could change.