I am testing the speed of inserting multiple rows with a single INSERT statement.
For example:
INSERT INTO [MyTable] VALUES (5, 'dog'), (6, 'cat'), (3, 'fish)
This is very fast until I pass 50 rows on a single statement, then the speed drops significantly.
Inserting 10000 rows with batches of 50 take 0.9 seconds.
Inserting 10000 rows with batches of 51 take 5.7 seconds.
My question has two parts:
Why is there such a hard performance drop at 50?
Can I rely on this behavior and code my application to never send batches larger than 50?
My tests were done in c++ and ADO.
Edit:
It appears the drop off point is not 50 rows, but 1000 columns. I get similar results with 50 rows of 20 columns or 100 rows of 10 columns.
It could also be related to the size of the row. The table you use as an example seems to have only 2 columns. What if it has 25 columns? Is the performance drop off also at 50 rows?
Did you also compare with the "union all" approach shown here? http://blog.sqlauthority.com/2007/06/08/sql-server-insert-multiple-records-using-one-insert-statement-use-of-union-all/
I suspect there's an internal cache/index that is used up to 50 rows (it's a nice round decimal number). After 50 rows it falls back on a less efficient general case insertion algorithm that can handle arbitrary amounts of inputs without using excessive memory.
the slowdown is probably the parsing of the string values: VALUES (5, 'dog'), (6, 'cat'), (3, 'fish) and not an INSERT issue.
try something like this, which will insert one row for each row returned by the query:
INSERT INTO YourTable1
(col1, col2)
SELECT
Value1, Value2
FROM YourTable2
WHERE ...--rows will be more than 50
and see what happens
If you are using SQL 2008, then you can use table value parameters and just do a single insert statement.
personally, I've never seen the slowdown at 50 inserts records even with regular batches. Regardless we moved to table value parameters which had a significant speed increase for us.
Random thoughts:
is it completely consistent when run repeatedly?
are you checking for duplicates in the 1st 10k rows for the 2nd 10k insert?
did you try batch size of 51 first?
did you empty the table between tests?
For high-volume and high-frequency inserts, consider using Bulk Inserts to load your data. Not the simplest thing int he world to implement and it brings with it a new set of challenges, but it can be much faster than doing an INSERT.
Related
I need to create an application that would allow me to get phone numbers of users with specific conditions as fast as possible. For example we've got 4 columns in sql table(region, income, age [and 4th with the phone number itself]). I want to get phone numbers from the table with specific region and income. Just make a sql query won't help because it takes significant amount of time. Database updates 1 time per day and I have some time to prepare data as I wish.
The question is: How would you make the process of getting phone numbers with specific conditions as fast as possible. O(1) in the best scenario. Consider storing values from sql table in RAM for the fastest access.
I came up with the following idea:
For each phone number create smth like a bitset. 0 if the particular condition is false and 1 if the condition is true. But I'm not sure I can implement it for columns with not boolean values.
Create a vector with phone numbers.
Create a vector with phone numbers' bitsets.
To get phone numbers - iterate for the 2nd vector and compare bitsets with required one.
It's not O(1) at all. And I still don't know what to do about not boolean columns. I thought maybe it's possible to do something good with std::unordered_map (all phone numbers are unique) or improve my idea with vector and masks.
P.s. SQL table consumes 4GB of memory and I can store up to 8GB in RAM. The're 500 columns.
I want to get phone numbers from the table with specific region and income.
You would create indexes in the database on (region, income). Let the database do the work.
If you really want it to be fast I think you should consider ElasticSearch. Think of every phone in the DB as a doc with properties (your columns).
You will need to reindex the table once a day (or in realtime) but when it's time to search you just use the filter of ElasticSearch to find the results.
Another option is to have an index for every column. In this case the engine will do an Index Merge to increase performance. I would also consider using MEMORY Tables. In case you write to this table - consider having a read replica just for reads.
To optimize your table - save your queries somewhere and add index(for multiple columns) just for the top X popular searches depends on your memory limitations.
You can use use NVME as your DB disk (if you can't load it to memory)
SELECT
a.id,
b.url as codingurl
FROM fact_A a
INNER JOIN dim_B b
ON strpos(a.url,b.url)> 0
Records Count in Fact_A: 2 Million
Records Count in Dim_B : 1500
Time Taken to Execute : 10 Mins
No of Nodes: 2
Could someone help me with an understanding why the above query takes more time to execute?
We have declared the distribution key in Fact_A to appropriately distribute the records evenly in both the nodes and also Sort Key is created on URL in Fact_A.
Dim_B table is created with DISTRIBUTION ALL.
Redshift does not have full-text search indexes or prefix indexes, so a query like this (with strpos used in filter) will result in full table scan, executing strpos 3 billion times.
Depending on which urls are in dim_B, you might be able to optimise this by extracting prefixes into separate columns. For example, if you always compare subpaths of the form http[s]://hostname/part1/part2/part3 then you can extract "part1/part2/part3" as a separate column both in fact_A and dim_B, and make it the dist and sort keys.
You can also rely on parallelism of Redshift. If you resize your cluster from 2 nodes to 20 nodes, you should see immediate performance improvement of 8-10 times as this kind of query can be executed by each node in parallel (for the most part).
I have a table with properties as ReadingTime, Frequency and I would like to insert 3 values in between those records where the time difference is greater than 12 hours. I could determine the time difference using the "Time Difference" node available but could not insert rows as per the requirement. Is there any way to attain this in knime ?
In case you are using Time Generator in a chunk loop (with the lagged column and Use second column option on the Time Difference node), you can generate as many nodes as you want (I assume you already use some switches/if nodes).
I'm currently implementing a histogram that will show a very large scale data using Qt and I have some doubts about which data structure(s) I should be using for my problem. I will be displaying the amount of queries received from users of an application and the way I should display is as follows -in a single application that will show different histograms upon clicking different "show me this data etc." buttons-
1) Display the histogram of total queries per every month -4 months of data here, I
kept four variables and incremented them as I caught queries belonging to those months
in the CSV file-
2) Display the histogram of total queries per every single day in a selected month -I was thinking of using 4 QVectors to represent the months for this one, incrementing every element of the vector (day), as I come by that specific day -e.g. the vector represents the month of August and whenever I come across a data with 2011-08-XY , I will increment the (XY + 1)th element of that vector by 1- my second alternative is to use 4 QLinkedList's for the sake of better complexity but I'm not sure if the ways I've come up with are efficient enough and I'm willing to listen to any other idea.
3) Here's where things get a bit complicated. Display the histogram of total queries per every hour in a selected day and month. The data represented is multiplied in a vast manner and I don't know which data structure -or combination of structures- I should use to implement this one. A list of lists perhaps?
Any ideas on my problems at 2) and 3) would be helpful, Thanks in advance.
Actually, it shouldn't be too unmanageable to always do queries per hour. Assuming that the number of queries per hour is never greater than the maximum int value, that's only 24 ints per day = 32 bits or 64 depending on your machine. Assuming 32 bits, then you could get up to 28 years worth of data per MB.
There's no need to transfer the month/year - your program can work that out. Just assign hour 0 to the earliest point in your data, which you keep as a constant, then work out the date based on hours passed since then.
This avoids having to have a list of lists or anything fancy - just use an array where each address contains the number of hours since hour 0, and the number of queries for that hour.
Why don't you simply use a classical database?
When you start asking these kind of question I think it is a good time to consider a more robust structure.There are multiple data structures implemented inside any DB, optimized either for different access type. You should considerate at least lookup, insertion, deletion, range queries. There is no structure which is better than the others in all costs, so there is always a trade-off.
Qt has some database classes you can use. I never used the Qt SQL library, but I think you should give it a shot. Fortunately, there is a Qt SQL programming guide at the end of the page linked.
I have been querying Geonames for parks per state. Mostly there are under 1000 parks per state, but I just queried Conneticut, and there are just under 1200 parks there.
I already got the 1-1000 with this query:
http://api.geonames.org/search?featureCode=PRK&username=demo&country=US&style=full&adminCode1=CT&maxRows=1000
But increasing the maxRows to 1200 gives an error that I am querying for too many at once. Is there a way to query for rows 1000-1200 ?
I don't really see how to do it with their API.
Thanks!
You should be using the startRow parameter in the query to page results. The documentation notes that it takes an integer value (0 based indexing) and should be
Used for paging results. If you want to get results 30 to 40, use startRow=30 and maxRows=10. Default is 0.
So to get the next 1000 data points (1000-1999), you should change your query to
http://api.geonames.org/search?featureCode=PRK&username=demo&country=US&style=full&adminCode1=CT&maxRows=1000&startRow=1000
I'd suggest reducing the maxRows to something manageable as well - something that will put less of a load on their servers and make for quicker responses to your queries.