Mysql 5.6 to mysql 8 upgradation throwing warning message the row size is xxx which is greater than maximum allowed size (8126) - database-migration

While upgrading existing mysql 5.6 to mysql 8 got the warning message logs in mysqld.err logs several times for each database
[Warning] [MY-011825] [InnoDB] Cannot add field abc in table dbName.myTable because after adding it, the row size is 8500 which is greater than maximum allowed size (8126) for a record on index leaf page.
Solution i got on multiple portal is to change row_format compact - > dynamic.
is it full proof solution.need to confirm is there any chance of data loss?

The warning message you're encountering during your upgrade from MySQL 5.6 to MySQL 8 is likely related to the maximum row size limit. In MySQL 8, the maximum row size is limited to 8126 bytes. If the row size of a table exceeds this limit, it can result in the warning message you're seeing.
To resolve this issue, you have a few options:
Reduce the size of the columns in the table: This is the most straightforward solution. You can try reducing the size of columns in the table that are contributing to the large row size.
Use the BLOB or TEXT data types: If you need to store large amounts of data in a single column, consider using the BLOB or TEXT data types, which can store large amounts of data up to 4 GB.
Split the table into smaller tables: If the table is too large to be managed efficiently, you may want to consider splitting it into smaller, more manageable tables.
Increase the maximum row size limit: You can increase the maximum row size limit by setting the max_allowed_packet system variable to a higher value. However, this approach is not recommended as it can result in performance degradation and may cause other problems.
I recommend trying the first option, reducing the size of the columns, as it is the most straightforward solution and will not result in any performance or stability issues.

Related

Redshift table sizes & flavours of

Confused by the term 'table size' in Redshift.
We have :
svv_table_info.size
"Size of table in 1MB blocks"
svv_table_info.pct_used
"Percent of available space used"
... so I assume that a lot of the 'size' is empty space due to sort keys etc
Then we have this..
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
.. which uses the term 'minimum' table size.
But nowhere can I find an explanation of what they means in the real world ? Is this a theoretical minimum if optimally configured ?
Ultimately I need to find out the basic size of original tangible data without any overheads.
Then yes, how much disc space is it actually costing to store it in Redshift.
So if I took 1TB out of our on-prem database and shoved it into Redshift, I'd be looking to see something like 1TB (data) & 1.2TB (data + Redshift overheads).
Hope someone can help clarify 🤞
Redshift stores data in 1MB blocks and blocks are associated with a slice and a column. So if I have 2 slices in my cluster and a table with 4 columns (plus the 3 system columns to make 7) distributed as EVEN containing at least 2 rows, then my table will minimally take up 2 X 7 X 1MB of space (14MB on disk). This is all that article is saying.
Now if I insert 2 additional rows into this table, Redshift will makes new blocks for this data. So now my 4 rows of data take up 28MB of space. However, if I Vacuum the table the wasted space will be reclaimed and the table size will come back down to 14MB. (yes this is a bit of an oversimplification but trying to get the concepts across)
As a rule of thumb a single 1MB block will typically hold between 100,000 rows and 2,000,000 rows of compressed data. (yes this depends on the data not being monster varchars) So for our table above I can keep adding rows (and vacuuming) without increasing the table size on disk until I get a few hundred thousand rows (per slice) in the table. Redshift is very efficient at storing large chunks of data but very inefficient at storing small ones.
What Redshift knows about your data size is how many blocks it takes on disk (across all the nodes, slices, and columns). How big your data would be if it was stored differently (not in blocks, compressed or uncompressed) is not data that is tracked. As John noted, for big tables, Redshift stores data more efficiently than most other database (when compression is used).
You cannot translate from an existing database size to the size of a table in Redshift. This is because:
Columns are stored separately
Minimum block size is 1MB
Data in Redshift is compressed, so it can take considerably less space depending on the type of data and the compression type chosen
Given compression, your data is likely to be smaller in Redshift than an original (uncompressed) data source. However, you can't really calculate that in advance unless you have transferred similar data in the past and apply a similar ratio.

Application for filtering database for the short period of time

I need to create an application that would allow me to get phone numbers of users with specific conditions as fast as possible. For example we've got 4 columns in sql table(region, income, age [and 4th with the phone number itself]). I want to get phone numbers from the table with specific region and income. Just make a sql query won't help because it takes significant amount of time. Database updates 1 time per day and I have some time to prepare data as I wish.
The question is: How would you make the process of getting phone numbers with specific conditions as fast as possible. O(1) in the best scenario. Consider storing values from sql table in RAM for the fastest access.
I came up with the following idea:
For each phone number create smth like a bitset. 0 if the particular condition is false and 1 if the condition is true. But I'm not sure I can implement it for columns with not boolean values.
Create a vector with phone numbers.
Create a vector with phone numbers' bitsets.
To get phone numbers - iterate for the 2nd vector and compare bitsets with required one.
It's not O(1) at all. And I still don't know what to do about not boolean columns. I thought maybe it's possible to do something good with std::unordered_map (all phone numbers are unique) or improve my idea with vector and masks.
P.s. SQL table consumes 4GB of memory and I can store up to 8GB in RAM. The're 500 columns.
I want to get phone numbers from the table with specific region and income.
You would create indexes in the database on (region, income). Let the database do the work.
If you really want it to be fast I think you should consider ElasticSearch. Think of every phone in the DB as a doc with properties (your columns).
You will need to reindex the table once a day (or in realtime) but when it's time to search you just use the filter of ElasticSearch to find the results.
Another option is to have an index for every column. In this case the engine will do an Index Merge to increase performance. I would also consider using MEMORY Tables. In case you write to this table - consider having a read replica just for reads.
To optimize your table - save your queries somewhere and add index(for multiple columns) just for the top X popular searches depends on your memory limitations.
You can use use NVME as your DB disk (if you can't load it to memory)

why AWS file size is different between Redshift and S3?

I'm UNLOADing tables from Redshift to S3 for backup. So I am checking to make sure the files are complete if we need them again.
I just did UNLOAD on a table that has size = 1,056 according to:
select "table", size, tbl_rows
FROM svv_table_info;
According to the documentation, the size is "in 1 MB data blocks", so this table is using 1,056 MB. But after copying to S3, the file size is 154 MB (viewing in AWS console).
I copied back to Redshift and all the rows are there, so this has to do with "1 MB data blocks". This is related to how it's saved in the file system, yes?
Can someone please explain? Thank you.
So you're asking why the SVV_TABLE_INFO view claims that the table consumes 1 GB, but when you dump it to disk the result is only 154 MB?
There are two primary reasons. The first is that you're actively updating the table but not vacuuming it. When a row is updated or deleted, Redshift actually appends a new row (yes, stored as columns) and tombstones the old row. To reclaim this space, you have to regularly vacuum the table. While Redshift will do some vacuuming in the background, this may not be enough, or it may not have happened at the time you're looking.
The second reason is that there's overhead required to store table data. Each column in a table is stored as a list of 1 MB blocks, one block per slice (and multiple slices per node). Depending on the size of your cluster and the column data type, this may lead to a lot of wasted space.
For example, if you're storing 32-bit integers, one 1MB block can store 256,000 such integers, requiring a total of 4 blocks to store 1,000,000 values (which is probably close to number of rows in your table). But, if you have a 4-node cluster, with 2 slices per node (ie, a dc2.large), then you'll actually require 8 blocks, because the column will be partitioned across all slices.
You can see the number of blocks that each column uses in STV_BLOCKLIST.

SQL Server CE Delete Performance

I am using SQL Server CE 4.0 and am getting poor DELETE query performance.
My table has 300,000 rows in it.
My query is:
DELETE from tableX
where columnName1 = '<some text>' AND columnName2 = '<some other text>'
I am using a non-clustered index on the 2 fields columnName1 and columnName2.
I noticed that when the number of rows to delete is small (say < 2000), the index can help performance by 2-3X. However, when the number of rows to delete is larger (say > 15000), the index does not help at all.
My theory for this behavior is that when the number of rows is large, the index maintenance is killing the gains achieved by using the index (index seek instead of table scan). Is this correct?
Unfortunately, I can't get rid of the index because it significantly helps non-mutating query performance.
Also, what else can I do to improve the delete performance for the > 15,000 row case?
I am using SQL Server CE 4.0 on Windows 7 (32-bit).
My application is written in C++ and uses the OLE DB interface to manipulate the database.
There is something known as "the tipping point" where the cost of locating individual rows using a seek is not worth it, and it is easier to just perform a single scan instead of thousands of seeks.
A couple of things you may consider for performance:
have a filtered index, if those are supported in CE (I honestly have no idea)
instead of deleting 15,000 rows at once, batch the deletes into chunks.
consider a "soft delete" - where you simply update an active column to 0. Then you can actually delete the rows in smaller batches in the background. I mean, is a user really sitting around and waiting actively for you to delete 15,000+ rows? Why?

Sqlite max columns number configuration from QT

I want to store rows that have 65536 columns in a Sqlite database, and I am doing that using C++ and QT.
My question is: Since the default maximum number of columns seems to be 2000 no more, how to configure this parameter from C++ and Qt?
Thank you.
The SQLLite homepage has some explanation on this:
2.Maximum Number Of Columns
The SQLITE_MAX_COLUMN compile-time parameter is used to set an upper
bound (...)
and
The default setting for SQLITE_MAX_COLUMN is 2000. You can change it
at compile time to values as large as 32767. On the other hand, many
experienced database designers will argue that a well-normalized
database will never need more than 100 columns in a table.
Like that, even if you increased it, you could only achieve half of what you want. Apart from that I can only refer to Styne666's comment on your post.