I want to merge two or more tables in to one, for example, I have table1.csv and table2.csv, they are from different Mysql server but have the same structure like [A, B, C, datatime].
For different records, if the values of A, B, C are not the same, then directly treat it as different records, if the values of A, B, and C are the same, then only the record with the latest datatime will be kept.
If I first use the program to select which records are useful locally, and then insert them into mysql together, will it be faster than inserting them one by one while selecting?
You can do it easy with a composite unique key over the three field on the table where you want to insert
This query will add a unique key, so you can add the same row again
ALTER TABLE `table1` ADD UNIQUE `unique_index`(`a`, `b`, `c`);
This query will append only different records
INSERT IGNORE table1 SELECT * FROM table2
Related
I'm trying to determine if there's a practical way to prevent duplicate rows from being inserted into a table using Azure SQL DW when the table already holds billions of rows (say 20 billion).
The root cause of needing this is that the source of the data is a third party that sends over supposedly unique data, but sometimes sends duplicates which have no identifying key. I unfortunately have no idea if we've already received the data they're sending.
What I've tried is to create a table that contains a row hash column (pre-calculated from several other columns) and distribute the data based on that row hash. For example:
CREATE TABLE [SomeFact]
(
Row_key BIGINT NOT NULL IDENTITY,
EventDate DATETIME NOT NULL,
EmailAddress NVARCHAR(200) NOT NULL,
-- other rows
RowHash BINARY(16) NOT NULL
)
WITH
(
DISTRIBUTION = HASH(RowHash)
)
The insert SQL is approximately:
INSERT INTO [SomeFact]
(
EmailAddress,
EventDate,
-- Other rows
RowHash
)
SELECT
temp.EmailAddress,
temp.EventDate,
-- Other rows
temp.RowHash
FROM #StagingTable temp
WHERE NOT EXISTS (SELECT 1 FROM [SomeFact] f WHERE f.RowHash = temp.RowHash);
Unfortunately, this is just too slow. I added some statistics and even created a secondary index on RowHash and inserts of any real size (10 million rows, for example) won't run successfully without erroring due to transaction sizes. I've also tried batches of 50,000 and those too are simply too slow.
Two things I can think of that wouldn't have the singleton records you have in your query would be to
Outer join your staging table with the fact table and filter on some NULL values. Assuming You're using Clustered Column Store in your fact table this should be a lot more inexpensive than the above.
Do a CTAS with a Select Distinct from the existing fact table, and a Select Distinct from the staging table joined together with a UNION.
My gut says the first option will be faster, but you'll probably want to look at the query plan and test both approaches.
Can you partition the 'main' table by EventDate and, assuming new data has a recent EventDate, CTAS out only the partitions that include the EventDate's of the new data, then 'Merge' the data with CTAS / UNION of the 'old' and 'new' data into a table with the same partition schema (UNION will remove the duplicates) or use the INSERT method you developed against the smaller table, then swap the partition(s) back into the 'main' table.
Note - There is a new option on the partition swap command that allows you to directly 'swap in' a partition in one step: "WITH (TRUNCATE_TARGET = ON)".
I'm trying to define, using the interleaving approach of Google Spanner, a mechanism to have rows from several tables in the same split. According to the documentation (https://cloud.google.com/spanner/docs/schema-and-data-model#database-splits), rows that share the same primary key prefix are placed in the same split. But what is defining the "same primary key prefix"? Let's put an example. I've three tables with primary keys:
table A has PK (C1, CA)
table B has PK (C1, CB)
table C has PK (C1, CC)
These three tables share the first element of their primary key, column C1. I wish that all rows with the same value for C1 to go to the same split.
Can I define table A as the parent table for B and C?
Do I need to create a dummy table with PK (C1)?
Any other approach?
The database will have lots of reads, many updates but few inserts.
Any suggestion will be highly appreciated.
Thanks
In order to define a child table, the child table's primary key must contain the entire primary key of the parent as a prefix.
In your example, you would need to create a table with primary key C1. You could then interleave tables A, B, and/or C in this parent table.
The parent table can't quite be a dummy table. To insert a value into a child table, the corresponding row must exist in the parent table. So in your example, you would need to ensure that the parent table has a row for each value of C1 that you want to add to any of the child tables.
Working on a way to compare 2 tables in PowerBI.
I'm joining the 2 tables using the primary key and making custom columns that compare if the old and new are equal.
This doesn't seem like the most efficient way of doing things, and I can't even color code the matrix because some values aren't integers.
Any suggestions?
I did a big project like this last year, comparing two versions of a data warehouse (SQL database).
I tackled most of it in the Query Editor (actually using Power Query for Excel, but that's the same as PBI's Query Editor).
My key technique was to first create a Query for each table, and use Unpivot Other Columns on everything apart from the Primary Key columns. This transforms it into rows of Attribute, Value. You can filter Attribute to just the columns you want to compare.
Then in a new Query you can Merge & Expand the "old" and "new" Queries, joining on on the Primary Key columns + the Attribute column. Then add Filter or Add Column steps to get to your final output.
Coming off a NLTK NER problem, I have PERSONS and ORGANIZATIONS, which I need to store in a sqlite3 db. The obtained wisdom is that I need to create separate TABLEs to hold these sets. How can i create a TABLE when len(PERSONs) could vary for each id. It can even be zero. The normal use of:
insert into table_name values (?),(t[0]) will return a fail.
Thanks to CL.'s comment, I figured out the best way is to think rows in a two-column table, where the first column is id INT, and the second column contains person_names. This way, there will be no issue with varying lengths of PERSONS list. of course, to link the main table with the persons table, the id field has to REFERENCE (foreign keys) to the story_id (main table).
Suppose I have a table A with VAR1 and VAR2. Suppose also, I have Table B with VAR1. Is there a way I can check to see if VAR1 from Table A is in Table B without merging?
You have to merge it in some fashion, but certainly not with merge. Depending on the characteristics of the two tables, you can use:
Merge
SQL Join
Format lookup (load relevant parts of dataset B into a format)
Hash lookup (load relevant parts of dataset B into hash table)
Array lookup (load relevant parts of dataset B into temporary array)
And probably several other methods that I'm forgetting.