Lets imagine I have bunch of referential tables, and I'd like to have one table consist of
Fcttbl[ID]
Ref_Product[Product]
Ref_Country[Country]
Ref_Category[Category]
Ref_Subcategory[Color]
All of these are connected to Fcttbl, but no other connections between them, if that count
Thank you in advance for your help
You can try a Summarize(). If this will not help then add a dummy tables and result sample.
SUMMARIZE(
Fcttbl
,Fcttbl[ID]
,Ref_Product[Product]
,Ref_Country[Country]
,Ref_Category[Category]
,Ref_Subcategory[Color]
)
Related
Thank you for you help in advance. I tried to find the same issue online I was unable. I am trying to use Power BI to compare set of data and to make my life easier every time a check data for submissions.
I am going to sum up the issue. If I have a master table with all the data and them all the data produced by my workers, how I can I compare them against the master table? See example below
You are asking the question wrong, or asking the wrong question. "How can I compare ...?" is too broad.
What insights do you want to derive from the data? You need to be more specific, like, "How many workers have Product A in yellow and size L?" or something like that. Then you can start building a data model that helps you get these insights.
The first step will be to consolidate the worker data into one table instead of a table for each worker. Add a column for Worker Name and combine them all into one.
Now you can build charts and pivots for aspects of the worker data and also do the same for the master data for comparison.
Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.
then I WRITE_TRUNCATE my production_table_orders with query like:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders always contains the most updated version of each order.
This process is suppose to run every 3 minutes.
I'm wondering if this is the best practice.
I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.
Suggestion?
We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid.
Use a CTAS statement (to the table itself) as you cannot add partition after fact.
EDIT: use 2 tables and MERGE
Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders for the imported records and final_table_orders as destination table and do
a MERGE like so:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.
Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.
There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG, or EXISTS with DELETE or UNION ALL,... Try them out and see which performs better for YOUR dataset.
I'm trying to leverage the advantages of DocumentDB / Elastic / NoSQL for retrieving big data and to visualize it. I want to use PowerBI to do that, which is pretty good, however, I have no clue how to model a document which has a 1:N nested data field. E.g.
{
name: string,
age: int
children: [ { name: string }... ]
}
In a normal case, you would flatten the table by expanding the nested values and joining them, but how does one do that when it's 1:N / A list. Is there a way to maybe extract that into it's own table?
I've been thinking about making a bridge which translates a document into data tables, but that feels like an incorrect way to go, and further proves some complications with regards to how many endpoints and queries there should be made.
I can't help but think this is a solved issue, as many places analyse and visualize large amounts of data stored in no sql. The alternative is a normalized relational database, but having millions and millions of entries in that which you analyze also seems incorrect when nosql is tuned for these scenarios.
If the data 1:N, but not arbitrarily deep, you can use the expand option in the query tab. You will get one row for each instance of customer that has all the attributes of the container.
If you want to get more sophisticated, you could normalized the schema by expanding just the customer id column (assuming there is one in your data) into one table, and expanding the customer details into another one, then creating a relationship across them. That makes aggregations easier (like count of parents). You'd just load the data twice, and delete the columns you don't need.
Alright I am in a rather difficult situation, or at least I think so anyway. I have been doing some research on how to fix my problem but have really come up empty handed.
I need to be able to reindex the rowid of my table after I delete a row. That way at any given time when I want to update or index a row by the rowid it is accessing the correct one.
Now for those of you asking why. Basically I am interfacing a "homebrewed" db that was programmed in C and is really just a bunch of memory locations all accessed like they were a db table. So what I'm trying to say is they can look up a row by searching for a value in the table, or by simply saying i want row 6. Lastly the table could consist of really anything, and any values which means they dont create a column as an index and ultimately the only thing for me to index their row by row number is the rowid to my knowledge.
So I have found that VACUUM would do what I want or need but it appears that the system that database is in isn't giving sqlite privileges to write so when VACUUM is run it comes back with and error. (ERROR 14 or Unable to open the database file) (I also know that my db is open so that isn't the issue but not having write privileges is the only reason I can come up with) I have also read some stuff about the auto increment or something like that but didn't really understand/think that was going to be able to fix my problem.
Any suggestions or ideas from the sqlite or database geniuses out that would be appreciated.
Not sure if I have understood completely your problem, but if you can use SQL code maybe you can write a query to update the IDs (assuming they are in dense order).
You can use a query like this:
UPDATE t1
SET id = (SELECT rank
FROM (SELECT id,
(
SELECT count()+1
FROM (SELECT DISTINCT id
FROM t1 AS t
WHERE t.id < t1.id
)
) rank
FROM t1
) AS sub
WHERE sub.id = t1.id
);
You can check my demo in SQLFiddler. In this demo you will see the result of the DELETE and UPDATE statements (to simulate your case) if you run all queries together.
I have to join tables in Hbase.
I integrated HIVE and HBase and that is working well. I can query using HIVE.
But can somebody help me how to join tables in HBase without using HIVE. I think using mapreduce we can achieve this, if so can anybody share a working example that I can refer.
Please share your opinions.
I have an approach in mind. That is,
If I need to JOIN tables A x B x C;
I may use TableMapReduceUtil to iterate over A, then get Data from B and C inside the TableMapper. Then use the TableReducer to write back to another table Y.
Will this approach be a good one.
That is certainly an approach, but if you are doing 2 random reads per scanned row then your speed will plummet. If you are filtering the rows out significantly or have a small dataset in A that may not be an issue.
Sort-merge Join
However the best approach, which will be available in HBase 0.96, is the MultipleTableInput method. This means that it will scan table A and write it's output with a unique key that will allow table B to match up.
E.g. Table A emits (b_id, a_info) and Table B will emit (b_id, b_info) merging together in the reducer.
This is an example of a sort-merge join.
Nested-Loop Join
If you are joining on the row key or the joining attribute is sorted in line with table B, you can have a instance of a scanner in each task which sequentially reads from table B until it finds what it's looking for.
E.g. Table A row key = "companyId" and Table B row key = "companyId_employeeId". Then for each Company in Table A you can get all the employees using the nest-loop algorithm.
Pseudocode:
for(company in TableA):
for(employee in TableB):
if employee.company_id == company.id:
emit(company.id, employee)
This is an example of a nest-loop join.
More detailed join algorithms are here:
http://en.wikipedia.org/wiki/Nested_loop_join
http://en.wikipedia.org/wiki/Hash_join
http://en.wikipedia.org/wiki/Sort-merge_join