Thank you for you help in advance. I tried to find the same issue online I was unable. I am trying to use Power BI to compare set of data and to make my life easier every time a check data for submissions.
I am going to sum up the issue. If I have a master table with all the data and them all the data produced by my workers, how I can I compare them against the master table? See example below
You are asking the question wrong, or asking the wrong question. "How can I compare ...?" is too broad.
What insights do you want to derive from the data? You need to be more specific, like, "How many workers have Product A in yellow and size L?" or something like that. Then you can start building a data model that helps you get these insights.
The first step will be to consolidate the worker data into one table instead of a table for each worker. Add a column for Worker Name and combine them all into one.
Now you can build charts and pivots for aspects of the worker data and also do the same for the master data for comparison.
Related
Anyone have experience with Drilldown Choropleth recently? I have taken a step back to try ArcGIS, but want to have a multi-layer map built in Power BI with shading using this add-in. I am having issues with loading the json- one for States (USA), one for Metro Area (MSA, USA). Also, not seeing the fields to add data points. This info I researched on the app my info has a json file link that is going to a 404.
If anyone wants to provide tips to transfer over to a contained ArcGIS, I would accept that.
More on the app: https://appsource.microsoft.com/en-us/product/power-bi-visuals/wa104381044?tab=overview
I basically need one layer shading on drill down for geo with points, then add one layer for demographic stats, one layer for population stats. Help?
for topojsons that work:
https://github.com/deldersveld/topojson
I used the US Counties one, so that's all I can comment on working.
Question
What is the Best way to update a column of a table of tens of millions of rows?
1)
I saw creating a new table and rename the old one when finish
2)
I saw update in batches using a temp table
3)
I saw single transaction (don't like this one though)
4)
never listen to cursor solution for a problema like this and I think it's not worthy to try
5) I read about loading data from file (Using BCP), but have not read if the performance is better or not. was not clear if it is just to copy or if it would allow join a big table with something and then bull copy.
really would like have some advice here.
Priority is performance
At the momment I'm testing solution 2) and Exploring solution 5)
Additional Information (UPDATE)
thank you for the critical thinking in here.
The operation be done in downtime.
UPDATE Will not cause row forwarding
All the tables go indexes, average 5 indexes, although few tables got
like 13 indexes.
the probability of target column is present in one of the table
indexes something like 50%.
Some tables can be rebuilt and replace, others don't because they
make part of a software solution, and we might lose support to those.
from those tables some got triggers.
I'll need to do this for more than 600 tables where ~150 range from
0.8 Million to 35 Million rows
The update is always in the same column in the various fields
References
BCP for data transfer
Actually it depends:
on the number of indexes the table contains
the size of the row before and after the UPDATE operation
type of UPDATE - would it be in place? does it need to modify the row length
does the operation cause row forwarding?
how big is the table?
how big would the transaction log of the UPDATE command be?
does the table contain triggers?
can the operation be done in downtime?
will the table be modified during the operation?
are minimal logging operations allowed?
would the whole UPDATE transaction fit in the transaction log?
can the table be rebuilt & replaced with a new one?
what was the timing of the operation on the test environment?
what about free space in the database - is there enough space for a copy of the table?
what kind of UPDATE operation is to be performed? does additional SELECT commands have to be done to calculate the new value of every row? or is it a static change?
Depending on the answers and the results of the operation in the test environment we could consider the fastest operations to be:
minimal logging copy of the table
an in place UPDATE operation preferably in batches
i have one mapping which just includes one source table and one target table. The source table has 100 columns and around 33xxxx records, i need to use this tool to insert to the target table and the logic is insert only. The version of informatica is 9.6.1 version and Database is SQL Server 2012.
After i run the workflow, it takes 5x/s to insert. the speed is too slow. I think it may be related to the number of columns
Can anyone help me how to increase the speed?
Thanks a lot
I think i know the reason why it happened. It is there are two fields which are ntext field in this table. That's why it takes very long time.
You can try the below options
1) Use bulk option for 'Target Load type' attribute in session if the target table doesn't have any indexes or keys on it
2) If there is any SQL override in the SOURCE QUALIFIER try to tune the query
3) Find for 'BUSY' in the session log and note down the busy percentages of each thread. Based on the thread percentages you will be able to identify the exact thread which is taking more time (Reader, Transformation, Writer)
4) Try to use informatica partitions through which you can achieve parallel processing.
Thanks and Regards,
Raj
Consider following points to increase the performance:
Increase the "commit interval" size in the session level properties.
Use the "bulk load" in session level properties.
You can also use the "partitioning" in session level, to do this you need partitioning license.
If your source is a database and you are doing sql override in source qualifier transformation , then you can also use the "Hints" for increasing the performan
I'm really confused about how or what AWS services to use for my case.
I have a web application which stores user interaction events. Currently these events are stored on a RDS table. Each event contains about 6 fields like timestamp, event type, userID, pageID, etc etc. Currently I have millions of event records on each account schema. When I try to generate reports out of this raw data - the reports are extremely slow since I do complex aggregation queries over long time period. a report of a time period of 30 days might take 4 minutes to generate on RDS.
Is there any way to make these reports running MUCH faster? I was thinking about storing the events on DynamoDB, but I cannot run such complex queries on the data, and to do any attribute based sorting.
Is there a good service combination to achieve this? Maybe using RedShift, EMP, Kinesis?
I think Redshift is your solution.
I'm working with a dataset that generates about 2.000.000 new rows each day and I made really complex operations on it. You could take advance of Redshift sort keys, and order your data by date.
Also if you do complex aggregate functions I really recommend to denormalize all the information and insert it in only one table with all the data. Redshift uses a very efficient, and automatic, column compression you won't have problems with the size of the dataset.
My usual solution to problems like this is to have a set of routines that rollup and store the aggregated results, to various levels in additional RDS tables. This transactional information you are storing isn't likely to change once logged, so, for example, if you find yourself running daily/weekly/monthly rollups of various slices of data, run the query and store those results, not necessarily at the final level that you will need, but at a level that significantly reduces the # of rows that goes into those eventual rollups. For example, have a daily table that summarizes eventtype, userid and pageId one row per day, instead of one row per event (or one row per hour instead of day) - you'll need to figure out the most logical rollups to make, but you get the idea - the goal is to pre-summarize at the levels that will reduce the amount of raw data, but still gives you plenty of flexibility to serve your reports.
You can always go back to the granular/transactional data as long as you keep it around, but there is not much to be gained by constantly calculating the same results every time you want to use the data.
I am starting using imdbpy and I am interesting in a way to implement a method with the following specifications:
Inputs:
numberToRetrieve (number of movies to retrive back)
movieGenre (the genre of movies by example:'horror')
Output:
List of movies object
Many thanks in advance!
Joshua
I'm way too late, but I answer for other user that will need it in the future.
You can do this (not easily, at least).
My best advice is to use the imdbpy2sql.py script to populate a database of your choice (Postgresql, MySQL, SQLite and other) and then script your series of batch operation as SQL statements. It's possible that you will need to add some indexes here and there to speed-up data retrieving.