insert data in sqlite3 when array could be of different lengths - python-2.7

Coming off a NLTK NER problem, I have PERSONS and ORGANIZATIONS, which I need to store in a sqlite3 db. The obtained wisdom is that I need to create separate TABLEs to hold these sets. How can i create a TABLE when len(PERSONs) could vary for each id. It can even be zero. The normal use of:
insert into table_name values (?),(t[0]) will return a fail.

Thanks to CL.'s comment, I figured out the best way is to think rows in a two-column table, where the first column is id INT, and the second column contains person_names. This way, there will be no issue with varying lengths of PERSONS list. of course, to link the main table with the persons table, the id field has to REFERENCE (foreign keys) to the story_id (main table).

Related

What is best practice to deal with missing data according to Kimball?

I have a data base with the following tables:
Customers, Invoices, Salesman, Target.
The ones concerned about my question are Customers, Invoices.
There are customersIDs used in the Invoices but doesn't exist in the Customers table.
If I used only the customers from Customers Table, my customer dimension would be incomplete.
My solution is to append these IDs from Invoices to Customers and fill other columns in the Customers table with nulls.
I don't know if this is the best approche according to Kimball?
also, if it is a good solution, how can I add accomplish it with Power bi desktop?
Customers table: "generated Data"
Invoice table:
..... just a sample the table is thousands of rows.
There's two points here:
Firstly, (in import mode at least) PBI already creates the "blank row" for items present in your fact table but missing from your dimension table for precisely this scenario. If you don't need the granularity of each individual missing customer id, then you don't need to do anything.
Secondly, if you need to to retain that granularity then your approach is the correct one. The way to do this in Power Query is as follows:
Create a new query which takes your customer dimension table and does a left outer join on customer id with your invoice fact table.
Expand the newly joined table but retain only the new customer id column.
Remove all columns apart from the new customer id column.
Remove duplicates
You now have a list of missing customer ids. Ensure the column name is the same as the column name of you customer id in the customer dimension table. Append this to the original customer dimension query and the nulls will be filled in automatically for the missing columns.
Please keep in mind that It is Kimball, not Kimble.
There are 4 steps of DWH Methodology:
1) Understand Business Process (What your process is actually measuring?)
2) Deciding the grain (It means what every row in your fact table actually represents?)
3) Deciding Dimensions (Ask Where-What-Who-Where-How-HowMany-HowMuch to your grain declaration formed together with business processing)
4) Define Facts (Metrics)
According to this order, You define Dimension tables before building your fact tables: If your dimension table , Customer table in this case, is missing in terms of customers available in your fact table, My biggest biggest advice according to the DWH Dimensional Modeling is to set your customer table right!!! Define every piece of customer in your dimension table!!!! Then populate your Fact table with records:
[Customer ID] in Customer Table : PRIMARY KEY
[CustomerID] in Invoice Table : FOREIGN KEY
SQL and Power BI reacts very differently in your problem:
1) Power BI has no referential integrity concept: It adds a blank row to your dimension table in such a case.
2) SQL gives referential integrity error, and you can't even add rows to your fact table. I support SQL in this case personally!!!!
Finally: Use some ETL tool(SSIS, Talend, ODI or even Power Query) to make your dimension table as accurate as possible:
For example:
Do not leave any column value as null!
If an unknown date exists, put a default date value like '1900-12-31'
If an unknown textual property, put in keywords 'unknown','not available' etc..
Because dimensional table are sources of querying in SQL statements; and different SQL Vendors (SQL Server, Oracle, MySQL) has to deal with NULL values in a different way, and this cause problems in terms of performance wise!!

Check on data modelling

I would like to know whether my data modeling is working for Power BI.
The dataset I am using is training course for students and corporate. The original data has 3 tables separated by individual program. The purpose of my visualization is to analyze the 3 programs from all students in a single dashboard.
Here is the original data after being imported to Power BI:
Here is the data pre-processing:
Remove unneeded column
DPT table
Remove column – No, Date, Quarter
DTP table
Remove column – Count, Email, Date
LLD table
Remove column – Email, To calculate, Learning Hours
Rename column name & impute missing value with “Not given”
DPT table
Trainee = Name, Training Provider = Provider, Course name = Course, Focus area = Domain
DTP table
Participant Name = Name, Event/Training Name = Course, Training providers = Provider
Create new column and impute them with “No given” and put in same
position (for append tables later)
DTP table
Level
LLD table
Company, Provider, Level
Create a new column called Program and impute value as it program
name for each row.
Post cleaned table:
After appending the 3 tables and calling it Master:
. Duplicate the Master table to create Student, Provider and Program table. In each table remove irrelevant columns, remove duplicates and create unique ID.
Final data model:
The focus is the Program, Provider and Student tables. The rest of the tables will be deactivated the relationship when creating calculated columns and measures before I make any correction to the data model.
Is there any proper approach to build the data model?
From my data model in the last picture, does it mean that the Provider table is a fact table while the Student and Program tables are dimensions?
I agree with removing unneeded columns, renaming columns for a better look, and substituting 'Not Given' in place of NULLS (Caution here: Measures and Dimensions handle nulls differently, for dimensional values substituting is okay)
If modeling using PowerBI is a must, then the following strategy should happen:
The dimensions can be Students, Programs, Providers
A Factless Fact table (FactProgram or something like that). It will have dimensional keys to Students, Programs, Providers ( and additional measures that you can create or can be taken from Master )
Remove unnecessary columns from the dimensions, so that the Remove Duplicates will give you what you want. For instance, Student and Program currently have same columns coming from master (Company, Course, Domain, Level, Program, Provider). Make it clear which columns belong to which dimensions and optionally create new dimensions (maybe, DimCompany)

Power BI / Power Query - M language - playing with data inside group table

Hello M language masters!
I have a question about working with grouped rows when the Power Query creates a table with data. But maybe it is better to start from the beginning.
Important information! I will be asking for example only about adding an index. I know that there are different possibilities to reach such a result. But for this question, I need an answer about the possibility to work on tables. I want to use this answer in different actions (e.g table sorting, adding columns in group table).
In my sample data source, I have a list of fake transactions. I want to add an index for each Salesman, to count operations for each of them.
Sample Data
So I just added this file as a data source in Power BI. In Power query, I have grouped rows according to name. This step created form me column contained a table for each Salesman, which stores all his or her operations.
Grouping result
And now, I want to add an index column in each table. I know, that this is possible by adding a new column to the main table, which will be store a new table with added index:
Custom column function
And in each table, I have Indexed. That is good. But I have an extra column now (one with the table without index, and one with a table with index).
Result - a little bit messy
So I want to ask if there is any possibility to add such an index directly to the table in column Operations, without creating the additional column. My approach seems to be a bit messy and I want to find something cleaner. Does anyone know a smart solution for that?
Thank you in advance.
Artur
Sure, you may do it inside Table.Group function:
= Table.Group(Source, {"Salesman"}, {"Operations", each Table.AddIndexColumn(_, "i", 1, 1)})
P.S. To add existing index column to nested table use this code:
= Table.ReplaceValue(PreviousStep,each [index],0,(a,b,c)=>Table.AddColumn(a,"index", each b),{"Operations"})

Cannot create a 1:M relationship

In Power BI, I am attempting to join a dimension table with a fact table. The dimension table has approximately 1.1M rows in it (a lot I know for a dimension table). All the values are unique. When I attempt to join this to the fact table, PBI automatically creates a M:M relationship. When I attempt to change this to a 1:M, I get a message saying "The cardinality you selected for this relationship isn't valid".
Here is the query that generates the dataset. As you can see, it's impossible for there to be duplicates.
SELECT DISTINCT
[TranDesc] as TransactionDescription
FROM [dbo].[dGLTranDescription];
Why would I get this message?
Try to validate that Power BI seeing values in the dimension table as unique. Depending on your data, source system and PowerBI may see it differently.
Here are suggestions from https://community.powerbi.com/t5/Desktop/The-cardinality-you-selected-isn-t-valid-for-this-relationship/td-p/73470
1.
Create two measures to verify in Power BI:
TotalRows = COUNTROWS('DimTableHere')
DistinctRows = DISTINCTCOUNT('DimTableHere'[DimTableJoinColumnHere])
After create those two measures, place them in two card visuals, if
results are different, it means there are duplicate values in your
Dimension table.
2.
If you had duplicates when first creating relationship and now you don't, deleting the relationship and recreating it may resolve it.
If you have removed duplicates on the relationship column and it still considers it as invalid cardinality, try running Text.Clean on that column prior to duplication removal. I've had a special character but the removal of duplicates on the query, the values counted as different there but once imported they were considered the same.

Comparing 2 Tables in PowerBI

Working on a way to compare 2 tables in PowerBI.
I'm joining the 2 tables using the primary key and making custom columns that compare if the old and new are equal.
This doesn't seem like the most efficient way of doing things, and I can't even color code the matrix because some values aren't integers.
Any suggestions?
I did a big project like this last year, comparing two versions of a data warehouse (SQL database).
I tackled most of it in the Query Editor (actually using Power Query for Excel, but that's the same as PBI's Query Editor).
My key technique was to first create a Query for each table, and use Unpivot Other Columns on everything apart from the Primary Key columns. This transforms it into rows of Attribute, Value. You can filter Attribute to just the columns you want to compare.
Then in a new Query you can Merge & Expand the "old" and "new" Queries, joining on on the Primary Key columns + the Attribute column. Then add Filter or Add Column steps to get to your final output.