PowerBI and nested 1:N data - powerbi

I'm trying to leverage the advantages of DocumentDB / Elastic / NoSQL for retrieving big data and to visualize it. I want to use PowerBI to do that, which is pretty good, however, I have no clue how to model a document which has a 1:N nested data field. E.g.
{
name: string,
age: int
children: [ { name: string }... ]
}
In a normal case, you would flatten the table by expanding the nested values and joining them, but how does one do that when it's 1:N / A list. Is there a way to maybe extract that into it's own table?
I've been thinking about making a bridge which translates a document into data tables, but that feels like an incorrect way to go, and further proves some complications with regards to how many endpoints and queries there should be made.
I can't help but think this is a solved issue, as many places analyse and visualize large amounts of data stored in no sql. The alternative is a normalized relational database, but having millions and millions of entries in that which you analyze also seems incorrect when nosql is tuned for these scenarios.

If the data 1:N, but not arbitrarily deep, you can use the expand option in the query tab. You will get one row for each instance of customer that has all the attributes of the container.
If you want to get more sophisticated, you could normalized the schema by expanding just the customer id column (assuming there is one in your data) into one table, and expanding the customer details into another one, then creating a relationship across them. That makes aggregations easier (like count of parents). You'd just load the data twice, and delete the columns you don't need.

Related

Redshift - Redesign tables to use DIST and SORT keys (performance issue)

I'm having serious performance problems on Redshift and I've started to rethink my tables structures.
Right now, I'm identifying tables that have most significance on my dashboard. First of all, I run the following query:
SELECT * FROM admin.v_extended_table_info
WHERE table_id IN (
SELECT DISTINCT s.tbl FROM stl_scan s
JOIN pg_user u ON u.usesysid = s.userid
WHERE s.type=2 AND u.usename='looker'
)
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC,
size DESC;
Based on query result, I could identify a lot of small tables (1-1000 records) that are distributed as EVEN and it could be ALL - this tables are used in a lot of joins instructions.
Beside that, I've identified that 99% of my tables are using EVEN without sort key. I'm not using denormalized tables so I need to run plenty of joins to get data - for what I've read, EVEN is not good for joins because it could be distributed over the network.
I have 3 tables related to Ticket flow: user, ticket and ticket_history. All those tables are EVEN without sort keys and diststyle as EVEN.
For now, I would like to redesign table user: this table is used on join by condition ticket.user_id = user.id and where clauses like user.email = 'xxxx#xxxx.com' or user.email like '%#something.com%' or group by user.email.
First thing I'm planning to do is use diststyle as distribution and key as id. Does make sense use a unique value as dist key? I've read plenty of posts about dist keys and still confuse for me.
As sort keys makes sense use email as compound? I've read to avoid columns that grows like dates, timestamps or identities, that's why i'm not using it as interleaved. To avoid that like, I'm planning to create a new column to identify what is email domain.
After that, I'll change small tables to dist ALL and try my queries again.
Am I on right way? Any other tip?
This question could sound stupid but my tech background is only software development, I'm learning about Redshift and reading a lot of documentations.
The basic rule of thumb is:
Set the DISTKEY to the column that is most used in JOINs
Set the SORTKEY to the column(s) most used in WHEREs
You are correct that small tables can have a distribution of ALL, which would avoid sending data between nodes.
DISTKEY provides the most benefit when tables are join via a common column that has the same DISTKEY in both tables. This means that each row is contained on the same node and no data needs to be sent between nodes (or, more accurately, slices). However, you can only select one DISTKEY, so do it on the column that is most often used for the JOIN.
SORTKEY provides the most benefit when Redshift can skip over blocks of storage. Each block of storage contains data for one column and is marked with a MIN and MAX value. When a table is sorted on a particular column, it minimises the number of disk blocks that contain data for a given column value (since they are all located together, rather than being spread randomly throughout disk storage). Thus, use column(s) that are most frequently used in WHERE statements.
If the user.email wildcard search is slow, you can certainly create a new column with the domain. Or, for even better performance, you could consider creating a separate lookup table with just user_id and domain, having SORTKEY = domain. This will perform the fastest when searching by domain.
A tip from experience: I would advise against using an email address as a user_id because people sometimes want to change email address. It is better to use a unique number for such id columns, with email address as a changeable attribute. (I've seen software systems need major rewrites to fix such an early design decision!)

When to use cte and temp table?

i know this a common question, but most frequently people ask about performancy between this two.
What I'm asking for is use cases of cte and temp table, for better understanding the usage of them
With a temp table you can use CONSTRAINT's and INDEX's. You can also create a CURSOR on a temp table where a CTE terminates after the end of the query(emphasizing a single query).
I will answer through specific use cases with an application I've had experience with in order to aid with my point.
Common use cases in an example enterprise application I've used is as follows:
Temp Tables
Normally, we use temp tables in order to transform data before INSERT or UPDATE in the appropriate tables in time that require more than one query. Gather similar data from multiple tables in order to manipulate and process the data.
There are different types of orders (order_type1, order_type2, order_type3) all of which are on different TABLE's but have similar COLUMN's. We have a STORED PROCEDURE that UNION's all these tables into one #orders temp table and UPDATE's a persons suggested orders depending on existing orders.
CTE's
CTE's are awesome for readability when dealing with single queries. When creating reports that requires analysis using PIVOT's,Aggregates, etc. with tons of lines of code, CTE's provide readability by being able to separate a huge query into logical sections.
Sometimes there is a combination of both. When more than one query is required. Its still useful to break down some of those queries with CTE's.
I hope this is of some usefulness, cheers!

Safely segregating customer data in Spanner

We're exploring options for reliably segregating customer data in Spanner. The most obvious solution is a customer per database, but the 100 database/instance limitation renders that impractical. Past experience leads me to be very suspicious of any plan to add a customer-id field to the primary key of each table, because it's far too easy to screw that up in SQL queries, leading to dangerous data cross-talk.
I'm considering weird solutions like using all 2k tables/instance, and taking the ~32 tables we need per customer and prefixing those. E.g., [cust-id]-Table1, [cust-id]-Table2, etc. At least then the customer segregation logic that needs to be iron-clad can be put in one place that's hard to screw up in queries. But is anyone aware of a less weird approach? E.g., "100" is a suspiciously-non-round number in a technical limitation -- is that adjustable somehow?
Unfortunately, 100 databases/instance is not an adjustable value.
Though, I don't seem to fully understand " very suspicious of any plan to add a customer-id field to the primary key of each table, because it's far too easy to screw that up in SQL queries, leading to dangerous data cross-talk." Are you concerned about query performance, data correctness, code correctness or schema ?
With this schema, ~32 tables per customer will only allow you to store ~6000 customers. Though I would suggest benchmarking with other schema choices Spanner exposes.
Would you be able to provide a high-level schema of these customer tables as well as your query patterns ?
Also, suggest to read into for more ideas that fit your usecase better:
Spanner Schema
Interleaved Tables
Secondary Indexes
SQL Best Practices

Cassandra NOT EQUAL Operator

Question to all Cassandra experts out there.
I have a column family with about a million records.
I would like to query these records in such a way that I should be able to perform a Not-Equal-To kind of operation.
I Googled on this and it seems I have to use some sort of Map-Reduce.
Can somebody tell me what are the options available in this regard.
I can suggest a few approaches.
1) If you have a limited number of values that you would like to test for not-equality, consider modeling those as a boolean columns (i.e.: column isEqualToUnitedStates with true or false).
2) Otherwise, consider emulating the unsupported query != X by combining results of two separate queries, < X and > X on the client-side.
3) If your schema cannot support either type of query above, you may have to resort to writing custom routines that will do client-side filtering and construct the not-equal set dynamically. This will work if you can first narrow down your search space to manageable proportions, such that it's relatively cheap to run the query without the not-equal.
So let's say you're interested in all purchases of a particular customer of every product type except Widget. An ideal query could look something like SELECT * FROM purchases WHERE customer = 'Bob' AND item != 'Widget'; Now of course, you cannot run this, but in this case you should be able to run SELECT * FROM purchases WHERE customer = 'Bob' without wasting too many resources and filter item != 'Widget' in the client application.
4) Finally, if there is no way to restrict the data in a meaningful way before doing the scan (querying without the equality check would returning too many rows to handle comfortably), you may have to resort to MapReduce. This means running a distributed job that would scan all rows in the table across the cluster. Such jobs will obviously run a lot slower than native queries, and are quite complex to set up. If you want to go this way, please look into Cassandra Hadoop integration.
If you want to use not-equals operator on a specific partition key and get all other data from table then you can use a combination of range queries and TOKEN function from CQL to achieve this
For example, if you want to fetch all rows except the ones having partition key as 'abc' then you execute below 2 queries
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) < TOKEN('abc');
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) > TOKEN('abc');
But, beware that result is going to be huge (depending on size of table and fields you need). So you might want to use this in conjunction with dsbulk kind of utility. Also note that there is no guarantee of ordering in your result. This is just a kind of data dump which will most probably be useful for some kind of one-time data migration like scenarios.

nosql/dynamodb hash and range use case

It's my first time using a NoSQL database so I'm really confused. I'd really appreciate any help I can get.
I want to store data comprising announcements in my table. Essentially, each announcement has an ID, a date, and a text.
So for example, an announcement might have ID of 1, date of 2014/02/26, and text of "This is a sample announcement". Newer announcements always have a greater ID value than older announcements, since they are added to the table later.
There are two types of queries I want to run on this table:
I want to retrieve the text of the announcements sorted in order of date.
I want to retrieve the text and dates of the x most recent announcements (say, the 3 most recent announcements).
So I've set up the table with the following attributes:
ID (number) as primary key, and
date (string) as range
Is this appropriate for what my use cases? And if so, what kind of query/reads/requests/scans/whatever (I'm really confused about the terminology here too) should I be running to accomplish the two types of queries I want to make?
Any help will be very much appreciated. Thanks!
You are on the right track.
As far as sorting, DynamoDB will sort by the range key, so date will work but I'd recommend storing it as a number, perhaps milliseconds since the Unix epoch, rather than a String. This will make it trivial to get the announcements in ascending or descending order based on their created date.
See this answer for an overview of local vs global secondary indexes and what capabilities they provide: Optional secondary indexes in DynamoDB
As far as retrieving all items, you would need to perform a scan. Scans are not as efficient as queries, but since all of Dynamo is on SSD's they're still relatively quick. You don't get the single digit millisecond performance with a scan that you get with a query, so if there's a way to associate announcements with a user ID, you might get better performance than with a scan.
Note that you cannot modify the table schema (hash key, range key, and indexes) after you create the table. There are ways to manually migrate a table or import/export it, but the point is that you should think hard about current and future query requirements up front and design the table to support them. It's very easy to add or stop storing non-key or non-item attributes though, which provides nice flexibility.
Finally, try to avoid thinking of Dynamo as relational. With Dynamo, in a lot of cases you may well be better off de normalizing or duplicating some of the data in exchange for fast query performance.