I just can't find a solution to what seems a easy problem to solve.
I have a dynamic line graph built with chart.js and based on data retrieved from a DB. User choices affect how data is picked. The graph makes a comparison between two datasets. the datasets can be of different lenght.
Let's say: dataset A can have 4 values, dataset B can have 8.
What I get is that if A dataset is longer, B dataset is shown as a line shorter than the line based on A dataset. But if the situation is the opposite, so A dataset is shorter than B dataset, B line is truncated to the lenght of A dataset so the existing values for B line are not shown.
Is there a way to prevent this and so adapt the size of the graph to the longer dataset, not the first one?
Related
I am attempting to create a line chart with two separate lines, by applying separate filters. To do so I have used slicers to create two tables CF1 and CF2 which is the original data filtered for the respective filters on the LHS and RHS of the data below (using relationships).
However, when displaying the information on a chart the second set of data is showing as the sum of the values instead of values as CF1 is.
I have tried creating a line chart with only CF1 or CF2 as the values and that works without issue.
To make this happen you need a common X-axis column to establish the evaluation context.
Now you are using the date column from one of the tables, say Table 1 which causes the measure that scans your Table 2 to just sum all entries, since the evaluation context does not filter Table 2 at all.
Read up on calendar tables and implement that. There are plenty of resources on this on the Web. See this example from Radacad, for instance.
I have to divide two measures from two different tables. I have created a measure in Table A & created measure-2 in Table B.
When I use matrix visual in Power BI by taking date field in columns and region in rows (for table A&B), I can see the both table values are correct as I am expected.
Ex: Table A 2017-Q1 value by measure1 is 29.2, Table B 2017-Q1 value by measure1 is 2.9.
I have to divide both measures and I need to show the value (divide%) in TableA along with Measure1.
Unfortunately I tried in multiple ways by forming relationship b/w two tables also, But not getting the expected result i.e., 29.2/2.9 we should get 10% but instead of that getting 3%.
Without knowing your data model, it's hard to give a reasonable answer.
https://learn.microsoft.com/en-us/dax/related-function-dax
Your best change of understanding what happens is to learn up on relations, and changes them when needed. The documentation is a great starting point.
Unrelated data plotted in a visual of different data will always aggregate since there is no relation to split your values. The value of 3% is correct, your assumption that you want 10% as an outcome is not valid for your situation.
If you link the dates of table A and the dates of table B to a seperate Calendar, it all would work.
I'm trying to return "duplicates" from a range. In this case a duplicate is when there exists more than one row that has the same data in the first and last column (the data in the middle columns needs to be returned, but is irrelevant in terms of having useful data for the search to be performed on).
For a small example data set and desired output see this sheet.
My current incomplete solution path is as follows:
I use
=QUERY({SourceData!A2:E,ARRAYFORMULA(IF(LEN(SourceData!A2:A),COUNTIFS(SourceData!A2:A&SourceData!E2:E,SourceData!A2:A&SourceData!E2:E,ROW(SourceData!A2:A),"<="&ROW(SourceData!A2:A)),))},"select Col1, Col2, Col3, Col4, Col5 where Col6 > 1")
where the ARRAYFORMULA appends a rolling count column to the end of the range and then QUERY the rows of the original range where the rolling count is above 1.
However, this only gives me the subsequent rows and not the first of the duplicates. (In the example it only gives me the second row of the matching pair and not the first.)
I'm tempted to limit the QUERY output to just column 1 and then wrap that output in a JOIN to make the output conditions of another QUERY. But given the size of the actual data set and the sheer number of IMPORTRANGEs and QUERYs I've already got going I'm starting to worry about efficiency. (I've got 12 Google Sheet documents all importing from a 13th Google Sheet document then the 13th document pulls and combines data from the 12 other sheets and spits subsets of the combined data set back to each of the 12 other documents.) The whole thing won't be usable if a user has to wait multiple minutes while all the functions resolve. Plus I'm sure someone out there has a more elegant way of getting this done that would be helpfully enlightening to an amateur such as me.
Advice is appreciated! Thank you for your time.
try:
={SourceData!A1:E1;
ARRAYFORMULA(FILTER(SourceData!A2:E, REGEXMATCH(SourceData!A2:A&SourceData!E2:E,
TEXTJOIN("|", 1, FILTER(SourceData!A2:A&SourceData!E2:E,
COUNTIFS(SourceData!A2:A&SourceData!E2:E, SourceData!A2:A&SourceData!E2:E,
ROW(SourceData!A2:A), "<="&ROW(SourceData!A2:A))>=2)))))}
I'm having huge performance issues with a SAS DI job that I need to get up and running. Therefore I'm looking for clever ways to optimize the job.
One thing in particular that I thought of is that I should perhaps permute the order of some joins and an append. Currently, my job is configured as follows:
there are several similarly structured source tables which I first apply a date filter to (to reduce the number of rows) and sort on two fields, say a and b, then I left join each table to a table with account table on the same fields a and b (I'd like to create indexes for these if possible, but don't know how to do it for temporary work tables in SAS DI). After each of these joins is complete, I append the resulting tables into one dataset.
It occurs to me that I could first append, and then do just one join, but I have no notion of which approach is faster, or if the answer is that it depends I have no notion of what it depends on (though I'd guess the size of the constituent tables).
So, is it better to do many joins then append, or to append then do one join?
EDIT
Here is an update with some relevant information (requested by user Robert Penridge).
The number of source tables here is 7, and the size of these tables ranges from 1500 to 5.2 million. 10 000 is typical. The number of columns is 25. These tables are each being joined with the same table, which has about 5000 rows and 8 columns.
I estimate that the unique key partitions the tables into subsets of roughly equal size; the size reduction here should be between 8% and 30% (the difference is due to the fact that some of the source tables carry much more historical data than others, adding to the percentage of the table grouped into the same number of groups).
I have limited the number of columns to the exact minimum amount required (21).
By default SAS DI creates all temporary datasets as views, and I have not changed that.
The code for the append and joins are auto-generated by SAS DI after constructing them with GUI elements.
The final dataset is not sorted; my reason for sorting the data which feeds the joins is that the section of this link on join performance (page 35) mentions that it should improve performance.
As I mentioned, I'm not sure if one can put indexes on temporary work tables or views in SAS DI.
I cannot say whether the widths of the fields is larger than absolutely necessary, but if so I doubt it is egregious. I hesitate to change this since it would have to be done manually, on several tables, and when new data comes in it might need that extra column width.
Much gratitude
Performance in SAS is mainly about reducing IO (ie. reading/writing to the disk).
Without additional details it's difficult to help but some additional things you can consider are:
limit the columns you are processing by using a keep statement (reduces IO)
if the steps performing the joins are IO intensive, consider using views rather than creating temporary tables
if the joins are still time consuming, consider replacing them with hash table lookups
make sure you are using proc append to append the 2 datasets together to reduce the IO. Append the smaller dataset to the larger dataset.
consider not sorting the final dataset but placing an index on it for consumers of the data.
ensure you are using some type of dataset compression, or ensure your column widths are set appropriately for all columns (ie. you don't have a width of 200 on a field that uses a width of 8)
reduce the number of rows as early in the process as possible (you are already doing this, just listing it here for completeness)
Adjusting the order of left-joins and appends probably won't make as much difference as doing the above.
As per your comments it seems that
1. There are 7 input source tables
2. Join these 7 source tables to 1 table
3. Append the results
In SAS DI studio, use a Lookup to perform the above much faster
1. Connect the 7 Input tables to a Lookup Transform (lets call them SRC 1-7)
2. The table with 5000 records is the tables on which lookup is performed on keys A and B (lets call this LKUP-1)
3. Take the relevant columns from LKUP-1 to propagate into the TARGET tables.
This will be much faster and you don't have to perform JOINs in this case as I suspect you are doing a Many-Many join which is degrading the performance in SAS DIS.
Question Summary
I can read all values out of the single column of a one-column table quite quickly. How can I read all values just as quickly from a single column of a table that has several other columns as well?
Details
I'm using the C++ api to read a sqlite database containing a single table with 2.2 million records.
The data has a "coordinates" column and (optionally) several other columns. The "coordinates" column is a BLOB and currently is always 8 bytes long. The other columns are a mix of TEXT and REAL, with the TEXT strings anywhere from a few characters to about 100 characters (the lengths vary record by record).
In one experiment, I created the table with the "coordinates" column, plus about 15 other columns. The total database file size was 745 MB. I did a simple
int rc = sqlite3_exec( db, "select coordinates from facilities", ReadSQLiteCallback, NULL, &errorMessage );
and it took 91 seconds to execute.
I then created the table with just the "coordinates" column and no other data columns. The total database file size was 36 MB. I ran the same select statement and it took 1.23 seconds.
I'm trying to understand what accounts for this dramatic difference in speed, and how I can improve the speed when the table has those additional data columns.
I do understand that the larger file means simply more data to read through. But I would expect the slowdown to be at worst more or less linear with the file size (i.e., that it would take maybe 20 times the 1.23 seconds, or about 25 seconds, but not 91 seconds).
Question Part I
I'm not using an index on the file because in general I tend to read most or all of the entire "coordinates" column at once as in the simple select above. So I don't really need an index for sorting or quickly accessing a subset of the records. But perhaps having an index would help the engine move from one variable-sized record to the next more quickly as it reads through all the data?
Is there any other simple idea that might help cut down on those 91 seconds?
Question Part II
Assuming there is no magic bullet for bringing the 91 seconds (when the 15 other data columns are included) down close to the 1.23 seconds (when just the coordinates column is present) in a single table, it seems like I could just use multiple tables, putting the coordinates in one table and the rest of the fields (to which I don't need such quick access) in another.
This sounds like it may be a use for foreign keys, but it seems like my case doesn't necessarily require the complexity of foreign keys, because I have a simple 1-to-1 correspondence between the coordinates table and the other data table -- each row of the coordinates table corresponds to the same row number of the other data table, so it's really just like I've "split" each record across two tables.
So the question is: I can of course manage this splitting by myself, by adding a row to both tables for each of my records, and deleting a row from both tables to delete a record. But is there a way to make SQLite manage this splitting for me (I googled "sqlite split record across tables" but didn't find much)?
Indexes are typically used for searching and sorting.
However, if all the columns actually used in a query are part of a single index, you have a covering index, and the query can be executed without accessing the actual table.
An index on the coordinates column is likely to speed up this query.
Even with a 1:1 relationship, you still need to know which rows are associated, so you still need a foreign key in one table. (This also happens to be the primary key, so in effect you just have the primary key column(s) duplicated in both tables.)
If you don't have an INTEGER PRIMARY KEY, you could use the internal ROWID instead of your primary key.