Trying to aggregate data from multiple files in two distinct tables - powerbi

I just got started in PowerBI and I am generating two report files every month from Service NOW.
SLA's report and the Incident report. Eventually, these files have the naming INC_MM_YY.xls or SLA_MM_YY.xls.
I am trying to make the addition of the previous month's files without the need to add new data sources/edit the queries. It seems that it is possible using M language in the advanced query editor but seems a lot complicated since I have 0 experience with power query M.
Are there other ways?
Or in the case above. I can retrieve the folder data as a table and iterate over the files. But how to do that in the M language?
Thank you.
EDIT: Just to try to make it clear let's look at the table generated by the folder source.
We have the name of the file and it's path for each row.
So in pseudo code should be something like:
For (each row as n) {
if (n.folderpath ends with "sla") {
tablesla += load source n."folderpath" && n."filename"
}
else tableincident += load source n."folderpath" && n."filename"
}
It just seems not practical in powerquery :/ I could find how to make something similar to a for loop but very confusing.

I figured it out.
You can actually create two different sources, one for the folder with the SLA and another with the folder for incident. Just after combining and transforming the data from one of the folders. Still in the Query Editor, you just click New Source and the other folder data will combined in a different table.
With that you have two distinct tables and any time when you put a new file in one of the folders, hit refresh, the data will be added to the correct table.
Thank you guys.

try the load from folder option, you can place each months files into a its own folder one for the SLA's and one for the Incidents. With the load from folder, it will go though each file and load it. So the next month, you add in Novembers data, refresh the dataset(s) and it will add it automatically.
The files need to be the same structure for it to work effectively, and it will load what it sees in the folder, so if you remove a file, Power BI will not retain it in the workbook, it only loads what it can see.
Other examples
https://powerbi.tips/2016/06/loading-data-from-folder/
https://insightsoftware.com/blog/power-bi-load-data-from-folder/
Hope that helps

Related

Can I add a new column without rewriting an entire file?

I've been experimenting with Apache Arrow. I have used the column oriented memory mapped files for many years. In the past, I've used a separate file for each column. Arrow seems to like to store everything in one file. Is there a way to add a new column without rewriting the entire file?
The short answer is probably no.
Arrow's in-memory format & libraries support this. You can add a chunked array to a table by just creating a new table (this should be zero-copy).
However, it appears you are talking about storing tables in files. None of the common file formats in use (parquet, csv, feather) support partitioning a table in this way.
Keep in mind, if you are reading a parquet file, you can specify which column(s) you want to read and it will only read the necessary data. So if your goal is only to support individual column retrieval/query then you can just build one large table with all your columns.

Building app to upload CSV to Oracle 12c database via Apex

I'v been asked to create an app in Oracle Apex that will allow me to drop a CSV file. The file contains a list of all active physicians and associated info in my area. I do not know where to begin! Requirements:
-after dropping CSV file to apex, remove unnecessary columns
-edit data in each field, ie if phone# > 7 characters and begins with 1, remove 1. Or remove all special characters from a column.
-The CSV contains physicians of every specialty, I only want to upload specific specialties to the database table.
I have a small amount of SQL experience from Uni, and I know some HTML and CSS, but beyond that I am lost. Please help!
Began tutorial on Oracle-Apex. Created upload wizard on a dev environment
User drops CSV file to apex
Apex edits columns to remove unneccesary characteres
Only uploads specific columns from CSV file
Only adds data when column "Specialties" = specific specialties
Does not add redundant data (physician is already located in table, do nothing)
Produces report showing all new physicians added to table
Huh, you're in deep trouble as you have to do some job using a tool you don't know at all, with limited knowledge of SQL language. Yes, it is said that Apex is simple to use, but nonetheless ... you have to know at least something. Otherwise, as you said, you're lost.
See if the following helps.
there's the CSV file
create a table in your database; its description should match the CSV file. Mention all columns it contains. Pay attention to datatypes, column lengths and such
this table will be "temporary" - you'll use it every day to load data from CSV files: first you'll delete all it contains, then load new rows
using Apex "Create page" Wizard, create the "Data loading" process. Follow the instructions (and/or read documentation about it). Once you're done, you'll have 4 new pages in your Apex application
when you run it, you should be able to load CSV file into that temporary table
That's the first stage - successfully load data into the database. Now, the second stage: fix what's wrong.
create another table in the database; it will be the "target" table and is supposed to contain only data you need (i.e. the subset of the temporary table). If such a table already exists, you don't have to create a new one.
create a stored procedure. It will read data from the temporary table and edit everything you've mentioned (remove special characters, remove leading "1", ...)
as you have to skip physicians that already exist in the target table, use NOT IN or NOT EXISTS
then insert "clean" data into the target table
That stored procedure will be executed after the Apex loading process is done; a simple way to do that is to create a button on the last page which will - when pressed - call the procedure.
The final stage is the report:
as you have to show new physicians, consider adding a column (into the target table) which will be a timestamp (perhaps DATE is enough, if you'll be doing it once a day) or process_id (all rows inserted in the same process will share the same value) so that you could distinguish newly added rows from the old ones
the report itself would be an Interactive report. Why? Because it is easy to create and lets you (or end users) to adjust it according to their needs (filter data, sort rows in a different manner, ...)
Good luck! You'll need it.

Google Dataprep: Save GCS file name as one of the column

I have a Dataprep flow configured. The Dataset is a GCS folder (all files from it). Target is BigQuery table.
Since data is coming from multiple files, I want to have filename as of the columns in the resulting data.
Is that possible?
UPDATE: There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps. (If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface)
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
Original Answer
This is not currently possible out of the box. IF you're manually merging datasets with UNION, you could first process them to add a column with the source so that it's then present in the combined output.
If you're bulk-ingesting files, that doesn't help—but there is an open feature request open that you can comment on and/or follow for updates:
https://issuetracker.google.com/issues/74386476

DynamoDB Query with multiple tags

I am rather new to DynamoDB and currently we are thinking about migrating an existing project to a serverless application using DynamoDB where we want to adapt the following setup from a RDMS database:
Tables:
Projects (ProjectID)
Files (FileID, ProjectID, Filename)
Tags (FileID, Tag)
We want to make a query with DynamoDB to fetch all Files for a specific Project (by ProjectID) with one or multiple Tags (by Tag). In an RDMS this query would be simple with something like:
SELECT * FROM Files JOIN Tags ON Tags.FileID = Files.FileID WHERE Files.ProjectID = ?PROJECT AND Tags.Tag = ?TAG_1 OR ?TAG_2 ...
At the moment, we have the following DynamoDB setup (but it can still be changed):
Projects (ProjectID [HashKey], ...)
Files (ProjectID [HashKey], FileID [RangeKey], ...)
Please also consider that the number of project entries is huge (between 1000 - 30000) and also the number of files for each project (is between 50 and 100.000) and the query should be really fast.
How can this be achieved using DynamoDB-query, best without using filter expressions since they are applied after data selection? It would be perfect if the table Files could have a StringSet Tags as column but I guess that this cannot be used for an efficient DynamoDB-query (so without using DynamoDB-scan) since DynamoDB-indices can only be of type String, Binary and Number and not of type StringSet? Is this maybe an applicable use case for the Global Secondary Index (GSI)?
A bit late, just saw this question referenced from another one.
I guess you've went and solved it something like this?
DynamoDB tables
Projects (ProjectID [HashKey], ...)
Files (ProjectID [HashKey], FileID [RangeKey], ...)
Tags (Tag [HashKey], FileID [RangeKey], ProjectID [LSI Sort Key])
On the FileTags, you need the FileID to make the primary key unique, but you can add the ProjectID as a sort key for a Local Secondary Index, so you can search on Tag + ProjectID.
It's some sort of Data Denormalization, but that's what it takes to go NoSQL :-( . E.g. if your File would be switched to another Project, you'll need to update the ProjectID not only on the File, but also on all the Tags.
The question is almost three years old, but it's still in the Google result. So if anybody else lands here, maybe the following page from DynamoDB docs can help. Just found it myself and didn't try it yet, but it looks promising. Seems to be newer than the other replies here and shows a nice approach to solve the problem.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-adjacency-graphs.html

Is it possible to validate the column order when uploading data from flat files using aws copy command

I'm uploading data from zipped flat files to redshift using copy command, I would like to understand if there is any way to validate that the column order of the files is correct? (for example, if fields are all varchar then the data could be uploaded to the wrong columns).
In the copy command documentation it shows that you can specify the column order, but not for flat files, but I was wondering if there are any other approaches that would allow me to check how the columns have been supplied (for example, uploading only the header row into a dummy table to check, but that doesn't seem a possibility).
You can't really do this inside Redshift. COPY doesn't provide any options to only load a specific number of rows or perform any validation.
Your best option would be to do this in the tool where you schedule the loads. You can get the first line from a compressed file easily enough (zcat < file.z|head -1) but for a file on S3 you may have to download the whole thing first.
FWIW, the process generating the load file should be fully automated in such a way that the column order can't change. If these files are being manually prepared you're asking for all sorts of trouble.