Dataprep importing files with different number of columns into a dataset - google-cloud-platform

I am trying to create a parameterized dataset that imports files from GCS and puts them under each other. This all works fine (Import Data > Parameterize).
To give a bit of context, I store each day a .csv file with a different name referring to that date.
Now it happens that my provider added a new column since last month into the files. This means that files before this date have 8 columns, whereas from this date 9 columns.
However, when I parameterize, Dataprep only takes into account the columns that are matching (thus 8 columns only). Ideally I would want empty observations for the rows coming from files that did not have this new column.
How can this be achieved?

The parameterized datasets only work on a fixed schema as mentioned in the documentation:
Avoid creating datasets with parameters where individual files or tables have differing schemas.
This fixed schema is generated using one of the file found during the creation of the dataset with parameters.
If the schema has changed, then you can "refresh" it by editing the dataset with parameters and clicking save. If all the matching files contain 9 columns, you should now see 9 columns in the transformer.

Related

Snowflake mapping columns names from CSV files

I have a table in Snowflake that have 24 columns in it. And I also have a CSV files in the S3 bucket with the changing number of columns. Sometimes it can be 4 columns and sometimes it can be 24 columns etc. I need also map the names of the CSV files columns to the name of the Snowflake table columns. Is there any way how to do it?
You would need to pre-process your CSV file to bring it to a more consistent format as your target table.
You can extract the column header and generate a COPY command mapping those columns into your table.

Efficiently Set Up Reference Table in Power BI

I have two 300MB base files on the network and several reference tables, which only use one or two columns from the base files. The first step of a reference table is to bring in the source table (all columns) and then the second step is to remove all the unneeded columns. However, this is extremely inefficient, and every time I do a data refresh after altering my table queries, it takes 5-10 minutes for every reference table to load the entire datasets.
Is there a more efficient way of doing this which would lead to faster load times? I am assuming that instead of the reference tables I could have a new table which selects only the one or two columns needed.
Thanks
What you can do is create your BaseTable and import the source (all Columns), name it "BaseTable"
Now create a new query (blank) and type = BaseTable. On enter you can now remove the columns you do not need. On all other tables you repeat this. So your source is only imported and you get the columns you need from this source without repeating the imoporting of your source data.

Changing the query data source to new updated sheet questions

We are utilizing an excel monthly report in our power bi project that has added measure columns, and we keep the sheets the fields pull the data from in one folder. When we get each month's updated excel sheet, would we be able delete the old one, add the new report to the folder with the exact same name as the old, and refresh the power bi query to use the new updated data? All the column headers would remain the same, the only thing that would be changing is maybe the amount of rows and the data within them. If we were to keep all the names the same, the only thing changing is the data sheet itself (not the column headers just the data) would the added measure columns remain and work? The measure columns act as column data multipliers and filters, and it would be a pain to make new ones each month.
Thanks
Yes. If the file path and filename and sheet/table name all remain the same, Power BI won't know the difference and you shouldn't have trouble if the columns and headers stay consistent.
Additionally, if you don't want to rename the file or delete/move older files from the folder, you could do a Load from Folder query and sort by date created/modified and grab the top row instead of specifying the filename.

How to create a Fact table from multiple different tables in Pentaho

I have been following a tutorial on creating a data warehouse using Pentaho Data Integration/Kettle.
The tutorial is based off of a CSV file but I am practicing with the northwinds database and postgresql I am trying to figure out how to select values from more than one table then output them into a single table.
My ETL process goes like this: I have several stages for each table, values are selected from each table and stored in a stage table for each table in the database, from there I have my dimensions table set up but I am trying to figure out the step between the stages and the dimensions which is where I am trying to select the values to update the dimensions table.
I have several stages set up for each of my tables at this point I am not sure if I should create a separate values table for each table or a single values table. Any help would be greatly appreciated. Thanks
When I try to select values from multiple tables I get an error that says "we detected rows with varying number of fields" It' seems I would need to create separate tables with
In kette, the metadata structure of the data stream cannot change. As such, if row 1 has 3 columns, one integer and two strings, for example, all rows must have the same structure.
If you're combining rows coming from different sources, you must ensure the structure is the same. That error is telling you that some of the incoming streams of data have a different number of fields.

How to properly import tsv to athena

I am following this example:
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files - TSV example
Summary of the code:
CREATE EXTERNAL TABLE flight_delays_tsv (
yr INT,
quarter INT,
month INT,
...
div5longestgtime INT,
div5wheelsoff STRING,
div5tailnum STRING
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://athena-examples-myregion/flight/tsv/';
My questions are:
My tsv does not have column names
(my tsv)
Is it ok if I just list the columns as c1,c2… and all of them as string ?
I do not understand this:
PARTITIONED BY (year STRING)
in the example, the column ‘year’ is not listed in any of the columns…
Column names
The column names are defined by the CREATE EXTERNAL TABLE command. I recommend you name them something useful so that it is easier to write queries. The column names do not need to match any names in the actual file. (It does not interpret header rows.)
Partitioning
From Partitioning Data - Amazon Athena:
To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data.
The field used to partition the data is NOT stored in the files themselves, which is why they are not in the table definition. Rather, the column value is stored in the name of the directory.
This might seem strange (storing values in a directory name!) but actually makes sense because it avoids situations where an incorrect value is stored in a folder. For example, if there is a year=2018 folder, what happens if a file contains a column where the year is 2017? This is avoided by storing the year in the directory name, such that any files within that directory are assigned the value denoted in the directory name.
Queries can still use WHERE year = 2018 even though it isn't listed as an actual column.
See also: LanguageManual DDL - Apache Hive - Apache Software Foundation
The other neat thing is that data can be updated by simply moving a file to a different directory. In this example, it would change the year value as a result of being in a different directory.
Yes, it's strange, but the trick is to stop thinking of it like a normal database and appreciate the freedom that it offers. For example, appending new data is as simple as dropping a file into a directory. No loading required!