Copying single csv with multiple schema in ADF - azure-data-factory-pipeline

Hi I am trying to sink csv file to databse using ADF and Data Flow
CSV file structure as below
r1c1,r1c2,r1c3,r1c4
r2c1,r2c2,r2c3,r2c4
r3c1,r3c2,r3c3,r3c4
r4c1,r4c2,r4c3,r4c4,r4c5,r4c6
r5c1,r5c2,r5c3,r5c4,r5c5,r5c6
so 2 schemas one with 4 columns and second with 6 columns.
My goal is to copy first 3 rows in one table and last 2 in second table.
Please guide me to acheive this in ADF.

Your useCase is typical for ConditionalSplit transformation.
here is a simple demo that i created to demonstrate your usecase.
add 2 schemas as datasets in ADF
create a pipeline with Data Flow activity
in DataFlow : add your souce (CSV data) , i put Null values in the first 3 rows in columns 5 and 6 as you requested.
in Conditional split , i added this condition : "isNull(Column_5) && isNull(Column_6)" - see screenshot below.
Rows that matches the condition will be saved in the matching sink.
Data Flow Activities:
Conditional Split Activity:
Rows 1 - 3 output :
Rows 5 to 6 output:
Read more about it here : https://learn.microsoft.com/en-us/azure/data-factory/data-flow-conditional-split

Related

How do I collapse an expanded column into single rows?

I have a SharePoint list containing a column with data type People or Group which can contain multiple people. When this list is imported into Power BI it appears as [Table] in the Power BI Query Editor.
When I expand this column (using highlighted button above), it creates multiple rows (which I don't want).
My goal is to preserve the row count of my table by converting all duplicate rows created by the expansion back to single rows with a delimiter between values. Has anyone found a way to consolidate this?
Data example
Original Data
ID
ColumnHeader
OtherColumns
1
[Table]
OtherData
After expansion
ID
ColumnHeader
OtherColumns
1
FakeEmail#email.com
OtherData
1
FakeEmail2#email.com
OtherData
Target output
ID
ColumnHeader
OtherColumns
1
FakeEmail#email.com# FakeEmail2#email.com
OtherData
*The delimiter can be anything (not neccesarily a #)
Assume you have a table like this.
Table (in green) contains data structured like this.
To achieve the concatenation you're after as follows:
Add a custom column with the following code.
Text.Combine([ColumnHeader][Column Header A],"# ")

Can we create Dynamic Date table in mapping Data Flow?

I have a query in Power BI that takes two parameter: Start Date and End Date.
Whenever I pass these Dates it return a table of Date that contain few columns created according to this range of date such as Date, QuarterofYear, Year, MonthName......etc.
Can we create a mapping data flow in ADF that takes two parameter as input and return a calculated table according to provided dates?
Is there any function that return the range of dates?
For your request: "I want that I pass two date Start Date and End Date in ADF Mapping Data Flow , and Data flow will Create a column such as "Date" that contain that number of Date rows. Is there any function for this? Exam. Start Date=20-01-2019, End Date=20-01-2020 Then Date Column Values should be: 20-01-2019 21-01-2019 ......... ......... 20-02-2020", according the Data Factory documents and my experience, the answer is no, we can't achieve it in Data Flow.
There is a solution to this, but it is a bit tricky.
TL;DR
The general data flow looks like this:
We need a dummy source with exactly one row which contains whatever.
Then we derive a column where we use the mapLoop() expression to create an array of all the dates we want to get rows for.
Finally, we need to flatten the array column which will result in one row per array entry and thus one row per date.
Walkthrough
Source dummy
Each dataflow needs a source and we need exactly one row to make our dataflow work. To achieve this I've created a dataset called empty of type CSV in my data lake which has this content:
empty
""
This is our source definition:
And its result looks like this:
Derived column days
This is where the magic happens!
We create a new column dates which is an array of all the dates we want to have in our date table:
In this scenario we want a date table starting on 2019-01-01 and reaching one year into the future. The full expression looks like this:
mapLoop(
addDays(currentDate(), 365) - toDate(2019-01-01),
addDays(toDate(2019-01-01), #index)
)
This is what happens here:
the mapLoop() function builds an array of elements. You specify the number of elements you want to have and the lambda expression to calculate each of the elements. For example, mapIndex([1, 2, 3, 4], #item + 2 + #index) results in [4, 6, 8, 10]
addDays(currentDate(), 365) - toDate('2019-01-01') is the number of days between our start (2019-01-01) and end date (1 year in the future from now) and thus the number of dates we want to have in our resulting array.
addDays(toDate(2019-01-01), #index) calculates each array item by adding #index days to our start date. This is executed for the number of days we've calculated before and #index is the array position. Thus, the first element of the array will be 2019-01-01 + 1, the second 2019-01-01 + 2 and so on.
Our stream now has these columns:
Flatten
Finally, you need a flatten transformation which will expand each item in your array to its dedicated row. We can also dismiss the useless empty column in this step:
And this finally results in what we wanted to achieve:
References
Data transformation expressions in mapping data flow

POWER BI Creating new query out of existing one using range of columns

Trying to create a new query from the existing "Master" Query using below formula:
let
Source = Table.SelectColumns('Original Source Name',{'Column Name','Column Name2'})
in
Source
which works fine, however I am looking to see if there is any other formula which would do the same but in a way that it will create the new query with a range of columns , for example Column 30- 67 ( in this case when the original Excel file is updated, inserting a column in this range it would automatically update in the PBI too when refreshed)
Here's one possible way. If you start with this table, named Table1:
You can reference it in a new query like this:
let
Source = Table.SelectColumns(Table1, List.Range(Table.ColumnNames(Table1), 2, 3))
in
Source
...to get this:
The formula selects a range of columns from the table starting at the column at index position 2, and spanning 3 columns. (The index starts with 0.) For columns 30-67, you would change the 2 to 31 and the 3 to 37. You would change Table1 to your Original Source Name as well.
See these links for more info on List.Range and Table.ColumnNames.

How to delete a row from csv file on datalake store without using usql?

I am writing a unit test for appending data to CSV file on a datalake. I want to test it by finding my test data appended to the same file and once I found it I want to delete the row I inserted. Basically once I found the test data My test will pass but as the tests are run in production so I have to search for my test data i.e to find the row I have inserted in a file and delete it after the test is run.
I want to do it without using usql inorder to avoid the cost factor involved in using usql. What are the other possible ways we can do it?
You cannot delete a row (or any part) from a file. Azure data lake store is an append-only file system. Data once committed cannot be erased or updated. If you're testing in production, your application needs to be aware of test rows and ignore them appropriately.
The other choice is to read all the rows in U-SQL and then write an output excluding the test rows.
Like other big data analytics platforms, ADLA / U-SQL does not support appending to files per se. What you can do is take an input file, append some content to it (eg via U-SQL) and write it out as another file, eg a simple example:
DECLARE #inputFilepath string = "input/input79.txt";
DECLARE #outputFilepath string = "output/output.txt";
#input =
EXTRACT col1 int,
col2 DateTime,
col3 string
FROM #inputFilepath
USING Extractors.Csv(skipFirstNRows : 1);
#output =
SELECT *
FROM #input
UNION ALL
SELECT *
FROM(
VALUES
(
2,
DateTime.Now,
"some string"
) ) AS x (col1, col2, col3);
OUTPUT #output
TO #outputFilepath
USING Outputters.Csv(quoting : false, outputHeader : true);
If you want further control, you can do some things via the Powershell SDK, eg test an item exists:
Test-AdlStoreItem -Account $adls -Path "/data.csv"
or move an item with Move-AzureRmDataLakeStoreItem. More details here:
Manage Azure Data Lake Analytics using Azure PowerShell

How to parse through a column in Pig to create additional columns

New Apache Pig user here. I basically have data in a format and need to split this into 6 columns to create my desired schema and then load into Pig for my existing script to run.
Sorry if the format below is untidy, i cant upload a picture due to reputation score.
Existing format has 3 columns
User-Equipment values::key:bytearray values:value:bytearray
user1-mobile 20130306-AC 9
user1-mobile 20130306-AT 21
user2-laptop 20130306-BC 0
Required format:
User Equipment Date Type "Count or Time" Value
user1 mobile 20130306 A C 9
user1 mobile 20130306 A T 21
Any suggestions on how to ge this done? IS there a regex I need to write?
The tricky thing here is all the columns have a delimiter (-) between them except "Type" and column "C or T"
If you don't have a common delimiter I can think of two possibilities:
You could implement your own LoadFunc as explained here: http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
You could use REGEX_EXTRACT_ALL as explained here: Apache Pig: Extra query parameters from web log
Here you go for 2.:
A = LOAD 'abc.txt' AS (line:CHARARRAY);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^(.+?)\\-(.+?)\\s(.+?)\\-(.)(.)\\s(.+)$')) AS (User:CHARARRAY,Equipment:CHARARRAY,Date:CHARARRAY,Type:CHARARRAY,CountorTime:CHARARRAY,Value:CHARARRAY);