PDI - Multiple file input based on date in filename - kettle

I'm working with a project using Kettle (PDI).
I have to input multiple file of .csv or .xls and insert it into DB.
The file name are AAMMDDBBBB, where AA is code for city and BBBB is code for shop. MMDD is date format like MM-DD. For example LA0326F5CA.csv.
The Regexp I use in the Input file steps look like LA.\*\\.csv or DT.*\\.xls, which is return all files to insert it into DB.
Can you indicate me how to select the files the file just for yesterday (based on the MMDD of the file name).

As you need some "complex" logic in your selection, you cannot filter based only on regexp. I suggest you first read all filenames, then filter the filenames based on their "age", then read the file based on the selected filenames.
In detail:
Use the Get File Names step with the same regexp you currently use (LA.*\.csv or DT.*\.xls). You may be more restrictive at that stage with a Regexp like LA\d\d\d\d.....csv, to ensure MM and DD are numbers, and DDDD is exactly 4 characters.
Filter based on the date. You can do this with a Java Filter, but it would be an order of magnitude easier to use a Javascript Script to compute the "age" of you file and then to use a Filter rows to keep only the file of yesterday.
To compute the age of the file, extract the MM and DD, you can use (other methods are available):
var regexp = filename.match(/..(\d\d)(\d\d).*/);
if(regexp){
var age = new Date() - new Date(2018, regexp[1], regexp[2]);
age = age /1000 /60 /60 /24;
};
If you are not familiar with Javascript regexp: the match will test
the filename against the regexp and keep the values of the parenthesis
in an array. If the test succeed (which you must explicitly check to
avoid run time failure), use the values of the match to compute the
corresponding date, and subtract the date of today to get the age.
This age is in milliseconds, which is converted in days.
Use the Text File Input and Excel Input with the option Accept file from previous step. Note that CSV Input does not have this option, but the more powerful Text File Input has.

well I change the Java Filter with Modified Java Script Value and its work fine now.
Another question, how can I increase the Performance and Speed of my current transformation(now I have 2 trans. for 2 cities)? My insert update make my tranformation to slow and need almost 1hour and 30min to process 500k row of data with alot of field(300mb) and my data not only this, if is it work more fast and my company like to used it, im gonna do it with 10TB of data/years and its alot of trans and rows. I need sugestion about it

Related

Expressions in Data Integrator tool on Informatica Cloud

I use Data Integrator tool on IICS and I have a csv file as source and need to change the data type on every single column as they all become nvarchar when read from the file. I have made an Expression transformation and use the To_Decimal function in each expression. But i find it very time consuming and booring to creat about a 100 expressions? This was easier and quicker to do in PowerCenter ... is there a smarter and quicker way to do this in IICS?
Br,
Ø
This is where re-usability plays vital role.
create a reusable exp transformation which will take input and convert it to decimal (). create 10 generic input and 10 generic output. One pair is shown below. Just copy and paste them 10 times and make sure the columns are properly set in formula.
in_col1 (string (150))
...
out_col1 (decimal(22,7) = To_Decimal( ltrim(rtrim( in_col1,7)))
Then copy it 10 times for your mapping. Pls note i used trim to remove spaces.
You can do this for date columns, trim space from string too.

Best way to use spreadsheet RegEx to extract text and numbers and replace with the formatting?

I'm currently working on a non profit project where I need to reformat the way the data in the rows displays.
At the moment, this is how the row data looks:
Save The Children (Donation)|10.00{0}{2}
And I need it to output like this instead:
donation_id:save_children|quantity:1|total:10.00
The first problem is sometimes there's multiple items within the row:
Save The Children (Donation)|10.00{0}{2} / Save The Forrest|15.50{0}{2}
In which case it would need to be separated by a semicolon:
donation_id:save_children|quantity:1|total:10.00;donation_id:save_forrest|quantity:1|total:15.50
The second problem is, we have 9 donation variables/causes, each needing to convert the output to a different "donation_id".
So every time it finds:
Save the Children, it needs to convert to: donation_id:save_children
Save the Forrest, to, donation_id:save_forrest
Save the Animals, to, donation_id:save_animals
And so forth.
And the third problem is that the donation amounts are variable (as people donate whatever they wish), so the "total:" dollar value that we ouput will often be different.
How would I go about doing this with the regex?
Thank you
You can use below regex
(Save) The (Children|Forrest|Animals).*?\|([0-9]+\.[0-9]+)\{0\}\{2\}([\s\/]+)?
substitution/replace with
donation_id:$1_$2|quantity:1|total:$3;
When I test for
Save The Children (Donation)|10.00{0}{2} / Save The Forrest|15.50{0}{2}
Output is
donation_id:Save_Children|quantity:1|total:10.00;donation_id:Save_Forrest|quantity:1|total:15.50;
Test it online!

How do I replace poorly formatted ZIP codes with proper ones?

I have a data set that that looks like this:
adjuster adjuster_zip
A-20 98216
A-14 98214
A-17 98216
A-20 California
I need to format this data set so that adjuster_zip is all numeric. I have several hundred adjusters and they all show up several hundred times. However, they each adjuster only has one zip code. As you can see with A-20, this adjuster has both a valid and invalid zip code. All of the adjusters that have invalid zip codes also have valid zip codes. How can I automate this so that SAS switches invalid zip codes with valid ones by adjuster?
Thanks for any and all help.
Also, I couldn't figure out how to format the data so that it shows up in a table. Sorry.
My suggestion would be to build a format table per adjuster. Start with your input dataset; then filter to only valid zip codes (you could use NOTDIGIT to check for any nondigit values, and LENGTH to check it is only five long). Then create a dataset with FMTNAME as a constant string with any legal format name you wish preceded by $ ($ADJZIPF would be a good cohice), START equal to the variable that contains the adjuster name, LABEL being the zip. Then use PROC FORMAT with cntlin= the dataset you just defined.
That would allow you to look up the zip for each adjuster using PUT and your custom format. You still have to worry about a few things; that table must be non-duplicated per adjuster, so you need to decide how to handle adjusters with two or more zips; and you need to check when you use PUT that it does find a zip code.

Excel international date formatting

I am having problems formatting Excel datetimes, so that it works internationally. Our program is written in C++ and uses COM to export data from our database to Excel, and this includes datetime fields.
If we don't supply a formatting mask, some installations of Excel displays these dates as Serial numbers (days since 1900.01.01 followed by time as a 24-hour fraction). This is unreadable to a human, so we ha found out that we MUST supply a date formatting mask to be sure that it displays readable.
The problem - as I see it - is that Excel uses international formatting masks. For example; the UK datetime format mask might be "YYYY-MM-DD HH:MM".
But if the format mask is sent to an Excel that is installed in Sweden, it fails since the Swedish version of the Excel uses "ÅÅÅÅ-MM-DD tt:mm".
It's highly impractical to have 150 different national datetime formatting masks in our application to support different countries.
Is there a way to write formatting masks so that they include locale, such that we would be allowed to use ONE single mask?
Unless you are using the date functionality in Excel, the easiest way to handle this is to decide on a format and then create a string yourself in that format and set the cell accordingly.
This comic: http://xkcd.com/1179/ might help you choose a standard to go with. Otherwise, clients that open your file in different countries will have differently formatted data. Just pick a standard and force your data to that standard.
Edited to add: There are libraries that can make this really easy for your as well... http://www.libxl.com/read-write-excel-date-time.html
Edited to add part2: Basically what I'm trying to get at is to avoid asking for the asmk and just format the data yourself (if that makes sense).
I recommend doing the following: Create an excel with date formatting on a specific cell and save this for your program to use.
Now when the program runs it will open this use this excel file to retrieve the local date formatting from the excel and the specified cell.
When you have multiple formats to save just use different cells for them.
It is not a nice way but will work afaik.
Alteratively you could consider creating an xla(m) file that will use vba and a command to feed back the local formatting characters through a function like:
Public Function localChar(charIn As Range) As String
localChar = charIn.NumberFormatLocal
End Function
Also not a very clean method, but it might do the trick for you.

Parsing a date in ColdFusion

I have a date stored in the format dd-mm-yyyy. I want to store the day, date and year as individual variables, while getting rid of any leading zeros (e.g. "09-09-2010" is stored as 9, 9, 2010).
I attempted to use the code on this page to split the date by dashes, but it is throwing expression errors.
Some people, when confronted with a
problem, think "I know, I'll use
regular expressions." Now they have
two problems.
Coding Horror: Regular Expressions: Now You Have Two Problems
Investigate the ColdFusion functions month(date), day(date) and year(date).
Update: you can pass a string to these functions so long as CF can turn into a date.
When you say that you have a date
stored in the format dd-mm-yyyy
are you sure you aren't confusing this with the way that your database UI is presenting it to you or are you actually storing the date in this format (for example, by writing it this way to a text file or a varchar column rather than a DateTime column)?
The reason I ask is that if a date is stored in a database as a date then CF will represent it as a date irrespective of how it appears in, say, SQL Management Studio. If this is the case then you can simply split the parts out using DatePart("datepart", "date").
If you have a date in a text format (such as from a form submission or because it has been stored as plain text) then you should be able to parse it in to a DateTime object using LSParseDateTime() and then use the DatePart(...) method on it to split out the parts.
See http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=functions_c-d_30.html
(sorry, can't post the URL to the other function due to lack of SO points!)
for the documentation on this.
As an aside, if you are using SQL2005 (or later) then you can create computed columns on the date field in order to split out the day, year and month at the database level. I thought I'd mention this just in case it proves useful.
Steve
If you're working with a string in that format, there's no need for regular expressions.
myDate = "13-12-2010";
theDay = listGetAt(myDate,1,"-");
theMonth = listGetAt(myDate,2,"-");
theYear = listGetAt(myDate,3,"-");
Using the val() function will also drop leading zeroes, if any.