SSIS Derived column padding of leading zeroes - if-statement

I want to be able to add leading zeroes in following scenarios -
I need leading zeroes before [ACCOUNT ID] if [ACCOUNT ID] < 17 {For APPLICATION = RCC, SEC, HOGAN CIS}
I need leading zeroes before [ACCOUNT ID] if ([ACCOUNT ID] + [ACCOUNT NUMBER]) < 17 {For APPLICATION = CLN}
I need leading zeroes before [CLIENT KEY] if ([CLIENT KEY] + [CLIENT ID]) < 17 {For APPLICATION = ITF}
I have the following expression defined in my derived column -
LTRIM(RTRIM(APPLICATION == "RCC" || APPLICATION == "SEC" ? APPLICATION + "|" + [ACCOUNT ID] : APPLICATION == "HOGAN CIS" ? (HOGAN_Hogan_Alpha_Product_Code == "DDA" ? "DDA" : "TDA") + "|" + [ACCOUNT ID] : APPLICATION == "ITF" ? APPLICATION + "|" + [CLIENT KEY] + [CLIENT ID] : APPLICATION == "CLN" ? APPLICATION + "|" + [ACCOUNT ID] + [ACCOUNT NUMBER] : APPLICATION + "|" + [ACCOUNT NUMBER]))

Padding strings
The general strategy I use when dealing with adding leading characters is to always add them and then take the N rightmost characters. I find it greatly simplifies your logic.
RIGHT(REPLICATE('0', 17) + [MyColumn], 17)
That would generate 17 zeros, prepend that to my column and then slice off the last 17 characters. If MyColumn was already 17 character, then no effective work is done. If it was empty, now you have a value of 17 zeros.
Choosing
In your case, I'd first try add a derived column to identify which block of logic this APPLICATION falls into. Much like the existing ternary expression you have.
(APPLICATION == "CLN") ? 10 :
(APPLICATION == "ITF") ? 20 :
(APPLICATION == "RCC" | APPLICATION == "SEC" ....) ? 30 : 40
Coming out of that derived column, you'll be able to verify the logic is working as expected which makes the padding an easier scenario.

Related

AWS's own example of submitting Pig job does not work due to issue with piggybank.jar

I have been trying to test out submitting Pig jobs on AWS EMR following Amazon's guide. I made the change to the Pig script to ensure that it can find the piggybank.jar as instructed by Amazon. When I run the script I get an ERROR 1070 indicated that one of functions available in piggybank cannot be resolved. Any ideas on what is going wrong?
Key part of error
2018-03-15 21:47:08,258 ERROR org.apache.pig.PigServer (main): exception
during parsing: Error during parsing. Could not resolve
org.apache.pig.piggybank.evaluation.string.EXTRACT using imports: [,
java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Failed to parse: Pig script failed to parse: <file s3://cis442f-
data/pigons3/do-reports4.pig, line 26, column 6> Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve org.apache.pig.piggybank.evaluation.string.EXTRACT using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
The first part of the script is as follows:
Line 26 referred to in the error is contains "EXTRACT("
register file:/usr/lib/pig/lib/piggybank.jar;
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT;
DEFINE FORMAT org.apache.pig.piggybank.evaluation.string.FORMAT;
DEFINE REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE;
DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME;
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT;
--
-- import logs and break into tuples
--
raw_logs =
-- load the weblogs into a sequence of one element tuples
LOAD '$INPUT' USING TextLoader AS (line:chararray);
logs_base =
-- for each weblog string convert the weblong string into a
-- structure with named fields
FOREACH
raw_logs
GENERATE
FLATTEN (
EXTRACT(
line,
'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'
)
)
AS (
remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,
request: chararray, status: int, bytes_string: chararray, referrer: chararray,
browser: chararray
)
;
The correct function name is REGEX_EXTRACT. So either change your DEFINE statement to
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.REGEX_EXTRACT;
OR use REGEX_EXTRACT directly in your pig script
logs_base =
-- for each weblog string convert the weblong string into a
-- structure with named fields
FOREACH
raw_logs
GENERATE
FLATTEN (
REGEX_EXTRACT(
line,
'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'
)
)
AS (
remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,
request: chararray, status: int, bytes_string: chararray, referrer: chararray,
browser: chararray
)
;
The original script from Amazon would not work because it relied on an older version of piggybank. Here is an updated version that does not need piggybank at all.
--
-- import logs and break into tuples
--
raw_logs =
-- load the weblogs into a sequence of one element tuples
LOAD '$INPUT' USING TextLoader AS (line:chararray);
logs_base =
-- for each weblog string convert the weblong string into a
-- structure with named fields
FOREACH
raw_logs
GENERATE
FLATTEN (
REGEX_EXTRACT_ALL(
line,
'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'
)
)
AS (
remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,
request: chararray, status: int, bytes_string: chararray, referrer: chararray,
browser: chararray
)
;
logs =
-- convert from string values to typed values such as date_time and integers
FOREACH
logs_base
GENERATE
*,
ToDate(time, 'dd/MMM/yyyy:HH:mm:ss Z', 'UTC') as dtime,
(int)REPLACE(bytes_string, '-', '0') as bytes
;
--
-- determine total number of requests and bytes served by UTC hour of day
-- aggregating as a typical day across the total time of the logs
--
by_hour_count =
-- group logs by their hour of day, counting the number of logs in that hour
-- and the sum of the bytes of rows for that hour
FOREACH
(GROUP logs BY GetHour(dtime))
GENERATE
$0,
COUNT($1) AS num_requests,
SUM($1.bytes) AS num_bytes
;
STORE by_hour_count INTO '$OUTPUT/total_requests_bytes_per_hour';
--
-- top 50 X.X.X.* blocks
--
by_ip_count =
-- group weblog entries by the ip address from the remote address field
-- and count the number of entries for each address blok as well as
-- the sum of the bytes
FOREACH
(GROUP logs BY (chararray)REGEX_EXTRACT(remoteAddr, '(\\d+\\.\\d+\\.\\d+)', 1))
-- (GROUP logs BY block)
GENERATE $0,
COUNT($1) AS num_requests,
SUM($1.bytes) AS num_bytes
;
by_ip_count_sorted = ORDER by_ip_count BY num_requests DESC;
by_ip_count_limited =
-- order ip by the number of requests they make
LIMIT by_ip_count_sorted 50;
STORE by_ip_count_limited into '$OUTPUT/top_50_ips';
--
-- top 50 external referrers
--
by_referrer_count =
-- group by the referrer URL and count the number of requests
FOREACH
(GROUP logs BY (chararray)REGEX_EXTRACT(referrer, '(http:\\/\\/[a-z0-9\\.-]+)', 1))
GENERATE
FLATTEN($0),
COUNT($1) AS num_requests
;
by_referrer_count_filtered =
-- exclude matches for example.org
FILTER by_referrer_count BY NOT $0 matches '.*example\\.org';
by_referrer_count_sorted =
-- take the top 50 results
ORDER by_referrer_count_filtered BY $1 DESC;
by_referrer_count_limited =
-- take the top 50 results
LIMIT by_referrer_count_sorted 50;
STORE by_referrer_count_limited INTO '$OUTPUT/top_50_external_referrers';
--
-- top search terms coming from bing or google
--
google_and_bing_urls =
-- find referrer fields that match either bing or google
FILTER
(FOREACH logs GENERATE referrer)
BY
referrer matches '.*bing.*'
OR
referrer matches '.*google.*'
;
search_terms =
-- extract from each referrer url the search phrases
FOREACH
google_and_bing_urls
GENERATE
FLATTEN(REGEX_EXTRACT_ALL(referrer, '.*[&\\?]q=([^&]+).*')) as (term:chararray)
;
search_terms_filtered =
-- reject urls that contained no search terms
FILTER search_terms BY NOT $0 IS NULL;
search_terms_count =
-- for each search phrase count the number of weblogs entries that contained it
FOREACH
(GROUP search_terms_filtered BY $0)
GENERATE
$0,
COUNT($1) AS num
;
search_terms_count_sorted =
-- order the results
ORDER search_terms_count BY num DESC;
search_terms_count_limited =
-- take the top 50 results
LIMIT search_terms_count_sorted 50;
STORE search_terms_count_limited INTO '$OUTPUT/top_50_search_terms_from_bing_google';

Python sorting timestamp

I am struggling with something that should be relatively straight forward, but I am getting nowhere.
I have a bunch of data that has a timestamp in the format of hh:mm:ss. The data ranges from 00:00:00 all 24 hours of the day through 23:59:59.
I do not know how to go about pulling out the hh part of the data, so that I can just look at data between specific hours of the day.
I read the data in from a CSV file using:
with open(filename) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
time = row['Time']
This gives me time in the hh:mm:ss format, but now I do not know how to do what I want, which is look at the data from 6am until 6pm. 06:00:00 to 18:00:00.
With the times in 24 hour format, this is actually very simple:
'06:00:00' <= row['Time'] <= '18:00:00'
Assuming that you only have valid timestamps, this is true for all times between 6 AM and 6 PM inclusive.
If you want to get a list of all rows that meet this, you can put this into a list comprehension:
relevant_rows = [row for row in reader if '06:00:00' <= row['Time'] <= '18:00:00']
Update:
For handling times with no leading zero (0:00:00, 3:00:00, 15:00:00, etc), use split to get just the part before the first colon:
> row_time = '0:00:00'
> row_time.split(':')
['0', '00', '00']
> int(row_time.split(':')[0])
0
You can then check if the value is at least 6 and less than 18. If you want to include entries that are at 6 PM, then you have to check the minutes and seconds to make sure it is not after 6 PM.
However, you don't even really need to try anything like regex or even a simple split. You have two cases to deal with - either the hour is one digit, or it is two digits. If it is one digit, it needs to be at least six. If it is two digits, it needs to be less than 18. In code:
if row_time[1] == ':': # 1-digit hour
if row_time > '6': # 6 AM or later
# This is an entry you want
else:
if row_time < '18:00:00': # Use <= if you want 6 PM to be included
# This is an entry you want
or, compacted to a single line:
if (row_time[1] == ':' and row_time > '6') or row_time < '18:00:00':
# Parenthesis are not actually needed, but help make it clearer
as a list comprehension:
relevant_rows = [row for row in reader if (row['Time'][1] == ':' and row['Time'] > '6') or row['Time'] < '18:00:00']
You can use Python's slicing syntax to pull characters from the string.
For example:
time = '06:05:22'
timestamp_hour = time[0:2] #catch all chars from index 0 to index 2
print timestamp_hour
>>> '06'
should produce the first two digits: '06'. Then you can call the int() method to cast them as ints:
hour = int(timestamp_hour)
print hour
>>> 6
Now you have an interger variable that can be checked to see if is between, say, 6 and 18.

Oracle How do I transform this string field into structured data using regular expressions?

I did start at this answer:
Oracle 11g get all matched occurrences by a regular expression
But it didn't get me far enough. I have a string field that looks like this:
A=&token1&token2&token3,B=&token2&token3&token5
It could have any number of tokens and any number of keys. The desired output is a set of rows looking like this:
Key | Token
A | &token1
A | &token2
A | &token3
B | &token2
B | &token3
B | &token5
This is proving rather difficult to do.
I started here:
SELECT token from
(SELECT REGEXP_SUBSTR(str, '[A-Z=&]+', 1, LEVEL) AS token
FROM (SELECT 'A=&token1&token2&token3,B=&token2&token3&token5' str from dual)
CONNECT BY LEVEL <= LENGTH(REGEXP_REPLACE(str, '[A-Z=&]+', ',')))
Where token is not null
But that yields:
A=&
&
&
B=&
&
&
which is getting me nowhere. I'm thinking I need to do a nested clever select where the first one gets me
A=&token1&token2&token3
B=&token2&token3&token5
And a subsequent select might be able to do a clever extract to get the final result.
Stumped. I'm trying to do this without using procedural or function code -- I would like the set to be something I can union with other queries so if it's possible to do this with nested selects that would be great.
UPDATE:
SET DEFINE OFF
SELECT SUBSTR(token,1,1) as Key, REGEXP_SUBSTR(token, '&\w+', 1, LEVEL) AS token2
FROM
(
-- 1 row per key/value pair
SELECT token from
(SELECT REGEXP_SUBSTR(str, '[^,]+', 1, LEVEL) AS token
FROM (SELECT 'A=&token1&token2&token3,B=&token2&token3&token5' str from dual)
CONNECT BY LEVEL <= LENGTH(REGEXP_REPLACE(str, '[^,]+', ',')))
Where token is not null
)
CONNECT BY LEVEL <= LENGTH(REGEXP_REPLACE(token, '&\w+'))
This gets me
A | &token1
A | &token2
B | &token3
B | &token2
A | &token2
B | &token3
Which is fantastic formatting except for the small problem that it's wrong (A should have a token3, and token4 and token5 are nowhere to be seen).
Great question! Thanks for it!
select distinct k, regexp_substr(v, '[^&]+', 1, level) t
from (
select substr(regexp_substr(val,'^[^=]+=&'),1,length(regexp_substr(val,'^[^=]+=&'))-2) k, substr(regexp_substr(val,'=&.*'),3) v
from (
select regexp_substr(str, '[^,]+', 1, level) val
from (select 'A=&token1&token2&token3,B=&token2&token3&token5' str from dual)
connect by level <= length(str) - length(replace(str,','))+1
)
) connect by level <= length(v) - length(replace(v,'&'))+1
It is an answer, and one that seems to work... But I don't like the middle splitting the val into kand v- there must be a better way (if the Key is always one character, that makes it easy though) . And having to put a DISTINCT to get rid of duplicates is horrible... Maybe with further playing you can clean it up though (or someone else might)
EDIT based on keeping the leading & and the key being a single character:
select distinct k, regexp_substr(v, '&[^&]+', 1, level) t
from (
select substr(val,1,1) k
, substr(regexp_substr(val,'=&.*'),1) v
from (
select regexp_substr(str, '[^,]+', 1, level) val
from (select 'A=&token1&token2&token3,B=&token2&token3&token5' str from dual)
connect by level <= length(str) - length(replace(str,','))+1
)
) connect by level < length(v) - length(replace(v,'&'))+1

Add value between column using sed/awk based on matching value at certain column

I have a log files with many records. All line of rows and columns have same format. I want to use sed to match value in certain column and adding new value in between column. As an example, a log like this :
2014.3.17 23:57:11 127.0.0.3 10.21.31.141 http://vcs2.msg.yahoo.com/capacity *DENIED* Banned site: msg.yahoo.com GET 0 0 3 403 - working_time content3 -
My command will search the log for msg.yahoo.com (column 9th) and if match it will add value (Social Media) between column 12 and 13. As intended output :
2014.3.17 23:57:11 127.0.0.3 10.21.31.141 http://vcs2.msg.yahoo.com/capacity *DENIED* Banned site: msg.yahoo.com GET 0 0 Social Media 3 403 - working_time content3 -
My awk code only put Social Media between column 12 and 13 :
awk -v column=12 -v value="Social Media" '
BEGIN {
FS = OFS = " ";
}
{
for ( i = NF + 1; i > column; i-- ) {
$i = $(i-1);
}
$i = value;
print $0;
}
' access3.log
but it need to find msg.yahoo.com in column 9 before add value. Its like this, if column
9 = msg.yahoo.com, put Social Media after column 12 or between 12 and 13 column.
Workable but ugly is sed (as things so often are):
sed '/\([^ ]* \)\{8\}msg\.yahoo\.com/s/\(\([^ ]* \)\{12\}\)/\1Social Media /' filename
Here is the fix for awk
awk '$9=="msg.yahoo.com"{$(NF-6)=$(NF-6) " Social Media"}1' access3.log
Explanation
$9=="msg.yahoo.com" only target on the line which msg.yahoo.com in column 9
$(NF-6)=$(NF-6) " Social Media" column (NF-6) is the reverse column 6 from end, and replace with a new value.
1 just means true and print.

What is the best way to populate a load file for a date lookup dimension table?

Informix 11.70.TC4:
I have an SQL dimension table which is used for looking up a date (pk_date) and returning another date (plus1, plus2 or plus3_months) to the client, depending on whether the user selects a "1","2" or a "3".
The table schema is as follows:
TABLE date_lookup
(
pk_date DATE,
plus1_months DATE,
plus2_months DATE,
plus3_months DATE
);
UNIQUE INDEX on date_lookup(pk_date);
I have a load file (pipe delimited) containing dates from 01-28-2012 to 03-31-2014.
The following is an example of the load file:
01-28-2012|02-28-2012|03-28-2012|04-28-2012|
01-29-2012|02-29-2012|03-29-2012|04-29-2012|
01-30-2012|02-29-2012|03-30-2012|04-30-2012|
01-31-2012|02-29-2012|03-31-2012|04-30-2012|
...
03-31-2014|04-30-2014|05-31-2014|06-30-2014|
........................................................................................
EDIT : Sir Jonathan's SQL statement using DATE(pk_date + n UNITS MONTH on 11.70.TC5 worked!
I generated a load file with pk_date's from 01-28-2012 to 12-31-2020, and plus1, plus2 & plus3_months NULL. Loaded this into date_lookup table, then executed the update statement below:
UPDATE date_lookup
SET plus1_months = DATE(pk_date + 1 UNITS MONTH),
plus2_months = DATE(pk_date + 2 UNITS MONTH),
plus3_months = DATE(pk_date + 3 UNITS MONTH);
Apparently, DATE() was able to convert pk_date to DATETIME, do the math with TC5's new algorithm, and return the result in DATE format!
.........................................................................................
The rules for this dimension table are:
If pk_date has 31 days in its month and plus1, plus2 or plus3_months only have 28, 29, or 30 days, then let plus1, plus2 or plus3 equal the last day of that month.
If pk_date has 30 days in its month and plus1, plus2 or plus3 has 28 or 29 days in its month, let them equal the last valid date of those month, and so on.
All other dates fall on the same day of the following month.
My question is: What is the best way to automatically generate pk_dates past 03-31-2014 following the above rules? Can I accomplish this with an SQL script, "sed", C program?
EDIT: I mentioned sed because I already have more than two years worth of data and
could perhaps model the rest after this data, or perhaps a tool like awk is better?
The best technique would be to upgrade to 11.70.TC5 (on 32-bit Windows; generally to 11.70.xC5 or later) and use an expression such as:
SELECT DATE(given_date + n UNITS MONTH)
FROM Wherever
...
The DATETIME code was modified between 11.70.xC4 and 11.70.xC5 to generate dates according to the rules you outline when the dates are as described and you use the + n UNITS MONTH or equivalent notation.
This obviates the need for a table at all. Clearly, though, all your clients would also have to be on 11.70.xC5 too.
Maybe you can update your development machine to 11.70.xC5 and then use this property to generate the data for the table on your development machine, and distribute the data to your clients.
If upgrading at least someone to 11.70.xC5 is not an option, then consider the Perl script suggestion.
Can it be done with SQL? Probably, but it would be excruciating. Ditto for C, and I think 'no' is the answer for sed.
However, a couple of dozen lines of perl seems to produce what you need:
#!/usr/bin/perl
use strict;
use warnings;
use DateTime;
my #dates;
# parse arguments
while (my $datep = shift){
my ($m,$d,$y) = split('-', $datep);
push(#dates, DateTime->new(year => $y, month => $m, day => $d))
|| die "Cannot parse date $!\n";
}
open(STDOUT, ">", "output.unl") || die "Unable to create output file.";
my ($date, $end) = #dates;
while( $date < $end ){
my #row = ($date->mdy('-')); # start with pk_date
for my $mth ( qw[ 1 2 3 ] ){
my $fut_d = $date->clone->add(months => $mth);
until (
($fut_d->month == $date->month + $mth
&& $fut_d->year == $date->year) ||
($fut_d->month == $date->month + $mth - 12
&& $fut_d->year > $date->year)
){
$fut_d->subtract(days => 1); # step back until criteria met
}
push(#row, $fut_d->mdy('-'));
}
print STDOUT join("|", #row, "\n");
$date->add(days => 1);
}
Save that as futuredates.pl, chmod +x it and execute like this:
$ futuredates.pl 04-01-2014 12-31-2020
That seems to do the trick for me.