Amazon Data Pipeline: How to use a script argument in a SqlActivity?

Amazon Data Pipeline: How to use a script argument in a SqlActivity? - amazon-web-services

When trying to use a Script Argument in the sqlActivity:
{
"id" : "ActivityId_3zboU",
"schedule" : { "ref" : "DefaultSchedule" },
"scriptUri" : "s3://location_of_script/unload.sql",
"name" : "unload",
"runsOn" : { "ref" : "Ec2Instance" },
"scriptArgument" : [ "'s3://location_of_unload/#format(minusDays(#scheduledStartTime,1),'YYYY/MM/dd/hhmm/')}'", "'aws_access_key_id=????;aws_secret_access_key=*******'" ],
"type" : "SqlActivity",
"dependsOn" : { "ref" : "ActivityId_YY69k" },
"database" : { "ref" : "RedshiftCluster" }
}
where the unload.sql script contains:
unload ('
select *
from tbl1
')
to ?
credentials ?
delimiter ',' GZIP;
or :
unload ('
select *
from tbl1
')
to ?::VARCHAR(255)
credentials ?::VARCHAR(255)
delimiter ',' GZIP;
process fails:
syntax error at or near "$1" Position
Any idea what i'm doing wrong?

This is the script that works fine from psql shell :
insert into tempsdf select * from source where source.id = '123';
Here are some of my tests on SqlActivity using Data-Pipelines :
Test 1 : Using ?'s
insert into mytable select * from source where source.id = ?; - works fine if used via both 'script' and 'scriptURI' option on SqlActivity object.
where "ScriptArgument" : "123"
here ? can replace the value of the condition, but not the condition itself.
Test 2 : Using parameters works when command is specified using 'script' option only
insert into #{myTable} select * from source where source.id = ?; - Works fine if used via 'script' option only
insert into #{myTable} select * from source where source.id = #{myId};
works fine if used via 'script' option only
where #{myTable} , #{myId} are Parameters whose value can be declared in template.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-custom-templates.html
(when you are using only parameters, make sure you delete an unused
scriptArguments - otherwise it will still throw and error)
FAILED TESTS and inferences:
insert into ? select * from source where source.id = ?;
insert into ? select * from source where source.id = '123';
Both the above commands does not work because
Table names cannot be used for placeholders for script arguments. '?''s can only be used to pass values for a comparison condition and column values.
insert into #{myTable} select * from source where source.id = #{myId}; - doesn't work if used as 'SciptURI'
insert into tempsdf select * from source where source.id = #{myId}; - does not work when used with 'ScriptURI'
Above 2 commands does not work because
Parameters cannot be evaluated if script is stored in S3.
insert into tempsdf select * from source where source.id = $1 ; - doesnt work with 'scriptURI'
insert into tempsdf values ($1,$2,$3); - does not work.
using $'s - doesn't not work in any combination
Other tests :
"ScriptArgument" : "123"
"ScriptArgument" : "456"
"ScriptArgument" : "789"
insert into tempsdf values (?,?,?); - works as both scriptURI , script and translates to insert into tempsdf values ('123','456','789');
scriptArguments will follow the order you insert and replaces "?" in
the script.

in shellcommand activity
we specify two scriptArguments to acces using $1 $2 in shell script(.sh)
"scriptArgument" : "'s3://location_of_unload/#format(minusDays(#scheduledStartTime,1),'YYYY/MM/dd/hhmm/')}'", # can be accesed using $1
"scriptArgument" : "'aws_access_key_id=????;aws_secret_access_key=*******'" # can be accesed using $2
I dont know will this work for you.

I believe you are using this sql activity for Redshift. Can you modify your sql script to refer to parameters using their positional notation.
To refer to the parameters in the sql statement itself, use $1, $2, etc.
See http://www.postgresql.org/docs/9.1/static/sql-prepare.html

Related

duckdb - aggregate string with a given separator

The standard aggregator makes coma separated list:
$ SELECT list_string_agg([1, 2, 'sdsd'])
'1,2,sdsd'
How can I make a smicolumn separated list or '/'-separated? Like '1;2;sdsd' or '1/2/sdsd'.

I believe string_agg function is what you want which also supports "distinct".
# Python example
import duckdb as dd
CURR_QUERY = \
'''
SELECT string_agg(distinct a.c, ' || ') AS str_con
FROM (SELECT 'string 1' AS c
UNION ALL
SELECT 'string 2' AS c,
UNION ALL
SELECT 'string 1' AS c) AS a
'''
print(dd.query(CURR_QUERY))
Above will give you "string 1||string 2"

How can I grab a user's logonworkstations list and remove all entries that match a csv listing

WorkstationList.csv has 3 separate columns with different names. Trying to use a column "retired" that has a list of retired logonworkstations names and match them to a users current list. If there is a match then delete from the users logonworkstation list.
$defaultWorkstationslist = Import-Csv -Path '[workstationList.csv]'
$olist = Get-Aduser $user-Properties LogonWorkstations | Select LogonWorkstations
$newlist = ''
foreach ($o in $olist){
foreach ($r in $defaultWorkstationslist.retired){
if ($o -ne $r){
$newlist += $o
} else {
continue
}
}
}
Set-ADUser $user -logonWorkstations $newlist
Output:
Set-ADUser : The format of the specified computer name is invalid
[redacted]:36 char:1
+ Set-ADUser $user-logonWorkstations $newlist
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (user:ADUser) [Set-ADUser], ADException
+ FullyQualifiedErrorId : ActiveDirectoryServer:1210,Microsoft.ActiveDirectory.Management.Commands.SetADUser

According to the Set-ADUser Docs for this specific parameter:
To specify more than one computer, create a single comma-separated list. You can identify a computer by using the Security Account Manager (SAM) account name (sAMAccountName) or the DNS host name of the computer. The SAM account name is the same as the NetBIOS name of the computer.
We can assume that, if the user has this property set, it would be a string with each computer comma separated (computer1,computer2...). Following that assumption, we can first test if the user has that property set, then split the value by comma and lastly filter each value against the $defaultWorkstationslist.retired array.
Important Note, this should work as long as the column Retired from your Csv has computer names using the SAM account name or NetBIOS name of the computer as stated in the docs.
$user = 'someuser'
$defaultWorkstationslist = Import-Csv -Path '[workstationList.csv]'
$aduser = Get-ADUser $user -Properties LogonWorkstations
# if this user has the attribute set
if($wsList = $aduser.LogonWorkstations) {
# split the string by comma and then use filtering technique to exclude those values that
# exists in the `$defaultWorkstationslist.retired` array.
$allowedList = ($wsList.Split(',').Where{ $_ -notin $defaultWorkstationslist.retired }) -join ','
Set-ADUser $user -LogonWorkstations $allowedList
}
For filtering we can use the .Where intrinsic method, where the current object in the pipeline is represented with $_ ($PSItem)
For testing if an element $wsList is contained in $defaultWorkstationslist.retired we can use Containment operators.

BigQuery - JSONpath recursive operator (2/2)

Is there any way to realize a recursive search on a JSON string object in BigQuery in absence of the operator "..", which is apparently not supported ?
Motivation: access "name" only when located within "students" in the below.
Query
SELECT JSON_EXTRACT(json_text, '$..students.name') AS first_student
FROM UNNEST([
'{"class" : {"students" : {"name" : "Jane"}}}'
]) AS json_text;
Desired output
+-----------------+
| first_student |
+-----------------+
| "Jane" |
+-----------------+
Current output
Unsupported operator in JSONPath: ..

Is there any way to realize a recursive search on a JSON string object in BigQuery in absence of the operator "..", which is apparently not supported ?
Consider below approach
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS STRING
LANGUAGE js AS """
return jsonPath(JSON.parse(json), json_path);
"""
OPTIONS (
library="gs://some_bucket/jsonpath-0.8.0.js"
);
SELECT CUSTOM_JSON_EXTRACT(json_text, '$..students.name') AS first_student
FROM UNNEST([
'{"class" : {"students" : {"name" : "Jane"}}}'
]) AS json_text;
with output
Note: to overcome current BigQuery's "limitation" for JsonPath, above solution uses UDF + external library - jsonpath-0.8.0.js that can be downloaded from https://code.google.com/archive/p/jsonpath/downloads and uploaded to Google Cloud Storage - gs://some_bucket/jsonpath-0.8.0.js

PowerBI - how to get an output from one query to another? (For Sophos Central APIs)

I want to get the output of the first query (SophosBearerToken) to the other one, but it gives me the following error
Formula.Firewall: Query 'Query2' (step 'PartnerID') references other
queries or steps, so it may not directly access a data source. Please
rebuild this data combination.
first query : SophosBearerToken (works)
let
SophosBearerToken = "Bearer " & (Json.Document(Web.Contents("https://id.sophos.com/api/v2/oauth2/token",
[
Headers = [#"Content-Type"="application/x-www-form-urlencoded"],
Content = Text.ToBinary("grant_type=client_credentials&client_id=" & #"SophosClientID" & "&client_secret=" & #"SophosClientSecret" & "&scope=token")
]
)) [access_token])
in
SophosBearerToken
second query: Query2(fail)
let
PartnerIDQuery = Json.Document(Web.Contents("https://api.central.sophos.com/whoami/v1", [Headers = [#"Authorization"=#"SophosBearerToken"]])),
PartnerID = PartnerIDQuery[id]
in
PartnerID
but when I add the output of the first query manually to the second one it works
what could be the mistake I'm doing in here?

Set missing column values to a default using AWS Glue Jobs

I'm trying to extract a dataset from dynamodb to s3 using Glue. In the process I want to select a handful of columns, then set a default value for any and all rows/columns that have missing values.
My attempt is currently to use the "Map" function, but it doesn't seem to be calling my method.
Here is what I have:
def SetDefaults(rec):
print("checking record")
for col in rec:
if not rec[col]:
rec[col] = "missing"
return rec
## Read raw(source) data from target DynamoDB
raw_data_dyf = glueContext.create_dynamic_frame_from_options("dynamodb", {"dynamodb.input.tableName" : my_dynamodb_table, "dynamodb.throughput.read.percent" : "0.50" } )
## Get the necessary columns
selected_data_dyf = ApplyMapping.apply(frame = raw_data_dyf, mappings = mappingList)
## get rid of null values
mapped_dyF = Map.apply(frame=selected_data_dyf, f=SetDefaults)
## write it all out as a csv
datasink = glueContext.write_dynamic_frame.from_options(frame=mapped_dyF , connection_type="s3", connection_options={ "path": my_train_data }, format="csv", format_options = {"writeHeader": False , "quoteChar": "-1" })
My ApplyMapping.apply call is doing the right thing, where mappingList is defined by a bunch of:
mappingList.append(('gsaid', 'bigint', 'gsaid', 'bigint'))
mappingList.append(('objectid', 'bigint', 'objectid', 'bigint'))
mappingList.append(('objecttype', 'bigint', 'objecttype', 'bigint'))
I have no errors, everything runs to completion. My data is all in s3, but there are many empty values still, rather than the "missing" entry I would like.
The "checking record" print statement never prints out. What am I missing here?

Alternative solution:
Convert DynamicFrame to Spark DataFrame
Use the DataFrame's fillna() method to fill the null values
Convert the DataFrame back to DynamicFrame

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Amazon Data Pipeline: How to use a script argument in a SqlActivity? - amazon-web-services

I believe you are using this sql activity for Redshift. Can you modify your sql script to refer to parameters using their positional notation. To refer to the parameters in the sql statement itself, use $1, $2, etc. See http://www.postgresql.org/docs/9.1/static/sql-prepare.html

Related

duckdb - aggregate string with a given separator

How can I grab a user's logonworkstations list and remove all entries that match a csv listing

BigQuery - JSONpath recursive operator (2/2)

PowerBI - how to get an output from one query to another? (For Sophos Central APIs)

Set missing column values to a default using AWS Glue Jobs

Categories

Resources