Using AWS Data Pipeline PigActivity - amazon-web-services

I am trying to get a simple PigActivity to work in Data Pipeline.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-pigactivity.html#pigactivity
The Input and Output fields are required for this activity. I have them both set to use S3DataNode. Both of these DataNodes have a directoryPath which point to my s3 input and output. I originally tried to use filePath but got the following error:
PigActivity requires 'directoryPath' in 'Output' object.
I am using a custom pig script, also located in S3.
My question is how do I reference these input and output paths in my script?
The example given on the reference uses the stage field (which can be disabled/enabled). My understanding is that this used to convert the data into tables. I don't want to do this as it also requires that you specify a dataFormat field.
Determines whether staging is enabled and allows your Pig script to have access to the staged-data tables, such as ${INPUT1} and ${OUTPUT1}.
I have disabled staging and I am trying to access the data in my script as follows:
input = LOAD '$Input';
But I get the following error:
IOException. org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : Input
I have tried using:
input = LOAD '${Input}';
But I get an error for this too.
There is the optional scriptVariable field. Do I have to use some sort of mapping here?

Just using
LOAD 'uri to your s3'
shall work.
Normally this is done for you in staging (table creation) and you do not have to access the URI directly from script and only specify it in S3DataNode.

Make sure you have set the "stage" property of "pigActivity" to be true.
Once I did that the script below started working for me:
part = LOAD ${input1} USING PigStorage(',') AS (p_partkey,p_name,p_mfgr,p_category,p_brand1,p_color,p_type,p_size,p_container);
grpd = GROUP part BY p_color;
${output1} = FOREACH grpd GENERATE group, COUNT(part);

Related

Google Cloud Data Fusion - Dynamic arguments based on functions

Good morning all,
I'm looking in Google Data Fusion for a way to make dynamic the name of a source file stored on GCS. The files to be processed are named according to their value date, example: 2020-12-10_data.csv
My need would be to set the filename dynamically so that the pipeline uses the correct file every day (something like this: ${ new Date(). Getfullyear()... }_data.csv
I managed to use the arguments in runtime by specifying the date as a string (2020-12-10) but not with a function.
And more generally is there any documentation on how to enter dynamic parameters with ready-made or custom "functions" (I couldn't find it)
Thanks in advance for your help.
There is a readymade workaround, you can give a try "BigQuery Execute" plugin.
Steps:
Put below query in SQL
select cast(current_date as string) ||'_data.csv' as filename
--for output '2020-12-15_data.csv'
Row As Arguments to 'true'
Now use the above arguments via ${filename} wherever you want to.

Unable to set variable in AWS pgsql

I am trying to create a dynamic location for create external table statement.
I am using the statement
#set loc = 's3/root/' + Replace(Convert (Varchar, Current_Date),'-','')
In order to set loc to 's3/root/20200622' but I am unable to do it, while the select gives expected result.
Redshift doesn't support variables.
You can either create a Python UDF to to this, but I'm not sure even that'll serve your use-case.
You can execute the queries using Python code and pass the filenames dynamically from Python.

Data Files in PR-request_Result is showing as undefined

Problem statement:- I am trying to use data variables in pre-request section and I am unable to retrieve values from CSV data variable. I tried using the below two options but still no data is seen when I try to log the value nothing is displayed;
- Data.Variable
- pm.iterationData.get("variable")
Please verify if the below is right?
Data File:
Account_Number,Account_Name,Customer_ID,Currency,Account_Type,Account_Sub, Category
100000002,SWEEPY GROUPS 001,1507400001508,THB,DA,SAV,I
10000019,SWEEPY GROUPS 019,1507400001508,USD,DA,SAV,E
A9871100000020,SWEEPY GROUPS 020,1507400001508,USD,DA,DDA,E
PRE-REQ:
console.log("Customer ID"+JSON.stringify(data.Account_Name))
In Console: resultant
Customer ID undfined
Please suggest an alternative how can I get it running.

DSX: Insert to code link is missing

After uploading some files to my project and creating a catalog, I can see the list of files in the Find and Add Data section. However, there is no link Insert to code. This is true for files of type csv, json, tar.gz as well as for a data set from a catalog. What am I doing wrong?
Insert to Code Option is only available for data that you upload in Object Storage service.
I see that you are using Catalog for storage in DSX.
Catalog is still in beta state and currently insert to code is not added or supported for Catalog data assets.
Feel free to add enhancement request here:-
https://datascix.uservoice.com/forums/387207-general
If you create a project with Object storage as storage , you will see the insert to code for csv files.
For reading from catalog , you will need to use projectUtil.
Catalog data asset is considered as a resource of project so to access it you would need access token.
So first step, generate the token to access the catalog resource.
Go to Project Settings and create access token and then clear next cell and
click insert project token from those 3 dots above in notebook and
you will see code generated as below
The generated code just creates project context.
import com.ibm.analytics.projectNotebookIntegration._
val pc = ProjectUtil.newProjectContext(sc, "994b03fa-XXXXXX", "p-XXXXXXXXXX")
Lets make list of available files.
val fileList = ProjectUtil.listAvailableFilesData(pc)
fileList.indices.foreach( i => println(i + ": " + fileList(i)))
So the fileList contains your filenames.
You can directly use the name of the file as second argument.
val df = ProjectUtil.loadDataFrameFromFile(pc, fileList(1))
or
val df1 = ProjectUtil.loadDataFrameFromFile(pc, "co2.csv")
You will see below:-
"Creating DataFrame, this will take a few moments...
DataFrame created."
df.show() and you will see content.
Full Notebook:-
https://github.com/charles2588/bluemixsparknotebooks/blob/master/scala/Read_Write_Catalog_Scala.ipynb
The below doc also has python and R examples.

Ref for projectUtil:- https://datascience.ibm.com/docs/content/local/notebookfunctionsload.html
Thanks,
Charles.

Drupal 8: Altering Search API queries

I'm working on a project which includes the following activated modules:
Drupal core 8.2.3
Database Search 8.x-1.0-beta4
Search API 8.x-1.0-beta4
Search API Term Handlers 8.x-1.0-beta4
Views 8.2.3
I have a list of nids which need to be excluded from the search result of the site-wide search. The search uses Search API and has been setup using Views.
The table in the database is: "search_api_db_default_index"
The field I wish to target is: "nid"
I wasn't able to get HOOK__search_api_query_alter or HOOK_search_api_results_alter to fire, so I am attempting to manipulate the query through HOOK_views_query_alter.
I have attempted to use both the "addWhere" and "addCondition" methods with the following syntax:
When using the addCondition method, I attempted
$query->addCondition('search_api_db_default_index.nid', $oneBadNid, '<>');
and
$query->addCondition('search_api_db_default_index.nid', $manyBadNids, 'NOT IN');
and when using the addWhere method, I attempted
$query->addWhere('AND', 'search_api_index_default_index.nid', $oneBadNid, '<>');
and
$query->addWhere('AND', 'search_api_index_default_index.nid', $manyBadNids, 'NOT IN');
Regardless of whether or not I prefix the field with the table name, searching always results in triggering the following notice:
Unknown field in filter clause: 'search_api_db_default_index.nid' .
It seems that the field name is always wrapped in an html encoded string representing a single quotation, but this occurs both when using double quotations or single quotations around the supplied table.field parameter.
I am not even sure that this is what is keeping me from altering my query, but it is the only thing close to an error which I have discovered in this process. It's also possible that I'm simply not supposed to be targeting the table in the manner written, but I did not find any documentation directing me to the proper methodology.
I would appreciate any insight into this issue! Thanks!
Generally you can use
$fields = $query->getIndex()->getFields();
on the query to get an array of fields you can use within the search_api query.
Piggy-backing off of Nebel54's comment, and attempting this on my own, you don't need to include the 'table' name when setting the addCondition. However, I did need to use hook_search_api_query_alter over a views-specific one.
function mymodule_search_api_query_alter(\Drupal\search_api\Query\QueryInterface &$query) {
// Ensure field_myfield is being indexed
$fields = $query->getIndex()->getFields();
if (isset($fields['field_myfield'])) {
$query->addCondition('field_myfield', 'myvalue', '<>');
}
}