Issue:
How do you include you mapping output variables in the workflow without having to use a human task?
Basically I have one row of data that I want to push out and to be included in e.g. the body of the Notification task or as an attachment.
FYI - we are not using power center, only the developer tool.
I've resolved a similar situation using a flat file output from the Mapping and Command Tasks in the workflow. Here is a link to where I explained the process in the Informatica Data Quality Community forum: https://communities.informatica.com/thread/15717
Related
We have a use-case to build data pipeline solution in which we need following things:
Ability to have multiple steps (outputs from one step should feed as input to next)
Ability to have multiple algorithms (SQL Query or probably invoke REST endpoint) in each step.
Input to first step can be anything. We have DW tables, but we can pre-process and keep the relevant information in AWS S3 or other data store.
Something like this:
Is there an existing solution that already provides functionalities similar to this or can be modified to support this?
Having something in AWS would be easier to integrate.
How about AWS Glue? Sounds like a fit to your goals...
Every time I write query using the bq CLI tool I have to specify --use_legacy_sql=flase. I need to configure this. Google's documentation on this states it must be edited in the file $HOME/.bigqueryrc but there is no such file. In fact I couldn't find the .bigqueryrc named file anywhere on the server. Please help me to save few seconds every time I write a query. Thanks.
Following comments conversation above - the file under discussion - .bigqueryrc - can be created (manually) and populated with "default" flag values as described at the Setting default values for command-line flags BigQuery documentation.
We're trying to use AWS Glue for ETL operations in our nodejs project. The workflow will be like below
user uploads csv file
data transformation from XYZ format to ABC format(mapping and changing field names)
download transformed csv file to local system
Note that, this flow should happen programmatically(creating crawlers, job triggers should be done programmatically not using the console). I don't know why documentation and other articles always show how to create crawlers, create jobs from glue console?
I believe that we have to create lambda functions and triggers. but not quite sure how to achieve this end to end flow. can anyone please help me. Thanks
I have setup to log Pentaho jobs and transformations to a database
This works fine provided I define every job and every transformation in its individual log settings dialogue.
I see that I can configure the kettle properties file to hold these values.
However I can't get this to inherit autoamtically in a transformation when it is called by a job. I assume that if defined in properties it should just inherit and work.
Any ideas on what I am missing?
Thanks
(MS windows env with MS Sql server- we don't have Pentaho enterprise).
You can do it by adding below entries in "kettle.properties" file.
kettle logging properties
KETTLE_TRANS_LOG_DB=
KETTLE_TRANS_LOG_SCHEMA=
KETTLE_TRANS_LOG_TABLE=etl_trans_log
KETTLE_JOB_LOG_DB=
KETTLE_JOB_LOG_SCHEMA=
KETTLE_JOB_LOG_TABLE=etl_job_log
Ok so I have foound that provided I set the proerties file on the machine and then set each transformation by right clicking and setting each log to have the connection, then when I call the job it all logs correctly.
So you need the database connection in all transforms and you need to set this a sdefault in the logging tab.
I think this is right anyway unless someone else has a shorter cut
I've submitted a training job to the cloud using the RESTful API and see in the console logs that it completed successfully. In order to deploy the model and use it for predictions I have saved the final model using tf.train.Saver().save() (according to the how-to guide).
When running locally, I can find the graph files (export-* and export-*.meta) in the working directory. When running on the cloud however, I don't know where they end up. The API doesn't seem to have a parameter for specifying this, it's not in the bucket with the trainer app, and I can't find any temporary buckets on the cloud storage created by the job.
When you set up your Cloud ML environment you set up a bucket for this purpose. Have you looked in there?
https://cloud.google.com/ml/docs/how-tos/getting-set-up
Edit (for future record): As Robert mentioned in comments, you'll want to pass the output location to the job as an argument. Couple of things to be mindful of:
Use a unique output location per job, so one job doesn't clobber over the outputs of another.
The recommendation is to specify the parent output path, and use it to contain the exported model in a subpath called 'model', as well as organizing other outputs like checkpoints and summaries within that path. That makes it easier to manage all the outputs.
While not required, I'll also suggest staging the training code in a packages subpath of the output, which helps correlate the source with the outputs it produces.
Finally(!), also keep in mind when you use hyperparameter tuning, you'll need to append the trial id to the output path for outputs produced by individual runs.