Can you connect a SqlActivity to a JdbcDatabase in Amazon Data Pipeline?

Can you connect a SqlActivity to a JdbcDatabase in Amazon Data Pipeline? - amazon-web-services

Using Amazon Data Pipeline, I'm trying to use a SqlActivity to execute some SQL on a non-Redshift data store (SnowflakeDB, for the curious). It seems like it should be possible to do that with a SqlActivity that uses a JdbcDatabase. My first warning was when the wysiwyg editor on Amazon didn't even let me try to create a JdbcDatabase, but I plowed on anyway and just wrote and uploaded a Json definition by hand, myself (here's the relevant bit):
{
"id" : "ExportToSnowflake",
"name" : "ExportToSnowflake",
"type" : "SqlActivity",
"schedule" : { "ref" : "DefaultSchedule" },
"database" : { "ref" : "SnowflakeDatabase" },
"dependsOn" : { "ref" : "ImportTickets" },
"script" : "COPY INTO ZENDESK_TICKETS_INCREMENTAL_PLAYGROUND FROM #zendesk_incremental_stage"
},
{
"id" : "SnowflakeDatabase",
"name" : "SnowflakeDatabase",
"type" : "JdbcDatabase",
"jdbcDriverClass" : "com.snowflake.client.jdbc.SnowflakeDriver",
"username" : "redacted",
"connectionString" : "jdbc:snowflake://redacted.snowflakecomputing.com:8080/?account=redacted&db=redacted&schema=PUBLIC&ssl=on",
"*password" : "redacted"
}
When I upload this into the designer, it refuses to activate, giving me this error message:
ERROR: 'database' values must be of type 'RedshiftDatabase'. Found values of type 'JdbcDatabase'
The rest of the pipeline definition works fine without any errors. I've confirmed that it activates and runs to success if I simply leave this step out.
I am unable to find a single mention on the entire Internet of someone actually using a JdbcDatabase from Data Pipeline. Does it just plain not work? Why is it even mentioned in the documentation if there's no way to actually use it? Or am I missing something? I'd love to know if this is a futile exercise before I blow more of the client's money trying to figure out what's going on.

In your JdbcDatabase you need to have the following property:
jdbcDriverJarUri: "[S3 path to the driver jar file]"

Related

ElasticSearch 7.10 AWS Reindexing error es_rejected_execution_exception 429

Both my indices are on the same node. The source has about 200k documents, I'm using AWS and the instance type is "t3.small.search" so 2 vCPUs. I tried slicing already but it just gives me the same error. Any ideas on what I can do to make this process finish successfully?
{
"error" : {
"root_cause" : [
{
"type" : "es_rejected_execution_exception",
"reason" : "rejected execution of coordinating operation [shard_detail=[fulltext][0][C], shard_coordinating_and_primary_bytes=0, shard_operation_bytes=98296362, shard_max_coordinating_and_primary_bytes=105630] OR [node_coordinating_and_primary_bytes=0, node_replica_bytes=0, node_all_bytes=0, node_operation_bytes=98296362, node_max_coordinating_and_primary_bytes=105630924]"
}
],
"type" : "es_rejected_execution_exception",
"reason" : "rejected execution of coordinating operation [shard_detail=[fulltext][0][C], shard_coordinating_and_primary_bytes=0, shard_operation_bytes=98296362, shard_max_coordinating_and_primary_bytes=105630] OR [node_coordinating_and_primary_bytes=0, node_replica_bytes=0, node_all_bytes=0, node_operation_bytes=98296362, node_max_coordinating_and_primary_bytes=105630924]"
},
"status" : 429
}

I ran into a similar problem. I was trying to reindex a couple of indices that had a lot of documents in them. I raised the JVM HeapSize from 512mb to 2gb and it fixed the problem.
Check current JVM HeapSize:
GET {ES_URL}/_cat/nodes?h=heap*&v
Here's how you can change the settings: https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html
Hope this helps.

Dealing With Incoming Null Values In Cloud Data Fusion When Building Data Pipeline

I have started trying out google cloud data fusion as a prospect ETL tool that I can finally decide to use.When building a data pipeline to fetch data from a REST API source and load it to a MySQL database am facing this error Expected a string but was NULL at line 1 column 221'. Please check the system logs for more details. and yes it's true I have a field that is null from the JSON response am seeing
"systemanswertime": null
How do I deal with null values since the available dropdown in the cloud data fusion studio string is not working are they other optional data types that I can use?
Below are two screenshots showing my current data pipeline structure
geneneral view
view showing mapping and the output schema
Thank You!!

What you need to do is to tell HTTP plugin that you are expecting a null by checking the null checkbox in front of output on the right side. See below example

You might be getting this error because in the JSON schema you are defining the value properties. You should allow systemanswertime parameter to be NULL.
You could try to parse the JSON value as follow:
"systemanswertime": {
"type": [
"string",
"null"
]
}
In the case you don't have access to the JSON file, you could try to use this plug in in order to enable the HTTP to manage nulleable values by dynamically substituting the configurations that can be served by the HTTP Server. You will need access to the HTTP endpoint in order construct an accessible HTTP endpoint that can serve content similar to:
{
"name" : "output.schema", "type" : "schema", "value" :
[
{ "name" : "id", "type" : "int", "nullable" : true},
{ "name" : "first_name", "type" : "string", "nullable" : true},
{ "name" : "last_name", "type" : "string", "nullable" : true},
{ "name" : "email", "type" : "string", "nullable" : true},
]
},

In case you are facing an error such as: No matching schema found for union type: ["string","null"], you could try the following workaround. The root cause of this errors are when the entries in the response from the API doesn't have all the fields it needs to have. For example, some entries may have callerId, channel, last_channel, last data, etc... but others entries may have not have last_channel or whatever other field from the JSON. This leads to a mismatch in the schema provided in the HTTP source and the pipeline fails right away.
As pear this when nodes encounter null values, logical errors, or other sources of errors, you may use an error handler plugin to catch errors. The way is as following:
In the HTTP source plug-in, change the following:
Output schema to account for custom field.
JSON/XML field mapping to account into custom field.
Changed Non-HTTP Error Handling field to Send to Error. This way it pushes the records through error collector and the pipeline proceeds with subsequent records.
Added Error Collector and a sink to capture the error records.
With this method you will be able to run the pipeline and had the problematic fields detected.
Kind regards,
Manuel

How to read ssm parameters in a shell script in aws data pipeline?

I'm setting up a data pipeline in aws. and plan to use a "getting started using ShellCommandActivity" template to run a shell script. how can i pass credentials stored in ssm parameter as a parameter to this script.

I haven't verified that, but ShellCommandActivity is similar to ShellScriptConfig from what I can tell. Based on the examples provided for these commands, I would think that you could pass the ssm param name as follows:
{
"id" : "CreateDirectory",
"type" : "ShellCommandActivity",
"command" : "your-script.sh <name-of-your-parameter>"
}
or
{
"id" : "CreateDirectory",
"type" : "ShellCommandActivity",
"scriptUri" : "s3://my-bucket/example.sh",
"scriptArgument" : ["<name-of-your-parameter>"]
}
and in the example.sh you would use $1 to refer to the value of the argument passed.

How to apply lifecycle patters in AWS elasticsearch to many indexs

I am trying to do this in AWS elasticsearch, whereby I create a template for the pattern application-logs-*, and then I want to apply a index policy log-rotation-policy for all indexes which match that expression. I have created my policy successfully, but when I try to create a template like so:
PUT _template/application-logs
{
"index_patterns" : [
"application-logs-*"
],
"settings" : {
"index.lifecycle.name": "log-rotation-policy",
}
}
I get an error:
"type": "illegal_argument_exception",
"reason": "unknown setting [index.policy_id] please check that any required plugins are installed, or check the breaking changes documentation for removed settings"
The AWS documentation is extremely vague,

Ok sorry I thought I would post this answer anyway because as I was writing this I figured out the problem, the correct key o use is: opendistro.index_state_management.policy_id so it should be:
PUT _template/application-logs
{
"index_patterns" : [
"application-logs-*"
],
"settings" : {
"opendistro.index_state_management.policy_id": "log-rotation-policy",
}
}
I found the answer here.

Hybris REST API - How to get customers modified after a specific date-time

I need to create a Hybris customer sync application with an external system.
I'm trying to pull only customers that have been modified after a specific date-time but not having any luck.
Looking at the Hybris documentation it indicates that something like this should work but it doesn't work:
http://localhost:9001/ws410/rest/customers?customer_query=modifiedtime%20%3E%202016%2D03%2D14&customers_size=5&customer_attributes=modifiedtime
It just returns all of the customers.
I've tried all kinds of variations of date format, etc.
Anyone have an example of how to create the query using the HYBRIS REST API?

Found it.
Had the wrong resource (customers - not customer) and had to do a conversion on the date (or at least that works).
Here is by date only:
http://tphybris-vm:9001/ws410/rest/customers?customers_size=50&customer_attributes=modifiedtime&customers_query=%7Bmodifiedtime%7D%20%3E%20TO_TIMESTAMP('2016-10-21'%2C%20'YYYY-MM-DD')
Returns:
{
"#uri" : "http://tphybris-vm:9001/ws410/rest/customers?customers_size=50&customer_attributes=modifiedtime&customers_query=%7Bmodifiedtime%7D%20%3E%20TO_TIMESTAMP('2016-10-21'%2C%20'YYYY-MM-DD')",
"customer" : {
"#uri" : "http://tphybris-vm:9001/ws410/rest/customers/anonymous",
"modifiedtime" : "2016-10-21T10:30:01.099-07:00",
"authorizedToUnlockPages" : "false",
"loginDisabled" : "false"
}
}
Here is by date time:
http://tphybris-vm:9001/ws410/rest/customers?customers_size=50&customer_attributes=modifiedtime&customers_query=%7Bmodifiedtime%7D%20%3E%20TO_TIMESTAMP('2016-10-21%2010%3A30%3A00'%2C%20'YYYY-MM-DD%20HH%3AMI%3ASS')
Returns:
{
"#uri" : "http://tphybris-vm:9001/ws410/rest/customers?customers_size=50&customer_attributes=modifiedtime&customers_query=%7Bmodifiedtime%7D%20%3E%20TO_TIMESTAMP('2016-10-21%2010%3A30%3A00'%2C%20'YYYY-MM-DD%20HH%3AMI%3ASS')",
"customer" : {
"#uri" : "http://tphybris-vm:9001/ws410/rest/customers/anonymous",
"modifiedtime" : "2016-10-21T10:30:01.099-07:00",
"authorizedToUnlockPages" : "false",
"loginDisabled" : "false"
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Can you connect a SqlActivity to a JdbcDatabase in Amazon Data Pipeline? - amazon-web-services

In your JdbcDatabase you need to have the following property: jdbcDriverJarUri: "[S3 path to the driver jar file]"

Related

ElasticSearch 7.10 AWS Reindexing error es_rejected_execution_exception 429

Dealing With Incoming Null Values In Cloud Data Fusion When Building Data Pipeline

How to read ssm parameters in a shell script in aws data pipeline?

How to apply lifecycle patters in AWS elasticsearch to many indexs

Hybris REST API - How to get customers modified after a specific date-time

Categories

Resources