Pentaho ETL Transformation Using Lamda - amazon-web-services

Is it possible to run Pentaho ETL Jobs/transformation using AWS Lamda functions?
I have Pentaho ETL jobs running on schedule on the Windows server, we are planning to migrate to AWS. I am considering the Lambda function. just to understand if it is possible to schedule the Pentaho ETL Jobs using AWS Lamdba

Here is the snippet of code that I was able to successfully run in AWS Lambda Function.
handleRequest Function is called from AWS Lambda Function
public Integer handleRequest(String input, Context context) {
parseInput(input);
return executeKtr(transName);
}
parseInput: This function is used to parse out a string parameter passed by Lambda Function to extract KTR name and its parameters with value. Format of the input is "ktrfilename param1=value1 param2=value2"
public static void parseInput(String input) {
String[] tokens = input.split(" ");
transName = tokens[0].replace(".ktr", "") + ".ktr";
for (int i=1; i<tokens.length; i++) {
params.add(tokens[i]);
}
}
Executing KTR: I am using git repo to store all my KTR files and based on the name passed as a parameter KTR is executed
public static Integer executeKtr(String ktrName) {
try {
System.out.println("Present Project Directory : " + System.getProperty("user.dir"));
String transName = ktrName.replace(".ktr", "") + ".ktr";
String gitURI = awsSSM.getParaValue("kattle-trans-git-url");
String repoLocalPath = clonePDIrepo.cloneRepo(gitURI);
String path = new File(repoLocalPath + "/" + transName).getAbsolutePath();
File ktrFile = new File(path);
System.out.println("KTR Path: " + path);
try {
/**
* IMPORTANT NOTE FOR LAMBDA FUNCTION MUST CREATE .KEETLE DIRECOTRY OTHERWISE
* CODE WILL FAIL IN LAMBDA FUNCTION WITH ERROR CANT CREATE
* .kettle/kettle.properties file.
*
* ALSO SET ENVIRNOMENT VARIABLE ON LAMBDA FUNCTION TO POINT
* KETTLE_HOME=/tmp/.kettle
*/
Files.createDirectories(Paths.get("/tmp/.kettle"));
} catch (IOException e) {
e.printStackTrace();
throw new RuntimeException("Error Creating /tmp/.kettle directory");
}
if (ktrFile.exists()) {
KettleEnvironment.init();
TransMeta metaData = new TransMeta(path);
Trans trans = new Trans(metaData);
// SETTING PARAMETERS
trans = parameterSetting(trans);
trans.execute( null );
trans.waitUntilFinished();
if (trans.getErrors() > 0) {
System.out.print("Error Executing transformation");
throw new RuntimeException("There are errors in running transformations");
} else {
System.out.print("Successfully Executed Transformation");
return 1;
}
} else {
System.out.print("KTR File:" + path + " not found in repo");
throw new RuntimeException("KTR File:" + path + " not found in repo");
}
} catch (KettleException e) {
e.printStackTrace();
throw new RuntimeException(e.getMessage());
}
}
parameterSetting: If KTR is accepting parameter and it is passed while calling AWS Lambda function, it is set using parameterSetting function.
public static Trans parameterSetting(Trans trans) {
String[] transParams = trans.listParameters();
for (String param : transParams) {
for (String p: params) {
String name = p.split("=")[0];
String val = p.split("=")[1];
if (name.trim().equals(param.trim())) {
try {
System.out.println("Setting Parameter:"+ name + "=" + val);
trans.setParameterValue(name, val);
} catch (UnknownParamException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
trans.activateParameters();
return trans;
}
CloneGitRepo:
public class clonePDIrepo {
/**
* Clones the given repo to local folder
*
* #param pathWithPwd Gir repo URL with access token included in the url. e.g.
* https://token_name:token_value#github.com/ktr-git-repo.git
* #return returns Local Repository String Path
*/
public static String cloneRepo(String pathWithPwd) {
try {
/**
* CREATING TEMP DIR TO AVOID FOLDER EXISTS ERROR, THIS TEMP DIRECTORY LATER CAN
* BE USED TO GET ABSOLETE PATH FOR FILES IN DIRECTORY
*/
File pdiLocalPath = Files.createTempDirectory("repodir").toFile();
Git git = Git.cloneRepository().setURI(pathWithPwd).setDirectory(pdiLocalPath).call();
System.out.println("Git repository cloned successfully");
System.out.println("Local Repository Path:" + pdiLocalPath.getAbsolutePath());
// }
return pdiLocalPath.getAbsolutePath();
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
}
AWSSSMgetParaValue: Gets string value of the parameter passed.
public static String getParaValue(String paraName) {
try {
Region region = Region.US_EAST_1;
SsmClient ssmClient = SsmClient.builder()
.region(region)
.build();
GetParameterRequest parameterRequest = GetParameterRequest.builder()
.name(paraName)
.withDecryption(true)
.build();
GetParameterResponse parameterResponse = ssmClient.getParameter(parameterRequest);
System.out.println(paraName+ " value retreived from AWS SSM");
ssmClient.close();
return parameterResponse.parameter().value();
} catch (SsmException e) {
System.err.println(e.getMessage());
return null;
}
}
Assumptions:
Git repo is created with KTR files in the root of the repo
git repo url exists on the aws SSM with valid tokens to clone the repo
Input string contains name of the KTR file
Environment Variable is configured on Lambda Function for KETTLE_HOME=/tmp/.kettle
Lambda Function has necessary permissions for SSM and S3 VPC Network
Proper Security Group rules are setup to allow required network access for the KTR File
I am planning to upload complete code to git. I will update this post with the URL of the repository.

Related

fetch batch translate job details using SDK

I was trying to find the LanguagePair and Operation associated with a given AWS translate job using the JAVA SDK.
Using the AWS web console, i created a couple of batch jobs to translate a few english sentences to french. In CloudWatch, i could see the metric dimensions as
LanguagePair: en-fr
Operation: TranslateText
Can i retrieve the same information (LanguagePair and Operation) for a given job,
using the TranslateAsyncClient.describeTextTranslationJob(...) method ?
You can use the TranslateClient to describes a translation job given the job number as input. This uses the TranslateClient; however you can use the Async version as well.
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.translate.TranslateClient;
import software.amazon.awssdk.services.translate.model.DescribeTextTranslationJobRequest;
import software.amazon.awssdk.services.translate.model.DescribeTextTranslationJobResponse;
import software.amazon.awssdk.services.translate.model.TranslateException;
// snippet-end:[translate.java2._describe_jobs.import]
/**
* To run this Java V2 code example, ensure that you have setup your development environment, including your credentials.
*
* For information, see this documentation topic:
*
* https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/get-started.html
*/
public class DescribeTextTranslationJob {
public static void main(String[] args) {
final String USAGE = "\n" +
"Usage:\n" +
" DescribeTextTranslationJob <id> \n\n" +
"Where:\n" +
" id - a translation job ID value. You can obtain this value from the BatchTranslation example.\n";
if (args.length != 1) {
System.out.println(USAGE);
System.exit(1);
}
String id = args[0];
Region region = Region.US_WEST_2;
TranslateClient translateClient = TranslateClient.builder()
.region(region)
.build();
describeTextTranslationJob(translateClient, id);
translateClient.close();
}
// snippet-start:[translate.java2._describe_jobs.main]
public static void describeTextTranslationJob(TranslateClient translateClient, String id) {
try {
DescribeTextTranslationJobRequest textTranslationJobRequest = DescribeTextTranslationJobRequest.builder()
.jobId(id)
.build();
DescribeTextTranslationJobResponse jobResponse = translateClient.describeTextTranslationJob(textTranslationJobRequest);
System.out.println("The job status is "+jobResponse.textTranslationJobProperties().jobStatus());
System.out.println("The source language is "+jobResponse.textTranslationJobProperties().sourceLanguageCode());
System.out.println("The target language is "+jobResponse.textTranslationJobProperties().targetLanguageCodes());
} catch (TranslateException e) {
System.err.println(e.getMessage());
System.exit(1);
}
// snippet-end:[translate.java2._describe_jobs.main]
}
}
To see all the data you can get back using this code, see this JavaDoc - https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/translate/model/TextTranslationJobProperties.html

Accessing specified key from s3 bucket?

I have a S3 bucket xxx. I wrote one lambda function to access data from s3 bucket and writing those details to a RDS PostgreSQL instance. I can do it with my code. I added one trigger to the lambda function for invoking the same when a file falls on s3.
But from my code I can only read file having name 'sampleData.csv'. consider my code given below
public class LambdaFunctionHandler implements RequestHandler<S3Event, String> {
private AmazonS3 s3 = AmazonS3ClientBuilder.standard().build();
public LambdaFunctionHandler() {}
// Test purpose only.
LambdaFunctionHandler(AmazonS3 s3) {
this.s3 = s3;
}
#Override
public String handleRequest(S3Event event, Context context) {
context.getLogger().log("Received event: " + event);
String bucket = "xxx";
String key = "SampleData.csv";
System.out.println(key);
try {
S3Object response = s3.getObject(new GetObjectRequest(bucket, key));
String contentType = response.getObjectMetadata().getContentType();
context.getLogger().log("CONTENT TYPE: " + contentType);
// Read the source file as text
AmazonS3 s3Client = new AmazonS3Client();
String body = s3Client.getObjectAsString(bucket, key);
System.out.println("Body: " + body);
System.out.println();
System.out.println("Reading as stream.....");
System.out.println();
BufferedReader br = new BufferedReader(new InputStreamReader(response.getObjectContent()));
// just saving the excel sheet data to the DataBase
String csvOutput;
try {
Class.forName("org.postgresql.Driver");
Connection con = DriverManager.getConnection("jdbc:postgresql://ENDPOINT:5432/DBNAME","USER", "PASSWORD");
System.out.println("Connected");
// Checking EOF
while ((csvOutput = br.readLine()) != null) {
String[] str = csvOutput.split(",");
String name = str[1];
String query = "insert into schema.tablename(name) values('"+name+"')";
Statement statement = con.createStatement();
statement.executeUpdate(query);
}
System.out.println("Inserted Successfully!!!");
}catch (Exception ase) {
context.getLogger().log(String.format(
"Error getting object %s from bucket %s. Make sure they exist and"
+ " your bucket is in the same region as this function.", key, bucket));
// throw ase;
}
return contentType;
} catch (Exception e) {
e.printStackTrace();
context.getLogger().log(String.format(
"Error getting object %s from bucket %s. Make sure they exist and"
+ " your bucket is in the same region as this function.", key, bucket));
throw e;
}
}
From my code you can see that I mentioned key="SampleData.csv"; is there any way to get the key inside a bucket without specifying a specific file name?
These couple of links would be of help.
http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingObjectKeysUsingJava.html
You can list objects using prefix and delimiter to find the key you are looking for without passing a specific filename.
If you need to get the event details on S3, you can actually enable the s3 event notifier to lambda function. Refer the link
You can enable this by,
Click on 'Properties' inside your bucket
Click on 'Events '
Click 'Add notification'
Give a name and select the type of event (eg. Put, delete etc.)
Give prefix and suffix if necessary or else leave blank which consider all events
Then 'Sent to' Lambda function and provide the Lambda ARN.
Now the event details will be sent lambda function as a json format. You can fetch the details from that json. The input will be like this:
{"Records":[{"eventVersion":"2.0","eventSource":"aws:s3","awsRegion":"ap-south-1","eventTime":"2017-11-23T09:25:54.845Z","eventName":"ObjectRemoved:Delete","userIdentity":{"principalId":"AWS:AIDAJASDFGZTLA6UZ7YAK"},"requestParameters":{"sourceIPAddress":"52.95.72.70"},"responseElements":{"x-amz-request-id":"A235BER45D4974E","x-amz-id-2":"glUK9ZyNDCjMQrgjFGH0t7Dz19eBrJeIbTCBNI+Pe9tQugeHk88zHOY90DEBcVgruB9BdU0vV8="},"s3":{"s3SchemaVersion":"1.0","configurationId":"sns","bucket":{"name":"example-bucket1","ownerIdentity":{"principalId":"AQFXV36adJU8"},"arn":"arn:aws:s3:::example-bucket1"},"object":{"key":"SampleData.csv","sequencer":"005A169422CA7CDF66"}}}]}
You can access the key as objectname = event['Records'][0]['s3']['object']['key'](Oops, this is for python)
and then sent this info to RDS.

AWS S3 returns 404 for a file that definitely still exists there

We have some code that downloads a bunch of S3 files to a local directory. The list of files to retrieve is from a query we run. It only lists files that actually exist in our S3 bucket.
As we loop to retrieve these files, about 10% of them return a 404 error as if the file doesn't exist. I log out the name/location of that file, so I can go to S3 and check, and sure enough every single one of the IS ON S3 in the location we went looking for it.
Why does S3 throw a 404 when the file exists?
Here is the Groovy code of the script.
class RetrieveS3FilesFromCSVLoader implements Loader {
private static String missingFilesFile = "00-MISSED_FILES.csv"
private static String csvFileName = "/csv/s3file2.csv"
private static String saveFilesToLocation = "/tmp/retrieve/"
public static final char SEPARATOR = ','
#Autowired
DocumentFileService documentFileService
private void readWithCommaSeparatorSQL() {
int counter = 0
String fileName
String fileLocation
File missedFiles = new File(saveFilesToLocation + missingFilesFile)
PrintWriter writer = new PrintWriter(missedFiles)
File fileCSV = new File(getClass().getResource(csvFileName).toURI())
fileCSV.splitEachLine(SEPARATOR as String) { nextLine ->
//if (counter < 15) {
if (nextLine != null && (nextLine[0] != 'FileLocation')) {
counter++
try {
//Remove 0, only if client number start with "0".
fileLocation = nextLine[0].trim()
byte[] fileBytes = documentFileService.getFile(fileLocation)
if (fileBytes != null) {
fileName = fileLocation.substring(fileLocation.indexOf("/") + 1, fileLocation.length())
File file = new File(saveFilesToLocation + fileName)
file.withOutputStream {
it.write fileBytes
}
println "$counter) Wrote file ${fileLocation} to ${saveFilesToLocation + fileLocation}"
} else {
println "$counter) UNABLE TO RETRIEVE FILE ELSE: $fileLocation"
writer.println(fileLocation)
}
} catch (Exception e) {
println "$counter) UNABLE TO RETRIEVE FILE: $fileLocation"
println(e.getMessage())
writer.println(fileLocation)
}
} else {
counter++;
}
//}
}
writer.close()
}
Here is the code for getFile(fileLocation) and client creation.
public byte[] getFile(String filename) throws IOException {
AmazonS3Client s3Client = connectToAmazonS3Service();
S3Object object = s3Client.getObject(S3_BUCKET_NAME, filename);
if(object == null) {
return null;
}
byte[] fileAsArray = IOUtils.toByteArray(object.getObjectContent());
object.close();
return fileAsArray;
}
/**
* Connects to Amazon S3
*
* #return instance of AmazonS3Client
*/
private AmazonS3Client connectToAmazonS3Service() {
AWSCredentials credentials;
try {
credentials = new BasicAWSCredentials(S3_ACCESS_KEY_ID, S3_SECRET_ACCESS_KEY);
} catch (Exception e) {
throw new AmazonClientException(
"Cannot load the credentials from the credential profiles file. " +
"Please make sure that your credentials file is at the correct " +
"location (~/.aws/credentials), and is in valid format.",
e);
}
AmazonS3Client s3 = new AmazonS3Client(credentials);
Region usWest2 = Region.getRegion(Regions.US_EAST_1);
s3.setRegion(usWest2);
return s3;
}
The code above works for 90% of the files in the list passed to the script, but we know with fact that all 100% of the files exist in S3 and with the location String we are passing.
I am just an idiot. Thought it had the production AWS credentials in the properties file. Instead it was development credentials. So I had the wrong credentials.

Service to upload images and videos to (S3, CloudFront etc)

I am looking for a service to upload videos and images from my mobile applications (frontend). I have heard about Amazon S3 and CloudFront. I am looking for a service that will store them, and will also be able to check if they meet certain criteria (for example, maximum file size of 3MB per picture), and return an error to the client if the file doesn't meet the criteria. Does Amazon S3 or CloudFront provide this? If not, is there any other recommended service for this?
You could use the AWS SDK. Here follows an example of the Java version (Amazon provides SDKs for different languages):
/**
* It stores the given file name in S3 and returns the key under which the file has been stored
* #param resource
* #param bucketName
* #return
*/
public String storeProfileImage(File resource, String bucketName, String username) {
String resourceUrl = null;
if (!resource.exists()) {
throw new IllegalArgumentException("The file " + resource.getAbsolutePath() + " doesn't exist");
}
long lengthInBytes = resource.length();
//For demo purposes. You should use a configurable property for the max size
if (lengthInBytes > (3 * 1024)) {
//Your error handling here
}
AccessControlList acl = new AccessControlList();
acl.grantPermission(GroupGrantee.AllUsers, Permission.Read);
String key = username + "/profilePicture." + FilenameUtils.getExtension(resource.getName());
try {
s3Client.putObject(new PutObjectRequest(bucketName, key, resource).withAccessControlList(acl));
resourceUrl = s3Client.getResourceUrl(bucketName, key);
} catch (AmazonClientException ace) {
LOG.error("A client exception occurred while trying to store the profile" +
" image {} on S3. The profile image won't be stored", resource.getAbsolutePath(), ace);
}
return resourceUrl;
}
You can also perform other operations, e.g. check if the bucket exists before storing the image
/**
* Returns the root URL where the bucket name is located.
* <p>Please note that the URL does not contain the bucket name</p>
* #param bucketName The bucket name
* #return the root URL where the bucket name is located.
*/
public String ensureBucketExists(String bucketName) {
String bucketUrl = null;
try {
if (!s3Client.doesBucketExist(bucketName)) {
LOG.warn("Bucket {} doesn't exists...Creating one");
s3Client.createBucket(bucketName);
LOG.info("Created bucket: {}", bucketName);
}
bucketUrl = s3Client.getResourceUrl(bucketName, null) + bucketName;
} catch (AmazonClientException ace) {
LOG.error("An error occurred while connecting to S3. Will not execute action" +
" for bucket: {}", bucketName, ace);
}
return bucketUrl;
}

Updating a file in Amazon S3 bucket

I am trying to append a string to the end of a text file stored in S3.
Currently I just read the contents of the file into a String, append my new text and resave the file back to S3.
Is there a better way to do this. I am thinkinig when the file is >>> 10MB then reading the entire file would not be a good idea so how should I do this correctly?
Current code
[code]
private void saveNoteToFile( String p_note ) throws IOException, ServletException
{
String str_infoFileName = "myfile.json";
String existingNotes = s3Helper.getfileContentFromS3( str_infoFileName );
existingNotes += p_note;
writeStringToS3( str_infoFileName , existingNotes );
}
public void writeStringToS3(String p_fileName, String p_data) throws IOException
{
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream( p_data.getBytes());
try {
streamFileToS3bucket( p_fileName, byteArrayInputStream, p_data.getBytes().length);
}
catch (AmazonServiceException e)
{
e.printStackTrace();
} catch (AmazonClientException e)
{
e.printStackTrace();
}
}
public void streamFileToS3bucket( String p_fileName, InputStream input, long size)
{
//Create sub folders if there is any in the file name.
p_fileName = p_fileName.replace("\\", "/");
if( p_fileName.charAt(0) == '/')
{
p_fileName = p_fileName.substring(1, p_fileName.length());
}
String folder = getFolderName( p_fileName );
if( folder.length() > 0)
{
if( !doesFolderExist(folder))
{
createFolder( folder );
}
}
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentLength(size);
AccessControlList acl = new AccessControlList();
acl.grantPermission(GroupGrantee.AllUsers, Permission.Read);
s3Client.putObject(new PutObjectRequest(bucket, p_fileName , input,metadata).withAccessControlList(acl));
}
[/code]
It's not possible to append to an existing file on AWS S3. When you upload an object it creates a new version if it already exists:
If you upload an object with a key name that already exists in the
bucket, Amazon S3 creates another version of the object instead of
replacing the existing object
Source: http://docs.aws.amazon.com/AmazonS3/latest/UG/ObjectOperations.html
The objects are immutable.
It's also mentioned in these AWS Forum threads:
https://forums.aws.amazon.com/message.jspa?messageID=179375
https://forums.aws.amazon.com/message.jspa?messageID=540395
It's not possible to append to an existing file on AWS S3.
You can delete existing file and upload new file with same name.
Configuration
private string bucketName = "my-bucket-name-123";
private static string awsAccessKey = "AKI............";
private static string awsSecretKey = "+8Bo..................................";
IAmazonS3 client = new AmazonS3Client(awsAccessKey, awsSecretKey,
RegionEndpoint.APSoutheast2);
string awsFile = "my-folder/sub-folder/textFile.txt";
string localFilePath = "my-folder/sub-folder/textFile.txt";
To Delete
public void DeleteRefreshTokenFile()
{
try
{
var deleteFileRequest = new DeleteObjectRequest
{
BucketName = bucketName,
Key = awsFile
};
DeleteObjectResponse fileDeleteResponse = client.DeleteObject(deleteFileRequest);
}
catch (Exception ex)
{
throw new Exception(ex.Message);
}
}
To Upload
public void UploadRefreshTokenFile()
{
FileInfo file = new FileInfo(localFilePath);
try
{
PutObjectRequest request = new PutObjectRequest()
{
InputStream = file.OpenRead(),
BucketName = bucketName,
Key = awsFile
};
PutObjectResponse response = client.PutObject(request);
}
catch (Exception ex)
{
throw new Exception(ex.Message);
}
}
One option is to write the new lines/information to a new version of the file. This would create a LARGE number of versions. But, essentially, whatever program you are using the file for could read ALL the versions and append them back together when reading it (this seems like a really bad idea as I write it out).
Another option would be to write a new object each time with a time stamp appended to the object name. my-log-file-date-time . Then whatever program is reading from it could append them all together after downloading my-log-file-*.
You would want to delete objects older than a certain time just like log rotation.
Depending on how busy your events are this might work. If you have thousands per second, I don't think this would work. But if you just have a few events per minute it may be reasonable.
You can do it with s3api put-object.
First download the version you want and use below commend. it will upload as the latest version.
ᐅ aws s3api put-object --bucket $BUCKET --key $FOLDER/$FILE --body $YOUR_LOCAL_DOWNLOADED_VERSION_FILE