Read n number of lines from s3 object using AWS lambda - amazon-web-services

In my Lambda I am trying to parse the content of a document from s3 bucket. The document I am processing is a txt file with more than 100Mb. I need to parse only the first line of the file.
What is the best cost-effective way to read the file?
Currently, I am taking the content using getObjectContent() method and taking the 1st line from it like this.
private AmazonS3 s3 = AmazonS3ClientBuilder.standard().build ();
GetObjectRequest getObjectRequest = new GetObjectRequest(bucket, key);
S3Object s3Object = s3.getObject(getObjectRequest);
BufferedReader reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));
String firstLine;
try {
while ((firstLine = reader.readLine()) != null) {
logger.log("META PROCESSOR | FIRST LINE OF FILE : " + firstLine);
break;
}
} catch (IOException e) {
logger.log("META PROCESSOR | FAILED TO LOAD FIRST LINE ");
return null;
}
Is it a good way to read the entire content just to read the first line? Is there any method available to read n number of lines from a file or n number of bytes from a file?

Related

Error 414 When sending invoice to Amazon MWS with _UPLOAD_VAT_INVOICE_

I'm trying to send invoices to amazon mws through _UPLOAD_VAT_INVOICE_ following the java example in this guide:
Link
pdf file is a simple invoice of 85 kb
The error is status code 414 that is "Uri too long"
Debugging original amazon class MarketplaceWebServiceClient I see this:
if( request instanceof SubmitFeedRequest ) {
// For SubmitFeed, HTTP body is reserved for the Feed Content and the function parameters
// are contained within the HTTP header
SubmitFeedRequest sfr = (SubmitFeedRequest)request;
method = new HttpPost( config.getServiceURL() + "?" + getSubmitFeedUrlParameters( parameters ) );
getSubmitFeedUrlParameters method takes every parameter and add it to querystring. One of these parameters is contentMD5 from:
String contentMD5 = Base64.encodeBase64String(pdfDocument);
So there is a very large string representing pdf file passed as parameter. This causes error 414
But that class is the original one taken from MaWSJavaClientLibrary-1.1.jar
Can anybody help me please?
Thanks
For the last 2 days I was working on the same problem,
I changed like this and it works now
InputStream contentStream = new ByteArrayInputStream(pdfDocument);
String contentMD5 =computeContentMD5Header(new ByteArrayInputStream(pdfDocument));
public static String computeContentMD5Header(InputStream inputStream) {
// Consume the stream to compute the MD5 as a side effect.
DigestInputStream s;
try {
s = new DigestInputStream(inputStream,
MessageDigest.getInstance("MD5"));
// drain the buffer, as the digest is computed as a side-effect
byte[] buffer = new byte[8192];
while (s.read(buffer) > 0)
;
return new String(
org.apache.commons.codec.binary.Base64.encodeBase64(s
.getMessageDigest().digest()), "UTF-8");
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException(e);
} catch (IOException e) {
throw new RuntimeException(e);
}
}

Accessing specified key from s3 bucket?

I have a S3 bucket xxx. I wrote one lambda function to access data from s3 bucket and writing those details to a RDS PostgreSQL instance. I can do it with my code. I added one trigger to the lambda function for invoking the same when a file falls on s3.
But from my code I can only read file having name 'sampleData.csv'. consider my code given below
public class LambdaFunctionHandler implements RequestHandler<S3Event, String> {
private AmazonS3 s3 = AmazonS3ClientBuilder.standard().build();
public LambdaFunctionHandler() {}
// Test purpose only.
LambdaFunctionHandler(AmazonS3 s3) {
this.s3 = s3;
}
#Override
public String handleRequest(S3Event event, Context context) {
context.getLogger().log("Received event: " + event);
String bucket = "xxx";
String key = "SampleData.csv";
System.out.println(key);
try {
S3Object response = s3.getObject(new GetObjectRequest(bucket, key));
String contentType = response.getObjectMetadata().getContentType();
context.getLogger().log("CONTENT TYPE: " + contentType);
// Read the source file as text
AmazonS3 s3Client = new AmazonS3Client();
String body = s3Client.getObjectAsString(bucket, key);
System.out.println("Body: " + body);
System.out.println();
System.out.println("Reading as stream.....");
System.out.println();
BufferedReader br = new BufferedReader(new InputStreamReader(response.getObjectContent()));
// just saving the excel sheet data to the DataBase
String csvOutput;
try {
Class.forName("org.postgresql.Driver");
Connection con = DriverManager.getConnection("jdbc:postgresql://ENDPOINT:5432/DBNAME","USER", "PASSWORD");
System.out.println("Connected");
// Checking EOF
while ((csvOutput = br.readLine()) != null) {
String[] str = csvOutput.split(",");
String name = str[1];
String query = "insert into schema.tablename(name) values('"+name+"')";
Statement statement = con.createStatement();
statement.executeUpdate(query);
}
System.out.println("Inserted Successfully!!!");
}catch (Exception ase) {
context.getLogger().log(String.format(
"Error getting object %s from bucket %s. Make sure they exist and"
+ " your bucket is in the same region as this function.", key, bucket));
// throw ase;
}
return contentType;
} catch (Exception e) {
e.printStackTrace();
context.getLogger().log(String.format(
"Error getting object %s from bucket %s. Make sure they exist and"
+ " your bucket is in the same region as this function.", key, bucket));
throw e;
}
}
From my code you can see that I mentioned key="SampleData.csv"; is there any way to get the key inside a bucket without specifying a specific file name?
These couple of links would be of help.
http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingObjectKeysUsingJava.html
You can list objects using prefix and delimiter to find the key you are looking for without passing a specific filename.
If you need to get the event details on S3, you can actually enable the s3 event notifier to lambda function. Refer the link
You can enable this by,
Click on 'Properties' inside your bucket
Click on 'Events '
Click 'Add notification'
Give a name and select the type of event (eg. Put, delete etc.)
Give prefix and suffix if necessary or else leave blank which consider all events
Then 'Sent to' Lambda function and provide the Lambda ARN.
Now the event details will be sent lambda function as a json format. You can fetch the details from that json. The input will be like this:
{"Records":[{"eventVersion":"2.0","eventSource":"aws:s3","awsRegion":"ap-south-1","eventTime":"2017-11-23T09:25:54.845Z","eventName":"ObjectRemoved:Delete","userIdentity":{"principalId":"AWS:AIDAJASDFGZTLA6UZ7YAK"},"requestParameters":{"sourceIPAddress":"52.95.72.70"},"responseElements":{"x-amz-request-id":"A235BER45D4974E","x-amz-id-2":"glUK9ZyNDCjMQrgjFGH0t7Dz19eBrJeIbTCBNI+Pe9tQugeHk88zHOY90DEBcVgruB9BdU0vV8="},"s3":{"s3SchemaVersion":"1.0","configurationId":"sns","bucket":{"name":"example-bucket1","ownerIdentity":{"principalId":"AQFXV36adJU8"},"arn":"arn:aws:s3:::example-bucket1"},"object":{"key":"SampleData.csv","sequencer":"005A169422CA7CDF66"}}}]}
You can access the key as objectname = event['Records'][0]['s3']['object']['key'](Oops, this is for python)
and then sent this info to RDS.

AWS S3 returns 404 for a file that definitely still exists there

We have some code that downloads a bunch of S3 files to a local directory. The list of files to retrieve is from a query we run. It only lists files that actually exist in our S3 bucket.
As we loop to retrieve these files, about 10% of them return a 404 error as if the file doesn't exist. I log out the name/location of that file, so I can go to S3 and check, and sure enough every single one of the IS ON S3 in the location we went looking for it.
Why does S3 throw a 404 when the file exists?
Here is the Groovy code of the script.
class RetrieveS3FilesFromCSVLoader implements Loader {
private static String missingFilesFile = "00-MISSED_FILES.csv"
private static String csvFileName = "/csv/s3file2.csv"
private static String saveFilesToLocation = "/tmp/retrieve/"
public static final char SEPARATOR = ','
#Autowired
DocumentFileService documentFileService
private void readWithCommaSeparatorSQL() {
int counter = 0
String fileName
String fileLocation
File missedFiles = new File(saveFilesToLocation + missingFilesFile)
PrintWriter writer = new PrintWriter(missedFiles)
File fileCSV = new File(getClass().getResource(csvFileName).toURI())
fileCSV.splitEachLine(SEPARATOR as String) { nextLine ->
//if (counter < 15) {
if (nextLine != null && (nextLine[0] != 'FileLocation')) {
counter++
try {
//Remove 0, only if client number start with "0".
fileLocation = nextLine[0].trim()
byte[] fileBytes = documentFileService.getFile(fileLocation)
if (fileBytes != null) {
fileName = fileLocation.substring(fileLocation.indexOf("/") + 1, fileLocation.length())
File file = new File(saveFilesToLocation + fileName)
file.withOutputStream {
it.write fileBytes
}
println "$counter) Wrote file ${fileLocation} to ${saveFilesToLocation + fileLocation}"
} else {
println "$counter) UNABLE TO RETRIEVE FILE ELSE: $fileLocation"
writer.println(fileLocation)
}
} catch (Exception e) {
println "$counter) UNABLE TO RETRIEVE FILE: $fileLocation"
println(e.getMessage())
writer.println(fileLocation)
}
} else {
counter++;
}
//}
}
writer.close()
}
Here is the code for getFile(fileLocation) and client creation.
public byte[] getFile(String filename) throws IOException {
AmazonS3Client s3Client = connectToAmazonS3Service();
S3Object object = s3Client.getObject(S3_BUCKET_NAME, filename);
if(object == null) {
return null;
}
byte[] fileAsArray = IOUtils.toByteArray(object.getObjectContent());
object.close();
return fileAsArray;
}
/**
* Connects to Amazon S3
*
* #return instance of AmazonS3Client
*/
private AmazonS3Client connectToAmazonS3Service() {
AWSCredentials credentials;
try {
credentials = new BasicAWSCredentials(S3_ACCESS_KEY_ID, S3_SECRET_ACCESS_KEY);
} catch (Exception e) {
throw new AmazonClientException(
"Cannot load the credentials from the credential profiles file. " +
"Please make sure that your credentials file is at the correct " +
"location (~/.aws/credentials), and is in valid format.",
e);
}
AmazonS3Client s3 = new AmazonS3Client(credentials);
Region usWest2 = Region.getRegion(Regions.US_EAST_1);
s3.setRegion(usWest2);
return s3;
}
The code above works for 90% of the files in the list passed to the script, but we know with fact that all 100% of the files exist in S3 and with the location String we are passing.
I am just an idiot. Thought it had the production AWS credentials in the properties file. Instead it was development credentials. So I had the wrong credentials.

Whats the Efficient way to call http request and read inputstream in spark MapTask

Please see the below code sample
JavaRDD<String> mapRDD = filteredRecords
.map(new Function<String, String>() {
#Override
public String call(String url) throws Exception {
BufferedReader in = null;
URL formatURL = new URL((url.replaceAll("\"", ""))
.trim());
try {
HttpURLConnection con = (HttpURLConnection) formatURL
.openConnection();
in = new BufferedReader(new InputStreamReader(con
.getInputStream()));
return in.readLine();
} finally {
if (in != null) {
in.close();
}
}
}
});
here url is http GET request. example
http://ip:port/cyb/test?event=movie&id=604568837&name=SID&timestamp_secs=1460494800&timestamp_millis=1461729600000&back_up_id=676700166
This piece of code is very slow . IP and port are random and load is distributed so ip can have 20 different value with port so I dont see bottleneck .
When I comment
in = new BufferedReader(new InputStreamReader(con
.getInputStream()));
return in.readLine();
The code is too fast.
NOTE: Input data to process is 10GB. Using spark to read from S3.
is there anything wrong I am doing with BufferedReader or InputStreamReader any alternative .
I cant use foreach in spark as I have to get the response back from server and need to save JAVARdd as textFile on HDFS.
if we use mappartition code something as below
JavaRDD<String> mapRDD = filteredRecords.mapPartitions(new FlatMapFunction<Iterator<String>, String>() {
#Override
public Iterable<String> call(Iterator<String> tuple) throws Exception {
final List<String> rddList = new ArrayList<String>();
Iterable<String> iterable = new Iterable<String>() {
#Override
public Iterator<String> iterator() {
return rddList.iterator();
}
};
while(tuple.hasNext()) {
URL formatURL = new URL((tuple.next().replaceAll("\"", ""))
.trim());
HttpURLConnection con = (HttpURLConnection) formatURL
.openConnection();
try(BufferedReader br = new BufferedReader(new InputStreamReader(con
.getInputStream()))) {
rddList.add(br.readLine());
} catch (IOException ex) {
return rddList;
}
}
return iterable;
}
});
here also for each record we are doing same .. isnt it ?
Currently you are using
map function
which creates a url request for each row in the partition.
You can use
mapPartition
Which will make the code run faster as it creates connection to the server only once , that is only one connection per partition.
A big cost here is setting up TCP/HTTPS connections. This is exacerbated by the fact that Even if you only read the first (short) line of a large file, in an attempt to re-use HTTP/1.1 connections better, modern HTTP clients try to read() to the end of the file, so avoiding aborting the connection. This is a good strategy for small files, but not for those in MB.
There is a solution there: set the content-length on the read, so that only a smaller block is read in, reducing the cost of the close(); the connection recycling then reduces HTTPS setup costs. This is what the latest Hadoop/Spark S3A client does if you set fadvise=random on the connection: requests blocks rather than the entire multi-GB file. Be aware though: that design is actually really bad if you are going byte-by-byte through a file...

Updating a file in Amazon S3 bucket

I am trying to append a string to the end of a text file stored in S3.
Currently I just read the contents of the file into a String, append my new text and resave the file back to S3.
Is there a better way to do this. I am thinkinig when the file is >>> 10MB then reading the entire file would not be a good idea so how should I do this correctly?
Current code
[code]
private void saveNoteToFile( String p_note ) throws IOException, ServletException
{
String str_infoFileName = "myfile.json";
String existingNotes = s3Helper.getfileContentFromS3( str_infoFileName );
existingNotes += p_note;
writeStringToS3( str_infoFileName , existingNotes );
}
public void writeStringToS3(String p_fileName, String p_data) throws IOException
{
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream( p_data.getBytes());
try {
streamFileToS3bucket( p_fileName, byteArrayInputStream, p_data.getBytes().length);
}
catch (AmazonServiceException e)
{
e.printStackTrace();
} catch (AmazonClientException e)
{
e.printStackTrace();
}
}
public void streamFileToS3bucket( String p_fileName, InputStream input, long size)
{
//Create sub folders if there is any in the file name.
p_fileName = p_fileName.replace("\\", "/");
if( p_fileName.charAt(0) == '/')
{
p_fileName = p_fileName.substring(1, p_fileName.length());
}
String folder = getFolderName( p_fileName );
if( folder.length() > 0)
{
if( !doesFolderExist(folder))
{
createFolder( folder );
}
}
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentLength(size);
AccessControlList acl = new AccessControlList();
acl.grantPermission(GroupGrantee.AllUsers, Permission.Read);
s3Client.putObject(new PutObjectRequest(bucket, p_fileName , input,metadata).withAccessControlList(acl));
}
[/code]
It's not possible to append to an existing file on AWS S3. When you upload an object it creates a new version if it already exists:
If you upload an object with a key name that already exists in the
bucket, Amazon S3 creates another version of the object instead of
replacing the existing object
Source: http://docs.aws.amazon.com/AmazonS3/latest/UG/ObjectOperations.html
The objects are immutable.
It's also mentioned in these AWS Forum threads:
https://forums.aws.amazon.com/message.jspa?messageID=179375
https://forums.aws.amazon.com/message.jspa?messageID=540395
It's not possible to append to an existing file on AWS S3.
You can delete existing file and upload new file with same name.
Configuration
private string bucketName = "my-bucket-name-123";
private static string awsAccessKey = "AKI............";
private static string awsSecretKey = "+8Bo..................................";
IAmazonS3 client = new AmazonS3Client(awsAccessKey, awsSecretKey,
RegionEndpoint.APSoutheast2);
string awsFile = "my-folder/sub-folder/textFile.txt";
string localFilePath = "my-folder/sub-folder/textFile.txt";
To Delete
public void DeleteRefreshTokenFile()
{
try
{
var deleteFileRequest = new DeleteObjectRequest
{
BucketName = bucketName,
Key = awsFile
};
DeleteObjectResponse fileDeleteResponse = client.DeleteObject(deleteFileRequest);
}
catch (Exception ex)
{
throw new Exception(ex.Message);
}
}
To Upload
public void UploadRefreshTokenFile()
{
FileInfo file = new FileInfo(localFilePath);
try
{
PutObjectRequest request = new PutObjectRequest()
{
InputStream = file.OpenRead(),
BucketName = bucketName,
Key = awsFile
};
PutObjectResponse response = client.PutObject(request);
}
catch (Exception ex)
{
throw new Exception(ex.Message);
}
}
One option is to write the new lines/information to a new version of the file. This would create a LARGE number of versions. But, essentially, whatever program you are using the file for could read ALL the versions and append them back together when reading it (this seems like a really bad idea as I write it out).
Another option would be to write a new object each time with a time stamp appended to the object name. my-log-file-date-time . Then whatever program is reading from it could append them all together after downloading my-log-file-*.
You would want to delete objects older than a certain time just like log rotation.
Depending on how busy your events are this might work. If you have thousands per second, I don't think this would work. But if you just have a few events per minute it may be reasonable.
You can do it with s3api put-object.
First download the version you want and use below commend. it will upload as the latest version.
ᐅ aws s3api put-object --bucket $BUCKET --key $FOLDER/$FILE --body $YOUR_LOCAL_DOWNLOADED_VERSION_FILE