Reading HAR file from DistributedCache in mapreduce - mapreduce

I've written an oozie workflow which creates HAR archive and then runs MR-job which needs to read data from this archive.
1. Archive is created
2. When job runs, mapper does see archive in distributed cache.
3. ??? How Can I read this arhive? What's the API to read data from this archive line by line (my har is batch of multiple new line separated text files).
NB: It work perfectly when i work with usual files (not HAR archive) stored in DistirubtedCache. I got a problem while trying to read data from HAR.
Here is a code snippet:
InputStream inputStream;
String cachedDatafileName = System.getProperty(DIST_CACHE_FILE_NAME);
LOG.info(String.format("Looking for[%s]=[%s] in DistributedCache",DIST_CACHE_FILE_NAME, cachedDatafileName));
URI[] uris = DistributedCache.getCacheArchives(getContext().getConfiguration());
URI uriToCachedDatafile = null;
for(URI uri : uris){
if(uri.toString().endsWith(cachedDatafileName)){
uriToCachedDatafile = uri;
break;
}
}
if(uriToCachedDatafile == null){
throw new RuntimeConfigurationException(String.format("Looking for[%s]=[%s] in DistributedCache failed. There is no such file",
DIST_CACHE_FILE_NAME, cachedDatafileName));
}
Path pathToFile = new Path(uriToCachedDatafile);
LOG.info(String.format("[%s] has been found. Uri is: [%s]. The path is:[%s]",cachedDatafileName, uriToCachedDatafile, pathToFile));
FileSystem fileSystem = pathToFile.getFileSystem(getContext().getConfiguration());
HarFileSystem harFileSystem = new HarFileSystem(fileSystem);
inputStream = harFileSystem.open(pathToFile); //NULL POINTER EXCEPTION IS HERE!
return inputStream;

protected InputStream getInputStreamToDistCacheFile() throws IOException{
InputStream inputStream;
String cachedDatafileName = System.getProperty(DIST_CACHE_FILE_NAME);
LOG.info(String.format("Looking for[%s]=[%s] in DistributedCache",DIST_CACHE_FILE_NAME, cachedDatafileName));
URI[] uris = DistributedCache.getCacheArchives(getContext().getConfiguration());
URI uriToCachedDatafile = null;
for(URI uri : uris){
if(uri.toString().endsWith(cachedDatafileName)){
uriToCachedDatafile = uri;
break;
}
}
if(uriToCachedDatafile == null){
throw new RuntimeConfigurationException(String.format("Looking for[%s]=[%s] in DistributedCache failed. There is no such file",
DIST_CACHE_FILE_NAME, cachedDatafileName));
}
//Path pathToFile = new Path(uriToCachedDatafile +"/stf/db_bts_stf.txt");
Path pathToFile = new Path("har:///"+"home/ssa/devel/megalabs/kyc-solution/kyc-mrjob/target/test-classes/GSMCellSubscriberHomeIntersectionJobDescriptionClusterMRTest/in/gsm_cell_location_stf.har" +"/stf/db_bts_stf.txt");
//Path pathToFile = new Path(("har://home/ssa/devel/megalabs/kyc-solution/kyc-mrjob/target/test-classes/GSMCellSubscriberHomeIntersectionJobDescriptionClusterMRTest/in/gsm_cell_location_stf.har"));
LOG.info(String.format("[%s] has been found. Uri is: [%s]. The path is:[%s]",cachedDatafileName, uriToCachedDatafile, pathToFile));
FileSystem harFileSystem = pathToFile.getFileSystem(context.getConfiguration());
FSDataInputStream fin = harFileSystem.open(pathToFile);
LOG.info("fin: " + fin);
// FileSystem fileSystem = pathToFile.getFileSystem(getContext().getConfiguration());
// HarFileSystem harFileSystem = new HarFileSystem(fileSystem);
// harFileSystem.exists(new Path("har://home/ssa/devel/mycompany/my-solution/my-mrjob/target/test-classes/HomeJobDescriptionClusterMRTest/in/locations.har"));
// LOG.info("harFileSystem.exists(pathToFile):"+ harFileSystem.exists(pathToFile));
// harFileSystem.initialize(uriToCachedDatafile, context.getConfiguration());
FileStatus[] statuses = harFileSystem.listStatus(new Path("har:///"+"har://home/ssa/devel/mycompany/my-solution/my-mrjob/target/test-classes/HomeJobDescriptionClusterMRTest/in/locations.har"));
for(FileStatus fileStatus : statuses){
LOG.info("fileStatus isDir"+fileStatus.isDirectory() +" len:" + fileStatus.getLen());
}
// String tmpPathToFile = "har:///"+pathToFile.toString(); //+"/stf/db_bts_stf.txt";
// Path tmpPath = new Path(tmpPathToFile);
// LOG.info("KILL ME PATH TO FILE IN ARCHIVE: " +tmpPath);
// inputStream = harFileSystem.open(tmpPath);
// return inputStream;
return fin;
}
As you can see, it's terrible. You have manually read index file stored inside archive and reconstruct paths using index file metadata. If you know the exact name of a file stored in archive (like in my example), you can construct paths manually.
It's not convinient.I did expect something like Zip->zipEntry, when you can iterate over entries of archive without knowing it's structure.

Related

How to modify the filename of the S3 object uploaded using the Kafka Connect S3 Connector?

I've been using the S3 connector for a couple of weeks now, and I want to change the way the connector names each file. I am using the HourlyBasedPartition, so the path to each file is already enough for me to find each file, and I want the filenames to be something generic for all the files, like just 'Data.json.gzip' (with the respective path from the partitioner).
For example, I want to go from this:
<prefix>/<topic>/<HourlyBasedPartition>/<topic>+<kafkaPartition>+<startOffset>.<format>
To this:
<prefix>/<topic>/<HourlyBasedPartition>/Data.<format>
The objective of this is to only make one call to S3 to download the files later, instead of having to look for the filename first and then download it.
Searching through the files from the folder called 'kafka-connect-s3', I found this file:
https://github.com/confluentinc/kafka-connect-storage-cloud/blob/master/kafka-connect-s3/src/main/java/io/confluent/connect/s3/TopicPartitionWriter.java which at the end has some of the following functions:
private RecordWriter getWriter(SinkRecord record, String encodedPartition)
throws ConnectException {
if (writers.containsKey(encodedPartition)) {
return writers.get(encodedPartition);
}
String commitFilename = getCommitFilename(encodedPartition);
log.debug(
"Creating new writer encodedPartition='{}' filename='{}'",
encodedPartition,
commitFilename
);
RecordWriter writer = writerProvider.getRecordWriter(connectorConfig, commitFilename);
writers.put(encodedPartition, writer);
return writer;
}
private String getCommitFilename(String encodedPartition) {
String commitFile;
if (commitFiles.containsKey(encodedPartition)) {
commitFile = commitFiles.get(encodedPartition);
} else {
long startOffset = startOffsets.get(encodedPartition);
String prefix = getDirectoryPrefix(encodedPartition);
commitFile = fileKeyToCommit(prefix, startOffset);
commitFiles.put(encodedPartition, commitFile);
}
return commitFile;
}
private String fileKey(String topicsPrefix, String keyPrefix, String name) {
String suffix = keyPrefix + dirDelim + name;
return StringUtils.isNotBlank(topicsPrefix)
? topicsPrefix + dirDelim + suffix
: suffix;
}
private String fileKeyToCommit(String dirPrefix, long startOffset) {
String name = tp.topic()
+ fileDelim
+ tp.partition()
+ fileDelim
+ String.format(zeroPadOffsetFormat, startOffset)
+ extension;
return fileKey(topicsDir, dirPrefix, name);
}
I don't know if this can be customised to what I want to do but seems to be somehow near/related to my intentions. Hope it helps.
(Submitted an issue to Github as well: https://github.com/confluentinc/kafka-connect-storage-cloud/issues/369)

How can get byte of pdf before download? - Grails

I can download pdf file from AWS, download is working fine but I need get byte before download.
String awsBucket = params.awsBucket
String awsKey = params.awsKey
String fileName = params.fileName
String pdfLink = params.pdfLink
File file = grailsApplication.mainContext.getResource('WSCredential.properties').file
FileInputStream fileInputStream = new FileInputStream(file)
AmazonS3 s3Client = new AmazonS3Client(new PropertiesCredentials(fileInputStream))
GetObjectRequest getObjectRequest = new GetObjectRequest(awsBucket, awsKey)
userReceipt = s3Client.getObject(getObjectRequest)
if (userReceipt)
{
response.setContentType("application/image/png")
response.setHeader("Content-disposition", "attachment;filename=\"${fileName}\"")
response.outputStream << userReceipt.getObjectContent()
}
If you mean the size of PDF, then you can use HeadObject.

Read n number of lines from s3 object using AWS lambda

In my Lambda I am trying to parse the content of a document from s3 bucket. The document I am processing is a txt file with more than 100Mb. I need to parse only the first line of the file.
What is the best cost-effective way to read the file?
Currently, I am taking the content using getObjectContent() method and taking the 1st line from it like this.
private AmazonS3 s3 = AmazonS3ClientBuilder.standard().build ();
GetObjectRequest getObjectRequest = new GetObjectRequest(bucket, key);
S3Object s3Object = s3.getObject(getObjectRequest);
BufferedReader reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));
String firstLine;
try {
while ((firstLine = reader.readLine()) != null) {
logger.log("META PROCESSOR | FIRST LINE OF FILE : " + firstLine);
break;
}
} catch (IOException e) {
logger.log("META PROCESSOR | FAILED TO LOAD FIRST LINE ");
return null;
}
Is it a good way to read the entire content just to read the first line? Is there any method available to read n number of lines from a file or n number of bytes from a file?

AWS S3 returns 404 for a file that definitely still exists there

We have some code that downloads a bunch of S3 files to a local directory. The list of files to retrieve is from a query we run. It only lists files that actually exist in our S3 bucket.
As we loop to retrieve these files, about 10% of them return a 404 error as if the file doesn't exist. I log out the name/location of that file, so I can go to S3 and check, and sure enough every single one of the IS ON S3 in the location we went looking for it.
Why does S3 throw a 404 when the file exists?
Here is the Groovy code of the script.
class RetrieveS3FilesFromCSVLoader implements Loader {
private static String missingFilesFile = "00-MISSED_FILES.csv"
private static String csvFileName = "/csv/s3file2.csv"
private static String saveFilesToLocation = "/tmp/retrieve/"
public static final char SEPARATOR = ','
#Autowired
DocumentFileService documentFileService
private void readWithCommaSeparatorSQL() {
int counter = 0
String fileName
String fileLocation
File missedFiles = new File(saveFilesToLocation + missingFilesFile)
PrintWriter writer = new PrintWriter(missedFiles)
File fileCSV = new File(getClass().getResource(csvFileName).toURI())
fileCSV.splitEachLine(SEPARATOR as String) { nextLine ->
//if (counter < 15) {
if (nextLine != null && (nextLine[0] != 'FileLocation')) {
counter++
try {
//Remove 0, only if client number start with "0".
fileLocation = nextLine[0].trim()
byte[] fileBytes = documentFileService.getFile(fileLocation)
if (fileBytes != null) {
fileName = fileLocation.substring(fileLocation.indexOf("/") + 1, fileLocation.length())
File file = new File(saveFilesToLocation + fileName)
file.withOutputStream {
it.write fileBytes
}
println "$counter) Wrote file ${fileLocation} to ${saveFilesToLocation + fileLocation}"
} else {
println "$counter) UNABLE TO RETRIEVE FILE ELSE: $fileLocation"
writer.println(fileLocation)
}
} catch (Exception e) {
println "$counter) UNABLE TO RETRIEVE FILE: $fileLocation"
println(e.getMessage())
writer.println(fileLocation)
}
} else {
counter++;
}
//}
}
writer.close()
}
Here is the code for getFile(fileLocation) and client creation.
public byte[] getFile(String filename) throws IOException {
AmazonS3Client s3Client = connectToAmazonS3Service();
S3Object object = s3Client.getObject(S3_BUCKET_NAME, filename);
if(object == null) {
return null;
}
byte[] fileAsArray = IOUtils.toByteArray(object.getObjectContent());
object.close();
return fileAsArray;
}
/**
* Connects to Amazon S3
*
* #return instance of AmazonS3Client
*/
private AmazonS3Client connectToAmazonS3Service() {
AWSCredentials credentials;
try {
credentials = new BasicAWSCredentials(S3_ACCESS_KEY_ID, S3_SECRET_ACCESS_KEY);
} catch (Exception e) {
throw new AmazonClientException(
"Cannot load the credentials from the credential profiles file. " +
"Please make sure that your credentials file is at the correct " +
"location (~/.aws/credentials), and is in valid format.",
e);
}
AmazonS3Client s3 = new AmazonS3Client(credentials);
Region usWest2 = Region.getRegion(Regions.US_EAST_1);
s3.setRegion(usWest2);
return s3;
}
The code above works for 90% of the files in the list passed to the script, but we know with fact that all 100% of the files exist in S3 and with the location String we are passing.
I am just an idiot. Thought it had the production AWS credentials in the properties file. Instead it was development credentials. So I had the wrong credentials.

How to create PPTX file using power-point-template using Apache POI

I want to create power-point presentation using power-point-template(which may be already exists or may be generated by poi.) for creating power point template file which have a background image in the slides, I write the following code which creates template file which opens in open-office but giving error to opening in Microsoft-power-point.
The Code is
private static void generatePOTX() throws IOException, FileNotFoundException {
String imgPathStr = System.getProperty("user.dir") + "/src/resources/images/TestSameChnl_001_t.jpeg";
File imgFile = new File(imgPathStr);
File potxFile = new File(System.getProperty("user.dir") + "/src/resources/Examples/layout.potx");
FileOutputStream out = new FileOutputStream(potxFile);
HSLFSlideShow ppt = new HSLFSlideShow();
HSLFSlide slide = ppt.createSlide();
slide.setFollowMasterBackground(false);
HSLFFill fill = slide.getBackground().getFill();
HSLFPictureData pd = ppt.addPicture(imgFile, PictureData.PictureType.JPEG);
fill.setFillType(HSLFFill.FILL_PICTURE);
fill.setPictureData(pd);
ppt.write(out);
out.close();
}
After that I tried to create a PPT file using the generated POTX file but
But it's giving error. I am trying bellow code for this.
And the code is
private static void GeneratePPTXUsingPOTX() throws FileNotFoundException, IOException {
File imgFile = new File(System.getProperty("user.dir")+"/src/resources/images/TestSameChnl_001_t.jpeg");
File potx_File = new File(System.getProperty("user.dir") + "/src/resources/Examples/layout.potx" );
File pptx_File = new File(System.getProperty("user.dir") + "/src/resources/Examples/PPTWithTemplate.pptx" );
File movieFile = new File(System.getProperty("user.dir") + "/src/resources/movie/Dummy.mp4");
FileInputStream ins = new FileInputStream(potx_File);
FileOutputStream out = new FileOutputStream(pptx_File);
HSLFSlideShow ppt = new HSLFSlideShow(ins);
List<HSLFSlide> slideList = ppt.getSlides();
int movieIdx = ppt.addMovie(movieFile.getAbsolutePath(), MovieShape.MOVIE_MPEG);
HSLFPictureData pictureData = ppt.addPicture(imgFile, PictureData.PictureType.JPEG);
MovieShape shape = new MovieShape(movieIdx, pictureData);
shape.setAnchor(new java.awt.Rectangle(300,225,420,280));
slideList.get(0).addShape(shape);
shape.setAutoPlay(true);
ppt.write(out);
out.close();
}
And the exception which is coming is as fallows:
java.lang.NullPointerException
at org.apache.poi.hslf.usermodel.HSLFPictureShape.afterInsert(HSLFPictureShape.java:185)
at org.apache.poi.hslf.usermodel.HSLFSheet.addShape(HSLFSheet.java:189)