Writing Jena Models to Tar.Gz archives - apache-jena

I am working with RDF models at the moment. Therefore I query data from a database, generate models using Apache Jena and work with them. Although, I don't want to have to query the models every time I use them, so I thought about storing them locally. The models are quite big, so I'd like to compress them using Apache Commons Compress. This works so far (try-catch-blocks omitted):
public static void write(Map<String, Model> models, String file){
logger.info("Writing models to file " + file);
TarArchiveOutputStream tarOutput = null;;
TarArchiveEntry entry = null;
tarOutput = new TarArchiveOutputStream(new GzipCompressorOutputStream(new FileOutputStream(new File(file))));
for(Map.Entry<String, Model> e : models.entrySet()) {
logger.info("Packing model " + e.getKey());
// Convert Model
ByteArrayOutputStream baos = new ByteArrayOutputStream();
RDFDataMgr.write(baos,e.getValue(), RDFFormat.RDFXML_PRETTY);
// Prepare Entry
entry = new TarArchiveEntry(e.getKey());
entry.setSize(baos.size());
tarOutput.putArchiveEntry(entry);
// write into file and close
tarOutput.write(baos.toByteArray());
tarOutput.closeArchiveEntry();
}
tarOutput.close();
}
But as I try the other direction, I get weird NullPointerExceptions. Is this a bug in the GZip-Implementation or is my understanding of Streams wrong?
public static Map<String, Model> read(String file){
logger.info("Reading models from file " + file);
Map<String, Model> models = new HashMap<>();
TarArchiveInputStream tarInput = new TarArchiveInputStream(new GzipCompressorInputStream(new FileInputStream(file)));
for(TarArchiveEntry currentEntry = tarInput.getNextTarEntry();currentEntry != null; currentEntry= tarInput.getNextTarEntry()){
logger.info("Processing model " + currentEntry.getName());
// Read the current model
Model m = ModelFactory.createDefaultModel();
m.read(tarInput, null);
// And add it to the output
models.put(currentEntry.getName(),m);
tarInput.close();
}
return models;
}
This is the stack trace:
Exception in thread "main" java.lang.NullPointerException
at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:271)
at java.io.InputStream.skip(InputStream.java:224)
at org.apache.commons.compress.utils.IOUtils.skip(IOUtils.java:106)
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:345)
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:272)
at de.mem89.masterthesis.rdfHydra.StorageHelper.read(StorageHelper.java:88)
at de.mem89.masterthesis.rdfHydra.StorageHelper.main(StorageHelper.java:124)

Related

Running BeamSql WithoutCoder or Making Coder Dynamic

I am reading data from file and converting it to BeamRecord But While i am Doing Query on that it Show Error-:
Exception in thread "main" java.lang.ClassCastException: org.apache.beam.sdk.coders.SerializableCoder cannot be cast to org.apache.beam.sdk.coders.BeamRecordCoder
at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.registerTables(BeamSql.java:173)
at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.expand(BeamSql.java:153)
at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.expand(BeamSql.java:116)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:533)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:465)
at org.apache.beam.sdk.values.PCollectionTuple.apply(PCollectionTuple.java:160)
at TestingClass.main(TestingClass.java:75)
But When I am Providing it a Coder Then It Runs Perfectly.
I am little confused that if I am reading data from a file the file data schema changes on every run because I am using templates so is there any way I can use Default Coder or Without Coder, i can Run the Query.
For Reference Code is Below Please Check.
PCollection<String> ReadFile1 = PBegin.in(p).apply(TextIO.read().from("gs://Bucket_Name/FileName.csv"));
PCollection<BeamRecord> File1_BeamRecord = ReadFile1.apply(new StringToBeamRecord()).setCoder(new Temp().test().getRecordCoder());
PCollection<String> ReadFile2= p.apply(TextIO.read().from("gs://Bucket_Name/FileName.csv"));
PCollection<BeamRecord> File2_Beam_Record = ReadFile2.apply(new StringToBeamRecord()).setCoder(new Temp().test1().getRecordCoder());
new Temp().test1().getRecordCoder() --> Returning HardCoded BeamRecordCoder Values Which I need to fetch at runtime
Conversion From PColletion<String> to PCollection<TableRow> is Below-:
Public class StringToBeamRecord extends PTransform<PCollection<String>,PCollection<BeamRecord>> {
private static final Logger LOG = LoggerFactory.getLogger(StringToBeamRecord.class);
#Override
public PCollection<BeamRecord> expand(PCollection<String> arg0) {
return arg0.apply("Conversion",ParDo.of(new ConversionOfData()));
}
static class ConversionOfData extends DoFn<String,BeamRecord> implements Serializable{
#ProcessElement
public void processElement(ProcessContext c){
String Data = c.element().replaceAll(",,",",blank,");
String[] array = Data.split(",");
List<String> fieldNames = new ArrayList<>();
List<Integer> fieldTypes = new ArrayList<>();
List<Object> Data_Conversion = new ArrayList<>();
int Count = 0;
for(int i = 0 ; i < array.length;i++){
fieldNames.add(new String("R"+Count).toString());
Count++;
fieldTypes.add(Types.VARCHAR); //Using Schema I can Set it
Data_Conversion.add(array[i].toString());
}
LOG.info("The Size is : "+Data_Conversion.size());
BeamRecordSqlType type = BeamRecordSqlType.create(fieldNames, fieldTypes);
c.output(new BeamRecord(type,Data_Conversion));
}
}
}
Query is -:
PCollectionTuple test = PCollectionTuple.of(
new TupleTag<BeamRecord>("File1_BeamRecord"),File1_BeamRecord)
.and(new TupleTag<BeamRecord>("File2_BeamRecord"), File2_BeamRecord);
PCollection<BeamRecord> output = test.apply(BeamSql.queryMulti(
"Select * From File1_BeamRecord JOIN File2_BeamRecord "));
Is thier anyway i can make Coder Dynamic or I can Run Query with Default Coder.

How to create PPTX file using power-point-template using Apache POI

I want to create power-point presentation using power-point-template(which may be already exists or may be generated by poi.) for creating power point template file which have a background image in the slides, I write the following code which creates template file which opens in open-office but giving error to opening in Microsoft-power-point.
The Code is
private static void generatePOTX() throws IOException, FileNotFoundException {
String imgPathStr = System.getProperty("user.dir") + "/src/resources/images/TestSameChnl_001_t.jpeg";
File imgFile = new File(imgPathStr);
File potxFile = new File(System.getProperty("user.dir") + "/src/resources/Examples/layout.potx");
FileOutputStream out = new FileOutputStream(potxFile);
HSLFSlideShow ppt = new HSLFSlideShow();
HSLFSlide slide = ppt.createSlide();
slide.setFollowMasterBackground(false);
HSLFFill fill = slide.getBackground().getFill();
HSLFPictureData pd = ppt.addPicture(imgFile, PictureData.PictureType.JPEG);
fill.setFillType(HSLFFill.FILL_PICTURE);
fill.setPictureData(pd);
ppt.write(out);
out.close();
}
After that I tried to create a PPT file using the generated POTX file but
But it's giving error. I am trying bellow code for this.
And the code is
private static void GeneratePPTXUsingPOTX() throws FileNotFoundException, IOException {
File imgFile = new File(System.getProperty("user.dir")+"/src/resources/images/TestSameChnl_001_t.jpeg");
File potx_File = new File(System.getProperty("user.dir") + "/src/resources/Examples/layout.potx" );
File pptx_File = new File(System.getProperty("user.dir") + "/src/resources/Examples/PPTWithTemplate.pptx" );
File movieFile = new File(System.getProperty("user.dir") + "/src/resources/movie/Dummy.mp4");
FileInputStream ins = new FileInputStream(potx_File);
FileOutputStream out = new FileOutputStream(pptx_File);
HSLFSlideShow ppt = new HSLFSlideShow(ins);
List<HSLFSlide> slideList = ppt.getSlides();
int movieIdx = ppt.addMovie(movieFile.getAbsolutePath(), MovieShape.MOVIE_MPEG);
HSLFPictureData pictureData = ppt.addPicture(imgFile, PictureData.PictureType.JPEG);
MovieShape shape = new MovieShape(movieIdx, pictureData);
shape.setAnchor(new java.awt.Rectangle(300,225,420,280));
slideList.get(0).addShape(shape);
shape.setAutoPlay(true);
ppt.write(out);
out.close();
}
And the exception which is coming is as fallows:
java.lang.NullPointerException
at org.apache.poi.hslf.usermodel.HSLFPictureShape.afterInsert(HSLFPictureShape.java:185)
at org.apache.poi.hslf.usermodel.HSLFSheet.addShape(HSLFSheet.java:189)

Serializing List<MemoryStream> to a file using a standard .NET class

Writing a WP8 Silverlight app. Is there a standard .NET technique available in this environment I can use to serialize an object like this
private static List<MemoryStream> MemoryStreamList = new List<MemoryStream>();
to save it to a file and restore it later?
I tried to use DataContractJsonSerializer for this which is good to serialize a List of simple custom objects, but it fails for List (I get System.Reflection.TargetInvocationException).
I would suggest converting your list to a list of byte arrays before persisting and then you should be able to serialize. Of course this comes with some overhead at deserialization as well.
Serialization part:
byte[] bytes = null;
var newList = MemoryStreamList.Select(x => x.ToArray()).ToList();
XmlSerializer ser = new XmlSerializer(newList.GetType());
using (var ms = new MemoryStream())
{
ser.Serialize(ms, newList);
//if you want your result as a string, then uncomment to lines below
//ms.Seek(0, SeekOrigin.Begin);
//using (var sr = new StreamReader(ms))
//{
//string serializedStuff = sr.ReadToEnd();
//}
//else you can call ms.ToArray() here and persist the byte[]
bytes = ms.ToArray();
}
Deserialization part:
using (var ms = new MemoryStream(bytes))
{
var result = ser.Deserialize(ms) as List<byte[]>;
}

GATE Embedded runtime

I want to use "GATE" through web. Then I decide to create a SOAP web service in java with help of GATE Embedded.
But for the same document and saved Pipeline, I have a different run-time duration, when GATE Embedded runs as a java web service.
The same code has a constant run-time when it runs as a Java Application project.
In the web service, the run-time will be increasing after each execution until I get a Timeout error.
Does any one have this kind of experience?
This is my Code:
#WebService(serviceName = "GateWS")
public class GateWS {
#WebMethod(operationName = "gateengineapi")
public String gateengineapi(#WebParam(name = "PipelineNumber") String PipelineNumber, #WebParam(name = "Documents") String Docs) throws Exception {
try {
System.setProperty("gate.home", "C:\\GATE\\");
System.setProperty("shell.path", "C:\\cygwin2\\bin\\sh.exe");
Gate.init();
File GateHome = Gate.getGateHome();
File FrenchGapp = new File(GateHome, PipelineNumber);
CorpusController FrenchController;
FrenchController = (CorpusController) PersistenceManager.loadObjectFromFile(FrenchGapp);
Corpus corpus = Factory.newCorpus("BatchProcessApp Corpus");
FrenchController.setCorpus(corpus);
File docFile = new File(GateHome, Docs);
Document doc = Factory.newDocument(docFile.toURL(), "utf-8");
corpus.add(doc);
FrenchController.execute();
String docXMLString = null;
docXMLString = doc.toXml();
String outputFileName = doc.getName() + ".out.xml";
File outputFile = new File(docFile.getParentFile(), outputFileName);
FileOutputStream fos = new FileOutputStream(outputFile);
BufferedOutputStream bos = new BufferedOutputStream(fos);
OutputStreamWriter out;
out = new OutputStreamWriter(bos, "utf-8");
out.write(docXMLString);
out.close();
gate.Factory.deleteResource(doc);
return outputFileName;
} catch (Exception ex) {
return "ERROR: -> " + ex.getMessage();
}
}
}
I really appreciate any help you can provide.
The problem is that you're loading a new instance of the pipeline for every request, but then not freeing it again at the end of the request. GATE maintains a list internally of every PR/LR/controller that is loaded, so anything you load with Factory.createResource or PersistenceManager.loadObjectFrom... must be freed using Factory.deleteResource once it is no longer needed, typically using a try-finally:
FrenchController = (CorpusController) PersistenceManager.loadObjectFromFile(FrenchGapp);
try {
// ...
} finally {
Factory.deleteResource(FrenchController);
}
But...
Rather than loading a new instance of the pipeline every time, I would strongly recommend you explore a more efficient approach to load a smaller number of instances of the pipeline but keep them in memory to serve multiple requests. There is a fully worked-through example of this technique in the training materials on the GATE wiki, in particular module number 8 (track 2 Thursday).

Same Instances header ( arff ) for all my database queries

I am using InstanceQuery , SQL queries, to construct my Instances. But my query results does not come in the same order always as it is normal in SQL.
Beacuse of this Instances constucted from different SQL has different headers. A simple example can be seen below. I suspect my results changes because of this behavior.
Header 1
#attribute duration numeric
#attribute protocol_type {tcp,udp}
#attribute service {http,domain_u}
#attribute flag {SF}
Header 2
#attribute duration numeric
#attribute protocol_type {tcp}
#attribute service {pm_dump,pop_2,pop_3}
#attribute flag {SF,S0,SH}
My question is : How can I give correct header information to Instance construction.
Is something like below workflow is possible?
get pre-prepared header information from arff file or another place.
give instance construction this header information
call sql function and get Instances (header + data)
I am using following sql function to get instances from database.
public static Instances getInstanceDataFromDatabase(String pSql
,String pInstanceRelationName){
try {
DatabaseUtils utils = new DatabaseUtils();
InstanceQuery query = new InstanceQuery();
query.setUsername(username);
query.setPassword(password);
query.setQuery(pSql);
Instances data = query.retrieveInstances();
data.setRelationName(pInstanceRelationName);
if (data.classIndex() == -1)
{
data.setClassIndex(data.numAttributes() - 1);
}
return data;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
I tried various approaches to my problem. But it seems that weka internal API does not allow solution to this problem right now. I modified weka.core.Instances append command line code for my purposes. This code is also given in this answer
According to this, here is my solution. I created a SampleWithKnownHeader.arff file , which contains correct header values. I read this file with following code.
public static Instances getSampleInstances() {
Instances data = null;
try {
BufferedReader reader = new BufferedReader(new FileReader(
"datas\\SampleWithKnownHeader.arff"));
data = new Instances(reader);
reader.close();
// setting class attribute
data.setClassIndex(data.numAttributes() - 1);
}
catch (Exception e) {
throw new RuntimeException(e);
}
return data;
}
After that , I use following code to create instances. I had to use StringBuilder and string values of instance, then I save corresponding string to file.
public static void main(String[] args) {
Instances SampleInstance = MyUtilsForWeka.getSampleInstances();
DataSource source1 = new DataSource(SampleInstance);
Instances data2 = InstancesFromDatabase
.getInstanceDataFromDatabase(DatabaseQueries.WEKALIST_QUESTION1);
MyUtilsForWeka.saveInstancesToFile(data2, "fromDatabase.arff");
DataSource source2 = new DataSource(data2);
Instances structure1;
Instances structure2;
StringBuilder sb = new StringBuilder();
try {
structure1 = source1.getStructure();
sb.append(structure1);
structure2 = source2.getStructure();
while (source2.hasMoreElements(structure2)) {
String elementAsString = source2.nextElement(structure2)
.toString();
sb.append(elementAsString);
sb.append("\n");
}
} catch (Exception ex) {
throw new RuntimeException(ex);
}
MyUtilsForWeka.saveInstancesToFile(sb.toString(), "combined.arff");
}
My save instances to file code is as below.
public static void saveInstancesToFile(String contents,String filename) {
FileWriter fstream;
try {
fstream = new FileWriter(filename);
BufferedWriter out = new BufferedWriter(fstream);
out.write(contents);
out.close();
} catch (Exception ex) {
throw new RuntimeException(ex);
}
This solves my problem but I wonder if more elegant solution exists.
I solved a similar problem with the Add filter that allows adding attributes to Instances. You need to add a correct Attibute with proper list of values to both datasets (in my case - to test dataset only):
Load train and test data:
/* "train" contains labels and data */
/* "test" contains data only */
CSVLoader csvLoader = new CSVLoader();
csvLoader.setFile(new File(trainFile));
Instances training = csvLoader.getDataSet();
csvLoader.reset();
csvLoader.setFile(new File(predictFile));
Instances test = csvLoader.getDataSet();
Set a new attribute with Add filter:
Add add = new Add();
/* the name of the attribute must be the same as in "train"*/
add.setAttributeName(training.attribute(0).name());
/* getValues returns a String with comma-separated values of the attribute */
add.setNominalLabels(getValues(training.attribute(0)));
/* put the new attribute to the 1st position, the same as in "train"*/
add.setAttributeIndex("1");
add.setInputFormat(test);
/* result - a compatible with "train" dataset */
test = Filter.useFilter(test, add);
As a result, the headers of both "train" and "test" are the same (compatible for Weka machine learning)