Currently, i am working on a project using WEKA. Being naive and newbie in it, there are many things which i am not familair with. In my last project I used text files as a classification using WEKA. I applied the TextDirectoryLoader convertor to convert a directory containing text files as mentioned on this URL Text categorization with WEKA. Now I want to use the same stretagy for converting a directory containing source code (instead of text). For example, I have a Jedit source file containing Java source code. I am trying to convert it to ARFF file so that i can apply classifiers or other functions present in WEKA on that ARFF file for data mining purposes. I have also tried a test file given on following URL ARFF files from Text Collections. I believe i can use the same file as an example to convert source code files. However, I do not know what attributes should I define in a FastVector? and What format should the data be in (String or numeric). And what other sections should an ARFF file may have?
As in the example the authors have defined following attributes
FastVector atts = new FastVector(2);
atts.addElement(new Attribute("filename", (FastVector) null));
atts.addElement(new Attribute("contents", (FastVector) null));
I have tried to find some examples on Google but no success.
Could anyone here suggests me any solution or alternate to solve the above said problem? (Example code will be highly appreciated).
Or atleast could give me a short example which convertes a source code directory into an ARFF file. (If it is possible).
If not possible what could be the possible reason
Any alternate solution (except WEKA) where I can use the same set of functions on a source code.
It is not clear, what is your goal? Do you want to classify the source code files, or find the files which are contains any bug, or what?
As I imagine, you want to extract features from each source file, and represent it with an instance. Then you can apply any machine learning based algorithm.
Here, you can find a java example, how can you construct an arff file from java:
https://weka.wikispaces.com/Creating+an+ARFF+file
But, you have to define your task specific features and extract it from each source code files.
Related
I have got a trace file that is binary in nature. I want to convert it to a text file and convert the data inside it to decimal form. I mean I am not sure, how to do this. This .trc file contains data in the form of telegrams and I want to extract particular kind of telegram and save them in text file which is readable in nature. I have to do all of this using C++.
Do you suggest any other language for it or does anyone has any idea about doing this in C++?
Binary trace files are usually encoded in proprietary formats. And there are applications or profilers specifically built to parse them.
Unless you know the file format, the only way to decode it is through reverse engineering. And in most cases it's not worth the effort.
Try to find documentation about it. Or maybe an application or utility that loads the file and exports data that is easier to read.
In case you are speaking about .trc binary files from Teledyne Lecroy Oscilloscopes, I would suggest to any of the following libraries out there for that:
https://pypi.org/project/lecroyparser/
https://github.com/jneer/lecroy-reader
https://github.com/yetifrisstlama/readTrc
https://igit.ific.uv.es/ferhue/lecroyparser
C++ examples of MXNet contain model training examples for MNISTIter, MNIST data set (.idx3-ubyte or .idx1-ubyte). However the same code actually recommends to use im2rec tool to produce the data, and it produces the different .rec format. Looks like the .rec format contains images and labels in the same file, because im2rec takes a prepared .lst file with both (number, label and image file name per each line).
I have produced the code like
auto val_iter = MXDataIter("ImageRecordIter");
setDataIter(&val_iter, "Train", vector < string >
{"output_train.rec", "output_validate.rec"}, batch_size));
with all files present but it fails because four files are still required in the vector (segmentation fault). But why, should not labels be inside the file now?
Digging more into the code, I found that setDataIter actually sets the parameters. Parameters for ImageRecordIter can be found here. I tried to set parameters like path_imgrec, path.imgrec, then call .CreateDataIter() but all this was not helpful - segmentation fault on the first attempt to use the iterator.
I was not able to find a single example in the whole Internet about how to train any MxNet neural network in C++ using .rec file format for training and validation sets. Is it possible? The only work around I found is to try original MNIST tools that produce files covered by MNIST output examples.
Eventually I have used Mnisten to produce the matching data set so that may input format is now the same as MxNet examples use. Mnisten is a good tool to work, just it is important not to forget that it normalizes grayscale pixels into 0..1 range (no more 0..255).
It is a command line tool but with all C++ code available (and there is not really a lot if it), the converter can also be integrated with existing code of the project to handle various specifics. I have never been affiliated with this project before.
I have a system consisting of parameters in Access, which are read by an R script, which then starts an Rmarkdown report. In Rmarkdown, a Stata script is built, which reads a data file and creates a graph specified by the Access parameters. To get the Stata graph into the report, I have to store it as a PNG file and link to this file in the Rmarkdown code. Finally, the report is rendered as a Word file (using knitr and Pandoc).
In the present setup, I have several places in the report where a graph can be called for. I can create a single PNG file for each of these places, I know the filenames (controlled by the Access parameters), and I link to each file using the standard command ![](path/to/filename.png. This works properly.
The next development step is that in each place, I need to create an unknown and varying number of PNG files (up to ca. 20 files). I will do this in Stata. The problem is to link to a varying number of files in the Rmd code. I haven't found a way to do this, and need advice on how.
I have some ideas for a solution, but I cannot find the commands or syntax to implement them. I have read the Introduction to Rmarkdown from Rstudio.com, and the Rmarkdown Reference Guide (5 pages) from the same source. I am rather new to both R and Rmarkdown, so I might have overlooked or not understood that there is a solution.
Is it possible to set up a loop or branch (e.g. "if", "for" or "while") in Rmarkdown? Then I could loop over the current number of files, or branch around unused file links.
Can I fetch all files in a certain directory, e.g. by making a link containing wildcards in the filename? Or is there another way of achieving this?
Is there a way of having links to files that do not exist in the present run, without crashing the program? Then I could set up enough links to cover all foreseeable cases.
Or, does anyone have other suggestions?
Sure, you could use a loop like
```{r, results="asis"}
files <- list.files(path = '/path/to/your/pngdirectory/',
pattern = '\\.png', full.names = T)
for(f in files) cat(paste0('![](',f,')\n'))
```
If you want to filter for certain png files you can extend the pattern argument to some more sophisticated regular expression. For exampele, If I only want png files containing '2017-07-11' in their name I would do
list.files(path = '/Users/martin/Dropbox/Screenshots',
pattern = '.*2017-07-11.*\\.png', full.names = T)
where .* matches any character.
Good day.
An apology for my English but it's not my native language so they know apologize for any errors.
I have a text file with the data processing which want to get a .arff file, which is the file type using weka.
I do not want to generate a single file. I get 2 files, one for training the model (training) and another to test the model (test).
This is done directly weka eh applying a filter stringtowordtokenizer but the problem is that when you use the second file to test a mistake because it is not fair test a model that has words that should not be.
If someone helps me I would appreciate.
Thank you and best regards.
I have a Qt project which uses XML files. Those XML files contain human-readable text and this text should be translated by using the Qt tools (lupdate, lrelease, QtLinguist).
The question is if it is possible to generate entries in .ts file via lupdate without duplicating the strings from the XML files in a source code file by using the QT_TR_NOOP() macro and friends? Or in general, how do you translate strings in non-source files for Qt projects?
We had the same problem : XML files containing human readable strings.
Our solution was to make sure that human readable strings in the XML files were easy to extract (we put them in a LABEL attribute) and we developped a small tool which would parse the XML files, extract the strings, generate a context (by extracting data from the XML file), and then generating a CPP header file containing a list of QT_TR_NOOP().
This file was added to our project file (.pro) that was used by lupdate.
This solution was fine for us but we had to be very careful about two things :
run this tool each time the content of an XML file changed.
make sure the XML files are UTF-8 encoded.
You can translate anything you want at runtime by using tr(), as long as the .qm file has a matching translation/context. It shouldn't make any difference whether lupdate extracted it or not.
I don't know how to make lupdate to extract strings from arbitrary XML, but that doesn't mean you can't use linguist.
.ts files are also XML; it should be easy to make an XSLT that transforms your XML into a .ts file. If you want to target something standard instead of just Qt, lupdate(and linguist) can process also XLIFF files.
you can have multiple .ts files (just call QTranslator::load more than once when setting it up)
If you really want to have it all in one file for the translator, have your XSLT copy the lupdate-generated file into its output.
As long as you use a context name that doesn't duplicate something used in the source code, this shouldn't be any different (from Qt's point of view) from the way many apps load a .qm for each DLL that has GUI.