Uploading xes files to weka - weka

I want to use the tool WEKA for process prediction. Since many event logs are in the XES format I somehow need to convert XES to a WEKA-readable format. (i.e: arff, csv, json)
Is there a tool or a code I can use?

You can use the pm4py library in python to convert it into a csv.
import pandas as pd
from pm4py.objects.conversion.log import converter as log_converterv
dataframe = log_converter.apply(log, variant=log_converter.Variants.TO_DATA_FRAME)
dataframe.to_csv('<path_to_csv_file.csv>')

Related

What format is GCP's "Longrunningrecognize" output data in?

I'm using Google Cloud's sample_long_running_recognize() to get audio transcripts as well as speaker diarization, but the output that it gives me is in LongRunningRecognizeResponse format which looks very similar to JSON, but not quite. How can I export the output of the LongRunningRecognizeResponse so I can put it in a pandas df?
I've tried to export it with
out = open(audio_in_file_path + "outputs/" + audio_in_file_name + "_out.json" , "w+")
out.write(response) # response is the output fyi
out.close()
but the format that the data is in is not actually JSON, so it messes up everything. I'm able to open the data on the console by calling the objects inside of response with something like response.results[1].alternatives[0] but I would much rather have it on a df.
Thanks in advance!
Indeed the data is similar in structure to JSON, since the Cloud Speech-to-Text API response is in JSON format.
However the Python Client Library creates a Python object [1] from the JSON response, which is harder to automatically parse into JSON format. You can create a JSON parsable object that is easy to save by iterating through the structure of LongRunningRecognizeResponse. For example like this:
for res in response.results:
for alt in res.alternatives:
rows.append({"transcript": alt.transcript, "confidence": alt.confidence})
with open("results.json" , "w+") as file:
json.dump(rows, file)
[1] https://googleapis.dev/python/speech/latest/gapic/v1/types.html#google.cloud.speech_v1.types.LongRunningRecognizeResponse

PySpark Write Parquet Binary Column with Stats (signed-min-max.enabled)

I found this apache-parquet ticket https://issues.apache.org/jira/browse/PARQUET-686 which is marked as resolved for parquet-mr 1.8.2. The feature I want is the calculated min/max in the parquet metadata for a (string or BINARY) column.
And referencing this is an email https://lists.apache.org/thread.html/%3CCANPCBc2UPm+oZFfP9oT8gPKh_v0_BF0jVEuf=Q3d-5=ugxSFbQ#mail.gmail.com%3E
which uses scala instead of pyspark as an example:
Configuration conf = new Configuration();
+ conf.set("parquet.strings.signed-min-max.enabled", "true");
Path inputPath = new Path(input);
FileStatus inputFileStatus =
inputPath.getFileSystem(conf).getFileStatus(inputPath);
List<Footer> footers = ParquetFileReader.readFooters(conf, inputFileStatus, false);
I've been unable to set this value in pyspark (perhaps I'm setting it in the wrong place?)
example dataframe
import random
import string
from pyspark.sql.types import StringType
r = []
for x in range(2000):
r.append(u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10)))
df = spark.createDataFrame(r, StringType())
I've tried a few different ways of setting this option:
df.write.format("parquet").option("parquet.strings.signed-min-max.enabled", "true").save("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", "true").parquet("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", True).parquet("s3a://test.bucket/option")
But all of the saved parquet files are missing the ST/STATS for the BINARY column. Here is an example output of the metadata from one of the parquet files:
creator: parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"value","type":"string","nullable":true,"metadata":{}}]}
file schema: spark_schema
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
value: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:33 TS:515
---------------------------------------------------------------------------------------------------
Also, based on this email chain https://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3C9DEF4C39-DFC2-411B-8987-5B9C33842974#videoamp.com%3E and question: Specify Parquet properties pyspark
I tried sneaking the config in through the pyspark private API:
spark.sparkContext._jsc.hadoopConfiguration().setBoolean("parquet.strings.signed-min-max.enabled", True)
So I am still unable to set this conf parquet.strings.signed-min-max.enabled in parquet-mr (or it is set, but something else has gone wrong)
Is it possible to configure parquet-mr from pyspark
Does pyspark 2.3.x support BINARY column stats?
How do I take advantage of the PARQUET-686 feature to add min/max metadata for string columns in a parquet file?
Since historically Parquet writers wrote wrong min/max values for UTF-8 strings, new Parquet implementations skip those stats during reading, unless parquet.strings.signed-min-max.enabled is set. So this setting is a read option that tells the Parquet library to trust the min/max values in spite of their known deficiency. The only case when this setting can be safely enabled is if the strings only contain ASCII characters, because the corresponding bytes for those will never be negative.
Since you use parquet-tools for dumping the statistics and parquet-tools itself uses the Parquet library, it will ignore string min/max statistics by default. Although it seems that there are no min/max values in the file, in reality they are there, but get ignored.
The proper solution for this problem is PARQUET-1025, which introduces new statistics fields min-value and max-value. These handle UTF-8 strings correctly.

Parallel excel sheet read from dask

Hello All the examples that I came across for using dask thus far has
been multiple csv files in a folder being read using dask read_csv
call.
if I am provided an xlsx file with multiple tabs, can I use anything
in dask to read them parallely?
P.S. I am using pandas 0.19.2 with python 2.7
For those using Python 3.6:
#reading the file using dask
import dask
import dask.dataframe as dd
from dask.delayed import delayed
parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
df = dd.from_delayed(parts)
print(df.head())
I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.
A simple example
fn = 'my_file.xlsx'
parts = [dask.delayed(pd.read_excel)(fn, i, **other_options)
for i in range(number_of_sheets)]
df = dd.from_delayed(parts, meta=parts[0].compute())
Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.
Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.

Anidate pandas dataframes in a python object

I am new into Python, I've been using Matlab for a long time. Most of the features that Python offers outperform those from Matlab, but I still miss some of the features of matlab structures!
Is there a similar way of grouping independent pandas dataframes into a single object? This would be of my convenience since sometimes I have to read data of different locations and I would like to obtain as many independent dataframes as locations, and if possible into a single object.
Thanks!
I am not sure that I fully understand your question, but this is where I think you are going.
You can use many of the different python data structures to organize pandas dataframes into a single group (List, Dictionary, Tuple). The list is most common, but a dictionary would also work well if you need to call them by name later on rather than position.
**Note: This example uses csv files, these files could be any io that pandas supports (csv, excel, txt, or even a call to a database)
import pandas as pd
files = ['File1.csv', 'File2.csv', 'File3.csv']
frames = [frames.append(pd.read_csv(file)) for file in files]
single_df = pd.concat(frames)
You can use each frame independently by calling it from the list. The following would return the File1.csv dataframe
frames[0]

Monitor training/validation process in Caffe

I'm training Caffe Reference Model for classifying images.
My work requires me to monitor the training process by drawing graph of accuracy of the model after every 1000 iterations on entire training set and validation set which has 100K and 50K images respectively.
Right now, Im taking the naive approach, make snapshots after every 1000 iterations, run the C++ classififcation code which reads raw JPEG image and forward to the net and output the predicted labels. However, this takes too much time on my machine (with a Geforce GTX 560 Ti)
Is there any faster way that I can do to have the graph of accuracy of the snapshot models on both training and validation sets?
I was thinking about using LMDB format instead of raw images. However, I cannot find documentation/code about doing classification in C++ using LMDB format.
1) You can use the NVIDIA-DIGITS app to monitor your networks. They provide a GUI including dataset preparation, model selection, and learning curve visualization. More, they use a caffe distribution allowing multi-GPU training.
2) Or, you can simply use the log-parser inside caffe.
/pathtocaffe/build/tools/caffe train --solver=solver.prototxt 2>&1 | tee lenet_train.log
This allows you to save train log into "lenet_train.log". Then by using:
python /pathtocaffe/tools/extra/parse_log.py lenet_train.log .
you parse your train log into two csv files, containing train and test loss. You can then plot them using the following python script
import pandas as pd
from matplotlib import *
from matplotlib.pyplot import *
train_log = pd.read_csv("./lenet_train.log.train")
test_log = pd.read_csv("./lenet_train.log.test")
_, ax1 = subplots(figsize=(15, 10))
ax2 = ax1.twinx()
ax1.plot(train_log["NumIters"], train_log["loss"], alpha=0.4)
ax1.plot(test_log["NumIters"], test_log["loss"], 'g')
ax2.plot(test_log["NumIters"], test_log["acc"], 'r')
ax1.set_xlabel('iteration')
ax1.set_ylabel('train loss')
ax2.set_ylabel('test accuracy')
savefig("./train_test_image.png") #save image as png
Caffe creates logs each time you try to train something, and its located in the tmp folder (both linux and windows).
I also wrote a plotting script in python which you can easily use to visualize your loss/accuracy.
Just place your training logs with .log extension next to the script and double click on it.
You can use command prompts as well, but for ease of use, when executed it loads all logs (*.log) it can find in the current directory.
it also shows the top 4 accuracies and at-which accuracy they were achieved.
you can find it here : https://gist.github.com/Coderx7/03f46cb24dcf4127d6fa66d08126fa3b
python /pathtocaffe/tools/extra/parse_log.py lenet_train.log
command produces the following error:
usage: parse_log.py [-h] [--verbose] [--delimiter DELIMITER]
logfile_path output_dir
parse_log.py: error: too few arguments
Solution:
For successful execution of "parse_log.py" command, we should pass the two arguments:
log file
path of output directory
So the correct command is as follows:
python /pathtocaffe/tools/extra/parse_log.py lenet_train.log output_dir