How to handle input file with non standard delimiters in dsx ml pipeline? - data-science-experience

I'm trying to work with a data set that has no header and has :: for field delimiters:
! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip
! unzip ml-1m.zip
! mv ml-1m/ratings.dat .
! head ratings.dat
The output:
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
I have loaded the file into my dsx pipeline, but I am unclear how to get dsx to split this file using the :: delimiters.
How do I do this?
If it is not possible to get dsx to reshape this file using dsx ml pipeline functionality, does dsx have any pre-requisities in terms of input file format?
Update:
The ml pipeline functionality I'm trying to use can be seen from the screenshot below:
I have added a data set, but can't figure out how to get dsx to recognise the field delimiters:

As of Feb-2017...
When you create a new pipeline and select a dataset, I believe DSX loads the file you select using a Spark DataFrameReader. The DataFrameReader defaults to using a single , as the delimiter. DSX does not provide a way to change the default delimiter in the UI.
I think preprocessing the data is your best option. You can do this in a notebook. Be aware that the Spark DataFrameReader only supports a single character delimiter, so you can't use that with this particular dataset. You can user pandas, however.
import pandas as pd
pdf = pd.read_csv('ml-1m/ratings.dat', sep='::',
header=None,
names=['UserID','MovieID','Rating','Timestamp'],
engine='python')
pdf.to_csv('ratings.csv', index=False)
!head ratings.csv
UserID,MovieID,Rating,Timestamp
1,1193,5,978300760
1,661,3,978302109
1,914,3,978301968
1,3408,4,978300275
1,2355,5,978824291
1,1197,3,978302268
1,1287,5,978302039
1,2804,5,978300719
1,594,4,978302268
Now the data will be in a format that DSX will be able to parse properly.

Related

Blankspace and colon not found in firstline

I have a jupyter notebook in SageMaker in which I want to run the XGBoost algorithm. The data has to match 3 criteria:
-No header row
-Outcome variable in the first column, features in the rest of the columns
-All columns need to be numeric
The error I get is the following:
Error for Training job xgboost-2019-03-13-16-21-25-000:
Failed Reason: ClientError: Blankspace and colon not found in firstline
'0.0,0.0,99.0,314.07,1.0,0.0,0.0,0.0,0.48027846,0.0...' of file 'train.csv'
In the error itself it can be seen that there are no headers, the output is the first column (it just takes 1.0 and 0.0 values) and all features are numerical. The data is stored in its own bucket.
I have seen a related question in GitHub but there are no solution there. Also, the example notebook that Amazon has does not take care of change the default sep or anything when saving a dataframe to csv for using it later on.
The error message indicated XGBoost was expecting the input data set as libsvm format instead of csv. SageMaker XGBoost by default assumed the input data set was in libsvm format. For using input data set in csv, please explicitly specify content-type as text/csv.
For more information: https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost

Consolidate file write and read together

I am writing a python script to write data to the Vertica DB. I use the official library vertica_db_client. For some reason, if I use the built-in cur.executemany method for some reason it takes a long time to complete (40+ seconds per 1k entries). The recommendation I got was to first save the data to a file, then use "COPY" method. Here is the save-to-a-csv-file part:
with open('/data/dscp.csv', 'w') as out:
csv_out=csv.writer(out)
csv_out.writerow(("time_stamp", "subscriber", "ip_address", "remote_address", "signature_service_name", "dscp_out", "bytes_in", "bytes_out")) # which is for adding a title line
for row in data:
csv_out.writerow(row)
My data is a list of tuples. examples are like:
[\
('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.11', 'SIP', 26, 2911, 4452), \
('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.21', 'SIP', 26, 4270, 5212), \
('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.129', '18.215.140.51', 'HTTP2 over TLS', 0, 14378, 5291)\
]
Then, in order to use the COPY method, I have to (at least based on their instruction https://www.vertica.com/docs/9.1.x/HTML/python_client/loadingdata_copystdin.html), read the file first then do "COPY from STDIN". Here is my code
f = open("/data/dscp.csv", "r")
cur.stdin = f
cur.execute("""COPY pason.dscp FROM STDIN DELIMITER ','""")
Here is the code for connecting the DB, in case it is relevent to the problem
import vertica_db_client
user = 'dbadmin'
pwd = 'xxx'
database = 'xxx'
host = 'xxx'
db = vertica_db_client.connect(database=database, user=user, password=pwd, host=host)
cur = db.cursor()
So clearly it is waste of effort to first save then read... What is the best way to consolidate the two reading part?
If anyone can tell me why my execute.many was slow it would be equally helpful!
Thanks!
First of all, yes, it is both the recommended way and the most efficient way to write the data to a file first. It may seem inefficient at first, but writing the data to a file on disk will take next to no time at all, but Vertica is not optimized for many individual INSERT statements. Bulk loading is the fastest way to get large amounts of data into Vertica. Not only that, but when you do many individual INSERT statements, you could potentially run into ROS pushback issues, and even if you don't there will be extra load on the database when the ROS containers are merged after the load.
You could convert your array of tuples two a large string variable and then print the string to the console.
The string would look something like:
'2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.11', 'SIP', 26, 2911, 4452
'2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.21', 'SIP', 26, 4270, 5212
'2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.129', '18.215.140.51', 'HTTP2 over TLS', 0, 14378, 5291
But instead of actually printing it to the console, you could just pipe it into a VSQL command.
$ python my_script.py | vsql -U dbadmin -d xxx -h xxx -c "COPY pason.dscp FROM STDIN DELIMITER ','"
This may not be efficient though. I don't have much experience with exceedingly long string variables in python.
Secondly, the vertica_db_client is no longer being actively developed by Vertica. While it will still supported at least until the python2 end of life, you should be using vertica_python.
You can install vertica_python with pip.
$ pip install vertica_python
or
$ pip3 install vertica_python
depending on which version of Python you want to use it with.
You can also build from source code can be found on Vertica's GitHub page https://github.com/vertica/vertica-python/
As for using the COPY command with vertica_python, see the answer in this question here: Import Data to SQL using Python
I have used several python libraries to connect to Vertica and vertica_python is by far my favorite, and ever since Vertica took over the development from Uber it has continued to improve on a very regular basis.

how to merge multiple parquet files to single parquet file using linux or hdfs command?

I have multiple small parquet files generated as output of hive ql job, i would like to merge the output files to single parquet file?
what is the best way to do it using some hdfs or linux commands?
we used to merge the text files using cat command, but will this work for parquet as well?
Can we do it using HiveQL itself when writing output files like how we do it using repartition or coalesc method in spark?
According to this https://issues.apache.org/jira/browse/PARQUET-460
Now you can download the source code and compile parquet-tools which is built in merge command.
java -jar ./target/parquet-tools-1.8.2-SNAPSHOT.jar merge /input_directory/
/output_idr/file_name
Or using a tool like https://github.com/stripe/herringbone
You can also do it using HiveQL itself, if your execution engine is mapreduce.
You can set a flag for your query, which causes hive to merge small files at the end of your job:
SET hive.merge.mapredfiles=true;
or
SET hive.merge.mapfiles=true;
if your job is a map-only job.
This will cause the hive job to automatically merge many small parquet files into fewer big files. You can control the number of output files with by adjusting hive.merge.size.per.task setting. If you want to have just one file, make sure you set it to a value which is always larger than the size of your output. Also, make sure to adjust hive.merge.smallfiles.avgsize accordingly. Set it to a very low value if you want to make sure that hive always merges files. You can read more about this settings in hive documentation.
Using duckdb :
import duckdb
duckdb.execute("""
COPY (SELECT * FROM '*.parquet') TO 'merge.parquet' (FORMAT 'parquet');
""")

Monitor training/validation process in Caffe

I'm training Caffe Reference Model for classifying images.
My work requires me to monitor the training process by drawing graph of accuracy of the model after every 1000 iterations on entire training set and validation set which has 100K and 50K images respectively.
Right now, Im taking the naive approach, make snapshots after every 1000 iterations, run the C++ classififcation code which reads raw JPEG image and forward to the net and output the predicted labels. However, this takes too much time on my machine (with a Geforce GTX 560 Ti)
Is there any faster way that I can do to have the graph of accuracy of the snapshot models on both training and validation sets?
I was thinking about using LMDB format instead of raw images. However, I cannot find documentation/code about doing classification in C++ using LMDB format.
1) You can use the NVIDIA-DIGITS app to monitor your networks. They provide a GUI including dataset preparation, model selection, and learning curve visualization. More, they use a caffe distribution allowing multi-GPU training.
2) Or, you can simply use the log-parser inside caffe.
/pathtocaffe/build/tools/caffe train --solver=solver.prototxt 2>&1 | tee lenet_train.log
This allows you to save train log into "lenet_train.log". Then by using:
python /pathtocaffe/tools/extra/parse_log.py lenet_train.log .
you parse your train log into two csv files, containing train and test loss. You can then plot them using the following python script
import pandas as pd
from matplotlib import *
from matplotlib.pyplot import *
train_log = pd.read_csv("./lenet_train.log.train")
test_log = pd.read_csv("./lenet_train.log.test")
_, ax1 = subplots(figsize=(15, 10))
ax2 = ax1.twinx()
ax1.plot(train_log["NumIters"], train_log["loss"], alpha=0.4)
ax1.plot(test_log["NumIters"], test_log["loss"], 'g')
ax2.plot(test_log["NumIters"], test_log["acc"], 'r')
ax1.set_xlabel('iteration')
ax1.set_ylabel('train loss')
ax2.set_ylabel('test accuracy')
savefig("./train_test_image.png") #save image as png
Caffe creates logs each time you try to train something, and its located in the tmp folder (both linux and windows).
I also wrote a plotting script in python which you can easily use to visualize your loss/accuracy.
Just place your training logs with .log extension next to the script and double click on it.
You can use command prompts as well, but for ease of use, when executed it loads all logs (*.log) it can find in the current directory.
it also shows the top 4 accuracies and at-which accuracy they were achieved.
you can find it here : https://gist.github.com/Coderx7/03f46cb24dcf4127d6fa66d08126fa3b
python /pathtocaffe/tools/extra/parse_log.py lenet_train.log
command produces the following error:
usage: parse_log.py [-h] [--verbose] [--delimiter DELIMITER]
logfile_path output_dir
parse_log.py: error: too few arguments
Solution:
For successful execution of "parse_log.py" command, we should pass the two arguments:
log file
path of output directory
So the correct command is as follows:
python /pathtocaffe/tools/extra/parse_log.py lenet_train.log output_dir

Weka: Src and Dest differ in # of attributes after I do feature selection on the training set

I am trying to use weka to classify text. What I do is this:
I create on big ARFF file with all of the data: all_of_it.arff.
I split that data into training and test:train.arff and test.arff
I do feature selection on the training set and output a new training file:train_fs.arff
I build a classifier with only those selected features.
And the problem is.....
I don't quite know how to standardize the test set to only use the features I selected from the training set. Something like create new test file from test.arff according to train_fs.arff
*I tried using
java -cp weka.jar weka.filters.unsupervised.attribute.Standardize -b -i train_fs.arff -o train2.arff -r test.arff -s test2.arff
but I got the infamous Src and Dest differ in # of attributes.
Is there any way to normalize/standardize the sets according to an arff file (namely my new training data with few features) I don't see how to do this with the Standardize or StringToWordVector filter.
Batch filtering is one solution to your problem.
Pros:
It will apply the same filter to your test dataset as you apply to your training dataset. When you perform feature selection, the two datasets will be compatible
Cons:
It is only availabe from the command line interface or Weka's Java API
The two datasets must be filtered at the same time
You can read more about Batch filtering here.
You may also want to look into InputMappedClassifier. It is a wrapper classifier that addresses incompatible training and testing data.