Running pig in local mode - hdfs

I am beginner to APACHE PIG and following is what i have a slight confusion
I am trying to run pig in local mode using
pig -x local.
Now ia m trying the simple code
dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
grouping= group dividends by symbol;
avg= foreach grouped generate group, AVG(dividends.dividend);
store avg into 'average_dividend'.
A folder by the name average_dividend is formed on my machine.
Now as per the book if i want to run it in local mode I have to give the following syntax
pig_path/bin/pig -x local average_dividend.pig
But where is the file average_dividend.pig(like where it is formed)?

I assume you are trying to run one of the examples of Programming Pig.
First locate average_dividend.pig in the directory where you extracted the code.
Since you are working in local mode you have to set the path to NYSE_dividends,
e.g: load '/home/user/programmingpig-master/data/NYSE_dividends',
Set the output directory (shouldn't exist) too where you want to save the result, e.g:
store avg into '/home/user/output'.
Then issue:
pig_path/bin/pig -x local -f average_dividend.pig

Related

Call the added file in Hive to use for UDF

I have a file which contains holidays and it is required by UDF to use this file to run and calculate the given business days for two dates. Issue I have is when I add the file, it goes to a working directory but this directory differs every session.
Unlike in the example below from Hive Resources - This is not what is happening.
hive> add FILE /tmp/tt.py;
hive> list FILES;
/tmp/tt.py
hive> select from networks a
MAP a.networkid
USING 'python tt.py' as nn where a.ds = '2009-01-04' limit 10;
This is what I am getting and the alpha numeric keeps changing.
/mnt/tmp/a17b43d5-df53-4eea-8e2c-565471b49d25_resources/holiday2021.csv
I need to make this file located in a more permanent folder and this hive sql can be executed into any of the 18 nodes.

Consolidate file write and read together

I am writing a python script to write data to the Vertica DB. I use the official library vertica_db_client. For some reason, if I use the built-in cur.executemany method for some reason it takes a long time to complete (40+ seconds per 1k entries). The recommendation I got was to first save the data to a file, then use "COPY" method. Here is the save-to-a-csv-file part:
with open('/data/dscp.csv', 'w') as out:
csv_out=csv.writer(out)
csv_out.writerow(("time_stamp", "subscriber", "ip_address", "remote_address", "signature_service_name", "dscp_out", "bytes_in", "bytes_out")) # which is for adding a title line
for row in data:
csv_out.writerow(row)
My data is a list of tuples. examples are like:
[\
('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.11', 'SIP', 26, 2911, 4452), \
('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.21', 'SIP', 26, 4270, 5212), \
('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.129', '18.215.140.51', 'HTTP2 over TLS', 0, 14378, 5291)\
]
Then, in order to use the COPY method, I have to (at least based on their instruction https://www.vertica.com/docs/9.1.x/HTML/python_client/loadingdata_copystdin.html), read the file first then do "COPY from STDIN". Here is my code
f = open("/data/dscp.csv", "r")
cur.stdin = f
cur.execute("""COPY pason.dscp FROM STDIN DELIMITER ','""")
Here is the code for connecting the DB, in case it is relevent to the problem
import vertica_db_client
user = 'dbadmin'
pwd = 'xxx'
database = 'xxx'
host = 'xxx'
db = vertica_db_client.connect(database=database, user=user, password=pwd, host=host)
cur = db.cursor()
So clearly it is waste of effort to first save then read... What is the best way to consolidate the two reading part?
If anyone can tell me why my execute.many was slow it would be equally helpful!
Thanks!
First of all, yes, it is both the recommended way and the most efficient way to write the data to a file first. It may seem inefficient at first, but writing the data to a file on disk will take next to no time at all, but Vertica is not optimized for many individual INSERT statements. Bulk loading is the fastest way to get large amounts of data into Vertica. Not only that, but when you do many individual INSERT statements, you could potentially run into ROS pushback issues, and even if you don't there will be extra load on the database when the ROS containers are merged after the load.
You could convert your array of tuples two a large string variable and then print the string to the console.
The string would look something like:
'2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.11', 'SIP', 26, 2911, 4452
'2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.21', 'SIP', 26, 4270, 5212
'2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.129', '18.215.140.51', 'HTTP2 over TLS', 0, 14378, 5291
But instead of actually printing it to the console, you could just pipe it into a VSQL command.
$ python my_script.py | vsql -U dbadmin -d xxx -h xxx -c "COPY pason.dscp FROM STDIN DELIMITER ','"
This may not be efficient though. I don't have much experience with exceedingly long string variables in python.
Secondly, the vertica_db_client is no longer being actively developed by Vertica. While it will still supported at least until the python2 end of life, you should be using vertica_python.
You can install vertica_python with pip.
$ pip install vertica_python
or
$ pip3 install vertica_python
depending on which version of Python you want to use it with.
You can also build from source code can be found on Vertica's GitHub page https://github.com/vertica/vertica-python/
As for using the COPY command with vertica_python, see the answer in this question here: Import Data to SQL using Python
I have used several python libraries to connect to Vertica and vertica_python is by far my favorite, and ever since Vertica took over the development from Uber it has continued to improve on a very regular basis.

How to post Stata program via Dropbox or private website?

Here is a sample program .do file, sampleprog.do:
program sampleprog
egen newVar = group (`1' `2')
end
How can I post it on my website (or dropbox), so that other people could install it to their Stata like this?
net from http://www.mywebsite.com/sampleprog.do
*** or may be like like this:
ssc install ...
I read the documentation about stata.toc...but I did not quite get it. What files should I upload and should it be one folder or what?
(PS: I definitely can simply email the .do file but this is not an option in my case.)
Here is a full explanation of how to share program or data files with others using your own website. I tried using Dropbox, but Stata 12 appears to have issues with https, which is the protocol for all Dropbox public links. If you want to use Dropbox, I recommend creating a shared folder that will sync on your collaborators' machines. The rest of this answer assumes you have a website serving pages over http or are using Stata 13, which supports https.
If this is a one-time thing, you can skip the rest of this answer by putting the file on your website and telling your collaborator to type:
. copy http://your-site.com/ado/program.ado program.ado
That will copy the ado file at the specified url into the user's current directory. If you want to provide information about your files, plan on sharing with multiple people and need to maintain/document a set files, read on!
Step 1 Create a folder on your website to hold the programs. I will call mine ado/
Step 2 Add the program files, help files, and data files you want to share. For this example, I have created a simple ado file called unique.ado with the following contents:
********************************************** unique.ado
capture program drop unique
program define unique
*! Count and number observations within group defined by varlist
* Example: unique person_id, obs(prow) tobs(pcount) sortby(time)
* to count and number rows by a variable called person_id
syntax varlist, obs(name) tobs(name) [sortby(varlist)]
bys `varlist' (`sortby') : gen long `obs' = _n
bys `varlist' (`sortby') : gen long `tobs' = _N
la var `obs' "Number of this row within `varlist' group."
la var `tobs' "Total number of rows with identical `varlist' values."
end
Step 3 Create a file called stata.toc to describe the files you wish to share. Here is mine:
********************************************** stata.toc
v 3
d Program to count observations by group
p unique [The unique.ado program for counting observations by group]
These files can be complicated. There are many features I won't cover here, but you can read this documentation to learn more.
Step 4 Create a package file for each of the packages defined by the lines in stata.toc that start with the letter p. Here is my package file for the unique package defined above:
********************************************** unique.pkg
v 3
d unique
d Program to count observations by group
d Distribution-Date: 28 June 2012
f unique.ado
Your directory now looks like this:
ado/
stata.toc
unique.ado
unique.pkg
Step 5 Use the site! Here are the commands to enter.
. net from http://example.com/ado/
. net describe unique
. net install unique
Here is what you'll see after entering the first command:
-----------------------------------------------------------------------------------
http://www.example.com/ado/
Program to count observations by group
-----------------------------------------------------------------------------------
PACKAGES you could -net describe-:
unique [The unique.ado program for counting observations by group]
-----------------------------------------------------------------------------------
The second command will tell you more about the package net describe unique:
---------------------------------------------------------------------------------------
package unique from http://www.example.com/ado
---------------------------------------------------------------------------------------
TITLE
unique
DESCRIPTION/AUTHOR(S)
Program to count observations by group
Distribution-Date: 28 June 2012
INSTALLATION FILES (type net install unique)
unique.ado
---------------------------------------------------------------------------------------
The third command will install the package net install unique:
checking unique consistency and verifying not already installed...
installing into /Users/cpoliquin/Library/Application Support/Stata/ado/plus/...
installation complete.
EDIT
See Nick's comments in the answer below. I intended this example to be simple and I don't expect other people to use this program. If you plan on submitting things to Stata Journal or SSC then his comments certainly apply! I hope this answer can serve as a decent tutorial for those confused by the official documentation.
This will be too long for a comment, so it is going to be an extra answer.
Your example uses the program name unique. If you search unique, all (or in Stata 13, search unique) you will find that a user-written program with the same name has been installed on SSC since 1998. This will create a clash of names for your users if (and only if) they attempt to use your program and also that earlier program. The more general advice is to search to see if a program name is already in use to try to avoid these problems.
Specifically, although you may just be using your unique as an arbitrary example, note that it contains bugs. An int doesn't contain enough bits to hold observation numbers exactly for large datasets. Also, as a matter of style, unique can change the sort order of your data, which is widely considered to be poor data management style.
Your example concerns dissemination of a program file without an accompanying help file. Suffice it to say that the SSC site would never accept such a program and the Stata Journal would not even review a paper based on such a submission before a help file was written to accompany it. Including explanatory comments with the code may be sufficient for your personal practices, but it falls below general Stata standards.
Stata 13 now supports https. See http://www.stata.com/manuals13/u.pdf, Section 3.6.
In short, I appreciate that you are trying to explain how to do something, but it is already well documented, and explicitly and implicitly some of your recommendations are below community standards.

SAS: Set current folder to the folder containing the running program

I've just started learning SAS because I'm required to use it for a statistics course. For this course, the university provides SAS 9.2 through their virtual-machine setup: I make a reservation in their system, they generate a VM on one of their servers, and I connect to the VM using Microsoft's Remote Desktop client. The virtual machines are generated and erased per session; settings are reset every time, and files must be stored on my client computer (which is accessible in the VM by a UNC path).
Within this setup, when I open a program file stored on my laptop, I've only been able to access the accompanying data files (each stored in the same folder as the program) either by hardcoding the full path or by updating the "current folder" setting at the beginning of each session. The first is problematic because it means the program won't run anywhere else - in particular, when I email it to the professor. The second is inconvenient, because browsing to this particular UNC path is time consuming, and I already have to browse to the same path to open the program file.
I want to make this easier by programmatically setting the current folder to the folder containing the program. Then I could just open the file and get to work. I've found some examples of getting the filename of the program file, of getting the path to a fileref, and of (link limit exceeded) setting the current folder, but I haven't been able to combine them in the right way. Please connect the dots for me.
To programmatically change the Windows current directory from SAS, you can use the X command, which is what really happens when you use the "Change current folder" dialog box:
x 'cd "\\computername\share name\folder"';
You can also do this using the SYSTEM data step function, a method I prefer because you get a return code (but more typing of course):
data _null_;
rc = system( 'cd "\\computername\share name\folder"' );
if rc = 0
then putlog 'Command successful';
else putlog 'Command failed';
run;
Note the UNC path is surrounded with double-quotes, which is necessary if the path contains blanks.
Of course, this still requires you to manually type in the command, but it might be something you could add to the program source code. If your VM environment allowed you to maintain some permanent presence on the server, you could save this command into a start-up file.
I would ask your professor for advice; if you are working with data given to you as part of your class, you may only need to send just the source code. On the other hand, if you are creating output data as part of your assignment, your professor might want your to deliver source code and SAS data sets. Surely he or she will have some procedure.
Complete Answer:
SAS's obtuse notation requires some strange delimiter fiddling to combine my partial solution (finding the path) with #Bob Duell's partial solution (setting the current folder). There seem to be two key rules involved:
&var is expanded in double-quoted strings ("&var"), but not single-quoted strings ('&var')
Quotes in &var are not treated as delimiters after expansion
So the solution is to compute a string of the quoted path (where the quotes are part of the string), and expand that within a double-quoted parameter to X or SYSTEM:
%let qsrc=%str(%")&src%str(%");
X "cd &qsrc"
It's not required to store the string, both &src and &qsrc can be expanded in-place, which yields a single statement solution:
X "cd %str(%")%substr(%sysget(SAS_EXECFILEPATH),1,%eval(%length(%sysget(SAS_EXECFILEPATH))-%length(%sysget(SAS_EXECFILENAME))))%str(%")";
This executes correctly, but breaks the syntax coloring in the GUI. Within a string, %str(%") and "" both expand to ", so replacing %str(%") with "" both executes correctly and is colored correctly in the GUI:
X "cd ""%substr(%sysget(SAS_EXECFILEPATH),1,%eval(%length(%sysget(SAS_EXECFILEPATH))-%length(%sysget(SAS_EXECFILENAME))))""";
This inherits the limitation that it only works when SAS_EXECFILEPATH and SAS_EXECFILENAME are defined, which is the case when running from within the Windows GUI editor. It's also subject to any limitations on in the "cd" command, which SAS intercepts rather than invoking the Windows command line. I expect it will fail on paths containing quotes.
A partial answer: One way to get the containing folder from the filename of the program file
Spread out & logging steps:
/* Find PathName of folder containing program */
%let FullName=%sysget(SAS_EXECFILEPATH);
%put FullName: &FullName.;
%let FullLen=%length(&FullName);
%put FullLen: &FullLen.;
%let BaseName=%sysget(SAS_EXECFILENAME);
%put BaseName: &BaseName.;
%let BaseLen=%length(&BaseName);
%put BaseLen: &BaseLen.;
%let PathLen=%eval(&FullLen.-&BaseLen.);
%put PathLen: &PathLen.;
%let PathName=%substr(&FullName,1,&PathLen);
%put PathName: &PathName.;
Consolidated & silent:
/* Find src folder */
%let src=%substr(%sysget(SAS_EXECFILEPATH),1,%eval(%length(%sysget(SAS_EXECFILEPATH))-%length(%sysget(SAS_EXECFILENAME))));
This only works when SAS_EXECFILEPATH and SAS_EXECFILENAME are defined, and it's not clear when that is. It does work when using the Windows GUI editor.

Building compatible datasets for Weka for large, evolving data

I have a largish dataset that I am using Weka to explore. It goes like this: today I will analyze as much data as I can, and create a trained classifier. I'll save this model as a file. Then tomorrow I will acquire a new batch of data, and want to use the saved model to predict the class for the new data. This repeats every day. Eventually I will update the saved model, but for now assume that it is static.
Due to the size and frequency of this task, I want to run this automatically, which means the command line or similar. However, my problem exists in the Explorer, as well.
My question has to do with the fact that, as my dataset grows, the list of possible labels for attributes also grows. Weka says such attribute lists cannot change, or the training set and test set are said to be incompatible (see: http://weka.wikispaces.com/Why+do+I+get+the+error+message+%27training+and+test+set+are+not+compatible%27%3F). But in my world there is no way that I could possibly know today all the attribute labels that I will stumble across next week.
To rectify the situation, it is suggested that I run batch filtering (http://weka.wikispaces.com/How+do+I+generate+compatible+train+and+test+sets+that+get+processed+with+a+filter%3F). Okay, that appears to mean that I need to re-build my model with the refiltered training data each day.
At this point the whole thing seems difficult enough that I fear I am making a horrible, simple newbie mistake, and so I ask for help.
DETAILS:
The model was created by
java -Xmx1280M weka.classifiers.meta.FilteredClassifier ^
-t .\training.arff -d .\my.model -c 15 ^
-F "weka.filters.supervised.attribute.Discretize -R first-last" ^
-W weka.classifiers.trees.J48 -- -C 0.25 -M 2
Naively, to predict I would try:
java -Xmx1280M weka.core.converters.DatabaseLoader ^
-url jdbc:odbc:(database) ^
-user (user) ^
-password (password) ^
-Q "exec (my_stored_procedure) '1/1/2012', '1/2/2012' " ^
\> .\NextDay.arff
And then:
java -Xmx1280M weka.classifiers.trees.J48 ^
-T .\NextDay.arff ^
-l .\my.model ^
-c 15 ^
-p 0 ^
\> .\MyPredictions.txt
this yields:
java.lang.Exception: training and test set are not compatible
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1035)
at weka.classifiers.Classifier.runClassifier(Classifier.java:312)
at weka.classifiers.trees.J48.main(J48.java:948)
A related question is asked at kdkeys.net/training-and-test-set-are-not-compatible-weka/
An associated problem is that the command-line version of the database extraction requires generation of a temporary .arff file, and it appears JDBC-generated arff files do not handle "date" data correctly. My database generates dates of the ISO-8601 format "yyyy-MM-dd'T'HH:mm:ss" but both Explorer and generated .arff files from JDBC data represent these as type NOMINAL. And so the list of labels for date attributes in the header is very, very long and never the same from dataset to dataset.
I'm not a java or python programmer, but if that's what it takes, I'll go buy some books! Thanks in advance.
I think you can use Incremental classifiers. But only few classifier can support for this option. Like SMO, J48 classifiers wont support this. So you will use some other classifier to classify.
To know more visit
http://weka.wikispaces.com/Classifying+large+datasets
http://wiki.pentaho.com/display/DATAMINING/Handling+Large+Data+Sets+with+Weka
There is a bigger problem with your plan too, it seems. If you have data from day 1 and you use it to build a model, then you use it on data from day n that has new and never before seen class labels, it will be impossible to predict the new labels because there is no training data for them. Similarly, if you have new attributes, it will be impossible to use those for classification because none of your training data has them to associate with the class labels.
Thus, if you want to use a model trained on data with only a subset of the new data's attributes/classes, then you might as well filter the new data to remove the new classes/attributes since they wouldn't be used even if you could execute weka without errors on two dissimilar datasets.
If it's not in your training set, exclude it from your test set. Then everything should work. If you need to be able to test/predict on it, then you need to retrain a new model that has examples of the new classes/attributes.
Doing this in your environment might require manually querying data out of the database into arff files, so as to query out only the attributes/classes that were in the training set. Look into sql and any major scripting language (e.g. perl, python) to do this without much fuss.
The university who maintains Weka also created MOA (Massive Online Analysis) to analyse and solve your kind of problem. All of their classifiers are updatable and you can compare classifiers performance over the time for your data flow. It also allows you to detect change of models (concept drift/shift) and optimize (ie. limit) your data window over the time (forget old data mechanism...).
Once you're done with testing and tuning with MOA, you can then use MOA classifiers from Weka (there is an extension to enable it) and batch all your process.