Python program to extend short urls that integrates with Stata - python-2.7

I have a dataset containing thousands of tweets. Some of those contain urls but most of them are in the classical shortened forms used in Twitter. I need something that gets the full urls so that I can check the presence of some particular websites. I have solved the problem in Python like this:
import urllib2
url_filename='C:\Users\Monica\Documents\Pythonfiles\urlstrial.txt'
url_filename2='C:\Users\Monica\Documents\Pythonfiles\output_file.txt'
url_file= open(url_filename, 'r')
out = open(url_filename2, 'w')
for line in url_file:
tco_url = line.strip('\n')
req = urllib2.urlopen(tco_url)
print >>out, req.url
url_file.close()
out.close()
Which works but requires that I export my urls from Stata to a .txt file and then reimport the full urls. Is there some version of my Python script that would allow me to integrate the task in Stata using the shell command? I have quite a lot of different .dta files and I would ideally like to avoid appending them all just to execute this task.
Thanks in advance for any answer!

Sure, this is possible without leaving Stata. I am using a Mac running OS X. The details might differ on your operating system, which I am guessing is Windows.
Python and Stata Method
Say we have the following trivial Python program, called hello.py:
#!/usr/bin/env python
import csv
data = [['name', 'message'], ['Monica', 'Hello World!']]
with open('data.csv', 'w') as wsock:
wtr = csv.writer(wsock)
for i in data:
wtr.writerow(i)
wsock.close()
This "program" just writes some fake data to a file called data.csv in the script's working directory. Now make sure the script is executable: chmod 755 hello.py.
From within Stata, you can do the following:
! ./hello.py
* The above line called the Python program, which created a data.csv file.
insheet using data.csv, comma clear names case
list
+-----------------------+
| name message |
|-----------------------|
1. | Monica Hello World! |
+-----------------------+
This is a simple example. The full process for your case will be:
Write file to disk with the URLs, using outsheet or some other command
Use ! to call the Python script
Read the output into Stata using insheet or infile or some other command
Cleanup by deleting files with capture erase my_file_on_disk.csv
Let me know if that is not clear. It works fine on *nix; as I said, Windows might be a little different. If I had a Windows box I would test it.
Pure Stata Solution (kind of a hack)
Also, I think what you want to accomplish can be done completely in Stata, but it's a hack. Here are two programs. The first simply opens a log file and makes a request for the url (which is the first argument). The second reads that log file and uses regular expressions to find the url that Stata was redirected to.
capture program drop geturl
program define geturl
* pass short url as first argument (e.g. http://bit.ly/162VWRZ)
capture erase temp_log.txt
log using temp_log.txt
copy `1' temp_web_file
end
The above program will not finish because the copy command will fail (intentionally). It also doesn't clean up after itself (intentionally). So I created the next program to read what happened (and get the URL redirect).
capture program drop longurl
program define longurl, rclass
* find the url in the log file created by geturl
capture log close
loc long_url = ""
file open urlfile using temp_log.txt , read
file read urlfile line
while r(eof) == 0 {
if regexm("`line'", "server says file permanently redirected to (.+)") == 1 {
loc long_url = regexs(1)
}
file read urlfile line
}
file close urlfile
return local url "`long_url'"
end
You can use it like this:
geturl http://bit.ly/162VWRZ
longurl
di "The long url is: `r(url)'"
* The long url is: http://www.ciwati.it/2013/06/10/wdays/?utm_source=twitterfeed&
* > utm_medium=twitter
You should run them one after the other. Things might get ugly using this solution, but it does find the URL you are looking for. May I suggest that another approach is to contact the shortening service and ask nicely for some data?
If someone at Stata is reading this, it would be nice to have copy return HTTP response header information. Doing this entirely in Stata is a little out there. Personally I would use entirely Python for this sort of thing and use Stata for the analysis of data once I had everything I needed.

Related

Read-in df from csv before launching main app | Dash

I am trying to get my first dashboard with python dash running.
The whole thing is very similar to this https://github.com/dkrizman/dash-manufacture-spc-dashboard.
At the beginning a Dataframe is read in from a csv. My problem seems to be quite easy to solve but somehow I am not succeeding:
I want to create a initial window that allows the user to select (from e.g. dropdown) the csv file (or accordingly the path) that is read in. All the .csv files look the same but just have different values.
When using the modal components I get problems with the install of bootstrap and I thought there must be an easier way?
Thanks for your help!
Best,
Nik

Is there a way to solve Stata's r(601) while looping the append for imported excel files?

I am trying to append multiple Excel files into a large database by executing the following code:
cls
set more off
clear all
global route = "C:\Users\NICOLE\Desktop\CAR"
cd "$route"
tempfile buildDB
save `buildDB', emptyok
local filenames : dir "$route" files "*.xlsx"
display `filenames'
foreach f of local filenames {
import excel using `"`f'"' ,firstrow allstring clear
gen source = `"`f'"'
append using `buildDB'
save `"`buildDB'"', replace
}
save "C:\Users\NICOLE\Desktop\CAR\DB_EG-RAC.dta" ,replace
Stata manages to append all of the files, but it also displays the following message of error:
file C:\Users\NICOLE.xlsx not found r(601);
And I do not know how to solve it, because it does not let my code run as it should. Thanks!
We have deadlock here. On the face of it the filename in question is not one you write in your code, but could only be part of the result of
local filenames : dir "$route" files "*.xlsx"
But the file named isn't even in the same directory as that named. Moreover, you are adamant that the file doesn't exist and Stata according to your error report can't find it.
The question still remains: how does Stata get asked to open a file that supposedly doesn't exist?
My only guesses are feeble:
Code you are not showing is responsible.
You are running slightly different versions of this script in different places and getting confused. Can you replicate this error that you did get once all over again? Have you searched everywhere remotely possible on the C: drive for this file nicole.xlsx?
It is crucial to realise that we can test nothing here. The problem has not been presented reproducibly.

Rename file after putHDFS

I have apache NIFI job where I get file from system using getFile then I use putHDFS, how can I rename the file in HDFS after putting the file in hadoop ?
I tried to use executeScript processor but can't get it to work
flowFile = session.get()
if flowFile != None:
tempFileName= flowFile.getAttribute("filename")
fileName=tempFileName.replace('._COPYING_','')
flowFile = session.putAttribute(flowFile, 'filename', fileName)
session.transfer(flowFile, REL_SUCCESS)
The answer above by Shu is correct for how to manipulate the filename attribute in NiFi, but if you have already written a file to HDFS and then use UpdateAttribute, it is not going to change the name of the file in HDFS, it will only change the value of the filename attribute in NiFi.
You could use the UpdateAttribute approach to create a new attribute called "final.filename" and then use MoveHDFS to move the original file to the final file.
Also of note, the PutHDFS processor already writes a temp file and moves it to the final file so I'm not sure if it is necessary for you to name ".COPYING". For example if you send a flow file to PutHDFS with filename of "foo" it will first write ".foo" to the directory and when done it will move it to "foo".
The only case where you need to use MoveHDFS is if some other process is monitoring the directory and can't ignore the dot files, then you write it somewhere else and use MoveHDFS once it is complete.
Instead of using ExecuteScript processor(extra overhead) use UpdateAttribute processor Feed the Success relationship from PutHDFS
Add new property in UpdateAttribute processor as
filename
${filename:replaceAll('<regex_expression>','<replacement_value>')}
Use replaceAll function from NiFi Expression Language.
(or)
Using replace Function
filename
${filename:replaceAll('<search_string>','<replacement_value>')}
NiFi expression language offers different functions to manipulate strings refer to this link for more documentation related to expression language.
i have tried same exact script that in Question with ExecuteScript processor with Script Engine as Python and everything works as expected.
As you are using .replace function and replacing with ''
Output:
As the filename fn._COPYING_ got changed to fn.

Python Error "[errno 2] system cannot find the path specified"

I am trying to code in python to download few data , the code is working for one structure but not for other, its giving me this error which i don't understand. I have written code on sublime text 3 and running it on DOS. Python version using is 2.7.11.
from bs4 import BeautifulSoup
import urllib
import re
url= raw_input("http://physics.iitd.ac.in/content/list-faculty-members")
html=urllib.urlopen(url).read()
soup=BeautifulSoup(html)
table = soup.find("table", attrs={"border":"0","width":"100%","cellpadding":"10"})
head=soup.find("h2",attrs={"class":"title style3"})
ready= table.find_all("tr")
header=head.find("big").find("strong")
datasets=[]
quest=[]
s=[]
test=header.get_text()
quest.append(test)
for b in ready:
x=[td.get_text() for td in b.find_all("td")]
dataset =[strong.get_text() for strong in b.find("td").find("a").find_all("strong")]
datasets.append(dataset)
quest.append(x)
print quest
The fact that it states cannot find the file specified: '' means that you're trying to open a file specified by an empty string!
It's a little hard to help much further since we don't have the code. The code you have included cannot be the code that generated that screenshot since the screenshot would have included a prompt as shown (the argument to the raw_input() call).
Clarifying that point, if that string you appear to have entered was actually entered, there would be no problem.
Calling urlopen() will in turn call FancyUrlOpener.open() and, being a descendant of UrlOpener, that's the function that receives control.
That function will intelligently select the function to use based on the scheme given.
The fact that it's choosing the file scheme rather than the HTTP one, and the fact that the exception complains about the file being an empty string, means that you are not passing in what you think you are.
The error message you're seeing, and the stack trace, can only occur if the following line fails (see open_local_file() here):
stats = os.stat(localname)
The stat call is only for local files, not URLs.
So you should be concentrating your effort: why is the string empty?
The most likely theory is that the code you've given the screenshot for had a different URL in the raw_input prompt and so thats what we're seeing as the prompt in the screenshot.
That would mean you simply pressed ENTER, perhaps thinking it had helpfully provided that URL as a default. That ENTER would then be taken as an emty string which would explain both the scheme selection and the empty string being used as a file name.
I have copy the url to test:
url = "http://www.che.iitb.ac.in/online/people/viewstudents?filter0=**ALL**&op1=&filter1=2010"
urllib.urlopen(url)
In fact, the url can be correctly parsed as "http", but your error msg tells us that your url are parsed as "file", so it is necessary for you to show us your really url or code.
My python version is 2.7.5.

Basics of writing plugins for Jedit

Can anyone direct me to a tutorial on writing plugins for Jedit? I have a pipedream of using Jedit as an editor for SAS. Currently, it does syntax highlighting, but I feel it is or could be made better by fleshing out the ideas better.
A couple questions:
Can you enable tab completion in Jedit?
Can you specify "environments" that begin and end with certain syntax? (For instance, the word "keep" makes sense between the lines data xxx; and run; but not between proc sort data=xxx; and run; So highlighting it there would be counter-instructive to inexperienced coders.
Can you store variables in a work place and reference them from a drop down menu (such as variable names in a dataset)
Can you execute code from the shell/terminal and pipe .log files back into the Jedit message window?
Are you talking about something like Microsoft's Intellisense or autocomplete? If so, a poor-man's approximation to auto-complete is to use the keyboard shortcut ctrl+b after typing in part of the word. It will complete the word based on all the words from all buffers that are open. See this questions for more on autocomplete.
In your syntax highlighting, you can create delegate syntax for different chunks of code so that it will be highlighted according to different rules. grep in your jedit's mode directory for "delegate".
Not exactly sure what you want, but jedit does keep track of a bunch of your latest copies from the text. Emacs calls this a "kill ring". For my jedit setup, I have Paste Previous... bound to ctrl+e ctrl+v. I believe that is the default shortcut binding. This will show you your last ~20 copies of text chunks and you can select which copy text chunk you want to use.
Yes, you can execute tasks in the shell and pipe them back into jedit. See this question. The following is how I do bk edit and reload a buffer. It doesn't get output from the shell, but it does execute a shell command:
import javax.swing.JOptionPane;
import java.io.File;
File f = new File(buffer.getPath());
String SCCS_path = f.getParent()+"/SCCS";
String bk_path = "/usr/local/bin/bk";
if ( !new File(SCCS_path).exists()) {
bk_path = "/usr/bin/bk";
}
Runtime.getRuntime().exec(
bk_path+ " edit "+
buffer.getPath());
Thread.currentThread().sleep(2000);
buffer.reload(view);
Btw, macros are very powerful in jedit. You can record what you are doing in jedit with Macros->Record Macro...and it will generate the equivalent script.