Using NLTK in AWS Glue - amazon-web-services

I'm struggling to get a script working and wondering if anyone else has successfully done this.
I'm using Glue to execute a spark script and am trying to use the NLTK module to analyze some text. I've been able to import the NLTK module by uploading it to s3 and referencing that location for the Glue additional python module config. However, I'm using the word_tokenize method which requires the punkt library to be downloaded in the nltk_data directory.
I've followed this (Download a folder from S3 using Boto3) to copy the punkt files to the tmp directory in Glue. However, if I look into the tmp folder in an interactive glue session I don't see the files. When I run the word_tokenize method I get an error saying that the package cant be found in the default locations (variations of /usr/nltk_data).
I'm going to move the required files into the nltk package in s3 and try to try to re-write the nltk tokenizer to load the files directly instead of the nltk_data location. But wanted to check here first if anyone was able to get this working as this seems fairly common.

I have limited experience with NLTK, but I think the nltk.download() will put punkt in the right spot.
import nltk
print('nltk.__version__', nltk.__version__)
nltk.download('punkt')
from nltk import word_tokenize
print(word_tokenize('Glue is good, but it has some rough edges'))
From the logs
nltk.__version__ 3.6.3
[nltk_data] Downloading package punkt to /home/spark/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
['Glue', 'is', 'good', ',', 'but', 'it', 'has', 'some', 'rough', 'edges']

I wanted to follow up here in case anyone else encounters these issues and can't find a working solution.
After leaving this project alone for a while I finally came back and was able to get a working solution. Initially I was adding my tmp location to the nltk_data path and downloading the required packages there. However, this wasnt working.
nltk.data.path.append("/tmp/nltk_data")
nltk.download("punkt", download_dir="/tmp/nltk_data")
nltk.download("averaged_perceptron_tagger", download_dir="/tmp/nltk_data")
Ultimately, I believe the issue was that the file I needed from punkt was not available on the worker nodes. Using the addFile method I was finally able to use nltk data.
sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')
The next issue I had was that I was trying to call a UDF function from a .withColmn() method to get the nouns for each row. The issue here is that withColummn requires that a column be passed but nltk will only work with string values.
Not working:
df2 = df.select(['col1','col2','col3']).filter(df['col2'].isin(date_list)).withColumn('col4', find_nouns(col('col1'))
In order to get nltk to work I passed in my full dataframe and looped over every row. Using collect to get the text value of the row then building a new dataframe and returning that with all the original columns plus the new nltk column. To me this seems incredible inefficient but I wasn't able to get a working solution without it.
df2 = find_nouns(df)
def find_nouns(df):
data = []
schema = StructType([...])
is_noun = lambda pos: pos[:2] == 'NN'
for i in range(df.count()):
row = df.collect()[i]
tokenized = nltk.word_tokenize(row[0])
data.append((row[0], row[1], row[2], [word for (word, pos) inn nltk.pos_tag(tokenized) if is_noun(pos)]))
df2 = spark.createDataFrame(data=data, schema=schema)
return df2
I'm sure there's a better solution out there, but I hope this can help someone get their project to an initial working solution.

Related

Read-in df from csv before launching main app | Dash

I am trying to get my first dashboard with python dash running.
The whole thing is very similar to this https://github.com/dkrizman/dash-manufacture-spc-dashboard.
At the beginning a Dataframe is read in from a csv. My problem seems to be quite easy to solve but somehow I am not succeeding:
I want to create a initial window that allows the user to select (from e.g. dropdown) the csv file (or accordingly the path) that is read in. All the .csv files look the same but just have different values.
When using the modal components I get problems with the install of bootstrap and I thought there must be an easier way?
Thanks for your help!
Best,
Nik

Python in Knime: Downloading files and dynamically pressing them into workflow

I'm using Knime 3.1.2 on OSX and Linux for OPENMS analysis (Mass Spectrometry).
Currently, it uses static filename.mzML files manually put in a directory. It usually has more than one file pressed in at a time ('Input FileS' module not 'Input File' module) using a ZipLoopStart.
I want these files to be downloaded dynamically and then pressed into the workflow...but I'm not sure the best way to do that.
Currently, I have a Python script that downloads .gz files (from AWS S3) and then unzips them. I already have variations that can unzip the files into memory using StringIO (and maybe pass them into the workflow from there as data??).
It can also download them to a directory...which maybe can them be used as the source? But I don't know how to tell the ZipLoop to wait and check the directory after the python script is run.
I also could have the python script run as a separate entity (outside of knime) and then, once the directory is populated, call knime...HOWEVER there will always be a different number of files (maybe 1, maybe three)...and I don't know how to make the 'Input Files' knime node to handle an unknown number of input files.
I hope this makes sense.
Thanks!
Thanks to Gábor for getting me on the right track. Although I ended up doing a slightly different route after much experimentation.
===
Being new to Knime, I don't know if this is an efficient use of Knime, or a complete Kluge...but it does work.
So, part of the problem is some of the Knime specific objects - One of which is called URIDataValue.
A Python Pandas dataframe is, apparently, interchangable with the Knime tables. However, I don't know if there's a way to import one of these URIDataValue objects into Python. So here's what I did...
1. I wrote a Python script that creates a Pandas Dataframe, and populates it with one Column. Everything is a string, including the column header:
from pandas import DataFrame
# Create empty table
T = DataFrame(
[
['file:///Users/.../copy/lfq_spikein_dilution_1.mzML'],
['file:///Users/.../copy/lfq_spikein_dilution_2.mzML'],
],
)
T.columns = ['URIDataValue']
#print T
output_table = T
That creates this dataframe:
Note: The column name and values are just strings. But it is (apparently) important that the column header be 'URIDataValue'...even though HERE it's just text. If the column name is not 'URIDataValue' the next node doesn't know what to do.
NEXT, the 'output_table' from the 'Python Source' node is patched to a 'String to URI' node, which (apparently and magically) knows to change the entire columns string values to URIDataValues (presumably based on the name of the first column...don't know that for sure).
Finally, the NEW table, with the correct data objects goes to a 'URI to PORT' node...since apparently 'Port' objects and a 'URI' object are different.
This, then, matches the needed input to the ZipLoop...which is normally the out put from a static (hard coded) 'Input Files' node.
Now, to actually solve the question above, I just have to add the code to my 'Python Source' to download and unzip the S3 files, then annotate the dataframe with their locations, and go.
I have no idea what I'm doing, but it worked.
There are multiple options to let things work:
Convert the files in-memory to a Binary Object cells using Python, later you can use that in KNIME. (This one, I am not sure is supported, but as I remember it was demoed in one of the last KNIME gatherings.)
Save the files to a temporary folder (Create Temp Dir) using Python and connect the Pyhon node using a flow variable connection to a file reader node in KNIME (which should work in a loop: List Files, check the Iterate List of Files metanode).
Maybe there is already S3 Remote File Handling support in KNIME, so you can do the downloading, unzipping within KNIME. (Not that I know of, but it would be nice.)
I would go with option 2, but I am not so familiar with Python, so for you, probably option 1 is the best. (In case option 3 is supported, that is the best in my opinion.)

How do you get the names of all gif images in a folder in python 2?

I have a folder like C:\Temp\My Pictures and it has a bunch of gif pictures in it. I need to be able to get the name of the gif images in a string and I have no idea how. I looked everywhere and couldn't find an answer, please help!
Try using:
import os
for file in os.listdir("C:\Temp\My Pictures"):
if file.endswith(".gif"):
print file
You can read more, about os.listdir at the official docs here.
As PotatoIng_ is correct you may want to end up with a list of strings.
Try using this (works with Python 2 and 3):
import os
root, dirs, files=next(os.walk('C:\Temp\My Pictures'))
gifs=list(filter(lambda filename:filename.endswith('.gif'), files))
os.walk walks down the directory tree, but you only need the first result( that's the next). Now filter the files list and use those entries with filename.endswith('.gif') is True. filter returns an iterator in python 3 so use list to turn it into a list.
Result will be e.g.:
>>> gifs
['a.gif', 'b.gif', 'c.gif']

Can't save figure as .eps [gswin32c is not recognized]

I'm using Enthought Canopy with PyLab(64-bit). For my report I need to use Latex (XeLaTex) and the plots are done with matplotlib.
To have an first idea I just copied the second example from http://matplotlib.org/users/usetex.html and compiled it. It looks fine and I can save it as a normal png without problems. However if i try to save it as .eps or.ps it does not work and an error appears:
invalid literal for int() with base 10: "
Additionaly in the Pylab shell it shows:
'gswin32c' is not recognized as an internal or external command, operable program or batch file'.
If I save it as .pdf I have no problems except the text is all black instead of being red and blue. This is a problem because in my plots I have two axes and I need them colorized for better readability.
If I then try to delete some lines from the example given (all text) I still cannot save it as .eps nor .ps. I can't figure out the problem and all the other topics related to this have not given me an insight. So I really need your help because I can't use .png for my report.
Thank you in advance!!!
I finally managed to solve this problem. It might look weird but maybe other people can benefit from it.
The solution might depend upon the software you use. I use Enthought Canopy (Python) and MikTeX 2.9 under W8 64bit.
If you want to output .ps and .eps files with matplotlib using the 'text.usetex': True option then you will encounter the problem posted above.
Solution:
Download and install Ghostscript (32bit) from http://www.ghostscript.com/download/gsdnld.html.
Download ps2eps-1.68.zip from http://www.tm.uka.de/~bless/ps2eps. The proceeding is given in the manual, however I like to point out the part with the environment variables. In this last step you need to go to Control Panel --> System --> Advanced system settings. Then click on the header 'Advanced' and on the bottom of the window you see 'Environment Variables' on which you click. Then you use the 'New'-Button for User Variables for USERNAME. Then you type in as variable name 'ps2eps' and for variable value you type in the actual path where you have saved the ps2eps.pl file. In my case this is 'C:\Program Files (x86)\ps2eps\bin\'. You can check if you type 'ps2eps' in the command-window.
Download xpdfbin-win-3.03.zip from http://www.foolabs.com/xpdf/download.html. You only need the file 'pdftops.exe'. However I could not assign a path like in step 2. I solved this by putting the 'pdftops.exe' in the MikTeX 2.9 folder. The exact location for me was 'C:\Program Files\MiKTeX 2.9\miktex\bin\x64'.
I was then able to save figures as .ps and have no more any error messages. Remember to use the settings proposed on http://matplotlib.org/users/usetex.html under 'postscript options'.
In myself used the following settings:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import matplotlib as mpl
mpl.rc('font', **{'family':'serif', 'serif':['Computer Modern Roman'],
'monospace':['Computer Modern Typewriter']})
params = {'backend': 'ps',
'text.latex.preamble': [r"\usepackage{upgreek}",
r"\usepackage{siunitx}",
r"\usepackage{amsmath}",
r"\usepackage{amstext}",],
'axes.labelsize': 18,
#'axes.linewidth': 1,
#'text.fontsize':17,
'legend.fontsize': 10,
'xtick.labelsize': 13,
#'xtick.major.width' : 0.75,
'ytick.labelsize': 13,
'figure.figsize': [8.8,6.8],
#'figure.dpi': 120,
'text.usetex': True,
'axes.unicode_minus': True,
'ps.usedistiller' : 'xpdf'}
mpl.rcParams.update(params)
mpl.rcParams.update({'figure.autolayout':True})
(whereas many of the params are just for my own purpose later in the plots)
As a beginner I am not well informed about the dependence from the 'backend' used if you are running a script from your python console. I however used this without any --pylab settings in before and I do not know if one needs to switch the backend manually if he is working already in a console with a specific matplotlib backend.
I had the same problem and my problem was a font adjustment in the python code that is :
from matplotlib import rc
rc('font',**{'family':'sans-serif','sans-serif':['Helvetica']})
rc('text', usetex=True)
when I remove this iit works fine and now i can save eps.
So be sure that any shortest working example is working for you or not then check the font and other style edits in your code. This may help.

how can I migrate issues from redmine to tuleap

Originally we use Redmine as issue management system, now we are planning to migrate to Tuleap system.
Both system have features to import/export issues into .csv file.
I want to know whether there is standard / simple way to migrate issues.
The main items inside issues are status, title and description.
What are "remaining_effort" and "cross_references" kind of data in remind ?
Since both system can export the csv file, which contains the item header that they needed, some header is different.
It needs scripts to map from one system to another system, code snippet is shown below.
It can work for other ALM system if they don't support from application (I mean migration).
#!/usr/bin/env python
import csv
import sys
# read sample tuleap csv header to avoid some field changes
tuleapcsvfile = open('tuleap.csv', 'rb')
reader = csv.DictReader(tuleapcsvfile)
to_del = ["remaining_effort","cross_references"]
# remove unneeded items
issueheader = [i for i in reader.fieldnames if not i in to_del]
# open stdout for output
w = csv.DictWriter(sys.stdout, fieldnames=issueheader,lineterminator="\n")
w.writeheader()
# read redmine csv files for converting
redminecsvfile = open('redmine.csv', 'rb')
redminereader = csv.DictReader(redminecsvfile)
for row in redminereader:
newrow = {}
if row['Status']=='New':
newrow['status'] = "Not Started"
# some simple one to one mapping
newrow['i_want_to' ]= row['Subject']
newrow['so_that'] = row['Description']
w.writerow(newrow)
some items in exported csv can't be imported back in tuleap like
remaining_effort,cross_references.
These two items are shown inside exported .csv file from tuleap issues.
Had the same issue and the csv solution looked too limited to me:
the field matching between tracker and csv content must fit exactly
you can't import attachments
you can't link artifacts
...
Issues can be extracted from Redmine using REST API or by directly reading the SQL database. Artifacts can be created in Tuleap using the REST API. You "just" need a script in the middle to extract issues from Redmine and then import them into Tuleap.
I created such a script in Python:
It has a plugin approach so that it could import issues/bugs from any bug tracker and later save them to any other bug tracker.
For now it only support extracting issues from Redmine SQL database and export to Tuleap using REST API.
One can extend it (new plugin) to extract issues from other trackers (bugzilla/mantis/gitlab).
One can extend it (new plugin) to generate a Tuleap xml file rather than importing the artifacts using Tuleap REST API (XML being more powerful here).
I ported hundreds of issues from Redmine to Tuleap using this and it was good enough for my needs.
Have a look at https://github.com/jpo38/TrackerIO.