Python 3.6 - Regular Expression with - regex

The following code returns a dictionary of part numbers from a spreadsheet and works as intended.
import openpyxl, os, pprint, re
wb = openpyxl.load_workbook('RiverbedInventory.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
max_row = sheet.max_row
inventory = {}
for row in range(1,max_row+1):
prodName = sheet['G' + str(row)].value
inventory.setdefault (prodName, {'count': -0})
inventory[prodName] ['count'] += 1
pprint.pprint(inventory)
I'm trying to filter the results using a regular expression to only return part #s matching specific criteria (part #s that start with VCX in this case). I keep getting "TypeError: expected string or bytes-like object" failure messages. I've googled this quite a bit but can't find an answer. Here's the regular expression code I'm using:
import openpyxl, os, pprint, re
wb = openpyxl.load_workbook('RiverbedInventory.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
max_row = sheet.max_row
steelhead = re.compile(r'VCX-\d+-\w+')
inventory = {}
for row in range(1,max_row+1):
prodCode = sheet['G' + str(row)].value
inventory.setdefault (prodCode, {'count': -0})
inventory[prodCode]['count'] += 1
pprint.pprint (steelhead.findall(inventory))
working vs non-working

In steelhead.findall(inventory), you pass a dictionary instead of a string. re.findall expects a string as the second argument.
You may use dictionary comprehension here:
print( {k: inventory[k] for k in inventory if steelhead.search(k)} )
See the Python 3 demo:
import re
inventory = {'UMTS-UNV-E': {'count':59}, 'VCX-020-E': {'count':2}, 'VCX-030-E': {'count':3}}
steelhead = re.compile(r'VCX-\d+-\w+')
print( {k: inventory[k] for k in inventory if steelhead.search(k)} )
Output: {'VCX-030-E': {'count': 3}, 'VCX-020-E': {'count': 2}}

Related

PySpark Dynamic When Statement

I have a list of strings I am using to create column names. This list is dynamic and may change over time. Depending on the value of the string the column name changes. An example of the code I currently have is below:
df = df.withColumn("newCol", \
F.when(df.pet == "dog", df.dog_Column) \
.otherwise(F.when(df.pet == "cat", df.cat_Column) \
.otherwise(None))))
I want to return the column that is a derivation of the name in the list. I would like to do something like this instead:
dfvalues = ["dog", "cat", "parrot", "goldfish"]
df = df.withColumn("newCol", F.when(df.pet == dfvalues[0], \
F.col(dfvalues[0] + "_Column"))
The issue is that I cannot figure out how to make a looping condition in Pyspark.
One way could be to use a list comprehension in conjuction with a coalesce, very similiar to the answer here.
mycols = [F.when(F.col("pet") == p, F.col(p + "_Column")) for p in dfvalues]
df = df.select("*", F.coalesce(*mycols).alias("newCol"))
This works because when() will return None if the is no otherwise(), and coalesce() will pick the first non-null column.
I faced same problem and found this site link.You can use python reduce to looping for clean solution.
from functools import reduce
def update_col(df1, val):
return df.withColumn('newCol',
F.when(F.col('pet') == val, F.col(val+'_column')) \
.otherwise(F.col('newCol')))
# add empty column
df1 = df.withColumn('newCol', F.lit(0))
reduce(update_col, dfvalues, df1).show()
that yields:
from pyspark.sql import functions as F
dfvalues = ["dog", "cat"]
df = df.withColumn("newCol", F.when(df.pet == dfvalues[0], F.col(dfvalues[0] + "_Column")))
df.show()
+----------+----------+---+------+
|cat_column|dog_column|pet|newCol|
+----------+----------+---+------+
| cat1| dog1|dog| dog1|
| cat2| dog2|cat| cat2|
+----------+----------+---+------+

Counting matrix pairs using a threshold

I have a folder with hundreds of txt files I need to analyse for similarity. Below is an example of a script I use to run similarity analysis. In the end I get an array or a matrix I can plot etc.
I would like to see how many pairs there are with cos_similarity > 0.5 (or any other threshold I decide to use), removing cos_similarity == 1 when I compare the same files, of course.
Secondly, I need a list of these pairs based on file names.
So the output for the example below would look like:
1
and
["doc1", "doc4"]
Will really appreciate your help as I feel a bit lost not knowing which direction to go.
This is an example of my script to get the matrix:
doc1 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints that it is failing to meet that pledge."
doc2 = "The BBC has been inundated with comments from Amazon Prime customers. Most reported problems with deliveries."
doc3 = "An Amazon spokesman told the BBC the ASA had confirmed to it there was no investigation at this time."
doc4 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints..."
documents = [doc1, doc2, doc3, doc4]
# In my real script I iterate through a folder (path) with txt files like this:
#def read_text(path):
# documents = []
# for filename in glob.iglob(path+'*.txt'):
# _file = open(filename, 'r')
# text = _file.read()
# documents.append(text)
# return documents
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
tf_matrix = LemVectorizer.transform(documents).toarray()
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
tfidf_matrix = tfidfTran.transform(tf_matrix)
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
Out:
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
As I understood your question, you want to create a function that reads the output numpy array and a certain value (threshold) in order to return two things:
how many docs are bigger than or equal the given threshold
the names of these docs.
So, here I've made the following function which takes three arguments:
the output numpy array from cos_similarity() function.
list of document names.
a certain number (threshold).
And here it's:
def get_docs(arr, docs_names, threshold):
output_tuples = []
for row in range(len(arr)):
lst = [row+1+idx for idx, num in \
enumerate(arr[row, row+1:]) if num >= threshold]
for item in lst:
output_tuples.append( (docs_names[row], docs_names[item]) )
return len(output_tuples), output_tuples
Let's see it in action:
>>> docs_names = ["doc1", "doc2", "doc3", "doc4"]
>>> arr = cos_similarity(documents)
>>> arr
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
>>> threshold = 0.5
>>> get_docs(arr, docs_names, threshold)
(1, [('doc1', 'doc4')])
>>> get_docs(arr, docs_names, 1)
(0, [])
>>> get_docs(lst, docs_names, 0.13)
(3, [('doc1', 'doc2'), ('doc1', 'doc4'), ('doc2', 'doc4')])
Let's see how this function works:
first, I iterate over every row of the numpy array.
Second, I iterate over every item in the row whose index is bigger than the row's index. So, we are iterating in a traingular shape like so:
and that's because each pair of documents is mentioned twice in the whole array. We can see that the two values arr[0][1] and arr[1][0] are the same. You also should notice that the diagonal items arn't included because we knew for sure that they are 1 as evey document is very similar to itself :).
Finally, we get the items whose values are bigger than or equal the given threshold, and return their indices. These indices are used later to get the documents names.

from dictionary to csv but fail to be created

I have a list of dictionaries that contain bacterial name as keys, and as values a set of numbers identifying a DNA sequence. Unluckily, in some dictionaries there is a missing value, and the script fails to produce the csv. Can anyone give me an idea on how I can get around it? This is my script:
import glob, subprocess, sys, os, csv
from Bio import SeqIO, SearchIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
def allele():
folders=sorted(glob.glob('path_to_files'))
dict_list=[]
for folder in folders:
fasta_file=glob.glob(folder +'/file.fa')[0]
input_handle=open(fasta_file ,'r')
records=list(SeqIO.parse(input_handle, 'fasta'))
namelist=[]
record_dict={}
sampleID = os.path.basename(folder)
record_dict['sampleid']=sampleID
for record in records:
name=record.description.split('\t')
gene=record.id.split('_')
geneID=gene[0] + '_' +gene[1]
allele=gene[2]
record_dict[geneID]=allele
dict_list.append(record_dict)
header = dict_list[0].keys()
with open('path_to_files/mycsv.csv', 'w') as csv_output:
writer=csv.DictWriter(csv_output,header,delimiter='\t')
writer.writeheader()
for samp in dict_list:
writer.writerow(samp)
print 'start'
allele()
Also can I get any suggestion on how to identify those dictionaries whose values sequence are the same?
Thanks
Concerning your first question, I'd just get the shorter dict's and fill, the missing entry with something that your dictWriter works with (didn't use it, ever), I guess NaN may work.
The simple thing would look like
testDict1 = { "sampleid" : 0,
"bac_s1" : [1,2,4],
"bac_s2" : [1,2,12],
"bac_s3" : [1,3,12],
"bac_s4" : [1,6,12],
"bac_s5" : [1,9,14]
}
testDict2 = { "sampleid" : 1,
"bac_s1" : [1,2,4],
"bac_s2" : [1,3,12],
"bac_s3" : [1,3,12],
"bac_s5" : [2,9,14],
}
testDict3 = { "sampleid" : 2,
"bac_s1" : [3,2,4],
"bac_s2" : [4,2,12],
"bac_s3" : [5,3,12],
"bac_s4" : [1,6,12],
"bac_s5" : [1,9,14]
}
dictList = [ testdict1, testdict2, testdict3 ]
### modified from https://stackoverflow.com/a/16974075/803359
### note, does not tell you elements that are present in shortdict but missing in long dict
### i.e. for your purpose you have to assume that keys in short dict are really present in longdict
def missing_elements( longdict, shortdict ):
list1 = longdict.keys()
list2 = shortdict.keys()
assert len( list1 ) >= len( list2 )
return sorted( set( list1 ).difference( list2 ) )
### make the first dict the longest
dictList = sorted( dictList, key=len, reverse=True )
for myDict in dictList[1:]:
### compare all others to the first
missing = missing_elements( dictList[0], myDict )
print missing
### then you fill with something empty or NaN that works with your save-function
for m in missing:
myDict[m] = [ float( 'nan' ) ]
print " "
for myDict in dictList:
print myDict
print " "
which you can incorporate in your code easily.

Extracting elements from list using Python

How can I extract '1' '11' and '111' from this list ?
T0 = ['4\t1\t\n', '0.25\t11\t\n', '0.2\t111\t\n']
to extract '4', '0.25' and '0.2' I used this :
def extract(T0):
T1 = []
for i in range(0, len(T0)):
pos = T0[i].index('\t')
T1.append(resultat[i][0: pos])
return T1
then I got :
T1 = ['4','0.25','0.2']
but for the rest I don't know how to extract it
can you help me please?
Using your code as base, it can be done as below. Will return as string if its alphabet, otherwise return as decimal integer.
def extract(T0):
T1=[]
for i in range len(T0):
tmp = T0[i].split('\t')[1]
if tmp.isalpha():
T1.append(tmp)
else:
T1.append(int(tmp))
return T1
Alternatively, try below for a more compact code using list comprehension
def extract(T0):
# return as string if its alphabet else return as decimal integer
# change int function to float if wanna return as float
tmp = [i.split('\t')[1] for i in T0]
return [i if i.isalpha() else int(i) for i in tmp]
Example
T0= ['X\tY\tf(x.y)\n', '0\t0\t\n', '0.1\t10\t\n', '0.2\t20\t\n', '0.3\t30\t\n']
extract(T0) # return ['Y', 0, 10, 20, 30]
You can accomplish this with the re module and a list comprehension.
import re
# create a regular expression object
regex = re.compile(r'[0-9]{1,}\.{0,1}[0-9]{0,}')
# assign the input list
T0 = ['4\t1\t\n', '0.25\t11\t\n', '0.2\t111\t\n']
# get a list of extractions using the regex
extractions = [x for x in [re.findall(regex, e) for e in T0]]
print extractions
# => [['4', '1'], ['0.25', '11'], ['0.2', '111']]

Printing Results from Loops

I currently have a piece of code that works in two segments. The first segment opens the existing text file from a specific path on my local drive and then arranges, based on certain indices, into a list of sub list. In the second segment I take the sub-lists I have created and group them on a similar index to simplify them (starts at def merge_subs). I am getting no error code but I am not receiving a result when I try to print the variable answer. Am I not correctly looping the original list of sub-lists? Ultimately I would like to have a variable that contains the final product from these loops so that I may write the contents of it to a new text file. Here is the code I am working with:
from itertools import groupby, chain
from operator import itemgetter
with open ("somepathname") as g:
# reads text from lines and turns them into a list sub-lists
lines = g.readlines()
for line in lines:
matrix = line.split()
JD = matrix [2]
minTime= matrix [5]
maxTime= matrix [7]
newLists = [JD,minTime,maxTime]
L = newLists
def merge_subs(L):
dates = {}
for sub in L:
date = sub[0]
if date not in dates:
dates[date] = []
dates[date].extend(sub[1:])
answer = []
for date in sorted(dates):
answer.append([date] + dates[date])
new code
def openfile(self):
filename = askopenfilename(parent=root)
self.lines = open(filename)
def simplify(self):
g = self.lines.readlines()
for line in g:
matrix = line.split()
JD = matrix[2]
minTime = matrix[5]
maxTime = matrix[7]
self.newLists = [JD, minTime, maxTime]
print(self.newLists)
dates = {}
for sub in self.newLists:
date = sub[0]
if date not in dates:
dates[date] = []
dates[date].extend(sub[1:])
answer = []
for date in sorted(dates):
print(answer.append([date] + dates[date]))
enter code here
enter code here