identify letter/number combinations using regex and storing in dictionary - regex

import pandas as pd
df = pd.DataFrame({'Date':['This 1-A16-19 person is BL-17-1111 and other',
'dont Z-1-12 do here but NOT 12-24-1981',
'numbers: 1A-256-29Q88 ok'],
'IDs': ['A11','B22','C33'],
})
Using the dataframe above I want to do the following 1) Use regex to identify all digit + number combination e.g 1-A16-19 2) Store in dictionary
Ideally I would like the following output (note that 12-24-1981 intentionally was not picked up by the regex since it doesn't have a letter in it e.g. 1A-24-1981)
{1: 1-A16-19, 2:BL-17-1111, 3: Z-1-12, 4: 1A-256-29Q88}
Can anybody help me do this?

This regex might do the trick.
(?=.*[a-zA-Z])(\S+-\S+-\S+)
It matches everything between two spaces that has two - in it. Also there won't be a match if there is no letter present.
regex101 example
As you can see for the given input you provided only 1-A16-19, BL-17-1111, Z-1-12 & 1A-256-29Q88 are getting returned.

you could try :
vals = df['Date'].str.extractall(r'(\S+-\S+-\S+)')[0].tolist()
# extract your strings based on your condition above and pass to a list.
# make a list with the index range of your matches.
nums = []
for x,y in enumerate(vals):
nums.append(x)
pass both lists into a dictionary.
my_dict = dict(zip(nums,vals))
print(my_dict)
{0: '1-A16-19',
1: 'BL-17-1111',
2: 'Z-1-12',
3: '12-24-1981',
4: '1A-256-29Q88'}
if you want the index to start at one you can specify this in the enumerate function.
for x,y in enumerate(vals,1):
nums.append(x)
print(nums)
[1, 2, 3,4,5]

Related

Python 3.5.2 Regex in List comprehension returns all entries - inconsistent with other example

I am searching a list for a particular entry. The entry is digits followed by oblique (one or many times).
If I put an example into a string and use re.match() I get the result.
If I put the string into a list and loop through I get a result from re.match()
If I try to get the index using list comprehension I get all the list indexes returned.
Using a different list I get the correct result.
Why is the list comprehension for my regex not just returning [2] as the control list does?
Example code:
import re
import sys
from datetime import datetime
rxco = re.compile
rx = {}
#String
s = r'140/154/011/002'
#String in a list
l = ['abc', 'XX123 SHDJ FFFF', s, 'unknown', 'TTL/4/5/6', 'ORD/123']
#Regex to get what I am interested in
rx['ls_pax_split'] = rxco(r'\s?((\d+\/?)*)')
#For loop returns matches and misses
for i in l:
m = re.match(rx['ls_pax_split'], i)
print(m)
#List Comprehension returns ALL entries - NOT EXPECTED
idx = [i for i, item in enumerate(l) if re.match(rx['ls_pax_split'], item)]
print(idx)
#Control Comprehension returns - AS EXPECTED
fruit_list = ['raspberry', 'apple', 'strawberry']
berry_idx = [i for i, item in enumerate(fruit_list) if re.match('rasp', item)]
print(berry_idx)
re.match(rx['ls_pax_split'], item) is returning a match object each time it runs, whereas re.match('rasp', item) is not. Therefore, the result of re.match(rx['ls_pax_split'], item) is always truthy.
Try adding .group(0) to the end of line 22 to get the string that matched the regular expression, or an empty string (i.e. a falsey value) if there was no match.
Like this:
idx = [i for i, item in enumerate(l) if re.match(rx['ls_pax_split'], item).group(0)]
EDIT
While the above will solve this problem, there may be a better way to avoid the hassle of dealing with .group. The regular expression (\d+\/?)* will match (\d+\/?) zero or more times, meaning that it is generating a lot of false positives where it detects exactly zero matches and therefore returns a match. Changing this to (\d+\/?)+ would solve it for this example by looking for one or more (\d+\/?).

Sum values of a Dictionary with "similar" keys Python

I have the following dictionary:
CostofA = {'Cost1,(1, 2)': 850.93,
'Cost1,(1, 2, 3)': 851.08,
'Cost1,(1, 3)': 851.00,
'Cost1,(1,)': 850.86,
'Cost2,(1, 2)': 812.56,
'Cost2,(1, 2, 3)': 812.65,
'Cost2,(2, 3)': 812.12,
'Cost2,(2,)': 812.04,
'Cost3,(1, 2, 3)': 717.93,
'Cost3,(1, 3)': 717.88,
'Cost3,(2, 3)': 717.32,
'Cost3,(3,)': 717.27}
From this dictionary, I want to create the following dictionary by adding up the elements that have similar keys. For example, I want to sum the values of 'Cost1,(1, 2, 3)', 'Cost2,(1, 2, 3)', and 'Cost3,(1, 2, 3)' as they have the same numbers inside the parentheses (1, 2, 3) and create 'Cost(1, 2, 3)': 2381.66. Similarly, 'Cost1,(1, 3)' and 'Cost3,(1, 3)' have the same numbers inside the parentheses, so, I want to sum 851.00 and 717.88 and write it to my new dictionary as: 'Cost(1, 3)': 1568.88. For 'Cost1,(1,)', 'Cost2,(2,)', and 'Cost3,(3,)', I do not want to do anything but to add them to the new dictionary. If I can get rid of the comma right after the 1 in the parentheses, it would be perfect. So, what I mean is: 'Cost1,(1,)': 850.86 becomes 'Cost(1)': 850.86.
CostofA = {'Cost(1)': 850.86,
'Cost(2)': 812.04,
'Cost(3)': 717.27,
'Cost(1, 2)': 1663.58,
'Cost(1, 3)': 1568.88,
'Cost(2, 3)': 1529.34,
'Cost(1, 2, 3)': 2381.66}
I know I can reach to the keys of the dictionary by
CostofA.keys()
and I know I may create a logic with a for loop and an if condition to create the above dictionary, however, I cannot think of a way to reach the numbers in the parentheses inside this if statement. Any suggestions?
Generate the items from the dictionary
Construct a list comprehension with tuple items by removing Cost.
e.g., Cost1,(1,2) will be (1,2), and Cost2,(1,2) will also be (1,2)
Sort the list so all key items will ordered
Groupby using itertools and sum and store it in a dict
from itertools import groupby
data = sorted([(i[0].split(",",1)[1].replace(",)",")"),i[1]) for i in CostofA.items()])
for key, group in groupby(data, lambda x: x[0]):
new_dict["Cost"+key] = sum([thing[1] for thing in group])
This is one solution:
import re
str_pat = re.compile(r'\((.*)\)')
Cost = {}
for key, value in CostofA.items():
match = str_pat.findall(key)[0]
if match.endswith(','): match = match[:-1]
temp_key = 'Cost(' + match + ')'
if temp_key in Cost:
Cost[temp_key] += value
else:
Cost[temp_key] = value
CostofA = Cost
This creates a new dictionary Cost with keys built based on the numbers enclosed by brackets in the original dictionary CostA. It uses a precompiled regex to match those numbers after which it checks with endswith(',') if the matched pattern end with a , like in (1,) - if it does, it removes it.
It then explicitly concatenates the pattern with brackets and other desired strings creating the target new key. If the new key exists, the program increases it's value by the value from the old dictionary. If it does not - it creates a new entry with that value. At the end, the program overwrites the old dictionary.
re.compile is a compiled regex object as said in the documentation:
Compile a regular expression pattern into a regular expression object,
which can be used for matching using its match() and search() methods,
described below.
It stores a given fixed regex pattern for searching and is considered to be more efficient than calling a new regex each time, especially when the program does more matching with that same pattern,
but using re.compile() and saving the resulting regular expression
object for reuse is more efficient when the expression will be used
several times in a single program.
Here it is used more for clarity as it defines the pattern once upfront rather then each time in the loop, but if your original dictionary was larger it could actually provide some performance improvements.

Adding dictionary values with the missing values in the keys

I have the following three dictionaries:
Mydict = {'(1)': 850.86,
'(1, 2)': 1663.5,
'(1, 2, 3)': 2381.67,
'(1, 3)': 1568.89,
'(2)': 812.04,
'(2, 3)': 1529.45,
'(3)': 717.28}
A = {1: 4480.0, 2: 3696.0, 3: 4192.5}
B = {1: 1904.62, 2: 1709.27, 3: 1410.73}
Based on the keys in Mydict, I want to add the missing key value of min(A, B). For example, For the first key '(1)' in Mydict, I want to add min(A[2], B[2]) + min(A[3], B[3]) to the value of the first row and update the value in that dictionary. Similarly, for the value of the key: '(1, 2)', I want to add the min(A[3] + B[3]) as only 3 is missing in there. For '(1, 2, 3)', I don't need to add anything as it involves all the 3 numbers, namely 1, 2, and 3.Thus, my new Mydict will be as following:
Mynewdict = {'(1)': 3970.86,
'(1, 2)': 3074.23,
'(1, 2, 3)': 2381.67,
'(1, 3)': 3278.16,
'(2)': 4127.39,
'(2, 3)': 3434.07,
'(3)': 4331.17}
In this example, all the values of B are less than the values of A, however, it may not be the case for all the time, that is why, I want to add the minimum of those. Thanks for answers.
The list of numbers sounds like a good idea, but to me it seems it will still require similar amount of (similar) operations as the following solution:
import re
str_pat = re.compile(r'\((.*)\)')
Mynewdict = Mydict
for key in Mynewdict.keys():
match = (str_pat.findall(key)[0]).split(',')
# set(list(map(int, match))) in Python 3.X
to_add = list(set(A.keys()).difference(set(map(int,match))))
if to_add:
for kAB in to_add:
Mynewdict[key] += min(A[kAB],B[kAB])
For each key in Mynewdict this program finds the pattern between the brackets and turns it into a list match split by ,. This list is then compared to list of keys in A.
The comparison goes through sets - the program construct sets from both lists and returns a set difference (also turned into list) into to_add. to_add is hence a list with keys in A that are numbers not present in the compound key in Mynewdict. This assumes that all the keys in B are also present in A. map is used to convert the strings in match to integers (for comparison with keys in A that are integers). To use it in Python 3.X you need to additionally turn the map(int, match) into a list as noted in the comment.
Last part of the program assigns minimum value between A and B for each missing key to the existing value of Mynewdict. Since Mynewdict is initially a copy of Mydict all the final keys and intial values already exist in it so the program does not need to check for key presence or explicitly add the initial values.
To look for keys in Mynewdict that correspond to specific values it seems you actually have to loop through the dictionary:
find_val = round(min(Mynewdict.values()),2)
for key,value in Mynewdict.items():
if find_val == round(value,2):
print key, value
What's important here is the rounding. It is necessary since the values in Mynewdict have variable precision that is often longer than the precision of the value you are looking for. In other words without round(value,2) if will evaluate to False in cases where it is actually True.

Compare each item in a list with all previous items, print only unique items

I am using the following regexp to match all occurrences of a special kind of number:
^([0-57-9]|E)[12][0-9]{3}[A-Z]?[A-Z]([0-9]{3}|[0-9]{4})
Let's assume that this regex matches the following five numbers:
31971R0974
11957E075
31971R0974-A01P2
31971R0974-A05
51992PC0405
These matches are then printed using the following code. This prints each item in the list and if the item contains a dash, everything after the dash is discarded.
def number_function():
for x in range(0, 10):
print("Number", number_variable[x].split('-', 1)[0])
However, this would print five lines where lines 1, 3 and 4 would be the same.
I need your help to write a script which compares each item with all previous items and only prints the item if it does not already exist.
So, the desired output would be the following three lines:
31971R0974
11957E075
51992PC0405
EDIT 2:
I solved it! I just needed to do some moving around. Here's the finished product:
def instrument_function():
desired = set()
for x in range(0, 50):
try:
instruments_celex[x]
except IndexError:
pass
else:
before_dash = instruments_celex[x].split('-', 1)[0]
desired.add(before_dash)
for x in desired:
print("Cited instrument", x)
I've done practically no python up until now, but this might do what you're after
def number_function():
desired = set()
for x in range(0, 10):
before_hyphen = number_variable[x].split('-', 1)[0]
desired.add(before_hyphen)
for x in desired:
print("Number", x)
Here is a version of your "finished" function that is more reaonable.
# Don't use instruments_celex as a global variable, that's terrible.
# Pass it in to the function instead:
def instrument_function(instruments_celex):
unique = set()
# In Python you don't need an integer loop variable. This is not Java.
# Just loop over the list:
for entry in instruments_celex:
unique.add(entry.split('-', 1)[0])
for entry in unique:
print("Cited instrument", entry)
You can also make use of generator expressions to make this shorter:
def instrument_function(instruments_celex): 
unique = set(entry.split('-', 1)[0] for entry in instruments_celex)
for entry in set:
print("Cited instrument", entry)
That's it. It's so simple in fact that I wouldn't make a separate function of it unless I do it at least two times in the program.

Create a vector of occurrences the same size as an input string

I'm new to python and needed some help.
I have a string such a ACAACGG
I would now like to create 3 vectors where the elements are the counts of particular letter.
For example, for "A", this would produce (1123333)
For "C", this would produce (0111222)
etc.
I'm not sure how to put the results of the counting into an string or into a vector.
I believe this is similar to counting the occurrences of a character in a string, but I'm not sure how to have it run through the string and place the count value at each point.
For reference, I'm trying to implement the Burrows-Wheeler transform and use it for a string search. But, I'm not sure how to create the occurrence vector for the characters.
def bwt(s):
s = s + '$'
return ''.join([x[-1] for x in
sorted([s[i:] + s[:i] for i in range(len(s))])])
This gives me the transform and I'm trying to create the occurrence vector for it. Ultimately, I want to use this to search for repeats in a DNA string.
Any help would be greatly appreciated.
I'm not sure what type you want the vectors to be in, but here's a function that returns a list of ints.
In [1]: def countervector(s, char):
....: c = 0
....: v = []
....: for x in s:
....: if x == char:
....: c += 1
....: v.append(c)
....: return v
....:
In [2]: countervector('ACAACGG', 'A')
Out[2]: [1, 1, 2, 3, 3, 3, 3]
In [3]: countervector('ACAACGG', 'C')
Out[3]: [0, 1, 1, 1, 2, 2, 2]
Also, here's a much shorter way to do it, but it will probably be inefficient on long strings:
def countervector(s, char):
return [s[:i+1].count(char) for i, _ in enumerate(s)]
I hope it helps.
As promised here is the finished script I wrote. For reference, I'm trying to use the Burrows-Wheeler transform to do repeat matching in strings of DNA. Basically the idea is to take a strand of DNA of some length M and find all repeat within that string. So, as an example, if I had strange acaacg and searched for all duplicated substrings of size 2, I would get a count of 1 and the starting locations of 0,3. You could then type in string[0:2] and string[3:5] to verify that they do actually match and their result is "ac".
If anyone is interested in learning about the Burrows-Wheeler, a Wikipedia search on it produces very helpful results. Here's is another source from Stanford that also explains it well. http://www.stanford.edu/class/cs262/notes/lecture5.pdf
Now, there are a few issues that I did not address in this. First, I'm using n^2 space to create the BW transform. Also, I'm creating a suffix array, sorting it, and then replacing it with numbers so creating that may take up a bit of space. However, at the end I'm only really storing the occ matrix, the end column, and the word itself.
Despite the RAM problems for strings larger that 4^7 (got this to work with a string size of 40,000 but no larger...), I would call this a success seeing as before Monday, the only thing I new how to do in python was to have it print my name and hello world.
# generate random string of DNA
def get_string(length):
string=""
for i in range(length):
string += random.choice("ATGC")
return string
# Make the BW transform from the generated string
def make_bwt(word):
word = word + '$'
return ''.join([x[-1] for x in
sorted([word[i:] + word[:i] for i in range(len(word))])])
# Make the occurrence matrix from the transform
def make_occ(bwt):
letters=set(bwt)
occ={}
for letter in letters:
c=0
occ[letter]=[]
for i in range(len(bwt)):
if bwt[i]==letter:
c+=1
occ[letter].append(c)
return occ
# Get the initial starting locations for the Pos(x) values
def get_starts(word):
list={}
word=word+"$"
for letter in set(word):
list[letter]=len([i for i in word if i < letter])
return list
# Single range finder for the BWT. This produces a first and last position for one read.
def get_range(read,occ,pos):
read=read[::-1]
firstletter=read[0]
newread=read[1:len(read)]
readL=len(read)
F0=pos[firstletter]
L0=pos[firstletter]+occ[firstletter][-1]-1
F1=F0
L1=L0
for letter in newread:
F1=pos[letter]+occ[letter][F1-1]
L1=pos[letter]+occ[letter][L1] -1
return F1,L1
# Iterate the single read finder over the entire string to search for duplicates
def get_range_large(readlength,occ,pos,bwt):
output=[]
for i in range(0,len(bwt)-readlength):
output.append(get_range(word[i:(i+readlength)],occ,pos))
return output
# Create suffix array to use later
def get_suf_array(word):
suffix_names=[word[i:] for i in range(len(word))]
suffix_position=range(0,len(word))
output=zip(suffix_names,suffix_position)
output.sort()
output2=[]
for i in range(len(output)):
output2.append(output[i][1])
return output2
# Remove single hits that were a result of using the substrings to scan the large string
def keep_dupes(bwtrange):
mylist=[]
for i in range(0,len(bwtrange)):
if bwtrange[i][1]!=bwtrange[i][0]:
mylist.append(tuple(bwtrange[i]))
newset=set(mylist)
newlist=list(newset)
newlist.sort()
return newlist
# Count the duplicate entries
def count_dupes(hits):
c=0
for i in range(0,len(hits)):
sum=hits[i][1]-hits[i][0]
if sum > 0:
c=c+sum
else:
c
return c
# Get the coordinates from BWT and use the suffix array to map them back to their original indices
def get_coord(hits):
mylist=[]
for element in hits:
mylist.append(sa[element[0]-1:element[1]])
return mylist
# Use the coordinates to get the actual strings that are duplicated
def get_dupstrings(coord,readlength):
output=[]
for element in coord:
temp=[]
for i in range(0,len(element)):
string=word[element[i]:(element[i]+readlength)]
temp.append(string)
output.append(temp)
return output
# Merge the strings and the coordinates together for one big list.
def together(dupstrings,coord):
output=[]
for i in range(0,len(coord)):
merge=dupstrings[i]+coord[i]
output.append(merge)
return output
Now run the commands as follows
import random # This is needed to generate a random string
readlength=12 # pick read length
word=get_string(4**7) # make random word
bwt=make_bwt(word) # make bwt transform from word
occ=make_occ(bwt) # make occurrence matrix
pos=get_starts(word) # gets start positions of sorted first row
bwtrange=get_range_large(readlength,occ,pos,bwt) # Runs the get_range function over all substrings in a string.
sa=get_suf_array(word) # This function builds a suffix array and numbers it.
hits=keep_dupes(bwtrange) # Pulls out the number of entries in the bwt results that have more than one hit.
dupes=count_dupes(hits) # counts hits
coord=get_coord(hits) # This part attempts to pull out the coordinates of the hits.
dupstrings=get_dupstrings(coord,readlength) # pulls out all the duplicated strings
strings_coord=together(dupstrings,coord) # puts coordinates and strings in one file for ease of viewing.
print dupes
print strings_coord