Better way to find sub strings in Datastore?

Better way to find sub strings in Datastore? - python-2.7

I have an aplication where an user inputs a name and the aplication gives back the adress and city for that name
The names are in datastore
class Person(ndb.Model):
name = ndb.StringProperty(repeated=True)
address = ndb.StringProperty(indexed=False)
city = ndb.StringProperty()
There are more than 5 million of Person entities. Names can be formed from 2 to 8 words (yes, there are people with 8 words in his names)
Users can enter any words for the name (in any order) and the aplication will return the first match.("John Doe Smith" is equivalent to " Smith Doe John")
This is a sample of my entities(the way how was put(ndb.put_multi)
id="L12802795",nombre=["Smith","Loyola","Peter","","","","",""], city="Cali",address="Conchuela 471"
id="M19181478",nombre=["Hoffa","Manzano","Linda","Rosse","Claudia","Cindy","Patricia",""], comuna="Lima",address=""
id="L18793849",nombre=["Parker","Martinez","Claudio","George","Paul","","",""], comuna="Santiago",address="Calamar 323 Villa Los Pescadores"
This is the way I get the name from the user:
name = self.request.get('content').strip() #The input is the name (an string with several words)
name=" ".join(name.split()).split() #now the name is a list of single words
In my design, in order to find a way to find and search words in the name for each entity, I did this.
q = Person.query()
if len(name)==1:
names_query =q.filter(Person.name==name[0])
elif len(name)==2:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1])
elif len(name)==3:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2])
elif len(name)==4:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3])
elif len(name)==5:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4])
elif len(name)==6:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4]).filter(Person.name==name[5])
elif len(name)==7:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4]).filter(Person.name==name[5]).filter(Person.name==name[6])
else :
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4]).filter(Person.name==name[5]).filter(Person.name==name[6]).filter(Person.name==name[7])
Person = names_query.fetch(1)
person_id=Person.key.id()
Question 1
Do you think, there is a better way for searching sub strings in strings (ndb.StringProperty), in my design. (I know it works, but I feel it can be improved)
Question 2
My solution has a problem for entities with repeted words in the name.
If I want to find an entity with words "Smith Smith" it brings me "Paul Smith Wshite" instead of "Paul Smith Smith", I do not know how to modify my query in order to find 2(or more) repeated words in Person.name

You could generate a list of all possible tokens for each name and use prefix filters to query them:
class Person(ndb.Model):
name = ndb.StringProperty(required=True)
address = ndb.StringProperty(indexed=False)
city = ndb.StringProperty()
def _tokens(self):
"""Returns all possible combinations of name tokens combined.
For example, for input 'john doe smith' we will get:
['john doe smith', 'john smith doe', 'doe john smith', 'doe smith john',
'smith john doe', 'smith doe john']
"""
tokens = [t.lower() for t in self.name.split(' ') if t]
return [' '.join(t) for t in itertools.permutations(tokens)] or None
tokens = ndb.ComputedProperty(_tokens, repeated=True)
#classmethod
def suggest(cls, s):
s = s.lower()
return cls.query(ndb.AND(cls.tokens >= s, cls.tokens <= s + u'\ufffd'))
ndb.put_multi([Person(name='John Doe Smith'), Person(name='Jane Doe Smith'),
Person(name='Paul Smith Wshite'), Person(name='Paul Smith'),
Person(name='Test'), Person(name='Paul Smith Smith')])
assert Person.suggest('j').count() == 2
assert Person.suggest('ja').count() == 1
assert Person.suggest('jo').count() == 1
assert Person.suggest('doe').count() == 2
assert Person.suggest('t').count() == 1
assert Person.suggest('Smith Smith').get().name == 'Paul Smith Smith'
assert Person.suggest('Paul Smith').count() == 3
And make sure to use keys_only queries if you only want keys/ids. This will make things significantly faster and almost free in terms of datastore OPs.

Related

Need to extract data which is in tabular format from a python list

Team A Team B
name xyz abc
addres 345,JH colony 43,JK colony
Phone 76576 87866
name pqr ijk
addres 345,ab colony 43,JKkk colony
Phone 7666666 873336
Here above , I have 2 teams with names, address and phone number of each player in a list . However, there are no tables as such, but the data whiloe i tried to read is in Tabular format, where In team A Team B are 2nd and 3rd columns and the 1st column is where the tags name,address phone comes.
My objective is to fetch only the names of the players grouped by team name. In this example, there are 2 players each team. it can be between 1 and 2.Is there a way someone can help to share a solution using Regular Expressions. I tried a bit, however that is giving me random results , such as team B players in Team A.Can someone help?

This should work for you, in future I would give more detail on your input string, I have assumed spaces. If it uses tabs, try replacing them with four spaces. I have added an extra row which included a more difficult case.
Warning: If Team B has more players than Team A, it will probably put the extra players in Team A. But it will depend on the exact formatting.
import re
pdf_string = ''' Team A Team B
name xyz abc
addres 345,JH colony 43,JK colony
Phone 76576 87866
name pqr ijk
addres 345,ab colony 43,JKkk colony
Phone 7666666 873336
name forename surname
addres 345,ab colony
Phone 7666666 '''
lines_untrimmed = pdf_string.split('\n')
lines = [line.strip() for line in lines_untrimmed]
space_string = ' ' * 3 # 3 spaces to allow spaces between names and teams
# This can be performed as a one liner below, but I wrote it out for an explanation
lines_csv = []
for line in lines:
line_comma_spaced = re.sub(space_string + '+', ',', line)
line_item_list = line_comma_spaced.split(',')
lines_csv.append(line_item_list)
# lines_csv = [re.sub(space_string + '+', ',', line).split(',') for line in lines]
teams = lines_csv[0]
team_dict = {team:[] for team in teams}
for line in lines_csv:
if 'name' in line:
line_abbv = line[1:] # [1:] to remove name
for i, team in enumerate(teams):
if i < len(line_abbv): # this will prevent an error if there are fewer names than teams
team_dict[team].append(line_abbv[i])
print(team_dict)
This will give the output:
{'Team A': ['xyz', 'pqr', 'forename surname'], 'Team B': ['abc', 'ijk', 'ijk']}

Using a function to retrieve specific data from text file. Python 3

There's an external text file That has information in columns.
For Example:
The text file looks something like this:
123 1 645 Kallum Chris Gardner
143 2 763 Josh Brown Sinclar
etc
Now the numbers "1" and "2" are Years. I need to write a program that gets an input for the year and prints out the rest of the information about the individual.
So I will enter "2" into the program and '143 2 763 Josh Brown Sinclar' will get printed out.
So far I got code like this. How do i move on further?
def order_name(regno, year, course, first_name, middle_name, last_name=None):
if not last_name:
last_name = middle_name
else:
first_name = first_name, middle_name
return (last_name, first_name,regno, course, year)

You could do something like this:
f = open('your_file.txt')
lines = f.readlines()
res = [x for x in lines if str(year) in x.split()[1]]
print res

Does any one know how I can add people into groups in django?

In django-admin I'm given a table with a list of people in it. The fields are for example:
firstname lastname occupation group
the first three columns are filled out already but the fourth (group) has to be done by me.
I would like to write an action that groups people into groups of say (3)
so the result would be
firstname lastname occupation group
mike jones doctor 1
tracy jackson laywer 1
Mack Bean Actor 1
Steward Griffin Baby 2
Candice Green Cashier 2
Anyone know how I can do this? I didnt add code because I dont know where to start

try this...
maxId = People.objects.all().aggregate(MAX('id'))['id__max']
newgroupid = maxId / 3
if maxId % 3 == 0:
newgroupid = newgroupid
else:
newgroupid = newgroupid + 1
now use this newgroupid to insert the record.

Search a list with words in string as parameter in python

I could use some advice, how to search in a list for genres with words in a string as parameter.
So if i have created a list called genre, which contains a string like:
['crime, drama,action']
I want to use this list to search for movies containing all genres or maybe just 1 of them.
I have created a big list which contains all information about the movie. An example from the list you see here:
('Saving Private Ryan (1998)', '8.5', "Tom Hanks, Matt Damon, Tom Sizemore',\n", 'action, drama, war,\n'),
So if i want to search for saving private ryan, which is a drama + action genre, but not crime, how can i then use my genre list to search for it?
Is there a way to search by something in the string?
UPDATE:
So this is what i done so far. I have tried to precess my tuple movie and use the def function.
Navn_rating = dict(zip(names1, ratings))
Actor_genre = dict(zip(actorlist, genre_list))
var = raw_input("Enter movie: ")
print "you entered ", var
for row in name_rating_actor_genre:
if var in row:
movie.append(row)
print "Movie found",movie
def process_movie(movie):
return {'title': names1, 'rating': ratings, 'actors': actorlist, 'genre': genre_list}

You can "search by something in the string" using in:
>>> genres = 'action, drama, war,\n'
>>> 'action' in genres
True
>>> 'drama' in genres
True
>>> 'romantic comedy' in genres
False
But note that this might not always give the result you want:
>>> 'war' in 'award-winning'
True
I think you should change your data structure. Consider making each movie a dictionary e.g.
{'title': 'Saving Private Ryan', 'year': 1998, 'rating': 8.5, 'actors': ['Tom Hanks', ...], 'genres': ['action', ...]}
then your query becomes
if 'drama' in movie.genres and 'action' in movie.genres:
You can use indexing, split and slicing to process your tuple of strings to make the values of the dictionary, e.g.:
>>> movie = ('Saving Private Ryan (1998)', '8.5', "Tom Hanks, Matt Damon, Tom Sizemore',\n", 'action, drama, war,\n')
>>> int(movie[0][-5:-1])
1998
>>> float(movie[1])
8.5
>>> movie[0][:-7]
'Saving Private Ryan'
>>> movie[2].split(",")
['Tom Hanks', ' Matt Damon', " Tom Sizemore'", '\n']
As you can see, some tidying up may be needed. You could write a function that takes the tuple as an argument and returns the corresponding dictionary:
def process_movie(movie_tuple):
# ... process the tuple here
return {'title': title, 'rating': rating, ...}
and apply this to your list of movies using map:
movies = list(map(process_movie, name_rating_actor_genre))
Edit:
You will know your function works when the following line doesn't raise any errors:
assert process_movie(('Saving Private Ryan (1998)', '8.5', "Tom Hanks, Matt Damon, Tom Sizemore',\n", 'action, drama, war,\n')) == {"title": "Saving Private Ryan", "year": 1998, "rating": 8.5, "actors": ["Tom Hanks", "Matt Damon", "Tom Sizemore"], "genres": ["action", "drama", "war"]}

Regular Expressions task

Below is an example of a text file I need to parse.
Lead Attorney: John Doe
Staff Attorneys: John Doe Jr. Paralegal: John Doe III
Geographic Area: Wisconsin
Affiliated Offices: None
E-mail: blah#blah.com
I need to parse all the key/value pairs and import it into a database. For example, I will insert 'John Doe' into the [Lead Attorney] column. I started a regex but I'm running into problems when parsing line 2:
Staff Attorneys: John Doe Jr. Paralegal: John Doe III
I started with the following regex:
(\w*.?\w+):\s*(.)(?!(\w.?\w+:.*))
But that does not parse out 'Staff Attorneys: John Doe Jr.' and 'Paralegal: John Doe III'. How can I ensure that my regex returns two groups for every key/value pair even if the key/value pairs are on the same line? Thanks!

Does any kind of key appear as a second key? The text above can be fixed by doing a data.replace('Paralegal:', '\nParalegal:') first. Then there is only one key/value pair per line, and it gets trivial:
>>> data = """Lead Attorney: John Doe
... Staff Attorneys: John Doe Jr. Paralegal: John Doe III
... Geographic Area: Wisconsin
... Affiliated Offices: None
... E-mail: blah#blah.com"""
>>>
>>> result = {}
>>> data = data.replace('Paralegal:', '\nParalegal:')
>>> for line in data.splitlines():
... key, val = line.split(':', 1)
... result[key.strip()] = val.strip()
...
>>> print(result)
{'Staff Attorneys': 'John Doe Jr.', 'Lead Attorney': 'John Doe', 'Paralegal': 'John Doe III', 'Affiliated Offices': 'None', 'Geographic Area': 'Wisconsin', 'E-mail': 'blah#blah.com'}
If "Paralegal:" also appears first you can make a regexp to do the replacement only when it's not first, or make a .find and check that the character before is not a newline. If there are several keywords that can appear like this, you can make a list of keywords, etc.
If the keywords can be anything, but only one word, you can look for ':' and parse backwards for space, which can be done with regexps.
If the keywords can be anything and include spaces, it's impossible to do automatically.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Better way to find sub strings in Datastore? - python-2.7

Related

Need to extract data which is in tabular format from a python list

Using a function to retrieve specific data from text file. Python 3

Does any one know how I can add people into groups in django?

Search a list with words in string as parameter in python

Regular Expressions task

Categories

Resources