Replace a sub-string - regex

I'm trying to use regular expressions to replace some things in a text.
My dataframe:
A B C
French house Phone. <phone_numbers>
English house email - <adresse_mail>
French apartment code : bla!123
French house Hello George!
English apartment Ethan, my phone is <phone_numbers>
Good output:
A B C
French house Phone. <phone_numbers>
English house email - <adresse_mail>
French apartment code : <code>
French house Hello George!
English apartment Ethan, my phone is <phone_numbers>
First, I tried this:
df['C'] = df['C'].str.replace(r'((ask code)|(code))\s?:?\s?\w+','<code>')
It works, but not completely.
code : bla!123
Output:
<code>!123
So, I tried this:
df['C'] = df['C'].str.replace(r'(ask code)|(code)\s?:?\s?), (\s?\w+)', r'\2,<code>')
But nothing happened...

I'd do:
df['C'] = df['C'].str.replace(r'(ask code|code)(\s?:?\s?).+', r'\1\2<code>')

input:
import re
string = 'code : bla!123'
string.replace((re.match(r'code*\s?:?\s?(.*)',string)[1]), '<code>')
output:
'code : <code>'

Related

How to format text written to a variable?

I am get the text from the database. He comes in the following form.
city = City.objects.first()
#London
country = Contry.objects.first()
#UK
metatag = MetaTag.objects.first()
#text {city.name} else text {country.name}
screen of note from database
How can I format it?
Need to get:
text London else text UK
metatag.format(city_name=city.name, country_name=country.name)

Need to extract data which is in tabular format from a python list

Team A Team B
name xyz abc
addres 345,JH colony 43,JK colony
Phone 76576 87866
name pqr ijk
addres 345,ab colony 43,JKkk colony
Phone 7666666 873336
Here above , I have 2 teams with names, address and phone number of each player in a list . However, there are no tables as such, but the data whiloe i tried to read is in Tabular format, where In team A Team B are 2nd and 3rd columns and the 1st column is where the tags name,address phone comes.
My objective is to fetch only the names of the players grouped by team name. In this example, there are 2 players each team. it can be between 1 and 2.Is there a way someone can help to share a solution using Regular Expressions. I tried a bit, however that is giving me random results , such as team B players in Team A.Can someone help?
This should work for you, in future I would give more detail on your input string, I have assumed spaces. If it uses tabs, try replacing them with four spaces. I have added an extra row which included a more difficult case.
Warning: If Team B has more players than Team A, it will probably put the extra players in Team A. But it will depend on the exact formatting.
import re
pdf_string = ''' Team A Team B
name xyz abc
addres 345,JH colony 43,JK colony
Phone 76576 87866
name pqr ijk
addres 345,ab colony 43,JKkk colony
Phone 7666666 873336
name forename surname
addres 345,ab colony
Phone 7666666 '''
lines_untrimmed = pdf_string.split('\n')
lines = [line.strip() for line in lines_untrimmed]
space_string = ' ' * 3 # 3 spaces to allow spaces between names and teams
# This can be performed as a one liner below, but I wrote it out for an explanation
lines_csv = []
for line in lines:
line_comma_spaced = re.sub(space_string + '+', ',', line)
line_item_list = line_comma_spaced.split(',')
lines_csv.append(line_item_list)
# lines_csv = [re.sub(space_string + '+', ',', line).split(',') for line in lines]
teams = lines_csv[0]
team_dict = {team:[] for team in teams}
for line in lines_csv:
if 'name' in line:
line_abbv = line[1:] # [1:] to remove name
for i, team in enumerate(teams):
if i < len(line_abbv): # this will prevent an error if there are fewer names than teams
team_dict[team].append(line_abbv[i])
print(team_dict)
This will give the output:
{'Team A': ['xyz', 'pqr', 'forename surname'], 'Team B': ['abc', 'ijk', 'ijk']}

Regex capture lines A, B, or C in any order only when not preceded by D

I have a file with content something like this:
SUBJECT COMPANY:
COMPANY DATA:
COMPANY CONFORMED NAME: MISCELLANEOUS SUBJECT CORP
CENTRAL INDEX KEY: 0000000000
STANDARD INDUSTRIAL CLASSIFICATION: []
IRS NUMBER: 123456789
STATE OF INCORPORATION: DE
FISCAL YEAR END: 1231
Then later in the file, it has something like this:
<REPORTING-OWNER>
COMPANY DATA:
COMPANY CONFORMED NAME: MISCELLANEOUS OWNER CORP
CENTRAL INDEX KEY: 0101010101
STANDARD INDUSTRIAL CLASSIFICATION: []
What I need to do is capture the company conformed name, central index key, IRS number, fiscal year end, or whatever I am looking to extract, but only in the subject company section--not the reporting owner section. These lines may be in any order, or not present, but I want to capture their values if they are present.
The regex I was trying to build looks like this:
(?:COMPANY CONFORMED NAME:\s*(?'conformed_name'(?!(?:A|AN|THE)\b)[A-Z\-\/\\=|&!#$(){}:;,#`. ]+)|CENTRAL INDEX KEY:\s*(?'cik'\d{10})|IRS NUMBER:\s*(?'IRS_number'\w{2}-?\w{7,8})|FISCAL YEAR END:\s*(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))
The desired results would be as follows:
conformed_name = "MISCELLANEOUS SUBJECT CORP"
CIK = "000000000"
IRS_number = "123456789"
fiscal_year_end = "1231"
Any flavor of regex is acceptable for this, as I'll adapt to whatever works best for the scenario. Thank you for reading about my quandary and for any guidance you can offer.
I ended up figuring it out on my own. Try it out here.
/SUBJECT COMPANY:\s+COMPANY DATA:(?:\s+(?:(?:COMPANY CONFORMED NAME:\s+(?'conformed_name'[^\n]+))|(?:CENTRAL INDEX KEY:\s+(?'CIK'\d{10}))|(?:STANDARD INDUSTRIAL CLASSIFICATION:\s+(?'assigned_SIC'[^\n]+))|(?:IRS NUMBER:\s+?(?'IRS_number'\w{2}-?\w{7,8}))|(?:STATE OF INCORPORATION:\s+(?'state_of_incorporation'\w{2}))|(?:FISCAL YEAR END:\s+(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))\n))+/s
To match only the company section, and only when preceded by “SUBJECT COMPANY”, use a look behind:
(?<=SUBJECT COMPANY:\t\n \n )(?:COMPANY CONFORMED NAME:\s*(?'conformed_name'(?!(?:A|AN|THE)\b)[A-Z\-\/\\=|&!#$(){}:;,#`. ]+)|CENTRAL INDEX KEY:\s*(?'cik'\d{10})|IRS NUMBER:\s*(?'IRS_number'\w{2}-?\w{7,8})|FISCAL YEAR END:\s*(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))

Better way to find sub strings in Datastore?

I have an aplication where an user inputs a name and the aplication gives back the adress and city for that name
The names are in datastore
class Person(ndb.Model):
name = ndb.StringProperty(repeated=True)
address = ndb.StringProperty(indexed=False)
city = ndb.StringProperty()
There are more than 5 million of Person entities. Names can be formed from 2 to 8 words (yes, there are people with 8 words in his names)
Users can enter any words for the name (in any order) and the aplication will return the first match.("John Doe Smith" is equivalent to " Smith Doe John")
This is a sample of my entities(the way how was put(ndb.put_multi)
id="L12802795",nombre=["Smith","Loyola","Peter","","","","",""], city="Cali",address="Conchuela 471"
id="M19181478",nombre=["Hoffa","Manzano","Linda","Rosse","Claudia","Cindy","Patricia",""], comuna="Lima",address=""
id="L18793849",nombre=["Parker","Martinez","Claudio","George","Paul","","",""], comuna="Santiago",address="Calamar 323 Villa Los Pescadores"
This is the way I get the name from the user:
name = self.request.get('content').strip() #The input is the name (an string with several words)
name=" ".join(name.split()).split() #now the name is a list of single words
In my design, in order to find a way to find and search words in the name for each entity, I did this.
q = Person.query()
if len(name)==1:
names_query =q.filter(Person.name==name[0])
elif len(name)==2:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1])
elif len(name)==3:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2])
elif len(name)==4:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3])
elif len(name)==5:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4])
elif len(name)==6:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4]).filter(Person.name==name[5])
elif len(name)==7:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4]).filter(Person.name==name[5]).filter(Person.name==name[6])
else :
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4]).filter(Person.name==name[5]).filter(Person.name==name[6]).filter(Person.name==name[7])
Person = names_query.fetch(1)
person_id=Person.key.id()
Question 1
Do you think, there is a better way for searching sub strings in strings (ndb.StringProperty), in my design. (I know it works, but I feel it can be improved)
Question 2
My solution has a problem for entities with repeted words in the name.
If I want to find an entity with words "Smith Smith" it brings me "Paul Smith Wshite" instead of "Paul Smith Smith", I do not know how to modify my query in order to find 2(or more) repeated words in Person.name
You could generate a list of all possible tokens for each name and use prefix filters to query them:
class Person(ndb.Model):
name = ndb.StringProperty(required=True)
address = ndb.StringProperty(indexed=False)
city = ndb.StringProperty()
def _tokens(self):
"""Returns all possible combinations of name tokens combined.
For example, for input 'john doe smith' we will get:
['john doe smith', 'john smith doe', 'doe john smith', 'doe smith john',
'smith john doe', 'smith doe john']
"""
tokens = [t.lower() for t in self.name.split(' ') if t]
return [' '.join(t) for t in itertools.permutations(tokens)] or None
tokens = ndb.ComputedProperty(_tokens, repeated=True)
#classmethod
def suggest(cls, s):
s = s.lower()
return cls.query(ndb.AND(cls.tokens >= s, cls.tokens <= s + u'\ufffd'))
ndb.put_multi([Person(name='John Doe Smith'), Person(name='Jane Doe Smith'),
Person(name='Paul Smith Wshite'), Person(name='Paul Smith'),
Person(name='Test'), Person(name='Paul Smith Smith')])
assert Person.suggest('j').count() == 2
assert Person.suggest('ja').count() == 1
assert Person.suggest('jo').count() == 1
assert Person.suggest('doe').count() == 2
assert Person.suggest('t').count() == 1
assert Person.suggest('Smith Smith').get().name == 'Paul Smith Smith'
assert Person.suggest('Paul Smith').count() == 3
And make sure to use keys_only queries if you only want keys/ids. This will make things significantly faster and almost free in terms of datastore OPs.

Regular Expressions task

Below is an example of a text file I need to parse.
Lead Attorney: John Doe
Staff Attorneys: John Doe Jr. Paralegal: John Doe III
Geographic Area: Wisconsin
Affiliated Offices: None
E-mail: blah#blah.com
I need to parse all the key/value pairs and import it into a database. For example, I will insert 'John Doe' into the [Lead Attorney] column. I started a regex but I'm running into problems when parsing line 2:
Staff Attorneys: John Doe Jr. Paralegal: John Doe III
I started with the following regex:
(\w*.?\w+):\s*(.)(?!(\w.?\w+:.*))
But that does not parse out 'Staff Attorneys: John Doe Jr.' and 'Paralegal: John Doe III'. How can I ensure that my regex returns two groups for every key/value pair even if the key/value pairs are on the same line? Thanks!
Does any kind of key appear as a second key? The text above can be fixed by doing a data.replace('Paralegal:', '\nParalegal:') first. Then there is only one key/value pair per line, and it gets trivial:
>>> data = """Lead Attorney: John Doe
... Staff Attorneys: John Doe Jr. Paralegal: John Doe III
... Geographic Area: Wisconsin
... Affiliated Offices: None
... E-mail: blah#blah.com"""
>>>
>>> result = {}
>>> data = data.replace('Paralegal:', '\nParalegal:')
>>> for line in data.splitlines():
... key, val = line.split(':', 1)
... result[key.strip()] = val.strip()
...
>>> print(result)
{'Staff Attorneys': 'John Doe Jr.', 'Lead Attorney': 'John Doe', 'Paralegal': 'John Doe III', 'Affiliated Offices': 'None', 'Geographic Area': 'Wisconsin', 'E-mail': 'blah#blah.com'}
If "Paralegal:" also appears first you can make a regexp to do the replacement only when it's not first, or make a .find and check that the character before is not a newline. If there are several keywords that can appear like this, you can make a list of keywords, etc.
If the keywords can be anything, but only one word, you can look for ':' and parse backwards for space, which can be done with regexps.
If the keywords can be anything and include spaces, it's impossible to do automatically.