Using a function to retrieve specific data from text file. Python 3 - list

There's an external text file That has information in columns.
For Example:
The text file looks something like this:
123 1 645 Kallum Chris Gardner
143 2 763 Josh Brown Sinclar
etc
Now the numbers "1" and "2" are Years. I need to write a program that gets an input for the year and prints out the rest of the information about the individual.
So I will enter "2" into the program and '143 2 763 Josh Brown Sinclar' will get printed out.
So far I got code like this. How do i move on further?
def order_name(regno, year, course, first_name, middle_name, last_name=None):
if not last_name:
last_name = middle_name
else:
first_name = first_name, middle_name
return (last_name, first_name,regno, course, year)

You could do something like this:
f = open('your_file.txt')
lines = f.readlines()
res = [x for x in lines if str(year) in x.split()[1]]
print res

Related

Extracting only dates from a text file and ignoring large numbers

I have a text file and I want to extract all dates from it but somehow my code is also extracting the other values like
Procedure #: 10075453.
Below is a small sample of that file:
Patient Name: Mills, John Procedure #: 10075453
October 7, 2017
Med Rec #: 747901 Visit ID: 110408731
Patient Location: OUTPATIENT Patient Type: OUTPATIENT
DOB:07/09/1943 Gender: F Age: 73Y Phone: (321)8344-0456
Can I get an idea how I could approach this problem?
doc = []
with open('Clean.txt', encoding="utf8") as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
def date_extract():
one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})')
two = df.str.extract(r'((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?:(?:\/|-)\d{2,4}))')
three = df.str.extract(r'((?:\d{1,2}(?:-|\/))?\d{4})')
dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
return pd.Series(dates.sort_values())

Better way to find sub strings in Datastore?

I have an aplication where an user inputs a name and the aplication gives back the adress and city for that name
The names are in datastore
class Person(ndb.Model):
name = ndb.StringProperty(repeated=True)
address = ndb.StringProperty(indexed=False)
city = ndb.StringProperty()
There are more than 5 million of Person entities. Names can be formed from 2 to 8 words (yes, there are people with 8 words in his names)
Users can enter any words for the name (in any order) and the aplication will return the first match.("John Doe Smith" is equivalent to " Smith Doe John")
This is a sample of my entities(the way how was put(ndb.put_multi)
id="L12802795",nombre=["Smith","Loyola","Peter","","","","",""], city="Cali",address="Conchuela 471"
id="M19181478",nombre=["Hoffa","Manzano","Linda","Rosse","Claudia","Cindy","Patricia",""], comuna="Lima",address=""
id="L18793849",nombre=["Parker","Martinez","Claudio","George","Paul","","",""], comuna="Santiago",address="Calamar 323 Villa Los Pescadores"
This is the way I get the name from the user:
name = self.request.get('content').strip() #The input is the name (an string with several words)
name=" ".join(name.split()).split() #now the name is a list of single words
In my design, in order to find a way to find and search words in the name for each entity, I did this.
q = Person.query()
if len(name)==1:
names_query =q.filter(Person.name==name[0])
elif len(name)==2:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1])
elif len(name)==3:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2])
elif len(name)==4:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3])
elif len(name)==5:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4])
elif len(name)==6:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4]).filter(Person.name==name[5])
elif len(name)==7:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4]).filter(Person.name==name[5]).filter(Person.name==name[6])
else :
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4]).filter(Person.name==name[5]).filter(Person.name==name[6]).filter(Person.name==name[7])
Person = names_query.fetch(1)
person_id=Person.key.id()
Question 1
Do you think, there is a better way for searching sub strings in strings (ndb.StringProperty), in my design. (I know it works, but I feel it can be improved)
Question 2
My solution has a problem for entities with repeted words in the name.
If I want to find an entity with words "Smith Smith" it brings me "Paul Smith Wshite" instead of "Paul Smith Smith", I do not know how to modify my query in order to find 2(or more) repeated words in Person.name
You could generate a list of all possible tokens for each name and use prefix filters to query them:
class Person(ndb.Model):
name = ndb.StringProperty(required=True)
address = ndb.StringProperty(indexed=False)
city = ndb.StringProperty()
def _tokens(self):
"""Returns all possible combinations of name tokens combined.
For example, for input 'john doe smith' we will get:
['john doe smith', 'john smith doe', 'doe john smith', 'doe smith john',
'smith john doe', 'smith doe john']
"""
tokens = [t.lower() for t in self.name.split(' ') if t]
return [' '.join(t) for t in itertools.permutations(tokens)] or None
tokens = ndb.ComputedProperty(_tokens, repeated=True)
#classmethod
def suggest(cls, s):
s = s.lower()
return cls.query(ndb.AND(cls.tokens >= s, cls.tokens <= s + u'\ufffd'))
ndb.put_multi([Person(name='John Doe Smith'), Person(name='Jane Doe Smith'),
Person(name='Paul Smith Wshite'), Person(name='Paul Smith'),
Person(name='Test'), Person(name='Paul Smith Smith')])
assert Person.suggest('j').count() == 2
assert Person.suggest('ja').count() == 1
assert Person.suggest('jo').count() == 1
assert Person.suggest('doe').count() == 2
assert Person.suggest('t').count() == 1
assert Person.suggest('Smith Smith').get().name == 'Paul Smith Smith'
assert Person.suggest('Paul Smith').count() == 3
And make sure to use keys_only queries if you only want keys/ids. This will make things significantly faster and almost free in terms of datastore OPs.

Print line if any of these words are matched

I have a text file with 1000+ lines, each one representing a news article about a topic that I'm researching. Several hundred lines/articles in this dataset are not about the topic, however, and I need to remove these.
I've used grep to remove many of them (grep -vwE "(wordA|wordB)" test8.txt > test9.txt), but I now need to go through the rest manually.
I have a working code that finds all lines that do not contain a certain word, prints this line to me, and asks if it should be removed or not. It works well, but I'd like to include several other words. E.g. let's say my research topic is meat eating trends. I hope to write a script that prints lines that do not contain 'chicken' or 'pork' or 'beef', so I can manually verify if the lines/articles are about the relevant topic.
I know I can do this with elif, but I wonder if there is a better and simpler way? E.g. I tried if "chicken" or "beef" not in line: but it did not work.
Here's the code I have:
orgfile = 'text9.txt'
newfile = 'test10.txt'
newFile = open(newfile, 'wb')
with open("test9.txt") as f:
for num, line in enumerate(f, 1):
if "chicken" not in line:
print "{} {}".format(line.split(',')[0], num)
testVar = raw_input("1 = delete, enter = skip.")
testVar = testVar.replace('', '0')
testVar = int(testVar)
if testVar == 10:
print ''
os.linesep
else:
f = open(newfile,'ab')
f.write(line)
f.close()
else:
f = open(newfile,'ab')
f.write(line)
f.close()
Edit: I tried Pieter's answer to this question but it does not work here, presumeably because I am not working with integers.
you can use any or all and a generator. For example
>>> key_word={"chicken","beef"}
>>> test_texts=["the price of beef is too high", "the chicken farm now open","tomorrow there is a lunar eclipse","bla"]
>>> for title in test_texts:
if any(key in title for key in key_words):
print title
the price of beef is too high
the chicken farm now open
>>>
>>> for title in test_texts:
if not any(key in title for key in key_words):
print title
tomorrow there is a lunar eclipse
bla
>>>

Python pandas data frame warning, suggest to use .loc instead?

Hi I would like to manipulate the data by removing missing information and make all letters lower case. But for the lowercase conversion, I get this warning:
E:\Program Files Extra\Python27\lib\site-packages\pandas\core\frame.py:1808: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
"DataFrame index.", UserWarning)
C:\Users\KubiK\Desktop\FamSeach_NameHandling.py:18: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
frame3["name"] = frame3["name"].str.lower()
C:\Users\KubiK\Desktop\FamSeach_NameHandling.py:19: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
frame3["ethnicity"] = frame3["ethnicity"].str.lower()
import pandas as pd
from pandas import DataFrame
# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Desktop\OddNames_sampleData.csv")
frame = DataFrame(data)
frame.columns = ["name", "ethnicity"]
name = frame.name
ethnicity = frame.ethnicity
# Remove missing ethnicity data cases
index_missEthnic = frame.ethnicity.isnull()
index_missName = frame.name.isnull()
frame2 = frame[index_missEthnic != True]
frame3 = frame2[index_missName != True]
# Make all letters into lowercase
frame3["name"] = frame3["name"].str.lower()
frame3["ethnicity"] = frame3["ethnicity"].str.lower()
# Test outputs
print frame3
This warning doesn't seem to be fatal (at least for my small sample data), but how should I deal with this?
Sample data
Name Ethnicity
Thos C. Martin Russian
Charlotte Wing English
Frederick A T Byrne Canadian
J George Christe French
Mary R O'brien English
Marie A Savoie-dit Dugas English
J-b'te Letourneau Scotish
Jane Mc-earthar French
Amabil?? Bonneau English
Emma Lef??c French
C., Akeefe African
D, James Matheson English
Marie An: Thomas English
Susan Rrumb;u English
English
Kaio Chan
Not sure why do you need so many booleans...
Also note that .isnull() does not catch empty strings.
And filtering empty string before applying .lower() doesn't seems neccessary either.
But it there is a need... This works for me:
frame = pd.DataFrame({'name':['Abc Def', 'EFG GH', ''], 'ethnicity':['Ethnicity1','', 'Ethnicity2']})
print frame
ethnicity name
0 Ethnicity1 Abc Def
1 EFG GH
2 Ethnicity2
name_null = frame.name.str.len() == 0
frame.loc[~name_null, 'name'] = frame.loc[~name_null, 'name'].str.lower()
print frame
ethnicity name
0 Ethnicity1 abc def
1 efg gh
2 Ethnicity2
When you set frame2/3, trying using .loc as follows:
frame2 = frame.loc[~index_missEthnic, :]
frame3 = frame2.loc[~index_missName, :]
I think this would fix the error you're seeing:
frame3.loc[:, "name"] = frame3.loc[:, "name"].str.lower()
frame3.loc[:, "ethnicity"] = frame3.loc[:, "ethnicity"].str.lower()
You can also try the following, although it doesn't answer your question:
frame3.loc[:, "name"] = [t.lower() if isinstance(t, str) else t for t in frame3.name]
frame3.loc[:, "ethnicity"] = [t.lower() if isinstance(t, str) else t for t in frame3. ethnicity]
This converts any string in the column into lowercase, otherwise it leaves the value untouched.

Regular Expression to deal with text issue

I have the following text sample:
Sample Supplier 123
AP Invoices 123456 -229.00
AP Invoices 235435 337.00
AP Invoices 444323 228.00
AP Invoices 576432 248.00
It's from a text file with 21,000 lines, which lists invoices against a supplier.
The pattern is always the same on each block of invoices against each supplier, where:
The supplier name starts at the beginning of a line
The invoices being to be listed 2 rows down from the supplier name, indented by one space.
I wondered if I can use a Regular Expression (I'm using TextPad as a Text Editor on a Windows PC) to:
Append each invoice line with a tab (\t)
Append the supplier name in front of the tab so each invoice line now starts with the supplier name, and a tab, where the supplier name is taken from 2 rows above the start of each block of invoices
Delete the supplier name line from above the invoice block.
Expected output:
Sample Supplier 123 AP Invoices 123456 -229.00
Sample Supplier 123 AP Invoices 235435 337.00
Sample Supplier 123 AP Invoices 444323 228.00
Sample Supplier 123 AP Invoices 576432 248.00
I realise I am probably asking for "the moon on a stick" here, but the alternative is to go through a 21,000 line text file and copy and paste the data into Excel, which might not be a very good use of my time.
Maybe I can't do it using a simple regular expression, or maybe it's simply not possible at all.
Any advice would be much appreciated.
Thanks
I would use a simple Python script to solve this issue:
currentheader = ""
with open("yourfile.txt") as f:
with open("newfile.txt","w") as fw:
for line in f:
if len(line.strip()) == 0:
continue
elif line[0] != " ": #new header
currentheader = line[:-1]
else:
fw.write(currentheader + "\t" + line[1:])
For this to work, on Windows you will have to install Python. Python 2 or 3 should both work with this script. After installing Python, you open a command line (Win+R, cmd, Enter), navigate to the folder your file is in using cd foldername, if necessary, and then type python dealWithTextIssue.py (after having saved the script as "dealWithTextIssue.py" there.
I think this isn't solvable just with regex, you'll have to do some programming. I made a little script in PHP:
$string = <<<EOL
Sample Supplier 123
AP Invoices 123456 -229.00
AP Invoices 235435 337.00
AP Invoices 444323 228.00
AP Invoices 576432 248.00
Second Supplier
A B C D
B F
EOL;
$array = preg_split("~[\n\r]+~", $string);
foreach ($array as $value) {
if (strpos($value, " ") == 0) {
if (strlen(trim($value)) > 0) {
echo "\t".$header.rtrim($value).PHP_EOL;
}
}
else {
$header = $value;
}
}
You can see it at work for example here after clicking on execute code.