What is the most efficient method to parse this line of text? - regex

The following is a row that I have extracted from the web:
AIG $30 AIG is an international renowned insurance company listed on the NYSE. A period is required. Manual Auto Active 3 0.0510, 0.0500, 0.0300 [EXTRACT]
I will like to create 5 separate variables by parsing the text and retrieving the relevant data. However, i seriously don't understand the REGEX documentation! Can anyone guide me on how i can do it correctly with this example?
Name = AIG
CurrentPrice = $30
Status = Active
World_Ranking = 3
History = 0.0510, 0.0500, 0.0300

Not sure what do you want to achieve here. There's no need to use regexps, you could just use str.split:
>>> str = "AIG $30 AIG is an international renowned insurance company listed on the NYSE. A period is required. Manual Auto Active 3 0.0510, 0.0500, 0.0300 [EXTRACT]"
>>> list = str.split()
>>> dict = { "Name": list[0], "CurrentPrice": list[1], "Status": list[19], "WorldRanking": list[20], "History": ' '.join((list[21], list[22], list[23])) }
#output
>>> dict
{'Status': 'Active', 'CurrentPrice': '$30', 'Name': 'AIG', 'WorldRanking': '3', 'History': '0.0510, 0.0500, 0.0300'}
Instead of using list[19] and so on, you may want to change it to list[-n] to not depend to the company's description length. Like that:
>>> history = ' '.join(list[-4:-1])
>>> history
'0.0510, 0.0500, 0.0300'
For floating history indexes it could be easier to use re:
>>> import re
>>> history = re.findall("\d\.\d{4}", str)
>>> ['0.0510', '0.0500', '0.0300']
For identifying status, you could get the indexes of history values and then substract by one:
>>> [ i for i, substr in enumerate(list) if re.match("\d\.\d{4}", substr) ]
[21, 22, 23]
>>> list[21:24]
['0.0510,', '0.0500,', '0.0300,']
>>> status = list[20]
>>> status
'3'

Related

python Find the most reported month

I am trying to find out October(mentioned 2 times), I had the idea to use dictionary to solve this problem. However I struggled a lot to figure out how to find/separate the months, I was not able to use my solution for the 1st str values where there are some spaces. Can someone please suggest how can I modify that split section to cover - , and white space?
import re
#str="May-29-1990, Oct-18-1980 ,Sept-1-1980, Oct-2-1990"
str="May-29-1990,Oct-18-1980,Sept-1-1980,Oct-2-1990"
val=re.split(',',str)
monthList=[]
myDictionary={}
#put the months in a list
def sep_month():
for item in val:
if not item.isdigit():
month,day,year=item.split("-")
monthList.append(month)
#process the month list from above
def count_month():
for item in monthList:
if item not in myDictionary.keys():
myDictionary[item]=1
else:
myDictionary[item]=myDictionary.get(item)+1
for k,v in myDictionary.items():
if v==2:
print(k)
sep_month()
count_month()
from datetime import datetime
import calendar
from collections import Counter
datesString = "May-29-1990,Oct-18-1980,Sep-1-1980,Oct-2-1990"
datesListString = datesString.split(",")
datesList = []
for dateStr in datesListString:
datesList.append(datetime.strptime(dateStr, '%b-%d-%Y'))
monthsOccurrencies = Counter((calendar.month_name[date.month] for date in datesList))
print(monthsOccurrencies)
# Counter({'October': 2, 'May': 1, 'September': 1})
Something to be aware in my solution with %b for the month is that Sept has changed to Sep to work (Month as locale’s abbreviated name). In this case you can either use fullname months (%B) or abbreviated name (%b). If you can not have the big string as with correct month name formatting, just replace the wrong ones ("Sept" for example with "Sep" and always work with date obj).
Not sure that regex is the best tool for this job, I would just use strip() along with split() to handle your whitespace issues and get a list of just the month abbreviations. Then you could create a dict with counts by month using the list method count(). For example:
dates = 'May-29-1990, Oct-18-1980 ,Sept-1-1980, Oct-2-1990'
months = [d.split('-')[0].strip() for d in dates.split(',')]
month_counts = {m: months.count(m) for m in set(months)}
print(month_counts)
# {'May': 1, 'Oct': 2, 'Sept': 1}
Or even better with collections.Counter:
from collections import Counter
dates = 'May-29-1990, Oct-18-1980 ,Sept-1-1980, Oct-2-1990'
months = [d.split('-')[0].strip() for d in dates.split(',')]
month_counts = Counter(months)
print(month_counts)
# Counter({'Oct': 2, 'May': 1, 'Sept': 1})

Why does the Django 'PostgreSQL Full Text Search' not return my objects and how to fix it?

I can't get the PostgreSQL Full Text Search feature to work as I need it.
I have a model Vereinwith a field straße. There are two Verein objects where straße has the value "Uhlandstraße".
What I want to achieve is that a search for "Uhl", "uhl", "nds", "straße" or "andstr" (you get the idea) returns those objects. Instead it does this:
>>> # With missing only 1 char at the end of the word, the search works.
>>> Verein.objects.filter(straße__search='Uhlandstraß')
<QuerySet [<Verein: 1>, <Verein: 2>]>
>>> # With missing more than 1 char at the and of the word, the search does not work.
>>> Verein.objects.filter(straße__search='Uhl')
<QuerySet []>
>>> Verein.objects.filter(straße__search='Uhlandstra')
<QuerySet []>
>>> # Same amount of chars as the working example, but from the end of the word, it does not work
>>> Verein.objects.filter(straße__search='hlandstraße')
<QuerySet []>
Any ideas what I need to change to get it working like explained?

Python: How to print specific columns with cut some strings on one of the column reading csv

Am new to Python, hence apologize for basic question.
I've a csv file in the below mentioned format.
##cat temp.csv
Id,Info,TimeStamp,Version,Numitems,speed,Path
18699504331,NA/NA/NA,2017:01:01:13:40:31,3.16,6,781.2kHz,/home/user1
31287345804,NA/NA/NA,2017:01:03:14:35:04,3.16,2,111.5MHz,/home/user2
16360534162,NA/NA/NA,2017:01:02:21:39:51,3.16,3,230MHz,/home/user3
I wanted to read csv and print only specific column of interest and cut some strings in one of the column in a readable fashion, so i can use it.
Here is the python code:
cat temp.py
import csv
with open('temp.csv') as cvsfile:
readcsv = csv.reader(cvsfile, delimiter=',')
Id =[]
Info =[]
Timestamp =[]
Version =[]
Numitems =[]
Speed =[]
Path =[]
for row in readcsv:
lsfid = row[0]
modelinfo = row[1]
timestamp = row[2]
compilever = row[3]
numofavb = row[4]
frequency = row[5]
designpath = row[6]
Id.append(lsfid)
Info.append(modelinfo)
Timestamp.append(timestamp)
Version.append(compilever)
Numitems.append(numofavb)
Speed.append(frequency)
Path.append(designpath)
print(Id)
print(Info)
print(Timestamp)
print(Version)
print(Numitems)
print(Speed)
print(Path)
Output:
python temp.py
['Id', '18699504331', '31287345804', '16360534162', '18772620814', '18699504331', '31287345804', '16360534162']
['Info', 'NA/NA/NA', 'NA/NA/NA', 'NA/NA/NA', 'NA/NA/NA', 'NA/NA/NA', 'NA/NA/NA', 'NA/NA/NA']
['TimeStamp', '2017:01:01:13:40:31', '2017:01:03:14:35:04', '2017:01:02:21:39:51', '2017:01:03:14:40:47', '2017:01:01:13:40:31', '2017:01:03:14:35:04', '2017:01:02:21:39:51']
['Version', '3.16', '3.16', '3.16', '3.16', '3.16', '3.16', '3.16']
['Numitems', '6', '2', '3', '2', '6', '2', '3']
['speed', '781.2kHz', '111.5MHz', '230MHz', '100MHz', '781.2kHz', '111.5MHz', '230MHz']
['Path', '/home/user1', '/home/user2', '/home/user3', '/home/user4', '/home/user5', '/home/user6', '/home/user7']
But what i wanted is in well organized look with my choice of column to be printed, something like below...
Id Info TimeStamp Version Numitems speed Path
18699504331 NA/NA/NA 2017:01:01:13:40:31 3.16 6 781.2kHz user1
31287345804 NA/NA/NA 2017:01:02:21:39:51 3.16 2 111.5MHz user2
31287345804 NA/NA/NA 2017:01:02:21:39:51 3.16 2 111.5MHz user3
Any help could be greatly appreciated!
Thanks in Advance
Velu.V
Check out numpy's genfromtxt function. You can use the use the usecols keyword argument to specify that you only want to read certain columns, see also https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html . For example lets say we have the following csv sheet:
col1 , col2 , col3
0.5, test, 0.3
0.7, test2, 0.1
Then,
import numpy as np
table=np.genfromtxt(f,delimiter=',',skip_header=0,dtype='S',usecols=[0,1])
will load the first two columns. You can then use the tabulate package ( https://pypi.python.org/pypi/tabulate) to nicely print out your table.
print tabulate(table,headers='firstrow')
Will look like:
col1 col2
------- --------
0.5 test
0.7 test2
Hope that answers your question

Python Dictionary Fuzzy Match on keys

I have the following dictionary:
classes = {'MATH6371': 'Statistics 1', 'COMP7330': 'Database Management',
'MATH6471': 'Statistics 2','COMP7340': 'Creative Computation' }
And I am trying make a raw_input fuzzy match on the dictionary keys. For example, if I type in 'math', the output would be Statistics 1 and Statistics 2.
I have the following code, but it only matches keys exactly:
def print_courses (raw_input):
search = raw_input("Type a course ID here:")
if search in classes:
print classes.get(search)
else:
print "Sorry, that course doesn't exist, try again"
print_courses(raw_input)
Thanks
Here you go:
>>> search = 'math'
>>> result = [classes[key] for key in classes if search in key.lower()]
['Statistics 2', 'Statistics 1']

Would DateTimeField() work if I have time in this format 1/7/11 9:15 ? If not what would?

I am importing data from a JSON file and it has the date in the following format 1/7/11 9:15
What would be the best variable type/format to define in order to accept this date as it is? If not what would be the most efficient way to accomplish this task?
Thanks.
"What would be the best variable type/format to define in order to accept this date as it is?"
The DateTimeField.
"If not what would be the most efficient way to accomplish this task?"
You should use the datetime.strptime method from Python's builtin datetime library:
>>> from datetime import datetime
>>> import json
>>> json_datetime = "1/7/11 9:15" # still encoded as JSON
>>> py_datetime = json.loads(json_datetime) # now decoded to a Python string
>>> datetime.strptime(py_datetime, "%m/%d/%y %I:%M") # coerced into a datetime object
datetime.datetime(2011, 1, 7, 9, 15)
# Now you can save this object to a DateTimeField in a Django model.
If you take a look at https://docs.djangoproject.com/en/dev/ref/models/fields/#datetimefield, it says that django uses the python datetime library which is docomented at http://docs.python.org/2/library/datetime.html.
Here is a working example (with many debug prints and step-by-step instructions:
from datetime import datetime
json_datetime = "1/7/11 9:15"
json_date, json_time = json_datetime.split(" ")
print json_date
print json_time
day, month, year = map(int, json_date.split("/")) #maps each string in stringlist resulting from split to an int
year = 2000 + year #be ceareful here! 2 digits for a year may cause trouble!!! (could be 1911 as well)
hours, minutes = map(int, json_time.split(":"))
print day
print month
print year
my_datetime = datetime(year, month, day, hours, minutes)
print my_datetime
#Generate a json date:
new_json_style = "{0}/{1}/{2} {3}:{4}".format(my_datetime.day, my_datetime.month, my_datetime.year, my_datetime.hour, my_datetime.minute)
print new_json_style