rdflib SPARQL queries (involving subclasses) not behaving as expected - python-2.7

I have just started to self-learn RDF and the Python rdflib library. But I have hit an issue with the following code. I expected the queries to return mark and nat as Persons and only natalie as a Professor. I can't see where I've gone wrong. (BTW, I know Professor should be a title rather than a kind of Person, but I'm just tinkering ATM.) Any help appreciated. Thanks.
rdflib 4, Python 2.7
>>> from rdflib import Graph, BNode, URIRef, Literal
INFO:rdflib:RDFLib Version: 4.2.
>>> from rdflib.namespace import RDF, RDFS, FOAF
>>> G = Graph()
>>> mark = BNode()
>>> nat = BNode()
>>> G.add((mark, RDF.type, FOAF.Person))
>>> G.add((mark, FOAF.firstName, Literal('mark')))
>>> G.add((URIRef('Professor'), RDF.type, RDFS.Class))
>>> G.add((URIRef('Professor'), RDFS.subClassOf, FOAF.Person))
>>> G.add((nat, RDF.type, URIRef('Professor')))
>>> G.add((nat, FOAF.firstName, Literal('natalie')))
>>> qres = G.query(
"""SELECT DISTINCT ?aname
WHERE {
?a rdf:type foaf:Person .
?a foaf:firstName ?aname .
}""", initNs = {"rdf": RDF,"foaf": FOAF})
>>> for row in qres:
print "%s is a person" % row
mark is a person
>>> qres = G.query(
"""SELECT DISTINCT ?aname
WHERE {
?a rdf:type ?prof .
?a foaf:firstName ?aname .
}""", initNs = {"rdf": RDF,"foaf": FOAF, "prof": URIRef('Professor')})
>>> for row in qres:
print "%s is a Prof" % row
natalie is a Prof
mark is a Prof
>>>

Your query is bringing back all types because the ?prof variable isn't bound to a value.
I think you want to use is the initBindings kwarg to pass in the URI for 'Professor' to your query. So changing your query to below retrieves just natalie.
qres = G.query(
"""SELECT DISTINCT ?aname
WHERE {
?a rdf:type ?prof .
?a foaf:firstName ?aname .
}""",
initNs ={"rdf": RDF,"foaf": FOAF},
initBindings={"prof": URIRef('Professor')})

Related

Replacing strings using regex using Pandas

In Pandas, why does the following not replace any strings containing an exclamation mark with whatever follows it?
In [1]: import pandas as pd
In [2]: ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zi
...: mbabwe'])
In [3]: ser
Out[3]:
0 Aland Islands !Åland Islands
1 Reunion !Réunion
2 Zimbabwe
dtype: object
In [4]: patt = r'.*!(.*)'
In [5]: repl = lambda m: m.group(1)
In [6]: ser.replace(patt, repl)
Out[6]:
0 Aland Islands !Åland Islands
1 Reunion !Réunion
2 Zimbabwe
dtype: object
Whereas the direct reference to the matched substring does work:
In [7]: ser.replace({patt: r'\1'}, regex=True)
Out[7]:
0 Åland Islands
1 Réunion
2 Zimbabwe
dtype: object
What am I doing wrong in the first case?
It appears that replace does not support a method as a replacement argument. Thus, all you can do is to import re library implicitly and use apply:
>>> import re
>>> #... your code ...
>>> ser.apply(lambda row: re.sub(patt, repl, row))
0 Åland Islands
1 Réunion
2 Zimbabwe"
dtype: object
There are two replace methods in Pandas.
The one that acts directly on a Series can take a regex pattern string or a compiled regex and can act in-place, but doesn't allow the replacement argument to be a callable. You must set regex=True and use raw strings.
With:
import re
import pandas as pd
ser = pd.Series(['Aland Islands !Åland Islands', 'Reunion !Réunion', 'Zimbabwe'])
Yes:
ser.replace(r'.*!(.*)', r'\1', regex=True, inplace=True)
ser.replace(r'.*!', '', regex=True, inplace=True)
regex = re.compile(r'.*!(.*)', inplace=True)
ser.replace(regex, r'\1', regex=True, inplace=True)
No:
repl = lambda m: m.group(1)
ser.replace(regex, repl, regex=True, inplace=True)
There's another, used as Series.str.replace. This one accepts a callable replacement but won't substitute in-place and doesn't take a regex argument (though regular expression pattern strings can be used):
Yes:
ser.str.replace(r'.*!', '')
ser.str.replace(r'.*!(.*)', r'\1')
ser.str.replace(regex, repl)
No:
ser.str.replace(regex, r'\1')
ser.str.replace(r'.*!', '', inplace=True)
I hope this is helpful to someone out there.
Try this snippet:
pattern = r'(.*)!'
ser.replace(pattern, '', regex=True)
In your case, you didn't set regex=True, as it is false by default.

pymongo float precision

I have a number stored in mongo as 15000.245263 with 6 numbers after decimal point but when I use pymongo to get this number I got 15000.24. Is the pymongo reduced the precision of float?
I can't reproduce this. In Python 2.7.13 on my Mac:
>>> from pymongo import MongoClient
>>> c = MongoClient().my_db.my_collection
>>> c.delete_many({}) # Delete all documents
>>> c.insert_one({'x': 15000.245263})
>>> c.find_one()
{u'x': 15000.245263, u'_id': ObjectId('59525d32a08bff0800cc72bd')}
The retrieved value of "x" is printed the same as it was when I entered it.
This could happen if you trying to print out a long float value, and i think it is not related to mongodb.
>>> print 1111.1111
1111.1111
>>> print 1111111111.111
1111111111.11
>>> print 1111111.11111111111
1111111.11111
# for a timestamp
>>> import time
>>> now = time.time()
>>> print now
1527160240.06
For python2.7.10 it will just display 13 character(for my machine), if you want to display the whole value, use a format instead, like this:
>>> print '%.6f' % 111111111.111111
111111111.111111
And this is just a display problem, the value of the variable will not be affected.
>>> test = 111111111.111111 * 2
>>> test
222222222.222222
>>> print test
222222222.222

Is it possible to serialize only specific classes/functions in pickle / dill python?

I have an app that want to serialize only classes/functions which are:
no Python primitive data type
no Numpy data type
no pandas data type.
So, is it possible to filter object to be saved in dill ?
(filter by loop on the type)
Thanks,
While this is not a complete solution (i.e. you'd probably want to include more of the modules with pandas data types, numpy data types… and also you might want to be more selective for the built-in types by filtering by type instead of module)… I think it sort of gets you what you want.
>>> import dill
>>> import numpy
>>> import pandas
>>>
>>> target = numpy.array([1,2,3])
>>> dill.dumps(target) if not dill.source.getmodule(type(target)) in [numpy, pandas.core.series, dill.source.getmodule(int)] else None
>>>
>>> target = [1,2,3]
>>> dill.dumps(target) if not dill.source.getmodule(type(target)) in [numpy, pandas.core.series, dill.source.getmodule(int)] else None
>>>
>>> target = lambda x:x
>>> dill.dumps(target) if not dill.source.getmodule(type(target)) in [numpy, pandas.core.series, dill.source.getmodule(int)] else None
>>>
>>> class Foo(object):
... pass
...
>>> target = Foo()
>>> dill.dumps(target) if not dill.source.getmodule(type(target)) in [numpy, pandas.core.series, dill.source.getmodule(int)] else None
'\x80\x02cdill.dill\n_create_type\nq\x00(cdill.dill\n_load_type\nq\x01U\x08TypeTypeq\x02\x85q\x03Rq\x04U\x03Fooq\x05h\x01U\nObjectTypeq\x06\x85q\x07Rq\x08\x85q\t}q\n(U\r__slotnames__q\x0b]q\x0cU\n__module__q\rU\x08__main__q\x0eU\x07__doc__q\x0fNutq\x10Rq\x11)\x81q\x12}q\x13b.'
>>>
However, if you are asking if dill has such a filtering method, then the answer is no.

Using python groupby or defaultdict effectively?

I have a csv with name, role, years of experience. I want to create a list of tuples that aggregates (name, role1, total_exp_inthisRole) for all the employess.
so far i am able to use defaultdict to do the below
import csv, urllib2
from collections import defaultdict
response = urllib2.urlopen(url)
cr = csv.reader(response)
parsed = ((row[0],row[1],int(row[2])) for row in cr)
employees =[]
for item in parsed:
employees.append(tuple(item))
employeeExp = defaultdict(int)
for x,y,z in employees: # variable unpacking
employeeExp[x] += z
employeeExp.items()
output: [('Ken', 15), ('Buckky', 5), ('Tina', 10)]
but how do i use the second column also to achieve the result i want. Shall i try to solve by groupby multiple keys or simpler way is possible? Thanks all in advance.
You can simply pass a tuple of name and role to your defaultdict, instead of only one item:
for x,y,z in employees:
employeeExp[(x, y)] += z
For your second expected output ([('Ken', ('engineer', 5),('sr. engineer', 6)), ...])
You need to aggregate the result of aforementioned snippet one more time, but this time you need to use a defaultdict with a list:
d = defaultdict(list)
for (name, rol), total_exp_inthisRole in employeeExp.items():
d[name].append(rol, total_exp_inthisRole)

What is the most efficient method to parse this line of text?

The following is a row that I have extracted from the web:
AIG $30 AIG is an international renowned insurance company listed on the NYSE. A period is required. Manual Auto Active 3 0.0510, 0.0500, 0.0300 [EXTRACT]
I will like to create 5 separate variables by parsing the text and retrieving the relevant data. However, i seriously don't understand the REGEX documentation! Can anyone guide me on how i can do it correctly with this example?
Name = AIG
CurrentPrice = $30
Status = Active
World_Ranking = 3
History = 0.0510, 0.0500, 0.0300
Not sure what do you want to achieve here. There's no need to use regexps, you could just use str.split:
>>> str = "AIG $30 AIG is an international renowned insurance company listed on the NYSE. A period is required. Manual Auto Active 3 0.0510, 0.0500, 0.0300 [EXTRACT]"
>>> list = str.split()
>>> dict = { "Name": list[0], "CurrentPrice": list[1], "Status": list[19], "WorldRanking": list[20], "History": ' '.join((list[21], list[22], list[23])) }
#output
>>> dict
{'Status': 'Active', 'CurrentPrice': '$30', 'Name': 'AIG', 'WorldRanking': '3', 'History': '0.0510, 0.0500, 0.0300'}
Instead of using list[19] and so on, you may want to change it to list[-n] to not depend to the company's description length. Like that:
>>> history = ' '.join(list[-4:-1])
>>> history
'0.0510, 0.0500, 0.0300'
For floating history indexes it could be easier to use re:
>>> import re
>>> history = re.findall("\d\.\d{4}", str)
>>> ['0.0510', '0.0500', '0.0300']
For identifying status, you could get the indexes of history values and then substract by one:
>>> [ i for i, substr in enumerate(list) if re.match("\d\.\d{4}", substr) ]
[21, 22, 23]
>>> list[21:24]
['0.0510,', '0.0500,', '0.0300,']
>>> status = list[20]
>>> status
'3'