I have a .txt file from which I have to fetch name and age.
The .txt file has data in the format like:
Age: 71 . John is 47 years old. Sam; Born: 05/04/1989(29).
Kenner is a patient Age: 36 yrs Height: 5 feet 1 inch; weight is 56 kgs.
This medical record is 10 years old.
Output 1: John, Sam, Kenner
Output_2: 47, 29, 36
I am using the regular expression to extract data. For example, for age, I am using the below regular expressions:
re.compile(r'age:\s*\d{1,3}',re.I)
re.compile(r'(age:|is|age|a|) \s*\d{1,3}(\s|y)',re.I)
re.compile(r'.* Age\s*:*\s*[0-9]+.*',re.I)
re.compile(r'.* [0-9]+ (?:year|years|yrs|yr) \s*',re.I)
I will apply another regular expression to the output of these regular expressions to extract the numbers. The problem is with these regular expressions, I am also getting the data which I do not want. For example
This medical record is 10 years old.
I am getting '10' from the above sentence which I do not want.
I only want to extract the names of people and their age. I want to know what should be the approach? I would appreciate any kind of help.
Please take a look at the Cloud Data Loss Prevention API. Here is a GitHub repo with examples. This is what you'll likely want.
def inspect_string(project, content_string, info_types,
min_likelihood=None, max_findings=None, include_quote=True):
"""Uses the Data Loss Prevention API to analyze strings for protected data.
Args:
project: The Google Cloud project id to use as a parent resource.
content_string: The string to inspect.
info_types: A list of strings representing info types to look for.
A full list of info type categories can be fetched from the API.
min_likelihood: A string representing the minimum likelihood threshold
that constitutes a match. One of: 'LIKELIHOOD_UNSPECIFIED',
'VERY_UNLIKELY', 'UNLIKELY', 'POSSIBLE', 'LIKELY', 'VERY_LIKELY'.
max_findings: The maximum number of findings to report; 0 = no maximum.
include_quote: Boolean for whether to display a quote of the detected
information in the results.
Returns:
None; the response from the API is printed to the terminal.
"""
# Import the client library.
import google.cloud.dlp
# Instantiate a client.
dlp = google.cloud.dlp.DlpServiceClient()
# Prepare info_types by converting the list of strings into a list of
# dictionaries (protos are also accepted).
info_types = [{'name': info_type} for info_type in info_types]
# Construct the configuration dictionary. Keys which are None may
# optionally be omitted entirely.
inspect_config = {
'info_types': info_types,
'min_likelihood': min_likelihood,
'include_quote': include_quote,
'limits': {'max_findings_per_request': max_findings},
}
# Construct the `item`.
item = {'value': content_string}
# Convert the project id into a full resource id.
parent = dlp.project_path(project)
# Call the API.
response = dlp.inspect_content(parent, inspect_config, item)
# Print out the results.
if response.result.findings:
for finding in response.result.findings:
try:
if finding.quote:
print('Quote: {}'.format(finding.quote))
except AttributeError:
pass
print('Info type: {}'.format(finding.info_type.name))
print('Likelihood: {}'.format(finding.likelihood))
else:
print('No findings.')
Related
This post shows how to get dependencies of a block of text in Conll format with Spacy's taggers. This is the solution posted:
import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
for i, word in enumerate(sent):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
It outputs this block:
1 Bob bob NNP PERSON 2 nsubj
2 bought buy VBD 0 ROOT
3 the the DT 4 det
4 pizza pizza NN 2 dobj
5 to to IN 2 dative
6 Alice alice NNP PERSON 5 pobj
I would like to get the same output WITHOUT using doc.sents.
Indeed, I have my own sentence-splitter. I would like to use it, and then give Spacy one sentence at a time to get POS, NER, and dependencies.
How can I get POS, NER, and dependencies of one sentence in Conll format with Spacy without having to use Spacy's sentence splitter ?
A Document in sPacy is iterable, and in the documentation is states that it iterates over Tokens
| __iter__(...)
| Iterate over `Token` objects, from which the annotations can be
| easily accessed. This is the main way of accessing `Token` objects,
| which are the main way annotations are accessed from Python. If faster-
| than-Python speeds are required, you can instead access the annotations
| as a numpy array, or access the underlying C data directly from Cython.
|
| EXAMPLE:
| >>> for token in doc
Therefore I believe you would just have to make a Document for each of your sentences that are split, then do something like the following:
def printConll(split_sentence_text):
doc = nlp(split_sentence_text)
for i, word in enumerate(doc):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
Of course, following the CoNLL format you would have to print a newline after each sentence.
This post is about a user facing unexpected sentence breaks from using the spacy sentence boundary detection. One of the solutions proposed by the developers at Spacy (as on the post) is to add flexibility to add ones own sentence boundary detection rules. This problem is solved in conjunction with dependency parsing by Spacy, not before it. Therefore, I don't think what you're looking for is supported at all by Spacy at the moment, though it might be in the near future.
#ashu 's answer is partly right: dependency parsing and sentence boundary detection are tightly coupled by design in spaCy. Though there is a simple sentencizer.
https://spacy.io/api/sentencizer
It seems the sentecizer just uses punctuation (not the perfect way). But if such sentencizer exists then you can create a custom one using your rules and it will affect sentence boundaries for sure.
I'm trying to read the discharge data of 346 US rivers stored online in textfiles. The files are more or less in this format:
Measurement_number Date Gage_height Discharge_value
1 2017-01-01 10 1000
2 2017-01-20 15 2000
# etc.
I only want to read the gage height and discharge value columns.
The problem is that in most files additional columns with metadata are added in front of the 'Gage height' column, so i can not just simply read the 3rd and 4th column because their index varies.
I'm trying to find a way to say 'read the columns with the name 'Gage_height' and 'Discharge_value'', but I haven't succeeded yet.
I hope anyone can help. I'm currently trying to load the text files with numpy.genfromtxt so it would be great to find a solution with that package but other solutions are also more than welcome.
This is my code so far
data_url=urllib2.urlopen(#the url of this specific site)
data=np.genfromtxt(data_url,skip_header=1,comments='#',usecols=2,3])
You can use the names=True option to genfromtxt, and then use the column names to select which columns you want to read with usecols.
For example, to read 'Gage_height' and 'Discharge_value' from your data file:
data = np.genfromtxt(filename, names=True, usecols=['Gage_height', 'Discharge_value'])
Note that you don't need to set skip_header=1 if you use names=True.
You can then access the columns using their names:
gage_height = data['Gage_height'] # == array([ 10., 15.])
discharge_value = data['Discharge_value'] # == array([ 1000., 2000.])
See the docs here for more information.
I'm using the R (3.2.3) tm-package (0.6-2) and would like to subset my corpus according to partial string matches contained with the metadatum "id".
For example, I would like to filter all documents that contain the string "US" within the "id" column. The string "US" would be preceded and followed by various characters and numbers.
I have found a similar example here. It is recommended to download the quanteda package but I think this should also be possible with the tm package.
Another more relevant answer to a similar problem is found here. I have tried to adapt that sample code to my context. However, I don't manage to incorporate the partial string matching.
I imagine there might be multiple things wrong with my code so far.
What I have so far looks like this:
US <- tm_filter(corpus, FUN = function(corpus, filter) any(meta(corpus)["id"] == filter), grep(".*US.*", corpus))
And I receive the following error message:
Error in structure(as.character(x), names = names(x)) :
'names' attribute [3811] must be the same length as the vector [3]
I'm also not sure how to come up with a reproducible example simulating my problem for this post.
It could work like this:
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
(corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)))
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 20
(idx <- grep("0", sapply(meta(corp, "id"), paste0), value=TRUE))
# 502 704 708
# "502" "704" "708"
(corpsubset <- corp[idx] )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 3
You are looking for "US" instead of "0". Have a look at ?grep for details (e.g. fixed=TRUE).
I have a large string of characters and would like to extract certain information from it matching pattern:
str(input)
chr [1:109094] "{'asin': '0981850006', 'description': 'Steven Raichlen\'s Best of Barbecue Primal Grill DVD. The first three volumes of the si"| truncated ...
I get the following content of input[1] - description of product meta
[1] ("{'asin': '144072007X', 'related': {'also_viewed': ['B008WC0X0A', 'B000CPMOVG', 'B0046641AE', 'B00J150GAO', 'B00005AMCG', 'B005WGX97I'],
'bought_together': ['B000H85WSA']},
'title': 'Sand Shark Margare Maron Audio CD',
'price': 577.15,
'salesRank': {'Patio, Lawn & Garden': 188289},
'imUrl': 'http://ecx.images-amazon.com/images/I/31B9X0S6dqL._SX300_.jpg',
'brand': 'Tesoro',
'categories': [['Patio, Lawn & Garden', 'Lawn Mowers & Outdoor Power Tools', 'Metal Detectors']],
'description': \"The Tesoro Sand Shark metal combines time-proven PI circuits with the latest digital technology creating the first.\"}")
Now I would like to iterate over each element of the large string and extract asin, title, price, salesRank, brand and categories that should be saved in a data.frame for better handling.
The data is originally from a JSON file as you might notice. I tried to import it using stream_in command, but it didn't help. So just imported it using readLines. Please please help! Being a bit desperate...Any hint is appreciated!
The jsonlite package shows the following problem:
lexical error: invalid char in json text.
{'asin': '0981850006', 'descript
(right here) ------^
closing fileconnectionoldClass input connection.
Any new ideas on that?
Given lots of unanswered questions on that issue, must be very relevant for newbies ;)
I have written a python 2.7 script to retrieve all my historical data from Xively.
Originally I wrote it in C#, and it works perfectly.
I am limiting the request to 6 hour blocks, to retrieve all stored data.
My version in Python is as follows:
requestString = 'http://api.xively.com/v2/feeds/41189/datastreams/0001.csv?key=YcfzZVxtXxxxxxxxxxxORnVu_dMQ&start=' + requestDate + '&duration=6hours&interval=0&per_page=1000' response = urllib2.urlopen(requestString).read()
The request date is in the correct format, I compared the full c# requestString version and the python one.
Using the above request, I only get 101 lines of data, which equates to a few minutes of results.
My suspicion is that it is the .read() function, it returns about 34k of characters which is far less than the c# version. I tried adding 100000 as an argument to the ad function, but no change in result.
Left another solution wrote in Python 2.7 too.
In my case, got data each 30 minutes because many sensors sent values every minute and Xively API has limited half hour of data to this sent frequency.
It's general module:
for day in datespan(start_datetime, end_datetime, deltatime): # loop increasing deltatime to star_datetime until finish
while(True): # assurance correct retrieval data
try:
response = urllib2.urlopen('https://api.xively.com/v2/feeds/'+str(feed)+'.csv?key='+apikey_xively+'&start='+ day.strftime("%Y-%m-%dT%H:%M:%SZ")+'&interval='+str(interval)+'&duration='+duration) # get data
break
except:
time.sleep(0.3)
raise # try again
cr = csv.reader(response) # return data in columns
print '.'
for row in cr:
if row[0] in id: # choose desired data
f.write(row[0]+","+row[1]+","+row[2]+"\n") # write "id,timestamp,value"
The full script you can find it here: https://github.com/CarlosRufo/scripts/blob/master/python/retrievalDataXively.py
Hope you might help, delighted to answer any questions :)