I'm trying to do a fuzzy lookup on two datasets in SAS. I have searched over google and found the below link which explains the process of doing the fuzzy lookup in SAS.
Link: http://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/
To explain in detail the problem, the two datasets contains information of Hospital names and other additional information. I have to match both the data sets based on Hospital names. But the main challenge is in some cases I have the hospital name as follows:
Dataset1(hospital Name): St.Hospital
Dataset2(hospital Name): Saint.Hospital
Like wise INC and Incorporated.
I would like to know is there any best way to do the fuzzy lookup in SAS.
Thanks,
VJ
There can't be any single best way to do a fuzzy lookup, as the article you linked to explains. You have to decide on the best approach for your particular problem domain and your particular tolerances for false positives and false negatives, etc.
For your data, I would probably just define a set of 'best guess' transformations on the hospital name in both input data sets, and then do a standard merge on the transformed names. The transformations would be something like:
Convert to uppercase
Convert 'ST.' or 'ST ' to 'SAINT' (or should that be 'STREET'??)
Convert 'INC' or 'INC.' to 'INCORPORATED'
Convert any other known common strings as above
Remove any remaining punctuation
Use COMPBL to reduce multiple spaces to a single space
Do the merge
You will then have to examine the result and decide if it's good enough for your purposes. There is no general way for a computer to match up two strings that might be arbitrarily badly-spelled, particularly if there are multiple possible 'correct' matches - this is the same problem that spell-checkers have been trying to solve for decades - there's no way of knowing (in isolation) whether a misspelled word like 'falt' was meant to be 'fault', 'fall', 'fast', 'fat' etc.
If your results have to be perfect, you will need a human to review anything that isn't an exact match, and even then some of the exact matches might be misspellings that happen to match another hospital's name (eg, 'Saint Mary's Hospital' vs 'Saint May's Hospital'). That's why the preferred approach would usually be to identify the hospital by an ID number and the name, rather than just the name.
Related
I have been looking for a name for a new project. I want the name to have available domains and social media handles. For months, all those I can think of are taken.
So I generated a list of names with at least a consonant and a vowel and checked if the domains are available (which is very fast). I have about a million possible names.
I would like to sort them by some rank of simplicity. "Aaazq" would be close to the bottom, "Cawel" would be close to the top. I thought of the CVC structure (Consonant-Vowel-Consonant) and wonder if some more sophisticated algorithm exists. I searched for "sonority" but it has a different meaning in linguistics.
How can I automatically rank the simplicity of a random name?
I assume you would judge simplicity as compared to a target language, say English. Something that is 'simple' in English might not be 'simple' in German or Korean, as these languages have very different phonological structures.
I would recommend the following:
collect some data of the language you are using. Just get some novels from Project Gutenberg, for example, or newspaper articles. Whatever you can easily get hold of.
now generate n-grams from this: all sequences of two (bigrams) or three (trigrams) letters. Turn this into a frequency list, so that common n-grams are at the top of the list with a high frequency.
turn your suggested name into n-grams. Count how many times the respective n-gram occurs in your frequency list, and take the average or median of the result
Your examples would do as follows:
aa aa az zq: "aa" is rare ("aardvark") "az" a bit more common ("glaze", "raze"), and "zq" would not exist. So, not a very high score.
ca aw we el: all of these are fairly common in English words, so a reasonably high score.
You could also add a dummy # at the beginning and the end, so in your first example you'd get #a, which is fine, as many English words start with "a", but the final q# bombs out, as there's only words such as "Iraq" which end in a "q".
You can obviously do the same for other languages.
Also, you can reverse the process in a way, and pick random n-grams from your frequency list to generate names: by picking higher-frequency n-grams you will make sure the name is a good match with the phonological structure of your target language.
Note for pedants: I use phonological structure, but it's really its representation in the spelling system that we're dealing with here.
I'm trying to figure out what are the ways (and which of them the best one) of extraction of Values for predefined Keys in the unstructured text?
Input:
The doctor prescribed me a drug called favipiravir.
His name is Yury.
Ilya has already told me about that.
The weather is cold today.
I am taking a medicine called nazivin.
Key list: ['drug', 'name', 'weather']
Output:
['drug=favipiravir', 'drug=nazivin', 'name=Yury', 'weather=cold']
So, as you can see, in the 3d sentence there is no explicit key 'name' and therefore no value extracted (I think there is the difference with NER). At the same time, 'drug' and 'medicine' are synonyms and we should treat 'medicine' as 'drug' key and extract the value also.
And the next question, what if the key set will be mutable?
Should I use as a base regexp approach because of predefined Keys or there is a way to implement it with supervised learning/NN? (but in this case how to deal with mutable keys?)
You can use a parser to tag words. Your problem is similar to Named Entity Recognition (NER). A lot of libraries, like NLTK in Python, have POS taggers available. You can try those. They are generally trained to identify names, locations, etc. Depending on the type of words you need, you may need to train the parser. So you'll need some labeled data also. Check out this link:
https://nlp.stanford.edu/software/CRF-NER.html
I have a large corpus of words extracted from the documents. In the corpus are words which might mean the same.
For eg: "command" and "order" means the same, "apple" and "apply" which does not mean the same.
I would like to merge the similar words, say "command" and "order" to "command".
I have tried to use word2vec but it doesn't check for semantic similarity of words(it ouputs good similarity for apple and apply since four characters in the words are the same). And when I try using wup similarity, it gives good similarity score if the words have matching synonyms whose results are not that impressive.
What could be the best approach to reduce semantically similar words to get rid of redundant data and merge similar data?
I believe one of the options here is using WordNet. It gives you a list of synonyms for the word, so you may merge them together given you know its part of speech.
However, I'd like to point out that "order" and "command" are not the same, e.g. you do not command food in restaurants and such homonymy is true for many-many words.
Also I'd like to point out that for Word2vec spelling is irrelevant and is not taken into consideration at all, the algorithm considers only concurrent usage. I suppose you might be mixing it with FastText.
However, there should be some problems with your model.
Because in a standard set of embeddings distance between these concepts should be large. MUSE FastText similarity between "apple" and "apply" is only 0.15, which is quite low.
I use Gensim's function
model.similarity("apply", "apple")
So you might need to fix learning parameters or just use a pretrained model.
I have a task to complete.
There are two types of csv files 4000+ both related to each other.
2 types are:
1. Country2.csv
2. Security_Name.csv
Contents of Country2.csv:
Company Name;Security Name;;;;Final NOS;Final FFR
Contents of Security_Name.csv:
Date;Close Price;Volume
There are multiple countries and for each country multiple security files
Now I need to READ them do some CALCULATION and then WRITE the output in another files
READ
Read both the file Country 2.csv and Security.csv and extract all the data from them.
For example :
Read France 2.csv, extract Security_Name, Final NOS, Final FFR
Then Read Security.csv(which matches the Security_Name) and extract Date, Close Price, Volume
Calculation
Calculations are basically finding Median of the values extracted which is quite simple.
For Example:
Monthly Median Traded Values
Daily Traded Value of a Security ... and so on
Write
Based on the month I need to sort the output in two different file with following formats:
If Month % 3 = 0
Save It as MONTH_NAME.csv in following format:
Security name; 12-month indicator; 3-month indicator; FOT
Else
Save It as MONTH_NAME.csv in following format:
Security Name; Monthly Median Traded Value Ratio; Number of days Volume > 0
My question is how do I design my application in such a way that it is maintainable and the flow of data throughout the execution is seamless?
So first thing. Based on the kind of data you are looking to generate, I would probably be looking at moving this data to a SQL db if at all possible. This is "one SQL query" kind of stuff. And far more maintainable than C++ that generates CSV files from CSV files.
Barring that, I would probably look at using datamash and/or perl. On a Windows platform, you could do this through Cygwin or WSL. Probably less maintainable, but so much easier it's not too much of an issue.
That said, if you're looking for something moderately maintainable, C++ could work. The first thing I would do is design my input classes. Data-centric, but it can work. It sounds like you could have a Country class, a Security class, and a SecurityClose class...or something along those lines. You can think about whether a Security class should contain a collection of SecurityClosees (data), or whether the data should just be "loose" and reference the Security it belongs to. Same with the Country->Security relationship.
Once you've decided how all that's going to look, you want something (likely a function) that can tokenize a CSV line. So "1,2,3" gets turned into a vector<string> with the contents "1" "2" "3". Then, each of your input classes should have a constructor or initializer that takes a vector<string> and populates itself. You might need to pass higher level data along too. Like the filename if you want the security data to know which security it belongs to..
That's basically most of the battle there. Once you've pulled your data into sensibly organized classes, the rest should come more easily. And if you run into bumps, hopefully you can ask specific design or implementation questions from there.
What I am trying to do:
I am trying to take a list of terms and distinguish which domain they are coming from. For example "intestine" would be from the anatomical domain while the term "cancer" would be from the disease domain. I am getting these terms from different ontologies such as DOID and FMA (they can be found at bioportal.bioontology.org)
The problem:
I am having a hard time realizing the best way to implement this. Currently I am naively taking the terms from the ontologies DOID and FMA and taking difference of any term that is in the FMA list which we know is anatomical from the DOID list (which contains terms that may be anatomical such as colon carcinoma, colon being anatomical and carcinoma being disease).
Thoughts:
I was thinking that I can get root words, prefixes, and postfixes, for the different term domains and try and match it to the terms in the list. Another idea is to take more information from their ontology such as meta data or something and use this to distinguish between the terms.
Any ideas are welcome.
As a first run, you'll probably have the best luck with bigrams. As an initial hypothesis, diseases are usually noun phrases, and usually have a very English-specific structure where NP -> N N, like "liver cancer", which means roughly the same thing as "cancer of the liver." Doctors tend not to use the latter, while the former should be caught with bigrams quite well.
Use the two ontologies you have there as starting points to train some kind of bigram model. Like Rcynic suggested, you can count them up and derive probabilities. A Naive Bayes classifier would work nicely here. The features are the bigrams; classes are anatomy or disease. sklearn has Naive Bayes built in. The "naive" part means, in this case, that all your bigrams are independent of each other. This assumption is fundamentally false, but it works well in a lot of circumstances, so we pretend it's true.
This won't work perfectly. As it's your first pass, you should be prepared to probe the output to understand how it derived the answer it came upon and find cases that failed on. When you find trends of errors, tweak your model, and try again.
I wouldn't recommend WordNet here. It wasn't written by doctors, and since what you're doing relies on precise medical terminology, it's probably going to add bizarre meanings. Consider, from nltk.corpus.wordnet:
>>> livers = reader.synsets("liver")
>>> pprint([l.definition() for l in livers])
[u'large and complicated reddish-brown glandular organ located in the upper right portion of the abdominal cavity; secretes bile and functions in metabolism of protein and carbohydrate and fat; synthesizes substances involved in the clotting of the blood; synthesizes vitamin A; detoxifies poisonous substances and breaks down worn-out erythrocytes',
u'liver of an animal used as meat',
u'a person who has a special life style',
u'someone who lives in a place',
u'having a reddish-brown color']
Only one of these is really of interest to you. As a null hypothesis, there's an 80% chance WordNet will add noise, not knowledge.
The naive approach - what precision and recall is it getting you? If you setup a test case now, then you can track your progress as you apply more sophisticated methods.
I don't know what initial set you are dealing with - but one thing to try is to get your hands on annotated documents(maybe use mechanical turk). The documents need to be tagged as the domains you're looking for - anatomical or disease.
then count and divide will tell you how likely a word you encounter is to belong to a domain. With that the next step and be to tweak some weights.
Another approach (going in a whole other direction) is using WordNet. I don't know if it will be useful for exactly your purposes, but its a massive ontology - so it might help.
Python has bindings to use Wordnet via nltk.
from nltk.corpus import wordnet as wn
wn.synsets('cancer')
gives output = [Synset('cancer.n.01'), Synset('cancer.n.02'), Synset('cancer.n.03'), Synset('cancer.n.04'), Synset('cancer.n.05')]
http://wordnetweb.princeton.edu/perl/webwn
Let us know how it works out.