Mongoose: query full name with regex - regex

I'm using the following code to search a user either by first or last name:
var re = new RegExp(req.params.search, 'i');
User.find()
.or([{ firstName: re }, { lastName: re }])
.exec(function(err, users) {
res.json(JSON.stringify(users));
});
It works well if search equals 'John' or 'Smith', but not for 'John Sm'.
Any clue how to accomplish this kind of query?
Thanks!
Disclaimer: This question originally appeared on the comments of this previous question 3 years ago and remains unanswered. I'm starting a new thread because 1) It wasn't the main question and 2) I consider this is interesting enough to have its own thread
EDIT:
Suppose the database contains two records: John Smith and John Kennedy.
Querying John should return both John Smith and John Kennedy
Querying John Sm should return only John Smith

Separate the search term by words, and separate them using an alternation operator ('|').
var terms = req.params.search.split(' ');
var regexString = "";
for (var i = 0; i < terms.length; i++)
{
regexString += terms[i];
if (i < terms.length - 1) regexString += '|';
}
var re = new RegExp(regexString, 'ig');
For the input 'John Smith', this will create a regex which looks like /John|Smith/ig. This will return true for individual words as well as work when the input is just 'John Sm'
You can play around with this regex to get one more suited to your needs.
EDIT:
The problem here is that your name fields are separate. In this case, applying the same regex to both fields will not result the results that you want. The regex needs to be applied to the same field with the complete name.
A possible solution is using aggregation:
User.aggregate()
.project({fullName: {$concat: ['$firstName', ' ', '$lastName']}})
.match({fullName: re})

Related

Regex capture lines A, B, or C in any order only when not preceded by D

I have a file with content something like this:
SUBJECT COMPANY:
COMPANY DATA:
COMPANY CONFORMED NAME: MISCELLANEOUS SUBJECT CORP
CENTRAL INDEX KEY: 0000000000
STANDARD INDUSTRIAL CLASSIFICATION: []
IRS NUMBER: 123456789
STATE OF INCORPORATION: DE
FISCAL YEAR END: 1231
Then later in the file, it has something like this:
<REPORTING-OWNER>
COMPANY DATA:
COMPANY CONFORMED NAME: MISCELLANEOUS OWNER CORP
CENTRAL INDEX KEY: 0101010101
STANDARD INDUSTRIAL CLASSIFICATION: []
What I need to do is capture the company conformed name, central index key, IRS number, fiscal year end, or whatever I am looking to extract, but only in the subject company section--not the reporting owner section. These lines may be in any order, or not present, but I want to capture their values if they are present.
The regex I was trying to build looks like this:
(?:COMPANY CONFORMED NAME:\s*(?'conformed_name'(?!(?:A|AN|THE)\b)[A-Z\-\/\\=|&!#$(){}:;,#`. ]+)|CENTRAL INDEX KEY:\s*(?'cik'\d{10})|IRS NUMBER:\s*(?'IRS_number'\w{2}-?\w{7,8})|FISCAL YEAR END:\s*(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))
The desired results would be as follows:
conformed_name = "MISCELLANEOUS SUBJECT CORP"
CIK = "000000000"
IRS_number = "123456789"
fiscal_year_end = "1231"
Any flavor of regex is acceptable for this, as I'll adapt to whatever works best for the scenario. Thank you for reading about my quandary and for any guidance you can offer.
I ended up figuring it out on my own. Try it out here.
/SUBJECT COMPANY:\s+COMPANY DATA:(?:\s+(?:(?:COMPANY CONFORMED NAME:\s+(?'conformed_name'[^\n]+))|(?:CENTRAL INDEX KEY:\s+(?'CIK'\d{10}))|(?:STANDARD INDUSTRIAL CLASSIFICATION:\s+(?'assigned_SIC'[^\n]+))|(?:IRS NUMBER:\s+?(?'IRS_number'\w{2}-?\w{7,8}))|(?:STATE OF INCORPORATION:\s+(?'state_of_incorporation'\w{2}))|(?:FISCAL YEAR END:\s+(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))\n))+/s
To match only the company section, and only when preceded by “SUBJECT COMPANY”, use a look behind:
(?<=SUBJECT COMPANY:\t\n \n )(?:COMPANY CONFORMED NAME:\s*(?'conformed_name'(?!(?:A|AN|THE)\b)[A-Z\-\/\\=|&!#$(){}:;,#`. ]+)|CENTRAL INDEX KEY:\s*(?'cik'\d{10})|IRS NUMBER:\s*(?'IRS_number'\w{2}-?\w{7,8})|FISCAL YEAR END:\s*(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))

Better way to find sub strings in Datastore?

I have an aplication where an user inputs a name and the aplication gives back the adress and city for that name
The names are in datastore
class Person(ndb.Model):
name = ndb.StringProperty(repeated=True)
address = ndb.StringProperty(indexed=False)
city = ndb.StringProperty()
There are more than 5 million of Person entities. Names can be formed from 2 to 8 words (yes, there are people with 8 words in his names)
Users can enter any words for the name (in any order) and the aplication will return the first match.("John Doe Smith" is equivalent to " Smith Doe John")
This is a sample of my entities(the way how was put(ndb.put_multi)
id="L12802795",nombre=["Smith","Loyola","Peter","","","","",""], city="Cali",address="Conchuela 471"
id="M19181478",nombre=["Hoffa","Manzano","Linda","Rosse","Claudia","Cindy","Patricia",""], comuna="Lima",address=""
id="L18793849",nombre=["Parker","Martinez","Claudio","George","Paul","","",""], comuna="Santiago",address="Calamar 323 Villa Los Pescadores"
This is the way I get the name from the user:
name = self.request.get('content').strip() #The input is the name (an string with several words)
name=" ".join(name.split()).split() #now the name is a list of single words
In my design, in order to find a way to find and search words in the name for each entity, I did this.
q = Person.query()
if len(name)==1:
names_query =q.filter(Person.name==name[0])
elif len(name)==2:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1])
elif len(name)==3:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2])
elif len(name)==4:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3])
elif len(name)==5:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4])
elif len(name)==6:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4]).filter(Person.name==name[5])
elif len(name)==7:
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4]).filter(Person.name==name[5]).filter(Person.name==name[6])
else :
names_query =q.filter(Person.name==name[0]).filter(Person.name==name[1]).filter(Person.name==name[2]).filter(Person.name==name[3]).filter(Person.name==name[4]).filter(Person.name==name[5]).filter(Person.name==name[6]).filter(Person.name==name[7])
Person = names_query.fetch(1)
person_id=Person.key.id()
Question 1
Do you think, there is a better way for searching sub strings in strings (ndb.StringProperty), in my design. (I know it works, but I feel it can be improved)
Question 2
My solution has a problem for entities with repeted words in the name.
If I want to find an entity with words "Smith Smith" it brings me "Paul Smith Wshite" instead of "Paul Smith Smith", I do not know how to modify my query in order to find 2(or more) repeated words in Person.name
You could generate a list of all possible tokens for each name and use prefix filters to query them:
class Person(ndb.Model):
name = ndb.StringProperty(required=True)
address = ndb.StringProperty(indexed=False)
city = ndb.StringProperty()
def _tokens(self):
"""Returns all possible combinations of name tokens combined.
For example, for input 'john doe smith' we will get:
['john doe smith', 'john smith doe', 'doe john smith', 'doe smith john',
'smith john doe', 'smith doe john']
"""
tokens = [t.lower() for t in self.name.split(' ') if t]
return [' '.join(t) for t in itertools.permutations(tokens)] or None
tokens = ndb.ComputedProperty(_tokens, repeated=True)
#classmethod
def suggest(cls, s):
s = s.lower()
return cls.query(ndb.AND(cls.tokens >= s, cls.tokens <= s + u'\ufffd'))
ndb.put_multi([Person(name='John Doe Smith'), Person(name='Jane Doe Smith'),
Person(name='Paul Smith Wshite'), Person(name='Paul Smith'),
Person(name='Test'), Person(name='Paul Smith Smith')])
assert Person.suggest('j').count() == 2
assert Person.suggest('ja').count() == 1
assert Person.suggest('jo').count() == 1
assert Person.suggest('doe').count() == 2
assert Person.suggest('t').count() == 1
assert Person.suggest('Smith Smith').get().name == 'Paul Smith Smith'
assert Person.suggest('Paul Smith').count() == 3
And make sure to use keys_only queries if you only want keys/ids. This will make things significantly faster and almost free in terms of datastore OPs.

Using Perl to extract text from a text file

I have a question related to using regex to pull out data from a text file. I have a text file in the following format:
REPORTING-OWNER:
OWNER DATA:
COMPANY CONFORMED NAME: DOE JOHN
CENTRAL INDEX KEY: 99999999999
FILING VALUES:
FORM TYPE: 4
SEC ACT: 1934 Act
SEC FILE NUMBER: 811-00248
FILM NUMBER: 11530052
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET
STREET 2: STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
ISSUER:
COMPANY DATA:
COMPANY CONFORMED NAME: ACME INC
CENTRAL INDEX KEY: 0000002230
IRS NUMBER: 134912740
STATE OF INCORPORATION: MD
FISCAL YEAR END: 1231
BUSINESS ADDRESS:
STREET 1: SEVEN ST PAUL ST STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
BUSINESS PHONE: 4107525900
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET SUITE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
I want to save the owner's name (John Doe) and identifier (99999999999) and the company's name (ACME Inc) and identfier (0000002230) as separate variables. However, as you can see, the variable names (CENTRAL INDEX KEY and COMPANY CONFORMED NAME) are exactly the same for both pieces of information.
I've used the following code to extract the owner's information, but I can't figure out how to extract the data for the company. (Note: I read the entire text file into $data).
if($data=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1;}
if($data=~m/^\s*COMPANY\s*CONFORMED\s*NAME:\s*(.*$)/m){$name=$1;}
Any idea as to how I can extract the information for both the owner and the company?
Thanks!
There is a big difference between doing it quick and dirty with regexes (maintenance nightmare), or doing it right.
As it happens, the file you gave looks very much like YAML.
use YAML;
my $data = Load(...);
say $data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"};
say $data->{"ISSUER"}->{"COMPANY DATA"}->{"COMPANY CONFORMED NAME"};
Prints:
DOE JOHN
ACME INC
Isn't that cool? All in a few lines of safe and maintainable code ☺
my ($ownname, $ownkey, $comname, $comkey) = $data =~ /\bOWNER DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+).*\bCOMPANY DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+)/ms
If you're reading this file on a UNIX operating system but it was generated on Windows, then line endings will be indicated by the character pair \r\n instead of just \n, and in this case you should do
$data =~ tr/\r//d;
first to get rid of these \r characters and prevent them from finding their way into $ownname and $comname.
Select both bits of information at the same time so that you know that you're getting the CENTRAL INDEX KEY associated with either the owner or the company.
($name, $cik) = $data =~ /COMPANY\s+CONFORMED\s+NAME:\s+(.+)$\s+CENTRAL\s+INDEX\s+KEY:\s+(.*)$/m;
Instead of trying to match elements in the string, split it into lines, and parse properly into data structure that will let such searches be made easily, like:
$data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"}
That should be relatively easy to do.
Search for OWNER DATA: read one more line, split on : and take the last field. Same for COMPANY DATA: header (sortof), on so on

How to get Place ID of city from Latitude/Longitude using Facebook API

I need to find the Facebook place for the city for many lat/long points. The actual points refer to personal addresses, so there are no exact place ID's to look for as in the case of a business.
For testing, I was looking for the town of Red Feather Lakes, CO.
The graph search function will return a lot of places, but does not return cities Example
Raw FQL does not let you search by lat/long, and has no concept of "nearby" anyway. Example
An FQL query by ID reveals that there is an least a "Display Subtext" field which indicates that object is a city. Example
Thanks for any help. I have over 80 years of dated and geotagged photos of my dad that he would love to see on his timeline!
EDIT
Cities are not in the place table, they are only in the page table.
There is an undocumented distance() FQL function, but it only works in the place table. (Via this SO answer.)
This works:
SELECT name,description,geometry,latitude,longitude, display_subtext
FROM place
WHERE distance(latitude, longitude, "40.801985", "-105.593719") < 50000
But this gives an error "distance is not valid in table page":
SELECT page_id,name,description,type,location
FROM page
WHERE distance(
location.latitude,location.longitude,
"40.801985", "-105.593719") < 50000
It's a glorious hack, but this code works. The trick is to make two queries. First we look for places near our point. This returns a lot of business places. We then take the city of one of these places, and use this to look in the page table for that city's page. There seems to be a standard naming conventions for cities, but different for US and non-US cities.
Some small cities have various spellings in the place table, so the code loops through the returned places until it finds a match in the page table.
$fb_token = 'YOUR_TOKEN';
// Red Feather Lakes, Colorado
$lat = '40.8078';
$long = '-105.579';
// Karlsruhe, Germany
$lat = '49.037868';
$long = '8.350124';
$states_arr = array('AL'=>"Alabama",'AK'=>"Alaska",'AZ'=>"Arizona",'AR'=>"Arkansas",'CA'=>"California",'CO'=>"Colorado",'CT'=>"Connecticut",'DE'=>"Delaware",'FL'=>"Florida",'GA'=>"Georgia",'HI'=>"Hawaii",'ID'=>"Idaho",'IL'=>"Illinois", 'IN'=>"Indiana", 'IA'=>"Iowa", 'KS'=>"Kansas",'KY'=>"Kentucky",'LA'=>"Louisiana",'ME'=>"Maine",'MD'=>"Maryland", 'MA'=>"Massachusetts",'MI'=>"Michigan",'MN'=>"Minnesota",'MS'=>"Mississippi",'MO'=>"Missouri",'MT'=>"Montana",'NE'=>"Nebraska",'NV'=>"Nevada",'NH'=>"New Hampshire",'NJ'=>"New Jersey",'NM'=>"New Mexico",'NY'=>"New York",'NC'=>"North Carolina",'ND'=>"North Dakota",'OH'=>"Ohio",'OK'=>"Oklahoma", 'OR'=>"Oregon",'PA'=>"Pennsylvania",'RI'=>"Rhode Island",'SC'=>"South Carolina",'SD'=>"South Dakota",'TN'=>"Tennessee",'TX'=>"Texas",'UT'=>"Utah",'VT'=>"Vermont",'VA'=>"Virginia",'WA'=>"Washington",'DC'=>"Washington D.C.",'WV'=>"West Virginia",'WI'=>"Wisconsin",'WY'=>"Wyoming");
$place_search = json_decode(file_get_contents('https://graph.facebook.com/search?type=place&center=' . $lat . ',' . $long . '&distance=10000&access_token=' . $fb_token));
foreach($place_search->data as $result) {
if ($result->location->city) {
$city = $result->location->city;
$state = $result->location->state;
$country = $result->location->country;
if ($country=='United States') {
$city_name = $city . ', ' . $states_arr[$state]; // e.g. 'Chicago, Illinois'
}
else {
$city_name = $city . ', ' . $country; // e.g. 'Rome, Italy'
}
$fql = 'SELECT name,page_id,name,description,type,location FROM page WHERE type="CITY" and name="' .$city_name. '"';
$result = json_decode(file_get_contents('https://graph.facebook.com/fql?q=' . rawurlencode($fql) . '&access_token=' . $fb_token));
if (count($result->data)>0) {
// We found it!
print_r($result);
break;
}
else {
// No luck, try the next place
print ("Couldn't find " . $city_name . "\n");
}
}
}
I found this solution worked for me when looking for a page for the closest city to the specified latitude/longitude. For some reason LIMIT 1 didn't return the closest city so I bumped up the limit and then took the first result.
SELECT page_id
FROM place
WHERE is_city and distance(latitude, longitude, "<latitude>", "<longitude>") < 100000
ORDER BY distance(latitude, longitude, "<latitude>", "<longitude>")
LIMIT 20

Regular Expressions task

Below is an example of a text file I need to parse.
Lead Attorney: John Doe
Staff Attorneys: John Doe Jr. Paralegal: John Doe III
Geographic Area: Wisconsin
Affiliated Offices: None
E-mail: blah#blah.com
I need to parse all the key/value pairs and import it into a database. For example, I will insert 'John Doe' into the [Lead Attorney] column. I started a regex but I'm running into problems when parsing line 2:
Staff Attorneys: John Doe Jr. Paralegal: John Doe III
I started with the following regex:
(\w*.?\w+):\s*(.)(?!(\w.?\w+:.*))
But that does not parse out 'Staff Attorneys: John Doe Jr.' and 'Paralegal: John Doe III'. How can I ensure that my regex returns two groups for every key/value pair even if the key/value pairs are on the same line? Thanks!
Does any kind of key appear as a second key? The text above can be fixed by doing a data.replace('Paralegal:', '\nParalegal:') first. Then there is only one key/value pair per line, and it gets trivial:
>>> data = """Lead Attorney: John Doe
... Staff Attorneys: John Doe Jr. Paralegal: John Doe III
... Geographic Area: Wisconsin
... Affiliated Offices: None
... E-mail: blah#blah.com"""
>>>
>>> result = {}
>>> data = data.replace('Paralegal:', '\nParalegal:')
>>> for line in data.splitlines():
... key, val = line.split(':', 1)
... result[key.strip()] = val.strip()
...
>>> print(result)
{'Staff Attorneys': 'John Doe Jr.', 'Lead Attorney': 'John Doe', 'Paralegal': 'John Doe III', 'Affiliated Offices': 'None', 'Geographic Area': 'Wisconsin', 'E-mail': 'blah#blah.com'}
If "Paralegal:" also appears first you can make a regexp to do the replacement only when it's not first, or make a .find and check that the character before is not a newline. If there are several keywords that can appear like this, you can make a list of keywords, etc.
If the keywords can be anything, but only one word, you can look for ':' and parse backwards for space, which can be done with regexps.
If the keywords can be anything and include spaces, it's impossible to do automatically.