Rails: efficient way of finding/replacing words in a large collection of records using a hash filter - ruby-2.0

Just wondering what the best way to approach this would be,
I have a large american to british english filter with about 2000 items
filter = {"authorized"=>"authorised"........}
and a large collection of about 4000 records
posts = Post.all
what would be the most efficient way to do a search and replace across a couple of properties (say post.title and post.description) for each record while maintaining original casing(ie. replace all characters after the first one only)?
edit: updated hash count

I would think about using gsub with a Regexp.union and the hash syntax:
string.gsub(Regexp.union(filter.keys), filter)
To iterate over all posts use find_each to improve memory usage:
FILTER = { "authorized" => "authorised", ... }
FILTER_REGEXP = Regexp.new(Regexp.union(FILTER.keys), Regexp::IGNORECASE)
def translate(string)
string.gsub(FILTER_REGEXP, FILTER)
end
Post.find_each do |post|
post.update(
title: translate(post.title),
description: translate(post.description)
)
end
To support the original casing, I would add both versions to the hash (upcase and lowercase), that makes the whole regexp bigger, but makes the code easier to read and you avoid the extra logic to handle different cases. To generate a hash with both versions from you current hash, just use:
filter = Hash[*filter.map { |k, v| [[k,v], [k.capitalize,v.capitalize]] }.flatten]

Related

Pandas: Grouping rows by list in CSV file?

In an effort to make our budgeting life a bit easier and help myself learn; I am creating a small program in python that takes data from our exported bank csv.
I will give you an example of what I want to do with this data. Say I want to group all of my fast food expenses together. There are many different names with different totals in the description column but I want to see it all tabulated as one "Fast Food " expense.
For instance the Csv is setup like this:
Date Description Debit Credit
1/20/20 POS PIN BLAH BLAH ### 1.75 NaN
I figured out how to group them with an or statement:
contains = df.loc[df['Description'].str.contains('food court|whataburger', flags = re.I, regex = True)]
I ultimately would like to have it read off of a list? I would like to group all my expenses into categories and check those category variable names so that it would only output from that list.
I tried something like:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
That obviously didn't work.
If there is a better way of doing this I am wide open to suggestions.
Also I have looked through quite a few posts here on stack and have yet to find the answer (although I am sure I overlooked it)
Any help would be greatly appreciated. I am still learning.
Thanks
You can assign a new column using str.extract and then groupby:
df = pd.DataFrame({"description":['Macdonald something', 'Whataburger something', 'pizza hut something',
'Whataburger something','Macdonald something','Macdonald otherthing',],
"debit":[1.75,2.0,3.5,4.5,1.5,2.0]})
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
df["found"] = df["description"].str.extract(f'({"|".join(fast_food)})',flags=re.I)
print (df.groupby("found").sum())
#
debit
found
Macdonald 5.25
Whataburger 6.50
pizza hut 3.50
Use dynamic pattern building:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
pattern = r"\b(?:{})\b".format("|".join(map(re.escape, fast_food)))
contains = df.loc[df['Description'].str.contains(pattern, flags = re.I, regex = True)]
The \b word boundaries find whole words, not partial words.
The re.escape will protect special characters and they will be parsed as literal characters.
If \b does not work for you, check other approaches at Match a whole word in a string using dynamic regex

How to get the most accurate term in regex?

I have an angular app using the mongodb sdk for js.
I would like to suggest some words on a input field for the user from my words collection, so I did:
getSuggestions(term: string) {
var regex = new stitch.BSON.BSONRegExp('^' +term , 'i');
return from(this.words.find({ 'Noun': { $regex: regex } }).execute());
}
The problem is that if the user type for example Bie, the query returns a lot of documents but the most accurated are the last ones, for example Bier, first it returns the bigger words, like Bieberbach'sche Vermutung. How can I deal to return the closests documents first?
A regular-expression is probably not enough to do what you are intending to do here. They can only do what they're meant to do – match a string. They might be used to give you a candidate entry to present to the user, but can't judge or weigh them. You're going to have to devise that logic yourself.

Mixed word and geo search

How can I make a search query that
Finds everything around a location (using aroundLatLng and aroundRadius)
It's biased by a word search
Doesn't require results to include words (e.g. they're used for relevance, but if no words match a record it is not excluded a priori)
?
So for example if I query for "great view" near {lat: 48.8, lng: 2.3}, it should rank by word-relevance first, then by distance, and records which doesn't match any words but are near Paris should come up anyway in the results.
To make sure all words are not mandatory, you can use the optionalWords feature to consider all query words optional (beware, at least 1 will always need to match):
var query = "great view";
index.search(query, {
optionalWords: query,
aroundLatLng: '48.8,2.3',
aroundRadius: 10000
}).then(....);
Then you can eventually move the geo criteria of the default ranking formula down, just before the custom one.

Best way to compare phone numbers using Regex

I have two databases that store phone numbers. The first one stores them with a country code in the format 15555555555 (a US number), and the other can store them in many different formats (ex. (555) 555-5555, 5555555555, 555-555-5555, 555-5555, etc.). When a phone number unsubscribes in one database, I need to unsubscribe all references to it in the other database.
What is the best way to find all instances of phone numbers in the second database that match the number in the first database? I'm using the entity framework. My code right now looks like this:
using (FusionEntities db = new FusionEntities())
{
var communications = db.Communications.Where(x => x.ValueType == 105);
foreach (var com in communications)
{
string sRegexCompare = Regex.Replace(com.Value, "[^0-9]", "");
if (sMobileNumber.Contains(sRegexCompare) && sRegexCompare.Length > 6)
{
var contact = db.Contacts.Where(x => x.ContactID == com.ContactID).FirstOrDefault();
contact.SMSOptOutDate = DateTime.Now;
}
}
}
Right now, my comparison checks to see if the first database contains at least 7 digits from the second database after all non-numeric characters are removed.
Ideally, I want to be able to apply the regex formatting to the point in the code where I get the data from the database. Initially I tried this, but I can't use replace in a LINQ query:
var communications = db.Communications.Where(x => x.ValueType == 105 && sMobileNumber.Contains(Regex.Replace(x.Value, "[^0-9]", "")));
Comparing phone numbers is a bit beyond the capability of regex by design. As you've discovered there are many ways to represent a phone number with and without things like area codes and formatting. Regex is for pattern matching so as you've found using the regex to strip out all formatting and then comparing strings is doable but putting logic into regex which is not what it's for.
I would suggest the first and biggest thing to do is sort out the representation of phone numbers. Since you have database access you might want to look at creating a new field or table to represent a phone number object. Then put your comparison logic in the model.
Yes it's more work but it keeps the code more understandable going forward and helps cleanup crap data.

How to change a node's property based on one of its other properties in Neo4j

I just started using Neo4j server 2.0.1. I am having trouble with the writing a cypher script to change one of the nodes property to something based one of its already defined properties.
So if I created these node's:
CREATE (:Post {uname:'user1', content:'Bought a new pair of pants today', kw:''}),
(:Post {uname:'user2', content:'Catching up on Futurama', kw:''}),
(:Post {uname:'user3', content:'The last episode of Game of Thrones was awesome', kw:''})
I want the script to look at the content property and pick out the word "Bought" and set the kw property to that using a regular expression to pick out word(s) larger then five characters. So, user2's post kw would be "Catching, Futurama" and user3's post kw would be "episode, Thrones, awesome".
Any help would be greatly appreciated.
You could do something like this:
MATCH (p:Post { uname:'user1' })
WHERE p.content =~ "Bought .+"
SET p.kw=filter(w in split(p.content," ") WHERE length(w) > 5)
if you want to do that for all posts, which might not be the fastest operation:
MATCH (p:Post)
WHERE p.content =~ "Bought .+"
SET p.kw=filter(w in split(p.content," ") WHERE length(w) > 5)
split splits a string into a collection of parts, in this case words separated by space
filter filters a collection by a condition behind WHERE, only the elements that fulfill the condition are kept
Probably you'd rather want to create nodes for those keywords and link the post to the keyword nodes.