matlab most efficient partial string subtract/regex - regex

I have a project where I have a huge list of email addresses (i.e. johnhahifas#example.com) and I have a list of top 5000 most common First names (i.e. john, jim etc)
I am trying to for each email address, remove any First name if it appears in the email, so that for example:
johnhahifas#example.com becomes hahifas#example.com
benTTTben#something.com becomes TTT#something.com
so far, I came up with a regex solution and a strfind solution, the regexprep is much faster; I even optimized it a little to remove the #example.com so that the operation might be faster; but it still takes forever to run.
you can download the common first names from this address (CSV file)
http://www.quietaffiliate.com/Files/CSV_Database_of_First_Names.csv
The regex code I have:
fileID = fopen('bademails');
emails = textscan(fileID,'%s');
str = emails{1,1}; %//Loaded in the emails
fileID = fopen('CSV_Database_of_First_Names.csv');
names = textscan(fileID,'%s');
Names = lower(names{1,1}); %//Loaded in the Names
K = regexprep(str,Names,''); %Regex on Names
Any faster solutions would be much appreciated, thank you in advance!

Related

Pandas: Grouping rows by list in CSV file?

In an effort to make our budgeting life a bit easier and help myself learn; I am creating a small program in python that takes data from our exported bank csv.
I will give you an example of what I want to do with this data. Say I want to group all of my fast food expenses together. There are many different names with different totals in the description column but I want to see it all tabulated as one "Fast Food " expense.
For instance the Csv is setup like this:
Date Description Debit Credit
1/20/20 POS PIN BLAH BLAH ### 1.75 NaN
I figured out how to group them with an or statement:
contains = df.loc[df['Description'].str.contains('food court|whataburger', flags = re.I, regex = True)]
I ultimately would like to have it read off of a list? I would like to group all my expenses into categories and check those category variable names so that it would only output from that list.
I tried something like:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
That obviously didn't work.
If there is a better way of doing this I am wide open to suggestions.
Also I have looked through quite a few posts here on stack and have yet to find the answer (although I am sure I overlooked it)
Any help would be greatly appreciated. I am still learning.
Thanks
You can assign a new column using str.extract and then groupby:
df = pd.DataFrame({"description":['Macdonald something', 'Whataburger something', 'pizza hut something',
'Whataburger something','Macdonald something','Macdonald otherthing',],
"debit":[1.75,2.0,3.5,4.5,1.5,2.0]})
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
df["found"] = df["description"].str.extract(f'({"|".join(fast_food)})',flags=re.I)
print (df.groupby("found").sum())
#
debit
found
Macdonald 5.25
Whataburger 6.50
pizza hut 3.50
Use dynamic pattern building:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
pattern = r"\b(?:{})\b".format("|".join(map(re.escape, fast_food)))
contains = df.loc[df['Description'].str.contains(pattern, flags = re.I, regex = True)]
The \b word boundaries find whole words, not partial words.
The re.escape will protect special characters and they will be parsed as literal characters.
If \b does not work for you, check other approaches at Match a whole word in a string using dynamic regex

Counting first and last names appearing more than once

I have the following names:
clear
input str25 names
"Trenton Mercer"
"Carissa Moyer"
"Timothy Delgado"
"Kaylynn Payne"
"Harry Patton"
"Charlie Dudley"
"Harry Schmitt"
"Wyatt Hammond"
"Kasen Delgado"
"Katherine Noble"
"Julius Jarvis"
"Harry Carney"
"Wyatt Holden"
"Megan Wilson"
"Priscilla Shaffer"
"Savanah Marshall"
"Harry Delgado"
"Harper Ballard"
"Harry Mcmahon"
"Alejandro Jarvis"
end
How can I identify which first and last names (separately) come up more than once?
I would also like to count how many times these appear.
Pearly's solution (with split as the definitely best choice for the issue) appears reasonable. But there are still some unnecessary contours. For example, generating tag, b1, b2 variables seems not really needed.
And more important, the final output is not thoroughly consistent, with the counting info just in line with seemingly-random order, which is also different from the original one without clear explanation.
Thus, I try to contribute a solution (which must also have defects), just as a way to avoid those issues while still providing the output that you are seeking for.
split names
foreach v in `r(varlist)' {
egen TotalAppear_`v' = total(`v' != ""), by(`v')
egen LastAppear_`v' = max(_n), by(`v')
replace LastAppear_`v' = LastAppear_`v'==_n
list `v' TotalAppear_`v' if LastAppear_`v' == 1 & TotalAppear_`v' >1
}
It should be noted your description leads to assumptions made in my code as well as in Pearly's solution:
Every name has only 2 parts, i.e. first name and last name, so not including any middle name(s).
You just want to compare within each group (each first name among first names, last name among last names), not comparing any one with those from the other group.

Renaming files with no fixed char length in Python

I am currently learning Python 2.7 and am really impressed by how much it can do.
Right now, I'm working my way through basics such as functions and loops. I'd reckon a more 'real-world' problem would spur me on even further.
I use a satellite recording device to capture TV shows etc to hard drive.
The naming convention is set by the device itself. It makes finding the shows you want to watch after the recording more difficult to find as the show name is preceded with lots of redundant info...
The recordings (in .mts format) are dumped into a folder called "HBPVR" at the root of the drive. I'd be running the script on my Mac when the drive is connected to it.
Example.
"Channel_4_+1-15062015-2100-Exams__Cheating_the_....mts"
or
"BBC_Two_HD-19052015-2320-Newsnight.mts"
I included the double-quotes.
I'd like a Python script that (ideally) would remove the broadcaster name, reformat the date info, strip the time info and then put the show's name to the front of the file name.
E.g "BBC_Two_HD-19052015-2320-Newsnight.mts" ->> "Newsnight 19 May 2015.mts"
What may complicate matters is that the broadcaster names are not all of equal length.
The main pattern is that broadcaster name runs up until the first hyphen.
I'd like to be able to re-run this script at later points for newer recordings and not have already renamed recordings renamed further.
Thanks.
Try this:
import calendar
input = "BBC_Two_HD-19052015-2320-Newsnight.mts"
# Remove broadcaster name
input = '-'.join(input.split("-")[1:])
# Get show name
show = ''.join(' '.join(input.split("-")[2:]).split(".mts")[:-1])
# Get time string
timestr = ''.join(input.split("-")[0])
day = int(''.join(timestr[0:2])) # The day is the first two digits
month = calendar.month_name[int(timestr[2:4])] # The month is the second two digits
year = timestr[4:8] # The year is the third through sixth digits
# And the new string:
new = show + " " + str(day) + " " + month + " " + year + ".mts"
print(new) # "Newsnight 19 May 2015.mts"
I wasn't quite sure what the '2320' was, so I chose to ignore it.
Thanks Coder256.
That has given me a bit more insight into how Python can actually help solve real world (first world!) problems like mine.
It tried it out with some different combos of broadcaster and show names and it worked.
I would like though to use the script to rename a batch of recordings/files inside the folder from time to time.
The script did throw and error when processing an already re-named recording, which is to be expected I guess. Should the renamed file have a special character at the start of its name to help avoid this happening?
e.g "_Newsnight 19 May 2015.mts"
Or is there a more aesthetically pleasing way of doing this, with special chars being added on etc.
Thanks.
One way to approach this, since you have a defined pattern is to use regular expressions:
>>> import datetime
>>> import re
>>> s = "BBC_Two_HD-19052015-2320-Newsnight.mts"
>>> ts, name = re.findall(r'.*?-(\d{8}-\d{4})-(.*?)\.mts', s)[0]
>>> '{} {}.mts'.format(name, datetime.datetime.strptime(ts, '%d%m%Y-%H%M').strftime('%d %b %Y'))
'Newsnight 19 May 2015.mts'

How to find a substring anywhere in a string

This should be easy, but I'm finding it difficult.
I just want to find whether a substring exists anywhere in a string. In my case, whether the name of a website exists in the title of a product.
My code is like this:
#FindNoCase("Amazon.com", "Google Chromecast available at Amazon")#
The above returns a 0 which is correct because the entire substring "Amazon.com" doesn't exist in the main string. But some of it does, namely the "Amazon" part.
How could I achieve what I'm trying to do which is just see if ANY of the substring (at least more than 2 character in length) exists in the main string?
So I need something like FindOneOf() but actually "find at least three of". It should then look at the word "Amazon" in the product title and check if at least 3 characters in the sequence of "Amazon.com" exists. When it sees that "Ama" exists, then it just needs to return a true value. Can it be done using the existing built-in functions somehow?
Update: Very simple solution. I used Left("amazon", 3).
There's a lot of danger in false positives, like if someone was buying the Alabama state flag.
Because of store names that contain spaces, this is a little tricky (Wal Mart is often written with a space).
If your string always contains at [store], you can extract the store name by finding the last at in the sentence and creating a string by chopping off everything else.
Because it looks for occurrences of at only as a whole word, there's no danger with store names such as Beats Audio, or Sam's Meat Shop. I can't think of any any stores with the word at in the name. While that would technically trip it up, there's much lower risk, and you can do a pre-replace on such store names.
<cfset mystring = "Google Chromecast available at Amazon">
<cfset SellerName = REReplaceNoCase(mystring,".*\b(?:at)\b(?!.*\b(?:at)\b)\s*","")>
<cfoutput>Seller: #Sellername#</cfoutput>
You can then do your comparisons much more safely.
Per your comment, If you know all possible patterns, you can still obtain the data if you want to (false positives can either be embarrassing or catastrophic, depending on the action). If you know the stores you're working with, you can use a regex to pull out the string like this
<cfset mystring = "Google Chromecast available at Amazon.co.uk">
<cfset SellerName = REReplaceNoCase(mystring,".*\b((Google|Amazon|Wal[\W]*Mart|E[\W]*bay)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #Sellername#</cfoutput>
The only part you need to update is the pipe-delimited list You might add K-Mart as K[\W]*Mart the [\W]* permits any special character or space so it covers kMart, K-Mart, k*Mart, but not Kwik-E-Mart.
Update #2, per more comments
<cfset mystring = "Google Chromecast available at Toys-R-US">
<cfset SellerNameRE = REReplace(rsProduct.sellername,"[\W]+","[\W]*","ALL")>
<cfset TheSellerName = REReplaceNoCase(mystring,".*\b((#sellernameRE#)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #TheSellername# (#SellerNameRE#)</cfoutput>
This replaces any symbols with the wildcard character so that symbols aren't required so that if something says Wal*Mart, it will still match WalMart.
You could also load a seperate column with "Regex Names" so that you're not doing this each time.
So your table would look something like
SellerID SellerName RegexName
1 Wal-Mart Wal[\W]*Mart
2 Toys-R-US Toys[\W]*R[\W]*US
<cfset mystring = "Google Chromecast available at Toys-R-US">
<cfset TheSellerName = REReplaceNoCase(mystring,".*\b((#rsProduct.RegexName#)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #TheSellername# (#SellerNameRE#)</cfoutput>
Solved it by doing this
#FindNoCase(left("Amazon.com", 3), "Google Chromecast available at Amazon")#
Yes there is potential it won't do what I need in cases where the seller name less than 3 characters long. But I think its rare enough to be ok.

Stata: Efficient way to replace numerical values with string values

I have code that currently looks like this:
replace fname = "JACK" if id==103
replace lname = "MARTIN" if id==103
replace fname = "MICHAEL" if id==104
replace lname = "JOHNSON" if id==104
And it goes on for multiple pages like this, replacing an ID name with a first and last name string. I was wondering if there is a more efficient way to do this en masse, perhaps by using the recode command?
I will echo the other answers that suggest a merge is the best way to do this.
But if you absolutely must code the lines item-wise (again, messy) you can generate a long list ("pages") of replace commands by using MS Excel to "help" you write the code. Here is a picture of your Excel sheet with one example, showing the MS Excel formula:
columns:
A B C D
row: 1 last first id code
2 MARTIN JACK 103 ="replace fname=^"&B2&"^ if id=="&C2
You type that in, make sure it looks like Stata code when the formula calculates (aside from the carets), and copy the formula in column D down to the end of your list. Then copy the whole block of Stata code in column D generated by the formulas into your do-file, and do a find and replace (be careful here if you are using the caret elsewhere for mathematical uses!!) for all ^ to be replaced with ", which will end up generating proper Stata syntax.
(This is truly a brute force way of doing this, and is less dynamic in the case that there are subsequent changes to your generation list. All--apologies in advance for answering a question here advocating use of Excel :) )
You don't explain where the strings you want to add come from, but what is generally the best technique is explained at
http://www.stata.com/support/faqs/data-management/group-characteristics-for-subsets/index.html
Create an associative array of ids vs Fname,Lname
103 => JACK,MARTIN
104 => MICHAEL,JOHNSON
...
Replace
id => hash{id} ( fname & lname )
The efficiency of doing this will be taken care by the programming language used