How to get count of item repetition in a list? - python-2.7

Hi i have a list in the following format
tweets= ['RT Find out how AZ is targeting escape pathways to further
personalise breastcancer treatment SABCS14', 'Did you know Ontario has
a special screening program for women considered high risk for
BreastCancer', 'Article Foods That Prevent BreastCancer','PRETTY
Infinity Faith Hope Breast Cancer RIBBON SIGN Leather Braided Bracelet
breastcancer BreastCancerAwareness']
I have just given a sample of list but it has a total of 8183 elemets. So now if i take 1st item in the list i have to compare that with all the other elements in the list and if 1st item appears anywhere in the list i need the count how many times it got repeated. I tried many ways possible but couldnt achieve desired result. Please help, thanks in advance.
my code
for x, left in enumerate(tweets1):
print x,left
for y, right in enumerate(tweets1):
print y,right
common = len(set(left) & set(right))

As already pointed out in comments, you can use collections.Counter to do this. The code will translate into something like below:
from collections import Counter
tweets = ['RT Find out how AZ is targeting escape pathways to further personalise breastcancer treatment SABCS14',
'Did you know Ontario has a special screening program for women considered high risk for BreastCancer',
'Article Foods That Prevent BreastCancer',
'PRETTY Infinity Faith Hope Breast Cancer RIBBON SIGN Leather Braided Bracelet breastcancer BreastCancerAwareness']
count = Counter(tweets)
for key in Count:
print key, Count[key]
Note that the Counter is essentially a dict, and so the order of the elements will not be guaranteed.

Related

Add custom column based on string in another column

Source data:
Market Platform Web sales $ Mobile sales $ Insured
FR iPhone 1323 8709 Y
IT iPad 12434 7657 N
FR android 234 2352355 N
IT android 12323 23434 Y
Is there a way to evaluate the sales of devices that are insured?
if List.Contains({"iPhone","iPad","iPod"},[Platform]) and ([Insured]="Y") then [Mobile sales] else "error"
Something to that extent, just not sure how to approach it
A direct answer to your question is
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
SumUpSales = Table.AddColumn(Source, "Sales of insured devices", each if List.Contains({"iPhone","iPad","iPod"}, _[Platform]) and Text.Upper(_[Insured]) = "Y" then _[#"Mobile sales $"] else null, type number)
in
SumUpSales
However, I would like to stress you few things.
First, it's better to convert values in [Insured] column to boolean first. That way you can catch errors before they corrupt your data without you noticing. My example doesn't do that, all it does is negating letter case in [Insured], since PowerM is case-sensitive language.
Second, you'd better use null rather than text value error. Then, you can set column type, and do some math with its values, such as summing them up. In case of mixed text and number values you will get an error in this and many other cases.
And last.
It is probably better way to use a pivot table for visualizing data like this. You just need to add a column which groups all Apple (and/or other) devices together based on the same logic, but excluding [Insured]. Pivot tables are more flexible, and I personally like them very much.

Pandas: Grouping rows by list in CSV file?

In an effort to make our budgeting life a bit easier and help myself learn; I am creating a small program in python that takes data from our exported bank csv.
I will give you an example of what I want to do with this data. Say I want to group all of my fast food expenses together. There are many different names with different totals in the description column but I want to see it all tabulated as one "Fast Food " expense.
For instance the Csv is setup like this:
Date Description Debit Credit
1/20/20 POS PIN BLAH BLAH ### 1.75 NaN
I figured out how to group them with an or statement:
contains = df.loc[df['Description'].str.contains('food court|whataburger', flags = re.I, regex = True)]
I ultimately would like to have it read off of a list? I would like to group all my expenses into categories and check those category variable names so that it would only output from that list.
I tried something like:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
That obviously didn't work.
If there is a better way of doing this I am wide open to suggestions.
Also I have looked through quite a few posts here on stack and have yet to find the answer (although I am sure I overlooked it)
Any help would be greatly appreciated. I am still learning.
Thanks
You can assign a new column using str.extract and then groupby:
df = pd.DataFrame({"description":['Macdonald something', 'Whataburger something', 'pizza hut something',
'Whataburger something','Macdonald something','Macdonald otherthing',],
"debit":[1.75,2.0,3.5,4.5,1.5,2.0]})
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
df["found"] = df["description"].str.extract(f'({"|".join(fast_food)})',flags=re.I)
print (df.groupby("found").sum())
#
debit
found
Macdonald 5.25
Whataburger 6.50
pizza hut 3.50
Use dynamic pattern building:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
pattern = r"\b(?:{})\b".format("|".join(map(re.escape, fast_food)))
contains = df.loc[df['Description'].str.contains(pattern, flags = re.I, regex = True)]
The \b word boundaries find whole words, not partial words.
The re.escape will protect special characters and they will be parsed as literal characters.
If \b does not work for you, check other approaches at Match a whole word in a string using dynamic regex

Stata xtline overlayed plot for multiple groups

I am attempting to produce an overlayed -xtline- plot that distinguishes between males and females (or any number of multiple groups) by displaying different plot styles for each group. I chose to recast the xtline plot as "connected" and show males using circle markers and females as triangle markers. Taking cues from this question on Statalist, I produced code similar to what is below. When I try this solution Stata produces the "too many options" error, which is perhaps predictable given the large number of unique persons. I am aware of this solution which employs combined graphs but that is also not practical given the large number of unique individuals in my data.
Does a more simple solution to this problem exist? Does Stata have the capacity to overlay multiple -xtline- plots like it can -twoway- plots?
The code below, using publicly available data from UCLA's excellent Stata guide shows my basic code and reproduces the error:
use http://www.ats.ucla.edu/stat/stata/examples/alda/data/alcohol1_pp, clear
xtset id age
gsort -male id
qui levelsof id if !male, loc(fidlevs)
qui levelsof id if male, loc(midlevs)
qui levelsof id, loc(alllevs)
tokenize `alllevs'
loc len_f : word count `fidlevs'
loc len_m : word count `midlevs'
loc len_all : word count `alllevs'
loc start_f = `len_all' - `len_f'
forval i = 1/`len_all' {
if `i' < `start_f' {
loc m_plot_opt "`m_plot_opt' plot`i'opts(recast(connected) mcolor(black) msize(medsmall) msymbol(circle) lcolor(black) lwidth(medthin) lpattern(solid))"
}
else if `i' >= `start_f' {
loc f_plot_opt "`f_plot_opt' plot`i'opts(recast(connected) mcolor(black) msize(medsmall) msymbol(triangle) lcolor(black) lwidth(medthin) lpattern(solid))"
}
}
di "xtline alcuse, legend(off) scheme(s1mono) overlay `m_plot_opt' `f_plot_opt'"
xtline alcuse, legend(off) scheme(s1mono) overlay `m_plot_opt' `f_plot_opt'
It is difficult (for me) to separate the programming issue here from statistical or graphical views on what kind of graph works well, or at all. Even with this modest dataset there are 82 distinct identifiers, so any attempt to show them distinctly fails to be useful, if only because the resulting legend takes up most of the real estate.
There is considerable ingenuity in the question code in working through all the identifiers, but a broad-brush approach seems to work as well. Try this:
use http://www.ats.ucla.edu/stat/stata/examples/alda/data/alcohol1_pp, clear
xtset id age
separate alcuse, by(male) veryshortlabel
label var alcuse1 "male"
label var alcuse0 "female"
line alcuse? age, legend(off) sort connect(L)
Key points:
There is nothing very special about xtline. It's just a convenience wrapper. When frustrated by its wired-in choices, people often just reach for line.
To get distinct colours, distinct variables suffice, which is where separate has a role. See also this Tip.
Although the example dataset is well behaved, extra options sort connect(L) will help in some case to remove spurious connections between individuals or panels. (In extreme cases, reach for linkplot (SSC).)
This could be fine too:
line alcuse age if male || line alcuse age if !male, legend(order(1 "male" 2 "female")) sort connect(L)

Generating rolling z-scores of panel data in Stata

I have an unbalanced panel data set (countries and years). For simplicity let's say I have one variable, x, that I am measuring. The panel data sorted first by country (a 3-digit numeric country-code) and then by year. I would like to write a .do file that generates a new variable, z_x, containing the standardized values of the variable x. The variables should be standardized by subtracting the mean from the preceding (exclusive) m time periods, and then dividing by the standard deviation from those same time periods. If this is not possible, return a missing value.
Currently, the code I am using to accomplish this is the following (edited now for clarity)
xtset weocountrycode year
sort weocountrycode year
local win_len = 5 // Defining rolling window length.
quietly: rolling sd_x=r(sd) mean_x=r(mean), window(`win_len') saving(stats_x, replace): sum x
use stats_x, clear
rename end year
save, replace
use all_data_PROCESSED_FINAL.dta, clear
quietly: merge 1:1 (weocountrycode year) using stats_x
replace sd_x = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n] // This and next line are for deleting values that rolling calculates when I actually want missing values.
replace mean_`x' = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n]
gen z_`x' = (`x' - mean_`x'[_n-1])/sd_`x'[_n-1] // calculate z-score
UPDATE:
My struggle with rolling is that when rolling is set up to use a window length 5 rolling mean, it automatically does window length 1,2,3,4 means for the first, second, third and fourth entries (when there are not 5 preceding entries available to average out). In fact, it does this in general - if the first non-missing value is on entry 5, it will do a length 1 rolling average on entry 5, length 2 rolling average on entry 6, ..... and then finally start doing length 5 moving averages on entry 9. My issue is that I do not want this, so I would like to avoid performing these calculations. Until now, I have only been able to figure out how to delete them after they are done, which is both inefficient and bothersome.
I tried adding an if clause to the -rolling- statement:
quietly: rolling sd_x=r(sd) mean_x=r(mean) if x[_n-`win_len'+1] != . & weocountrycode[_n-`win_len'+1] != weocountrycode[_n], window(`win_len') saving(stats_x, replace): sum x
But it did not fix the problem and the output is "weird" in the sense that
1) If `win_len' is equal to, say, 10, there are 15 missing values in the resulting z_x variable, instead of 9.
2) Even though there are "extra" missing values in z_x, the observations still start out as window length 1 means, then window length 2 means, etc. which makes no sense to me.
Which leads me to believe I fundamentally don't understand 1) what -rolling- is doing and 2) how an if clause works in the context of -rolling-.
Does this help?
Thanks!
I'm not sure I understand completely but I'll try to answer based on what I think your problem is, and based on a comment by #NickCox.
You say:
... when rolling is set up to use a window length 5 rolling mean...
if the first non-missing value is
on entry 5, it will do a length 1 rolling average on entry 5, length 2
rolling average on entry 6, ...
This is expected. help rolling states:
The window size refers to calendar periods, not the number of
observations. If there
are missing data (for example, because of weekends), the actual number of observations used by command may be less than
window(#).
It's not actually doing a "length 1 rolling average", but I get to that later.
Below some examples to see what rolling does:
clear all
set more off
*-------------------------- example data -----------------------------
set obs 92
gen dat = _n - 1
format dat %tq
egen seq = fill(1 1 1 1 2 2 2 2)
tsset dat
tempfile main
save "`main'"
list in 1/12, separator(4)
*------------------- Example 1. None missing ------------------------
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------- Example 2. All but one value, missing in first window ------
use "`main'", clear
replace seq = . in 1/3
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------------- Example 3. All missing in first window --------------
use "`main'", clear
replace seq = . in 1/4
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
Note I use the stepsize option to make things much easier to follow. Because the date variable is in quarters, I set windowsize(4) and stepsize(4) so rolling is just computing averages by year. I hope that's easy to see.
Example 1 does as expected. No problem here.
Example 2 on the other hand, should be more interesting for you. We've said that what matters are calendar periods, so the mean is computed for the whole year (four quarters), even though it contains missings. There are three missings and one non-missing. summarize is computing the mean over the whole year, but summarize ignores missings, so it just outputs the mean of non-missings, which in this case is just one value.
Example 3 has missings for all four quarters of the year. Therefore, summarize outputs . (missing).
Your problem, as I understand it, is that when you face a situation like Example 2, you'd like the output to be missing. This is where I think Nick Cox's advice comes in. You could try something like:
rolling mean=r(mean) N=r(N), window(4) stepsize(4) clear: summarize seq, detail
replace mean = . if N != 4
list in 1/12, separator(0)
This says: if the number of non-missings for the window (r(N), also computed by summarize), is not the same as the window size, then replace it with missing.

Extracting dollar amounts from existing sql data?

I have a field with that contains a mix of descriptions and dollar amounts. With TSQL, I would like to extract those dollar amounts, then insert them into a new field for the record.
-- UPDATE --
Some data samples could be:
Used knife set for sale $200.00 or best offer.
$4,500 Persian rug for sale.
Today only, $100 rebate.
Five items for sale: $20 Motorola phone car charger, $150 PS2, $50.00 3 foot high shelf.
In the set above I was thinking of just grabbing the first occurrence of the dollar figure... that is the simplest.
I'm not trying to remove the amounts from the original text, just get their value, and add them to a new field.
The amounts could/could not contain decimals, and commas.
I'm sure PATINDEX won't cut it and I don't need an extremely RegEx function to accomplish this.
However, looking at The OLE Regex Find (Execute) function here, appears to be the most robust, however when trying to use the function I get the following error message in SSMS:
SQL Server blocked access to procedure 'sys.sp_OACreate' of component
'Ole Automation Procedures' because this component is turned off as
part of the security configuration for this server. A system
administrator can enable the use of 'Ole Automation Procedures' by
using sp_configure. For more information about enabling 'Ole
Automation Procedures', see "Surface Area Configuration" in SQL Server
Books Online.
I don't want to go and changing my server settings just for this function. I have another regex function that works just fine without changes.
I can't imagine this being that complicated to just extract dollar amounts. Any simpler ways?
Thanks.
CREATE FUNCTION dbo.fnGetAmounts(#str nvarchar(max))
RETURNS TABLE
AS
RETURN
(
-- generate all possible starting positions ( 1 to len(#str))
WITH StartingPositions AS
(
SELECT 1 AS Position
UNION ALL
SELECT Position+1
FROM StartingPositions
WHERE Position <= LEN(#str)
)
-- generate possible lengths
, Lengths AS
(
SELECT 1 AS [Length]
UNION ALL
SELECT [Length]+1
FROM Lengths
WHERE [Length] <= 15
)
-- a Cartesian product between StartingPositions and Lengths
-- if the substring is numeric then get it
,PossibleCombinations AS
(
SELECT CASE
WHEN ISNUMERIC(substring(#str,sp.Position,l.Length)) = 1
THEN substring(#str,sp.Position,l.Length)
ELSE null END as Number
,sp.Position
,l.Length
FROM StartingPositions sp, Lengths l
WHERE sp.Position <= LEN(#str)
)
-- get only the numbers that start with Dollar Sign,
-- group by starting position and take the maximum value
-- (ie, from $, $2, $20, $200 etc)
SELECT MAX(convert(money, Number)) as Amount
FROM PossibleCombinations
WHERE Number like '$%'
GROUP BY Position
)
GO
declare #str nvarchar(max) = 'Used knife set for sale $200.00 or best offer.
$4,500 Persian rug for sale.
Today only, $100 rebate.
Five items for sale: $20 Motorola phone car charger, $150 PS2, $50.00 3 foot high shelf.'
SELECT *
FROM dbo.fnGetAmounts(#str)
OPTION(MAXRECURSION 32767) -- max recursion option is required in the select that uses this function
This link should help.
http://blogs.lessthandot.com/index.php/DataMgmt/DataDesign/extracting-numbers-with-sql-server
Assuming you are OK with extracting the numeric's, regardless of wether or not there is a $ sign. If that is a strict requirement, some mods will be needed.