Pandas extracting substring - regex

I have a column containing dates, which could look like 2017-10-12. I want to create a new column containing the day, which in my case would be the number between the two -'s. I've tried various .str.extract() queries, but I can't seem to get it right.
df['days'] = df['dates'].str.extract('(-*)')
Any hints?

Use split and select second list by str[1]:
df['days'] = df['dates'].str.split('-').str[1]
Or to_datetime with format parameter + dt.day:
df['days'] = pd.to_datetime(df['dates'], format='%Y-%d-%m').dt.day

Related

Conditionally Filtering Out Rows based on 2 Parameters in w/ Power Query

I have a table similar to the one attached below:
What I would like to do, using power query, is conditionally filter out (remove) rows where CenterNum = 1101 and DepCode = 257. I figured Table.SelectRows() would work but it doesn't and the query just returns, this table is empty. The #"Expanded AccountLookup" ,in my formula below, is referencing the power query applied step before the one I am trying to create. I'm hoping to get some input on how to remove rows based on these two paramters.
= Table.SelectRows(#"Expanded AccountLookup", each [CostCenterNumber] = "1111001" and [NoteTypeCode] = "257")
Thank you!
You didn’t post a screenshot so it is hard to tell if the column format is text or numerical but try removing the quotes around the numbers
= Table.SelectRows(#"Expanded AccountLookup", each [CostCenterNumber] = 1111001 and [NoteTypeCode] = 257)
If that doesn't work, check the actual column names against what you are using, especially for case (upper/lower) and leading/trailing spaces. The best way to do that is to temporarily rename it, and look at the code for the "from name"

How to create a new column from the select of the distinct values of two columns with Power Query Editor in Power Bi

I need to insert in a new column the distinct values of two columns using the Power Query Editor Power Bi.
Any ideas enter image description hereguys?
Here you go:
let
Source = SomeTable
firstCol = Source[FirstColumn],
secondCol = Source[SecondColumn],
thirdCol = List.Sort(List.RemoveNulls(List.Distinct(List.Combine( {firstCol, secondCol})))),
#"TableResult" = Table.FromColumns({firstCol, secondCol, thirdCol}, {"First","Second","Combined"})
in
#"TableResult"
This basically converts your first and second columns into lists, and combines them into one new list. Next, it transforms the new list a bit to match your requirements -- first by getting just the distinct values, then dropping any null values, and lastly sorting it in ascending order.
Once that's done, we can take advantage of Table.FromColumns and create a table from our three lists.
That should get you where you're going.
Thanks Ryan.
My requirement was to create a distinct list of IP addresses from one of the columns in a Table.
Here's the code that worked for me.
= let
Source = #"TABLENAME",
firstCol = #"TABLENAME"[Client Ip],
IP = List.Distinct(firstCol),
#"DistinctIP" = Table.FromColumns({IP}, {"IP"})
in
#"DistinctIP"

Sum values in one column of all rows which match one or more of multiple criteria

I have some table data in which I'd like to sum all the values in a specific column of all rows where column A contains string A and/or column B contains string B. How can I achieve this?
This works for one criterium:
=SUM(FILTER(G:G,REGEXMATCH(F:F,"stringA")))
I tried this, but it didn't work:
=SUM(FILTER(G:G,OR(ISTEXT(REGEXMATCH(F:F,"stringA")),ISTEXT(REGEXMATCH(C:C,"stringB")))))
Please try:
=SUM(FILTER(G:G,REGEXMATCH(F:F,"stringA")+REGEXMATCH(C:C,"stringB")))
+ works for or logic. ISTEXT is not needed because REGEXMATCH gives true or false.
OR does not work because filter is an arrayformula, use + in array formulas.
=SUM(FILTER(G:G,REGEXMATCH(F:F&C:C,"stringA|stringB")))
OR is denoted by |
EDIT Added &C:C to denote different Columns

Deleting pandas dataframe rows if value in given column not contained in a list

I have pandas dataframe called df that contains several columns and a df['MY STATE'] column. My goal is to remove all the rows from the dataframe which to not contains US states. I want to do this by comparing the value in the cell to a pandas series I have containing all the state abbreviations. I have seen people use something like the following to clean a dataframe:
df = df[df['COST'] <= 0]
But something like what I need (below) doesn't work
df = df[df['MY STATE'] not in states['Abbreviation'].values]
Is there a way to do this simply?
I have read that df.query() can be used to do something like this, but I have not yet found an example, and have also read that df.query() cannot be used when there is a space in the name of the column.
Thank you,
Michael
IIUC you can use isin with inverse operator ~:
df = df[~df['MY STATE'].isin(states['Abbreviation'].values)]

Searching for unmatched ntheames when comparing spreadsheets

In one spreadsheet I have 3 columns with a first and last name of a person combined. In the 2nd spreadsheet, I have column a = first name and column b = last name.
I want to know which names in spreadsheet one cannot be found in spreadsheet two. I also need to verify the data to make sure that the formula was accurate on finding the correct lookup.
Do I have to combine my columns in spreadsheet 2 to make the first and last name in the same column to make this work?
Which formula would you use for either scenario?
Use this:
=ISNA(MATCH($A1&" "&$B1,Sheet2!$A:$A,FALSE)))
Where (in order):
A1 is the first name column in Sheet1
B1 is the last name column in Sheet1
Sheet2 is the sheet that has the data stored as names separately
$A:$A is the rows that have the two names together
FALSE is because it's an exact match
This will return FALSE if the element does not exist, and TRUE if it does
You can also use:
=VLOOKUP($A1&" "&$B1,Sheet2!$A:$D,3,FALSE)
If you want to retrieve data for a match.
Finally, if you need to do your lookups the other way, take a look at this thread for some ideas on how to split the string into two pieces.