I've never got around to learning regex till now, but I'm trying to figure out how to use it in pandas with Series.str.match(expression) In order to split one column to make two new columns. (I know I can do this without regex)
examples of the column data are:
True Grit {'Rooster Cogburn'}
The King's Speech {'King George VI'}
Biutiful {'Uxbal'}
Where there can be any number of strings greater than 1 in each of the two groupings. How can I extract two groups to result in True Grit, Rooster Cogburn?
Given this dataframe
col
0 True Grit {Rooster Cogburn}
1 The King's Speech {King George VI}
2 Biutiful {Uxbal}
df = df.col.str.extract('(.*)\s*{(.*)}', expand = True)
will return
0 1
0 True Grit Rooster Cogburn
1 The King's Speech King George VI
2 Biutiful Uxbal
Related
I'm trying to determine which is the first row with a cell that contains only digits, "," "$" in a data frame:
Assessment Area Offices Offices Deposits as of 6/30/16 Deposits as of 6/30/16 Assessment Area Reviews Assessment Area Reviews Assessment Area Reviews
2 Assessment Area # % $ (000s) % Full Scope Limited Scope TOTAL
3 Ohio County 1 50.0% $24,451 52.7% 1 0 1
4 Hart County 1 50.0% $21,931 47.3% 1 0 1
5 OVERALL 2 100% $46,382 100.0% 2 0 2
This code does find the row:
grepl("[0-9]",table_1)
But the code returns:
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
I only want to know the row.
Your data could use some cleaning up, but it's not entirely necessary in order to solve your problem. You want to find the first row that contains a dollar sign and an appropriate value. My solution does the following:
Iterates over rows
In each row, asks if there's at least one cell that starts with a dollar sign followed by a specific combination of digits and commas (to be explained in greater detail below)
Stops when we reach that row
Prints the ID of the row
The solution involves a for loop, an if statement, and a regular expression.
First of all, here's my attempt to reproduce a data frame. Again the details don't matter too much. I just wanted to make the "money row" the second row which is kind of how it seems to appear in your example
df<- data.frame(
Assessment_Area = c(2,3,4,5),
Offices = c("#",1,1,2),
Dep_Percent_63016 = c("#","50.0%","50.0%","100.0%"),
Dep_Total_63016 = c("$ (000s)", "$24,451", "$21,931","$46,382"),
Assessment_Area_Rev = rep("Blah",4)
)
df
Assessment_Area Offices Dep_Percent_63016 Dep_Total_63016
1 2 # # $ (000s)
2 3 1 50.0% $24,451
3 4 1 50.0% $21,931
4 5 2 100.0% $46,382
Assessment_Area_Rev
1 Blah
2 Blah
3 Blah
4 Blah
Here's the for loop:
library(stringr)
for (i in 1:nrow(df)) {
if (any(str_detect(df[i,],"^\\$\\d{1,3}(,\\d{3})*"))) {
print(i)
break
}
}
The key is the line with the if statement. any returns TRUE if any element of a logical vector is true. In this case the vector is created by applying stringr::str_detect to a row of the df which is indexed as df[i,]. str_detect returns a logical vector - you supply a character vector and an expression to match in the elements of that vector. It returns TRUE or FALSE for each element in the vector which in this case is each cell in a row. So the crux of this is the regular expression:
"^\\$\\d{1,3}(,\\d{3})*"
This is the pattern we're searching for (the money cell) in each row. ^\\$ indicates we want the string to start with the dollar sign. The two backslashes escape the $ character because it's a metacharacter in regular expressions (end anchor). We then want 1-3 digits. This will match any dollar value below $1,000. Then we specify that the expression can contain any number (including 0) of , followed by three more digits. This will cover any dollar value.
Finally, if we encounter a row which contains one of these expressions, the for loop will print the number of the row and end the loop so it will return the lowest row number containing one desired cell. In this example the output is 2. If no appropriate rows are encountered, nothing will happen.
There may be more you want to do once you have that information, but if all you need is the lowest row number containing your money expression then this is sufficient.
A less elegant regular expression which only looks for dollar signs, commas, and digits would be:
"[0-9$,]+"
which is what you asked for although I don't think that's what you really want because that will match something like ,56$,,$$78
So I have converted a pdf to a dataframe and am almost in the final stages of what I wish the format to be. However I am stuck in the following step. I have a column which is like -
Column A
1234[321]
321[3]
123
456[456]
and want to separate it into two different columns B and C such that -
Column B Column C
1234 321
321 3
123 0
456 456
How can this be achieved? I did try something along the lines of
df.Column A.str.strip(r"\[\d+\]")
but I have not been able to get through after trying different variations. Any help will be greatly appreciated as this is the final part of this task. Much thanks in advance!
An alternative could be:
# Create the new two columns
df[["Column B", "Column C"]]=df["Column A"].str.split('[', expand=True)
# Get rid of the extra bracket
df["Column C"] = df["Column C"].str.replace("]", "")
# Get rid of the NaN and the useless column
df = df.fillna(0).drop("Column A", axis=1)
# Convert all columns to numeric
df = df.apply(pd.to_numeric)
You may use
import pandas as pd
df = pd.DataFrame({'Column A': ['1234[321]', '321[3]', '123', '456[456]']})
df[['Column B', 'Column C']] = df['Column A'].str.extract(r'^(\d+)(?:\[(\d+)])?$', expand=False)
# If you need to drop Column A here, use
# df[['Column B', 'Column C']] = df.pop('Column A').str.extract(r'^(\d+)(?:\[(\d+)])?$', expand=False)
df['Column C'][pd.isna(df['Column C'])] = 0
df
# Column A Column B Column C
# 0 1234[321] 1234 321
# 1 321[3] 321 3
# 2 123 123 0
# 3 456[456] 456 456
See the regex demo. It matches
^ - start of string
(\d+) - Group 1: one or more digits
(?:\[(\d+)])? - an optional non-capturing group matching [, then capturing into Group 2 one or more digits, and then a ]
$ - end of string.
I'm sure this is a very simple issue with regexps, but: Trying to use str.match in pandas to match a non-ASCII character (the times sign). I expect the first match call will match the first row of the DataFrame; the second match call will match the last row; and the third match will match the first and last rows. However, the first call does match but the second and third calls do not. Where am I going wrong?
Dataframe looks like (with x replacing the times sign, it actually prints as a ?):
Column
0 2x 32
1 42
2 64 x2
Pandas 0.20.3, python 2.7.13, OS X.
#!/usr/bin/env python
import pandas as pd
import re
html = '<table><thead><tr><th>Column</th></tr></thead><tbody><tr><td>2× 32</td></tr><tr><td>42</td></tr><tr><td>64 ×2</td></tr></tbody><table>'
df = pd.read_html(html)[0]
print df
print df[df['Column'].str.match(ur'^[2-9]\u00d7', re.UNICODE, na=False)]
print df[df['Column'].str.match(ur'\u00d7[2-9]$', re.UNICODE, na=False)]
print df[df['Column'].str.match(ur'\u00d7', re.UNICODE, na=False)]
Output I see (again with ? replaced with x):
Column
0 2x 32
Empty DataFrame
Columns: [Column]
Index: []
Empty DataFrame
Columns: [Column]
Index: []
Use contains():
df.Column.str.contains(r'^[2-9]\u00d7')
0 True
1 False
2 False
Name: Column, dtype: bool
df.Column.str.contains(r'\u00d7[2-9]$')
0 False
1 False
2 True
Name: Column, dtype: bool
df.Column.str.contains(r'\u00d7')
0 True
1 False
2 True
Name: Column, dtype: bool
Explanation: contains() uses re.search(), and match() uses re.match() (docs). Since re.match() only matches from the beginning of a string (docs), only your first case, which matches at the start (with ^) will work. Actually in that case you don't need both match and ^:
df.Column.str.match(r'[2-9]\u00d7')
0 True
1 False
2 False
Name: Column, dtype: bool
I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:
set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
Desired results
Desired results should look like that:
indicator period values
1 someindicator 2001 0.2655087
2 someindicator 2011 0.3721239
3 some text 20022008 0.5728534
4 another indicator 2003 0.9082078
Characteristics
Indicator descriptions are in one column
Numeric values (counting from first digit with the first digit are in the second column)
Code
require(dplyr); require(tidyr); require(magrittr)
dta %<>%
separate(col = indicator, into = c("indicator", "period"),
sep = "^[^\\d]*(2+)", remove = TRUE)
Naturally this does not work:
> head(dta, 2)
indicator period values
1 001 0.2655087
2 011 0.3721239
Other attempts
I have also tried the default separation method sep = "[^[:alnum:]]" but it breaks down the column into too many columns as it appears to be matching all of the available digits.
The sep = "2*" also doesn't work as there are too many 2s at times (example: 20032006).
What I'm trying to do boils down to:
Identifying the first digit in the string
Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.
I think this might do it.
library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078
The following is an explanation of the regular expression, brought to you by regex101.
(?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
(?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched
You could also use unglue::unnest() :
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_unnest(dta, indicator, "{indicator}{=\\s*}{period=\\d*}")
#> values indicator period
#> 1 0.43234262 someindicator 2001
#> 2 0.65890900 someindicator 2011
#> 3 0.93576805 some text 20022008
#> 4 0.01934736 another indicator 2003
Created on 2019-09-14 by the reprex package (v0.3.0)
I'm looking to subset a data frame based on matches from a regular expression that scans a single column, and returns the data in all the rows where column 2 has a match from the regular expression.
Using R 3.01 and I'm a relative inexperienced R programmer.
My data frame looks like this:
data:
........Column 1 .. Column2 Column 3
Row 1 ..data..........string....data
Row 2 ..data..........string....data
Row 3 ..data..........string....data
Row 4 ..data..........string....data
I'm using the following to scan column 2:
grep("word1", data$Column2, perl=TRUE)]
So far, I get all the strings returned from column2 that contain word1, but I'm looking to subset the entire row(s) where those matches are found.
new.data.frame <- old.data.frame[grep("word1", data$Column2, perl=TRUE), ]