How to extract rows where a string matches? - regex

my data is as below and want to extract only those rows where data column has string like "7_" and its position is not fix. I wand to extract those rows where "7_" matches and data column should have only "7_" values.
Row No Name data
1 ABC 4_6035;9_47;7_113838;0_14
2 xyz 0_6035;7_145
3 MNO 4_6035;5_47;8_113838;7_14
4 PPP 0_6035;5_145
Output I am looking for is
Row No Name data
1 ABC 7_113838
2 xyz 7_145
3 MNO 7_14
Please help.

^(?=.*\\b7_).*$
You can try this.See demo.
https://regex101.com/r/oL9kE8/10

Try this
within(df[grep("7_", df$data, fixed = TRUE), ],
data <- sub(".*?(7_[^;]*).*", "\\1", data))
# RowNo Name data
# 1 1 ABC 7_113838
# 2 2 xyz 7_145
# 3 3 MNO 7_14

Related

Using awk, how do I match pattern and variants?

I've been struggling with this for a while in regex testers but what came up as a correct regex pattern actually failed. I've got a large file, tab delimited, with numerous types of data. I want to print a specific column, with the characters XYZ, and it's subsequent values.
In the specific column I'm interested in I have values like:
XYZ
ABCDE
XYZ/WORDS
XYZ/ABCDE
ABFE
XYZ
regex tester that was successful was something like:
XYZ(.....)*
It obviously fails when implemented as:
awk '{if ($1=="XYZ(......)*") print$0}'
What regex character do I use to denote that I want everything after the backslash(/), including the original pattern (XYZ)?
Specifically, I want to be able to capture all instances of XYZ, and print the other columns that go along with them (hence the print$0). Specifically, capture these values:
XYZ
XYZ/WORDS
XYZ/ABCDE
Thank you
Setup: (assuming actual data file does not include blank lines)
$ cat x
XYZ 1 2 3 4
ABCDE 1 2 3 4
XYZ/WORDS 1 2 3 4
XYZ/ABCDE 1 2 3 4
ABFE 1 2 3 4
XYZ 1 2 3 4
If you merely want to print all rows where the first field starts with XYZ:
$ awk '$1 ~ /^XYZ/' x
XYZ 1 2 3 4
XYZ/WORDS 1 2 3 4
XYZ/ABCDE 1 2 3 4
XYZ 1 2 3 4
If this doesn't provide the expected results then please update the question with more details (to include a more representative set of input data and the expected output).

Separating columns based on Regex | Pandas

So I have converted a pdf to a dataframe and am almost in the final stages of what I wish the format to be. However I am stuck in the following step. I have a column which is like -
Column A
1234[321]
321[3]
123
456[456]
and want to separate it into two different columns B and C such that -
Column B Column C
1234 321
321 3
123 0
456 456
How can this be achieved? I did try something along the lines of
df.Column A.str.strip(r"\[\d+\]")
but I have not been able to get through after trying different variations. Any help will be greatly appreciated as this is the final part of this task. Much thanks in advance!
An alternative could be:
# Create the new two columns
df[["Column B", "Column C"]]=df["Column A"].str.split('[', expand=True)
# Get rid of the extra bracket
df["Column C"] = df["Column C"].str.replace("]", "")
# Get rid of the NaN and the useless column
df = df.fillna(0).drop("Column A", axis=1)
# Convert all columns to numeric
df = df.apply(pd.to_numeric)
You may use
import pandas as pd
df = pd.DataFrame({'Column A': ['1234[321]', '321[3]', '123', '456[456]']})
df[['Column B', 'Column C']] = df['Column A'].str.extract(r'^(\d+)(?:\[(\d+)])?$', expand=False)
# If you need to drop Column A here, use
# df[['Column B', 'Column C']] = df.pop('Column A').str.extract(r'^(\d+)(?:\[(\d+)])?$', expand=False)
df['Column C'][pd.isna(df['Column C'])] = 0
df
# Column A Column B Column C
# 0 1234[321] 1234 321
# 1 321[3] 321 3
# 2 123 123 0
# 3 456[456] 456 456
See the regex demo. It matches
^ - start of string
(\d+) - Group 1: one or more digits
(?:\[(\d+)])? - an optional non-capturing group matching [, then capturing into Group 2 one or more digits, and then a ]
$ - end of string.

Sumif and IF, to add up one column and compare it to another column

I have two lists that I want to compare to see if they match, but in one list the numbers are broken down into individual lots so I need to sum them first to make sure they match the other list which only shows the total amount.
Here's an example:
List 1
5 ABC
6 ABC
7 ABC
1 CDE
5 CDE
2 CDE
List 2
18 ABC
8 CDE
So I want to make sure that the sum of the ABC and CDE in List 1 matches the amount of ABC and CDE in List 2. I can do this using multiple columns, but I am trying for a more...elegant way (one nested formula).
If you are looking for a confirmation that the numbers match you can use the following:
=SUMIF($B$3:$B$11,E1,$A$3:$A$11)=SUMIF(B15:$B$17,E1,$A$15:$A$17)
Whhat this does is check if the sum of ABC in list 1 is equal to the sum of ABC in list 2 and return true if they are equal and false if they are not.

PowerBI - Get the Index of the First Occurance of a value in the column

I am trying to return the Index of a first occurrence of a value in a column.
I would want to use the Calculated Column functionality in PowerBI.
For Example,
Input Output
ASD 1
ASD 1
ASD 1
GEF 4
GEF 4
HIJ 6
GEF 4
This can be done in excel using a simple formula like,
MATCH(A2,A:A,0)-1
For PowerBI to understand Index, I have created a column called as Index on the Query editor and made the data look like,
Index Input Output
1 ASD ?
2 ASD ?
3 ASD ?
4 GEF ?
5 GEF ?
6 HIJ ?
7 GEF ?
How to do this in PowerBI?
The way I did this was to find the minimal index the corresponds to the Input value in the table:
Output = MINX(
FILTER(TableName,
TableName[Input] = EARLIER(TableName[Input])),
TableName[Index])
This takes the minimal index over the table, where Input matches the value of Input in the original (earlier) row context.

Replace String B with String C if it contains (but not exactly matches) String A

I have a data frame match_df which shows "matching rules": the column old should be replaced with the colum new in the dataframes it is applied on.
old <- c("10000","20000","300ZZ","40000")
new <- c("Name1","Name2","Name3","Name4")
match_df <- data.frame(old,new)
old new
1 10000 Name1
2 20000 Name2
3 300ZZ Name3 # watch the letters
4 40000 Name4
I want to apply the matching rules above on a data frame working_df
id <- c(1,2,3,4)
value <- c("xyz-10000","20000","300ZZ-230002112","40")
working_df <- data.frame(id,value)
id value
1 1 xyz-10000
2 2 20000
3 3 300ZZ-230002112
4 4 40
My desired result is
# result
id value
1 1 Name1
2 2 Name2
3 3 Name3
4 4 40
This means that I am not looking for an exact match. I'd rather like to replace the whole string working_df$value as soon as it includes any part of the string in match_df$old.
I like the solution posted in R: replace characters using gsub, how to create a function?, but it works only for exact matches. I experimented with gsub, str_replace_all from stringr but I couldn't find a solution that works for me. There are many solutions for exact matches on SOF, but I couldn't find a comprehensible one for this problem.
Any help is highly appreciated.
I'm not sure this is the most elegant/efficient way of doing it but you could try something like this:
working_df$value <- sapply(working_df$value,function(y){
idx<-which(sapply(match_df$old,function(x){grepl(x,y)}))[1]
if(is.na(idx)) idx<-0
ifelse(idx>0,as.character(match_df$new[idx]),as.character(y))
})
It uses grepl to find, for each value of working_df, if there is a row of match_df that is partially matching and get the index of that row. If there is more than one, it takes the first one.
You need the grep function. This will return the indices of a vector that match a pattern (any pattern, not necessarily a full string match). For instance, this will tell you which of your "old" values match the "10000" pattern:
grep(match_df[1,1], working_df$value)
Once you have that information, you can look up the corresponding "new" value for that pattern, and replace it on the matching rows.
Here are 2 approaches using Map + <<- and a for loop:
working_df[["value2"]] <- as.character(working_df[["value"]])
Map(function(x, y){working_df[["value2"]][grepl(x, working_df[["value2"]])] <<- y}, old, new)
working_df
## id value value2
## 1 1 xyz-10000 Name1
## 2 2 20000 Name2
## 3 3 300ZZ-230002112 Name3
## 4 4 40 40
## or...
working_df[["value2"]] <- as.character(working_df[["value"]])
for (i in seq_along(working_df[["value2"]])) {
working_df[["value2"]][grepl(old[i], working_df[["value2"]])] <- new[i]
}