text: [aa-b c d...]
result: [b-123 c d...]
text:[aa-word1 word2 word3 ...]
result[word1-123 word2 word...]
[aa-bananas oranges apples]
[bananas-123 oranges apples]
I want to replace aa- but -123 should be only placed after the next word.
The next word should be a parameter, instead of a fixed text like the insert aa-. This is because there are many different cases to be replaced.
I'll change "aa-" to many other variants. "bb-" "cc-"...
But the word1 is always a variable in the text.
Ctrl+H
Find what: \[aa-(\w+)
Replace with: [$1-123
check Match case
check Wrap around
check Regular expression
Replace all
Explanation:
\[ # opening square bracket
aa # literally 2 a
- # hyphen
(\w+) # group 1, 1 or more word character
Result for given example:
[b-123 c d...]
[word1-123 word2 word3 ...]
[bananas-123 oranges apples]
Screen capture:
Related
I have rows of data that look like:
12 1234 6
33 154 10
1734 2345 7
I am trying to create a regex in VS Code to use as a search and replace where I can use $1 $2 $3 in the replace to represent the different numbers in the line
So that I can replace it with something like
(12) [1234] {6}
(33) [154] {7}
I am not sure how to match it so it captures all 3 numbers in one regex split out into the separate numbers
(\d+) is matching each number individually, but how do I get it to match all 3 numbers in $1 $2 $3 ?
In Visual Studio Code, you can use
Find what: ^(\d+)\s+(\d+)\s+(\d+)$
Replace with: ($1) [$2] {$3}
Or, if you need to keep the same whitespace amount between the numebrs:
Find what: ^(\d+)(\s+)(\d+)(\s+)(\d+)$
Replace with: ($1)$2[$3]$4{$5}
NOTE: the \s shorthand character class usually matches line breaks, but in Visual Studio Code, when the pattern has no \r nor \n, \s does not match line breaks, so it is safe to use it, as it won't match across lines.
If you strictly need to only match lines with digit sequences separated with space/tabs, then replace \s+ with [ \t]+.
See the demo:
I need to remove lines having just 2-3 words starting with say
hi/Hi/Hello/hello
Example
hi Matt
I have tried using the following code
dropcols = ['Hi','hi','Hello']
dataextract = dataextract[~dataextract['text'].str.contains('|'.join(dropcols))]
But this would remove relevant lines like
for example - 'hi Matt, did you get my email'
And I only need to remove the line if it has
'hi Matt'
This expression,
^(?=.*\b(?:hi|hello)\b).*$[\r?\n]
with re.sub might be an option.
import re
regex = r"^(?=.*\b(?:hi|hello)\b).*$[\r?\n]"
test_str = """
hi alice
some other words
Hi bob
some other words
Hello alice
some other words
hello bob
some other words
hi Matt
some other words
"""
subst = ""
print(re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE))
Output
some other words
some other words
some other words
some other words
some other words
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
To match the first 2 or 3 words which start with hi/Hi/Hello/hello, you might use this pattern which you can remove from the string:
^[hH](?:i|ello)(?: \w+){1,2}
Explanation
^ Start of string
[hH] Match h or H
(?:i|ello) Match i or ello
(?: \w+){1,2} Repeteat 1 - 2 times matching a space and 1+ word characters
Regex demo
If you want to match all non whitespace characters, you could use \S+ instead of \w+
I trying to clean up a large .csv file that contains many comma separated words that I need to consolidate parts of. So I have a subsection where I want to change all the commas to slashes. Lets say my file contains this text:
Foo,bar,spam,eggs,extra,parts,spoon,eggs,sudo,test,example,blah,pool
I want to select all commas between the unique words bar and blah. The idea is to then replace the commas with slashes (using find and replace), such that I get this result:
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
As per #EganWolf input:
How do I include words in the search but exclude them from the selection (for the unique words) and how do I then match only the commas between the words?
Thus far I have only managed to select all the text between the unique words including them:
bar,.*,blah, bar:*, *,blah, (bar:.+?,blah)*,*\2
I experimented with negative look ahead but cant get any search results from my statements.
Using Notepad++, you can do:
Ctrl+H
Find what: (?:\bbar,|\G(?!^))\K([^,]*),(?=.+\bblah\b)
Replace with: $1/
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
(?: # start non capture group
\bbar, # word boundary then bar then a comma
| # OR
\G # restart from last match position
(?!^) # negative lookahead, make sure not followed by beginning of line
) # end group
\K # forget all we've seen until this position
([^,]*) # group 1, 0 or more non comma
, # a comma
(?= # positive lookahead
.+ # 1 or more any character but newlie
\bblah\b # word boundary, blah, word boundary
) # end lookahead
Result for given example:
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
Screen capture:
The following regex will capture the minimally required text to access the commas you want:
(?<=bar,)(.*?(,))*(?=.*?,blah)
See Regex Demo.
If you want to replace the commas, you will need to replace everything in capture group 2. Capture group 0 has your entire match.
An alternative approach would be to split your string by comma to create an array of words. Then join words between bar and blah using / and append the other words joined by ,.
Here is a PowerShell example of split and join:
$a = "Foo,bar,spam,eggs,extra,parts,spoon,eggs,sudo,test,example,blah,pool"
$split = $a -split ","
$slashBegin = $split.indexof("bar")+1
$commaEnd = $split.indexof("blah")-1
$str1 = $split[0..($slashbegin-1)] -join ","
$str2 = $split[($slashbegin)..$commaend] -join "/"
$str3 = $split[($commaend+1)..$split.count] -join ","
#($str1,$str2,$str3) -join ","
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
This could easily be made into a function with your entire line and keywords as inputs.
I have some naturally occuring text:
text="word1 word2 word3. word4, word5 word6 word7"
And some elements that I want to detect in that text:
elements=c("word2","word6 word7",".",",")
However,
elements[sapply(paste0("\\<",elements,"\\>"),grepl,text)]
only returns the unigram "word2" and the bigram "word6 word7". The period and comma, which are in the text, are not detected.
How can I achieve that?
You don't need to include the square brackets, since sqaure brackets are special meta charcaters in regex which means a character class.
> text="word1 word2 word3. word4, word5 word6 word7"
> elements=c("word2","word6 word7",".",",")
> elements[sapply(paste0(elements),grepl,text, fixed=T)]
[1] "word2" "word6 word7" "." ","
elements[sapply(paste0("[",elements,"]"),grepl,text)] does the job.
I'm quite terrible at regexes.
I have a string that may have 1 or more words in it (generally 2 or 3), usually a person name, for example:
$str1 = 'John Smith';
$str2 = 'John Doe';
$str3 = 'David X. Cohen';
$str4 = 'Kim Jong Un';
$str5 = 'Bob';
I'd like to convert each as follows:
$str1 = 'John S.';
$str2 = 'John D.';
$str3 = 'David X. C.';
$str4 = 'Kim J. U.';
$str5 = 'Bob';
My guess is that I should first match the first word, like so:
preg_match( "^([\w\-]+)", $str1, $first_word )
then all the words after the first one... but how do I match those? should I use again preg_match and use offset = 1 in the arguments? but that offset is in characters or bytes right?
Anyway after I matched the words following the first, if the exist, should I do for each of them something like:
$second_word = substr( $following_word, 1 ) . '. ';
Or my approach is completely wrong?
Thanks
ps - it would be a boon if the regex could maintain the whole first two words when the string contain three or more words... (e.g. 'Kim Jong U.').
It can be done in single preg_replace using a regex.
You can search using this regex:
^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+
And replace by:
$1.
RegEx Demo
Code:
$name = preg_replace('/^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+/', '$1.', $name);
Explanation:
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
^\w+(?:$| +)(*SKIP)(*F) matches first word in a name and skips it (does nothing)
(\w)\w+ matches all other words and replaces it with first letter and a dot.
You could use a positive lookbehind assertion.
(?<=\h)([A-Z])\w+
OR
Use this regex if you want to turn Bob F to Bob F.
(?<=\h)([A-Z])\w*(?!\.)
Then replace the matched characters with \1.
DEMO
Code would be like,
preg_replace('~(?<=\h)([A-Z])\w+~', '\1.', $string);
DEMO
(?<=\h)([A-Z]) Captures all the uppercase letters which are preceeded by a horizontal space character.
\w+ matches one or more word characters.
Replace the matched chars with the chars inside the group index 1 \1 plus a dot will give you the desired output.
A simple solution with only look-ahead and word boundary check:
preg_replace('~(?!^)\b(\w)\w+~', '$1.', $string);
(\w)\w+ is a word in the name, with the first character captured
(?!^)\b performs a word boundary check \b, and makes sure the match is not at the start of the string (?!^).
Demo