I just started PySpark, here is the task:
I have an input of:
I need to use a regex to remove punctuation and all leading or trailing space and underscore. output is all lowercase.
What I came up is not complete:
sentence = regexp_replace(trim(lower(column)), '\\*\s\W\s*\\*_', '')
and the result is:
How do I fix the regex here? I need to use regexp_replace here.
Thank you very much.
You may use
^\W+|\W+$|[^\w\s]+|_
The ^ and $ anchors must match line start/end.
If the pattern must not overflow across lines, replace \W+$ with [^\w\n]+$ and the ^\W+ pattern with ^[^\w\n]+:
^[^\w\n]+|[^\w\n]+$|[^\w\s]+|_
See the regex demo.
Explanation:
^ - start of line (if multiline option is onby default, else, try adding (?m) at the pattern start)
[^\w\n]+ - 1 or more non-word chars (non-[a-zA-Z0-9_]) except a newline
| - or
[^\w\n]+$ - 1 or more non-word chars at the end of the line ($)
| - or
[^\w\s]+ - 1 or more non-word chars except any whitespace
| - or
_ - an underscore.
If you do not really care about Unicode (I used \w, \s that can be made Unicode aware), you may just use a shorter, more simple pattern:
^[^a-zA-Z\n]+|[^a-zA-Z\n]+$|[^a-zA-Z\s]+
See this regex demo.
TL;DR: sentence = column.strip(' \t\n*+_')
If you want to remove characters only from the ends and don't care about unicode, then the basic string strip() function will let you pick characters to strip. It defaults to whitespace, but you can put in whatever you want.
If you want to remove within a string you are stuck with a regular expression or, if using byte strings or Python 2, maketrans.
You may like to look at this question as well.
Related
I would like to add some custom text to the end of all lines in my document opened in Notepad++ that start with 10 and contain a specific word (for example "frog").
So far, I managed to solve the first part.
Search: ^(10)$
Replace: \1;Batteries (to add ;Batteries to the end of the line)
What I need now is to edit this regex pattern to recognize only those lines that also contain a specific word.
For example:
Before: 1050;There is this frog in the lake
After: 1050;There is this frog in the lake;Batteries
You can use the regex to match your wanted lines:
(^(10).*?(frog).*)
the .*? is a lazy quantifier to get the minimum until frog
and replace by :
$1;Battery
Hope it helps,
You should allow any characters between the number and the end of line:
^10.*frog.*
And replacement will be $0;Batteries. You do not even need a $ anchor as .* matches till the end of a line since . matches any character but a line break char.
NOTE: There is no need to wrap the whole pattern with capturing parentheses, the $0 placeholder refers to the whole match value.
More details:
^ - start of a line
10 - a literal 10 text
.* - zero or more chars other than line break chars as many as possible
frog - a literal string
.* - zero or more chars other than line break chars as many as possible
try this
find with: (^(10).*(frog).*)
replace with: $1;Battery
Use ^(10.*frog.*)$ as regex. Replace it with something like $1;Batteries
How to get (or remove) all comment lines from a matlab file?
Lines may start with no or an arbitrary number of whitespaces followed by one or more %, followed by the comment.
Using
only_comments = regexp(raw_string, '(?m)^[ ]*[%].*?$', 'match');
fails. Also, how to make sure tabs will be catched?
As I understand this its
(?m) line mode
^ beginning of line
[ ]* none or any number of white spaces
[%].*?$ followed by a % and then any charachter until the line end is reached.
Whats wrong?
Seems like you want something like this,
only_comments = regexp(raw_string, '(?m)^[ ]*[%]+.*?$', 'match');
OR
only_comments = regexp(raw_string, '(?m)^ *%+.*$', 'match');
Explanation:
^ Asserts that we are at the start.
<space>* Matches zero or more spaces.
%+ Matches one or more %
.* Matches any character but not of line breaks.
$ Asserts that we are at the end.
(?m)^[ ]*%+.*$
Think you need this.your regex (?m)^[ ]*[%].*?$ does not quantify %.It will match only 1 %.You need to use %+ to match one or more of it.
I am using Notepad++ to remove some unwanted strings from the end of a pattern and this for the life of me has got me.
I have the following sets of strings:
myApp.ComboPlaceHolderLabel,
myApp.GridTitleLabel);
myApp.SummaryLabel + '</b></div>');
myApp.NoneLabel + ')') + '</label></div>';
I would like to leave just myApp.[variable] and get rid of, e.g. ,, );, + '...', etc.
Using Notepad++, I can match the strings themselves using ^myApp.[a-zA-Z0-9].*?\b (it's a bit messy, but it works for what I need).
But in reality, I need negate that regex, to match everything at the end, so I can replace it with a blank.
You don't need to go for negation. Just put your regex within capturing groups and add an extra .*$ at the last. $ matches the end of a line. All the matched characters(whole line) are replaced by the characters which are present inside the first captured group. .
matches any character, so you need to escape the dot to match a literal dot.
^(myApp\.[a-zA-Z0-9].*?\b).*$
Replacement string:
\1
DEMO
OR
Match only the following characters and then replace it with an empty string.
\b[,); +]+.*$
DEMO
I think this works equally as well:
^(myApp.\w+).*$
Replacement string:
\1
From difference between \w and \b regular expression meta characters:
\w stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.
(^.*?\.[a-zA-Z]+)(.*)$
Use this.Replace by
$1
See demo.
http://regex101.com/r/lU7jH1/5
How do I replace a word with a new line and a word using regex with an empty string in Powershell?
Below is a sample content... I need to delete all the use database and go I'm using powershell and powershell_ise for editor:
use database_instance
go
if condition
You need to match Newline and also space after the newline:
/use database_\w+\n\s*\w+/g
$sql = #"
use database_instance
go
if condition
"#
$sql -ireplace 'use\s+\w+_\w+\s*(?:\r?\n)+\s*go' , ''
How this Works:
Using -ireplace for case insensitive regex.
Find the word use followed by one or more whitespace \s+ followed by one or more word characters \w+, then an underscore _.
One or more word characters \w+, followed by 0 or more whitespace (just in case)
A non-capturing group (?:) since we don't need the result, this is just to encapsulate a newline that accounts for windows and unix line endings. It consists of an optional CR followed by a LF, and this is matched 1 or more times.
Followed by 0 or more whitespace \s* then the word go.
Replace it with nothing!
This does leave some empty space, but that shouldn't be too big of an issue since the SQL parser won't care.
Note
In your comments you said you tried:
$out -replace "/use database_\w+\n\w+/g"
Be aware that powershell does not use /regexhere/ syntax. The forward slashes are treated as literals, so the flags you specified are as well. The replace is global by default so you don't need g anyway.
I currently need to figure out how to use regex and came to a point which i don't seem to figure out:
the test strings that are the sources (They actually come from OCR'd PDFs):
string1 = 'Beleg-Nr.:12123-23131'; // no spaces after the colon
string2 = 'Beleg-Nr.: 12121-214331'; // a tab after the colon
string3 = 'Beleg-Nr.: 12-982831'; // a tab and spaces after the colon
I want to get the numbers eplicitly. For that I use this pattern:
pattern = '/(?<=Beleg-Nr\.:[ \t]*)(.*)
This will get me the pure numbers for string1 and string2 but isn't working on string3 (it gives me additional whitespace before the number).
What am I missing here?
Edit: Thanks for all the helpful advises. The software that OCRs on the fly is able to surpress whitespace on its own in regexes. This did the trick. The resulting pattern is:
(?<=Beleg-Nr\.:[\s]*)(.*)
You can use "\s" special symbol to include both space and tabs (so, you will not need combine it into a group via []).
This works for me:
/(Beleg-Nr.:\s*)(.*)/
http://regexr.com?35rj6
The problem is that [ ]* will match only spaces. You need to use \s which will match any whitespace character (more specifically \s is [\f\n\r\t\v\u00A0\u2028\u2029]) :
/(?<=Beleg-Nr.:\s*)(.*)/
Side note:
* is greedy by default, so it will try to match max number of whitespaces possible, so you do not need to use negative [^\s] in your last () group.
Just replace the (.*) with a more restrictive pattern ([^ ]+$ for example). Also note, that the . after Beleg-Nr matches other chars as well.
The $ in my example matches the end of the line and thus ensures, that all characters are being matched.
I'd suggest to match to tabs as well:
pattern = '/(?<=Beleg-Nr\.:[ \t]*)([^ \t]+)$