Get string after string with trailing whitespaces - regex

I currently need to figure out how to use regex and came to a point which i don't seem to figure out:
the test strings that are the sources (They actually come from OCR'd PDFs):
string1 = 'Beleg-Nr.:12123-23131'; // no spaces after the colon
string2 = 'Beleg-Nr.: 12121-214331'; // a tab after the colon
string3 = 'Beleg-Nr.: 12-982831'; // a tab and spaces after the colon
I want to get the numbers eplicitly. For that I use this pattern:
pattern = '/(?<=Beleg-Nr\.:[ \t]*)(.*)
This will get me the pure numbers for string1 and string2 but isn't working on string3 (it gives me additional whitespace before the number).
What am I missing here?
Edit: Thanks for all the helpful advises. The software that OCRs on the fly is able to surpress whitespace on its own in regexes. This did the trick. The resulting pattern is:
(?<=Beleg-Nr\.:[\s]*)(.*)

You can use "\s" special symbol to include both space and tabs (so, you will not need combine it into a group via []).

This works for me:
/(Beleg-Nr.:\s*)(.*)/
http://regexr.com?35rj6

The problem is that [ ]* will match only spaces. You need to use \s which will match any whitespace character (more specifically \s is [\f\n\r\t\v\u00A0\u2028\u2029]) :
/(?<=Beleg-Nr.:\s*)(.*)/
Side note:
* is greedy by default, so it will try to match max number of whitespaces possible, so you do not need to use negative [^\s] in your last () group.

Just replace the (.*) with a more restrictive pattern ([^ ]+$ for example). Also note, that the . after Beleg-Nr matches other chars as well.
The $ in my example matches the end of the line and thus ensures, that all characters are being matched.
I'd suggest to match to tabs as well:
pattern = '/(?<=Beleg-Nr\.:[ \t]*)([^ \t]+)$

Related

how to remove star * from string using regex in pyspark

I just started PySpark, here is the task:
I have an input of:
I need to use a regex to remove punctuation and all leading or trailing space and underscore. output is all lowercase.
What I came up is not complete:
sentence = regexp_replace(trim(lower(column)), '\\*\s\W\s*\\*_', '')
and the result is:
How do I fix the regex here? I need to use regexp_replace here.
Thank you very much.
You may use
^\W+|\W+$|[^\w\s]+|_
The ^ and $ anchors must match line start/end.
If the pattern must not overflow across lines, replace \W+$ with [^\w\n]+$ and the ^\W+ pattern with ^[^\w\n]+:
^[^\w\n]+|[^\w\n]+$|[^\w\s]+|_
See the regex demo.
Explanation:
^ - start of line (if multiline option is onby default, else, try adding (?m) at the pattern start)
[^\w\n]+ - 1 or more non-word chars (non-[a-zA-Z0-9_]) except a newline
| - or
[^\w\n]+$ - 1 or more non-word chars at the end of the line ($)
| - or
[^\w\s]+ - 1 or more non-word chars except any whitespace
| - or
_ - an underscore.
If you do not really care about Unicode (I used \w, \s that can be made Unicode aware), you may just use a shorter, more simple pattern:
^[^a-zA-Z\n]+|[^a-zA-Z\n]+$|[^a-zA-Z\s]+
See this regex demo.
TL;DR: sentence = column.strip(' \t\n*+_')
If you want to remove characters only from the ends and don't care about unicode, then the basic string strip() function will let you pick characters to strip. It defaults to whitespace, but you can put in whatever you want.
If you want to remove within a string you are stuck with a regular expression or, if using byte strings or Python 2, maketrans.
You may like to look at this question as well.

Perl regex extract two consecutive words

I am trying to extract strings containing two words separated by one or more whitespace from a list.
Example:
#a=("aaa12:.", "lala lulu", "erwer", ",", "lala loqw asqwd", "asdas sadsad", "asasd| asq");
#b=grep {/\w+\s+\w+/} #a;
this gives me
'lala lulu',
'lala loqw asqwd',
'asdas sadsad'
but I don't want to grep the one with three words...
I tried #b=grep {/^\w\s+\w$/} but then I don't get any matches. Should be simple, but I just don't get it. Which regex do I need here?
\w only matches one character. You want the following:
/^\w+\s+\w+\z/
^ matches the start of string.
\w+ matches one of more "word" characters.
\s+ matches one of more whitespace characters.
\w+ matches one of more "word" characters.
\z matches the end of the string.
I tried #b=grep {/^\w\s+\w$/} but then I don't get any matches
The only reason it doesn't work is because you left off quantifier(s) at
the beginning/end:
/^\w\s+\w$/
^ ^
where it would work fine if it were /^\w+\s+\w+$/
The better way to do it though is add some flexibility with whitespace: /^\s*\w+\s+\w+\s*$/

Remove all characters after a certain match

I am using Notepad++ to remove some unwanted strings from the end of a pattern and this for the life of me has got me.
I have the following sets of strings:
myApp.ComboPlaceHolderLabel,
myApp.GridTitleLabel);
myApp.SummaryLabel + '</b></div>');
myApp.NoneLabel + ')') + '</label></div>';
I would like to leave just myApp.[variable] and get rid of, e.g. ,, );, + '...', etc.
Using Notepad++, I can match the strings themselves using ^myApp.[a-zA-Z0-9].*?\b (it's a bit messy, but it works for what I need).
But in reality, I need negate that regex, to match everything at the end, so I can replace it with a blank.
You don't need to go for negation. Just put your regex within capturing groups and add an extra .*$ at the last. $ matches the end of a line. All the matched characters(whole line) are replaced by the characters which are present inside the first captured group. .
matches any character, so you need to escape the dot to match a literal dot.
^(myApp\.[a-zA-Z0-9].*?\b).*$
Replacement string:
\1
DEMO
OR
Match only the following characters and then replace it with an empty string.
\b[,); +]+.*$
DEMO
I think this works equally as well:
^(myApp.\w+).*$
Replacement string:
\1
From difference between \w and \b regular expression meta characters:
\w stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.
(^.*?\.[a-zA-Z]+)(.*)$
Use this.Replace by
$1
See demo.
http://regex101.com/r/lU7jH1/5

NOTEPAD++ REGEX - I can't get what's in between two strings, I don't get it

I'm so close to understanding regex. I'm a bit stumped, I thought i understood lazy and greedy.
Here is my current regex: <g_n><!\[CDATA\[([^]]+)(?=]]><\/g_n>)
My current regex makes:
<g_n><![CDATA[xxxxxxxxxx]]></g_n>
match to:
<g_n><![CDATA[xxxxxxxxxx
But I want to make it match like this:
xxxxxxxxxx
You want
<g_n><!\[CDATA\[(.*?)]]></g_n>
then if you want to replace it use
\1
in the replacement box
Your matching the whole string, the brackets around the .*? match all of that and put it in the \1 variable
So the match will be all of the string with \1 referring to what you want
To change the xxxxx
Regex :
(<g_n><![CDATA[)(?:.*?)(]]></g_n>)
Replacement
\1WHAT YOU WANT TO CHANGE TO\2
It looks like you need to add escape slashes to the two closing square brackets, as they are literals from the string you're parsing.
<g_n><!\[CDATA\[.*+?\]\]><\/g_n>
^ ^
Any square brackets not being escaped by backslashes will be treated as regex operational brackets, which in this case won't catch the input string.
EDIT, I think the +? is redundant.
\[.*\]\]> ...
should suffice, since .* means any character, any amount of times.
Tested with notepad++ 6.3.2:
find: (<g_n><!\[CDATA\[)([^]]+)(?=]]></g_n>)
replace: $1WhatYouWant
You can replace + by * in the pattern to match void CDATA:
<g_n><![CDATA[]]></g_n>

How do I remove trailing whitespace using a regular expression?

I want to remove trailing white spaces and tabs from my code without
removing empty lines.
I tried:
\s+$
and:
([^\n]*)\s+\r\n
But they all removed empty lines too. I guess \s matches end-of-line characters too.
UPDATE (2016):
Nowadays I automate such code cleaning by using Sublime's TrailingSpaces package, with custom/user setting:
"trailing_spaces_trim_on_save": true
It highlights trailing white spaces and automatically trims them on save.
Try just removing trailing spaces and tabs:
[ \t]+$
To remove trailing whitespace while also preserving whitespace-only lines, you want the regex to only remove trailing whitespace after non-whitespace characters. So you need to first check for a non-whitespace character. This means that the non-whitespace character will be included in the match, so you need to include it in the replacement.
Regex: ([^ \t\r\n])[ \t]+$
Replacement: \1 or $1, depending on the IDE
The platform is not specified, but in C# (.NET) it would be:
Regular expression (presumes the multiline option - the example below uses it):
[ \t]+(\r?$)
Replacement:
$1
For an explanation of "\r?$", see Regular Expression Options, Multiline Mode (MSDN).
Code example
This will remove all trailing spaces and all trailing TABs in all lines:
string inputText = " Hello, World! \r\n" +
" Some other line\r\n" +
" The last line ";
string cleanedUpText = Regex.Replace(inputText,
#"[ \t]+(\r?$)", #"$1",
RegexOptions.Multiline);
Regex to find trailing and leading whitespaces:
^[ \t]+|[ \t]+$
If using Visual Studio 2012 and later (which uses .NET regular expressions), you can remove trailing whitespace without removing blank lines by using the following regex
Replace (?([^\r\n])\s)+(\r?\n)
With $1
Some explanation
The reason you need the rather complicated expression is that the character class \s matches spaces, tabs and newline characters, so \s+ will match a group of lines containing only whitespace. It doesn't help adding a $ termination to this regex, because this will still match a group of lines containing only whitespace and newline characters.
You may also want to know (as I did) exactly what the (?([^\r\n])\s) expression means. This is an Alternation Construct, which effectively means match to the whitespace character class if it is not a carriage return or linefeed.
Alternation constructs normally have a true and false part,
(?( expression ) yes | no )
but in this case the false part is not specified.
[ |\t]+$ with an empty replace works.
\s+($) with a $1 replace also works, at least in Visual Studio Code...
To remove trailing white space while ignoring empty lines I use positive look-behind:
(?<=\S)\s+$
The look-behind is the way go to exclude the non-whitespace (\S) from the match.
To remove any blank trailing spaces use this:
\n|^\s+\n
I tested in the Atom and Xcode editors.
In Java:
String str = " hello world ";
// prints "hello world"
System.out.println(str.replaceAll("^(\\s+)|(\\s+)$", ""));
You can simply use it like this:
var regex = /( )/g;
Sample: click here