Perl regex extract two consecutive words

Perl regex extract two consecutive words - regex

I am trying to extract strings containing two words separated by one or more whitespace from a list.
Example:
#a=("aaa12:.", "lala lulu", "erwer", ",", "lala loqw asqwd", "asdas sadsad", "asasd| asq");
#b=grep {/\w+\s+\w+/} #a;
this gives me
'lala lulu',
'lala loqw asqwd',
'asdas sadsad'
but I don't want to grep the one with three words...
I tried #b=grep {/^\w\s+\w$/} but then I don't get any matches. Should be simple, but I just don't get it. Which regex do I need here?

\w only matches one character. You want the following:
/^\w+\s+\w+\z/
^ matches the start of string.
\w+ matches one of more "word" characters.
\s+ matches one of more whitespace characters.
\w+ matches one of more "word" characters.
\z matches the end of the string.

I tried #b=grep {/^\w\s+\w$/} but then I don't get any matches
The only reason it doesn't work is because you left off quantifier(s) at
the beginning/end:
/^\w\s+\w$/
^ ^
where it would work fine if it were /^\w+\s+\w+$/
The better way to do it though is add some flexibility with whitespace: /^\s*\w+\s+\w+\s*$/

Related

Regex replace one value between comma separated values

I'm having a bunch of comma separated CSV files.
I would like to replace exact one value which is between the third and fourth comma. I would love to do this with Notepad++ 'Find in Files' and Replace functionality which could use RegEx.
Each line in the files look like this:
03/11/2016,07:44:09,327575757,1,5434543,...
The value I would like to replace in each line is always the number 1 to another one.
It can't be a simple regex for e.g. ,1, as this could be somewhere else in the line, so it must be the one after the third and before the fourth comma...
Could anyone help me with the RegEx?
Thanks in advance!
Two more rows as example:
01/25/2016,15:22:55,276575950,1,103116561,10.111.0.111,ngd.itemversions,0.401,0.058,W10,0.052,143783065,,...
01/25/2016,15:23:07,276581704,1,126731239,10.111.0.111,ll.browse,7.133,1.589,W272,3.191,113273232,,...

You can use
^(?:[^,\n]*,){2}[^,\n]*\K,1,
Replace with any value you need.
The pattern explanation:
^ - start of a line
(?:[^,\n]*,){2} - 2 sequences of
[^,\n]* - zero or more characters other than , and \n (matched with the negated character class [^,\n]) followed with
, - a literal comma
[^,\n]* - zero or more characters other than , and \n
\K - an operator that forces the regex engine to discard the whole text matched so far with the regex pattern
,1, - what we get in the match.
Note that \n inside the negated character classes will prevent overflowing to the next lines in the document.

You can replace value between third and fourth comma using following regex.
Regex: ([^,]+,[^,]+,[^,]+),([^,]+)
Replacement to do: Replace with \1,value. I used XX for demo.
Regex101 Demo
Notepad++ Demo

Regular expression matching space but at the end of line

I'm trying to replace multiple spaces with a single one, but at the start of the line.
Example:
___abc___def__
___ghi___jkl__
should turn to
___abc_def__
___ghi_jkl__
Note that I've replaced space with underscore
A simple search using the following pattern:
([^\s])\s+
matches the space at the end of the first line up to the space at the beginning of the next one.
So, if I replace with \1_, I get the following:
___abc_def_ghi_jkl
And that is absolutely not what I expect and regex engines, e.g., PowerGREP or the one in Visual Studio, don't behave that way.

If you want to match only horizontal spaces, use \h:
Find what: (?<=\S)\h+(?=\S)
Replace with: (a space)

There are several possible interpretations of the question. For each of them the replacement will be a single space character.
If spaces is plural and means space characters but not tabs then use
a find string of (^ {2,})|( {2,}$).
If spaces is plural and should includes tabs then use a find string
of (^[ \t]{2,})|([ \t]{2,}$).
If any leading or trailing spaces and tabs (one or more) is to be
replaced with a space then use a find string of (^[ \t]+)|([ \t]+$).
The general form of each of these is (^...)|(...$). The | means an alternation so either the preceding or the following bracketed expression can match. Hence the find what text can match either at the beginning or the end of a line. The ... varies depending on exactly what needs to be matched. Specifying [ \t] means only the two characters space and tab, whereas \s includes the line-end characters.

Ok, so the intention was to replace this:
Hey diddle diddle, \n<br/>
The Cat and the fiddle,\n
with this:
Hey diddle diddle,\n<br/>
The Cat and the fiddle,\n
A slightly modified version of Toto's answer did the trick:
(?<=\S)\h+(?=\S)|\s+$
finding any space(s) between word-characters and trailing space at the end of the line.

Remove all characters after a certain match

I am using Notepad++ to remove some unwanted strings from the end of a pattern and this for the life of me has got me.
I have the following sets of strings:
myApp.ComboPlaceHolderLabel,
myApp.GridTitleLabel);
myApp.SummaryLabel + '</b></div>');
myApp.NoneLabel + ')') + '</label></div>';
I would like to leave just myApp.[variable] and get rid of, e.g. ,, );, + '...', etc.
Using Notepad++, I can match the strings themselves using ^myApp.[a-zA-Z0-9].*?\b (it's a bit messy, but it works for what I need).
But in reality, I need negate that regex, to match everything at the end, so I can replace it with a blank.

You don't need to go for negation. Just put your regex within capturing groups and add an extra .*$ at the last. $ matches the end of a line. All the matched characters(whole line) are replaced by the characters which are present inside the first captured group. .
matches any character, so you need to escape the dot to match a literal dot.
^(myApp\.[a-zA-Z0-9].*?\b).*$
Replacement string:
\1
DEMO
OR
Match only the following characters and then replace it with an empty string.
\b[,); +]+.*$
DEMO

I think this works equally as well:
^(myApp.\w+).*$
Replacement string:
\1
From difference between \w and \b regular expression meta characters:
\w stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.

(^.*?\.[a-zA-Z]+)(.*)$
Use this.Replace by
$1
See demo.
http://regex101.com/r/lU7jH1/5

Regex to find a expression followed by whitespace, #, # or end of input

I want to find all instance of word (say myword), with the added condition that the word has whitespace, "#", "#" afterwords, or is the end of input.
Input string:
"myword# myword mywordrick myword# myword"
I want the regex to match everything besides mywordtrick -
myword#
myword
myword#
myword
I am able to match against the first 3 with myword[##\s]
I thought myword[##\s\z] would match against all 4, but I only get 3
I try myword[\z] and get no matches
I try myword\z and get 1 match.
I figure \z inside a [] doesn't work, because [] is character based logic, rather than position based logic.
Is there a way to use a single regex to match the expressions I am interested in? I do not want to use both myword[##\s] and myword\z unless I really have to.

Your regex would be,
myword(?:[##\s]|$)
It matches the string myword along with the symbols only if it's followed by # or # or \s or $. $ means the end of the line.
DEMO

Get string after string with trailing whitespaces

I currently need to figure out how to use regex and came to a point which i don't seem to figure out:
the test strings that are the sources (They actually come from OCR'd PDFs):
string1 = 'Beleg-Nr.:12123-23131'; // no spaces after the colon
string2 = 'Beleg-Nr.: 12121-214331'; // a tab after the colon
string3 = 'Beleg-Nr.: 12-982831'; // a tab and spaces after the colon
I want to get the numbers eplicitly. For that I use this pattern:
pattern = '/(?<=Beleg-Nr\.:[ \t]*)(.*)
This will get me the pure numbers for string1 and string2 but isn't working on string3 (it gives me additional whitespace before the number).
What am I missing here?
Edit: Thanks for all the helpful advises. The software that OCRs on the fly is able to surpress whitespace on its own in regexes. This did the trick. The resulting pattern is:
(?<=Beleg-Nr\.:[\s]*)(.*)

You can use "\s" special symbol to include both space and tabs (so, you will not need combine it into a group via []).

This works for me:
/(Beleg-Nr.:\s*)(.*)/
http://regexr.com?35rj6

The problem is that [ ]* will match only spaces. You need to use \s which will match any whitespace character (more specifically \s is [\f\n\r\t\v\u00A0\u2028\u2029]) :
/(?<=Beleg-Nr.:\s*)(.*)/
Side note:
* is greedy by default, so it will try to match max number of whitespaces possible, so you do not need to use negative [^\s] in your last () group.

Just replace the (.*) with a more restrictive pattern ([^ ]+$ for example). Also note, that the . after Beleg-Nr matches other chars as well.
The $ in my example matches the end of the line and thus ensures, that all characters are being matched.
I'd suggest to match to tabs as well:
pattern = '/(?<=Beleg-Nr\.:[ \t]*)([^ \t]+)$

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl regex extract two consecutive words - regex

\w only matches one character. You want the following: /^\w+\s+\w+\z/ ^ matches the start of string. \w+ matches one of more "word" characters. \s+ matches one of more whitespace characters. \w+ matches one of more "word" characters. \z matches the end of the string.

Related

Regex replace one value between comma separated values

Regular expression matching space but at the end of line

Remove all characters after a certain match

Regex to find a expression followed by whitespace, #, # or end of input

Get string after string with trailing whitespaces

Categories

Resources