How to deal with the new line character in the Silverlight TextBox - regex

When using a multi-line TextBox (AcceptsReturn="True") in Silverlight, line feeds are recorded as \r rather than \r\n. This is causing problems when the data is persisted and later exported to another format to be read by a Windows application.
I was thinking of using a regular expression to replace any single \r characters with a \r\n, but I suck at regex's and couldn't get it to work.
Because there may be a mixture of line endings just blindy replacing all \r with \r\n doesn't cut it.
So two questions really...
If regex is the way to go what's the correct pattern?
Is there a way to get Silverlight to respect it's own Environment.NewLine character in TextBox's and have it insert \r\n rather just a single \r?

I don't know Silverlight, but I imagine (I hope!) there's a way to get it to respect Environment.NewLine—that would be a better approach. If there isn't, however, you can use a regex. I'll assume you have text which contains all of \r, \n, and \r\n, and never uses those as anything but line endings—you just want consistency. (If they show up as non-line ending data, the regex solution becomes much harder, and possibly impossible.) You thus want to replace all occurrences of \r(?!\n)|(?<!\r)\n with \r\n. The first half of the first regex matches any \r not followed by a \n; the second half matches a lone \n which wasn't preceded by a \r.
The fancy operators in this regex are termed lookaround: (?=...) is a positive lookahead, (?<=...) is a positive lookbehind, (?!...) is a negative lookahead, and (?<!...) is a negative lookbehind. Each of them is a zero-width assertion like ^ or $; they match successfully without consuming input if the given regex succeeds/fails (for positive/negative, respectively) to match after/before (for lookahead/lookbehind) the current location in the string.

I don't know Silverlight at all (and I find the behavior you're describing very strange), but perhaps you could try searching for \r(?!\n) and replacing that with \r\n.
\r(?!\n) means "match a \r if and only if it's not followed by \n".
If you also happen to have \n without preceding \rs and want to "normalize" those too, then search for \r(?!\n)|(?<!\r)\n and replace with \r\n.
(?<!\r)\n means "match a \n if and only if it's not preceded by \r".

Related

What is the difference between `(\S.*\S)` and `^\s*(.*)\s*$` in regex?

I'm doing the RegexOne regex tutorial and it has a question about writing a regular expression to remove unnecessary whitespace.
The solution provided in the tutorial is
We can just skip all the starting and ending whitespace by not capturing it in a line. For example, the expression ^\s*(.*)\s*$ will catch only the content.
The setup for the question does indicate the use of the hat at the beginning and the dollar sign at the end, so it makes sense that this is the expression that they want:
We have previously seen how to match a full line of text using the hat ^ and the dollar sign $ respectively. When used in conjunction with the whitespace \s, you can easily skip all preceding and trailing spaces.
That said, using \S instead, I was able to come up with what seems like a simpler solution - (\S.*\S).
I've found this Stack Overflow solution that match the one in the tutorial - Regex Email - Ignore leading and trailing spaces? and I've seen other guides that recommend the same format but I'm struggling to find an explanation for why the \S is bad.
Additionally, this validates as correct in their tool... so, are there cases where this would not work as well as the provided solution? Or is the recommended version just a standard format?
The tutorial's solution of ^\s*(.*)\s*$ is wrong. The capture group .* is greedy, so it will expand as much as it can, all the way to the end of the line - it will capture trailing spaces too. The .* will never backtrack, so the \s* that follows will never consume any characters.
https://regex101.com/r/584uVG/1
Your solution is much better at actually matching only the non-whitespace content in the line, but there are a couple odd cases in which it won't match the non-space characters in the middle. (\S.*\S) will only capture at least two characters, whereas the tutorial's technique of (.*) may not capture any characters if the input is composed of all whitespace. (.*) may also capture only a single character.
But, given the problem description at your link:
Occasionally, you'll find yourself with a log file that has ill-formatted whitespace where lines are indented too much or not enough. One way to fix this is to use an editor's search a replace and a regular expression to extract the content of the lines without the extra whitespace.
From this, matching only the non-whitespace content (like you're doing) probably wouldn't remove the undesirable leading and trailing spaces. The tutorial is probably thinking to guide you towards a technique that can be used to match a whole line with a particular pattern, and then replace that line with only the captured group, like:
Match ^\s*(.*\S)\s*$, replace with $1: https://regex101.com/r/584uVG/2/
Your technique would work given the problem if you had a way to make a new text file containing only the captured groups (or all the full matches), eg:
const input = ` foo
bar
baz
qux `;
const newText = (input.match(/\S(?:$|.*\S)/gm) || [])
.join('\n');
console.log(newText);
Using \S instead of . is not bad - if one knows a particular location must be matched by a non-space character, rather than by a space, using \S is more precise, can make the intent of the pattern clearer, and can make a bad match fail faster, and can also avoid problems with catastrophic backtracking in some cases. These patterns don't have backtracking issues, but it's still a good habit to get into.

Finding ALL matches after a certain character(s)

I'd like to think I'm ok at writing RegEx's, but there's one thing I can't seem to crack:
I want to start looking for multiple, identical matches after a certain set of characters and capture all of them. Here's an example string:
Dialogue: 0,0:05:47.99,0:05:50.74,JoJo-main,Koichi,0000,0000,0000,,What are you doing, Giorno Giovanna?!
For this example, I want to start looking for matches after ,,. I want to find all instances of Gio i.e.
Dialogue: 0,0:05:47.99,0:05:50.74,JoJo-main,Koichi,0000,0000,0000,,What are you doing, {Gio}rno {Gio}vanna?!
I've tried first using non-capturing groups like /(?:,,.*?)(Gio)/g then lookbehinds like /(?<=,,.*?)(Gio)/g, /(?<=,,)(?:.*?)(Gio)/g and /(?<=,,)((?:.*?)(Gio))+/g to avoid consuming the ,,
None of these give me the behaviour I want, as I want individual matches as if I just used Gio, but without the chance of accidentally capturing stuff before the ,,
I could, of course, run one RegEx to find the ,, then feed that position to another RegEx to look for Gios after that point.
However I have thousands of lines like these to parse and thousands of words to look for on each line (I separate them with |), so I'd ideally like to do it with one RegEx and without a loop.
You may consider the following option for the .NET or modern ECMAScript 2018+ compliant JS environments:
/(?<=,,.*?)Gio/g
See the regex demo.
The (?<=,,.*?)Gio pattern matches Gio when it is preceded with ,, and any 0+ chars other than line break chars, as few as possible.
The following variant will work with PCRE/Onigmo regex engines:
/(?:\G(?!^)|,,).*?\KGio/
See another regex demo. Here, (?:\G(?!^)|,,) matches either the end of the previous successful match or ,, and then .*? matches and consumes any 0+ chars other than line break chars, as few as possible, then \K will reset the match buffer and Gio will land right there.

grep regex lookahead or start of string (or lookbehind or end of string)

I want to match a string which may contain a type of character before the match, or the match may begin at the beginning of the string (same for end of string).
For a minimal example, consider the text n.b., which I'd like to match either at the beginning of a line and end of a line or between two non-word characters, or some combination. The easiest way to do this would be to use word boundaries (\bn\.b\.\b), but that doesn't match; similar cases happen for other desired matches with non-word characters in them.
I'm currently using (^|[^\w])n\.b\.([^\w]|$), which works satisfactorily, but will also match the non-word characters (such as dashes) which appear immediately before and after the word, if available. I'm doing this in grep, so while I could easily pipe the output into sed, I'm using grep's --color option, which is disabled when piping into another command (for obvious reasons).
EDIT: The \K option (i.e. (\K^|[^\w])n\.b\.(\K[^\w]|$) seems to work, but it also does discard the color on the match within the output. While I could, again, invoke auxiliary tools, I'd love it if there was a quick and simple solution.
EDIT: I have misunderstood the \K operator; it simply removes all the text from the match preceding its use. No wonder it was failing to color the output.
If you're using grep, you must be using the -P option, or lookarounds and \K would throw errors. That means you also have negative lookarounds at your disposal. Here's a simpler version of your regex:
(?<!\w)n\.b\.(?!\w)
Also, be aware that (?<=...) and (?<!...) are lookbehinds, and (?=...) and (?!...) are lookaheads. The wording of your title suggests you may have gotten those mixed up, a common beginner's mistake.
Apparently matching beginning of string is possible inside lookahead/lookbehinds; the obvious solution is then (?<=^|[^\w])n\.b\.(?=[^\w]|$).

match first space on a line using sublime text and regular expressions

So regular expressions have always been tough for me. Im getting frustrated trying to find a regular expression that will select the first white space on a line. So then i can use sublime text to replace that with a /
If you could give a quick explanation that would help to
In the spirit of #edi's answer, but with some explanation of what's happening. Match the beginning of the line with ^, then look for a sequence of characters that are not whitespace with [^\s]* or \S* (the former may work in more editors, libraries, etc than the latter), then find the first whitespace character with \s. Putting these together, you have
^[^\s]*\s
You may want to group the non-whitespace and whitespace parts, so you can do the replacement you're talking about:
^([^\s]*)(\s)
Then the replacement pattern is just \1/
You can use this regex.
^([^\s]*)\s

Find whitespace in end of string using wildcards or regex

I have a Resoure.resx file that I need to search to find strings ending with a whitespace. I have noticed that in Visual Web Developer I can search using both regex and wildcards but I can not figure out how to find only strings with whitespace in the end. I tried this regex but didn't work:
\s$
Can you give me an example? Thanks!
I'd expect that to work, although since \s includes \n and \r, perhaps it's getting confused. Or I suppose it's possible (but really unlikely) that the flavor of regular expressions that Visual Web Developer uses (I don't have a copy) doesn't have the \s character class. Try this:
[ \f\t\v]$
...which searches for a space, formfeed, tab, or vertical tab at the end of a line.
If you're doing a search and replace and want to get rid of all of the whitespace at the end of the line, then as RageZ points out, you'll want to include a greedy quantifier (+ meaning "one or more") so that you grab as much as you can:
[ \f\t\v]+$
You were almost there. adding the + sign means 1 characters to infinite number of characters.
This would probably make it:
\s+$
Perhaps this would work:
^.+\s$
Using this you'll be able to find nonempty lines that end with a whitespace character.