I am trying to work on regular expressions. I have a mainframe file which has several fields. I have a flat file parser which distinguishes several types of records based on the first three letters of every line. How do I write a regular expression where the first three letters are 'CTR'.
Beginning of line or beginning of string?
Start and end of string
/^CTR.*$/
/ = delimiter
^ = start of string
CTR = literal CTR
$ = end of string
.* = zero or more of any character except newline
Start and end of line
/^CTR.*$/m
/ = delimiter
^ = start of line
CTR = literal CTR
$ = end of line
.* = zero or more of any character except newline
m = enables multi-line mode, this sets regex to treat every line as a string, so ^ and $ will match start and end of line
While in multi-line mode you can still match the start and end of the string with \A\Z permanent anchors
/\ACTR.*\Z/m
\A = means start of string
CTR = literal CTR
.* = zero or more of any character except newline
\Z = end of string
m = enables multi-line mode
As such, another way to match the start of the line would be like this:
/(\A|\r|\n|\r\n)CTR.*/
or
/(^|\r|\n|\r\n)CTR.*/
\r = carriage return / old Mac OS newline
\n = line-feed / Unix/Mac OS X newline
\r\n = windows newline
Note, if you are going to use the backslash \ in some program string that supports escaping, like the php double quotation marks "" then you need to escape them first
so to run \r\nCTR.* you would use it as "\\r\\nCTR.*"
^CTR
or
^CTR.*
edit:
To be more clear: ^CTR will match start of line and those chars. If all you want to do is match for a line itself (and already have the line to use), then that is all you really need. But if this is the case, you may be better off using a prefab substr() type function. I don't know, what language are you are using. But if you are trying to match and grab the line, you will need something like .* or .*$ or whatever, depending on what language/regex function you are using.
Regex symbol to match at beginning of a line:
^
Add the string you're searching for (CTR) to the regex like this:
^CTR
Example: regex
That should be enough!
However, if you need to get the text from the whole line in your language of choice, add a "match anything" pattern .*:
^CTR.*
Example: more regex
If you want to get crazy, use the end of line matcher
$
Add that to the growing regex pattern:
^CTR.*$
Example: lets get crazy
Note: Depending on how and where you're using regex, you might have to use a multi-line modifier to get it to match multiple lines. There could be a whole discussion on the best strategy for picking lines out of a file to process them, and some of the strategies would require this:
Multi-line flag m (this is specified in various ways in various languages/contexts)
/^CTR.*/gm
Example: we had to use m on regex101
Try ^CTR.\*, which literally means start of line, CTR, anything.
This will be case-sensitive, and setting non-case-sensitivity will depend on your programming language, or use ^[Cc][Tt][Rr].\* if cross-environment case-insensitivity matters.
^CTR.*$
matches a line starting with CTR.
Not sure how to apply that to your file on your server, but typically, the regex to match the beginning of a string would be :
^CTR
The ^ means beginning of string / line
There's are ambiguities in the question.
What is your input string? Is it the entire file? Or is it 1 line at a time? Some of the answers are assuming the latter. I want to answer the former.
What would you like to return from your regular expression? The fact that you want a true / false on whether a match was made? Or do you want to extract the entire line whose start begins with CTR? I'll answer you only want a true / false match.
To do this, we just need to determine if the CTR occurs at either the start of a file, or immediately following a new line.
/(?:^|\n)CTR/
(?i)^[ \r\n]*CTR
(?i) -- case insensitive -- Remove if case sensitive.
[ \r\n] -- ignore space and new lines
* -- 0 or more times the same
CTR - your starts with string.
Related
I am trying to use regular expression to extract data from a selector. But I find caret and dollar sign doesn't work as I expected.
I was using .* to test the ^ and $ sign as below. I thought two lines below should return the same thing.
But the first one just returns an empty list. And the second one returns the entire block as I expected.
response.xpath('//script[contains(.,"reports")]/text()').re('^.*$')
response.xpath('//script[contains(.,"reports")]/text()').re('.*')
.* is not including new lines and line breaks.
^ Matches the beginning of the string, or the beginning of a line if the multiline flag is set.
Same for $ - end of string or end of line in multiline flag is set.
For better testing try ^[\s\S]*$ expression. This will include any symbols in between of string start and string end.
I just started PySpark, here is the task:
I have an input of:
I need to use a regex to remove punctuation and all leading or trailing space and underscore. output is all lowercase.
What I came up is not complete:
sentence = regexp_replace(trim(lower(column)), '\\*\s\W\s*\\*_', '')
and the result is:
How do I fix the regex here? I need to use regexp_replace here.
Thank you very much.
You may use
^\W+|\W+$|[^\w\s]+|_
The ^ and $ anchors must match line start/end.
If the pattern must not overflow across lines, replace \W+$ with [^\w\n]+$ and the ^\W+ pattern with ^[^\w\n]+:
^[^\w\n]+|[^\w\n]+$|[^\w\s]+|_
See the regex demo.
Explanation:
^ - start of line (if multiline option is onby default, else, try adding (?m) at the pattern start)
[^\w\n]+ - 1 or more non-word chars (non-[a-zA-Z0-9_]) except a newline
| - or
[^\w\n]+$ - 1 or more non-word chars at the end of the line ($)
| - or
[^\w\s]+ - 1 or more non-word chars except any whitespace
| - or
_ - an underscore.
If you do not really care about Unicode (I used \w, \s that can be made Unicode aware), you may just use a shorter, more simple pattern:
^[^a-zA-Z\n]+|[^a-zA-Z\n]+$|[^a-zA-Z\s]+
See this regex demo.
TL;DR: sentence = column.strip(' \t\n*+_')
If you want to remove characters only from the ends and don't care about unicode, then the basic string strip() function will let you pick characters to strip. It defaults to whitespace, but you can put in whatever you want.
If you want to remove within a string you are stuck with a regular expression or, if using byte strings or Python 2, maketrans.
You may like to look at this question as well.
I'm trying to create a regex that will select an entire line where it contains a matching string.
I can't seem to get it to work. Here is the expression:
^.*?(\bEventname 2\b).*$
You can see the test case and what I've tried here:
https://www.regex101.com/r/mT5rZ3/1
here's what I use and it works perfectly for me
^.*substring.*$
This answer solves the question with 463 steps instead of 952 steps. Just ensure a new line at the end of the file.
.*Eventname 2.*\n
https://www.regex101.com/r/mT5rZ3/5
EDIT 4-9-2022
With .*Eventname 2.*\n? it also solves with 463 steps, but there is no need to ensure a new line at the end of the file.
If you are using the PHP regex . don't match newlines. So
.*(\bEventname 2\b).*
would be enough. If . matches newline you would need *? to make the dots non-greedy (so it just matches one line, instead of everything). You also need to be in multi-line mode to use ^ and $, but that shouldn't be necessary (since you only want to match one line anyway).
Try this:
(.*(?:Eventname 2).*)
explaination:
( ... ) : groups and captures the line
(?:...) : groups without capturing the string that the line needs to contain
.* : any characters
You are using a string containing several lines. By default, the ^ and $ operators will match the beginning and end of the whole string. The m modifier will cause them to match the beginning and end of a line.
I am trying to make simple regex that will check if a line is blank or not.
Case;
" some" // not blank
" " //blank
"" // blank
The pattern you want is something like this in multiline mode:
^\s*$
Explanation:
^ is the beginning of string anchor.
$ is the end of string anchor.
\s is the whitespace character class.
* is zero-or-more repetition of.
In multiline mode, ^ and $ also match the beginning and end of the line.
References:
regular-expressions.info/Anchors, Character Classes, and Repetition.
A non-regex alternative:
You can also check if a given string line is "blank" (i.e. containing only whitespaces) by trim()-ing it, then checking if the resulting string isEmpty().
In Java, this would be something like this:
if (line.trim().isEmpty()) {
// line is "blank"
}
The regex solution can also be simplified without anchors (because of how matches is defined in Java) as follows:
if (line.matches("\\s*")) {
// line is "blank"
}
API references
String String.trim()
Returns a copy of the string, with leading and trailing whitespace omitted.
boolean String.isEmpty()
Returns true if, and only if, length() is 0.
boolean String.matches(String regex)
Tells whether or not this (entire) string matches the given regular expression.
Actually in multiline mode a more correct answer is this:
/((\r\n|\n|\r)$)|(^(\r\n|\n|\r))|^\s*$/gm
The accepted answer: ^\s*$ does not match a scenario when the last line is blank (in multiline mode).
Try this:
^\s*$
Full credit to bchr02 for this answer. However, I had to modify it a bit to catch the scenario for lines that have */ (end of comment) followed by an empty line. The regex was matching the non empty line with */.
New: (^(\r\n|\n|\r)$)|(^(\r\n|\n|\r))|^\s*$/gm
All I did is add ^ as second character to signify the start of line.
The most portable regex would be ^[ \t\n]*$ to match an empty string (note that you would need to replace \t and \n with tab and newline accordingly) and [^ \n\t] to match a non-whitespace string.
Here Blank mean what you are meaning.
A line contains full of whitespaces or a line contains nothing.
If you want to match a line which contains nothing then use '/^$/'.
Somehow none of the answers from here worked for me when I had strings which were filled just with spaces and occasionally strings having no content (just the line terminator), so I used this instead:
if (str.trim().isEmpty()) {
doSomethingWhenWhiteSpace();
}
Well...I tinkered around (using notepadd++) and this is the solution I found
\n\s
\n for end of line (where you start matching) -- the caret would not be of help in my case as the beginning of the row is a string
\s takes any space till the next string
hope it helps
This regex will delete all empty spaces (blank) and empty lines and empty tabs from file
\n\s*
I am doing this in groovy.
Input:
hip_abc_batch hip_ndnh_4_abc_copy_from_stgig abc_copy_from_stgig
hiv_daiv_batch hip_a_de_copy_from_staging abc_a_de_copy_from_staging
I want to get the last column. basically anything that starts with abc_.
I tried the following regex (works for second line but not second.
\abc_.*\
but that gives me everything after abc_batch
I am looking for a regex that will fetch me anything that starts with abc_
but I can not use \^abc_.*\ since the whole string does not start with abc_
It sounds like you're looking for "words" (i.e., sequences that don't include spaces) that begin with abc_. You might try:
/\babc_.*\b/
The \b means (in some regular expression flavors) "word boundary."
Try this:
/\s(abc_.*)$/m
Here is a commented version so you can understand how it works:
\s # match one whitepace character
(abc_.*) # capture a string that starts with "abc_" and is followed
# by any character zero or more times
$ # match the end of the string
Since the regular expression has the "m" switch it will be a multi-line expression. This allows the $ to match the end of each line rather than the end of the entire string itself.
You don't need to trim the whitespace as the second capture group contains just the text. After a cursory scan of this tutorial I believe this is the way to grab the value of a capture group using Groovy:
matcher = (yourString =~ /\s(abc_.*)$/m)
// this is how you would extract the value from
// the matcher object
matcher[0][1]
I think you are looking for this: \s(abc_[a-zA-Z_]*)$
If you are using perl and you read all lines into one string, don't forget to set the the m option on your regex (that stands for "Treat string as multiple lines").
Oh, and Regex Coach is your free friend.