caret and dollar sign doesn't work in scrapy s - regex

I am trying to use regular expression to extract data from a selector. But I find caret and dollar sign doesn't work as I expected.
I was using .* to test the ^ and $ sign as below. I thought two lines below should return the same thing.
But the first one just returns an empty list. And the second one returns the entire block as I expected.
response.xpath('//script[contains(.,"reports")]/text()').re('^.*$')
response.xpath('//script[contains(.,"reports")]/text()').re('.*')

.* is not including new lines and line breaks.
^ Matches the beginning of the string, or the beginning of a line if the multiline flag is set.
Same for $ - end of string or end of line in multiline flag is set.
For better testing try ^[\s\S]*$ expression. This will include any symbols in between of string start and string end.

Related

Regular expressions: inserting a word and NOT replacing the found key

I have a list of items, such as:
this_thing.ety
other-stuff.ety
34-pairings.ety
I want to do this:
"At the beginning of every line, insert "images/"
so the result of search/replace with reg exp would yield:
images/this_thing.ety
images/other-stuff.ety
images/34-pairings.ety
I am using:
^.
as my anchor to find the beginning of each line but everything I've tried to add "images/" has resulted in actually replacing that first character. I am using Notepad ++, but can use anything.
I thought using ${foo} was on the right track but I'm missing something here.
In a regex ^.is matching begin of line and a character. If you replace this by 'image', first character, which matched, will be replaced. Empty line wont have 'image' but stay identical (they don't match ^.)
Just use ^ as regexp for begin of line
. is the any character symbol, but can only account for one character. You will want to use ^..*$ or ^.+$ if your version of regex allows so that every line that contains at least one character will be fully replaced. With replace, it would look like this
s/^(.+)$/images\/\1/
where the \1 re-inserts the part in parenthesis in the regex. In older versions of regex, try
s/^^\(..*\)$/\1/

Regular expression to match last line break in file

In my quest to learn flex I'm having a scanner echo input adding line numbers.
After every line I display a counter and increment it.
Trouble is there is always a lone line number at the end of the display.
I need a regex that will ignore all line breaks except for the last one.
I tried [\n/<<EOF>>] to no avail.
Any thoughts?
I don't know what regex engine uses Flex but you can use this regex:
\z
Working demo
\z assert position at the very end of the string.
Matches the end of a string only. Unlike $, this is not affected by
multiline mode, and, in contrast to \Z, will not match before a
trailing newline at the end of a string.
If above regex doesn't work then you can use this one:
(?<=[\S\s])$
Working demo
Edit: since flex seems to work slightly different than other regex engines you could use this regex:
[\s\S]$
To get the latest character of each line. Then you can iterated over all lines until get the last one. Here you have an online flex regex engine tool:
http://ryanswanson.com/regexp/#start
Try below regex, It will search for a new line character at the end of the line.
\n$
Have you tried simply doing:
\n$
Debuggex Demo
The \n matches the newline, the $ matches end of string.

regex match only specific file lines

I have a file containing lines like:
13
13-55
some text 11
I want to create a regex to match only first to type of lines, but not the last one.
Reges created by me is [0-9\-]+
You have to specify that you are testing from the beggining to the end of the string:
Try with following regex:
^[0-9-]+$
Try using anchors (^ and $ to denote beginning and end of string respectively) and use the multiline option (this one depends on the language/engine/environment of the regex).
^[0-9-]+$
Note, you can drop the backslash for the - if it's at the beginning or end of a character class.
If you want to match lines which start with number.
^[0-9-]+$

Regular expression to match characters at beginning of line only

I am trying to work on regular expressions. I have a mainframe file which has several fields. I have a flat file parser which distinguishes several types of records based on the first three letters of every line. How do I write a regular expression where the first three letters are 'CTR'.
Beginning of line or beginning of string?
Start and end of string
/^CTR.*$/
/ = delimiter
^ = start of string
CTR = literal CTR
$ = end of string
.* = zero or more of any character except newline
Start and end of line
/^CTR.*$/m
/ = delimiter
^ = start of line
CTR = literal CTR
$ = end of line
.* = zero or more of any character except newline
m = enables multi-line mode, this sets regex to treat every line as a string, so ^ and $ will match start and end of line
While in multi-line mode you can still match the start and end of the string with \A\Z permanent anchors
/\ACTR.*\Z/m
\A = means start of string
CTR = literal CTR
.* = zero or more of any character except newline
\Z = end of string
m = enables multi-line mode
As such, another way to match the start of the line would be like this:
/(\A|\r|\n|\r\n)CTR.*/
or
/(^|\r|\n|\r\n)CTR.*/
\r = carriage return / old Mac OS newline
\n = line-feed / Unix/Mac OS X newline
\r\n = windows newline
Note, if you are going to use the backslash \ in some program string that supports escaping, like the php double quotation marks "" then you need to escape them first
so to run \r\nCTR.* you would use it as "\\r\\nCTR.*"
^CTR
or
^CTR.*
edit:
To be more clear: ^CTR will match start of line and those chars. If all you want to do is match for a line itself (and already have the line to use), then that is all you really need. But if this is the case, you may be better off using a prefab substr() type function. I don't know, what language are you are using. But if you are trying to match and grab the line, you will need something like .* or .*$ or whatever, depending on what language/regex function you are using.
Regex symbol to match at beginning of a line:
^
Add the string you're searching for (CTR) to the regex like this:
^CTR
Example: regex
That should be enough!
However, if you need to get the text from the whole line in your language of choice, add a "match anything" pattern .*:
^CTR.*
Example: more regex
If you want to get crazy, use the end of line matcher
$
Add that to the growing regex pattern:
^CTR.*$
Example: lets get crazy
Note: Depending on how and where you're using regex, you might have to use a multi-line modifier to get it to match multiple lines. There could be a whole discussion on the best strategy for picking lines out of a file to process them, and some of the strategies would require this:
Multi-line flag m (this is specified in various ways in various languages/contexts)
/^CTR.*/gm
Example: we had to use m on regex101
Try ^CTR.\*, which literally means start of line, CTR, anything.
This will be case-sensitive, and setting non-case-sensitivity will depend on your programming language, or use ^[Cc][Tt][Rr].\* if cross-environment case-insensitivity matters.
^CTR.*$
matches a line starting with CTR.
Not sure how to apply that to your file on your server, but typically, the regex to match the beginning of a string would be :
^CTR
The ^ means beginning of string / line
There's are ambiguities in the question.
What is your input string? Is it the entire file? Or is it 1 line at a time? Some of the answers are assuming the latter. I want to answer the former.
What would you like to return from your regular expression? The fact that you want a true / false on whether a match was made? Or do you want to extract the entire line whose start begins with CTR? I'll answer you only want a true / false match.
To do this, we just need to determine if the CTR occurs at either the start of a file, or immediately following a new line.
/(?:^|\n)CTR/
(?i)^[ \r\n]*CTR
(?i) -- case insensitive -- Remove if case sensitive.
[ \r\n] -- ignore space and new lines
* -- 0 or more times the same
CTR - your starts with string.

How to check if a line is blank using regex

I am trying to make simple regex that will check if a line is blank or not.
Case;
" some" // not blank
" " //blank
"" // blank
The pattern you want is something like this in multiline mode:
^\s*$
Explanation:
^ is the beginning of string anchor.
$ is the end of string anchor.
\s is the whitespace character class.
* is zero-or-more repetition of.
In multiline mode, ^ and $ also match the beginning and end of the line.
References:
regular-expressions.info/Anchors, Character Classes, and Repetition.
A non-regex alternative:
You can also check if a given string line is "blank" (i.e. containing only whitespaces) by trim()-ing it, then checking if the resulting string isEmpty().
In Java, this would be something like this:
if (line.trim().isEmpty()) {
// line is "blank"
}
The regex solution can also be simplified without anchors (because of how matches is defined in Java) as follows:
if (line.matches("\\s*")) {
// line is "blank"
}
API references
String String.trim()
Returns a copy of the string, with leading and trailing whitespace omitted.
boolean String.isEmpty()
Returns true if, and only if, length() is 0.
boolean String.matches(String regex)
Tells whether or not this (entire) string matches the given regular expression.
Actually in multiline mode a more correct answer is this:
/((\r\n|\n|\r)$)|(^(\r\n|\n|\r))|^\s*$/gm
The accepted answer: ^\s*$ does not match a scenario when the last line is blank (in multiline mode).
Try this:
^\s*$
Full credit to bchr02 for this answer. However, I had to modify it a bit to catch the scenario for lines that have */ (end of comment) followed by an empty line. The regex was matching the non empty line with */.
New: (^(\r\n|\n|\r)$)|(^(\r\n|\n|\r))|^\s*$/gm
All I did is add ^ as second character to signify the start of line.
The most portable regex would be ^[ \t\n]*$ to match an empty string (note that you would need to replace \t and \n with tab and newline accordingly) and [^ \n\t] to match a non-whitespace string.
Here Blank mean what you are meaning.
A line contains full of whitespaces or a line contains nothing.
If you want to match a line which contains nothing then use '/^$/'.
Somehow none of the answers from here worked for me when I had strings which were filled just with spaces and occasionally strings having no content (just the line terminator), so I used this instead:
if (str.trim().isEmpty()) {
doSomethingWhenWhiteSpace();
}
Well...I tinkered around (using notepadd++) and this is the solution I found
\n\s
\n for end of line (where you start matching) -- the caret would not be of help in my case as the beginning of the row is a string
\s takes any space till the next string
hope it helps
This regex will delete all empty spaces (blank) and empty lines and empty tabs from file
\n\s*