Change all numbers' tag using RegEx - regex

Here is my simple text file:
1. Text About Question 1
2. Text About Question 2
.
.
20. Text About Question 20
I have 250 text file and all files have only 20 questions and I want to convert these files to xml, add "question" tag beginning of every number, so they will look like:
<question>1. Text About Question 1
<question>2. Text About Question 2
.
.
<question>20. Text About Question 20<question>
I have tried this regex: copy (\d{1}.) replace \1 which just effect between 1 and 9. After 10 it divides number like
1<question>0. Text About Question 10
As a second way, this regex: (\d{2}.) only effect between 10 and 20. So it looks like:
1. Text About Question 1
2. Text About Question 2
.
.
<question>20. Text About Question 20</question>
I couldn't continue with (\d{1}.) because this regex add same tags to number between 10 and 20 and looks like:
<question>1. Text About Question 1 </question>
<question>2. Text About Question 2</question>
.
.
<question><question>20. Text About Question 20</question>
Is there proper way to tag each question from 1 to 20 using regex?

You want to match all numbers between 1 and 20. Here is the regex for that
^[1-9]\.$|^1[0-9]\.$|^20\.$
Breakdown
^ - Start of line
[1-9] - Any digit between 1 and 9. Note 0 is not included
\. - Escape character before a period. Otherwise it will match any character
$ - End of regex
| - Or
^1[0-9]\.$ - Starts with a 1 and is between 10 and 19.
|^20\.$ - Or starts and ends with 20.

Related

RegEx for matching group in multiline texts

I have this multi-line text, I want to extract the numerical value before the 'Next' text (in this case 13). The numerical values will change, but the location will stay the same, it indicates total # of pages on website. I am having trouble writing the correct regex to return this value:
Previous
1
2
3
...
13
Next
Showing 1 - 100 of 1227 Results[EXTRACT]
pattern =re.compile(r'(\d{1,2})\r\nNext', re.M)
result = pattern.match(text)
The expected return value is 13.
import re
t = """Previous
1
2
3
...
13
Next
Showing 1 - 100 of 1227 Results[EXTRACT]"""
re.search(r"\d+(?=\s+Next)", t).group(0)
Returns: '13'
The regular expression does a lookahead assertion to see if there is any amount (>1) of digits followed by any amount (>1) of whitespace characters followed by the word Next.

Regular Expression for parsing a sports score

I'm trying to validate that a form field contains a valid score for a volleyball match. Here's what I have, and I think it works, but I'm not an expert on regular expressions, by any means:
r'^ *([0-9]{1,2} *- *[0-9]{1,2})((( *[,;] *)|([,;] *)|( *[,;])|[,;]| +)[0-9]{1,2} *- *[0-9]{1,2})* *$'
I'm using python/django, not that it really matters for the regex match. I'm also trying to learn regular expressions, so a more optimal regex would be useful/helpful.
Here are rules for the score:
1. There can be one or more valid set (set=game) results included
2. Each result must be of the form dd-dd, where 0 <= dd <= 99
3. Each additional result must be separated by any of [ ,;]
4. Allow any number of sets >=1 to be included
5. Spaces should be allowed anywhere except in the middle of a number
So, the following are all valid:
25-10 or 25 -0 or 25- 9 or 23 - 25 (could be one or more spaces)
25-10,25-15 or 25-10 ; 25-15 or 25-10 25-15 (again, spaces allowed)
25-1 2 -25, 25- 3 ;4 - 25 15-10
Also, I need each result as a separate unit for parsing. So in the last example above, I need to be able to separately work on:
25-1
2 -25
25- 3
4 - 25
15-10
It'd be great if I could strip the spaces from within each result. I can't just strip all spaces, because a space is a valid separator between result sets.
I think this is solution for your problem.
str.replace(r"(\d{1,2})\s*-\s*(\d{1,2})", "$1-$2")
How it works:
(\d{1,2}) capture group of 1 or 2 numbers.
\s* find 0 or more whitespace.
- find -.
$1 replace content with content of capture group 1
$2 replace content with content of capture group 2
you can also look at this.

Notepad++ `Find in Files` displays all results except a certain keyword

I am using notepad++ to search for certain keywords (using regular expression). something like (word1|word2|this statement|another statement). It works but can I search and show all results except a certain keyword? something like exclude the following words (exclude this|exclude this)? For example, below.
samedir\File1.log
This is line 1
This is line 2
This is line 3
exclude this
This is line 4
This is line 5
This is line 6
not excluded
excluding this
samedir\File2.log
This is line 1 1
This is line 2 1
This is line 3 1
exclude this
This is line 4 1
This is line 5 1 1
This is line 6 1
not excluded
excluding this
For example: I want to start a find in both files (on the same directory) but exclude the lines with excluding this and exclude this
the results should show something like below
File1.log
This is line 1
This is line 2
This is line 3
This is line 4
This is line 5
This is line 6
not excluded
File2.log
This is line 1 1
This is line 2 1
This is line 3 1
This is line 4 1
This is line 5 1 1
This is line 6 1
not excluded
You can do this with a lookahead assertion:
^(?!excluding this|exclude this)[^\r\n]*$
This will match entire lines as long as they don't contain excluding this or exclude this.
The key is the (?!) part. See http://www.regular-expressions.info/lookaround.html for more info.
You could try the regex like below to match all the lines which don't have exclude or excluding this strings.
^(?!.*\bexclud(?:ing|e) this\b).+$
DEMO
This (?!.*\bexclud(?:ing|e) this\b) negative lookahead at the start asserts that there isn't a string exclude this or excluding this present on the the line in which we are going to match. That is , the above regex would match all the lines except the one which contains exclude this or excluding this
I wanted to exclude one string and search for two strings in one line, so I used this regex:
^(?!.*excludethisstring).*includethisstring1.*includethisstring2.*$
This will make it so that the one line searched MUST have the two strings included, if you want to search for either one of the lines:
^(?!.*excludethisstring).*(includethisstring1|includethisstring2).*$

bash: how to extract text from beginning of a string to the first number? [duplicate]

This question already has answers here:
BASH: How to extract substring that is surrounded by specific text
(3 answers)
Closed 8 years ago.
I've a bunch of files named like this:
text 01 (blabla) other text
text 02 (whatever) other text
.
.
text 025 (etc) other tex
some text 1 (20031020) other text
some text 2 (20031022) other text
.
.
some text 10 (20031025) other text
some new text 01 other text
.
.
.
some new text 200 other text
and I want to extract from the filename only the words before the first number, so from the example above I want to obtain:
text
some text
some new text
I want to do this to move each file in belonging folder depending on file name (or create the folder if it not exist).
I want to do this with bash, and I know it can be done using regex but I don't know how, I've only seen example where the field are delimited by known characters, while in this case the limit is a space followed by any number.
Use ${variable%%pattern} (this remove suffix pattern).
$ filename='text 01 (blabla) other text'
$ echo ${filename%%[0-9]*}
text

conditionally remove portion of a line in delimited file

I have a ~ delimited text file with about 20 nullable columns.
I am trying to use SED (from cygwin) to "blank out" the value in column 11 if the following conditions are met...
Column 3 is a zero (0)
Column 11 is in date format mm/dd/yy (I'm not really concerned if it's a valid date)
Here's what I'm trying...
s/\([^~]*~[^~]*~0~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~\)\(\d{2}\/\d{2}\/\d{2}~\)\(.*$\)/\1~\3/
Here's a sample from the file:
Test A~7~1~~~~72742050~~~Z370~10/25/11~~~0~8.58563698~6.40910452~4.59198764~3.18239469~1.72955975~.23345372~-1.30891113~-2.89971394~1~0
Test B~7~0~~~~72742060~~~Z351~05/15/12~05/14/12~~0~18.88910518~12.69425528~9.96182381~6.76077612~6.76077612~3.86279298~.22449489~-.91021010~0~0
Test C~7~0~~~~72742060~~~Z352~06/12/12~ABC~~0~20.60845679~17.54889351~15.52912556~12.43279217~12.43279217~10.32033576~9.35296144~8.09245899~0~0
...and here's what I expect to get back
Test A~7~1~~~~72742050~~~Z370~10/25/11~~~0~8.58563698~6.40910452~4.59198764~3.18239469~1.72955975~.23345372~-1.30891113~-2.89971394~1~0
Test B~7~0~~~~72742060~~~Z351~05/15/12~~~0~18.88910518~12.69425528~9.96182381~6.76077612~6.76077612~3.86279298~.22449489~-.91021010~0~0
Test C~7~0~~~~72742060~~~Z352~06/12/12~ABC~~0~20.60845679~17.54889351~15.52912556~12.43279217~12.43279217~10.32033576~9.35296144~8.09245899~0~0
but the file comes through with line 2 completely unchanged.
You are trying to replace column 12 instead of 11:
\([^~]*~[^~]*~0~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~\)\(\d{2}\/\d{2}\/\d{2}~\)\(.*$\)
1 2 3 4 5 6 7 8 9 10 11 12
If just removing one of the [^~]*~ from the end of the first group doesn't fix it, it could be because your version of sed doesn't support either \d or repetition with {2} (although escaping the curly brackets would probably fix that).
Here is a version that should work everywhere which replaces each \d{2} with [0-9][0-9] (and fixes the incorrect column issue mentioned above):
s/\([^~]*~[^~]*~0~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~\)\([0-9][0-9]\/[0-9][0-9]\/[0-9][0-9]~\)\(.*$\)/\1~\3/