I have numerous text files that have book data in which I am trying to extract the International Standard Book Number (ISBN) from. Example snippets:
{" , "classifications": {}, "title": "La casa", "identifiers": {}, "isbn_13": ["978-84-940533-7-5"], "covers": [7281722], "created": {"type": "/type/datetime",
and
"2014-07-28T06:07:52.898549"}, "number_of_pages": 408, "isbn_13": ["9789602354292"],
but how would I go about finding and extracting that ISBN information? Some of the ISBN numbers have dashes, and some do not. Is there a way to replace everything in the text file with a blank except for the snippets that match? I've done research on several similar questions, but having a hard time comprehending it all as I am very new to Notepad++.
Lets say you have your ISBN and some more text in a text file line by line you'd go through following steps:
Make a copy of your text file first!
Open your text file in Notepad++.
Ctrl+H
Search mode: Regular expression
Find what: ^.*?(((1[03])*[ ]*(: ){0,1})*(([0-9Xx][- ]*){13}|([0-9Xx][- ]*){10})).*
Replace with: \1
Click on Replace All
For RegEx please search Google or StackOverflow first. For further information have a look at RegExLib.com, the Internet's first Regular Expression Library.
Related
I have several thousand text files containing form information (one text file for each form), including the unique id of each form.
I have been trying to extract just the form id using regex (which I am not too familiar with) to match the string of characters found before and after the form id and extract only the form ID number in between them. Usually the text looks like this: "... 12 ID 12345678 INDEPENDENT BOARD..."
The bolded 8-digit number is the form ID that I need to extract.
The code I used can be seen below:
$id= ([regex]::Match($text_file, "12 ID (.+) INDEPENDENT").Groups[1].Value)
This works pretty well, but I soon noticed that there were some files for which this script did not work. After investigation, I found that there was another variation to the text containing the form ID used by some of the text files. This variation looks like this: "... 12 ID 12345678 (a.12(3)(b),45)..."
So my first challenge is to figure out how to change the script so that it will match the first or the second pattern. My second challenge is to escape all the special characters in "(a.12(3)(b),45)".
I know that the pipe | is used as an "or" in regex and two backslashes are used to escape special characters, however the code below gives me errors:
$id= ([regex]::Match($text_one_line, "34 PR (.+) INDEPENDENT"|"34 PR (.+) //(a//.12//(3//)//(b//)//,45//)").Groups[1].Value)
Where have I gone wrong here and how I can fix my code?
Thank you!
When you approach a regex pattern always look for fixed vs. variable parts.
In your case the ID seems to be fixed, and it is, therefore, useful as a reference point.
The following pattern applies this suggestion: (?:ID\s+)(\d{8})
(click on the pattern for an explanation).
$str = "... 12 ID 12345678 INDEPENDENT BOARD..."
$ret = [Regex]::Matches($str, "(?:ID\s+)(\d{8})")
for($i = 0; $i -lt $ret.Count; $i++) {
$ret[0].Groups[1].Value
}
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. It contains a treasure trove of useful information.
I have a file with some paragraphs, what i want to do is to highlight certain pattern of text/words occurring in the text file with background yellow and text color black.
pattern = ["enough", "too much"];
Text file = "text.txt";
and show it on a webpage with highlighted text for enough and too much words in the text file.
I want to use perl to do this task.
Please tell me how i can do this in optimized way.
Make array of all the words you want to highlight.
Save input file in $file variable.
run foreach on that array and use regular expression to replace the word with word+HTML tag.
ie...
foreach(#words)
{
$file=~r/$/< font color=black, bgcolor=yellow>$< /font>/g;
}
save the $file again as a file with .html or .htm extension.
This was more like logic question than technical i guess.
I'm trying to create a custom syntax language file to highlight and help with creating new documents in Sublime Text 2. I have come pretty far, but I'm stuck at a specific problem regarding Regex searches in the tmLanguage file. I simply want to be able to match a regex over multiple lines within a YAML document that I then convert to PList to use in Sublime Text as a package. It won't work.
This is my regex:
/(foo[^.#]*bar)/
And this is how it looks inside the tmLanguage YAML document:
patterns:
- include: '#test'
repository:
test:
comment: Tester pattern
name: constant.numeric.xdoc
match: (foo[^.#]*bar)
If I build this YAML to a tmLanguage file and use it as a package in Sublime Text, I create a document that uses this custom syntax, try it out and the following happens:
This WILL match:
foo 12345 bar
This WILL NOT match:
foo
12345
bar
In a Regex tester, they should and will both match, but in my tmLanguage file it does not work.
I also already tried to add modifiers to my regex in the tmLanguage file, but the following either don't work or break the document entirely:
match: (/foo[^.#]*bar/gm)
match: /(/foo[^.#]*bar/)/gm
match: /foo[^.#]*bar/gm
match: foo[^.#]*bar
Note: My Regex rule works in the tester, this problem occurs in the tmLanguage file in Sublime Text 2 only.
Any help is greatly appreciated.
EDIT: The reason I use a match instead of begin/end clauses is because I want to use capture groups to give them different names. If someone has a solution with begin and end clauses where you can still name 'foo', '12345' and 'bar' differently, that's fine by me too.
I found that this is impossible to do. This is directly from the TextMate Manual, which is the text editor Sublime Text is based on.
12.2 Language Rules
<...>
Note that the regular expressions are matched against only a single
line of the document at a time. That means it is not possible to use a
pattern that matches multiple lines. The reason for this is technical:
being able to restart the parser at an arbitrary line and having to
re-parse only the minimal number of lines affected by an edit. In most
situations it is possible to use the begin/end model to overcome this
limitation.
My situation is one of the few in which a begin/end model cannot overcome the limitation. Unfortunate.
Long time since asked, but are you sure you can't use begin/end? I had similar problems with begin/end until I got a better grasp of the syntax/logic. Here's a rough example from a json tmLanguage file I'm doing (don't know the proper YAML syntax).
"repository": {
"foobar": {
"begin": "foo(?=[^.#]*)", // not sure about what's needed for your circumstance. the lookahead probably only covers the foo line
"end": "bar",
"beginCaptures": {
"0": {
"name": "foo"
}
},
"endCaptures": {
"0": {
"name": "bar"
}
},
"patterns": [
{"include": "#test-after-foobarmet"}
]
},
"test-after-foobarmet": {
"comment": "this can apply to many lines before next bar so you may need more testing",
"comment2": "you could continue to have captures here that go to another deeper level...",
"name": "constant.numeric.xdoc",
"match": "anyOtherRegexNeeded?"
}
}
I didn't follow your
"i need to number the different sections between the '#' and '.'
characters."
, but you should be able to have a test in test-after-foobarmet with more captures if needed for naming different groups between foo bar.
There's are good explanation of TextMate Grammar here. May still suffer from some errors but explains it in a way that was helpful for me when I didn't know anything about the topic.
I'm having a battle with a regex. (MOBI creation)
I have two files: one with XML, the other an HTML table of contents.
The important parts of the XML:
<navPoint id="_NeedsHTMLid" playOrder="40">
<navLabel><text>Needs anchor text from link.)</text></navLabel>
...
The HTML TOC, of course, looks like:
schema.org Article Mark-up
======
Hours and hours... worked with Textpad forever. Saw remarks here, now I'm using NotePad++... some of the regex results are different (NOT that I had it working anyway.) #_[\b(\w\b] was returning the ID: now? Not so much!
Does anyone know how to yank both the ID and the anchor text out of these? I'd be so grateful.
You can use this to get the id and the anchor text at the same time:
_(\w+)\b|([a-Z\s.]+[)]+)
#_[\b(\w\b] is not a valid regex. Try _([^"]+)\b.
Edited: try [^"] in place of \w.
If you want to match the ids and the text, go to Search > Find menu (shortcut CTRL+F) and do the following:
Find what:
id="([a-zA-Z0-9\-\:\_\.]+)"|<text>(.+?)<\/text>
Select radio button "Regular Expression"
Then press Find All in Current Document
You can test it with your example at regex101.
Here's a StackOverflow post about valid id names.
I didn't provided you with a Search and Replace solution, since you didn't mentioned anything about a replacement.
I want to find a section from text file using regular expression. I have file as below:
This is general text section that I don't want.
HEADER:ABC1
This is section text under header
More text to follow
Additional text to follow
HEADER:XYZ2
This is section text under header
More text to follow
Additional text to follow
HEADER:KHJ3
This is section text under header
A match text will look like this A:86::ABC
Now, I want to retrieve all section text up to HEADER if the section text contains the match A:86::ABC. The result text will be
(HEADER:KHJ3
This is section text under header
A match text will look like this A:86::ABC).
I appreciate any help. I am using python and the match section can be more than one in a file. Also this is a multi line file.
regex = re.compile(".*(HEADER.*$.*A:86::ABC)", re.MULTILINE|re.DOTALL)
>>> regex.findall(string)
[u'HEADER:KHJ3\nThis is section text under header \nA match text will look like this A:86::ABC']
Hopefully this helps.
For 2 captures use ".*(HEADER.*$)(.*A:86::ABC)"