I have many text files and need to locate certain words that may exist in the context of the file but need only those that are in quotation marks.
Example: Find in the text below the word "search" only if in quotes (the word "search" may vary).
1. text text text text text text search text
2. text "search text text text text" text
3. text "SEARCH text text text text" text
For this precise example, I would expect only the words of line 2 and 3.
Thanks to anyone who can help me.
If you can guarantee that there'll be only one set of quotes, then
/".*search.*"/i
should do. But if there can be more than one pair of quotes, then you have to ensure that an even number of quotes have been passed, lest you mistake a closing quote for an opening quote:
/^[^"]*("[^"]*"[^"]*)*"[^"]*search[^"]*"/i
Here's a demo. (Note that the demo contains \ns purely for presentation purposes.) If you see two #s in the demo regex, please replace them with parentheses ( )—it is a limitation of the way RegexPal encodes its data in the URL.
I you want all waords between double quotes, I would simply use grep:
grep -E -o '".*"' inputfile
I f you want only the first word:
sed -E 's/.+"([[:alpha:]]+) .*/\1/' inputfile
Related
I have text like this:
something text
(some text here image and more text)
some more text
(test)
text
I want to do a regex search for everything in between the 2 parenthesis and search for the word image, if that word exists between 2 parenthesis then I want to add a new line AFTER that line. So my regex should produce this output:
something text
(some text here image and more text)
some more text
(test)
text
How can I best achieve this? I've tried (?<=\()(?=image)(?=\)) but that didn't work.
Using a word boundary:
\(.*\bimage\b.*\)
To capture that pattern when matching, place it within parentheses:
(\(.*\bimage\b.*\))
Then try referencing the group using $1 (or \1 depending on the language in which the regex is being used).
you didn't mention a tool, but with sed
sed 's/(.*image.*)/&\n/' file
if you want to restrict to standalone word "image" use \b word boundary (I think GNU sed only though)
sed 's/(.*\bimage\b.*)/&\n/' file
You can use the following regex to search for the word image in between parentheses and by replacing it with the captured group and a new line you can get the expected result :
(\(.*?image.*?\))
input >> something text
(some text here image and more text)
some more text
(test)
text
regex search >> (\(.*?image.*?\))
replace with >> `$1\n`
output >> something text
(some text here image and more text)
some more text
(test)
text
see demo / explanation
My file contains above 2000 abstracts containing above 18000 sentences, starting with tag and ending in tag . I want to find the information by use of notepad++, A view of my file is as below:
<abstract>
<sentence>Activationofthe<conslex="CD28_surface_receptor"sem="G#protein_family_or_group"><conslex="CD28"sem="G#protein_molecule">CD28</cons>surfacereceptor</cons>providesamajorcostimulatorysignalfor<conslex="T_cell_activation"sem="G#other_name">Tcellactivation</cons>resultinginenhancedproductionof<conslex="interleukin-2"sem="G#protein_molecule">interleukin-2</cons>(<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>)and<conslex="cell_proliferation"sem="G#other_name">cellproliferation</cons>.</sentence>
<sentence>In<conslex="primary_T_lymphocyte"sem="G#cell_type">primaryTlymphocytes</cons>weshowthat<conslex="CD28"sem="G#protein_molecule">CD28</cons>ligationleadstotherapidintracellularformationof<conslex="reactive_oxygen_intermediate"sem="G#inorganic">reactiveoxygenintermediates</cons>(<conslex="ROI"sem="G#inorganic">ROIs</cons>)whicharerequiredfor<conslex="CD28-mediated_activation"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-mediatedactivation</cons>ofthe<conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>/<conslex="CD28-responsive_complex"sem="G#protein_complex"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-responsivecomplex</cons>and<conslex="IL-2_expression"sem="G#other_name"><conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expression</cons>.</sentence>
<sentence>Delineationofthe<conslex="CD28_signaling_cascade"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>signalingcascade</cons>wasfoundtoinvolve<conslex="protein_tyrosine_kinase_activity"sem="G#other_name"><conslex="protein_tyrosine_kinase"sem="G#protein_family_or_group">proteintyrosinekinase</cons>activity</cons>,followedbytheactivationof<conslex="phospholipase_A2"sem="G#protein_molecule">phospholipaseA2</cons>and<conslex="5-lipoxygenase"sem="G#protein_molecule">5-lipoxygenase</cons>.</sentence>
<sentence>Ourdatasuggestthat<conslex="lipoxygenase_metabolite"sem="G#protein_family_or_group"><conslex="lipoxygenase"sem="G#protein_molecule">lipoxygenase</cons>metabolites</cons>activate<conslex="ROI_formation"sem="G#other_name"><conslex="ROI"sem="G#inorganic">ROI</cons>formation</cons>whichtheninduce<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expressionvia<conslex="NF-kappa_B_activation"sem="G#other_name"><conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>activation</cons>.</sentence>
<sentence>Thesefindingsshouldbeusefulfor<conslex="therapeutic_strategies"sem="G#other_name">therapeuticstrategies</cons>andthedevelopmentof<conslex="immunosuppressants"sem="G#other_name">immunosuppressants</cons>targetingthe<conslex="CD28_costimulatory_pathway"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>costimulatorypathway</cons>.</sentence>
</abstract>
I want to extract the text between quotation marks or in other words want to remove all data except that is in double quotes throughout the text e.g. my desired output is like this
CD28_surface_receptor G#protein_family_or_group CD28 G#protein_molecule
primary_T_lymphocyte G#cell_type
I used .*"(.*)".* in Find What and then replace with \1 by replacing all. It only extracted the text having quotations, from the last of each line but I want to extract from all doc and from each line as there are more string having double quotes in my file.
You can use [^"]*"([^"]+)"[^"]* in Find What, and replace with \1\r\n:
Or, to have them tab-separated, replace with \1\t:
I'm editing a book in LaTeX and its quotation marks syntax is different from the simple " characters. So I want to convert "quoted text here" to ``quoted text here''.
I have 50 text files with lots of quotations inside. I tried to write a regular expression to substitute the first " with `` and the second " with '', but I failed. I searched on internet and asked some friends, but I had no success at all. The closest thing I got to replace the first quotation mark is
s/"[a-z]/``/g
but this is clearly wrong, since
"quoted text here"
will become
``uoted text here"
How can I solve my problem?
I'm a little confused by your approach. Shouldn't it be the other way round with s/``/"[a-z]/g? But then, I think it'll be better with:
s/``(.*?)''/"\1"/g
(.*?) captures what's between `` and ''.
\1 contains this capture.
If it's the opposite that you're looking for (i.e. I wrongly interpreted your question), then I would suggest this:
s/"(.*?)"/``\1''/g
Which works on the same principles as the previous regex.
Use the following to tackle multiple quotations, replacing all " in one step.
echo '"Quote" she said, "again."' | sed "s/\"\([^\"]*\)\"/\`\`\1''/g"
The [^\"]* avoids the need for ungreedy matching, which does not seem possible in sed.
If you are using the TeXmaker software, you could use a regular expression with the Replace command (CTRL+R), and put the following into the Find field:
"([^}]*)"
and into the Replace field:
``$1''
And then just press the Replace All button. But after that, you still have to check that everything is fine, and maybe you need to do some corrections. This has worked pretty well for me.
Try grouping the word:
sed 's/"\([a-z]\)/``\1/'
On my PC:
abhishekm71#PC:~$ echo \"hello\" | sed 's/"\([a-z]\)/``\1/'
``hello"
It depends a little on your input file (are quotes always paired, or can there be ommissions?). I suggest the following robust approach:
sed 's/"\([0-9a-zA-Z]\)/``\1/g'
sed "s/\([0-9a-zA-Z]\)\"/\1\'\'/g"
Assumption: An opening quotation mark is always immediately followed by a letter or digit, a closing quotation mark is preceeded by one. Quotations can span over several words an even several input lines (some of the other solutions don't work when this happens).
Note that I also replace the closing quotation mark: Depending on the fonts you use the double quotation mark can be typeset as neutral straight quotation mark.
You are looking for something contained in straight quotation marks not containing a quotation mark, so the best regex is "([^"]*?)". Replace it with ``\1''. In Perl this can be simplified to s/"([^"]*?)"/``\1''/g. I would be very careful with this approach, it only works if all opening quotation marks have matching closing ones, for example in "one" two "three" four. But it will fail in "one" t"wo "three" four producing ``one'' t``wo ''three".
I am using a program that pastes what is in the clipboard in a modified format according to what I specify.
I would like for it to paste paths (i.e. "C:\folder\My File") without the pair of double quotes.
This, which isn't using RegEx works: Find " (I simply enter than in one line) and replace with nothing. I enter nothing in the second field. I leave it blank.
Now, though that works, it will remove double quotes in this scenario: Bob said "What are you doing?"
I would like the program to remove the quotes only if the the words enclosed in the double quotes have a backslash.
So, once again, just to make sure I am clear, I need the following:
1) RegEx Expression to find strings that have both double quotes and a backslash within those set of quotes.
2) A RegEx Expression that says: replace the backslashes with backslashes (i.e. leave them there).
Thank you for the fast response. This program has two fields. One for what to find and the other for what to replace. So, what would go in the 2nd field?
The program came with the Remove HTML entry, which has
<[^>]*> in the match pattern
and nothing (it's blank) in the Replacement field.
You didn't say which language you use, here's an example in Javascript:
> s = 'say "hello" and replace "C:\\folder\\My File" thanks'
"say "hello" and replace "C:\folder\My File" thanks"
> s.replace(/"([^"\\]*\\[^"]*)"/g, "$1")
"say "hello" and replace C:\folder\My File thanks"
This should work in .NET:
^".*?\\.*?"$
I need to edit lines in a text file.
The text files contains more than 100 lines of data in the below format.
Cosmos Rh Us (Paperback) $10.99 Shipped:
The Good Earth (Paperback) $6.66 Shipped:
BEST OF D.H. LAWRENCE (Paperback) $7.89 Shipped:
...
These are excerpts from the online book shop I use to buy books
I have this data in a test editor. How do I edit it [Fine/Replace] such that the data becomes like this
$10.99
$6.66
$7.89
or better, without the dollar sign, since it'll be easy total it.
I use notepad++ as text editor.
Search for (don't forget to enable regular expressions in the replace box!)
^.*\$(\d+\.\d+).*$
and replace all with
\1
You could simply match full lines and capture all numbers after the $ sign:
Find what: ^[^$]*\$(\d+\.\d+).*$
Replace with: $1
Make sure that you don't check the ". matches newline" option. And note that this will behave unexpectedly if you have multiple $ signs in a line.
You might need to update to Notepad++ 6. Before that some regex features were not working properly.
Find:
((?<=\$)[\d\.]+)
Replace With:
\1 or $1 (whichever Notepad++ uses)
first regex will be replaced with nothing
[a-zA-Z0-9].*\)
second regex will be replaced with nothing
[a-zA-Z]+\: