Extracting text between each quotation marks from file in notepad++ - regex

My file contains above 2000 abstracts containing above 18000 sentences, starting with tag and ending in tag . I want to find the information by use of notepad++, A view of my file is as below:
<abstract>
<sentence>Activationofthe<conslex="CD28_surface_receptor"sem="G#protein_family_or_group"><conslex="CD28"sem="G#protein_molecule">CD28</cons>surfacereceptor</cons>providesamajorcostimulatorysignalfor<conslex="T_cell_activation"sem="G#other_name">Tcellactivation</cons>resultinginenhancedproductionof<conslex="interleukin-2"sem="G#protein_molecule">interleukin-2</cons>(<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>)and<conslex="cell_proliferation"sem="G#other_name">cellproliferation</cons>.</sentence>
<sentence>In<conslex="primary_T_lymphocyte"sem="G#cell_type">primaryTlymphocytes</cons>weshowthat<conslex="CD28"sem="G#protein_molecule">CD28</cons>ligationleadstotherapidintracellularformationof<conslex="reactive_oxygen_intermediate"sem="G#inorganic">reactiveoxygenintermediates</cons>(<conslex="ROI"sem="G#inorganic">ROIs</cons>)whicharerequiredfor<conslex="CD28-mediated_activation"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-mediatedactivation</cons>ofthe<conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>/<conslex="CD28-responsive_complex"sem="G#protein_complex"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-responsivecomplex</cons>and<conslex="IL-2_expression"sem="G#other_name"><conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expression</cons>.</sentence>
<sentence>Delineationofthe<conslex="CD28_signaling_cascade"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>signalingcascade</cons>wasfoundtoinvolve<conslex="protein_tyrosine_kinase_activity"sem="G#other_name"><conslex="protein_tyrosine_kinase"sem="G#protein_family_or_group">proteintyrosinekinase</cons>activity</cons>,followedbytheactivationof<conslex="phospholipase_A2"sem="G#protein_molecule">phospholipaseA2</cons>and<conslex="5-lipoxygenase"sem="G#protein_molecule">5-lipoxygenase</cons>.</sentence>
<sentence>Ourdatasuggestthat<conslex="lipoxygenase_metabolite"sem="G#protein_family_or_group"><conslex="lipoxygenase"sem="G#protein_molecule">lipoxygenase</cons>metabolites</cons>activate<conslex="ROI_formation"sem="G#other_name"><conslex="ROI"sem="G#inorganic">ROI</cons>formation</cons>whichtheninduce<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expressionvia<conslex="NF-kappa_B_activation"sem="G#other_name"><conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>activation</cons>.</sentence>
<sentence>Thesefindingsshouldbeusefulfor<conslex="therapeutic_strategies"sem="G#other_name">therapeuticstrategies</cons>andthedevelopmentof<conslex="immunosuppressants"sem="G#other_name">immunosuppressants</cons>targetingthe<conslex="CD28_costimulatory_pathway"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>costimulatorypathway</cons>.</sentence>
</abstract>
I want to extract the text between quotation marks or in other words want to remove all data except that is in double quotes throughout the text e.g. my desired output is like this
CD28_surface_receptor G#protein_family_or_group CD28 G#protein_molecule
primary_T_lymphocyte G#cell_type
I used .*"(.*)".* in Find What and then replace with \1 by replacing all. It only extracted the text having quotations, from the last of each line but I want to extract from all doc and from each line as there are more string having double quotes in my file.

You can use [^"]*"([^"]+)"[^"]* in Find What, and replace with \1\r\n:
Or, to have them tab-separated, replace with \1\t:

Related

Extracting text between quotation marks in notepad++

My file contains above 2000 abstracts containing above 18000 sentences, starting with tag and ending in tag . I want to find the information by use of notepad++,
A view of my file is as below:
<abstract>
<sentence>Activationofthe<conslex="CD28_surface_receptor"sem="G#protein_family_or_group"><conslex="CD28"sem="G#protein_molecule">CD28</cons>surfacereceptor</cons>providesamajorcostimulatorysignalfor<conslex="T_cell_activation"sem="G#other_name">Tcellactivation</cons>resultinginenhancedproductionof<conslex="interleukin-2"sem="G#protein_molecule">interleukin-2</cons>(<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>)and<conslex="cell_proliferation"sem="G#other_name">cellproliferation</cons>.</sentence>
<sentence>In<conslex="primary_T_lymphocyte"sem="G#cell_type">primaryTlymphocytes</cons>weshowthat<conslex="CD28"sem="G#protein_molecule">CD28</cons>ligationleadstotherapidintracellularformationof<conslex="reactive_oxygen_intermediate"sem="G#inorganic">reactiveoxygenintermediates</cons>(<conslex="ROI"sem="G#inorganic">ROIs</cons>)whicharerequiredfor<conslex="CD28-mediated_activation"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-mediatedactivation</cons>ofthe<conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>/<conslex="CD28-responsive_complex"sem="G#protein_complex"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-responsivecomplex</cons>and<conslex="IL-2_expression"sem="G#other_name"><conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expression</cons>.</sentence>
<sentence>Delineationofthe<conslex="CD28_signaling_cascade"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>signalingcascade</cons>wasfoundtoinvolve<conslex="protein_tyrosine_kinase_activity"sem="G#other_name"><conslex="protein_tyrosine_kinase"sem="G#protein_family_or_group">proteintyrosinekinase</cons>activity</cons>,followedbytheactivationof<conslex="phospholipase_A2"sem="G#protein_molecule">phospholipaseA2</cons>and<conslex="5-lipoxygenase"sem="G#protein_molecule">5-lipoxygenase</cons>.</sentence>
<sentence>Ourdatasuggestthat<conslex="lipoxygenase_metabolite"sem="G#protein_family_or_group"><conslex="lipoxygenase"sem="G#protein_molecule">lipoxygenase</cons>metabolites</cons>activate<conslex="ROI_formation"sem="G#other_name"><conslex="ROI"sem="G#inorganic">ROI</cons>formation</cons>whichtheninduce<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expressionvia<conslex="NF-kappa_B_activation"sem="G#other_name"><conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>activation</cons>.</sentence>
<sentence>Thesefindingsshouldbeusefulfor<conslex="therapeutic_strategies"sem="G#other_name">therapeuticstrategies</cons>andthedevelopmentof<conslex="immunosuppressants"sem="G#other_name">immunosuppressants</cons>targetingthe<conslex="CD28_costimulatory_pathway"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>costimulatorypathway</cons>.</sentence>
</abstract>
I want to extract the text between quotation marks e.g. my desired output is like this
"CD28_surface_receptor" "G#protein_family_or_group" "CD28" "G#protein_molecule"
"primary_T_lymphocyte" "G#cell_type"
i hope there will be a simpler way of doing so with notepad++ through use of regx. task may become easy if there is a way to extract the text on the basis of color in notepad++
Check the below
"\w+"|"G#\w+"
or operator | works in lateste notepad ++ only

Remove more than one comma in between quotes in CSV file using Regex

all! Have CSV files coming in with with text inside double quotes that contain one or more commas and I am wondering if there is a regex form for Notepad++ that would remove any number of commas inside a CSV file.
For example I need to go from the this:
text,text1,"interesting, text,"
To this:
text,text1,"interesting text"
There can be 1,2 or more commas inside the quotes.
Anyone know a of a way to make this happen using regex form in Notepad++?
Thanks in advance!
use this pattern
,(?!(([^"]*"){2})*[^"]*$)
and replace with nothing
it is looking for a comma , that does not see an optional even number of double quotes " to the end of the string

Notepad++ Search and Replace with Tab Delimited File

I have a file that is tab delimited. When exporting from Excel, if the cell has a comma in it, it will wrap the cell with double quotes.
To find the first double quote, I can look for a tab then double quote ex: \t"
The next double quote to remove is at the end of the line, so I would like to find double quote then newline ex: \n" but this is not working.
Example of the file format:
textTABtextTAB"moretextwithquotes"CRLF
First, you're searching for \n" instead of "\n, if I well understand your problem.
Secondly, you need to search for \r\n instead of \n, so your final result should be "\r\n.
If all your data is consistent where double quotes are matched and encapsulates fields,
I would just do a global find and replace just on quoted text.
Replacing the match with just the field data. This strips the quotes, leaves everything
else untouched.
Find: "([^"\\]*(?:\\.[^"\\]*)*)"
Replace: $1

Replace a comma in text values in CSV using regex in Notepad++

I searched a lot but couldn't find any exact soluion.
I have a CSV which contains some values that contains a comma in between the values.
Following is a sample row
"BEIAAGJIPAMBPJIF",2757,08042010,"13:53.59",09042010,"01:55.39","SIHAM","BEIAIGHEIPLGPJIF",20,"A",20,"S",0.00,0.00,0.00,"OLY
SPECIAL ORDER","IN STOCK , DESIGNER",0.00000,0,"","N","N",
Now it you look at the value "IN STOCK , DESIGNER", it containts a comma in between. due to which while reading the csv in my .net application and in MS Dynamics CRM import file wizard, it breaks it into two seprate values instead of one single value.
I need a regex that can match such strings and replace the comma with a hyphen "-" that I can use in Notepad ++.
Kindly help.
Thanks.
This solution worked for me, although it is a bit indirect:
by searching, detect character which is unused in the file, e.g. #
use the following regex replace to replace all delimiters: find: (".*?"|.*?), replace: \1# (note the character from step 1)
now, all leftover commas are only those which are inside the quotes. Mass replace them for -
replace back all #'s for commas

Find a specific word in text files between double quotation marks

I have many text files and need to locate certain words that may exist in the context of the file but need only those that are in quotation marks.
Example: Find in the text below the word "search" only if in quotes (the word "search" may vary).
1. text text text text text text search text
2. text "search text text text text" text
3. text "SEARCH text text text text" text
For this precise example, I would expect only the words of line 2 and 3.
Thanks to anyone who can help me.
If you can guarantee that there'll be only one set of quotes, then
/".*search.*"/i
should do. But if there can be more than one pair of quotes, then you have to ensure that an even number of quotes have been passed, lest you mistake a closing quote for an opening quote:
/^[^"]*("[^"]*"[^"]*)*"[^"]*search[^"]*"/i
Here's a demo. (Note that the demo contains \ns purely for presentation purposes.) If you see two #s in the demo regex, please replace them with parentheses ( )—it is a limitation of the way RegexPal encodes its data in the URL.
I you want all waords between double quotes, I would simply use grep:
grep -E -o '".*"' inputfile
I f you want only the first word:
sed -E 's/.+"([[:alpha:]]+) .*/\1/' inputfile