Extracting text between quotation marks in notepad++ - regex

My file contains above 2000 abstracts containing above 18000 sentences, starting with tag and ending in tag . I want to find the information by use of notepad++,
A view of my file is as below:
<abstract>
<sentence>Activationofthe<conslex="CD28_surface_receptor"sem="G#protein_family_or_group"><conslex="CD28"sem="G#protein_molecule">CD28</cons>surfacereceptor</cons>providesamajorcostimulatorysignalfor<conslex="T_cell_activation"sem="G#other_name">Tcellactivation</cons>resultinginenhancedproductionof<conslex="interleukin-2"sem="G#protein_molecule">interleukin-2</cons>(<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>)and<conslex="cell_proliferation"sem="G#other_name">cellproliferation</cons>.</sentence>
<sentence>In<conslex="primary_T_lymphocyte"sem="G#cell_type">primaryTlymphocytes</cons>weshowthat<conslex="CD28"sem="G#protein_molecule">CD28</cons>ligationleadstotherapidintracellularformationof<conslex="reactive_oxygen_intermediate"sem="G#inorganic">reactiveoxygenintermediates</cons>(<conslex="ROI"sem="G#inorganic">ROIs</cons>)whicharerequiredfor<conslex="CD28-mediated_activation"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-mediatedactivation</cons>ofthe<conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>/<conslex="CD28-responsive_complex"sem="G#protein_complex"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-responsivecomplex</cons>and<conslex="IL-2_expression"sem="G#other_name"><conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expression</cons>.</sentence>
<sentence>Delineationofthe<conslex="CD28_signaling_cascade"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>signalingcascade</cons>wasfoundtoinvolve<conslex="protein_tyrosine_kinase_activity"sem="G#other_name"><conslex="protein_tyrosine_kinase"sem="G#protein_family_or_group">proteintyrosinekinase</cons>activity</cons>,followedbytheactivationof<conslex="phospholipase_A2"sem="G#protein_molecule">phospholipaseA2</cons>and<conslex="5-lipoxygenase"sem="G#protein_molecule">5-lipoxygenase</cons>.</sentence>
<sentence>Ourdatasuggestthat<conslex="lipoxygenase_metabolite"sem="G#protein_family_or_group"><conslex="lipoxygenase"sem="G#protein_molecule">lipoxygenase</cons>metabolites</cons>activate<conslex="ROI_formation"sem="G#other_name"><conslex="ROI"sem="G#inorganic">ROI</cons>formation</cons>whichtheninduce<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expressionvia<conslex="NF-kappa_B_activation"sem="G#other_name"><conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>activation</cons>.</sentence>
<sentence>Thesefindingsshouldbeusefulfor<conslex="therapeutic_strategies"sem="G#other_name">therapeuticstrategies</cons>andthedevelopmentof<conslex="immunosuppressants"sem="G#other_name">immunosuppressants</cons>targetingthe<conslex="CD28_costimulatory_pathway"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>costimulatorypathway</cons>.</sentence>
</abstract>
I want to extract the text between quotation marks e.g. my desired output is like this
"CD28_surface_receptor" "G#protein_family_or_group" "CD28" "G#protein_molecule"
"primary_T_lymphocyte" "G#cell_type"
i hope there will be a simpler way of doing so with notepad++ through use of regx. task may become easy if there is a way to extract the text on the basis of color in notepad++

Check the below
"\w+"|"G#\w+"
or operator | works in lateste notepad ++ only

Related

Append a various text after the closing bracket

I have below code, and using IntelliJ ,find and replacement. I want to append .driver to it
Current code:
click(page.username)
click(page.password)
Expected code:
click(page.username,driver)
click(page.password,driver)
Note- after click(, the text has different values.
Open find and replace panel in InteliJ.
Find text )
replace text ,driver)
Replace one by one in case you have other ) characters. Click on exclude to ignore and replace to replace text.

Extracting text between each quotation marks from file in notepad++

My file contains above 2000 abstracts containing above 18000 sentences, starting with tag and ending in tag . I want to find the information by use of notepad++, A view of my file is as below:
<abstract>
<sentence>Activationofthe<conslex="CD28_surface_receptor"sem="G#protein_family_or_group"><conslex="CD28"sem="G#protein_molecule">CD28</cons>surfacereceptor</cons>providesamajorcostimulatorysignalfor<conslex="T_cell_activation"sem="G#other_name">Tcellactivation</cons>resultinginenhancedproductionof<conslex="interleukin-2"sem="G#protein_molecule">interleukin-2</cons>(<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>)and<conslex="cell_proliferation"sem="G#other_name">cellproliferation</cons>.</sentence>
<sentence>In<conslex="primary_T_lymphocyte"sem="G#cell_type">primaryTlymphocytes</cons>weshowthat<conslex="CD28"sem="G#protein_molecule">CD28</cons>ligationleadstotherapidintracellularformationof<conslex="reactive_oxygen_intermediate"sem="G#inorganic">reactiveoxygenintermediates</cons>(<conslex="ROI"sem="G#inorganic">ROIs</cons>)whicharerequiredfor<conslex="CD28-mediated_activation"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-mediatedactivation</cons>ofthe<conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>/<conslex="CD28-responsive_complex"sem="G#protein_complex"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-responsivecomplex</cons>and<conslex="IL-2_expression"sem="G#other_name"><conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expression</cons>.</sentence>
<sentence>Delineationofthe<conslex="CD28_signaling_cascade"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>signalingcascade</cons>wasfoundtoinvolve<conslex="protein_tyrosine_kinase_activity"sem="G#other_name"><conslex="protein_tyrosine_kinase"sem="G#protein_family_or_group">proteintyrosinekinase</cons>activity</cons>,followedbytheactivationof<conslex="phospholipase_A2"sem="G#protein_molecule">phospholipaseA2</cons>and<conslex="5-lipoxygenase"sem="G#protein_molecule">5-lipoxygenase</cons>.</sentence>
<sentence>Ourdatasuggestthat<conslex="lipoxygenase_metabolite"sem="G#protein_family_or_group"><conslex="lipoxygenase"sem="G#protein_molecule">lipoxygenase</cons>metabolites</cons>activate<conslex="ROI_formation"sem="G#other_name"><conslex="ROI"sem="G#inorganic">ROI</cons>formation</cons>whichtheninduce<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expressionvia<conslex="NF-kappa_B_activation"sem="G#other_name"><conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>activation</cons>.</sentence>
<sentence>Thesefindingsshouldbeusefulfor<conslex="therapeutic_strategies"sem="G#other_name">therapeuticstrategies</cons>andthedevelopmentof<conslex="immunosuppressants"sem="G#other_name">immunosuppressants</cons>targetingthe<conslex="CD28_costimulatory_pathway"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>costimulatorypathway</cons>.</sentence>
</abstract>
I want to extract the text between quotation marks or in other words want to remove all data except that is in double quotes throughout the text e.g. my desired output is like this
CD28_surface_receptor G#protein_family_or_group CD28 G#protein_molecule
primary_T_lymphocyte G#cell_type
I used .*"(.*)".* in Find What and then replace with \1 by replacing all. It only extracted the text having quotations, from the last of each line but I want to extract from all doc and from each line as there are more string having double quotes in my file.
You can use [^"]*"([^"]+)"[^"]* in Find What, and replace with \1\r\n:
Or, to have them tab-separated, replace with \1\t:

Replace a comma in text values in CSV using regex in Notepad++

I searched a lot but couldn't find any exact soluion.
I have a CSV which contains some values that contains a comma in between the values.
Following is a sample row
"BEIAAGJIPAMBPJIF",2757,08042010,"13:53.59",09042010,"01:55.39","SIHAM","BEIAIGHEIPLGPJIF",20,"A",20,"S",0.00,0.00,0.00,"OLY
SPECIAL ORDER","IN STOCK , DESIGNER",0.00000,0,"","N","N",
Now it you look at the value "IN STOCK , DESIGNER", it containts a comma in between. due to which while reading the csv in my .net application and in MS Dynamics CRM import file wizard, it breaks it into two seprate values instead of one single value.
I need a regex that can match such strings and replace the comma with a hyphen "-" that I can use in Notepad ++.
Kindly help.
Thanks.
This solution worked for me, although it is a bit indirect:
by searching, detect character which is unused in the file, e.g. #
use the following regex replace to replace all delimiters: find: (".*?"|.*?), replace: \1# (note the character from step 1)
now, all leftover commas are only those which are inside the quotes. Mass replace them for -
replace back all #'s for commas

notepad++ regex to replace define sort of data

I want to replace texts in huge text file using notepad++. I don't know how can I replace text only if it's length is between for example 50-100. As far as I know in regex it should look like this [a-zA-Z0-9 -+]{50,100} but it doesn't work in n++. I'm not a regex specialist.
Example input:
<a>short text</a>
<a>veeeeryyyyy lloooooonnngggg teeeexxxtttt</a>
Expected output:
<a>short text</a>
<a>shrt txt</a>
Or more better is [^<>]{15,100} it replace everything between tags

Regex to disallow too many sequential, non-whitespace characters - while allowing links

I'm looking to use a regex to replace sequential runs of non-whitespace characters (say more than 35) with only the first 35 characters. I would like to allow strings with "http" in them to remain as they are (so as not to break links).
The strings will be from user input, and if somebody types 50 'x' characters in a row it may go outside of my <DIV> container and disrupt the layout. The runs might come at the beginning of a line or in the middle of one.
E.G. I would like to disallow these types of input:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
12345
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
but not these:
http://somesite.com/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
12345
http://somesite.com/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
I got the idea of using a negative lookaround from this question
I'm getting mixed results w/ this regex:
$comment=preg_replace('/^(((?!http).){25})(((?!http).)*)$/imUs', '$1',$comment);
That regex is preserving links, but it is also trimming acceptable text down to 25 characters.
text text text text text text text text text text text text text text text text text text text text text text text text
is becoming
text text text text tex
From reading regex's from other questions, I have a feeling that this can be done with a more elegant regex than I show above. Thanks for any suggestions.
I came up with this, and some quick testing seems to show it working for me, but let me know if it works correctly for you.
$comment = preg_replace('/(^|\s)((?!http)[^\s"]{25})[^\s"]+/i', '$1$2', $comment);
Obviously replace the 25 with whatever your max length should be.