Regex to disallow too many sequential, non-whitespace characters - while allowing links - regex

I'm looking to use a regex to replace sequential runs of non-whitespace characters (say more than 35) with only the first 35 characters. I would like to allow strings with "http" in them to remain as they are (so as not to break links).
The strings will be from user input, and if somebody types 50 'x' characters in a row it may go outside of my <DIV> container and disrupt the layout. The runs might come at the beginning of a line or in the middle of one.
E.G. I would like to disallow these types of input:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
12345
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
but not these:
http://somesite.com/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
12345
http://somesite.com/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
I got the idea of using a negative lookaround from this question
I'm getting mixed results w/ this regex:
$comment=preg_replace('/^(((?!http).){25})(((?!http).)*)$/imUs', '$1',$comment);
That regex is preserving links, but it is also trimming acceptable text down to 25 characters.
text text text text text text text text text text text text text text text text text text text text text text text text
is becoming
text text text text tex
From reading regex's from other questions, I have a feeling that this can be done with a more elegant regex than I show above. Thanks for any suggestions.

I came up with this, and some quick testing seems to show it working for me, but let me know if it works correctly for you.
$comment = preg_replace('/(^|\s)((?!http)[^\s"]{25})[^\s"]+/i', '$1$2', $comment);
Obviously replace the 25 with whatever your max length should be.

Related

Text replace challenge (regex)

I can't solve a problem. Perhaps it is impossible to achieve what I want.
GOAL: use only replace function to remove all text except the email address.
I have a text with email in: Start text some other text 2828 text my.address#mail.com some additional text.
Regular expression to select email: [a-zA-Z0-9\-\._]+#[\w\d\-\._]+\.\w{2,12}
Regular expression works perfectly to find an email address, but it didn't work to remove all letters from an email.
Below print screen shows what I got as a result when apply replace function in the text editor:
As results I used regexp .*([a-zA-Z0-9\-\._]+#[\w\d\-\._]+\.\w{2,12}).*, and replace it on $1. Sadly this workflow give me broken email.
I used email as an example, the same result I got for any other data types as URLs, IPs, phones, names, cities, zips etc.
Can anyone unveil a solution to this problem?
Thank you a lot.
PS I am not interested in using math() function, because of this function isn't presented in most of the text editors.
I think you should make the first part non greedy .*? or else the .* will match upon the # and after that just giving up 1 match to satisfy the character class [a-zA-Z0-9\-\._]+
If it is not greedy it will capture my.address#mail.com instead of s#mail.com
.*?([a-zA-Z0-9\-\._]+#[\w\d\-\._]+\.\w{2,12}).*
I would do it like this:
Find: (.*?)[a-zA-Z0-9\-\._]+#[\w\d\-\._]+\.\w{2,12}\s?(.*?)
Replace: $1$2
Input: Start text some other text 2828 text my.address#mail.com some additional text
Output: Start text some other text 2828 text some additional text

Extracting text between quotation marks in notepad++

My file contains above 2000 abstracts containing above 18000 sentences, starting with tag and ending in tag . I want to find the information by use of notepad++,
A view of my file is as below:
<abstract>
<sentence>Activationofthe<conslex="CD28_surface_receptor"sem="G#protein_family_or_group"><conslex="CD28"sem="G#protein_molecule">CD28</cons>surfacereceptor</cons>providesamajorcostimulatorysignalfor<conslex="T_cell_activation"sem="G#other_name">Tcellactivation</cons>resultinginenhancedproductionof<conslex="interleukin-2"sem="G#protein_molecule">interleukin-2</cons>(<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>)and<conslex="cell_proliferation"sem="G#other_name">cellproliferation</cons>.</sentence>
<sentence>In<conslex="primary_T_lymphocyte"sem="G#cell_type">primaryTlymphocytes</cons>weshowthat<conslex="CD28"sem="G#protein_molecule">CD28</cons>ligationleadstotherapidintracellularformationof<conslex="reactive_oxygen_intermediate"sem="G#inorganic">reactiveoxygenintermediates</cons>(<conslex="ROI"sem="G#inorganic">ROIs</cons>)whicharerequiredfor<conslex="CD28-mediated_activation"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-mediatedactivation</cons>ofthe<conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>/<conslex="CD28-responsive_complex"sem="G#protein_complex"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-responsivecomplex</cons>and<conslex="IL-2_expression"sem="G#other_name"><conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expression</cons>.</sentence>
<sentence>Delineationofthe<conslex="CD28_signaling_cascade"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>signalingcascade</cons>wasfoundtoinvolve<conslex="protein_tyrosine_kinase_activity"sem="G#other_name"><conslex="protein_tyrosine_kinase"sem="G#protein_family_or_group">proteintyrosinekinase</cons>activity</cons>,followedbytheactivationof<conslex="phospholipase_A2"sem="G#protein_molecule">phospholipaseA2</cons>and<conslex="5-lipoxygenase"sem="G#protein_molecule">5-lipoxygenase</cons>.</sentence>
<sentence>Ourdatasuggestthat<conslex="lipoxygenase_metabolite"sem="G#protein_family_or_group"><conslex="lipoxygenase"sem="G#protein_molecule">lipoxygenase</cons>metabolites</cons>activate<conslex="ROI_formation"sem="G#other_name"><conslex="ROI"sem="G#inorganic">ROI</cons>formation</cons>whichtheninduce<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expressionvia<conslex="NF-kappa_B_activation"sem="G#other_name"><conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>activation</cons>.</sentence>
<sentence>Thesefindingsshouldbeusefulfor<conslex="therapeutic_strategies"sem="G#other_name">therapeuticstrategies</cons>andthedevelopmentof<conslex="immunosuppressants"sem="G#other_name">immunosuppressants</cons>targetingthe<conslex="CD28_costimulatory_pathway"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>costimulatorypathway</cons>.</sentence>
</abstract>
I want to extract the text between quotation marks e.g. my desired output is like this
"CD28_surface_receptor" "G#protein_family_or_group" "CD28" "G#protein_molecule"
"primary_T_lymphocyte" "G#cell_type"
i hope there will be a simpler way of doing so with notepad++ through use of regx. task may become easy if there is a way to extract the text on the basis of color in notepad++
Check the below
"\w+"|"G#\w+"
or operator | works in lateste notepad ++ only

Find a specific word in text files between double quotation marks

I have many text files and need to locate certain words that may exist in the context of the file but need only those that are in quotation marks.
Example: Find in the text below the word "search" only if in quotes (the word "search" may vary).
1. text text text text text text search text
2. text "search text text text text" text
3. text "SEARCH text text text text" text
For this precise example, I would expect only the words of line 2 and 3.
Thanks to anyone who can help me.
If you can guarantee that there'll be only one set of quotes, then
/".*search.*"/i
should do. But if there can be more than one pair of quotes, then you have to ensure that an even number of quotes have been passed, lest you mistake a closing quote for an opening quote:
/^[^"]*("[^"]*"[^"]*)*"[^"]*search[^"]*"/i
Here's a demo. (Note that the demo contains \ns purely for presentation purposes.) If you see two #s in the demo regex, please replace them with parentheses ( )—it is a limitation of the way RegexPal encodes its data in the URL.
I you want all waords between double quotes, I would simply use grep:
grep -E -o '".*"' inputfile
I f you want only the first word:
sed -E 's/.+"([[:alpha:]]+) .*/\1/' inputfile

Regex: How to leave out webding font characters?

I've a free text field on my form where the users can type in anything. Some users are pasting text into this field from Word documents with some weird characters that I don't want to go in my DB. (e.g. webding font characters) I'm trying to get a regular expression that would give me only the alphanum and the punctuation characters.
But when I try the following, the output is still all the characters. How can I leave them out?
<html><body><script type="text/javascript">var str="";document.write(str.replace(/[^a-zA-Z 0-9 [:punct]]+/g, " "));</script></body></html>
If you want only ascii, use /[^ -~]+/ as regex. The problem is your [:punct:] statement. Perhaps javascript does not support [:punct:]?

Regex with Tab delimited text containing \x09

I've got a tough one.
I've got tab-delimited text to match with a regex.
My regex looks like:
^([\w ]+)\t(\d*)\t(\d+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)$
and an example source text is (tabs converted to \t for clarity):
JJ\t345\t0\tTest\tSome test text\tmore text: pcre:"/\x20\x62\x3b\x0a\x09\x61\x2e\x53\x74\x61\x72/"\tNone
However, the problem is that in my source text, the 6th field contains a regex string. Therefore, it can contain \x09, which naturally blows up the regex since it's seen as a tab as well.
Is there any way to tell the regex engine, "Match on \t but not on the text \x09." My guess is no, since they're the same thing.
If not, is there any character that could be safely used for delimiting text that contains a regex string?
I would recommend encoding all of the characters in the pcre string prior to running the regular expression against it.
Seems like a problem with the test case. A regex might have tabs in it, but your sample above doesn't. Your string in Java would look like:
String testString = "JJ\t345\t0\tTest\tSome test text\tmore text: pcre:"/\\x20\\x62\\x3b\\x0a\\x09\\x61\\x2e\\x53\\x74\\x61\\x72/"\tNone";
If you look at this string in the debugger you'll have \x09 as 4 characters instead of as 1 (the tab).