replacing specific characters in between specific elements - regex

I'd like to use a regular expression to replace a space in a string. The space in question is the only space between two elements in the string. The string itself however contains much more elements and spaces. So far i've tried
(<-)[\s]*?(->)
But that doesnt work. It is supposed to take
<-word anotherword->
and allow me to replace the space in it.
As \s selects all spaces, and
(<-)[\s\S]*?(->)
Selects all characters inbetween the <- and ->, i tried to re-use the expression but then for the spaces only.
I'm not so good at these expressions, and i can't for the life of me find an answer anywhere.
If anyone could just point me to the answer, that would be great. Thanks.

It's difficult to be sure what you want, post some before and after examples. And, specify what language you are using.
But, it looks like (<-\S+)\s*(\S+->) should probably do it (deletes spaces).
If the <- and -> are NOT to be preserved, move them out of the parentheses, like so:
<-(\S+)\s*(\S+)->
Here's what it would look like in JavaScript:
var before = "Ten years ago a crack <-flegan esque-> unit was sent to prison by a military "
+ "court for a crime they didn't commit.\n"
+ "These men promptly escaped from a maximum security stockade to the "
+ "<-flargon 666-> underground.\n"
+ "Today, still wanted by the government, they survive as soldiers of fortune.\n"
+ "If you have a problem and no one else can help, and if you can find them, "
+ "maybe you can hire the <-flugen 9->.\n"
;
var after = before.replace (/(<-\S+)\s*(\S+->)/g, "$1$2");
alert (after);
Which yields:
Ten years ago a crack <-fleganesque-> unit was sent to prison by a military court for a crime they didn't commit.
These men promptly escaped from a maximum security stockade to the <-flargon666-> underground.
Today, still wanted by the government, they survive as soldiers of fortune.
If you have a problem and no one else can help, and if you can find them, maybe you can hire the <-flugen9->.

Related

Regex select words longer than 4 characters but only one instance if duplicates

I am trying to format text in InDesign using GREP Style.
The goal is to select words longer then 4 letters in a paragraph but if the word has been duplicated in a paragraph it should not select more then first instance of this word.
This is sample text:
"The Lord's right hand is lifted high; the Lord's right hand has done mighty things!"
The solution should give
Lord right hand lifted high done mighty things
i have done the first part
[[:word:]]{4,}
but don't have a clue how to deal with those duplicates.
Is order a requirement? If not, how about words longer than 4 characters not followed by that same word later in the text? See:
([[:word:]]{4,})(?!.*\1)
https://regex101.com/r/Ug4dLZ/1
Result: lifted high Lord right hand done many things
To be more comprehensive, include word breaks (i.e. count "Person" and "Personhood" as 2 separate words):
([[:word:]]{4,})(?!.*\b\1\b)

RegEx to clean VISA merchant names (remove random strings)

I am trying to develop a ReGex (.Net flavor), which I can use to clean VISA merchant names.
Examples:
Norton *AP1223506209 --> Norton *AP
Norton *AP1223511428
EUROWINGS VYJD6J_123001 --> EUROWINGS
EUROWINGS W6PDFI_125626
AER LINGUCB22QKM2 --> AER LINGUCB
AER LINGUCB248L2W
AIR FRANCE JWNCSC --> AIR FRANCE
AIR FRANCE K8L7TT
PAYPAL *AIRBNB HMQXBW --> PAYPAL *AIRBNB
PAYPAL *AIRBNB HMQXNZ
SAS 1174565172360 --> SAS
SAS 1174565172368
I would like to keep the first "name" part, but remove the second "gibberish" part.
The following Regex works for Norton and Air Lingu as well as for Eurowings and Air France, if they contain numbers in the gibberish part. It totally fails for PAYPAL *AIRBNB and other strings, that don't contain any numbers in the gibberish part, and also for SAS, probably because the name is too short / there are too many spaces:
Search:
([A-z *-]{2,50}[A-z]{2,50})(.{0,3}([0-9-]{0,3}[A-z *+.#-/]{0,3}){1,10})
Replace:
$1
Is there any way to make this work for gibberish parts that don't contain numbers? I have something like this in mind, but don't manage to create an according RegEx:
Group 1 (to keep)
Must contain consonants and vowels
Can contain few numbers, spaces or punctuation signs (e.g.: "7x7: Taxi Service")
Group 2 (to be removed)
Consists of sequences of numbers, letters and optional punctuation signs
OR: consists of consonants, only
OR: consists of numbers, only
Thanks for any help and best regards
Pesche
Edit:
If I add more examples, Lindens solution still works quite well, but does not recognize all of the examples or in some cases too much of the string. I tried to adjust it, but with my lacking skills didn't quite succeed:
https://regex101.com/r/7y9zGl/4
The following problems remain:
with a length of 6 for the last \w, longer patterns would not be matched in full length (e.g. after easyjet and after EMP Merchan). Increasing it, however, causes other strings to be truncated (e.g. AER LINGU, potentially also HOTELS.COM if > 12 was used).
The merchant names after PAYPAL * and GOOGLE * should not be deleted, as they are true merchant names. I tried to exclude strings containing GOOGLE * with a negative lookbehind, but it does not seem to work like that.
Whereas the merchant name after PAYPAL * should generally remain, in some cases it is followed by gibberish, e.g. PAYPAL *AIRBNB HMQXBW. If the negative lookbehind worked, those cases would no longer be cleaned.
if the merchant name is not followed by gibberish, part of the name itself may be deleted (e.g. EMP Merchan)
As the full list of merchant names is long and versatile, the approach to detect "gibberish" should be as generic as possible (i.e. not rely on a certain length of the gibberish part). Hence my original, now slightly modified "pattern":
Consists of sequences of numbers, letters and optional punctuation signs
OR: consists non or very few vowels (EASYJET 000ESJ5TWN -> the gibberish contains only one vowel, EASYJET 3 of them; PAYPAL *NITSCHKE -> NITSCHKE should not be matched, it contains 2 vowels)
OR: consists of numbers, only
Is such a thing even possible? The goal is to use SQL to clean the merchant names. If necessary, this can be done in several run throughs (for different kind of patterns).
Thx again!
Updated regex based on extended sample and desired results:
[\s*<]+\d+$|[\s*<]+(?![A-Z]{6}.*)\w*\d[\w>]*$|\d{6,}$|[\s*<]+[A-Z]{6}$|(?![A-Z]+$)(?<=[A-Z])\w{6}$
Demo
I cannot validate as I'm only on my phone, but can you try something like this?
^([0-9A-Za-z\*][ ]{0-2})
Take all the numbers, the letters (capital and minor) the star and max 2 spaces from the beginning of the line.
Please check the () but I guess the idea is here.
Sorry, it seems wrong when there is no double space.
You want to take all the char until 2 spaces or 2 numbers according to your examples.
.* {2}|.*[0-9]{2}
Is it better?
Regards,
Thomas

Regex lookbehind - excluding words from searches

I need to search my corpus for words such as game or shame but I would like to specify the search to exclude three strings a game/a shame or , A game/A shame and a/an/A/An WORD game or a/an/A/An WORD shame , where WORD is a modifier, e.g., a great game or a great shame.
If someone could help me out, that would be great, thanks!
In my corpus, the optional WORD between the indefinite article a/an and game or a/an and shame is most commonly great and real. So even excluding these two, would already help me a lot.
The lookbehind below works perfectly to exclude a/A
(?<!a\s|A\s)\bshame\b
To exclude the modifying WORD, I was trying to use ?\w in the lookbehind grep, but it just wouldn't work - the grep below without ? runs and it still excludes examples such as a shame, but it still returns the undesired examples such as a great shame or a crying shame - see concordance lines (3) and (4) in the sample text below:
(?<!a\s|A\s|a\b\w\b|A\b\w\b)\bshame\b
The tool I'm using to implement regex is AntConc, which supports Perl regular expressions.
Sample text with two irrelevant examples (3 & 4) after using the search string below
(?<!a\s|A\s)\bshame\b
1 (match shame)
, people ogling from the sidelines. If you want a closer look, you have to ring for entry and wait to be admitted. I guess me and Saul just have no shame (or just know the benefits of our bank accounts being in hard currencies), because we wandered into plenty. Lots and lots of little boutiques and edgily designed fashion stores with music blaring.& abbutterflie.txt 47 1
2 (match shame)
last twenty years and I've experienced all sorts of biggotry but I seriously thought that anti black nazism in football wass a thing of the past. You should all hang your heads in shame, bunch of [badword]s. adamdphillips.txt 57 1
3 (don't match shame)
me monetarily as I wasn't that close to her, but she was really good friends with the other girl and it's messed that up for them a bit, which is a great shame. Anyway, Holly and I have since found somewhere to move in just the two of us. It's going to cost an absolute fortune and I'm going to be eating basics beans on aderyn.txt 60 1
4 (don't match shame)
are loads of amazingly good bands out there, gigging up and down the country who will never get signed because no-one can figure out how to market them, and this is a crying shame. There are artists out there like Thea Gilmore and <a href="http://blog.amandapalmer.net/" rel="nofollow"> Amanda Palmer& aderyn.txt 60 2
5 (match shame)
/><br />"There is no better time to show these terrorists that we have no fear of them. Instead we are forced, through the cowardly acts of our superiors, to hide in shame."<br /><br />But Herb Wiseman, high school consultant for Lee County, Florida, pointed to the July 7 London bombings.<br /><br />"What happens if kids get on aggy91.txt 64 1
Because variable length negative lookbehinds are not allowed, the approach in your previous question's answer won't transfer to this one.
I've gone with a (*SKIP)(*FAIL) pattern. This will match and discard the disqualified matches, and only retain qualifying matches:
/[Aa]n?( \w+)? shame(*SKIP)(*FAIL)|shame/ 3844 steps (Demo)
Or if you wish to include word boundary metacharacters:
/\b[Aa]n?( \w+)? shame\b(*SKIP)(*FAIL)|\bshame\b/ 4762 steps (Demo)

Regex for moving periods at end of sentences not abbreviations

Looking for some ideas on how to remove the period character in sentences but not remove the periods in abbreviations. For instance
"The N.J. turnpike is long. Today is a beautiful day."
Would be changed to:
"The N.J. turnpike is long Today is a beautiful day"
This is a hard problem. Lingua::EN::Sentence makes a three-quarters assed attempt to solve it. It knows about common abbreviations in American English and has hooks for you to add other abbreviations you know about.
As others have said, this is a very difficult task in the general case. If you want to learn more, you should start out by reading more about "sentence segmentation" or "sentence boundary disambiguation," which is the task of dividing a text into sentences. Here's a few links to get you started:
http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
http://en.wikipedia.org/wiki/Text_segmentation#Sentence_segmentation
http://www.robincamille.com/2012-02-18-nltk-sentence-tokenizer/
http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html#sec-further-examples-of-supervised-classification
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html#punkt-tokenizer
Why would you want to remove periods at the end of a sentence in light of abbreviations? Go BIG: remove all dots, or go NONE!

SQL Server Regular Expression Workaround in T-SQL?

I have some SQLCLR code for working with Regular Expresions. But now that it is getting migrated into Azure, which does not allow SQLCLR, that's out. I need to find a way to do regex in pure T-SQL.
Master Data Services are not available because the dev edition of MSSQL we have is not R2.
All ideas appreciated, thanks.
Regular expression match samples that need handling
(culled from regexlib and other places over the past few years)
email address
^[\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?$
dollars
^(\$)?(([1-9]\d{0,2}(\,\d{3})*)|([1-9]\d*)|(0))(\.\d{2})?$
uri
^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$
one numeric digit
^\d$
percentage
^-?[0-9]{0,2}(\.[0-9]{1,2})?$|^-?(100)(\.[0]{1,2})?$
height notation
^\d?\d'(\d|1[01])"$
numbers between 1 1000
^([1-9]|[1-9]\d|1000)$
credit card numbers
^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$
list of years
^([1-9]{1}[0-9]{3}[,]?)*([1-9]{1}[0-9]{3})$
days of the week
^(Sun|Mon|(T(ues|hurs))|Fri)(day|\.)?$|Wed(\.|nesday)?$|Sat(\.|urday)?$|T((ue?)|(hu?r?))\.?$
time on 12 hour clock
(?<Time>^(?:0?[1-9]:[0-5]|1(?=[012])\d:[0-5])\d(?:[ap]m)?)
time on 24 hour clock
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
usa phone numbers
^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$
Unfortunately, you will not be able to move your CLR function(s) to SQL Azure. You will need to either use the normal string functions (PATINDEX, CHARINDEX, LIKE, and so on) or perform these operations outside of the database.
EDIT Adding some information for the examples added to the question.
Email address
This one is always controversial because people disagree about which version of the RFC they want to support. The original didn't support apostrophes, for example (or at least people insist that it didn't - I haven't dug it up from the archives and read it myself, admittedly), and it has to be expanded quite often for new TLDs (once for 4-letter TLDs like .info, then again for 6-letter TLDs like .museum). I've often heard quite knowledgeable people state that perfect e-mail validation is impossible, and having previously worked for an e-mail service provider, I can tell you that it was a constantly moving target. But for the simplest approaches, see the question TSQL Email Validation (without regex).
One numeric digit
Probably the easiest one of the bunch:
WHERE #s LIKE '[0-9]';
Credit card numbers
Assuming you strip out dashes and spaces, which you should do in any case. Note that this isn't an actual check of the credit card number algorithm to ensure that the number itself is actually valid, just that it conforms to the general format (AmEx = 15 digits starting with a 3, the rest are 16 digits - Visa starts with a 4, MasterCard starts with a 5, Discover starts with 6 and I think there's one that starts with a 7 (though that may just be gift cards of some kind)):
WHERE #s + ' ' LIKE '[3-7]'+ REPLICATE('[0-9]', 14) + '[0-9 ]';
If you want to be a little more precise at the cost of being long-winded, you can say:
WHERE (LEN(#s) = 15 AND #s LIKE '3' + REPLICATE('[0-9]', 14))
OR (LEN(#s) = 16 AND #s LIKE '[4-7]' + REPLICATE('[0-9]', 15));
USA phone numbers
Again, assuming you're going to strip out parentheses, dashes and spaces first. Pretty sure a US area code can't start with a 1; if there are other rules, I am not aware of them.
WHERE #s LIKE '[2-9]' + REPLICATE('[0-9]', 9);
-----
I'm not going to go further, because a lot of the other expressions you've defined can be extrapolated from the above. Hopefully this gives you a start. You should be able to Google for some of the others to see how other people have replicated the patterns with T-SQL. Some of them (like days of the week) can probably just be checked against a table - seems overkill to do an invasie pattern matching for a set of 7 possible values. Similarly with a list of 1000 numbers or years, these are things that will be much easier (and probably more efficient) to check if the numeric value is in a table rather than convert it to a string and see if it matches some pattern.
I'll state again that a lot of this will be much better if you can cleanse and validate the data before it gets into the database in the first place. You should strive to do this wherever possible, because without CLR, you just can't do powerful RegEx inside SQL Server.
Ken Henderson wrote about ways to replicate RegEx without CLR, but they require sp_OA* procedures, which are even less likely to ever see the light of day in Azure than CLR. Most of the other articles you'll find online use an approach similar to Ken's or use complex use of built-in string functions.
Which portions of RegEx specifically are you trying to replicate? Can you show an example of the input/output of one of your functions? Perhaps it will be easy to convert to get similar results using the built-in string functions like PATINDEX.