Regular expression in Groovy. Catch fragment of a String - regex

Given the following input:
BGM+220+105961-44+9'
DTM+137:20140121:102'
NAD+BY+0048003479::91'
NAD+SE+0000805406::91'
NAD+DP+0048003479::91'
CUX+2:USD+9'
PIA+1+M1PL05883LOT+":BP::92'
PIA+1+927700077001:VP::91'
PRI+AAA:9:::1:PCE'
SCC+1'
QTY+21:10000:PCE'
DTM+2:11022014:102'
PIA+1+M1PL05883LOT+":BP::92'
PIA+1+927700077001:VP::91'
PRI+AAA:9:::1:PCE'
SCC+1'
QTY+21:20000:PCE'
DTM+2:04022014:102'
UNS+S'
UNT++1'
UNZ+1+10596144'
The goal is to capture from the first line:
BGM+220+105961-44+9'
the value between "-" and "end of the digit". In the above example, it would be "44".
Thanks in advance

You could do:
text.tokenize( '\n' ) // split it based on newlines
.head() // grab the first one
.find( /-\d+/ ) // find '-44'
.substring( 1 ) // remove the '-'
Actually, you don't need to split it, so just:
text.find( /-\d+/ )?.substring( 1 )
does the same thing (as it's the first line you're interested in)
Edit after comment:
To get both the numbers surrounding the -, you could do:
def (pre,post) = text.find( /\d+-\d+/ )?.tokenize( '-' )
assert pre == '105961'
assert post == '44'

Related

IntelliJ: Regular expression to join multiple lines into single CSV line?

Occasionally I need to join multiple lines of data into a single line, and in this case, specifically as comma-separated values on a single line:
input: (lines pasted into some Android Studio editor tab)
Rush
IQ
Saga
Yes
desired output:
'Rush','IQ','Saga','Yes'
Edit > Find > Replace I got close with this regex pattern to match newline character (\n) with goal eliminate it:
search: ^(.*)$\n
replace: '$1',
[x] Regex
but produces this undesired output:
'Rush',IQ
'Saga',Yes
because after the a new line is eliminated the following line is already adjoining so it's skipped... so we get this "every other line" behavior.
The fastest and easiest way I could think of is to replace \n by ',' and then manually wrap the whole line in quotes:
The result of the first replacement would be:
Rush','IQ','Saga','Yes
And then just manually add first and last quote.
Step 1: Concatenate the lines, use
(.+)(?:\R|\z)
Replace with '$1',.
The (.+)(?:\R|\z) pattern matches any 1+ chars other than line break chars as many as possible (.+) and captures this into Group 1 and (?:\R|\z) matches either a line break sequence (\R) or (|) the very end of the string (\z).
Step 2: Post-process by repalcing ,$ with an empty string. This pattern matches , at the end of the line.
Occasionally I need to join multiple lines of data into a single line, and in this case, specifically as comma-separated values on a single line:
Regex may not be the best solution for this.
CSV library
There are several comma-separated values (CSV) libraries available to make quick work of this.
The libraries will handle a particular problem you may overlook in writing your own code: Some of your lines of input having the single-quote mark within their content. Such cases need to be escaped. Quoting RFC 4180 section 2.7:
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
Here is an example of using Apache Commons CSV library.
We use lambda syntax with a Scanner to get an Iterable of the lines of text from your input.
We specify using a single-quote, as you desire, rather than default of double-quote in standard CSV.
We use try-with-resources syntax to automatically close the CSVPrinter object, whether our code runs successfully or throws an exception.
String input = "Rush\n" +
"IQ\n" +
"Saga\n" +
"Yes";
Iterable < String > iterable = ( ) -> new Scanner( input ).useDelimiter( "\n" ); // Lambda syntax to get a `Iterable` of lines from a `String`.
CSVFormat format =
CSVFormat
.RFC4180
.withQuoteMode( QuoteMode.ALL )
.withQuote( '\'' );
StringBuilder stringBuilder = new StringBuilder();
try (
CSVPrinter printer = new CSVPrinter( stringBuilder , format ) ;
)
{
printer.printRecord( iterable );
}
catch ( IOException e )
{
e.printStackTrace();
}
String output = stringBuilder.toString();
System.out.println( "output: " + output );
When run:
output: 'Rush','IQ','Saga','Yes'
We can shorten that code.
try (
CSVPrinter printer = new CSVPrinter( new StringBuilder() , CSVFormat.RFC4180.withQuoteMode( QuoteMode.ALL ).withQuote( '\'' ) ) ;
)
{
printer.printRecord( ( Iterable < String > ) ( ) -> new Scanner( input ).useDelimiter( "\n" ) );
System.out.println( printer.getOut().toString() ); // Or: `return printer.getOut()` returning an `Appendable` object.
}
catch ( IOException e )
{
e.printStackTrace();
}
Not that this is particularly better shortened. Personally, I would use the longer version wrapped in a method in a utility class. Like this:
public String enquoteLines( String input ) {
String output = "";
Iterable < String > iterable = ( ) -> new Scanner( input ).useDelimiter( "\n" ); // Lambda syntax to get a `Iterable` of lines from a `String`.
CSVFormat format =
CSVFormat
.RFC4180
.withQuoteMode( QuoteMode.ALL )
.withQuote( '\'' );
StringBuilder stringBuilder = new StringBuilder();
try (
CSVPrinter printer = new CSVPrinter( stringBuilder , format ) ;
)
{
printer.printRecord( iterable );
output = printer.getOut().toString();
}
catch ( IOException e )
{
e.printStackTrace();
}
return output;
}
Calling it:
String input = "Rush\n" +
"IQ\n" +
"Saga\n" +
"Oui";
String output = this.enquoteLines( input );

python replace line text with weired characters

How do I replace the following using python
GSA*HC*11177*NYSfH-EfC*23130303*0313*1*R*033330103298
STEM*333*3001*0030303238
BHAT*3319*33*33377*23330706*031829*RTRCP
NUM4*41*2*My Break Room Place*****6*1133337
I want to replace the all character after first occurence of '*' . All characters must be replace except '*'
Example input:
NUM4*41*2*My Break Room Place*****6*1133337
example output:
NUM4*11*1*11 11111 1111 11111*****1*1111111
Fairly simple, use a callback to return group 1 (if matched) unaltered, otherwise
return replacement 1
Note - this also would work in multi-line strings.
If you need that, just add (?m) to the beginning of the regex. (?m)(?:(^[^*]*\*)|[^*\s])
You'd probably want to test the string for the * character first.
( ^ [^*]* \* ) # (1), BOS/BOL up to first *
| # or,
[^*\s] # Not a * nor whitespace
Python
import re
def repl(m):
if ( m.group(1) ) : return m.group(1)
return "1"
str = 'NUM4*41*2*My Break Room Place*****6*1133337'
if ( str.find('*') ) :
newstr = re.sub(r'(^[^*]*\*)|[^*\s]', repl, str)
print newstr
else :
print '* not found in string'
Output
NUM4*11*1*11 11111 1111 11111*****1*1111111
If you want to use regex, you can use this one: (?<=\*)[^\*]+ with re.sub
inputs = ['GSA*HC*11177*NYSfH-EfC*23130303*0313*1*R*033330103298',
'STEM*333*3001*0030303238',
'BHAT*3319*33*33377*23330706*031829*RTRCP',
'NUM4*41*2*My Break Room Place*****6*1133337']
outputs = [re.sub(r'(?<=\*)[^\*]+', '1', inputline) for inputline in inputs]
Regex explication here

validator.addMethod for checking before and end whitespaces

I want to validate a field with white spaces either before a text string or after. It is allowed to have space in the middle string.
Here is my code
$.validator.addMethod("trimLookup", function(value, element) {
regex = "^[^\s]+(\s+[^\s]+)*$";
regex = new RegExp( regex );
return this.optional( element ) || regex.test( value );
}, $.validator.format("Cannot contains any spaces at beginning or end"));
I test the regex in https://regex101.com/ it works fine. I also test this code with other regex it works. But if enter " " or " abc " it doesn't work.
Any Suggestion?
Thank you for your time!

QRegExp not finding expected string pattern

I am working in Qt 5.2, and I have a piece of code that takes in a string and enters one of several if statements based on its format. One of the formats searched for is the letters "RCV", followed by a variable amount of numbers, a decimal, and then one more number. There can be more than one of these values in the line, separated by "|", for example it could one value like "RCV0123456.1" or mulitple values like "RCV12345.1|RCV678.9". Right now I am using QRegExp class to find this, like this:
QString value = "RCV000030249.2|RCV000035360.2"; //Note: real test value from my code
if(QRegExp("^[RCV\d+\.\d\|?]+$").exactMatch(value))
std::cout << ":D" << std::endl;
else
std::cout << ":(" << std::endl;
I want it to use the if statement, but it keeps going into the else statement. Is there something I'm doing wrong with the regular expression?
Your expression should be like #vahancho mentionet in a comment:
if(QRegExp("^[RCV\\d+\\.\\d\\|?]+$").exactMatch(value))
If you use C++11, then you can use its raw strings feature:
if(QRegExp(R"(^[RCV\d+\.\d\|?]+$)").exactMatch(value))
Aside from escaping the backslashes which others has mentioned in answers and comments,
There can be more than one of these values in the line, separated by "|", for example it could one value like "RCV0123456.1" or mulitple values like "RCV12345.1|RCV678.9".
[RCV\d+\.\d\|?] may not be doing what you expect. Perhaps you want () instead of []:
/^
[RCV\d+\.\d\|?]+ # More than one of characters from the list:
# R, C, V, a digit, a +, a dot, a digit, a |, a ?
$/x
/^
(
RCV\d+\.\d # RCV, some digits, a dot, followed by a digit
\|? # Optional: a |
)+ # Quantifier of one or more
$/x
Also, maybe you could revise the regex such that the optional | requires the group to be matched *again*:
/^
(RCV\d+\.\d) # RCV, some digits, a dot, followed by a digit
(
\|(?1) # A |, then match subpattern 1 (Above)
)+ # Quantifier of one or more
$/x
Check if only valid occurences in line with the addition to require an | starting second occurence (having your implementation would not require the | even with double quotes):
QString value = "RCV000030249.2|RCV000035360.2"; //Note: real test value from my code
if(QRegExp("^RCV\\d+\\.\\d(\\|RCV\\d+\\.\\d)*$").exactMatch(value))
std::cout << ":D" << std::endl;
else
std::cout << ":(" << std::endl;

I want to replace ',' on the 150th location in a String with a <br>

My String is : PI Last Name equal to one of
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI','ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA','AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE','ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN','ALVAREZ','AMARYAN','AMBESI-IMPIOMBATO','AMEGBETO','AMOWITZ', 'ANAGNOSTARAS','ANAND','ANDERSEN','ANDERSON', 'ANDRADE','ANDREEFF','ANDROPHY','ANGER','ANHOLT','ANTHONY','ANTLE','ANTONELLI','ANTONY', 'ANZULOVICH', 'APODACA','APOSHIAN','APPEL','APPLEBY','APRIL','ARAUJO','ARBIB','ARBOLEDA', 'ARCHAKOV','ARCHER', 'ARECHAVALETA-VELASCO','ARENS','ARGON','ARGYROKASTRITIS', 'ARIAS','ARIZAGA','ARMSTRONG','ARNON', 'ARSHAVSKY','ARVIN','ASATRYAN','ASCOLI','ASKENASE','ASSI','ATALAY','ATANASOVA','ATKINSON','ATTYGALLE','ATWEH','AU','AVETISYAN','AWE','AYOUB','AZAD','BACSO','BAGASRA','BAKER','BALAS', 'BALCAZAR','BALK','BALKAY','BALLOU','BALRAJ','BALSTER','BANERJEE','BANKOLE','BANTA','BARAL','BARANOWSKA','BARBAS', 'BARBER','BARILLAS-MURY','BARKHOLT','BARNES','BARNETT','BARRETT','BARRIA','BARROW','BARROWS','BARTKE','BARTLETT','BASSINGTHWAIGHTE','BASSIOUNY','BASU','BATES','BATTAGLIA','BATTERMAN','BAUER','BAUERLE','BAUM','BAUME', 'BAUMLER','BAVISTER','BAWA','BAYNE','BEASLEY','BEATTY','BEATY','BEBENEK','BECK','BECKER','BECKMAN','BECKMAN-SUURKULA' ,'BEDFORD','BEDOLLA','BEEBE','BEEMON','BEHETS','BEHRMAN','BEIER','BEKKER','BELL','BELLIDO','BELMAIN', 'BENATAR','BENBENISHTY','BENBROOK','BENDER','BENEDETTI','BENNETT','BENNISH','BENZ','BERG','BERGER','BERGEY','BERGGREN','BERK','BERKOWITZ','BERLIN','BERLINER','BERMAN','BERTINO','BERTOZZI','BERTRAND','BERWICK','BETHONY','BEYERS','BEYRER' ,'BEZPROZVANNY','BHAGWAT','BHANDARI','BHARGAVA','BHARUCHA','BHUJWALLA','BIANCO','BIDLACK','BIELERT','BIER','BIESSMANN','BIGELOW' ,'BILLER','BILLINGS','BINDER','BINDMAN','BINUTU','BIRBECK','BIRGE','BIRNBAUM','BIRO','BIRT','BISHAI','BISHOP','BISSELL','BJORKEGREN','BJORNSTAD','BLACK','BLANCHARD','BLASS','BLATTNER','BLIGNAUT','BLOCH','BLOCK','BLOOM','BLOOM,','BLUM','BLUMBERG' ,'BLUMENTHAL','BLYUKHER','BODDULURI','BOFFETTA','BOGOLIUBOVA', 'BOLLINGER','BOLLS','BOMSZTYK','BONANNO','BONNER','BOOM','BOOTHROYD','BOPPANA','BORAWSKI','BORG','BORIS-LAWRIE','BORISY','BORLONGAN','BORNSTEIN','BORODOVSKY','BORST','BOS','BOTO','BOWDEN','BOWEN','BOYCE-JACINO','BRADEN','BRADY' ,'BRAITHWAITE','BRANN','BRASH','BRAUNSTEIN', 'BREMAN','BRENNAN','BRENNER','BRETSCHER','BREW','BREYSSE','BRIGGS','BRITES','BRITT','BRITTENHAM','BRODIE','BRODY','BROOK','BROOTEN','BROSCO','BROSNAN','BROWN','BROWNE','BRUCKNER','BRUNENGRABER','BRYL','BRYSON','BU','BUCHAN','BUDD','BUDNIK', 'BUEKENS','BUKRINSKY','BULLMORE','BULUN','BURBANO','BURGENER','BURGESS','BURKS','BURMEISTER','BURNETT','BURNHAM','BURNS','BURRIDGE','BURTON','BUSCIGLIO','BUSHEK','BUSIJA','BUZSAKI','BZYMEK','CABA')
I need to have a regex which will greedily looks for up to 150 characters with a last character being a ','. And then replace the last ',' of the 150 with a <br />
Any suggestions pls?
I used this ','(?=[^()]*\)) but this one replaces all the occurences. I want the 150th ones to be replaced.
Thanks everyone for your suggestions. I managed to do it with Java code instead of regex.
StringBuilder sb = new StringBuilder(html);
int i = 0;
while ((i = sb.indexOf("','", i + 150)) != -1) {
int j = sb.lastIndexOf("','", i + 150);
sb.insert(i+1, "<BR>");
}
return sb.toString();
However, this breaks at the first encounter of ',' in the 150 chars.
Can anyone help modify my code to incorporate the break at the last occurence of ',' withing the 150 chars.
You'll want something like this:
Look for every occurrence of \([^)]+*,[^)]+*\) (Find a parenthesis-wrapped string with a comma in it and then run the following regular expression on each of the matched elements:
(.{135,150}[^,]*?),
The first number is the minimum number of characters you want to match before you add a break tag -- the second is the maximum number of characters you would like to match before inserting a break tag. If there is no , between the characters in question then the regular expression will continue to consume characters until it finds a comma.
You could probably do it like this:
regex ~ /(^.{1,14}),/
replacement ~ '\1<replacement' or "$1<insert your text>"
In Perl:
$target = ','x 22;
$target =~ s/(^ .{1,14}) , /$1<15th comma>/x;
print $target;
Output
,,,,,,,,,,,,,,<15th comma>,,,,,,,
Edit: As an alternative, if you want to break the string up into succesive 150 or less
you could do it this way:
regex ~ /(.{1,150},)/sg
replacement ~ '\1<br/>' or "$1<br\/>"
// That is a regex of type global (/g) and include newlines (/s)
In Perl:
$target = "
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI','ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA','AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE','ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN', ... )
";
if ($target =~ s/( .{1,150} , )/$1<br\/>/sxg) {
print $target;
}
Output:
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI',<br/>'ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA',<br/>'AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE',<br/>'ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN',<br/> ... )