Counting repeated characters around nth character of string - regex

For a part of my University project, i am trying to count base repeats around the 11th character of 21 bp sequences of DNA. I want to look at the 11th character, then if there are repeated identical characters around it, to print them.
For example:
GCTAAAGTAAAAGAAGATGCA
Would give results of:
11th base is A, YES repeated 4 times
I really don't know how to go about this, to get the 11th character i'm sure i can use a regex but after that i'm not sure.
To start with I have playing around using a hash and looking for the number of occurrences of different nucleotide groups in each sequence, as follows:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $outputfile = "/Users/edwardtickle/Documents/hash.txt";
open FILE1, "/Users/edwardtickle/Documents/randomoutput.txt";
open( OUTPUTFILE, ">$outputfile" );
while (<FILE1>) {
if (/^(\S+)/) {
my $dna = $1;
my #repeats = ( $dna =~ /[A]{3}/g );
my #changes = ( $dna =~ /[T]{2}/g );
my %hash = ();
my %hash1 = ();
for my $repeats (#repeats) {
$hash{$repeats}++;
}
for my $changes (#changes) {
$hash1{$changes}++;
}
for my $key ( keys %hash ) {
print OUTPUTFILE $key . " => " . $hash{$key} . "\n";
}
for my $key1 ( keys %hash1 ) {
print OUTPUTFILE $key1 . " => " . $hash1{$key1} . "\n";
}
}
}
FILE1 data:
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
Gives results of:
TT => 2
AAA => 1
TT => 4
AAA => 1
TT => 2
TT => 4
AAA => 1
TT => 2
TT => 5
AAA => 1
TT => 1
AAA => 2
TT => 1
TT => 2
When for this sample data set i would like a cumulative tally of every sequence, rather than number of individual occurrences in each matching string, like this:
AAA => 6
TT => 23
How do i go about changing the output? And how do i prevent a string of TTTTT bases showing up as TT => 2? Then if anyone has any recommendations of how to go about the original problem/if it is even possible, that would be greatly appreciated.
Thanks in advance!

Using a regular expression:
use strict;
use warnings;
my $char = 11; # Looking for the 11th character, or position 10.
while (<DATA>) {
chomp;
if (m{
( (.) \2*+ ) # Look for a repeated character sequence
(?<= .{$char} ) # Must include pos $char - 1
}x) {
printf "%s => %d\n", $2, length($1);
}
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
Output:
G => 1
A => 3
T => 1
T => 2
A => 4
T => 3
A => 4
A => 5
T => 2
G => 3

This code should do what you need. There really isn't a regular expression that will find the longest sequence of a given character at and around a given character position. This code works by splitting the string $seq into an array of characters #seq and then searching forwards and backwards from the centre.
It's practical to do things this way because the sequence is relatively short, and as long as there's an odd numbers of characters in the string it will calculate the centre point for you.
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($base, $length) = find_mid_band($_);
printf "%s => %d\n", $base, $length;
}
sub find_mid_band {
my ($seq) = #_;
my #seq = unpack '(A1)*', $seq;
my $len_seq = #seq;
my $c_offset = ($len_seq - 1) / 2;
my $c_char = $seq[$c_offset];
my ($start, $end) = ($c_offset, $c_offset + 1);
--$start while $start > 0 and $seq[$start-1] eq $c_char;
++$end while $end < $len_seq and $seq[$end] eq $c_char;
return $c_char, $end-$start;
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
output
G => 1
A => 3
T => 1
T => 2
A => 4
T => 3
A => 4
A => 5
T => 2
G => 3
Update
Here's a better way. It's shorter and faster, and works by all the subsequences of the same character until it finds a sequence that spans the middle character.
The output is identical to that of the above.
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($base, $length) = find_mid_band($_);
printf "%s => %d\n", $base, $length;
}
sub find_mid_band {
my ($seq) = #_;
my $mid_seq = length($seq) / 2;
while ( $seq =~ /(.)\1*/g ) {
if ($-[0] < $mid_seq and $+[0] > $mid_seq) {
return $1, $+[0]-$-[0];
}
}
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC

Related

Perl remove characters from string matrix according to pattern

I have multiple strings with the same length stored in a hash data structure. Example:
$VAR1 = {
'first' => 'abcXY',
'second' => 'XYXYa',
'third' => '*abXZ'
};
From this 'matrix' of characters, I would like to remove the 'columns' which exclusively contain the characters X or Y. In the above example, this would be the fourth character of each string (4th 'column').
The desired result would be:
$VAR1 = {
'first' => 'abcY',
'second' => 'XYXa',
'third' => '*abZ'
};
Following code does this by creating a transpose of the values of my hash structure and then determines which indices to keep:
# data structure
my %h = ('first'=>'abcXY', 'second'=>'XYXYa', 'third'=>'*abXZ' );
# get length of all values in hash
my $nchar = length $h{(keys(%h))[0]};
# transpose values of hash
my #transposed = map { my $idx=$_; [map {substr ($_, $idx, 1) } values(%h)] } 0..$nchar-1;
# determine indices which I want to keep
my #indices;
for my $i (0..$#transposed){
my #a = #{$transposed[$i]};
# do not keep index if column consists of X and Y
if ( scalar(grep {/X|Y/} #a) < scalar(#a) ) {
push #indices, $i;
}
}
# only keep letters with indices
for my $k (keys %h){
my $str = $h{$k};
my $reduced = join "", map{ substr ($str, $_, 1) } #indices;
$h{$k} = $reduced;
}
This is a terrible amount of code for such a simple operation. How could I do this more elegantly (preferably not with some matrix library, but with standard perl)?
Edit
Here another example: From the following strings, the first and last characters should be removed, because in both strings, the first and last position is either X or Y:
$VAR1 = {
'1' => 'Xsome_strX',
'2' => 'YsomeXstrY'
};
Desired result:
$VAR1 = {
'1' => 'some_str',
'2' => 'someXstr'
};
my $total = values %hash;
my %ind;
for my $v (values %hash) {
$ind{ pos($v) -1 }++ while $v =~ /[XY]/g;
}
my #to_remove = sort {$b <=> $a} grep { $ind{$_} == $total } keys %ind;
for my $v (values %hash) {
substr($v, $_, 1, "") for #to_remove;
}

Perl, array of matches for group of precompiled regexps

I have group of some regexps and want to match current line for each of them, if match succeeded call some function with matched groups as parameters.
my %regexps = (
"a" => qr/^(a)\s*(b)/o,
"b" => qr/^(c)\s*(d)/o,
"c" => qr/^(e)\s*(f)/o,
);
sub call_on_match {
my $actions = shift;
# ... some setup actions for $_
while (my ($regexp, $func) = each(%$actions) ) {
if (my #matches = /$regexp/){
$func->(#matches);
}
}
}
call_on_match({
$regexps{"a"} => \&some_funca,
$regexps{"b"} => \&some_funcb,
$regexps{"c"} => \&some_funcc,
})
The problem is in my #matches = /$regexp/ expression, it executes about 110k times and takes about 1 second total for compilation (Typical profiler output for this line: # spent 901ms making 107954 calls to main::CORE:regcomp, avg 8µs/call.
First guess was to remove additional regexp slashes, in case it makes perl thinks that it is new regexp and must be compiled. I used my #matches = ($_ =~ $regexp), but no success.
Is there another ways to make perl not to recompile qr'ed regexps in this context?
UPD: I replaced hash with array (like [$regexps{"a"}, \&some_funca]):
foreach my $group (#$actions){
my ($regexp, $func) = #$group;
if (my #matches = ($_ =~ $regexp)){
$func->(#matches);
}
}
Now it compiles faster but compilation doesn't disappear: # spent 51.7ms making 107954 calls to main::CORE:regcomp, avg 479ns/call
I suggest that you use the IDs as keys in both hashes, like this
use strict;
use warnings;
my %regexps = (
a => qr/^(a)\s*(b)/,
b => qr/^(c)\s*(d)/,
c => qr/^(e)\s*(f)/,
);
sub call_on_match {
my ($actions) = #_;
# ... some setup actions for $_
while (my ($regexp_id, $func) = each %$actions) {
if (my #matches = $_ =~ $regexps{$regexp_id}) {
$func->(#matches);
}
}
}
call_on_match(
{
a => \&some_funca,
b => \&some_funcb,
c => \&some_funcc,
}
);

Match count in Perl

I have a hash with various keywords. Now, I want to find the count of these keywords in the string.
I just wrote some part of code with a foreach loop.
use strict;
use warnings;
my $string = "The invitro experiments are conducted on human liver microsom. "
. " These liver microsom can be cultured in rats.";
my %hash = (
"human" => 1,
"liver" => 1,
"microsom" => 1,
);
for my $nme (keys %hash){
# Some code which I am not sure
}
Expected output: human:1; liver:2; microsom:3
Can someone help me in this?
Thanks
The following snippet should suffice.
#!/usr/bin/perl -w
use strict;
my $string="The invitro experiments are conducted on human liver microsom. These liver microsom can be cultured in rats.";
my %hash = (
'human' => 1,
'liver' => 1,
'microsom' => 1,
);
my #words = split /\b/, $string;
my %seen;
for (#words) {
if ($_ eq 'human' or $_ eq 'liver' or $_ eq 'microsom') {
$seen{$_}++;
}
}
for (keys %hash) {
print "$_: $seen{$_}\n";
}
Is that homework? :) Well, depends on number of words in hash, and number of words in string ( strings ), it will be better iterate over hash, or iterate over words in string(s), incrementing appropriate values when found. As you need to check all the words, you will end with a list with some marked "0" occurences, and some marked more than zero.
probably not the best way of going about this, but it should work.
my $string = "The invitro experiments are conducted on human liver microsom. These liver microsom can be cultured in rats.";
my %hash = (
'human' => 1,
'liver' => 1,
'microsom' => 1,
);
foreach my $nme (keys %hash){
$hash{$nme} = scalar #{[$string =~ /$nme/g]};
print "$hash{$nme}\n";
}

How to automagically create pattern based on real data?

I have many vendors in database, they all differ in some aspect of their data. I'd like to make data validation rule which is based on previous data.
Example:
A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12
Goal: if user inputs string 'XZ-217' for vendor B, algorithm should compare previous data and say: this string is not similar to vendor B previous data.
Is there some good way/tools to achieve such comparison? Answer could be some generic algoritm or Perl module.
Edit:
The "similarity" is hard to define, i agree. But i'd like to catch to algorithm, which could analyze previous ca 100 samples and then compare the outcome of analyze with new data. Similarity may based on length, on use of characters/numbers, string creation patterns, similar beginning/end/middle, having some separators in.
I feel it is not easy task, but on other hand, i think it has very wide use. So i hoped, there is already some hints.
You may want to peruse:
http://en.wikipedia.org/wiki/String_metric and http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm (for instance)
Joel and I came up with similar ideas. The code below differentiates 3 types of zones.
one or more non-word characters
alphanumeric cluster
a cluster of digits
It creates a profile of the string and a regex to match input. In addition, it also contains logic to expand existing profiles. At the end, in the task sub, it contains some pseudo logic which indicates how this might be integrated into a larger application.
use strict;
use warnings;
use List::Util qw<max min>;
sub compile_search_expr {
shift;
#_ = #{ shift() } if #_ == 1;
my $str
= join( '|'
, map { join( ''
, grep { defined; }
map {
$_ eq 'P' ? quotemeta;
: $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
: $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
: undef
;
} #$_
)
} #_ == 1 ? #{ shift } : #_
);
return qr/^(?:$str)$/;
}
sub merge_profiles {
shift;
my ( $profile_list, $new_profile ) = #_;
my $found = 0;
PROFILE:
for my $profile ( #$profile_list ) {
my $profile_length = #$profile;
# it's not the same profile.
next PROFILE unless $profile_length == #$new_profile;
my #merged;
for ( my $i = 0; $i < $profile_length; $i++ ) {
my $old = $profile->[$i];
my $new = $new_profile->[$i];
next PROFILE unless $old->[0] eq $new->[0];
push( #merged
, [ $old->[0]
, min( $old->[1], $new->[1] )
, max( $old->[2], $new->[2] )
]);
}
#$profile = #merged;
$found = 1;
last PROFILE;
}
push #$profile_list, $new_profile unless $found;
return;
}
sub compute_info_profile {
shift;
my #profile_chunks
= map {
/\W/ ? [ P => $_ ]
: /\D/ ? [ W => length, length ]
: [ D => length, length ]
}
grep { length; } split /(\W+)/, shift
;
}
# Psuedo-Perl
sub process_input_task {
my ( $application, $input ) = #_;
my $patterns = $application->get_patterns_for_current_customer;
my $regex = $application->compile_search_expr( $patterns );
if ( $input =~ /$regex/ ) {}
elsif ( $application->approve_divergeance( $input )) {
$application->merge_profiles( $patterns, compute_info_profile( $input ));
}
else {
$application->escalate(
Incident->new( issue => INVALID_FORMAT
, input => $input
, customer => $customer
));
}
return $application->process_approved_input( $input );
}
Here is my implementation and a loop over your test cases. Basically you give a list of good values to the function and it tries to build a regex for it.
output:
A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})
code:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw'uniq each_arrayref';
my %examples = (
A => [qw/ XZ-4 XZ-23 XZ-217 /],
B => [qw/ 1276 1899 22711 /],
C => [qw/ 12-4 12-75 12 /],
);
foreach my $example (sort keys %examples) {
print "$example: ", gen_regex(#{ $examples{$example} }) || "Generate failed!", "\n";
}
sub gen_regex {
my #cases = #_;
my %exploded;
# ex. $case may be XZ-217
foreach my $case (#cases) {
my #parts =
grep { defined and length }
split( /(\d+|\w+)/, $case );
# #parts are ( XZ, -, 217 )
foreach (#parts) {
if (/\d/) {
# 217 becomes ['\d' => 3]
push #{ $exploded{$case} }, ['\d' => length];
} elsif (/\w/) {
#XZ becomes ['\w' => 2]
push #{ $exploded{$case} }, ['\w' => length];
} else {
# - becomes ['lit' => '-']
push #{ $exploded{$case} }, ['lit' => $_ ];
}
}
}
my $pattern = '';
# iterate over nth element (part) of each case
my $ea = each_arrayref(values %exploded);
while (my #parts = $ea->()) {
# remove undefined (i.e. optional) parts
my #def_parts = grep { defined } #parts;
# check that all (defined) parts are the same type
my #part_types = uniq map {$_->[0]} #def_parts;
if (#part_types > 1) {
warn "Parts not aligned\n";
return;
}
my $type = $part_types[0]; #same so make scalar
# were there optional parts?
my $required = (#parts == #def_parts);
# keep the values of each part
# these are either a repitition or lit strings
my #values = sort uniq map { $_->[1] } #def_parts;
# these are for non-literal quantifiers
my $min = $required ? $values[0] : 0;
my $max = $values[-1];
# write the specific pattern for each type
if ($type eq '\d') {
$pattern .= '\d' . "{$min,$max}";
} elsif ($type eq '\w') {
$pattern .= '\w' . "{$min,$max}";
} elsif ($type eq 'lit') {
# quote special characters, - becomes \-
my #uniq = map { quotemeta } uniq #values;
# join with alternations, surround by non-capture grouup, add quantifier
$pattern .= '(?:' . join('|', #uniq) . ')' . ($required ? '{1}' : '?');
}
}
# build the qr regex from pattern
my $regex = qr/$pattern/;
# test that all original patterns match (#fail should be empty)
my #fail = grep { $_ !~ $regex } #cases;
if (#fail) {
warn "Some cases fail for generated pattern $regex: (#fail)\n";
return '';
} else {
return $regex;
}
}
To simplify the work of finding the pattern, optional parts may come at the end, but no required parts may come after optional ones. This could probably be overcome but it might be hard.
If there was a Tie::StringApproxHash module, it would fit the bill here.
I think you're looking for something that combines the fuzzy-logic functionality of String::Approx and the hash interface of Tie::RegexpHash.
The former is more important; the latter would make light work of coding.

Removing spaces between single letters

I have a string that may contain an arbitrary number of single-letters separated by spaces. I am looking for a regex (in Perl) that will remove spaces between all (unknown number) of single letters.
For example:
ab c d should become ab cd
a bcd e f gh should become a bcd ef gh
a b c should become abc
and
abc d should be unchanged (because there are no single letters followed by or preceded by a single space).
Thanks for any ideas.
Your description doesn't really match your examples. It looks to me like you want to remove any space that is (1) preceded by a letter which is not itself preceded by a letter, and (2) followed by a letter which is not itself followed by a letter. Those conditions can be expressed precisely as nested lookarounds:
/(?<=(?<!\pL)\pL) (?=\pL(?!\pL))/
tested:
use strict;
use warnings;
use Test::Simple tests => 4;
sub clean {
(my $x = shift) =~ s/(?<=(?<!\pL)\pL) (?=\pL(?!\pL))//g;
$x;
}
ok(clean('ab c d') eq 'ab cd');
ok(clean('a bcd e f gh') eq 'a bcd ef gh');
ok(clean('a b c') eq 'abc');
ok(clean('ab c d') eq 'ab cd');
output:
1..4
ok 1
ok 2
ok 3
ok 4
I'm assuming you really meant one space character (U+0020); if you want to match any whitespace, you might want to replace the space with \s+.
You can do this with lookahead and lookbehind assertions, as described in perldoc perlre:
use strict;
use warnings;
use Test::More;
is(tran('ab c d'), 'ab cd');
is(tran('a bcd e f gh'), 'a bcd ef gh');
is(tran('a b c'), 'abc');
is(tran('abc d'), 'abc d');
sub tran
{
my $input = shift;
(my $output = $input) =~ s/(?<![[:lower:]])([[:lower:]]) (?=[[:lower:]])/$1/g;
return $output;
}
done_testing;
Note the current code fails on the second test case, as the output is:
ok 1
not ok 2
# Failed test at test.pl line 7.
# got: 'abcd efgh'
# expected: 'a bcd ef gh'
ok 3
ok 4
1..4
# Looks like you failed 1 test of 4.
I left it like this as your second and third examples seem to contradict each other as to how leading single characters should be handled. However, this framework should be enough to allow you to experiment with different lookaheads and lookbehinds to get the exact results you are looking for.
This piece of code
#!/usr/bin/perl
use strict;
my #strings = ('a b c', 'ab c d', 'a bcd e f gh', 'abc d');
foreach my $string (#strings) {
print "$string --> ";
$string =~ s/\b(\w)\s+(?=\w\b)/$1/g; # the only line that actually matters
print "$string\n";
}
prints this:
a b c --> abc
ab c d --> ab cd
a bcd e f gh --> a bcd ef gh
abc d --> abc d
I think/hope this is what you're looking for.
This should do the trick:
my $str = ...;
$str =~ s/ \b(\w) \s+ (\w)\b /$1$2/gx;
That removes the space between all single nonspace characters. Feel free to replace \S with a more restrictive character class if needed. There also may be some edge cases related to punctuation characters that you need to deal with, but I can't guess that from the info you have provided.
As Ether helpfully points out, this fails on one case. Here is a version that should work (though not quite as clean as the first):
s/ \b(\w) ( (?:\s+ \w\b)+ ) /$1 . join '', split m|\s+|, $2/gex;
I liked Ether's test based approach (imitation is the sincerest form of flattery and all):
use warnings;
use strict;
use Test::Magic tests => 4;
sub clean {
(my $x = shift) =~ s{\b(\w) ((?: \s+ (\w)\b)+)}
{$1 . join '', split m|\s+|, $2}gex;
$x
}
test 'space removal',
is clean('ab c d') eq 'ab cd',
is clean('a bcd e f gh') eq 'a bcd ef gh',
is clean('a b c') eq 'abc',
is clean('abc d') eq 'abc d';
returns:
1..4
ok 1 - space removal 1
ok 2 - space removal 2
ok 3 - space removal 3
ok 4 - space removal 4
It's not a regex but since I am lazy by nature I would it do this way.
#!/usr/bin/env perl
use warnings;
use 5.012;
my #strings = ('a b c', 'ab c d', 'a bcd e f gh', 'abc d');
for my $string ( #strings ) {
my #s; my $t = '';
for my $el ( split /\s+/, $string ) {
if ( length $el > 1 ) {
push #s, $t if $t;
$t = '';
push #s, $el;
} else { $t .= $el; }
}
push #s, $t if $t;
say "#s";
}
OK, my way is the slowest:
no_regex 130619/s -- -60% -61% -63%
Alan_Moore 323328/s 148% -- -4% -8%
Eric_Storm 336748/s 158% 4% -- -5%
canavanin 352654/s 170% 9% 5% --
I didn't include Ether's code because ( as he has tested ) it returns different results.
Now I have the slowest and the fastest.
#!/usr/bin/perl
use 5.012;
use warnings;
use Benchmark qw(cmpthese);
my #strings = ('a b c', 'ab c d', 'a bcd e f gh', 'abc d');
cmpthese( 0, {
Eric_Storm => sub{ for my $string (#strings) { $string =~ s{\b(\w) ((?: \s+ (\w)\b)+)}{$1 . join '', split m|\s+|, $2}gex; } },
canavanin => sub{ for my $string (#strings) { $string =~ s/\b(\w)\s+(?=\w\b)/$1/g; } },
Alan_Moore => sub{ for my $string (#strings) { $string =~ s/(?<=(?<!\pL)\pL) (?=\pL(?!\pL))//g; } },
keep_uni => sub{ for my $string (#strings) { $string =~ s/\PL\pL\K (?=\pL(?!\pL))//g; } },
keep_asc => sub{ for my $string (#strings) { $string =~ s/[^a-zA-Z][a-zA-Z]\K (?=[a-zA-Z](?![a-zA-Z]))//g; } },
no_regex => sub{ for my $string (#strings) { my #s; my $t = '';
for my $el (split /\s+/, $string) {if (length $el > 1) { push #s, $t if $t; $t = ''; push #s, $el; } else { $t .= $el; } }
push #s, $t if $t;
#say "#s";
} },
});
.
Rate no_regex Alan_Moore Eric_Storm canavanin keep_uni keep_asc
no_regex 98682/s -- -64% -65% -66% -81% -87%
Alan_Moore 274019/s 178% -- -3% -6% -48% -63%
Eric_Storm 282855/s 187% 3% -- -3% -46% -62%
canavanin 291585/s 195% 6% 3% -- -45% -60%
keep_uni 528014/s 435% 93% 87% 81% -- -28%
keep_asc 735254/s 645% 168% 160% 152% 39% --
This will do the job.
(?<=\b\w)\s(?=\w\b)
Hi I have written simple javascript to do this it's simple and you can convert into any language.
function compressSingleSpace(source){
let words = source.split(" ");
let finalWords = [];
let tempWord = "";
for(let i=0;i<words.length;i++){
if(tempWord!='' && words[i].length>1){
finalWords.push(tempWord);
tempWord = '';
}
if(words[i].length>1){
finalWords.push(words[i]);
}else{
tempWord += words[i];
}
}
if(tempWord!=''){
finalWords.push(tempWord);
}
source = finalWords.join(" ");
return source;
}
function convertInput(){
let str = document.getElementById("inputWords").value;
document.getElementById("firstInput").innerHTML = str;
let compressed = compressSingleSpace(str);
document.getElementById("finalOutput").innerHTML = compressed;
}
label{
font-size:20px;
margin:10px;
}
input{
margin:10px;
font-size:15px;
padding:10px;
}
input[type="button"]{
cursor:pointer;
background: #ccc;
}
#firstInput{
color:red;
font-size:20px;
margin:10px;
}
#finalOutput{
color:green;
font-size:20px;
margin:10px;
}
<label for="inputWords">Enter your input and press Convert</label><br>
<input id="inputWords" value="check this site p e t z l o v e r . c o m thanks">
<input type="button" onclick="convertInput(this.value)" value="Convert" >
<div id="firstInput">check this site p e t z l o v e r . c o m thanks</div>
<div id="finalOutput">check this site petzlover.com thanks</div>