String split by character - regex

I have 50 strings of this form:
28 North Dakota 0 2 1 0 0 1 1 0 0 _1 _2 _1 0 0 0 0 1 0 0 0 0 2 16 F 9.5610957 11
I want to separate the string after the state name. (Split the string at the last character) But there is character 'F' near the end of the string. So I split the string in half using this:
substring(x,1,nchar(x)/2)
Now I am left with this:
28 North Dakota 0 2 1 0 0 1 1 0 0 _1 _2 _1
Now I can try and separate the string after the last letter in the string. How do I do that? I understand that what I am doing is bad coding practice (Choosing to split the string in half). Is there a smarter way of doing this?
I have a list of all the states. Could I use that as a dictionary to split the strings?

We can use str_split with n option. The lookaround regex implies we are splitting by one or more space that precedes a numeric value and succeeds a character. As we specify the 'n' option as 2, it will split at the first instance of finding this pattern to give two splits.
library(stringr)
str_split(str1, "(?<=[a-z])\\s+(?=[0-9])", n = 2)[[1]]
#[1] "28 North Dakota"
#[2] "0 2 1 0 0 1 1 0 0 _1 _2 _1 0 0 0 0 1 0 0 0 0 2 16 F 9.5610957 11"
Or instead of using a package solution, we can also do with strsplit after creating a delimiter
strsplit(sub("(.*[a-z])\\s(.*)", "\\1,\\2", str1), ",")[[1]]
[1] "28 North Dakota"
[2] "0 2 1 0 0 1 1 0 0 _1 _2 _1 0 0 0 0 1 0 0 0 0 2 16 F 9.5610957 11"
If we need the first part alone. We match one or more space (\\s+) followed by a digit (\\d) followed by characters to the end of the string (.*) and replace by ''.
sub("\\s+\\d.*", "", str1)
#[1] "28 North Dakota"
If we need the state alone
library(stringr)
str_extract(str1, "[A-Za-z]+\\s*[A-Za-z]+")
#[1] "North Dakota"
NOTE: The OP mentioned about splitting after the state name.
data
str1 <- "28 North Dakota 0 2 1 0 0 1 1 0 0 _1 _2 _1 0 0 0 0 1 0 0 0 0 2 16 F 9.5610957 11"

Here is a method using gsub:
gsub("^\\d+ ([A-Za-z ]+) \\d+.*", "\\1", temp)
"North Dakota"
The regular expression at the beginning says match a digit as the first character "^\d", maybe more than one digit "+", followed by a space " ". Then capture "()" the next set of alphabetical characters "[A-Za-z ]+" as well as spaces. Then match a space followed by at least one digit " \d+" and anything that follows ".*", the "\1" returns the captured subexpression.
To return the final part of the substring, you could move the capturing parentheses to the corresponding part of the regular expression.
gsub("^\\d+ [A-Za-z ]+ (\\d+.*)", "\\1", temp)
[1] "0 2 1 0 0 1 1 0 0 _1 _2 _1 0 0 0 0 1 0 0 0 0 2 16 F 9.5610957 11"
or to capture the state name and the number that precedes it,
gsub("^(\\d+ [A-Za-z ]+) \\d+.*", "\\1", temp)
[1] "28 North Dakota
the example string:
temp <- c("28 North Dakota 0 2 1 0 0 1 1 0 0 _1 _2 _1 0 0 0 0 1 0 0 0 0 2 16 F 9.5610957 11")

Related

Detecting Special Characters with Regular Expression in python?

df
Name
0 ##
1 R##
2 ghj##
3 Ray
4 *#+
5 Jack
6 Sara123#
7 ( 1234. )
8 Benjamin k 123
9 _
10 _!##_
11 _#_&#+-
12 56##!
Output:
Bad_Name
0 ##
1 *#+
2 _
3 _!##_
4 _#_&#+-
I need to detect the special character through regular expression. If a string contains any alphabet or Number then that string is valid else it will consider as bad string.
I was using '^\W*$' RE, everything was working fine except when the string contains '_'( underscore) it is not treating as Bad String.
Use pandas.Series.str.contains:
df[~df['Name'].str.contains('[a-z0-9]', False)]
Output:
Name
0 ##
4 *#+
9 _
10 _!##_
11 _#_&#+-

A regular expression for a binary string with one pair of consecutive 0s and one pair of consecutive 1s

1*(011*)*00(11*0)* 1* intersect 0*(100*)*11(00*1)* 0*
The first half of the regular expression should match all binary strings with one pair of consecutive 0s and the second half should match all binary strings with one pair of consecutive 1s. As the first contains strings with one pair of consecutive 1s, and the second contains strings with one pair of consecutive 0s, I claim that the entire regular expression would only match binary strings with at most one consecutive pair of 0s and one consecutive pair of 1s. Is this correct?
Yes, but more precisely your expression matches binary strings that contain exactly one pair of 0s and exactly one pair of 1s (rather than "at most").
I can prove it via this method:
Here is another regular expression to encode those semantics, using a union rather than an intersection, which I feel is more straightforward.
(1)?(01)*00(10)*11(01)*(0)?|(0)?(10)*11(01)*00(10)*(1)?
The first half matches binary strings in which the pair of zeros precedes the pair of ones, and the second half matches binary strings in which the pair of ones precedes the pair of zeros. Before, after, and between those pairs alternating values may occur.
A string is accepted if it matches either of those patterns (rather than both as in your expression).
Now, it is possible to construct the state transitions based on either of these regular expressions. I have done so below, first with mine then with yours. Each numbered state contains a list of regular expressions that describe the remaining portion of the string, and the state transitions that occur when either a 0, 1, or end-of-line is encountered. A string matches if it matches any regular expression in the list.
As you can see, the state transitions between your version and mine are completely homologous. Therefore, they represent exactly the same set of strings.
start (1)?(01)*00(10)*11(01)*(0)?
(0)?(10)*11(01)*00(10)*(1)?
0 1
1 2
EOL NO_MATCH
1 1(01)*00(10)*11(01)*(0)?
0(10)*11(01)*(0)?
(10)*11(01)*00(10)*(1)?
0 3
1 2
EOL NO_MATCH
2 (01)*00(10)*11(01)*(0)?
0(10)*11(01)*00(10)*(1)?
1(01)*00(10)*(1)?
0 1
1 4
EOL NO_MATCH
3 (10)*11(01)*(0)?
0 NO_MATCH
1 5
EOL NO_MATCH
4 (01)*00(10)*(1)?
0 6
1 NO_MATCH
EOL NO_MATCH
5 0(10)*11(01)*(0)?
1(01)*(0)?
0 3
1 7
EOL NO_MATCH
6 1(01)*00(10)*(1)?
0(10)*(1)?
0 8
1 4
EOL NO_MATCH
7 (01)*(0)?
0 9
1 NO_MATCH
EOL MATCH
8 (10)*(1)?
0 NO_MATCH
1 10
EOL MATCH
9 1(01)*(0)?
END
0 NO_MATCH
1 7
EOL MATCH
10 0(10)*(1)?
END
0 8
1 NO_MATCH
EOL MATCH
start 1*(011*)*00(11*0)*1* + 0*(100*)*11(00*1)*0*
0 1
1 2
EOL NO_MATCH
1 11*(011*)*00(11*0)*1* + 0*(100*)*11(00*1)*0*
0(11*0)*1* + 0*(100*)*11(00*1)*0*
0 3
1 2
EOL NO_MATCH
2 1*(011*)*00(11*0)*1* + 00*(100*)*11(00*1)*0*
1*(011*)*00(11*0)*1* + 1(00*1)*0*
0 1
1 4
EOL NO_MATCH
3 (11*0)*1* + 0*(100*)*11(00*1)*0*
0 NO_MATCH
1 5
EOL NO_MATCH
4 1*(011*)*00(11*0)*1* + (00*1)*0*
0 6
1 NO_MATCH
EOL NO_MATCH
5 1*0(11*0)*1* + 00*(100*)*11(00*1)*0*
(11*0)*1* + 00*(100*)*11(00*1)*0*
1*0(11*0)*1* + 1(00*1)*0*
(11*0)*1* + 1(00*1)*0*
0 3
1 7
EOL NO_MATCH
6 11*(011*)*00(11*0)*1* + 0*1(00*1)*0*
0(11*0)*1* + 0*1(00*1)*0*
11*(011*)*00(11*0)*1* + 0*
0(11*0)*1* + 0*
0 8
1 4
EOL NO_MATCH
7 1*0(11*0)*1* + (00*1)*0*
1* + (00*1)*0*
0 9
1 NO_MATCH
EOL MATCH
8 (11*0)*1* + 0*1(00*1)*0*
(11*0)*1* + 0*
0 NO_MATCH
1 10
EOL MATCH
9 (11*0)*1* + 0*1(00*1)*0*
(11*0)*1* + 0*
0 NO_MATCH
1 7
EOL MATCH
10 1*0(11*0)*1* + (00*1)*0*
1* + (00*1)*0*
(11*0)*1* + 0*
0 8
1 NO_MATCH
EOL MATCH

merge two files based on partial match between strings

I have two files where the string in file1 have partial match to the string in the last column of file2. I would to merge the two files based the match between the strings. How do I solve this when the match is only partial, meaning that the strings in file1 often is a substring of that in file2. PS: Case should be ignored.
file1:
AGTAAGGTCAGCTAAATAAGCTATCGGGCCCATACCCCGAAAATGTTGGTTATATCCTTCCCGTACTA 0 1 2 3
CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT 2 11 14 0
AAAGTGGCCTACGCCACCGCCATGGACTGGTTCATAGCCGTGTGCTATGCCTTC 1 2 3 4
AAAGTGTCATATGCCACTGCCATGGATTGGTTCATAGCTGTTTGCTTTGCATTC 50 1 1 21
TACCCTGTAGAACCGAANTTGT 0 0 1 4
TCCCTGTGGTCTAGTGGTTAGGATTCTGCGCTCTCACCGCCGCGGCCCGGG 1 0 4 3
GGGCCAGGATGAAACCTAATTTGAGTGGCCATCCATGGATGAGAAATGCGG 0 1 3 0
file2:
chrX Rfam ncRNA 55609165 55609267 53.97 + 0 ID=RF00019.20;Name=RF00019;Alias=Y_RNA;Note=AL627224.14/36063-36164 chrX:55609165-55609267 ggctggtttgagtgcagtgatgcttacaactaattgatcacatccaattacagatttctttgctctttctgtactcccagtgcttcacttgactagccttta
chrX Rfam regulatory_region 57233087 57233370 53.02 - 0 ID=RF01417.3;Name=RF01417;Alias=RSV_RNA;Note=Z83745.1/45303-45021 chrX:57233087-57233370 gtaaatgcaaaccattcacagtcttgctcagctaaggggatagtaaagaaacagtcttttaaatcaatgactattaaaggccaatttcttggaatcatagcaggagaaggcagtcctggctgcaatgtccccataggttgtataactgaattaatggctcttaagtcagttaacattctccatttacctgattttttcttaattacaaaaactggagaatttcaaggggaaaatattggaactatgtgtcctttttctaattgttcagtaactaagtcctcta
chrX Rfam regulatory_region 61975961 61976233 45.45 - 0 ID=RF01417.4;Name=RF01417;Alias=RSV_RNA;Note=BX322784.3/89124-88853 chrX:61975961-61976233 AAAGTGTCATATGCCACTGCCATGGATTGGTTCATAGCTGTTTGCTTTGCATTC
chrX Rfam ncRNA 62059095 62059167 29.9 + 0 ID=RF00005.18;Name=RF00005;Alias=tRNA;Note=BX119964.4/4840-4911 chrX:62059095-62059167 GTTAATGTAGCTTAATTCATCAAAGCAAGGCACTGAAAAATGCCTAGATGAATACACATGATTCCATTAACA
chrX Rfam regulatory_region 62582448 62582735 62.81 - 0 ID=RF01417.5;Name=RF01417;Alias=RSV_RNA;Note=AL158203.12/36753-36467 chrX:62582448-62582735 gtaaacacaaatttttctctgtccttctctgctagatgaatggtataaaaacaatctttaagtcaacaacgattataggccaatcttcaggaattgccacaggggaggggaggacctgttgaagagaccccataggttgcaaattagcattaatagcagttaagtagtgcaaaagtctccatttaccagactttttgggaatgacgaaaatgggcgaattccaaaggctgtttgatggttctatatggccagctttcaattgctcctcaactaattcatgggctctc
chrX Rfam ncRNA 63430570 63430868 141.38 + 0 ID=RF00017.15;Name=RF00017;Alias=Metazoa_SRP;Note=AL355852.23/124872-125169 chrX:63430570-63430868 cctggggcagtggcacatgcctgtagtcccagctacttgggaggctgaagcaggaggatagcttaagttcaggagttctgggatgtaatgcactatgctgatagggtgtctgcactaagttcagcatcaacatggtgacctcccaggagcaggggaccaccaggctgcctaaggaggtatgaactggccgagatcagaaacggagcacataaaaacttgcatcttgatcagtagtgggattgcgcctacaaatagccactgcactgcagactgggcaacatagtgagaccttgtctct
If your files arent huge, and awk is able to hold all of file2 in memory, you can do this:
awk '
ARGIND==1 { save[tolower($NF)] = $0 }
ARGIND==2 { col1 = tolower($1)
for(pat in save){
if(pat ~ col1)print $0 " ----- " save[pat]
}
}
' file2 file1
This reads file2 first and saves each line ($0) in associative array save, indexed by the last field ($NF) converted to lowercase.
It then reads file1 (so ARGIND is 2, 2nd file), and converts column 1 to lowercase. Then it tries to match (~) this string (or pattern really) against each index in the array. If it matches it prints the current line from file1 and the saved line from file2.

Regular Expression: Find repeated patterns

Having this string s=";123;;123;;456;;124;;123;;567;" in R, which shows some Ids separated by ";", I want to find the repeated IDs, so in this case ";123;" is repeated. I used the following command in R:
gregexpr("(;[1-9]+;).*\1", s)
but it doesn't find the repeated patterns. Any idea what is wrong?
One example of a long string:
1760381;;1774536;;1774614;;1774617;;1774705;;1774723;;1775013;;1902321;;1928678;;2105486;;2105514;;2105544;;2105575;;2105585;;2279115;;2379236;;290927;;542280;;555749;;641540;;683822;;694934;;713228;;713248;;713249;;726949;;727204;;731434;;754522;;7693856;;100095;;1003838;;1045582;;1079057;;1108697;;1231229;;124087;;1249672;;1328126;;1412065;;1419930;;1441743;;1470580;;1476585;;1502106;;1556149;;1637775;;1643922;;1655644;;1755547;;1759001;;1760295;;1760296;;1760320;;1760326;;1760338;;1760348;;1760349;;1760350;;1760353;;1760375;;1760376;;1760377;;1760378;;1760388;;1760401;;1760402;;1760403;;1760410;;1760421;;1760425;;1760426;;1760642;;1760654;;1770463;;1774365;;1774366;;1774394;;1774449;;1774453;;1774454;;1774455;;1774456;;1774457;;1774458;;1774461;;1774462;;1774463;;1774464;;1774466;;1774469;;1774504;;1774505;;1774506;;1774519;;1774520;;1774525;;1774527;;1774529;;1774532;;1774533;;1774539;;1774542;;1774593;;1774595;;1774604;;1774610;;1774616;;1774617;;1774641;;1774660;;1774671;;1774674;;1774684;;1774687;;1774694;;1774704;;1774706;;1774713;;1774717;;1774722;;1774723;;1774726;;1774733;;1774745;;1774750;;1774753;;1774754;;1774766;;1774784;;1774786;;1774795;;1774799;;1774800;;1774803;;1774809;;1774813;;1774835;;1774849;;1774852;;1774853;;1774854;;1774857;;1774858;;1774861;;1774862;;1774867;;1774868;;1774869;;1774870;;1774877;;1774878;;1774880;;1774884;;1774885;;1774886;;1774902;;1774905;;1774934;;1774935;;1774937;;1774939;;1774946;;1774949;;1774950;;1774958;;1774959;;1774960;;1774961;;1774962;;1774964;;1774965;;1774966;;1774967;;1774969;;1774971;;1774972;;1774973;;1774975;;1774977;;1774978;;1774999;;1775000;;1775003;;1775005;;1775006;;1775009;;1775013;;1775014;;1775017;;1775024;;1775026;;1775033;;1775038;;1775040;;1775041;;1775044;;1775087;;1785544;;1811645;;1837210;;1864356;;1928674;;1928678;;1932882;;1954203;;2066856;;2076876;;2105349;;2105351;;2105458;;2105464;;2105476;;2105480;;2105482;;2105484;;2105489;;2105496;;2105500;;2105510;;2105514;;2105518;;2105532;;2105545;;2105550;;2172257;;2172762;;218438;;2228198;;2229827;;2247909;;2262250;;2263135;;2287260;;2335872;;2335873;;2335874;;2335877;;2338682;;2352560;;2420902;;263946;;265370;;303060;;330571;;338764;;387492;;387750;;388362;;431807;;436056;;436442;;444058;;458026;;491696;;504783;;513098;;529228;;539799;;549649;;559957;;562574;;563116;;576418;;582851;;592273;;599952;;614463;;626416;;645122;;652363;;665854;;668048;;682877;;683822;;688317;;709795;;710684;;723114;;724447;;724526;;725177;;731389;;731434;;876958;;879962;;947924;;987322;;987446;;61326;;1025952;;1095970;;1338018;;1349990;;1373122;;1419930;;1760310;;1760320;;1774705;;1774706;;1774708;;1774712;;1774952;;1774954;;1774963;;1774972;;1774977;;1775077;;1901075;;2022080;;2117779;;2143723;;441554;;450517;;549649;;1010402;;113311;;1148258;;1374348;;1419930;;1606449;;1606515;;1606608;;1606610;;1760320;;1760338;;1760618;;1760642;;1774504;;1774520;;1774595;;1774705;;1774909;;1774977;;1775011;;1775043;;179542;;1928678;;2105598;;2105721;;2188303;;2335873;;340762;;387759;;436442;;504783;;588336;;646185;;682877;;715644;;725080;;741661;;760924
m<-gregexpr("[0-9]+",s)
n<-regmatches(s,m)
[[1]]
[1] "123" "123" "456" "124" "123" "567"
data.frame(table(unlist(n)))
Var1 Freq
1 123 3
2 124 1
3 456 1
4 567 1
The code works for your long form string too: Here is the head and tail of the output:
head(data.frame(table(unlist(n))),10)
Var1 Freq
1 100095 1
2 1003838 1
3 1010402 1
4 1025952 1
5 1045582 1
6 1079057 1
7 1095970 1
8 1108697 1
9 113311 1
10 1148258 1
tail(data.frame(table(unlist(n))),10)
Var1 Freq
316 731434 2
317 741661 1
318 754522 1
319 760924 1
320 7693856 1
321 876958 1
322 879962 1
323 947924 1
324 987322 1
325 987446 1
1) In the examples the ids are all the same length so we assume that is a general feature. Try this pattern where (?=...) is a zero width lookahead expression (see ?regex)
pat <- ";([1-9]+);(?=.*\\1)"
gregexpr(pat, s, perl = TRUE)
or this:
library(gsubfn)
strapply(s, pat, perl = TRUE)[[1]]
## [1] "123" "123"
This lists each id one fewer times than its occurrence (zero times for ids not duplicated) in s so to list each duplicated id uniquely try unique(st) where st is the result of this last line of code above.
Note: In the second example in the question, i.e. the long string, there is no ; at the end of the string so the last id can never be matched by the expression unless we first paste a ; onto the end.
2) Instead of matching the contents we could match the delimiters instead:
strsplit(s, ";")[[1]])[-1]
If st is the result of this line of code then st is just a vector of all the ids so unique(st[duplicated[st]) uniquely lists each duplicated id and involves no regular expressions.

Need Help Regarding Regular Expression in TCL

Can Anyone help me "Execution flow" of the follwing Regular Expression in TCL.
% regexp {^([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])$} 9
1 (success)
%
%
% regexp {^([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])$} 64
1 (success)
% regexp {^([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])$} 255
1 (success)
% regexp {^([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])$} 256
0 (Fail)
% regexp {^([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])$} 1000
0 (Fail)
Can Anyone Please Explain me how these are executing ? I am struggling to understand .
The regexp first has the anchors ^ and $ around the main capturing group indicated by brackets here ([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5]) which means that it is checking the whole string.
Second, inside the capture group, we have 3 parts:
[01]?[0-9][0-9]?
2[0-4][0-9]
25[0-5]
They are separated with | (or) operators, which means if the string satisfies any of the 3 parts, the match succeeds.
Now, to the individual parts:
[01]?[0-9][0-9]? This means that it matches 0 or 1 times [01] (either 0 or 1), then any digit, and again any digit, if there's one. Together, this accepts strings like 000 or 199 but nothing above 199.
2[0-4][0-9] this follows the same logic as above, except that it validates strings with numbers from 200 to 249.
25[0-5] Finally, this one validates strings with numbers from 250 to 255.
Since there's nothing more, only numbers ranging from 000 to 255 will succeed in the validation.
This is why 9, 64 and 255 passed, but not 256 or 1000.
Not an answer to the question, just exploring other ways to do this validation:
proc from_0_to_255 {n} {
expr {[string is integer -strict $n] && 0 <= $n && $n <= 255}
}
from_0_to_255 256 ; # => 0
proc int_in_range {n {from 0} {to 255}} {
expr {[string is integer -strict $n] && $from <= $n && $n <= $to}
}
int_in_range 256 ; # => 0
int_in_range 256 0 1024 ; # => 1
proc int_in_range {n args} {
array set range [list -from 0 -to 255 {*}$args]
expr {
[string is integer -strict $n] &&
$range(-from) <= $n && $n <= $range(-to)
}
}
int_in_range 256 ; # => 0
int_in_range 256 -to 1024 ; # => 1
Everything is detailled in http://perldoc.perl.org/perlre.html#Regular-Expressions.
^ Match the beginning of the line
$ Match the end of the line (or before newline at the end)
? Match 1 or 0 times
| Alternation
() Grouping
[] Bracketed Character class
It matches to the following numbers
[01]?[0-9][0-9]? -> 0 - 9, 00 - 99, 000 - 199
2[0-4][0-9] -> 200 - 249
25[0-5] -> 250 - 255