merge two files based on partial match between strings - regex

I have two files where the string in file1 have partial match to the string in the last column of file2. I would to merge the two files based the match between the strings. How do I solve this when the match is only partial, meaning that the strings in file1 often is a substring of that in file2. PS: Case should be ignored.
file1:
AGTAAGGTCAGCTAAATAAGCTATCGGGCCCATACCCCGAAAATGTTGGTTATATCCTTCCCGTACTA 0 1 2 3
CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT 2 11 14 0
AAAGTGGCCTACGCCACCGCCATGGACTGGTTCATAGCCGTGTGCTATGCCTTC 1 2 3 4
AAAGTGTCATATGCCACTGCCATGGATTGGTTCATAGCTGTTTGCTTTGCATTC 50 1 1 21
TACCCTGTAGAACCGAANTTGT 0 0 1 4
TCCCTGTGGTCTAGTGGTTAGGATTCTGCGCTCTCACCGCCGCGGCCCGGG 1 0 4 3
GGGCCAGGATGAAACCTAATTTGAGTGGCCATCCATGGATGAGAAATGCGG 0 1 3 0
file2:
chrX Rfam ncRNA 55609165 55609267 53.97 + 0 ID=RF00019.20;Name=RF00019;Alias=Y_RNA;Note=AL627224.14/36063-36164 chrX:55609165-55609267 ggctggtttgagtgcagtgatgcttacaactaattgatcacatccaattacagatttctttgctctttctgtactcccagtgcttcacttgactagccttta
chrX Rfam regulatory_region 57233087 57233370 53.02 - 0 ID=RF01417.3;Name=RF01417;Alias=RSV_RNA;Note=Z83745.1/45303-45021 chrX:57233087-57233370 gtaaatgcaaaccattcacagtcttgctcagctaaggggatagtaaagaaacagtcttttaaatcaatgactattaaaggccaatttcttggaatcatagcaggagaaggcagtcctggctgcaatgtccccataggttgtataactgaattaatggctcttaagtcagttaacattctccatttacctgattttttcttaattacaaaaactggagaatttcaaggggaaaatattggaactatgtgtcctttttctaattgttcagtaactaagtcctcta
chrX Rfam regulatory_region 61975961 61976233 45.45 - 0 ID=RF01417.4;Name=RF01417;Alias=RSV_RNA;Note=BX322784.3/89124-88853 chrX:61975961-61976233 AAAGTGTCATATGCCACTGCCATGGATTGGTTCATAGCTGTTTGCTTTGCATTC
chrX Rfam ncRNA 62059095 62059167 29.9 + 0 ID=RF00005.18;Name=RF00005;Alias=tRNA;Note=BX119964.4/4840-4911 chrX:62059095-62059167 GTTAATGTAGCTTAATTCATCAAAGCAAGGCACTGAAAAATGCCTAGATGAATACACATGATTCCATTAACA
chrX Rfam regulatory_region 62582448 62582735 62.81 - 0 ID=RF01417.5;Name=RF01417;Alias=RSV_RNA;Note=AL158203.12/36753-36467 chrX:62582448-62582735 gtaaacacaaatttttctctgtccttctctgctagatgaatggtataaaaacaatctttaagtcaacaacgattataggccaatcttcaggaattgccacaggggaggggaggacctgttgaagagaccccataggttgcaaattagcattaatagcagttaagtagtgcaaaagtctccatttaccagactttttgggaatgacgaaaatgggcgaattccaaaggctgtttgatggttctatatggccagctttcaattgctcctcaactaattcatgggctctc
chrX Rfam ncRNA 63430570 63430868 141.38 + 0 ID=RF00017.15;Name=RF00017;Alias=Metazoa_SRP;Note=AL355852.23/124872-125169 chrX:63430570-63430868 cctggggcagtggcacatgcctgtagtcccagctacttgggaggctgaagcaggaggatagcttaagttcaggagttctgggatgtaatgcactatgctgatagggtgtctgcactaagttcagcatcaacatggtgacctcccaggagcaggggaccaccaggctgcctaaggaggtatgaactggccgagatcagaaacggagcacataaaaacttgcatcttgatcagtagtgggattgcgcctacaaatagccactgcactgcagactgggcaacatagtgagaccttgtctct

If your files arent huge, and awk is able to hold all of file2 in memory, you can do this:
awk '
ARGIND==1 { save[tolower($NF)] = $0 }
ARGIND==2 { col1 = tolower($1)
for(pat in save){
if(pat ~ col1)print $0 " ----- " save[pat]
}
}
' file2 file1
This reads file2 first and saves each line ($0) in associative array save, indexed by the last field ($NF) converted to lowercase.
It then reads file1 (so ARGIND is 2, 2nd file), and converts column 1 to lowercase. Then it tries to match (~) this string (or pattern really) against each index in the array. If it matches it prints the current line from file1 and the saved line from file2.

Related

Detecting Special Characters with Regular Expression in python?

df
Name
0 ##
1 R##
2 ghj##
3 Ray
4 *#+
5 Jack
6 Sara123#
7 ( 1234. )
8 Benjamin k 123
9 _
10 _!##_
11 _#_&#+-
12 56##!
Output:
Bad_Name
0 ##
1 *#+
2 _
3 _!##_
4 _#_&#+-
I need to detect the special character through regular expression. If a string contains any alphabet or Number then that string is valid else it will consider as bad string.
I was using '^\W*$' RE, everything was working fine except when the string contains '_'( underscore) it is not treating as Bad String.
Use pandas.Series.str.contains:
df[~df['Name'].str.contains('[a-z0-9]', False)]
Output:
Name
0 ##
4 *#+
9 _
10 _!##_
11 _#_&#+-

Removing special characters while retaining alpha numeric words

I'm in the middle of cleaning a data set that has this:
[IN]
my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
my_Series.str.replace("[^a-zA-Z]+", " ")
[OUT]
0
1 ASD
2 AUG M G
3 Air G G
4 Karsh
[IDEAL OUT]
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
My goal is to remove special characters and numbers but it there's a word that contains alphanumeric, it should stay. Can anyone help?
Try with apply to achieve your ideal output.
>>> my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
Output:
>>> my_Series.apply(lambda x: " ".join(['' if word.isdigit() else word for word in x.replace('-', ' ').split()]))
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
I have replaced - with space and split string on spaces. Then check whether the word is digit or not.
If it is digit replace with empty string else with actual word.
At last we are joining the list.
Edit 1:
regex solution :-
>>> my_Series.str.replace("((\d+)(?=.*\d))|([^a-zA-Z0-9 ])", " ")
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
Using lookaround.
((\d+)(?=.*\d))|([^a-zA-Z0-9 ])
(A number is last if it is followed by any other number) OR (allows alpha numeric)

Bash - word/term frequency per line (i.e. document)

I have a file rev.txt like this:
header1,header2
1, some text here
2, some more text here
3, text and more text here
I also have a vocabulary document with all unique words from rev.txt, like so (but sorted):
a
word
list
text
here
some
more
and
I want to generate a term frequency table for each line in rev.txt where it lists the occurence of each vocabulary word in each line of rev.txt, like so:
0 0 0 1 1 1 0 0
0 0 0 1 1 1 1 0
0 0 0 2 1 0 1 1
They could be comma separated as well.
This is similar to a question here. However, instead of search through the entire document, I want to do this line by line, using the complete vocabulary I already have.
Re: Jean-François Fabre
Actually, I am performing these in MATLAB. However, bash (I believe) would be faster for this preprocessing as I have direct disk access to the files.
Normally, I would use python, but limiting myself to using bash, this hacky one-liner solution will works for the given test case.
perl -pe 's|^.*?,[ ]?(.*)|\1|' rev.txt | sed '1d' | awk -F' ' 'FILENAME=="wordlist.txt" {wc[$1]=0; wl[wllen++]=$1; next}; {for(i=1; i<=NF; i++){wc[$i]++}; for(i=0; i<wllen; i++){print wc[wl[i]]" "; wc[wl[i]]=0; if(i+1==wllen){print "\n"} }}' ORS="" wordlist.txt -
Explanation/My thinking...
In the first part, perl -pe 's|^.*?,[ ]?(.*)|\1|' rev.txt, was used to pull out everything after the first comma (+removing the leading whitespace) from "rev.txt".
In the next part, sed '1d', was used to remove the first i.e. header line.
In the next part, we specified awk -F' ' ... ORS="" wordlist.txt - to use whitespace as a field delimiter, the output record delimiter as no space (note: we will print them as we go), and to read input from wordlist.txt (i.e. the "vocabulary document with all unique words from rev.txt") and stdin.
In the awk command, if the FILENAME is equal to "wordlist.txt", then (1) initialize array wc where the keys are the vocab words and the count is 0, and (2) initialize a list wl where the word order in the same as wordlist.txt.
FILENAME=="wordlist.txt" {
wc[$1]=0;
wl[wllen++]=$1;
next
};
After initialization, for each word in a line of stdin (i.e. the tidy rev.txt), increment the count of the word in wc.
{ for (i=1; i<=NF; i++) {
wc[$i]++
};
After the word counts have been added for a line, for each word in the list of words wl, print the count of that word with a whitespace and reset the count in wc back to 0. If the word is the last in the list, then add a whitespace to the output.
for (i=0; i<wllen; i++) {
print wc[wl[i]]" ";
wc[wl[i]]=0;
if(i+1==wllen){
print "\n"
}
}
}
Overall, this should produce the specified output.
Here's one in awk. It reads in the vocabulary file voc.txt (it's a piece of cake to produce it automatically in awk), copies the word list for each row of text and counts the word frequencies:
$ cat program.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_str_asc" # order for copying vocabulary array w
}
NR==FNR { # store the voc.txt to w
w[$1]=0
next
}
FNR>1 { # process text files to matrix
for(i in w) # copy voc array
a[i]=0
for(i=2; i<=NF; i++) # count freqs
a[$i]++
for(i in a) # output matrix row
printf "%s%s", a[i], OFS
print ""
}
Run it:
$ awk -f program.awk voc.txt rev.txt
0 0 1 0 0 1 1 0
0 0 1 0 1 1 1 0
0 1 1 0 1 0 2 0

String split by character

I have 50 strings of this form:
28 North Dakota 0 2 1 0 0 1 1 0 0 _1 _2 _1 0 0 0 0 1 0 0 0 0 2 16 F 9.5610957 11
I want to separate the string after the state name. (Split the string at the last character) But there is character 'F' near the end of the string. So I split the string in half using this:
substring(x,1,nchar(x)/2)
Now I am left with this:
28 North Dakota 0 2 1 0 0 1 1 0 0 _1 _2 _1
Now I can try and separate the string after the last letter in the string. How do I do that? I understand that what I am doing is bad coding practice (Choosing to split the string in half). Is there a smarter way of doing this?
I have a list of all the states. Could I use that as a dictionary to split the strings?
We can use str_split with n option. The lookaround regex implies we are splitting by one or more space that precedes a numeric value and succeeds a character. As we specify the 'n' option as 2, it will split at the first instance of finding this pattern to give two splits.
library(stringr)
str_split(str1, "(?<=[a-z])\\s+(?=[0-9])", n = 2)[[1]]
#[1] "28 North Dakota"
#[2] "0 2 1 0 0 1 1 0 0 _1 _2 _1 0 0 0 0 1 0 0 0 0 2 16 F 9.5610957 11"
Or instead of using a package solution, we can also do with strsplit after creating a delimiter
strsplit(sub("(.*[a-z])\\s(.*)", "\\1,\\2", str1), ",")[[1]]
[1] "28 North Dakota"
[2] "0 2 1 0 0 1 1 0 0 _1 _2 _1 0 0 0 0 1 0 0 0 0 2 16 F 9.5610957 11"
If we need the first part alone. We match one or more space (\\s+) followed by a digit (\\d) followed by characters to the end of the string (.*) and replace by ''.
sub("\\s+\\d.*", "", str1)
#[1] "28 North Dakota"
If we need the state alone
library(stringr)
str_extract(str1, "[A-Za-z]+\\s*[A-Za-z]+")
#[1] "North Dakota"
NOTE: The OP mentioned about splitting after the state name.
data
str1 <- "28 North Dakota 0 2 1 0 0 1 1 0 0 _1 _2 _1 0 0 0 0 1 0 0 0 0 2 16 F 9.5610957 11"
Here is a method using gsub:
gsub("^\\d+ ([A-Za-z ]+) \\d+.*", "\\1", temp)
"North Dakota"
The regular expression at the beginning says match a digit as the first character "^\d", maybe more than one digit "+", followed by a space " ". Then capture "()" the next set of alphabetical characters "[A-Za-z ]+" as well as spaces. Then match a space followed by at least one digit " \d+" and anything that follows ".*", the "\1" returns the captured subexpression.
To return the final part of the substring, you could move the capturing parentheses to the corresponding part of the regular expression.
gsub("^\\d+ [A-Za-z ]+ (\\d+.*)", "\\1", temp)
[1] "0 2 1 0 0 1 1 0 0 _1 _2 _1 0 0 0 0 1 0 0 0 0 2 16 F 9.5610957 11"
or to capture the state name and the number that precedes it,
gsub("^(\\d+ [A-Za-z ]+) \\d+.*", "\\1", temp)
[1] "28 North Dakota
the example string:
temp <- c("28 North Dakota 0 2 1 0 0 1 1 0 0 _1 _2 _1 0 0 0 0 1 0 0 0 0 2 16 F 9.5610957 11")

Regular Expression: Find repeated patterns

Having this string s=";123;;123;;456;;124;;123;;567;" in R, which shows some Ids separated by ";", I want to find the repeated IDs, so in this case ";123;" is repeated. I used the following command in R:
gregexpr("(;[1-9]+;).*\1", s)
but it doesn't find the repeated patterns. Any idea what is wrong?
One example of a long string:
1760381;;1774536;;1774614;;1774617;;1774705;;1774723;;1775013;;1902321;;1928678;;2105486;;2105514;;2105544;;2105575;;2105585;;2279115;;2379236;;290927;;542280;;555749;;641540;;683822;;694934;;713228;;713248;;713249;;726949;;727204;;731434;;754522;;7693856;;100095;;1003838;;1045582;;1079057;;1108697;;1231229;;124087;;1249672;;1328126;;1412065;;1419930;;1441743;;1470580;;1476585;;1502106;;1556149;;1637775;;1643922;;1655644;;1755547;;1759001;;1760295;;1760296;;1760320;;1760326;;1760338;;1760348;;1760349;;1760350;;1760353;;1760375;;1760376;;1760377;;1760378;;1760388;;1760401;;1760402;;1760403;;1760410;;1760421;;1760425;;1760426;;1760642;;1760654;;1770463;;1774365;;1774366;;1774394;;1774449;;1774453;;1774454;;1774455;;1774456;;1774457;;1774458;;1774461;;1774462;;1774463;;1774464;;1774466;;1774469;;1774504;;1774505;;1774506;;1774519;;1774520;;1774525;;1774527;;1774529;;1774532;;1774533;;1774539;;1774542;;1774593;;1774595;;1774604;;1774610;;1774616;;1774617;;1774641;;1774660;;1774671;;1774674;;1774684;;1774687;;1774694;;1774704;;1774706;;1774713;;1774717;;1774722;;1774723;;1774726;;1774733;;1774745;;1774750;;1774753;;1774754;;1774766;;1774784;;1774786;;1774795;;1774799;;1774800;;1774803;;1774809;;1774813;;1774835;;1774849;;1774852;;1774853;;1774854;;1774857;;1774858;;1774861;;1774862;;1774867;;1774868;;1774869;;1774870;;1774877;;1774878;;1774880;;1774884;;1774885;;1774886;;1774902;;1774905;;1774934;;1774935;;1774937;;1774939;;1774946;;1774949;;1774950;;1774958;;1774959;;1774960;;1774961;;1774962;;1774964;;1774965;;1774966;;1774967;;1774969;;1774971;;1774972;;1774973;;1774975;;1774977;;1774978;;1774999;;1775000;;1775003;;1775005;;1775006;;1775009;;1775013;;1775014;;1775017;;1775024;;1775026;;1775033;;1775038;;1775040;;1775041;;1775044;;1775087;;1785544;;1811645;;1837210;;1864356;;1928674;;1928678;;1932882;;1954203;;2066856;;2076876;;2105349;;2105351;;2105458;;2105464;;2105476;;2105480;;2105482;;2105484;;2105489;;2105496;;2105500;;2105510;;2105514;;2105518;;2105532;;2105545;;2105550;;2172257;;2172762;;218438;;2228198;;2229827;;2247909;;2262250;;2263135;;2287260;;2335872;;2335873;;2335874;;2335877;;2338682;;2352560;;2420902;;263946;;265370;;303060;;330571;;338764;;387492;;387750;;388362;;431807;;436056;;436442;;444058;;458026;;491696;;504783;;513098;;529228;;539799;;549649;;559957;;562574;;563116;;576418;;582851;;592273;;599952;;614463;;626416;;645122;;652363;;665854;;668048;;682877;;683822;;688317;;709795;;710684;;723114;;724447;;724526;;725177;;731389;;731434;;876958;;879962;;947924;;987322;;987446;;61326;;1025952;;1095970;;1338018;;1349990;;1373122;;1419930;;1760310;;1760320;;1774705;;1774706;;1774708;;1774712;;1774952;;1774954;;1774963;;1774972;;1774977;;1775077;;1901075;;2022080;;2117779;;2143723;;441554;;450517;;549649;;1010402;;113311;;1148258;;1374348;;1419930;;1606449;;1606515;;1606608;;1606610;;1760320;;1760338;;1760618;;1760642;;1774504;;1774520;;1774595;;1774705;;1774909;;1774977;;1775011;;1775043;;179542;;1928678;;2105598;;2105721;;2188303;;2335873;;340762;;387759;;436442;;504783;;588336;;646185;;682877;;715644;;725080;;741661;;760924
m<-gregexpr("[0-9]+",s)
n<-regmatches(s,m)
[[1]]
[1] "123" "123" "456" "124" "123" "567"
data.frame(table(unlist(n)))
Var1 Freq
1 123 3
2 124 1
3 456 1
4 567 1
The code works for your long form string too: Here is the head and tail of the output:
head(data.frame(table(unlist(n))),10)
Var1 Freq
1 100095 1
2 1003838 1
3 1010402 1
4 1025952 1
5 1045582 1
6 1079057 1
7 1095970 1
8 1108697 1
9 113311 1
10 1148258 1
tail(data.frame(table(unlist(n))),10)
Var1 Freq
316 731434 2
317 741661 1
318 754522 1
319 760924 1
320 7693856 1
321 876958 1
322 879962 1
323 947924 1
324 987322 1
325 987446 1
1) In the examples the ids are all the same length so we assume that is a general feature. Try this pattern where (?=...) is a zero width lookahead expression (see ?regex)
pat <- ";([1-9]+);(?=.*\\1)"
gregexpr(pat, s, perl = TRUE)
or this:
library(gsubfn)
strapply(s, pat, perl = TRUE)[[1]]
## [1] "123" "123"
This lists each id one fewer times than its occurrence (zero times for ids not duplicated) in s so to list each duplicated id uniquely try unique(st) where st is the result of this last line of code above.
Note: In the second example in the question, i.e. the long string, there is no ; at the end of the string so the last id can never be matched by the expression unless we first paste a ; onto the end.
2) Instead of matching the contents we could match the delimiters instead:
strsplit(s, ";")[[1]])[-1]
If st is the result of this line of code then st is just a vector of all the ids so unique(st[duplicated[st]) uniquely lists each duplicated id and involves no regular expressions.