I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.
Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.
I have a returned string formatted as below:
PR ER
89
>
from which the number can be extracted by using \n(\d+), but sometimes it returns:
23 PR P 10000>
Or, it could be something like:
23
PR P
10000
>
In these scenarios, how can I extract the number 10000 between PR and >?
This might work for you:
\d+(?=\s*>)
It looks for any sequence of digits followed by any number of whitespaces and a '>'
For java if you need
String str = "23 PR P 10000>";
Pattern reg = Pattern.compile("(\\d+)");
Matcher m = reg.matcher(str);
while (m.find()){
System.out.println("group : " + m. group() + " - start :" + m.start() + " - end :" + m.end());
}
i might just answer this myself
\d+\n>
worked!
thanks all
Here is the question I have to solve and the code I've written so far.
Write a function named printDuplicates that accepts an input stream and an output stream as parameters.
The input stream represents a file containing a series of lines. Your function should examine each line looking for consecutive occurrences of the same token on the same line and print each duplicated token along how many times it appears consecutively.
Non-repeated tokens are not printed. Repetition across multiple lines (such as if a line ends with a given token and the next line starts with the same token) is not considered in this problem.
For example, if the input file contains the following text:
hello how how are you you you you
I I I am Jack's Jack's smirking smirking smirking smirking smirking revenge
bow wow wow yippee yippee yo yippee yippee yay yay yay
one fish two fish red fish blue fish
It's the Muppet Show, wakka wakka wakka
My expected result should be:
how*2 you*4
I*3 Jack's*2 smirking*5
wow*2 yippee*2 yippee*2 yay*3
\n
wakka*3
Here is my function:
1 void printDuplicates(istream& in, ostream& out)
2 {
3 string line; // Variable to store lines in
4 while(getline(in, line)) // While there are lines to get do the following
5 {
6 istringstream iss(line); // String stream initialized with line
7 string word; // Current word
8 string prevWord; // Previous word
9 int numWord = 1; // Starting index for # of a specific word
10 while(iss >> word) // Storing strings in word variable
11 {
12 if (word == prevWord) ++numWord; // If a word and the word 13 before it are equal add to word counter
14 else if (word != prevWord) // Else if the word and the word before it are not equal
15 {
16 if (numWord > 1) // And there are at leat two copies of that word
17 {
18 out << prevWord << "*" << numWord << " "; // Print out "word*occurrences"
19 }
20 numWord = 1; // Reset the num counter variable for next word
21 }
22 prevWord = word; // Set current word to previous word, loop begins again
23 }
24 out << endl; // Prints new line between each iteration of line loop
25 }
26 }
My result thus far is:
how*2
I*3 Jack's*2 smirking*5
wow*2 yippee*2 yippee*2
I have tried adding (|| iss.eof()), (|| iss.peek == EOF), etc inside the nested else if statement on Line 14, but I am unable to figure this guy out. I need some way of knowing I'm at the end of the line so my else if statement will be true and try to print the last word on the line.
I have a AWK script to write specific values matching with specific pattern to a .csv file.
The code is as follows:
BEGIN{print "Query Start,Query End, Target Start, Target End,Score, E,P,GC"}
/^\>g/ { Query=$0 }
/Query =/{
split($0,a," ")
query_start=a[3]
query_end=a[5]
query_end=gsub(/,/,"",query_end)
target_start=a[8]
target_end=a[10]
}
/Score =/{
split($0,a," ")
score=a[3]
score=gsub(/,/,"",score)
e=a[6]
e=gsub(/,/,"",e)
p=a[9]
p=gsub(/,/,"",p)
gc=a[12]
printf("%s,%s,%s,%s,%s,%s,%s,%s\n",query_start, query_end,target_start,target_end,score,e,p,gc)
}
The input file is as follows:
>gi|ABCDEF|
Plus strand results:
Query = 100 - 231, Target = 100 - 172
Score = 20.92, E = 0.01984, P = 4.309e-08, GC = 51
But I received the output in a .csv file as provided below:
100 0 100 172 0 0 0 51
The program failed to copy the values of:
Query end
Score
E
P
(Note: all the failed values are present before comma (,))
Any help to obtain the right output will be great.
Best regards,
Amit
As #Jidder mentioned, you don't need to call split() and as #jaypal mentioned you're using gsub() incorrectly, but also you don't need to call gsub() at all if you just include , in your FS.
Try this:
BEGIN {
FS = "[[:space:],]+"
OFS = ","
print "Query Start","Query End","Target Start","Target End","Score","E","P","GC"
}
/^\>g/ { Query=$0 }
/Query =/ {
query_start=$4
query_end=$6
target_start=$9
target_end=$11
}
/Score =/ {
score=$4
e=$7
p=$10
gc=$13
print query_start,query_end,target_start,target_end,score,e,p,gc
}
That work? Note the field numbers are bumped out by 1 because when you don't use the default FS awk no longer skips leading white space so there's an empty field before the white space in your input.
Obviously, you are not using your Query variable so the line that populates it is redundant.
I have a program :
Question : Input a number of integer of 2 digit only , and in the out-put it should show the all input values BUT loop should stop on 42 :
example
input
1
2
87
42
99
output
1
2
87
my code
a = []
5.times do |i|
a[i] = Integer(gets.chomp)
end
a.each do |e|
break if e == '42'
puts e
end
Few things to change. First of all gets will give you a string together with \n at the end, so you need to change it to gets.chomp to remove it.
Now your loop should look like this:
a.each do |e|
break if e == '42'
puts e
end
However ruby's array has much butter function which is perfect for what you want:
puts a.take_while {|e| e != '42'}
Additional notes:
Note that it is operating on strings rather than numbers. You might need to validate the input at some point and convert it into integer values.
5.times do|i| - the |i| bit is obsolete.