Just a basic question related to python regex.
My code is this
void main void()
{
abc
{
tune
}
{
wao
}
}
i want to replace the abc ,tune and wao with different values like abc=u, tune=p and wao=m.
I am unable to select these two sub brackets of code in my regex.
The Regex which i am using is \{([^}]+)\}.
I want to write a code which can substitute these values in such a way that first all sub brackets were substituted i.e (tune=p and wao=m) then main brackets are substituted (abc=u).
If you do want to make use of regex, the following will work for the given example:
import re
code = """void main void()
{
abc
{
tune
}
{
wao
}
}
"""
code = re.sub(r"({\s*)[^{\s]+(\s*{)", r"\1u\2", code)
code = re.sub(r"({\s*)[^{\s]+(\s*})(?=\s*{)", r"\1p\2", code)
code = re.sub(r"({\s*)[^{\s]+(\s*})(?=\s*})", r"\1m\2", code)
print(code)
Output:
void main void()
{
u
{
p
}
{
m
}
}
But in general cases it is recommended to do a stateful parsing as #wim suggests.
I am copying code of file 1 to file 2 , but i want the code in file 2 to look adjusted with indentation like this: at the beginning indentation=0, every curly bracket opened increases the depth of indentation, every curly bracket closed reduces the indentation 4 spaces for example. I need help in fixing this to work
char preCh;
int depth=0;
int tab = 3;
int d = 0;
int pos = 0;
file1.get(ch);
while(!file1.eof())
{
if(ch=='{')
{
d++;
}
if(ch=='}'){
d--;
}
depth = tab * d;
if(preCh == '{' && ch=='\n'){
file2.put(ch);
for (int i = 0; i <= depth; i++)
{
file2.put(' ');
}
}
else
file2.put(ch);
preCh = ch;
ch = file1.get();
}
}
result must be indented like in code editors:
int main(){
if(a>0)
{
something();
}
}
Maybe, unexpectedly for you, there is no easy answer to your question.
And because of that, your code will never work.
First and most important, you need to understand and define indentation styles. Please see here in Wikipedia. Even in your given mini example, you are mixing Allman and K&R. So, first you must be clear, what to use.
Then, you must be aware that brackets may appear in quotes, double quotes, C-Comments, C++ comments and even worse, multi line comments (and #if or #idefs). This will make life really hard.
And, for the closing brackets, and for example Allman style, you will know the needed indentation only after you printed already the "indentation spaces". So you need to work line oriented or use a line buffers, before you print a complete line.
Example:
}
In this one simple line, you will read the '}' character, after you have already printed the spaces. This will always lead to wrong (too far right) indentation.
The logic for only this case would be complicated. Ant then assume statements like
if (x ==5) { y = 3; } } } }
So, unfortunately I cannot give you an easy solution.
A parser would be needed, or I simply recommend any kinde of beautifier or pretty printer
First, a bit of background about my question.
I work as a bioinformatician, which means that I do informatics treatment to try to answer a biological question. In my problem, I have to manipulate a file called a FASTA file which looks like this :
>Header 1
ATGACTGATCGNTGACTGACTGTAGCTAGC
>Header 2
ATGCATGCTAGCTGACTGATCGTAGCTAGC
ATCGATCGTAGCT
So a FASTA file is basically just a header, preceded by a '>' character, then a sequence on one or multiple lines that is composed of nucleotides. Nucleotides are characters that can take 5 possible values : A, T, C, G or N.
The thing I would like to do is count the number of times each nucleotide type appears so if we consider this dummy FASTA file :
>Header 1
ATTCGN
I should have, as a result :
A:1 T:2 C:1 G:1 N:1
Here is what I got so far :
ifstream sequence_file(input_file.c_str());
string line;
string sequence = "";
map<char, double> nucleotide_counts;
while(getline(sequence_file, line)) {
if(line[0] != '>') {
sequence += line;
}
else {
nucleotide_counts['A'] = boost::count(sequence, 'A');
nucleotide_counts['T'] = boost::count(sequence, 'T');
nucleotide_counts['C'] = boost::count(sequence, 'C');
nucleotide_counts['G'] = boost::count(sequence, 'G');
nucleotide_counts['N'] = boost::count(sequence, 'N');
sequence = "";
}
}
So it reads the file line by line, if it encounters a '>' as the first character of the line, it knows that the sequence is complete and starts to count. Now the problem I'm facing is that I have millions of sequences with several billions of nucleotides in total. I can see that my method is not optimized because I call boost::count five times on the same sequence.
Other things I have tried :
Parsing the sequence to increment a counter for each nucleotide types. I tried using a map<char, double> to map each nucleotide to a value but this was slower than the boost solution.
Using the std::count of the algorithm library but this was too slow too.
I searched the internet for solutions but every solution I found was good if the number of sequences was low, which is not my case. Would you have any idea that could help me speed things up ?
EDIT 1 :
I also tried this version but it was 2 times slower than the boost one :
ifstream sequence_file(input_file.c_str());
string line;
string sequence = "";
map<char, double> nucleotide_counts;
while(getline(sequence_file, line)) {
if(line[0] != '>') {
sequence += line;
}
else {
for(int i = 0; i < sequence.size(); i++) {
nucleotide_counts[sequence[i]]++;
}
sequence = "";
}
}
EDIT 2 : Thanks to everyone in this thread, I was able to obtain a speed up of about 30 times compared to the boost original solution. Here is the code :
#include <map> // std::array
#include <fstream> // std::ifstream
#include <string> // std::string
void count_nucleotides(std::array<double, 26> &nucleotide_counts, std::string sequence) {
for(unsigned int i = 0; i < sequence.size(); i++) {
++nucleotide_counts[sequence[i] - 'A'];
}
}
std::ifstream sequence_file(input_file.c_str());
std::string line;
std::string sequence = "";
std::array<double, 26> nucleotide_counts;
while(getline(sequence_file, line)) {
if(line[0] != '>') {
sequence += line;
}
else {
count_nucleotides(nucleotide_counts, sequence);
sequence = "";
}
}
In order of importance:
Good code for this task will 100% be I/O-bound. Your processor can count characters much faster than your disk can pump them to the CPU. Thus, the first question to me is: What is the throughput of your storage medium? What are your ideal RAM and cache throughputs? Those are the upper limits. If you've hit them, there's not much point in looking at your code further. It's possible that your boost solution is there already.
std::map lookups are relatively expensive. Yes, it's O(log(N)), but your N=5 is small and constant, so this tells you nothing. For 5 values, the map will have to chase about three pointers for every lookup (not to mention how impossible this is for the branch predictor). Your count solution has 5 map lookups and 5 traversals of each string, whereas your manual solution has a map lookup for every nucleotide (but only one traversal of the string).
Serious suggestion: Use a local variable for each counter. Those will almost surely get placed in CPU registers and are therefore essentially free. You won't ever pollute your cache with the counters that way, unlike map, unordered_map, vector etc.
Replacing abstraction by repetition like this is usually not a good idea, but in this case, it's pretty inconceivable that you'll ever need significantly more counters, so scalability is not an issue.
Consider std::string_view (which would require a different method of reading the file) to avoid creating copies of the data. You load the entire data into memory from disk and then, for each sequence, you copy it. That's not really necessary and (depending on how smart your compiler is) can bog you down. Especially since you keep appending to the string until the next header (which is more unnecessary copying - you could just count after every line).
If, for some reason, you are not hitting the theoretical throughputs, consider multithreading and/or vectorization. But I can't imagine this would be necessary.
By the way, boost::count is a thin wrapper around std::count at least in this version.
I think you did the right thing here though: Writing good and readable code, then identifying it as performance bottleneck and checking if you can make it run faster (potentially by making it slightly more ugly).
If this is the main task you have to perform, you might have an interest in an awk solution. Various problems with FASTA files are very easily tackled with awk:
awk '/^>/ && c { for(i in a) if (i ~ /[A-Z]/) printf i":"a[i]" "; print "" ; delete a }
/^>/ {print; c++; next}
{ for(i=1;i<=length($0);++i) a[substr($0,i,1)]++ }
END{ for(i in a) if (i ~ /[A-Z]/) printf i":"a[i]" "; print "" }' fastafile
This outputs on your example:
>Header 1
N:1 A:7 C:6 G:8 T:8
>Header 2
A:10 C:10 G:11 T:12
note: I am aware that this is not C++, but it is often useful to show other means to achieve the same goal.
Benchmarks with awk:
testfile: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
unziped size: 2.3G
total records: 5502947
total lines:
Script 0: (runtime: too long) The first mentioned script is utterly slow. Use only on small files
Script 1: (runtime: 484.31 sec) This is an optimised version where we do a targetted count:
/^>/ && f { for(i in c) printf i":"c[i]" "; print "" ; delete c }
/^>/ {print; f++; next}
{ s=$0
c["A"]+=gsub(/[aA]/,"",s)
c["C"]+=gsub(/[cC]/,"",s)
c["G"]+=gsub(/[gG]/,"",s)
c["T"]+=gsub(/[tT]/,"",s)
c["N"]+=gsub(/[nN]/,"",s)
}
END { for(i in c) printf i":"c[i]" "; print "" ; delete c }
Update 2: (runtime: 416.43 sec) Combine all the subsequences into a single sequence and count only ones:
function count() {
c["A"]+=gsub(/[aA]/,"",s)
c["C"]+=gsub(/[cC]/,"",s)
c["G"]+=gsub(/[gG]/,"",s)
c["T"]+=gsub(/[tT]/,"",s)
c["N"]+=gsub(/[nN]/,"",s)
}
/^>/ && f { count(); for(i in c) printf i":"c[i]" "; print "" ; delete c; string=""}
/^>/ {print; f++; next}
{ string=string $0 }
END { count(); for(i in c) printf i":"c[i]" "; print "" }
Update 3: (runtime: 396.12 sec) Refine how awk finds its records and fields, and abuse this in a single go.
function count() {
c["A"]+=gsub(/[aA]/,"",string)
c["C"]+=gsub(/[cC]/,"",string)
c["G"]+=gsub(/[gG]/,"",string)
c["T"]+=gsub(/[tT]/,"",string)
c["N"]+=gsub(/[nN]/,"",string)
}
BEGIN{RS="\n>"; FS="\n"}
{
print $1
string=substr($0,length($1)); count()
for(i in c) printf i":"c[i]" "; print ""
delete c; string=""
}
Update 4: (runtime: 259.69 sec) Update the regex search in gsub. This creates a worthy speedup:
function count() {
n=length(string);
gsub(/[aA]+/,"",string); m=length(string); c["A"]+=n-m; n=m
gsub(/[cC]+/,"",string); m=length(string); c["C"]+=n-m; n=m
gsub(/[gG]+/,"",string); m=length(string); c["G"]+=n-m; n=m
gsub(/[tT]+/,"",string); m=length(string); c["T"]+=n-m; n=m
gsub(/[nN]+/,"",string); m=length(string); c["N"]+=n-m; n=m
}
BEGIN{RS="\n>"; FS="\n"}
{
print ">"$1
string=substr($0,length($1)); count()
for(i in c) printf i":"c[i]" "; print ""
delete c; string=""
}
Don't use a map if you want speed and can use an array. Also, std::getline can use a custom delimiter (instead of \n).
ifstream sequence_file(input_file.c_str());
string sequence = "";
std::array<int, 26> nucleotide_counts;
// For one sequence
getline(sequence_file, sequence, '>');
for(auto&& c : sequence) {
++nucleotide_counts[c-'A'];
}
// nucleotide_counts['X'-'A'] contains the count of nucleotide X in the sequence
Demo
The reason why it's so slow is that you have indirect accesses all the time or 5 scans of the same string.
You don't need a map, use 5 integers, and increment them separately. Then it should be faster than the boost::count version because you don't traverse the string 5 times, and it will be faster than the map or the unordered_map increments because you won't have n indirect accesses.
so use something like:
switch(char)
{
case 'A':
++a;
break;
case 'G':
++g;
break;
}
...
Like people in comments suggested, try sth like that
enum eNucleotide {
NucleotideA = 0,
NucleotideT,
NucleotideC,
NucleotideG,
NucleotideN,
Size,
};
void countSequence(std::string line)
{
long nucleotide_counts[eNucleotide::Size] = { 0 };
if(line[0] != '>') {
for(int i = 0; i < line.size(); ++i)
{
switch (line[i])
{
case 'A':
++nucleotide_counts[NucleotideA];
break;
case 'T':
++nucleotide_counts[NucleotideT];
break;
case 'C':
++nucleotide_counts[NucleotideC];
break;
case 'G':
++nucleotide_counts[NucleotideC];
break;
case 'N':
++nucleotide_counts[NucleotideN];
break;
default :
/// error condition
break;
}
}
/// print results
std::cout << "A: " << nucleotide_counts[NucleotideA];
std::cout << "T: " << nucleotide_counts[NucleotideT];
std::cout << "C: " << nucleotide_counts[NucleotideC];
std::cout << "G: " << nucleotide_counts[NucleotideG];
std::cout << "N: " << nucleotide_counts[NucleotideN] << std::endl;
}
}
and call this function for every line content.(Didn't tested code.)
I need a regular expression to extract the following text block functions only.
Example:
// Comment 1. function example1() { return 1; } // Comment 2 function example2() { if (a < b) { a++ } } // Comment 3 function example3() { while (1) { i++; } } /* Comment 4 */ function example4() { i = 4; for (i = 1; i < 10; i++) { i++; } return i; }
Take into account that no line breaks. It is a single block of code.
I have tried using the following regular expression:
Expression:
function\s[a-z|A-Z|0-9_]+()\s?{(?:.+)\s}
But there is a problem, place the .+ , take me all characters to the end of the text block.
Thanks in advance guys for the help you can give me.
In PCRE (PHP, R, Delphi), you can achieve this with recursion:
function\s[a-zA-Z0-9_]+\(\)(\s?{(?>[^{}]|(?1))*})
See demo.
In Ryby, just use \g<1> instead of (?1):
function\s[a-zA-Z0-9_]+\(\)(\s?{(?>[^{}]|(\g<1>))*})
In .NET, you can match them using balanced groups:
function\s[a-zA-Z0-9_]+\(\)\s*{((?<sq>{)|(?<-sq>})|[^{}]*)+}
See demo 2
In other languages, there is no recursion, and you need to use a workaround by adding nested levels "manually". In your examples, you have 2 levels.
Thus, in Python, it will look like:
function\s[a-zA-Z0-9_]+\(\)(?:\s?{(?:[^{}]*(?:\s*{[^{}]*}[^{}]*)*})*)
See Python demo (also works in JavaScript
In Java, you will need to escape {:
function\s[a-zA-Z0-9_]+\(\)(?:\s?\{(?:[^\{}]*(?:\s*\{[^{}]*}[^{}]*)*})*)
I'm currently trying how to use Regex in C++ for more advanced stuff like creating my own coding language (just for the fun of it) and I'm having a bit of a problem with reading out the scope of the code.
I'm currently using this text to test my Regex against,
private char test;
public static int foo = 0;
private int var = 0;
private void run(char data, int add)
{
var += 50 + add;
print (var + test, asdm, asf.getString());
if (var == 70) {
print("yes");
}
}
And the regex I'm using to read out the scope of run and the if-statement within it is,
\{([\S\s]*?)\}
This is what it matches atm
{
var += 50 + add;
print (var + test, asdm, asf.getString());
if (var == 70) {
print("yes");
}
The problem is that it find a } before the actual } it needs, I'd like to know if there is a way to go around this and capture each individual scope as it's own part.
I've tried messing around with loopbacks but I keep getting the error 'The expression contained mismatched ( and )' even though it should be valid regex. Edit, I found out that the c++ version of regex (default one) does not support lookbehind which is why it didn't work.