How can I parallelize this Matrix times Vector operation using OpenMP? - c++

I try to parallelise the for loop in the below code, but I am not sure of what type of OpenMP directives should I add before the for loop and do I need to declare those variables in private or public first?
#include <stdio.h>
#include <time.h>
#include <omp.h>
void mxv_row(int m, int n, double *A, double *B, double *C)
{
int i, j;
# pragma omp parallel private(?)shared (?)
for (i=0; i<m; i++)
# pragma omp for
{
A[i] = 0.0;
for (j=0; j<n; j++)
A[i] += B[i*n+j]*C[j];
}
}

As you declared j outside the loop, it should be private. If you declare it in the second for loop, it's fine then.
The rest can be shared (except i, as it's your parallel loop index), but doesn't have to, as it's only sizes and pointers.

Related

is a race condition for parallel OpenMP threads reading the same shared data possible?

There is a piece of code
#include <iostream>
#include <array>
#include <random>
#include <omp.h>
class DBase
{
public:
DBase()
{
delta=(xmax-xmin)/n;
for(int i=0; i<n+1; ++i) x.at(i)=xmin+i*delta;
y={1.0, 3.0, 9.0, 15.0, 20.0, 17.0, 13.0, 9.0, 5.0, 4.0, 1.0};
}
double GetXmax(){return xmax;}
double interpolate(double xx)
{
int bin=xx/delta;
if(bin<0 || bin>n-1) return 0.0;
double res=y.at(bin)+(y.at(bin+1)-y.at(bin)) * (xx-x.at(bin))
/ (x.at(bin+1) - x.at(bin));
return res;
}
private:
static constexpr int n=10;
double xmin=0.0;
double xmax=10.0;
double delta;
std::array<double, n+1> x;
std::array<double, n+1> y;
};
int main(int argc, char *argv[])
{
DBase dbase;
const int N=10000;
std::array<double, N> rnd{0.0};
std::array<double, N> src{0.0};
std::array<double, N> res{0.0};
unsigned seed = 1;
std::default_random_engine generator (seed);
for(int i=0; i<N; ++i) rnd.at(i)=
std::generate_canonical<double,std::numeric_limits<double>::digits>(generator);
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
src.at(i)=rnd.at(i) * dbase.GetXmax();
res.at(i)=dbase.interpolate(rnd.at(i) * dbase.GetXmax());
}
for(int i=0; i<N; ++i) std::cout<<"("<<src.at(i)<<" , "<<res.at(i)
<<") "<<std::endl;
return 0;
}
It seemes to work properly with either #pragma omp parallel for
or without it (checked output). But i can't understand the following things:
1) Different parallel threads access the same arrays x and y of object dbase of class Dbase (i understand that OpenMP implicitly made dbase object shared, i. e. #pragma omp parallel for shared(dbase)). Different threads do not write in these arrays, only read. But when they read, can there be a race condition for their reading from x and y or not? If not, how is it organized that at every moment only 1 thread reads from x and y in interpolate() and different threads do not bother each other? Maybe, there is a local copy of dbase object and its x and y arrays in each OpenMP thread (but it is equivalent to #pragma omp parallel for private(dbase))?
2) Shall i write in such code #pragma omp parallel for shared(dbase) or #pragma omp parallel for is enough?
3) I think that if i placed 1 random number generator inside the for-loop, to make it work properly (not to let its inner state be in race condition), i should write
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
src.at(i)=rnd.at(i) * dbase.GetXmax();
#pragma omp atomic
std::generate_canonical<double,std::numeric_limits<double>::digits>
(generator)
res.at(i)=dbase.interpolate(rnd.at(i) * dbase.GetXmax());
}
The #pragma omp atomic would destroy the increase in performance (would make threads wait) from #pragma omp parallel for. So, the only correct way how to use random numbers inside a parallel region is to have own generator (or seed) for each thread or prepare all needed random numbers before the for-loop. Is it correct?

From serial to omp: no speedup

I'm new to openMP. I'm working on All Pair Shortest Path Algorithm and here is the serial C++ routine i need to parallelize (complete code at the end of the post):
void mini(vector<vector<double>> &M, size_t n, vector<double> &rowk, vector<double> &colk)
{
size_t i, j;
for ( i=0; i<n; i++)
for ( j=0; j<n; j++)
M[i][j]=min(rowk[j]+colk[i], M[i][j]);
}
At execution I get this :
$ time ./floyd
real 0m0,349s
user 0m0,349s
sys 0m0,000s
Now, I try to insert some directives:
void mini(vector<vector<double>> &M, size_t n, vector<double> &rowk, vector<double> &colk)
{
#pragma omp parallel
{
size_t i, j;
#pragma omp parallel for
for ( i=0; i<n; i++)
for ( j=0; j<n; j++)
M[i][j]=min(rowk[j]+colk[i], M[i][j]);
}
}
Unfortunately, there is no speedup:
$ grep -c ^processor /proc/cpuinfo
4
$ time ./floyd
real 0m0,547s
user 0m2,073s
sys 0m0,004s
What am I doing wrong?
EDIT
Processor: Intel(R) Core(TM) i5-4590 CPU (4 hardware cores)
Complete code :
#include <cstdio>
#include <vector>
#include <limits>
#include <ctime>
#include <random>
#include <set>
#include <omp.h>
using namespace std;
typedef struct Wedge
{
int a, b;
double w;
} Wedge;
typedef pair<int, int> edge;
int randrange(int end, int start=0)
{
random_device rd;
mt19937 gen(rd());
uniform_int_distribution<> dis(start, end-1);
return dis(gen);
}
void relax_omp(vector<vector<double>> &M, size_t n, vector<double> &rowk, vector<double> &colk)
{
#pragma omp parallel
{
size_t i, j;
#pragma omp parallel for
for (i=0; i<n; i++)
for ( j=0; j<n; j++)
M[i][j]=min(rowk[j]+colk[i], M[i][j]);
}
}
void relax_serial(vector<vector<double>> &M, size_t n, vector<double> &rowk, vector<double> &colk)
{
size_t i, j;
for (i=0; i<n; i++)
for ( j=0; j<n; j++)
M[i][j]=min(rowk[j]+colk[i], M[i][j]);
}
void floyd(vector<vector<double>> &dist, bool serial)
{
size_t i, k;
size_t n {dist.size()};
for (k=0; k<n; k++)
{
vector<double> rowk =dist[k];
vector<double> colk(n);
for (i=0; i<n; i++)
colk[i]=dist[i][k];
if (serial)
relax_serial(dist, n, rowk, colk);
else
relax_omp(dist, n, rowk, colk);
}
for (i=0; i<n; i++)
dist[i][i]=0;
}
vector<Wedge> random_edges(int n, double density, double max_weight)
{
int M{n*(n-1)/2};
double m{density*M};
set<edge> edges;
vector<Wedge> wedges;
while (edges.size()<m)
{
pair<int,int> L;
L.first=randrange(n);
L.second=randrange(n);
if (L.first!=L.second && edges.find(L) == edges.end())
{
double w=randrange(max_weight);
Wedge wedge{L.first, L.second, w};
wedges.push_back(wedge);
edges.insert(L);
}
}
return wedges;
}
vector<vector<double>> fill_distances(vector<Wedge> wedges, int n)
{
double INF = std::numeric_limits<double>::infinity();
size_t i, m=wedges.size();
vector<vector<double>> dist(n, vector<double>(n, INF));
int a, b;
double w;
for (i=0; i<m; i++)
{ a=wedges[i].a;
b=wedges[i].b;
w=wedges[i].w;
dist[a][b]=w;
}
return dist;
}
int main (void)
{
double density{0.33};
double max_weight{200};
int n{800};
bool serial;
int ntest=10;
double avge_serial=0, avge_omp=0;
for (int i=0; i<ntest; i++)
{
vector<Wedge> wedges=random_edges(n, density, max_weight);
vector<vector<double>> dist=fill_distances(wedges, n);
double dtime;
dtime = omp_get_wtime();
serial=true;
floyd(dist, serial);
dtime = omp_get_wtime() - dtime;
avge_serial+=dtime;
dtime = omp_get_wtime();
serial=false;
floyd(dist, serial);
dtime = omp_get_wtime() - dtime;
avge_omp+=dtime;
}
printf("%d tests, n=%d\n", ntest, n);
printf("Average serial : %.2lf\n", avge_serial/ntest);
printf("Average openMP : %.2lf\n", avge_omp/ntest);
return 0;
}
output :
20 tests, n=800
Average serial : 0.31
Average openMP : 0.61
command line:
g++ -std=c++11 -Wall -O2 -Wno-unused-result -Wno-unused-variable -Wno-unused-but-set-variable -Wno-unused-parameter floyd.cpp -o floyd -lm -fopenmp
Your main issue is that you accidentally use nested parallelism:
#pragma omp parallel
{
size_t i, j;
#pragma omp parallel for
Since you already are in a parallel region, your second line should be
#pragma omp for
Otherwise, since a omp parallel for equals a omp parallel and a omp for, you have two nested parallel regions which is typically bad. Fixing this minor thing gets an ~2x speedup on a similar CPU.
There are several limitations why you are unlikely to get a full 4x speedup, such as but not limited to:
Memory bandwidth as a bottleneck
Relative overhead due to the small amount of work done within the parallel loop
Lower clock frequencies with multiple threads in turbo mode
Edit:
By the way, the much more idiomatic way to write your code is the following:
void relax_omp(...) {
#pragma omp parallel for
for (size_t i=0; i<n; i++) {
for (size_t j=0; j<n; j++) {
M[i][j]=min(rowk[j]+colk[i], M[i][j]);
}
}
}
If you declare variables as locally as possible, OpenMP wil almost alaways do the right thing. Which, in this case, means that i and j are private. In general it is much easier to reason about code this way.
There could be many reasons for this, the most obvious being that the work load is too small to notice speed up. The initial work load is 300ms. I would suggest enclosing this in a serial outerloop that repeats this work for at least 20 times, then you are starting with a serial time of (300ms * 20) 6 seconds to test with.
The other factor is the availability of parallel cores on the machine you are running this on. If your cpu has one core, multi-threading will cause slowdown due to the cost of thread-switching. 2 logical cores should show some speed up, 2 physical cores may show close to linear speed up.
Using pragma directives alone also does not guarantee that openMP is used. You have to compile using the -fopenmp command line argument to guarantee that the openmp library is linked to your object code.
Edit
Looking at your code now, the factor that controls the amount of work seems to be N rather than the outer loop. The idea of the outer loop was to artificially increase the amount of work within the same timing period but that can't be done here as you are trying to solve a specific problem. You can try parallelizing the nested loop as well but I think N = 800 is too low for parallelization to make a difference.
#pragma omp parallel for private(j) collapse(2)
j needs to be private to each iteration of the outer loop, hence private(j), otherwise j gets shared across all threads, leading to an inaccurate result.
Your loop is executed 640,000 times which is not much for modern CPUs that clock 3GHZ+, try something around N = 5000 which is 25M iterations.

Reduction in OpenMP with GMP/ARB matrices

I want to parallelize a program that I have written which calculates a series involving matrix and vector products with the result being a vector. Since arguments become very small and large, I use ARB (based on GMP, MPFR and flint) to prevent Loss of significance.
Also, since series elements are independent, matrix dimensions are not big, but the series needs to be evaluated upto 50k elements, it makes sense to have a number of threads compute a few elements of the series each, i.e. 5 threads could each compute 10k elements in parallel and then add up the resulting vectors.
The issue is now that the ARB function to add up vectors and matrices is not a standard operation that can be used withtin an openmp reduction easily.
When I naively try to write a custom reduction, g++ complains about the void type, since operations in ARB do not have a return value:
void arb_add(arb_t z, const arb_t x, const arb_t y, slong prec)¶
will calculate and set z to z=x+y with a precision of prec bits, but arb_add itself is s a void function.
As an example: a random for-loop for a similar problem looks like this (my actual program is different of course)
[...]
arb_mat_t RMa,RMb,RMI,RMP,RMV,RRes;
arb_mat_init(RMa,Nmax,Nmax); //3 matrices
arb_mat_init(RMb,Nmax,Nmax);
arb_mat_init(RMI,Nmax,Nmax);
arb_mat_init(RMV,Nmax,1); // 3 vectors
arb_mat_init(RMP,Nmax,1);
arb_mat_init(RRes,Nmax,1);
[...]
//Result= V + ABV +(AB)^2 V + (AB)^3 V + ...
//in my actual program A and B would be j- and k-dependent and would
//include more matrices and vectors
#pragma omp parallel for collapse(1) private(j)
for(j=0; j<jmax; j++){
arb_mat_one(RMI); //sets the matrix RMI to 1
for(k=0; k<j; k++){
Qmmd(RMI,RMI,RMa,Nmax,prec); //RMI=RMI*RMa
Qmmd(RMI,RMI,RMb,Nmax,prec); //RMI=RMI*RMb
cout << "j=" << j << ", k=" << k << "\r" << flush;
}
Qmvd(RMP,RMI,RMV,Nmax,prec); //RMP=RMI*RMV
arb_mat_add(RRes,RRes,RMP,prec); //Result=Result + RMP
}
[...]
which of course breaks down when using more than 1 thread because I did not specify a reduction on RRes. Here Qmmd() and Qmvd() are self-written matrix-matrix and matrix-vector product functions, and RMa, RMb, and RMV are random matrices and vectors, resp.
The idea is now to reduce RRes, such that each thread can compute a private version of RRes including a fraction of the final result, before adding them all up using arb_mat_add. I could write a function matrixadd(A,B) to compute A=A+B
void matrixadd(arb_mat_t A, arb_mat_t B) {
arb_mat_add(A,A,B,2000);
//A=A+B, the last value is the precision in bits used for that operation
}
and then eventually
#pragma omp declare reduction \
(myadd : void : matrixadd(omp_out, omp_in))
#pragma omp parallel for private(j) reduction(myadd:RRes)
for(j=0; j<jmax; j++){
arb_mat_one(RMI);
for(k=0; k<j; k++){
Qmmd(RMI,RMI,RMa,Nmax,prec);
Qmmd(RMI,RMI,RMb,Nmax,prec);
cout << "j=" << j << ", k=" << k << "\r" << flush;
}
Qmvd(RMP,RMI,RMV,Nmax,prec);
matrixadd(RRes,RMP);
}
Gcc is not happy with this:
main.cpp: In function ‘int main()’:
main.cpp:503:46: error: invalid use of void expression
(myadd : void : matrixadd(omp_out, omp_in))
^
main.cpp:504:114: error: ‘RRes’ has invalid type for ‘reduction’
Can Openmp understand my void reduction and can be made to work with ARB and GMP? If so, how? Thanks!
(Also, my program currently includes a convergence check with a break condition in the j-for-loop. If you also know how to easily implement such a thing, too, I would be very grateful because for my current openmp tests I just removed the break and set a constant jmax.)
My question is very similar to this one.
Edit: Sorry, here is my try of a minimal, complete and verifiable example. Additional required packages are arb, flint, gmp, mpfr (available through packetmanagers) and gmpfrxx.
#include <iostream>
#include <omp.h>
#include <cmath>
#include <ctime>
#include <cmath>
#include <gmp.h>
#include "gmpfrxx/gmpfrxx.h"
#include "arb.h"
#include "acb.h"
#include "arb_mat.h"
using namespace std;
void generate_matrixARBdeterministic(arb_mat_t Mat, int N, double w2) //generates some matrix
{
int i,j;
double what;
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
what=(i*j+30/w2)/((1+0.1*w2)*j+20-w2);
arb_set_d(arb_mat_entry(Mat,i,j),what);
}
}
}
void generate_vecARBdeterministic(arb_mat_t Mat, int N) //generates some vector
{
int i;
double what;
for(i=0;i<N;i++)
{
what=(4*i*i+40)/200;
arb_set_d(arb_mat_entry(Mat,i,0),what);
}
}
void Qmmd(arb_mat_t res, arb_mat_t MA, arb_mat_t MB, int NM, slong prec)
{ ///res=M*M=Matrix * Matrix
arb_t Qh1;
arb_mat_t QMh;
arb_init(Qh1);
arb_mat_init(QMh,NM,NM);
for (int i=0; i<NM; i++){
for(int j=0; j<NM; j++){
for (int k=0; k<NM; k++ ) {
arb_mul(Qh1,arb_mat_entry(MA, i, k),arb_mat_entry(MB, k, j),prec);
arb_add(arb_mat_entry(QMh, i, j),arb_mat_entry(QMh, i, j),Qh1,prec);
}
}
}
arb_mat_set(res,QMh);
arb_mat_clear(QMh);
arb_clear(Qh1);
}
void Qmvd(arb_mat_t res, arb_mat_t M, arb_mat_t V, int NM, slong prec) //res=M*V=Matrix * Vector
{ ///res=M*V
arb_t Qh,Qs;
arb_mat_t QMh;
arb_init(Qh);
arb_init(Qs);
arb_mat_init(QMh,NM,1);
arb_set_ui(Qh,0.0);
arb_set_ui(Qs,0.0);
arb_mat_zero(QMh);
for (int i=0; i<NM; i++){
arb_set_ui(Qs,0.0);
for(int j=0; j<NM; j++){
arb_mul(Qh,arb_mat_entry(M, i, j),arb_mat_entry(V, j, 0),prec);
arb_add(Qs,Qs,Qh,prec);
}
arb_set(arb_mat_entry(QMh, i, 0),Qs);
}
arb_mat_set(res,QMh);
arb_mat_clear(QMh);
arb_clear(Qh);
arb_clear(Qs);
}
void QPrintV(arb_mat_t A, int N){ //Prints Vector
for(int i=0;i<N;i++){
cout << arb_get_str(arb_mat_entry(A, i, 0),5,0) << endl; //ARB_STR_NO_RADIUS
}
}
void matrixadd(arb_mat_t A, arb_mat_t B) {
arb_mat_add(A,A,B,2000);
}
int main() {
int Nmax=10,jmax=300; //matrix dimension and max of j-loop
ulong prec=2000; //precision for arb
//initializations
arb_mat_t RMa,RMb,RMI,RMP,RMV,RRes;
arb_mat_init(RMa,Nmax,Nmax);
arb_mat_init(RMb,Nmax,Nmax);
arb_mat_init(RMI,Nmax,Nmax);
arb_mat_init(RMV,Nmax,1);
arb_mat_init(RMP,Nmax,1);
arb_mat_init(RRes,Nmax,1);
omp_set_num_threads(1);
cout << "Maximal threads is " << omp_get_max_threads() << endl;
generate_matrixARBdeterministic(RMa,Nmax,1.0); //generates some Matrix for RMa
arb_mat_set(RMb,RMa); // sets RMb=RMa
generate_vecARBdeterministic(RMV,Nmax); //generates some vector
double st=omp_get_wtime();
Qmmd(RMI,RMa,RMb,Nmax,prec);
int j,k=0;
#pragma omp declare reduction \
(myadd : void : matrixadd(omp_out, omp_in))
#pragma omp parallel for private(j) reduction(myadd:RRes)
for(j=0; j<jmax; j++){
arb_mat_one(RMI);
for(k=0; k<j; k++){
Qmmd(RMI,RMI,RMa,Nmax,prec);
Qmmd(RMI,RMI,RMb,Nmax,prec);
cout << "j=" << j << ", k=" << k << "\r" << flush;
}
Qmvd(RMP,RMI,RMV,Nmax,prec);
matrixadd(RRes,RMP);
}
QPrintV(RRes,Nmax);
double en=omp_get_wtime();
printf("\n Time it took was %lfs\n",en-st);
arb_mat_clear(RMa);
arb_mat_clear(RMb);
arb_mat_clear(RMV);
arb_mat_clear(RMP);
arb_mat_clear(RMI);
arb_mat_clear(RRes);
return 0;
}
and
g++ test.cpp -g -fexceptions -O3 -ltbb -fopenmp -lmpfr -lflint -lgmp -lgmpxx -larb -I../../PersonalLib -std=c++14 -lm -o b.out
You can do the reduction by hand like this
#pragma omp parallel
{
arb_mat_t RMI, RMP;
arb_mat_init(RMI,Nmax,Nmax); //allocate memory
arb_mat_init(RMP,Nmax,1); //allocate memory
#pragma omp for
for(int j=0; j<jmax; j++){
arb_mat_one(RMI);
for(int k=0; k<j; k++){
Qmmd(RMI,RMI,RMa,Nmax,prec);
Qmmd(RMI,RMI,RMb,Nmax,prec);
}
}
Qmvd(RMP,RMI,RMV,Nmax,prec);
#pragma omp critical
arb_mat_add(RRes,RRes,RMP,prec);
arb_mat_clear(RMI); //deallocate memory
arb_mat_clear(RMP); //deallocate memory
}
If you want to use declare reduction you need to make a C++ wrapper for arb_mat_t. Using declare reduction lets OpenMP decide how to do the reduction. But I highly doubt you will find a case where this gives better performance than the manual case.
You can either create an array of matrices for each thread to sum up. You simply replace matrixadd(RRes, RMP) with matrixadd(RRes[get_omp_thread_num()], RMP) and then sum all RRes in the end, where RRes would now be a std::vector<arb_mat_t>.
Or you could try to define an addition operator for a wrapper class, of course you should be careful to avoid copying the entire matrix. This feels like more of a hassle as you have to be a bit careful with the memory management (since you're using a library - you don't know exactly what it does unless you take the time to go through it all).

Avoid calling omp_get_thread_num() in parallel for loop with simd

What is the performance cost of call omp_get_thread_num(), compared to look up the value of a variable?
How to avoid calling omp_get_thread_num() for many times in a simd openmp loop?
I can use #pragma omp parallel, but will that make a simd loop?
#include <vector>
#include <omp.h>
int main() {
std::vector<int> a(100);
auto a_size = a.size();
#pragma omp for simd
for (int i = 0; i < a_size; ++i) {
a[i] = omp_get_thread_num();
}
}
I wouldn't be too worried about the cost of the call, but for code clarity you can do:
#include <vector>
#include <omp.h>
int main() {
std::vector<int> a(100);
auto a_size = a.size();
#pragma omp parallel
{
const auto threadId = omp_get_thread_num();
#pragma omp for
for (int i = 0; i < a_size; ++i) {
a[i] = threadId;
}
}
}
As long as you use #pragma omp for (and don't put an extra `parallel in there! otherwise each of your n threads will spawn n more threads... that's bad) it will ensure that inside your parallel region that for loop is split up amongst the n threads. Make sure omp compiler flag is turned on.

How does pragma and omp make a difference in these two codes producing same output?

Initially value of ab is 10, then after some delay created by for loop ab is set to 55 and then its printed in this code..
#include <iostream>
using namespace std;
int main()
{
long j, i;
int ab=10 ;
for(i=0; i<1000000000; i++) ;
ab=55;
cout << "\n----------------\n";
for(j=0; j<100; j++)
cout << endl << ab;
return 0;
}
The purpose of this code is also the same but what was expected from this code is the value of ab becomes 55 after some delay and before that the 2nd pragma block should print 10 and then 55 (multithreading) , but the second pragma block prints only after the delay created by the first for loop and then prints only 55.
#include <iostream>
#include <omp.h>
using namespace std;
int main()
{
long j, i;
int ab=10;
omp_set_num_threads(2);
#pragma omp parallel
{
#pragma omp single
{
for(i=0; i<1000000000; i++) ;
ab=55;
}
#pragma omp barrier
cout << "\n----------------\n";
#pragma omp single
{
for(j=0; j<100; j++)
cout << endl << ab;
}
}
return 0;
}
So you want to "observe race conditions" by changing the value of a variable in a first region and printing the value from the second region.
There are a couple of things that prevent you achieving this.
The first (and explicitly stated) is the #pragma omp barrier. This OpenMP statement requests the runtime that threads running the #pragma omp parallel must wait until all threads in the team arrive. This first barrier forces the two threads to be at the barrier, thus at that point ab will have value 55.
The #pragma omp single (and here stated implicitly) contains an implicit `` waitclause, so the team of threads running theparallel region` will wait until this region has finished. Again, this means that ab will have value 55 after the first region has finished.
In order to try to achieve (and note the "try" because that will depend from run to run, depending on several factors [OS thread scheduling, OpenMP thread scheduling, HW resources available...]). You can give a try to this alternative version from yours:
#include <iostream>
#include <omp.h>
using namespace std;
int main()
{
long j, i;
int ab=10;
omp_set_num_threads(2);
#pragma omp parallel
{
#pragma omp single nowait
{
for(i=0; i<1000000000; i++) ;
ab=55;
}
cout << "\n----------------\n";
#pragma omp single
{
for(j=0; j<100; j++)
cout << endl << ab;
}
}
return 0;
}
BTW, rather than iterating for a long trip-count in your loops, you could use calls such as sleep/usleep.