parameter | name | typical value |
---|---|---|
number of stations | N_a | 64 |
number of beams | N_b | 400 |
samples per sec | bw | 875e6 |
number of channels | N_c | 4096 |
decimation | N_d | 10 |
New delays (geometric and atmosphere) have to be computed at most a few times per sec (in reality probably only once every few sec). If this is once per sec, the number of operations per sec is of the order N_a*N_b*N_c, which is typically 10^8. Compared to the other demands this is very low and therefore not of concern at the moment.
This is the dominant part of the computations. Each (complex) sample has to be multiplied with its complex factor, which corresponds to 4 real multiplications and 2 real additions (one is actually a subtraction). Each beam has two polarisations to that we have 12*N_a*N_b*bw floating-point operations per second in total. This is typically 269 TFLOPS.
Summing up the stations now "only" takes 4*N_a*N_b*bw floating-point operations per second (factor 4 is for two polarisations and complex beams). This is typically 90 TFLOPS, but these can be combined with the phase corrections using fused multiply/accumulate operations.
These will also be needed, also to subtract the autocorrelations from the coherently added beams. Each complex voltage has to be "detected" to form a power, which needs (2 MUL + 1 ADD)*2*N_a*bw = 6*N_a*bw operations or typically 0.34 TFLOPS (if both polarisations are kept separate). This is entirely insignificant.
The beams can be averaged in time and/or frequency, depending on the setup. In principle this does not require too many additional operations, because the additions can be combined with the former ones. Otherwise we basically have to added all data again, which means 4*N_a_N_b*bw = 90 TFLOPS more operations (factor 4 for two pols and complex).
For the phase correction and coherent and incoherent sums, and assuming that fused operations can be used, we will need a total of 270 TFLOPS. With additional averaging (if it cannot be combined with previous operations in an efficient way) we have at most 360 TFLOPs.
There are good reasons for having our own correlations, so let us see is this is feasible. The number of baselines (including auto-correlations) is N_a*(N_a+1)/2, which we approximate as N_a^2/2 = 2048 (exact number 2080).
Computing all these complex visibilities requires just as many complex multiplications per polarisation product (of which there are four), which corresponds to 4 real multiplications and 2 real additions. This means 3*N_a^2*bw operations per second or 11 TFLOPS per polarisation product or 12*N_a^2*bw = 43 TFLOPS for all. Accumulating them can be done with fused operations, otherwise we would have additional 4*N_a^2*bw = 14 TFLOPS for full polarisations.
Accumulation in time and/or frequency needs additional operations, but these are difficult to estimate at this moment.
The demands for correlations are so much below the direct beam-forming, that we should try to account for them in our hardware.
With L=1 (8) km baselines and D=13.5m dishes, we need of the order (L/D)^2 = 5500 (350 000) beams to cover the primary beam. The N_b=400 beams will thus not cover the entire area.
We can form additional beams from visibilities. For this we assume that we have only one polarisation product (typically Stokes I) and a decimation factor in time/frequency of N_d. Per (decimated) sample, each beam requires the addition of N_a^2/2 visibilities after applying phase factors (4 MUL and 2 ADD). This means 3*N_a^2*N_b*bw/N_d operations. With a decimation of N_d=10 and N_b=400 beams, this means 430 TFLOPS, not so much more than the direct beam-forming. But now we can trade resolution in time/frequency for number of beams. Going from 50 microsec time resolution to 500 microsec boosts the number of beams to 4000. At higher frequencies (e.g. for GC searches) we can average more in frequency without losing any science. This sounds realistic, provided we can search all these beams. But they do not have to be searched in real-time. Still, they may have to go into the switch again, which may limit us.
Also we must not forget that decimation in time and/or frequency reduces ouf field of view because of bandwidth and time-averaging smearing!
An FFT can be used for this "imaging" step, but this makes everything much more complicated, because visibilities would have to be gridded, which is probably not very efficient on GPUs.