WLAN Horrible Roaming Performance (N800, OS2008), Software or Hardware Problem ?

Fri Feb 22 16:50:37 EET 2008

On 22 February 2008, Kalle Valo wrote:
> "ext Frantisek Dufka" <dufkaf at seznam.cz> writes:
> > According to my previous tests
> > http://bugs.maemo.org/show_bug.cgi?id=2006#c47
> > I got maximum of 305 udelay(5) loops for reading and 3819 udelay(5)
> > loops for writing so that's 19 miliseconds of busylooping in worst
> > case for writing (if udelay counts in microseconds). So this might be
> > worth the effort.
>
> That's a long busyloop. Of course workqueues create overhead because
> of context switches, but I would guess that to be peanuts compared to
> the long busyloops.

Seeing that N800 wlan driver is still rather slow and cpu hungry, having lots
of context switches may be also very bad.

By the way, cx3110x driver reads/writes rather small chunks of data (don't
remember the exact size now). The number of context switches could be probably
reduced if it was possible to accumulate many frames in wlan chip firmware and
read them all in one DMA transfer.

> I would be very interested to see test results comparing all three
> methods: tasklet, workqueue and asynchronous. I would assume that the
> asynchronous method is the best (ie. less disruptive for other
> processes), but I would like to see real numbers to back this up.

I don't know how it is related to asynchronous method, but we could try to
pipeline DMA transfers and processing of the received/transmitted frames 
by umac.

When I benchmarked performance, I got ~2MB/s bandwidth transmitting data 
over McBSP (resulting in ~800KB/s overall performance) and ~2.6MB/s bandwidth
transmitting data over McBSP (resulting in ~1MB/s overall performance and the
device working unstable). So busylooping and waiting for DMA completion is
only a fraction of time.

If we assume that the amount of time spent in umac is more or less equal to
the time waiting for DMA transfer, changing the code to do DMA transfer and
process the next frame at the same time may be the best option.

I'll try to show it with an example (Nokia 770). Let's suppose that we need 
to transmit 3 frames of data. Right now cx3110x driver does the following
(in 'sm_drv_spi_tx' function):

1. Query frame 1 data from umac
2. Start DMA transfer of frame 1 data
3. Wait for DMA completion (busyloop)
4. Query frame 2 data from umac
5. Start DMA transfer of frame 2 data
6. Wait for DMA completion (busyloop)
4. Query frame 3 data from umac
5. Start DMA transfer of frame 3 data
6. Wait for DMA completion (busyloop)

If we change the code to do the following, we may improve performance:
1. Query frame 1 data from umac
2. Start DMA transfer of frame 1 data
3. Query frame 2 data from umac
4. Wait for DMA completion (busyloop)
5. Start DMA transfer of frame 2 data
6. Query frame 3 data from umac
7. Wait for DMA completion (busyloop)
5. Start DMA transfer of frame 3 data
6. Wait for DMA completion (busyloop)

And if stage 3 (Query frame 2 data from umac) takes more time than DMA
transfer, we will not have to wait for anything at the stage 4 at all
(eliminating the need of busylooping or context switching).

Something similar can be also tried for receiving data.

Surely, this is just an idea and it needs to be tested. But if it works (and
if all the estimations are correct), it may greatly improve wlan performance
on heavy data transfers. I'll try to do some experiments on the coming
weekend.