Puzzle about an M/G/1 queue

I am deeply puzzled by a question about the behaviour of an M/G/1 queue – i.e., a queue with a Markovian distribution of arrival times, a General distribution of service times and 1 server. I have asked about this on the Math Stackexchange (and there’s now a bounty on the question if you’d like to answer it there – but as I am getting nowhere with it, I thought I’d ask it here too.

(This is related to getting a more rigorous presentation on thrashing into my PhD thesis.)

Considering an M/G/1 queue with Poisson arrivals of rate λ – this comes from Cox and Miller’s (1965) “The Theory of Stochastic Processes” (pp 240 – 241) and also Cox and Isham’s 1986 paper “The Virtual Waiting-Time and Related Processes“.

My question is what is the difference between (using the authors’ notation) p_0 and p(0,t)? The context is explained below…

In the 1965 book (the 1986 paper presents the differentials of the same equations), X(t) is the “virtual waiting time” of a process and the book writes of “a discrete probability p_0(t) that X(t)=0, i.e., that the system is empty, and a density p(x,t) for X(t)>0“.

The system consumes virtual waiting time in unit time, i.e., if X(t)\leq\Delta t and there are no arrivals in time \Delta t then X(t + \Delta t) = 0.

The distribution function of X(t) is then given by:

They then state:
p(x, t+ \Delta t) = p(x + \Delta t, t)(1 - \lambda \Delta t) +p_0(t)b(x)\lambda\Delta t + \int_{0}^{x}p(x - y, t)b(y)dy\lambda\Delta t + o(\Delta t)

I get all this – the first term on the RHS is a run-down of X(t)>0 with no arrivals, the second is adding b(x) of service time when the system is empty at t and the third, convolution-like, term is adding b(y) of service time from an arrival when it’s not empty at t. (The fourth accounts for their being more than one arrival in \Delta t but it tends to zero much faster than \Delta t so drops out as \Delta t approaches the limit.)

And … and this is where I have the problems …

p_0(t+\Delta t)=p_0(t)(1-\lambda\Delta t) +p(0,t)\Delta t(1 - \lambda\Delta t) + o(\Delta t)

The first term of the RHS seems clear – the probability that the system is empty at t multiplied by the probability there will be no arrivals in \Delta t, but the second is not clear to me at all.

I assume this term accounts for the probability of the system “emptying” during \Delta t but I don’t see how that works, is anyone able to explain?

In other words, how does p(0,t)\Delta t(1 - \lambda\Delta t) represent this draining? Presumably (1 - \lambda\Delta t) again represents the possibility of zero arrivals in \Delta t, so how does p(0, t)\Delta t represent the X(t) \leq \Delta t situation?

If we take the equilibrium situation where p_0(t) = p_0 and p(x, t) = p(x) then, if we differentiate and as p^{\prime}_0 = 0, we get p_0 = \lambda p(0) – so, again, what does p(0) represent?

A very first draft…

Our proposal here is to meet this need to use a memory space much greater than the locally available low-latency storage head on. The problems faced by programmers of today’s many- and multi-core devices are, in a sense, far from new: in the 1960s an earlier generation of programmers, experimenting with virtual memory and time-sharing computers, and using programs generated by the first generation of high-level languages were struck down by the phenomenon of thrashing, where computing time was lost to excessive memory management [23]. The solutions we propose here draw on the lessons learned at that time, but significantly adapt the methods to account for the realities of using multicore devices with very limited local storage.

We propose to enable virtual memory [24] for such systems, using a novel form of paging which, instead of transferring whole pages across the memory hierarchy, concentrates on a minimal transfer of partial pages [50]. Here hard faults still impose a significant penalty of lost cycles but the cost of paging in and out of memory is amortised across multiple memory reads and writes. This avoids unnecessary transfer and limits congestion in the memory interconnect, which we find to be a significant determinant of program performance.

One of the lessons of the 1960s was that maintaining a program’s working set in low latency memory was the most effective way of limiting paging and thrashing [22]. Much research at that time and into the 1970s and later concentrated on effective ways to maintain the working set of pages in fast memory and innovation continued into the 1990s (and beyond), for instance with the 2Q algorithm in Linux [38]. Here, though, we establish that better performance in embedded and real-time manycore systems is unlikely to come from a better page replacement algorithm but from a reorientation towards better management of the internals of the pages.

General computing page replacement algorithms are generally robust [49] and with the decreasing cost of memory, much recent research has concentrated on using large page sizes, not least to minimise TLB misses [9], we return to earlier research findings which emphasise the efficiency of small pages on small systems and we show that combining this with a FIFO page replacement policy may, perhaps counter-intuitively, deliver more time predictable or even faster performance than attempts to use some version of a least-recently used policy, though we also show that choice of page replacement algorithm is likely to be strongly related to the variation in working set size over a program’s lifecycle.

Concerns about power have always been important for embedded devices, but changing characteristics of hardware mean new issues such as “dark silicon” [28] will be important as we move deeper into the coming manycore era. We demonstrate here that constraints in power may be, to some extent, mitigated by a loosening of bottlenecks elsewhere in a realtime system, particularly in the memory interconnect: in a system where power is no constraint then congestion here dominates performance, where power concerns limit processor activities they may also lessen congestion.

Further we demonstrate results that some frequency scaling in manycore systems may not limit performance if it contributes to lessening congestion in the memory interconnect: results that suggest that the dark silicon issue may not be as big a barrier to performance in real-world applications as might be expected.

A plot that helps explain why the desktop computer industry is in trouble

Desktop computer sales are falling.

Desktop (and laptop) machines are just not improving in speed at the same rate as they have historically, and so sales are on the slide – there is no need to buy a faster machine when the one you have already is not much slower than an expensive replacement.

(I am writing this on a six year old laptop – it is, relatively speaking, quite slow, but still not so slow that I have been bothered to buy a faster machine.)

This lack of increased speed up is because, although “Moore’s Law” of doubling the number of transistors on a piece of silicon still applies, the related effect of “Dennard Scaling” – which kept the power needed by those transistors under control – has broken down.

The solution for this ought to be to use more but less energy-hungry, chips (CPUs). If we can get our software to efficiently use an array of not-so-fast chips rather than a single fast chip then we can still see faster computers. This is why computers (an increasingly mobile phones, because they too will soon be in trouble) are now sold as “multicore” devices.

But this has its limits. Unless we can find better ways for these chips to co-operate in software then current hardware models limit the ability to speed machines up.

And that, sort of, is where my chart comes in.

slowWhat the chart shows is that, using current software designs, adding more processors to tackling a problem can actually decrease the speed at which the problem is processed, and certainly shows no sign of speeding it up.

The chart – not to be taken as gospel as the data is very rough and ready and not properly normalized – shows how many instructions (in a simulator) an array of processors will execute per million or so processing cycles – essentially 10 means100,000), 20 means 200,000 and so on.

But the piece of information the chart does not show is that as we move from left to right the number of processors being applied to the problem is being increased and yet the number of instructions being executed is either more or less constant or even – as the first few processors are added – falling. After about 29,000,000,000 ticks the number of processors being applied begins to fall (from a peak of 16 to 11 at the far right) but again the number of instructions being processed is close to constant on average (the red line).

This isn’t because the code is hopelessly inefficient (for instance the processors are added or removed as new threads of execution come on-stream or finish) but because they compete with each other to access the available memory. This is an example of what computer scientists call “thrashing” – as the cores force each others’ preferred pieces of memory out of the system in an effort execute their own code (and this takes time), only to have the same thing done to them. Processors spend far more time waiting for their code and data to be fetched into memory than they do actually executing it.

Unless and until this fundamental problem is fixed or bypassed then the desktop computing industry will be in recession.

Thrash reduction no longer a priority for Linux kernel devs?

Version 3.5 of the Linux kernel has been released.

freshly installed ipod linux, booting. during ...
freshly installed ipod linux, booting. during Wikipedia:Workshop Köln (Photo credit: Wikipedia)

One of the changes it includes is the removal of the “swap token” code – one of the very few ‘local’ memory management policies grafted on to the ‘global’ page replacement mechanisms in the kernel.

There are various technical reasons offered for the removal of the code – on which I am not qualified to comment – but the borrow line is that it was broken in any case, so the removal seems to make sense.

What does slightly disturb me, though, is the comment that Rik Van Riel, the key figure in kernel memory management code, makes:

The days of sub-1G memory systems with heavy use of swap are over.
If we ever need thrashing reducing code in the future, we will have to
implement something that does scale.

I think the days of sub-1G systems are far from over. In fact I suspect there are more of them, and more of them running Linux, than ever before and that trend is not going to stop.

He’s right of course about the need to find that code that works – my own efforts (in my MSc report) didn’t crack this problem, but I do think there is more that can be done.

Compile time shortens exponentially with increased memory

Well, it did not take me too long to work out how to do something useful with R and, indeed, get a really interesting result (see graph below):

Linux compile timesThere are a fair few missing data points here – experiments take time – but already you can see that there appears to be a fairly clear relationship between the time it takes to compile a Linux kernel and the available memory as shown by what looks like an exponential shortening of compile times as memory increases.