More on P and NP


English: Euler diagram for P, NP, NP-Complete,...
English: Euler diagram for P, NP, NP-Complete, and NP-Hard set of problems. (Photo credit: Wikipedia)

From Frank Vega:

 

I wanted to answer you one of your comments in your post “Even if P=NP we might see no benefit“, but I saw I can’t do it anymore in that page, maybe due to problem with my internet. I was the person who claim a possible demonstration of problem “P versus NP” in my paper “P versus UP” which is published in IEEE,

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6272487

I’m working in that problem as a hobby since I graduated. I sent a preprint version in Arxiv with the intention of find some kind of collaboration. I also tried to find help in another blogs. But, finally I decided to sent to a journal of IEEE a new version which differs of the preprints in Arxiv that I withdrew because it had some errors. Then, I waited and after a revision in a IEEE journal I was published in 17th August 2012.

However, I wrote this paper in Spanish and I had the same version in English. So, I decided to sent again to Arxiv, but they denied me that possibility, therefore, I used a pseudonymous. I also uploaded other papers with that name which are not so serious but reflect the work that   I’m doing right now as a hobby too.

I love Computer Science and Math. I’m working right now in a project so important as P versus NP, but I do all this as a way of doing the things that I like most although my environment doesn’t allow me at all. I also tried to work with other scientists which have invited me to work with them since I published my paper in IEEE. Indeed, I don’t want to be annoying with my comments, I just searching the interchange with another people who have the capacity to understand my work, that’s all.

Good Luck

Frank’s website is here: http://the-point-of-view-of-frank.blogspot.com/

Have we reached “peak silicon” and what can we do about it?


Moore’s Law states that the number of transistors that can be squeezed into a given slice of silicon doubles every two years (or 18 months) – something I wrote about recently and where I declared “More transistors means greater speed, more and cheaper memory and so on … ”

Except, maybe not. As the graph below, shamelessly grabbed from Herb Stutter’s “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software“, shows, while Moore’s Law (the green graph) holds true, the other associated improvements that we have come to expect to parallel it, such as a similar increase in efficiency per watt (royal blue graph) and clock speed (navy blue) have not. In short, we can build cheaper chips but they are not necessarily much faster.

Herb Sutter's graph of CPU performanceAnd, as this article recounts, we are now talking about “dark silcon” – bits of chips that have to remain unpowered while other parts are in use so as to ensure the whole chip does not fry or fail due to too high power consumption.

So, if we have reached the point of “peak silicon” what can we do about it?

The chip manufacturers have responded by packing more and more cores into their devices and that works up to a point – we do not even need to have very parallel coding outside the operating system to take some advantage of that on even a typical multitasking desktop machine. But none of us are doubling the number of video renderers, MP3 decoders, database queries and spreadsheet calculations we run in parallel every 18 months, so the “Moore paradigm” of computing power doubling in that time will be lost.

A more fundamental alternative is to rewrite our software so that it becomes inherently designed to take advantage of multicore machines. Writing effective parallel software is not easy, but it can be done for lots of tasks. But even then there are limits – “Amdahl’s law” reminds us that parallelisation will only speed the parts of a program that can be run in parallel: if say we had a piece of code that must be run in serial and takes 5 seconds, and some code that currently takes 55 seconds but could be made perfectly parallel, then if we had 2 processors it takes 5 seconds (serial time), plus 27.5 seconds for the parallel code, doubling the processors but not quite halving the time, with a 46% saving. Doubling the number of processors again (to 4) cuts total computing time to 18.75 seconds but the proportional saving has dropped to 42%. In other words, the “Moore paradigm” also disappears.

The third thing we can do is look for better algorithms: the recent announcement of a vastly improved fast fourier transform (FFT) algorithm shows what can be done here – algorithmic improvement can vastly outstrip hardware speedup. But currently for many problems (those in NP space) there is no prior known algorithm available and computing power can be simply dedicated to going through all the possible algorithms looking for the one that works (we do not know what algorithms solves an NP problem but once a solution is found we can verify it ‘easily’). Assuming, as most mathematicians are said to do, that P does not equal NP (ie there is no yet to be discovered algorithm that cracks NP problems) this at least means that “peak silicon” will keep internet commerce safe for the foreseeable future but it is bad news in lots of other ways.

There is a fourth option, of course, which is to get a better physics – either for silcon fabrication, quantum computing or some other physics based innovation. Right now, though, these are probably still the least likely options but as the links below show, lots of people are working .

In continued praise of “Programming Pearls”


Algorithm City

I have read two books in the last year that have fundamentally changed the way I think about computers and programming – The Annotated Turing and Programming Pearls.

Programming Pearls in particular continues to inspire me in the way it makes you think about building better algorithms and using data structures better – and here’s a short tale about what I have just done (and why I did it) with my valext program.

Valext runs a program under step tracing to allow the detailed examination of patterns of memory use – interrogating the /proc/pid/pagemap at every step.

The way I was doing this was to query the /proc/pid/maps, find the blocks that were being mapped and then seek through /proc/pid/pagemap, reading off the page status.

For every page that was mapped I would build a data structure and stick it into a linked list

struct blocklist {
    uint64_t address;
    struct blocklist *nextblock;
};
...
for (i = startaddr; i < endaddr; i++)
{
    block = malloc(sizeof (struct blocklist));
    block->address = i;
    block->nextblock = NULL;
    if (*lastblock == NULL){
        *lastblock = block;
        *header = block;
    } else {
        (*lastblock)->nextblock = block;
        *lastblock = block;
    }
}

But the problem with this code – an \mathcal{O}(n) algorithm – is that performance stinks. I was running it on a very underpowered laptop and got to about four days of wallclock time, in which time about 2 million steps had been taken.

So today I rewrote it with an essentially \mathcal{O}(1) algorithm – I allocate an array of arbitrary size (512 elements in current code) and fill that up – with a 0 value being the guard (ie marking the end). If more than 512 elements are needed then another 512 will be allocated and chained to the top and so on…


struct blockchain {
    int size;
    uint64_t *head;
    struct blockchain *tail;
};

uint64_t* blockalloc(int size)
{
    uint64_t *buf = calloc(size, sizeof(uint64_t));
    return buf;
}

struct blockchain *newchain(int size)
{
    struct blockchain *chain = NULL;
    chain = calloc(1, sizeof(struct blockchain));
    if (!chain)
        return NULL;
    chain->head = blockalloc(size);
    if (chain->head == NULL) {
        free(chain);
        return NULL;
    }
    chain->size = size;
    return chain;
}

/* recursively free the list */
void cleanchain(struct blockchain *chain)
{
    if (!chain)
        return;
    cleanchain(chain->tail);
    free(chain->head);
    free(chain);
    return;
}

/* set up a list */
struct blockchain* getnextblock(struct blockchain** chain,
struct blockchain** header, char* buf)
{
    int match, t = 0;
    uint64_t startaddr;
    uint64_t endaddr;
    uint64_t i;
    const char* pattern;
    regex_t reg;
    regmatch_t addresses[3];

    pattern = "^([0-9a-f]+)-([0-9a-f]+)";
    if (regcomp(&reg, pattern, REG_EXTENDED) != 0)
        goto ret;
    match = regexec(&reg, buf, (size_t)3, addresses, 0);
    if (match == REG_NOMATCH || match == REG_ESPACE)
         goto cleanup;
    startaddr = strtoul(&buf[addresses[1].rm_so], NULL, 16) >> PAGESHIFT;
    endaddr = strtoul(&buf[addresses[2].rm_so], NULL, 16) >> PAGESHIFT;
    if (*chain == NULL) {
        *chain = newchain(MEMBLOCK);
        if (*chain == NULL)
            goto cleanup;
        *header = *chain;
    }
    for (i = startaddr; i < endaddr; i++)
    {
        if (t >= MEMBLOCK) {
        struct blockchain *nxtchain = newchain(MEMBLOCK);
        if (!nxtchain)
             goto cleanup;
        (*chain)->tail = nxtchain;
        *chain = nxtchain;
        t = 0;
    }
    (*chain)->head[t] = i;
     t++;
    }
cleanup:
    regfree(&reg);
ret:
    return *chain;
}

As I am only testing this code now I don’t know if it will be faster – though everything tells me that it should be (though I am not entirely clear if this memory allocation issue is the real bottleneck in the program or whether it is just that a Pentium III is a Pentium III and there is nothing much I can do about that).

What I do know is that if it was not for the inspiration of Programming Pearls I probably would not even have bothered trying.