Reviving this blog … with a question about binary trees


Work changes and a determination to actually finish my PhD mean I really should make a bit of an effort here and so I will.

Here is a puzzle that has been bothering me about binary trees which has come from my PhD research…

In that research I am investigating how to implement virtual memory on a many-core Network-on-Chip (NoC) system. Essentially I have been building and running emulators.

The core of the emulation is the “memory tree” – the connection between the cores of the NoC and the global memory controller. The “Bluetree” memory tree is a binary tree arrangement, with the root node (level 0) connecting to the memory controller and the leaves at the NoC tiles (with each tile having a single processor).

At every non-leaf level of the tree there are multiplexors (mux) with a buffer that can hold a single memory packet – so if two requests for global memory arrive at once and the buffer is free there needs to be an arbitration process to decide which packet gets to occupy the buffer and which has to wait.

We have 128 leaf nodes – as in this tree…

Binary tree with 128 leaf nodes

With this many leaf nodes, the question of the arbitration policy of the muxes is important. A naïve approach would be to say, if two packets contest for the buffer, let the packet on the left win the first time and then pass the packet on the right: after all this guarantees no starvation.

But it is actually a hopeless policy … the leftmost leaf node (shall number from zero) faces no opposition, the rightmost (number 127) loses out to every other node.

But number of leaf node alone is not a sufficient way of describing the relative number of blocks – for instance leaf node 16 appears to be in a better place than leaf node 15 – as going up the tree leaf node 15 can be blocked at level 6 (the root node is at level 0), at level 5, level 4, level 3 and level 2: while leaf node 16 is only blocked at level 1.

Practical testing of this naïvely fair approach in the emulator gives results like the following… as can be seen the right hand part of the tree performs appallingly but even there lots of variation can be seen:

some test results

My question is this: what is the correct (i.e., correct mathematically) way of describing the order of the leaf nodes?

Advertisements

Reaching a decision


English: Distributed Memory
English: Distributed Memory (Photo credit: Wikipedia)

A week further on and not much C++ has been written – and now I think I need to make a new start.

Up to this point I have been trying to write a software model of the hardware and my thought was I could think put a software-modelling layer on top of that. But that simply is not going to work – it is just too complex.

Instead I am going to have to make some policy decisions in the software – essentially over how I model the local memory on the chip: each tile will process memory reads and writes and needs to know where that memory is – it could be in the global off-chip memory store or it could be on-chip.

The difference matters because, at least in theory, the on-chip memory is speedily accessible, while the off-chip memory is 50 to 100 to 500 times “further away”.  Because memory accesses exhibit locality it makes sense to ship blocks of addressed memory from the global to the local store – but doing so takes time and if there are a lot of memory movements then we get thrashing.

What I now have to do is think of what policy I will use to decide what memory gets stored locally (or, more likely, what policy I use to map the addresses). I’ll start by once again reviewing papers that propose some schemes for existing Networks-on-Chip.

In other news: I have had a paper (of which I am co-author and first named author) accepted by OSPERTS 15 – so I will be off to Sweden to get mauled by the audience there in early July. It will be an experience, and I am looking forward to it, but I also think it might be not so much a baptism, but a destruction by fire.

Further thoughts on the simulation task


A Motorola 68451 MMU (for the 68010)
A Motorola 68451 MMU (for the 68010) (Photo credit: Wikipedia)

Lying in bed this morning and puzzling over what to do …

At first I thought what I should do is copy one of the existing operating system models for NoCs, but that simply would not be flexible enough.

What I have to do is model the hardware (including the modifications to the MMU I want to see) as, essentially, some form of black box, and build other layers – including the memory tree – above that. That means I need to separate the CPU/tiles from global memory: sounds simple in theory but implementing this is going to be very far from easy.

How much memory do you need?


What’s the best way to speed up your computing experience? As many know the most cost-effective way is often not to buy a new computer with a faster processor but to add more memory to an existing system.

The plot below, based on some results from my PhD research shows just how this works…

timingsIn my case I am measuring how long it takes a processor of a fixed speed to execute a given program while I vary the amount of memory available.

My research centres on “network-on-chip” systems so the issue here is how much memory is available locally (i.e., on the chip). Because if instructions and data are not locally available they have to be fetched in from some more “distant” store (whether that is system global memory or even a disk system). And if space on the chip is limited then you generally have to evict some other piece of memory to make way for the piece needed immediately. Yet if you need to evicted piece again in the future you have to take the time needed to fetch it back in again and so on.

In this case we are using 4096 byte (4K) blocks of memory – i.e., 4K pages. So when it says 10 pages on the x-axis that means there is 40K available locally and so on.

I am testing all this on the OVPsim emulator and I have no independent means of accurately timing how long the process takes – but I can count the number of instructions it takes to complete the task.

Factors that affect the number of instructions taken are – in order of likely importance:

  • the time taken to fetch whole pages in and out of memory – a “hard fault” occurs when a page has to be evicted and a new page brought in instead (the earliest hard faults are not as slow as the later ones as until the full number of pages are used no eviction is required);
  • the time it takes to ensure that pages that are in memory but which are not immediately available for execution (because the number of translation mappings that tell the central processor where they are is limited – to 8 in this test case – and so for each fault we first have to test if the page is available but unmapped). If an unmapped page is available we have a “soft fault” which is much quicker to process than a hard fault as no eviction etc is required;
  • picking the next page that will be evicted if have a hard fault – in this case we aim to pick the page that was least recently used, but even then we have to use an approximation. This process is not the same as evicting the page – it merely indicates which one is the best candidate for eviction should that become necessary.

The number of hard faults is reduced by increasing the amount of memory available (until, of course, you have so much memory you never need evict a page). But as you increase the amount of memory you also make checking for soft faults and picking the next candidate for eviction slower – because there are more pages to check.

And this is what we see in the plot. When memory is in short supply adding even a single extra page can make a big difference. There is then a long range where the decrease is not so great but constant. This reflects the fact that memory addresses are not needed at random and that programs show ‘locality of preference’ (ie if you need one address at one moment you are quite likely to need a nearby address at the next). This locality means that adding extra pages can have a limited impact once you have enough to cover the range of likely picks in a reasonable time frame. Then adding extra memory means that the changes between the different phases of locality becomes smoother, so we still see a reduction in time but not anywhere near as pronounced as before.

Then we get to the point – about 30 pages in this case – where we are close to covering the full range of addresses used, even when we switch between phases. In this case we see a sudden fall again until – at about 33 – 35 pages – we seem to have covered every single address our program will ever use.

After that having more memory makes the system slightly slower (NB in most real-world desk and laptop systems adding more memory does not slow down the system – so don’t worry about that!).

The lesson: if your computer is really slow and you don’t want to replace it, add more memory. But if you computer already has so much memory that it doesn’t know what to do with it and it is slow, you have a bigger problem!

Going atomic … or concurrency is hard


SVG Graph Illustrating Amdahl's Law
SVG Graph Illustrating Amdahl’s Law (Photo credit: Wikipedia)

In my PhD world a year’s worth of software experimentation has proved what we all knew already … that systems using traditional memory models struggle in the Network-on-Chip environment and so I am now trying something slightly different.

My “model” (it’s all in software) is of a 16 core system, with each core having a small amount of on-chip memory (32k), which are combined together to form a flat memory space. Memory in this space can be accessed quickly, memory outside it, in the next level up in the hierarchy, is roughly 100 times further away.

Using any form of traditional paging model (including Belady’s optimal page replacement algorithm) this system starts to thrash on even moderate loads – the cost of moving pages in and out of the local memory determines performance and so there is no benefit from adding additional processors (in fact it just slows the individual processors down).

Such an outcome makes any promise of improved performance from parallelism void – it does not really matter how efficiently you have parallelised the code (some corner cases excepted – eg if all chips were accessing the same memory at the same time), you are trapped by a memory I/O bound.

So now I want to look at alternatives beyond the usual 4k (or 2k) paging – but I have been struggling all week to get the locking semantics of my code right. Concurrency is hard.

The one thing that debugging parallel code and locks teaches you again and again is never to assume that some event will be so rare you don’t need to bother about it: because when you are executing millions of instructions a second, even rare events tend to happen.

It has also taught me to check return values – code that will “always” work in a single threaded environment may actually turn out to be quite a tricky customer when running in parallel with other instances of itself or when it is accessing shared memory.

But, finally, the main lesson this week has been about going atomic.

I have a tendency to think – if I can release that lock for a few lines of code that might improve overall performance and I can just lock it again a little later. Beware of that thought.

If you need to make a series of actions atomic you need to hold the same lock across them all – releasing it for even a few lines breaks atomicity and will quite likely break your code.

Curses on ncurses


gdb icon, created for the Open Icon Library
gdb icon, created for the Open Icon Library (Photo credit: Wikipedia)

Every programmer will be familiar with something like this…

A little while back I wrote a program that simulates – crudely but effectively – a multicore NoC device. I use it to model the execution times of different page replacement algorithms.

The input is XML generated via a step by step trace of a working program. The actually instructions being traced do not matter – what I care about are the memory access patterns.

To allow me to test more models more quickly I have now written some R code that generates a semi-random access pattern based, very loosely indeed, on the patterns seen in the real program. The advantage is I can test against a set number of memory accesses but with a range of pseudo-random access patterns, so although I am not running models against the “real” access pattern, neither am I taking three weeks per experiment.

But when I used the artificially generated access patterns, my program crashed with a seg fault. But even more confusingly, when I ran the code in GDB, the GNU Debugger, if I stepped through the code it worked, but I just ran the code in debugger then it crashed just as it did without using the debugger.

After a few hours I realised why – in my artificial patterns, the first thing the first thread does is spawn all the other threads to be used. In real world code, of course, these spawns take place after quite some code has been executed.

Every code spawn causes the ncurses code I am using to update the screen. When using ‘real’ access patterns these updates take place comfortably after all the ncurses environment has been set up (by a separate thread), but in the artificial code, the thread updates are the first thing that get posted to the screen, even before ncurses has been set up – hence the crash.

If I step through the code then the ncurses thread runs ahead and sets up the screen before I hit the thread update code and again it works.

The solution? Use a condition variable and a mutex to ensure that nothing executes before the ncurses environment is fully established.

Not a big deal – but perhaps, at some point in the future someone struggling to understand why their code – which previously worked so well – has now stopped processing what seems to be well-formed input. Hope this helps!

Traffic generation options


English: Read Only Memory
English: Read Only Memory (Photo credit: Wikipedia)

This is a blog post where I am going to write about things as a way of clarifying, in my own mind, what the best way of tackling a problem is.

So far, in research for my PhD, I have concentrated on establishing some base points for potential performance of Network-on-Chip systems running multithreaded code.

Nearly nine months ago I hacked at Valgrind‘s Lackey tool to ensure it produced XML output recording every memory reference made by a piece of code running under it. This was really basic stuff – Lackey recognises four primatives – code for code-in-execution, and load, store and modify (a combined load and store) for read-write memory. So typically you get blocks of code followed by some read-write records and then some more code. I don’t know what the operands are, just the primative type, the address and the size of the piece of memory being used (whether for code or read-write operations).

I then used that to record the output of one of the Parsec parallel benchmarks – a 16 thread (actually it executes 18 threads) piece of video transcoding. In the real world this ran in seconds, under Lackey it took a few days and output about 200GB of XML.

That XML can then be broken down into thread-specific strands of execution – 18 of these in all, of various sizes, but all of the order of several GB at least.

These are then plugged in to some sort of simulator. The basic hardware model being simulated has remained the same throughout (mostly, I did fiddle with it a bit a while back but decided that wasn’t worth it). So we have 16 cores sharing a flat 512kB memory space (this is very loosely based on the Parallella system, but it is not meant to be any sort of simulation of it). There is no caching and no sense that any part of the memory is further from one core than another.

What does alter is the page replacement algorithm used. I first tried FIFO and the code ran for many weeks and completed in about 60 billion simulated ticks – if a memory reference is to a page in the 512kB then it is deemed to take 1 tick to complete, if the reference is to a page (a 4k page size has been used thus far), it takes 100 ticks per 16 byte line to load (25600 ticks for a whole 4k page) – and plainly we have to decide what page gets evicted if our 512k store is already full.

Messing about with various LRU models showed that a two queue LRU did give a little better performance than a simple single LRU queue, and that completed in around 50 billion ticks (and two weeks or so of running).

I then built – more or less starting from scratch – a version of the simulator that modelled Belady’s OPT. That required some very large binary trees to be used – along with 250GB of RAM – and completed the code in about 22 billion ticks (and about three weeks in wall clock time).

All these models showed one piece of common behaviour – thrashing, as demonstrated by the fact that adding additional cores to the execution did not increase the amount of code being executed: instead each individual core had to handle more page faults as the cores competed for the small pool of memory.

I now have two pieces of code running which aim to measure the (in)efficiency of these “traditional” paging approaches – come back in a few weeks to see what they show.

So, while they run I need to get on to the next stage, which is testing some alternative approaches. But I have  a problem – I cannot wait three weeks for each experiment to run. There simply is not any time for that.

The alternatives boil down to chopping up sections of my current XML from the benchmark, or writing a traffic generator.

The traffic generator idea has a lot to be said for it – my supervisor certainly is in favour – but it is not without weaknesses: the degree of locality between the different threads executing is really quite important – a lot of locality and the fault count falls and code gets executed fast – poor locality and the fault count rockets.

But how do I model that: I am not sure there is any literature out there that discusses this problem in much detail – multithreading is hard and for that reason rational people avoid it!

But using the chopped up XML is not without issues either – it’s inflexible, it elevates one instance of executing code to be a model for all executing code and so is just as vulnerable to the question of locality.

After paging?


Diagram of relationship between the virtual an...
Diagram of relationship between the virtual and physical address spaces. (Photo credit: Wikipedia)

Paging and virtual memory is at the heart of just about any computing device – more complex than a DVD player – we use everyday.

Paging is the memory management system based on the idea that we can divide the real memory of our computer into a sequence of smallish (typically 4,096 bytes) of “page frames” and then load the bits of data and program in and out of those frames (in “pages”) as we need them.

So, you can have pages from various different running programs in the page frames at any given time and then use a “virtual memory” system to map the pages placed in an arbitrary frame to the memory address the program thinks the page should be resident in.

It is not the only system we could use – “segments”, which involve moving large chunks (as opposed to small pages) of memory about is one approach, while “overlays” – using part of the memory space as sort of scratchpad working area – is another. More recently, with bigger “traditional” computers very large pages have been used as a way of making, at least in theory, more efficient use of memory now measured in billions (as opposed to a few tens) of bytes.

But paging is easily the most widely used approach and has been integral to the development of “multitasking” and similar shared resources approaches to computing – because paging allows us to just have the useful bits of a program and its data in memory we can have many more programs “running” at a given time.

But my PhD research is pointing me towards some of the weaknesses of the paging approach.

At the heart of the case for paging is the idea of “locality” in a computer’s use of resources: if you use one memory address at one instant there is a high probability you will use a nearby address very soon: think of any sort of sequential document or record and you can see why that idea is grounded in very many use cases of computing devices.

Locality means that it ought to make sense to read in memory in blocks and not just one little drop at a time.

But this principle may be in opposition to efficient usage of memory when competition for space in fierce: such as for the limited local memory resources we have on a Network-on-Chip computer.

Right now I am collecting data to measure the efficiency of 4k pages on such (simulated) devices. With 16 simulated cores trying to handle up to 18 threads of execution competition for pages is intense and the evidence suggests that they are resident, in some cases at least, for many fewer “ticks” than it takes to load them from the next lowest level in the memory hierarchy. On top of that many pages show that the principle of locality can be quite a weak one – pages of code are, in general, quite likely to demonstrate high locality (especially in loops) but pages of read-write memory may not.

I don’t have all the data to hand – essentially I am transforming one 200GB XML file into another XML file which will likely be around the same size and that takes time, even on quite a high spec computer (especially when you have to share resources with other researchers). But I expect some interesting results.

Enhanced by Zemanta

None of the easy things work


This is a post about my PhD research: in fact it is a sort of public rumination, an attempt to clarify my thoughts in writing before I take the next step.

It’s also possibly an exercise in procrastination: a decision to write about what I might do next, rather than to get on with doing it, but I am going to suppress that thought for now.

I am looking for ways to make “network on chip” systems more viable as general use (or any use, I suppose) computing platforms. These systems are a hardware response to the hardware problem that is causing such difficulties for big software and hardware manufacturers alike: namely that we cannot seem to make faster computers any more.

The problem we have is that while we can still get more transistors on a chip (i.e., that “Moore’s Law” still applies), we cannot keep operating them at faster speed (i.e., “Dennard Scaling” has broken down) as they get too hot.

In response we can either build better small devices (mobile phones, tablets) or try to build faster parallel computing devices (so instead of one very fast chip we have several moderately fast chips and try to have better software that makes good use of their ability to compute things in parallel).

Network-on-chip (NoC) processors are a major step along the road of having parallel processors – we put more processing units on a single piece of silicon rather than have them co-operate via external hardware. But the software has not caught up and we just cannot keep these chips busy enough to get the benefit their parallelism might offer.

That is where I hope to make a difference, even if just at the margins. Can I find a way to make the NoC chips busier, specifically by keeping them fed with data and code from the computer memory fast enough?

I have tried the obvious and simple methods: essentially adaptations of methods that have been used for most of the last 50 years in conventional serial computer devices and the answer is ‘no’ if that is all that is on offer.

Messing about with the standard algorithms used to feed code and data to the processors hits a solid brick wall: the chips have a limited amount of ‘fast’ local memory and the time it takes to keep that refreshed with up-to-date code and data places a fundamental limit on performance.

So, while many computer science students might be familiar with “Amdahl’s Law” which stipulates that, for parallel code, the elements that have to be run in serial (even if just setting up the parallel section) place a fundamental limit on how much extra performance we can squeeze out by throwing more and more parallel processors at the problem – we have a different, if related, problem here. Because we can apply more and more parallel processors to the problem but the performance remains constant, because even though we are running parallel code, we are limited by memory performance.

This limit – which implies that as we use more processors they become individually less efficient – even hits the so-called “clairvoyant” or optimal (OPT) memory management/page replacement algorithm: OPT knows which memory page it is most efficient to replace but is still limited by the fundamental barrier of limited on-chip memory.

The limit is manifest in the straight lines we can see in the plot here – the steeper slope of OPT means it runs faster but after the first few processors are brought to bear on the problem (the number of processors being used climbs for the first few billion instructions) the rate of instructions executed per ‘tick’ (an analogue of time) is constant.

OPT and LRU compared - both fundamentally stymied by memory shortage
OPT and LRU compared – both fundamentally stymied by memory shortage

 

Getting NoCs to run faster and so releasing the benefits from the potentially massive parallelism they could bring, depends on beating this memory barrier (and lots of other things too, but one at a time!). So, what are the options?

Well, one thing I can rule out is trying to cache a particular piece of a memory page (in traditional operating systems memory is shifted about the system in blocks called pages – typically 4096 bytes long). Caches typically store memory in 16 byte “lines” – hardware reads from the backing memory store in 16 byte blocks in most cases – and so I tested to see if there was a pattern in which 16 byte line was most likely to be used (see previous blog post). My instinct from looking at the plot is that will not work.

Similarly, a look at which pages were being used doesn’t reveal any immediately obvious pattern – some pages are used heavily by code, some are not – nothing surprising there.

So, the easy things do not work. Now I need to look at the hard things.

I think I need to escape from the page paradigm – one thing to look at is the size of the memory objects that are accessed. 4k pages are simple to handle – load a block in, push it out: but they could be (probably are) very inefficient. Might it be better to base our memory caching system on object sizes? That’s what I plan to check.

Enhanced by Zemanta

Looks like one hope/theory knocked on the head


Samsung Now Mass Producing Industry’s Most Adv...
Samsung DDR4 (Photo credit: samsungtomorrow)

Now I have mapped the speed of OPT and LRU using a traditional two level memory hierarchy, my task is to find something better that might make NoC a more viable computing platform.

One thought that occurred to me was that the preference of C and similar languages for page aligned memory in certain situations might make it feasible or reasonable to cache the first 128 bits (16 bytes) in faster access memory.

So I wrote some code to test which 16 byte “segments” of a page were accessed by code.

The list that follows is for read/write accesses (i.e. not instructions). I will get around to plotting a graph, but it looks like that idea – which was always a bit of a straw clutching exercise – is not going to fly.

 

Segment: 0 Count: 3145700

Segment: 1 Count: 4892407

Segment: 2 Count: 3207293

Segment: 3 Count: 3790303

Segment: 4 Count: 5315932

Segment: 5 Count: 8143085

Segment: 6 Count: 6060159

Segment: 7 Count: 3900910

Segment: 8 Count: 10018940

Segment: 9 Count: 3738990

Segment: 10 Count: 3654813

Segment: 11 Count: 3887982

Segment: 12 Count: 4046926

Segment: 13 Count: 7366857

Segment: 14 Count: 5029997

Segment: 15 Count: 5555853

Segment: 16 Count: 5101695

Segment: 17 Count: 4919006

Segment: 18 Count: 5536030

Segment: 19 Count: 3679621

Segment: 20 Count: 3496416

Segment: 21 Count: 4175369

Segment: 22 Count: 4126270

Segment: 23 Count: 3264013

Segment: 24 Count: 4100255

Segment: 25 Count: 4322633

Segment: 26 Count: 9749020

Segment: 27 Count: 19622721

Segment: 28 Count: 11919679

Segment: 29 Count: 3679721

Segment: 30 Count: 8063852

Segment: 31 Count: 11752619

Segment: 32 Count: 23894391

Segment: 33 Count: 14935880

Segment: 34 Count: 17010081

Segment: 35 Count: 12806447

Segment: 36 Count: 11758764

Segment: 37 Count: 13757944

Segment: 38 Count: 5967876

Segment: 39 Count: 8410322

Segment: 40 Count: 9751409

Segment: 41 Count: 10568546

Segment: 42 Count: 8047474

Segment: 43 Count: 9064589

Segment: 44 Count: 9219878

Segment: 45 Count: 7913851

Segment: 46 Count: 4868922

Segment: 47 Count: 5057902

Segment: 48 Count: 6649602

Segment: 49 Count: 8562603

Segment: 50 Count: 3609239

Segment: 51 Count: 3965129

Segment: 52 Count: 4869654

Segment: 53 Count: 5149858

Segment: 54 Count: 2055722

Segment: 55 Count: 32499129

Segment: 56 Count: 2733892

Segment: 57 Count: 8887833

Segment: 58 Count: 2329809

Segment: 59 Count: 2946391

Segment: 60 Count: 2946919

Segment: 61 Count: 8444848

Segment: 62 Count: 2890405

Segment: 63 Count: 3106806

Segment: 64 Count: 3349164

Segment: 65 Count: 4892046

Segment: 66 Count: 4453014

Segment: 67 Count: 4943979

Segment: 68 Count: 4595481

Segment: 69 Count: 6552969

Segment: 70 Count: 15521259

Segment: 71 Count: 4037689

Segment: 72 Count: 3737825

Segment: 73 Count: 5438653

Segment: 74 Count: 5012634

Segment: 75 Count: 4229649

Segment: 76 Count: 11143742

Segment: 77 Count: 6171722

Segment: 78 Count: 11938365

Segment: 79 Count: 11041493

Segment: 80 Count: 6030829

Segment: 81 Count: 9557385

Segment: 82 Count: 23490594

Segment: 83 Count: 9314300

Segment: 84 Count: 11119997

Segment: 85 Count: 8870504

Segment: 86 Count: 9059886

Segment: 87 Count: 4581160

Segment: 88 Count: 13002081

Segment: 89 Count: 6050274

Segment: 90 Count: 4850064

Segment: 91 Count: 4495531

Segment: 92 Count: 6551494

Segment: 93 Count: 6294404

Segment: 94 Count: 5852555

Segment: 95 Count: 9759357

Segment: 96 Count: 5688209

Segment: 97 Count: 9991460

Segment: 98 Count: 5155181

Segment: 99 Count: 3923768

Segment: 100 Count: 3319723

Segment: 101 Count: 8659639

Segment: 102 Count: 4416737

Segment: 103 Count: 2534491

Segment: 104 Count: 2470280

Segment: 105 Count: 2709695

Segment: 106 Count: 2535015

Segment: 107 Count: 3050061

Segment: 108 Count: 2772341

Segment: 109 Count: 3621580

Segment: 110 Count: 3332878

Segment: 111 Count: 4358518

Segment: 112 Count: 4282092

Segment: 113 Count: 3523498

Segment: 114 Count: 3111625

Segment: 115 Count: 3474108

Segment: 116 Count: 2809375

Segment: 117 Count: 3592879

Segment: 118 Count: 2768848

Segment: 119 Count: 10525789

Segment: 120 Count: 4122297

Segment: 121 Count: 3660249

Segment: 122 Count: 3432971

Segment: 123 Count: 3671499

Segment: 124 Count: 3066933

Segment: 125 Count: 3271576

Segment: 126 Count: 7069611

Segment: 127 Count: 3774229

Segment: 128 Count: 3258290

Segment: 129 Count: 2416892

Segment: 130 Count: 3413264

Segment: 131 Count: 2789339

Segment: 132 Count: 7841394

Segment: 133 Count: 2992968

Segment: 134 Count: 3624867

Segment: 135 Count: 3304507

Segment: 136 Count: 3573405

Segment: 137 Count: 4226925

Segment: 138 Count: 4803447

Segment: 139 Count: 5630354

Segment: 140 Count: 6420305

Segment: 141 Count: 4721786

Segment: 142 Count: 4860223

Segment: 143 Count: 4183175

Segment: 144 Count: 3790705

Segment: 145 Count: 7974241

Segment: 146 Count: 4146253

Segment: 147 Count: 3063269

Segment: 148 Count: 3485084

Segment: 149 Count: 2923729

Segment: 150 Count: 2947715

Segment: 151 Count: 3818073

Segment: 152 Count: 2769076

Segment: 153 Count: 2645308

Segment: 154 Count: 4452525

Segment: 155 Count: 4793146

Segment: 156 Count: 2281903

Segment: 157 Count: 3256076

Segment: 158 Count: 2414992

Segment: 159 Count: 2951958

Segment: 160 Count: 1747487

Segment: 161 Count: 2385269

Segment: 162 Count: 4128250

Segment: 163 Count: 5101661

Segment: 164 Count: 3579155

Segment: 165 Count: 3746030

Segment: 166 Count: 3117725

Segment: 167 Count: 3516756

Segment: 168 Count: 6842484

Segment: 169 Count: 2145906

Segment: 170 Count: 2281955

Segment: 171 Count: 3043248

Segment: 172 Count: 2946803

Segment: 173 Count: 2638829

Segment: 174 Count: 3543883

Segment: 175 Count: 3509146

Segment: 176 Count: 7295889

Segment: 177 Count: 3061840

Segment: 178 Count: 4201176

Segment: 179 Count: 22324025

Segment: 180 Count: 7157776

Segment: 181 Count: 14223711

Segment: 182 Count: 5079461

Segment: 183 Count: 6497476

Segment: 184 Count: 3395786

Segment: 185 Count: 3594557

Segment: 186 Count: 4511617

Segment: 187 Count: 3377908

Segment: 188 Count: 4015556

Segment: 189 Count: 3884031

Segment: 190 Count: 4418048

Segment: 191 Count: 4425675

Segment: 192 Count: 4151888

Segment: 193 Count: 4073928

Segment: 194 Count: 7515643

Segment: 195 Count: 4260252

Segment: 196 Count: 15629731

Segment: 197 Count: 8020217

Segment: 198 Count: 8051874

Segment: 199 Count: 8100937

Segment: 200 Count: 4875180

Segment: 201 Count: 6493053

Segment: 202 Count: 5447678

Segment: 203 Count: 5587693

Segment: 204 Count: 4748458

Segment: 205 Count: 5004972

Segment: 206 Count: 6418567

Segment: 207 Count: 10145254

Segment: 208 Count: 5678528

Segment: 209 Count: 5601157

Segment: 210 Count: 5702977

Segment: 211 Count: 5590463

Segment: 212 Count: 6693545

Segment: 213 Count: 4030951

Segment: 214 Count: 5199543

Segment: 215 Count: 7942092

Segment: 216 Count: 6376629

Segment: 217 Count: 7480443

Segment: 218 Count: 5150624

Segment: 219 Count: 4318579

Segment: 220 Count: 4535156

Segment: 221 Count: 3908294

Segment: 222 Count: 4547756

Segment: 223 Count: 4509968

Segment: 224 Count: 2944729

Segment: 225 Count: 3301802

Segment: 226 Count: 3638158

Segment: 227 Count: 4838325

Segment: 228 Count: 9253359

Segment: 229 Count: 2775284

Segment: 230 Count: 3601569

Segment: 231 Count: 3482137

Segment: 232 Count: 3594477

Segment: 233 Count: 2952485

Segment: 234 Count: 3315834

Segment: 235 Count: 2438082

Segment: 236 Count: 4846832

Segment: 237 Count: 2711387

Segment: 238 Count: 5907452

Segment: 239 Count: 3551899

Segment: 240 Count: 3621086

Segment: 241 Count: 3032466

Segment: 242 Count: 3572000

Segment: 243 Count: 3020379

Segment: 244 Count: 3275738

Segment: 245 Count: 3398733

Segment: 246 Count: 4023338

Segment: 247 Count: 3839542

Segment: 248 Count: 6434160

Segment: 249 Count: 3999611

Segment: 250 Count: 4388716

Segment: 251 Count: 8178958

Segment: 252 Count: 3622946

Segment: 253 Count: 3564538

Segment: 254 Count: 3182242

Segment: 255 Count: 3133097

Enhanced by Zemanta