## The Sum Of The First Billion Primes

### September 11, 2012

The first ten prime numbers are 2, 3, 5, 7, 11, 13, 17, 19, 23, and 29. Their sum is 129.

Your task is to write a program that calculates the sum of the first billion primes. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

The above C++ code sums primes using my primesieve library (https://primesieve.googlecode.com). To compile it on a Linux/Unix system use:

This takes about 10 seconds on my Intel Core-i5 CPU.

A (inefficient) Python solution:

On my machine (Intel Core i3) this takes around 87 seconds to complete.

Java noob’s super inefficient, but extensively documented, Java

Done in SwiftForth with a bitset library I wrote a long time ago. Took 40 minutes or 410K primes per second. (While also watching web video, etc. on the computer.) It uses the compressed 33K file from previous exercise to store primes < 1 million, which is used as the sieving primes, and the segmented sieve from another previous exercise to sieve blocks after 1 million, in increments of 100,000.

requires bitset

\ Table lookup for "small" primes, (33K table for primes < 1e6)

1,000,003 drop Constant FILE_PRIME_MAX

: load-primes ( str — addr )

r/o bin open-file throw locals| fd |

fd file-size throw drop ( — size )

s" CREATE prime-table" evaluate

here over allot

swap fd read-file throw drop

fd close-file throw ;

s" ../prime1e6.dat" load-primes

hex

create masks[]

FF c, \ 0

FE c, FE c, FE c, FE c, FE c, FE c, \ 1 – 6

FC c, FC c, FC c, FC c, \ 7 – 10

F8 c, F8 c, \ 11 – 12

F0 c, F0 c, F0 c, F0 c, \ 13 – 16

E0 c, E0 c, \ 17 – 18

C0 c, C0 c, C0 c, C0 c, \ 19 – 22

80 c, 80 c, 80 c, 80 c, 80 c, 80 c, \ 23 – 28

00 c, \ 29

decimal

create offset_mod_11

-1 , 1 , 7 , -1 , 11 , 17 , -1 , 29 , 13 , 23 , 19 ,

: lowbit>offset

dup negate and 11 mod cells offset_mod_11 + @ ;

: next-small-prime ( n — n )

dup 2 < IF drop 2 EXIT THEN

dup 3 < IF drop 3 EXIT THEN

dup 5 < IF drop 5 EXIT THEN

dup FILE_PRIME_MAX >= abort" : precomputed prime table exceeded"

30 /mod prime-table +

swap masks[] + c@ \ addr mask

over c@ and lowbit>offset

BEGIN dup 0< WHILE

drop

1+ dup c@ lowbit>offset

REPEAT

swap prime-table – 30 * + ;

\ segmented sieve

100,000 drop Constant SIEVE_BLOCK_SIZE

SIEVE_BLOCK_SIZE 2/ bset sieve

2variable sieve_start \ used by the generator

: sieve_end ( — d )

sieve_start 2@ [ SIEVE_BLOCK_SIZE 1- ] Literal M+ ;

: d/mod ( d n — d%n d/n )

locals| n |

\ FORTH does not have remainder function for double division

\ so we multiply the quotient by the divisor and subtract…

\

2dup 1 n M*/ 2dup 2>r \ get quotient & save a copy

n 1 M*/ d- drop \ compute remainder

2r> ; \ restore quotient

: dmod ( d m — n )

locals| m |

m d/mod 2drop \ note: FM/MOD is faster but overflows…

dup 0< IF m + THEN ;

: offset ( n — n ) \ first offset into bit array of prime #

sieve_start 2@ third 1+ M+ D2/ dnegate rot dmod ;

: seg-sieve ( d — )

sieve_start 2!

sieve bits erase

sieve bset-invert! \ set everything to TRUE

3

BEGIN

dup offset

BEGIN dup sieve nip ( length ) < WHILE

dup sieve -bit

over +

REPEAT drop

next-small-prime

sieve_end third dup M* D< UNTIL

drop ;

\ gen-primes — generate primes and call callback for each one.

\ callback returns 0 when no more primes desired

\ After ~1e12 primes generated will throw exception…

variable 'with_prime

variable complete?

: sieve_cb ( n — n )

complete? @ IF

drop

ELSE

sieve_start 2@ rot 2* 1+ M+ 'with_prime @execute not complete? !

THEN ;

: gen-primes ( xt — )

'with_prime ! false complete? !

2

BEGIN

dup s>d 'with_prime @execute not IF drop EXIT THEN

next-small-prime

dup 1000000 > UNTIL drop

1,000,000

BEGIN

2dup seg-sieve

sieve ['] sieve_cb bset-foreach

complete? @ IF 2drop EXIT THEN

SIEVE_BLOCK_SIZE M+

AGAIN ;

\ sum first N primes

variable #primes

2variable sum

: sum_cb ( d — ? )

sum 2@ D+ sum 2!

-1 #primes +!

#primes @ 0> ;

: sum_primes ( n — d )

0, sum 2! #primes !

['] sum_cb gen-primes

sum 2@ ;

counter 1,000,000,000 drop sum_primes du. timer 11138479445180240497 2401035 ok

Trying this again, hope it formats better…

Done in SwiftForth with a bitset library I wrote a long time ago. Took 40 minutes or 410K primes per second. (While also watching web video, etc. on the computer.) It uses the compressed 33K file from previous exercise to store primes < 1 million, which is used as the sieving primes, and the segmented sieve from another previous exercise to sieve blocks after 1 million, in increments of 100,000.

requires bitset

\ Table lookup for "small" primes, (33K table for primes < 1e6)

1,000,003 drop Constant FILE_PRIME_MAX

: load-primes ( str — addr )

r/o bin open-file throw locals| fd |

fd file-size throw drop ( — size )

s" CREATE prime-table" evaluate

here over allot

swap fd read-file throw drop

fd close-file throw ;

s" ../prime1e6.dat" load-primes

hex

create masks[]

FF c, \ 0

FE c, FE c, FE c, FE c, FE c, FE c, \ 1 – 6

FC c, FC c, FC c, FC c, \ 7 – 10

F8 c, F8 c, \ 11 – 12

F0 c, F0 c, F0 c, F0 c, \ 13 – 16

E0 c, E0 c, \ 17 – 18

C0 c, C0 c, C0 c, C0 c, \ 19 – 22

80 c, 80 c, 80 c, 80 c, 80 c, 80 c, \ 23 – 28

00 c, \ 29

decimal

create offset_mod_11

-1 , 1 , 7 , -1 , 11 , 17 , -1 , 29 , 13 , 23 , 19 ,

: lowbit>offset

dup negate and 11 mod cells offset_mod_11 + @ ;

: next-small-prime ( n — n )

dup 2 < IF drop 2 EXIT THEN

dup 3 < IF drop 3 EXIT THEN

dup 5 < IF drop 5 EXIT THEN

dup FILE_PRIME_MAX >= abort" : precomputed prime table exceeded"

30 /mod prime-table +

swap masks[] + c@ \ addr mask

over c@ and lowbit>offset

BEGIN dup 0< WHILE

drop

1+ dup c@ lowbit>offset

REPEAT

swap prime-table – 30 * + ;

\ segmented sieve

100,000 drop Constant SIEVE_BLOCK_SIZE

SIEVE_BLOCK_SIZE 2/ bset sieve

2variable sieve_start \ used by the generator

: sieve_end ( — d )

sieve_start 2@ [ SIEVE_BLOCK_SIZE 1- ] Literal M+ ;

: d/mod ( d n — d%n d/n )

locals| n |

\ FORTH does not have remainder function for double division

\ so we multiply the quotient by the divisor and subtract…

\

2dup 1 n M*/ 2dup 2>r \ get quotient & save a copy

n 1 M*/ d- drop \ compute remainder

2r> ; \ restore quotient

: dmod ( d m — n )

locals| m |

m d/mod 2drop \ note: FM/MOD is faster but overflows…

dup 0< IF m + THEN ;

: offset ( n — n ) \ first offset into bit array of prime #

sieve_start 2@ third 1+ M+ D2/ dnegate rot dmod ;

: seg-sieve ( d — )

sieve_start 2!

sieve bits erase

sieve bset-invert! \ set everything to TRUE

3

BEGIN

dup offset

BEGIN dup sieve nip ( length ) < WHILE

dup sieve -bit

over +

REPEAT drop

next-small-prime

sieve_end third dup M* D< UNTIL

drop ;

\ gen-primes — generate primes and call callback for each one.

\ callback returns 0 when no more primes desired

\ After ~1e12 primes generated will throw exception…

variable 'with_prime

variable complete?

: sieve_cb ( n — n )

complete? @ IF

drop

ELSE

sieve_start 2@ rot 2* 1+ M+ 'with_prime @execute not complete? !

THEN ;

: gen-primes ( xt — )

'with_prime ! false complete? !

2

BEGIN

dup s>d 'with_prime @execute not IF drop EXIT THEN

next-small-prime

dup 1000000 > UNTIL drop

1,000,000

BEGIN

2dup seg-sieve

sieve ['] sieve_cb bset-foreach

complete? @ IF 2drop EXIT THEN

SIEVE_BLOCK_SIZE M+

AGAIN ;

\ sum first N primes

variable #primes

2variable sum

: sum_cb ( d — ? )

sum 2@ D+ sum 2!

-1 #primes +!

#primes @ 0> ;

: sum_primes ( n — d )

0, sum 2! #primes !

['] sum_cb gen-primes

sum 2@ ;

counter 1,000,000,000 drop sum_primes du. timer 11138479445180240497 2401035 ok

[…] This problem from Programming Praxis came about in the comments to my last post and intrigued me. So today, we are trying to sum the first one billion primes. Summing the first hundred, thousand, even million primes isn’t actually that bad. But it takes a bit more effort when you scale it up to a billion. And why’s that? […]

I know it’s an older, but after your comments on my solution to the Pythagorean Triples hinting at this one, I had to try it. I ended up working on it way more than I probably should have with quals coming up, but what I ended up with is 5 different ways (including all three sieves mentioned on Wikipedia) of summing the first n primes with different data structure variants on the two fastest ones.

It still ends up taking an hour and a half on the quickest run, but it was really interesting to work out all of the ways to do it. Perhaps I’ll see if I can optimize even more at some point. Although 10 seconds? Likely not. :)

Linky: The Sum of the First Billion Primes

Five methods in Perl.

The first is calling Math::Prime::Util::next_prime in a dumb loop. Not very fast since it’s doing trial division and/or MR tests. 400k-500k / second.

We can speed that up by having MPU precalculate all the primes, obviously at the expense of memory (about 760MB for sieving 22.8B numbers). Using nth_prime_upper() to get an upper bound (0.01% larger than the real value, but obtained in microseconds) we can do this pretty fast. A reasonably fast solution at over 3M primes/s.

Next is using Math::Prime::Util to do it more efficiently — get primes in segments until we’ve reached our limit. Not bad — 66s on my machine to sum the first 1 billion. Over 15M/s.

Now we move to pure Perl. First let’s use an unsegmented sieve. This one is a modification of one from c2.com’s wiki, which is a very simple SoE, and much faster than most Perl sieves found on the web (I have a collection of 17 of them — some shockingly bad, some quite decent). It runs at about 260k/s. But the biggest problem is that it isn’t segmented, nor is it storing the numbers efficiently. My 32GB machine started swapping when calculating 10^8.

Lastly is a segmented sieve in pure Perl using strings. A bit wacky and too verbose, but it can be faster than the obvious solutions depending on memory speed. As a basic unsegmented sieve, it’s about 2x faster than any of the other Perl solutions on my workstation. On my old Atom netbook it’s a bit slower than the above sieve. I wrote it earlier this year so Math::Prime::Util could have something reasonable as a fallback if the XS failed to compile on a platform. Since it’s segmented it will work for our task without going crazy with memory. Over 500k/s.

None of them compare to primesieve of course (the serial version of Kim’s program takes all of 7 seconds on my machine, the parallel version runs in 1.1s). The callback is spiffy also. I thought about adding something like it to Math::Prime::Util (e.g. prime_foreach 2 100 { …sub… }) and still might, but calling Perl subroutines is far slower than C — it’s over a minute just to make 1B empty subroutine calls so it seems of limited use for this example.

Timings:

Code:

Dang. After some comments over on my original post, I decided to go back and add a the segmented version of the Sieve of Eratosthenes into the mix. I’m glad I did.

While my previous times were 2 hr 42 min for Eratosthenes and 1 hr 28 min for Atkin, the new segmented sieve blows both out of the water at only 25 min 12 sec. It’s still nowhere near as quick as some people’s, but it’s at least a good example of what being conscious of what memory/caching behavior can do to runtimes.

Linky: One Billion Primes – Segmented Sieve

Alright, here’s my approach (quite literally).

On my Intel Core i7 2.3 ghz machine, it completes in 495 milliseconds and uses about 800,000 bytes of RAM (but if I used a larger wheel that would go up rapidly).

One interesting side note: the actual algorithm I’m using doesn’t use any memory during run time aside from the pre-allocated memory of the wheels and the stack space of the function calls (which will never be deeper than log n/log 2), so it would be relatively trivially to make this algorithm multi-threaded or even distributed. My 495 ms time is single-threaded. With that said, this algorithm runs in something worse than O(n^4/5) time, so there are faster solutions.

My C# implementation follows. Just call Program.Main(). This code handles overflow very poorly – if you try values of n larger than 22801763489, or increasing the wheel size, there’s a good chance you’ll start seeing garbage values.

Nathan, that is very impressive. Thanks for sharing as well as the writeup you did for stackoverflow and your web page. I saw another approach here: http://mathematica.stackexchange.com/questions/80291/efficient-way-to-sum-all-the-primes-below-n-million-in-mathematica that looks more analogous to Meissel’s formula (with the Legendre phi replaced by the S(v,p) function).

Pari/GP. 2m20s

Perl 47s

Perl 12s

That’s not *too* far off primesieve’s 7 seconds. Both work by generating primes with a segmented sieve and doing something with the results. Someday I may add one of the fast sums, but memory and code size (standard C) are important.

Using The Lagarias-Miller-Odlyzko method for counting primes improved by Deleglise-Rivat it easy to write a ithprime(n) function which returns the n-th prime in time (n^(2/3)/log(n)^2).

This algorithm return ithprime(10^9) = 22801763489 in 0.03s.

The number of primes up to x may be written pi(x) = sum(1, for p in primes(x)) = sum(f(p) for p in primes(x)) where f(p) = 1.

The Lagarias-Miller-Odlyzko method may be used with very small changes to compute Sf(x) = sum(f(p) for p in primes(x)), still with O(x^(1/3)/log(x)^2) operations

for every function f such that f(ab) = f(a) f(b). This is explained in the paper

Maximal product of primes whose sum is bounded, by M. Deléglise and J.-L. Nicolas, arXiv:1207.0603v1,p 25-35.

If we take for f the identity function f(x) = x we get Sf(x) = sum(p for p in primes(x).

This algorithm computes

Sf(22801763489) = 11138479445180240497 en 0.9s.

If we replace one billion, 10^9 by 10^12 wet get respectiviely

ithprime(10^12) = 29996224275833 in 1.5s

and then Sf( 29996224275833) = 14738971133550183905879827 in 5.6s.

@Marc Deleglise, very interesting, can you show us the code used or post a link to it (since this is a programming website not a math one)?

This is a quick and dirty way of explicitly computing the sum using CUDA:

This code uses the thrust library for a parallel reduction as well as my CUDASieve library (https://github.com/curtisseizert/CUDASieve) for parallel prime generation. The fastest block size of 2^32 runs in about 630 ms on my GTX 1080 but takes about 1.6 GB of device memory during this time. One could significantly improve the speed of this by changing the source code of CUDASieve in order to obviate generating lists of primes. This would theoretically take only a few kb over the memory required to store the list of sieving primes for the Eratosthenes implementation (~60 kb) and would have a likely run time essentially equal to that of simply counting the primes.