Path: news.mathworks.com!not-for-mail
From: "Andy Robb" <ajrobb@hotmail.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: count number of
Date: Sun, 14 Sep 2008 19:51:02 +0000 (UTC)
Organization: The MathWorks, Inc.
Lines: 46
Message-ID: <gajpv6$eef$1@fred.mathworks.com>
References: <g6q9kq$log$1@fred.mathworks.com> <3412110.1217443328021.JavaMail.jakarta@nitrogen.mathforum.org> <g7450c$q2p$1@fred.mathworks.com>
Reply-To: "Andy Robb" <ajrobb@hotmail.com>
NNTP-Posting-Host: webapp-05-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1221421862 14799 172.30.248.35 (14 Sep 2008 19:51:02 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Sun, 14 Sep 2008 19:51:02 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 1374918
Xref: news.mathworks.com comp.soft-sys.matlab:490195



The following times are those to count the bits in all 4 billion 32-bit integers on 64-bit ubuntu linux.
 
With a fast multiply, the following can be significantly faster than my previous post (25s v 32s):

int bitcount(unsigned n) {
  n -= (n >> 1) & 0x55555555;
  n = ((n >> 2) & 0x33333333) + (n & 0x33333333);
  return (((n + (n >> 4)) & 0x0f0f0f0f) * 0x01010101) >> 24;
}

see page 180 of
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF

However, if you are counting bits of lots of integers at once, the look-up method can be even faster (once the look-up table is loaded into cache - 19s)

int count(unsigned v) {
  static const int BitsSetTable256[] = 
  { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4
  , 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5
  , 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5
  , 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6
  , 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5
  , 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6
  , 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6
  , 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7
  , 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5
  , 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6
  , 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6
  , 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7
  , 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6
  , 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7
  , 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7
  , 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
  };

  return BitsSetTable256[v & 0xff] + 
    BitsSetTable256[(v >> 8) & 0xff] + 
    BitsSetTable256[(v >> 16) & 0xff] + 
    BitsSetTable256[v >> 24]; 
}

This is a modified version of a routine in http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

I changed the table from (unsigned char) to (int) as it is faster to add directly from the table than to convert unsigned char to int before adding.

If the routine is inlined, the time drops to 15s. With a bit of assembler hacking, the time can drop below 12s.