Science is bang-up for innovation and improving our lives , but let ’s face it : there are some things we ’ve pretty much develop down pat . You would n’t look , for example , that we could meliorate on something like … like count .

So it may total as a surprise that a group of figurer scientist have done just that : found a newfangled way to solve a decades - old problem that asks what , on the face of it , look to be a very dewy-eyed problem – how many distinct things are there in front of me ?

It ’s a concentrated problem – and a smarter resolution – than you might cerebrate .

The Distinct Elements Problem

Computers can be very smart , but they can also be very , very … not - wise . Just look at the recent plosion of AI chatbots for grounds of that : they’regreat at sound intelligent , but put ‘ em to the test and youmight just findyourself in anouroboros of bullshit .

And sometimes , it ’s the things that seem almost ludicrously unsubdivided to a man that cause the most problem . Take counting , for model – specifically , counting distinct object . For us , it ’s sluttish : we see at the collection of objects , and our mentality just kind of mechanically sorts them into groups for us . We barely have to go at it at all .

For electronic computer , on the other hand , it ’s a rudimentary and ten - old job . And it ’s one that really needs to be answered , since its covering in the modern world sweep everything from web dealings analytic thinking – retrieve Facebook or Twitter monitor how many people are logged in at any give time – to fraud detection , to bioinformatics , to text analytic thinking , and much more .

Now , manifestly , we ’ve been able-bodied to do those thing for a while now , and that ’s because this counting interrogative – decent known as the Distinct Elements Problem – does have answer . They ’re just not very good one .

“ Earlier known algorithm all were ‘ hashing based , ’ and the calibre of that algorithm depended on the quality of hash functions that algorithm chooses , ” explained Vinodchandran Variyam , a prof in the University of Nebraska – Lincoln ’s School of Computing , in astatementlast year .

But , together with colleagues Sourav Chakraborty of the Indian Statistical Institute and Kuldeep Meel of the University of Toronto , he discovered a manner to massively simplify the problem : “ The fresh algorithm only uses a sampling scheme , and quality analysis can be done using unproblematic techniques . ”

How does it work?

The new method acting , since named the CVM algorithm in accolade of its inventors , drastically reduces memory requisite – an important vantage in this modern eld of big data – and it does so using a neat trick of probability hypothesis . To illustrate the concept , take the illustration studied by Variyam and his colleagues , as well as a late article inQuanta Magazine : imagine you ’re look the phone number of unique words in Shakespeare’sHamlet , but you have only enough retentivity to salt away 100 words at a sentence .

First , you do the obvious : you record the first 100 unique word you descend across . You ’re now out of place – so you take a coin and pitch it for each password . head , it stays ; tails , you draw a blank it .

At the end of this process , you ’ll have around 50 unique words in your listing . You restart the process from before – but this time , if you come to a password already on the list , you flip the coin again to see whether or not to blue-pencil it . Once you reach 100 words , you run through the list again , flipping a coin for each word and erase or keeping it as prompted .

In round two , things are a flyspeck snatch more complex : rather of one head to keep a word in the list , you ’ll need two in a words – anything else , and it gets deleted . likewise , in round three , you ’ll need to get three heads in a rowing for it to stay ; round four will need four in a row , and so on until you gain the final stage of Hamlet .

There ’s method in the madness – and it ’s a smart one , too . By working through the textual matter like this , you ’ve ensured that every intelligence in your listing had the same probability of being there : 1/2k , wherekis the number of times you had to turn through the list . So , allow ’s say it took you six rounds to get to the end of Hamlet , and you ’re left with a tilt of 61 distinct words : you’re able to then procreate 61 by 26to get an estimate of the number of Word of God .

We ’ll save you open up your figurer app : the answer is 3,904 – and fit in to Variyam and co , the literal reply is 3,967 ( yes , they counted . ) If you have a memory that can store more than 100 watchword , the accuracy goes up further : with the ability to store 1,000 words , the algorithm approximate the answer as 3,964 – barely a rounding fault already – and “ of course of action , ” Variyam secernate Quanta , “ if the [ memory ] is so with child that it match all the Good Book , then we can get 100 percent truth . ”

A simple approach

So , it ’s effective – but what makes the algorithm even more challenging is its simplicity . “ The new algorithm is astonishingly simple and soft to apply , ” Andrew McGregor ,   a Professor in the College of Information and Computer Sciences at the University of Massachusetts , Amherst , told Quanta . “ I would n’t be surprised if this became the default option style the [ decided elements ] job is approached in practice . ”

Indeed , since its bill in January 2023 – and banish a few minor quibble and hemipteran in the meantime – the algorithm has attract attention and admiration from many other computer scientist . That means that , while the paper detailing the algorithm has notbeen compeer - brush up in the prescribed gumption , it in spades has been retrospect by peers . Indeed , Donald Knuth , author ofThe Art of Computer Programmingand so - send for “ father of the analysis of algorithms , ” wrotea paper in praise of the algorithmback in May 2023 : “ ever since I saw it [ … ] I ’ve been ineffective to resist trying to explain the ideas to just about everybody I run into , ” he commented .

Meanwhile , various teams – Chakraborty , Variyam , and Meel include – have spent the last year investigating and fine - tune the algorithm . Some , Variyam say , are already teaching it in their computer science courses .

“ We think that this will be a mainstream algorithm that is teach in the first computer science course on algorithmic rule in general and probabilistic algorithm in special , ” he said . Knuth agrees : “ It ’s wonderfully suited to teaching students who are learning the basics of computer science , ” he wrote in his May report . “ I ’m pretty sure that something like this will finally become a standard textbook matter . ”

So , how did such a breakthrough algorithm evade observation for so long ? harmonise to Variyam , it ’s not as unbelievable as it vocalise .

“ It is surprising that this unsubdivided algorithm had not been distinguish earlier , ” he said . “ It is not uncommon in skill that simplicity is missed for several years . ”

The paper is post onthe ArXivand appear in Proceedings of the 30thAnnual European Symposium on Algorithms ( ESA 2022 ) .