Several months ago I wanted to estimate the mean of a value for users of an online app as quickly as possible with as few samples as possible. Every data point required scraping a webpage, and this proved timely AND costly in terms of system resources. Additionally, if I triggered too much traffic with my queries, the host would temporarily block the IP of my server.
Having theorized that the distribution was normal, I considered several more formal approaches (e.g., estimate power and then choose the appropriate sample size; sequential sampling; etc.) However, I was curious if I could develop an iterative approach that would both satisfy my precision requirements and help me get the data I needed as efficiently as possible given the cost of each web-scraping request.
Big M: A Generally Precise Estimate Of The Mean
While I didn’t need to publish the means I was estimating in scholastic journals, I wanted to ensure that the estimates were provably reliable in the general sense. That is to say, I wanted confidence intervals “in the nineties” or “in the eighties.” I wasn’t trying to disprove any formal null hypotheses, but I did want good data. I was training machine learning models for a class, and I wanted the most predictive models possible given my resource limitations.
I decided to play around with an iterative approach to generating a sample of large enough size to achieve the general degree of precision that I desired (even the thought of this would probably make my grad school stats professors throw up in their mouths.) This type of approach is a “no no” in statistics books, as you can generally grow a sample until you temporarily get what you want. You usually want to make informed a priori decisions about your research and then follow them to find your robust results.
However, I’d played around with Monte Carlo simulations a couple decades ago (I’m so old), and I always found it interesting how well various methods generally held up even in the face of violations of assumptions. Additionally, the ability of machine learning models to consistently converge on valid findings even in the face of crude hyperparamters has taught me to put things to the test before discounting them.
I set out to make a (relatively) simple algorithm for estimating the mean of the population to a general (e.g., “around ninety out of one hundred samples will contain the mean of the population with the given error constraint.”) I called this estimate Big M because, well, you know, D. Knuth is awesome, and Big O conveys a form of general precision that I wanted to embrace. If I’m working with big data, I don’t need a scientifically chosen set of samples that should guarantee a CI of 95%, I just need to know that I’m generally in the 90s.
The Big M Algorithm Overview
After trying out various forms of the algorithm, I developed the following approach, and a quick example Jupyter notebook is linked to containing code and several random results. Essentially, the code integrates the following algorithm.
- Select initial sample.
- Compute the confidence interval (CI) and mean.
- Check if the acceptable error (e.g., off by no more than 2 inches) is outside the confidence interval.
- If the acceptable error is outside the CI, increment the count of valid findings.
- If the acceptable error is inside the CI, start the count over at zero.
- Check if the continuous count has reached its threshold.
- If the continuous count has reached its threshold, exit the function with the mean estimate.
- If the continuous count has not reached its threshold, add one more observation to the sample and return to step 2.
There are some other details of note. The continuous count formulation is based on the ratio of the population size to the initial sample size and the confidence percentage chosen. Additionally, there is a maximum sample size that is formulated automatically by the size of the population and the initial sample size, though this parameter can be set manually.
The Big M Jupyter Notebook Example Code (Embedded as HTML)
I’ve linked to a demonstration of the code to implement the algorithm. Much of this is arbitrary, and I’m sure you could refactor this so it performs better. I’ve since developed a server-side implementation that’s quite different from this in a language other than Python, but this should (barring bugs, which obviously will be present) capture the general thrust of the algorithm and show general results (you can just rerun the Notebook to see new results.)