What prompted this?

When I’m bored, I sometimes visit r/random. I usually click through a few subreddits before I remember that I have actual things to do. During my clicking, I’ve noticed that I get sort of a lot of repeats. More repeats than I would expect. I mean, there are 3 million subreddits—I can’t keep getting the same few results every time I ask for a random one.

How was the data collected?

My first step towards figuring out a black box like r/random is to get some data that I can analyze. In this case, I’d like to start by simply recording a bunch of the “random” results and then see where I can go from there.

Let’s see what we’re working with.

$ curl "https://www.reddit.com/r/random"
 <body>
    <h1>whoa there, pardner!</h1>



<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>

<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>

<p>please wait 5 second(s) and try again.</p>

    <p>as a reminder to developers, we recommend that clients make no
    more than <a href="http://github.com/reddit/reddit/wiki/API">one
    request every two seconds</a> to avoid seeing this message.</p>
  </body>

[I removed the <head>. The indentation, spacing, and capitalization are sic.]

Sort of a weird, half-hearted anti-bot measure there. One User-Agent later, we’re in business.

$ curl -A "mozilla/5.0" "https://www.reddit.com/r/random"
<html>
 <head>
  <title>302 Found</title>
 </head>
 <body>
  <h1>302 Found</h1>
  The resource was found at <a href="https://www.reddit.com/r/AppleMusic/?utm_campaign=redirect&amp;utm_medium=desktop&amp;utm_source=reddit&amp;utm_name=random_subreddit">https://www.reddit.com/r/AppleMusic/?utm_campaign=redirect&amp;utm_medium=desktop&amp;utm_source=reddit&amp;utm_name=random_subreddit</a>;
you should be redirected automatically.


 </body>
</html>

It turns out that the User-Agent doesn’t even have to be reasonable. I misspelled mozilla as mozzila and then tried gibberish when I noticed the spelling error. Both worked fine. As long as it isn’t something like curl/7.58.0 or python-requests/2.18.4, it seems to work.

The 302 redirect has the name of the subreddit, so I can do some string parsing to get all the information I care about without having to follow the redirect. Now that I know what I’m looking for, I can script this up in Python and get myself some data.

Scraping Reddit with Python for fun and not really any profit

I’ve uploaded a few Jupyter notebooks to GitHub if you’re interested in the actual code. I’ll use this section to describe some of the quirks I came across.

To collect the raw list of results, I made a simple scraper using requests and some string parsing.

Later, when I wanted to collect some metadata about the subreddits, I took one look at Reddit’s API and immediately turned to BeautifulSoup4 instead. Most of the metadata I wanted is available on the old.reddit.com/r/<subreddit> version of the page, which has English names for the id and class fields of the HTML tags.

I couldn’t find the subreddit’s creation date there, so I had to turn to www.reddit . . . for that piece of data. The creation dates are usually in

<div class="_2QZ7T4uAFMs_N83BZcN-Em" id="IdCard--CakeDay--undefined--t5_2qixk">

but some of them aren’t. I scraped 3,679 subreddits and I had to manually fill in the creation date for about 20 of them.

Do the subreddits appear in a random order?

Yes.

Wald–Wolfowitz runs test

I’m going to call it the “runs test” here, but I’ll let those two have some credit up there in the section title. I’m pretty sure the subreddits appear in a random order, so I don’t want anything too fancy. I want a quick, simple test to confirm my suspicions so that I can move on and investigate the size of the pool.

The runs test has two things going for it.

  1. I have a vague idea in my mind that runs are an important metric for measuring randomness.
  2. It has a straightforward Wikipedia article, which is rare for these types of things.

First, a quick rundown of what I’m looking to get out of this test: I want to count the number of runs somehow, and then I want to see whether that’s the same number of runs that a random sequence would have. Easy peasy. Let’s count some runs.

>>> len(df['subreddit'].unique())
3679

Oh, right. I have over 3000 different values, but randomness metrics all deal with sequences of 2 values (i.e., bits).

How do I turn my sequence of subreddit names into a sequence of bits?

The Wikipedia page provides some guiding light here.

  • Enumerate all of the subreddit names
  • Take the median
  • Replace any value above the median with a 1
  • Replace any value below the median with a 0

Back to the runs test

Once that’s cleared up, I can take another swing at the test. The code is available in the GitHub repo, so I’ll just summarize the results here.

My data has 6043 runs. I need to check whether 6043 is a likely value to get by comparing it to a distribution I bootstrapped from actual random sequences. Some simple metrics show that 6043 runs is 0.28 standard deviations away from the theoretical mean of 6027 runs. With such a tiny z-score, I’m comfortable failing to reject the null hypothesis that 6043 runs is a sample from the population of random sequences. (That wording is actually still a little bit of a shortcut, if you can believe it, but I’m not defending a thesis here.) In practical terms, the sequence of subreddits I got from r/random has about the same number of runs as a random sequence would.

That’s enough evidence for me to believe that the subreddits appear in a random order. It doesn’t convince me that these subreddits are randomly drawn from the full pool of 3 million subreddits, though. That’s going to require some more digging.

Are the subreddits drawn from the full pool of over 3 million subreddits?

No.

Elaborate.

If I were to draw randomly from a pool of 3 million objects, I would expect to get a new object I’ve never seen before pretty much every time. Even if I were to draw, say, 12,055 times, I’d still expect to get mostly unique objects. Random sampling with replacement, as in the scenario I described, is one of the basic building blocks of statistics. Here it is in Chapter 1 of the first stats textbook I found online. I’m pretty confident that I’ll be able to find someone who has figured out a way to determine the size of the population based on how many duplicates there are in a random sample (with replacement) of that population.

I didn’t find exactly that, but I did find something I could use to figure it out myself.

What I found was a post on the statistics StackExchange that describes how to go the opposite direction. Given the size of the pool to draw from and the number of samples, the post explains how to calculate how many objects should show up two times, three times, four times, etc.

So I can’t calculate the size of the pool the subreddits are drawn from, but I can figure out what the distribution of duplicates should look like and compare that to the distribution of duplicates I saw empirically. Let’s do that.

# Comparison of expected values to empirical values
# Draw 12_055 samples from a pool of 3_000_000
 dupe        Expected      Actual
  1            12008.34      533
  2               23.28      827
  3                0.03      857
  4                0.00      657
  5                0.00      406
  6                0.00      237
  7                0.00      101
  8                0.00       38
  9                0.00       14
  10               0.00        8

This result lines up with my intuition that almost all of them should only show up once. Clearly, my 12,055 samples were not taken from a pool of 3 million possible subreddits. I could do some kind of statistical test to prove that these two distributions of duplicates are different, but I think it’s obvious enough that I’m not going to spend time hunting down exactly the right test for this situation.

What I’m going to do, instead, is figure out the size of the pool r/random draws from.

How many subreddits are in the r/random pool?

About 4,000.

Elaborate.

Since I have a function that maps a pool size to a distribution of duplicates, I can guess and check pool sizes until I get a distribution of duplicates that matches the distribution I saw. As a bit of a cybersecurity tie-in, this general process is similar to brute-forcing hashes in that there’s a one-way function and I want to know what input will produce the output I have. Fortunately for me, it is very easy to predict the output of this one-way function. I can quickly search the result space for values close to my output and then search the promising area more closely.

There’s a bunch of code in the notebooks in the GitHub repo that you can peruse if you want to see where my numbers came from. My analysis shows that r/random pulls from a pool of about 4000 subreddits. To be exact, the size of the pool is between 3461 and 4271 with 99% confidence.

Why did Reddit set up r/random that way?

I can only speculate. I thought they might choose the top 4,000 subreddits based on some metric, but I haven’t found an obvious candidate. There are some subreddits with only 10 subscribers, and I didn’t see anything interesting in the creation dates either. It’s possible that Reddit has created an extremely convoluted metric to select the top 4,000 subreddits, but I think Occam would agree that the subreddits are probably just a random sample. I haven’t run the scraper again, but I expect that they redraw the random pool every now and then. Maybe every day or so.

My best guess is that the list of active, non-banned, non-private, non-NSFW subreddits changes really frequently. As proof, when I went back and collected the subscriber counts and creation dates a few days after I took my sample, three or four of the subreddits had been banned or taken private. Instead of trying to keep up with that shifting list in real time, Reddit probably takes a random sample from a (daily?) snapshot and selects subreddits from that sample to get results that seem random enough to users.

Random enough to users who aren’t me, that is.

Links to messy code and pretty clean data

If I go down the rabbit hole of cleaning up my code to the point where it feels presentable, I will never publish it. So, I’m choosing to leave my code unedited in the interest of moving on to new projects. Maybe this will inspire me to to write presentable code in the first place next time.

Appendix A: Can’t you just do algebra?

I wanted to do a data science and statistics project, so I didn’t rearrange the equation that I got from Stack Exchange to solve it for the size of the pool. I plan to do that in the future and compare it to the results I got the first time around.