Scraping Reddit to find the most popular domains


Overview

Reddit is one of the 10 most popular sites on the internet, making it a great way to find independent content creators (either for advertising purposes, or simply content discovery).

Below, we'll show you how to scrape Reddit using Praw (Python Reddit API Wrapper). For this example, our goal will be to scrape the top submissions for the month in a few subreddits, storing the following: submission URL, domain (website URL), submission score. We use a similar dataset to help build the Find-me database of content creators open to advertising opportunities.


Import packages, set up PRAW, select subreddits

#packages
import pandas as pd
import praw
import operator
import pandas as pd
#set up praw - setup here: http://praw.readthedocs.io/en/latest/getting_started/quick_start.html
reddit = praw.Reddit(client_id='my client id',
client_secret='my client secret',
user_agent='my user agent')
#create list of subreddits to include
s_list = \
[
'funny',
'todayilearned',
'science',
'worldnews',
'gaming']

Grab the score, domain (url), and subreddit for each top monthly submission

In this section we're looping through our array of subreddits from above and storing key information (score, domain, id). The output of this section is 3 separate dataframes, which we can then merge together using the Reddit submission ID.

domains_sub = {}
domains = {}
domains_score = {}
domains_url = {}

for i in s_list:
#--Aggregating Score--# #pull in top submissions for the month for subreddit specified in list above subreddit = reddit.subreddit(i) submissions = subreddit.top('month', limit=50) #sum score across submissions for s in submissions: if s.id in domains_score.keys(): domains_score[s.id] += s.score else: domains_score[s.id] = s.score
df_score = pd.DataFrame.from_dict(domains_score, orient='index').reset_index() df_score.columns = ['id','score']
#--Grab domain for given submission ID--# subreddit = reddit.subreddit(i) #input('enter subreddit name: /r/')) submissions = subreddit.top('month', limit=50) for s in submissions: if s.id in domains.keys(): domains[s.id] = s.domain else: domains[s.id] = s.domain df_domain = pd.DataFrame.from_dict(domains, orient='index').reset_index() df_domain.columns = ['id','domain']
#--Grab subreddit for given submission ID--# subreddit = reddit.subreddit(i) submissions = subreddit.top('month', limit=50) for s in submissions: if s.id in domains_sub.keys(): domains_sub[s.id] = s.subreddit.display_name else: domains_sub[s.id] = s.subreddit.display_name df_subreddit = pd.DataFrame.from_dict(domains_sub, orient='index').reset_index() df_subreddit.columns = ['id','subreddit']

Preview our 3 dataframes

#Subreddit, post ID
df_subreddit.head()
id subreddit
0 78tulq todayilearned
1 76bn5s science
2 7871xy science
3 77pnk6 science
4 75eydj gaming

#Score, post ID
df_score.head()
id score
0 78tulq 42720
1 76bn5s 25021
2 7871xy 30648
3 77pnk6 13178
4 75eydj 64504

#Domain (URL), post ID
df_domain.head()
id domain
0 78tulq atlasobscura.com
1 76bn5s ns.umich.edu
2 7871xy acsh.org
3 77pnk6 jech.bmj.com
4 75eydj i.redd.it

Merge dataframes

Here we merge the three tables together, using submission ID as primary key.

#merge the three tables together, using submission ID as primary key
df_sub_score = df_subreddit.merge(df_score, how='left', on="id")
df_final = df_sub_score.merge(df_domain, how='left', on='id')
# Add in submission URL using the 'id' 
df_final['url'] = ['www.reddit.com/']+df_final['id'].astype(str) 
df_final.head()
id subreddit score domain url
0 78tulq todayilearned 42729 atlasobscura.com www.reddit.com/78tulq
1 76bn5s science 25024 ns.umich.edu www.reddit.com/76bn5s
2 7871xy science 30642 acsh.org www.reddit.com/7871xy
3 77pnk6 science 13176 jech.bmj.com www.reddit.com/77pnk6
4 75eydj gaming 64510 i.redd.it www.reddit.com/75eydj

Done! Explore the output here:

We now have a nice clean dataframe of the top monthly posts for our chosen subreddit, allowing us to see which domains racked up the highest scores/# of top posts. We threw together a small tool with the output if you'd like to dig through the data yourself.