## Context
Scraping Reddit data has become increasingly challenging, especially in light of recent developments. In June 2023, Reddit implemented significant changes to their API policies, effectively shutting down free access that previously powered popular libraries like PRAW (Python Reddit API Wrapper) and other tools that allowed developers to easily access Reddit data.
![[reddit-block-image.png]]
This move has forced many developers and researchers to seek alternative methods for data collection. We're going to look at an innovative approach that uses a real browser environment to overcome these limitations, which is demonstrated in [[RedditPulse]].
## The Challenge
Reddit employs various techniques to prevent scraping:
1. Request blocking
2. Complex browser fingerprinting
3. Anti-bot technologies
Mimicking browser requests by reverse-engineering them is becoming increasingly difficult and unreliable.
## A Solution
Let's look at a method that uses a real browser environment to scrape Reddit data. Here's an overview of the approach. This is an implementation of the [[Chrome Extension-Based Dynamic Content Scraping System]] :
1. Log into a real Reddit account using Chrome on X11 on a remote Ubuntu server.
2. Use Xvfb (X Virtual Framebuffer) to open Chrome with a custom scraping extension.
3. The extension performs the scraping and outputs data to a specified directory.
4. #Python scripts process this data, transform it, and load it into MongoDB.
5. The processed data is then displayed on the frontend of the application.
## Implementation Details
### 1. Browser Automation Script (run_hot.sh)
This Bash script sets up the virtual display and launches Chrome with the custom extension:
```bash
#!/bin/bash
# run_hot.sh
SUBREDDIT=$1
TIMEFRAME=$2
SCREENSHOT_PATH="/screenshots/screenshot.png"
# Find an available display number
for DISPLAY_NUMBER in {99..199}
do
if ! [ -e "/tmp/.X${DISPLAY_NUMBER}-lock" ]; then
break
fi
done
export DISPLAY=":${DISPLAY_NUMBER}"
# Start Xvfb
Xvfb $DISPLAY -screen 0 1280x720x24 -ac -noreset & XVFB_PID=$!
echo "Xvfb running on $DISPLAY with PID $XVFB_PID"
# Launch Chrome with the scraping extension
google-chrome \
--load-extension=reddit2/extension \
--url "https://www.reddit.com/r/${SUBREDDIT}/top/?t=month" &
sleep 37 # Wait for page load
# Capture screenshot
scrot "$SCREENSHOT_PATH"
echo "Screenshot saved to $SCREENSHOT_PATH"
# Clean up
kill $XVFB_PID
```
### 2. Scraping Extension (content.js)
The Chrome extension injects JavaScript into Reddit pages to perform the actual scraping:
```javascript
// content.js
const delay = ms => new Promise(res => setTimeout(res, ms));
const scrollTimes = 11;
const scrollInterval = 1500;
async function scrapeData() {
const stats = getSubredditStats();
let allPosts = new Set();
// Scroll and collect posts
for (let i = 0; i < scrollTimes; i++) {
window.scrollTo(0, document.body.scrollHeight);
await delay(scrollInterval);
document.querySelectorAll('shreddit-post').forEach(post => allPosts.add(post));
}
// Process and save data
const csvContent = [/* ... header row ... */];
allPosts.forEach(post => {
// Extract post data
const title = post.getAttribute('post-title') || 'No Title';
const date = new Date(post.getAttribute('created-timestamp')).toLocaleString() || 'Unknown Date';
//...
csvContent.push([/* ... data fields ... */].map(field => `"${field}"`).join(','));
});
// Save as CSV
const csvBlob = new Blob([csvContent.join('\n')], { type: 'text/csv;charset=utf-8;' });
const downloadLink = document.createElement('a');
downloadLink.href = URL.createObjectURL(csvBlob);
downloadLink.setAttribute('download', 'reddit_output.csv');
document.body.appendChild(downloadLink);
downloadLink.click();
document.body.removeChild(downloadLink);
}
if (document.readyState === 'complete') {
scrapeData();
} else {
window.addEventListener('load', scrapeData);
}
```
### 3. Data Processing (scrape_engagement.py)
This Python script starts the scraping process in run_hot.sh and processes the scraped data:
```python
# scrape_engagement.py
import pandas as pd
from pymongo import MongoClient, UpdateOne
import logging
def run_reddit_scraper(subreddit, timeframe):
script_path = './run_hot.sh'
subprocess.run([script_path, subreddit, timeframe], check=True)
logging.info(f'Scraping complete for {subreddit}!')
#Renaming the output from our extension for this batch job
os.rename('../reddit_output.csv', 'reddit_posts.csv')
def process_scraped_data(subreddit):
df = pd.read_csv('reddit_posts.csv', on_bad_lines='skip')
df['Date'] = pd.to_datetime(df['Date'], errors='coerce') - pd.Timedelta(hthes=7)
df = df.dropna(subset=['Date'])
subreddit_id = update_subreddit_info(subreddit, df)
update_engagement_stats(subreddit_id, df)
update_time_analysis(subreddit_id, df)
# ... other update functions ...
def update_engagement_stats(subreddit_id, df):
scrape_time = pd.to_datetime(df['Scrape Time'].iloc[0])
currently_online = df['Currently Online'].iloc[0]
currently_subscribed = df['Currently Subscribed'].iloc[0]
engagement_data = {
'subreddit_id': subreddit_id,
'timestamp': scrape_time,
'users_online': int(currently_online),
'subscribers_count': int(currently_subscribed),
'posts_count': len(df),
'comments_count': int(df['Comments'].median()),
'upvotes_count': int(df['Upvotes'].median())
}
db.engagement_stats.insert_one(engagement_data)
# ... other processing functions ...
def main():
subreddits = load_subreddits()
for subreddit in subreddits:
try:
run_reddit_scraper(subreddit, 'day')
process_scraped_data(subreddit)
except Exception as e:
logging.error(f'Error processing {subreddit}: {e}')
logging.info(f'Processing completed for: {subreddit}')
if __name__ == "__main__":
main()
```
### 4. Scheduling using crontab
We can then use crontab to schedule the batch processing jobs.
```bash
# runs at the start of each hour.
*/60 * * * * cd /home/projects/redditpulse && /usr/bin/python3 scrape_engagement.py >> output.log 2>&1
```
## Conclusion
This browser-based approach to scraping Reddit data offers several advantages:
1. **Bypasses Anti-Bot Measures**: By using a real browser environment, we can interact with Reddit as a normal user would, avoiding many anti-scraping techniques.
2. **Scalability**: The use of Xvfb allows us to run this process on remote servers, enabling large-scale data collection.
3. **Flexibility**: the custom Chrome extension can be easily modified to adapt to changes in Reddit's structure or to collect different types of data.
4. **Data Processing Pipeline**: The combination of browser-based scraping and Python-based processing allows for a robust and flexible data pipeline.
While this method requires more computational resources than traditional scraping approaches, it provides a stealthy way to access Reddit data at scale.