## Context Scraping Reddit data has become increasingly challenging, especially in light of recent developments. In June 2023, Reddit implemented significant changes to their API policies, effectively shutting down free access that previously powered popular libraries like PRAW (Python Reddit API Wrapper) and other tools that allowed developers to easily access Reddit data. ![[reddit-block-image.png]] This move has forced many developers and researchers to seek alternative methods for data collection. We're going to look at an innovative approach that uses a real browser environment to overcome these limitations, which is demonstrated in [[RedditPulse]]. ## The Challenge Reddit employs various techniques to prevent scraping: 1. Request blocking 2. Complex browser fingerprinting 3. Anti-bot technologies Mimicking browser requests by reverse-engineering them is becoming increasingly difficult and unreliable. ## A Solution Let's look at a method that uses a real browser environment to scrape Reddit data. Here's an overview of the approach. This is an implementation of the [[Chrome Extension-Based Dynamic Content Scraping System]] : 1. Log into a real Reddit account using Chrome on X11 on a remote Ubuntu server. 2. Use Xvfb (X Virtual Framebuffer) to open Chrome with a custom scraping extension. 3. The extension performs the scraping and outputs data to a specified directory. 4. #Python scripts process this data, transform it, and load it into MongoDB. 5. The processed data is then displayed on the frontend of the application. ## Implementation Details ### 1. Browser Automation Script (run_hot.sh) This Bash script sets up the virtual display and launches Chrome with the custom extension: ```bash #!/bin/bash # run_hot.sh SUBREDDIT=$1 TIMEFRAME=$2 SCREENSHOT_PATH="/screenshots/screenshot.png" # Find an available display number for DISPLAY_NUMBER in {99..199} do if ! [ -e "/tmp/.X${DISPLAY_NUMBER}-lock" ]; then break fi done export DISPLAY=":${DISPLAY_NUMBER}" # Start Xvfb Xvfb $DISPLAY -screen 0 1280x720x24 -ac -noreset & XVFB_PID=$! echo "Xvfb running on $DISPLAY with PID $XVFB_PID" # Launch Chrome with the scraping extension google-chrome \ --load-extension=reddit2/extension \ --url "https://www.reddit.com/r/${SUBREDDIT}/top/?t=month" & sleep 37 # Wait for page load # Capture screenshot scrot "$SCREENSHOT_PATH" echo "Screenshot saved to $SCREENSHOT_PATH" # Clean up kill $XVFB_PID ``` ### 2. Scraping Extension (content.js) The Chrome extension injects JavaScript into Reddit pages to perform the actual scraping: ```javascript // content.js const delay = ms => new Promise(res => setTimeout(res, ms)); const scrollTimes = 11; const scrollInterval = 1500; async function scrapeData() { const stats = getSubredditStats(); let allPosts = new Set(); // Scroll and collect posts for (let i = 0; i < scrollTimes; i++) { window.scrollTo(0, document.body.scrollHeight); await delay(scrollInterval); document.querySelectorAll('shreddit-post').forEach(post => allPosts.add(post)); } // Process and save data const csvContent = [/* ... header row ... */]; allPosts.forEach(post => { // Extract post data const title = post.getAttribute('post-title') || 'No Title'; const date = new Date(post.getAttribute('created-timestamp')).toLocaleString() || 'Unknown Date'; //... csvContent.push([/* ... data fields ... */].map(field => `"${field}"`).join(',')); }); // Save as CSV const csvBlob = new Blob([csvContent.join('\n')], { type: 'text/csv;charset=utf-8;' }); const downloadLink = document.createElement('a'); downloadLink.href = URL.createObjectURL(csvBlob); downloadLink.setAttribute('download', 'reddit_output.csv'); document.body.appendChild(downloadLink); downloadLink.click(); document.body.removeChild(downloadLink); } if (document.readyState === 'complete') { scrapeData(); } else { window.addEventListener('load', scrapeData); } ``` ### 3. Data Processing (scrape_engagement.py) This Python script starts the scraping process in run_hot.sh and processes the scraped data: ```python # scrape_engagement.py import pandas as pd from pymongo import MongoClient, UpdateOne import logging def run_reddit_scraper(subreddit, timeframe): script_path = './run_hot.sh' subprocess.run([script_path, subreddit, timeframe], check=True) logging.info(f'Scraping complete for {subreddit}!') #Renaming the output from our extension for this batch job os.rename('../reddit_output.csv', 'reddit_posts.csv') def process_scraped_data(subreddit): df = pd.read_csv('reddit_posts.csv', on_bad_lines='skip') df['Date'] = pd.to_datetime(df['Date'], errors='coerce') - pd.Timedelta(hthes=7) df = df.dropna(subset=['Date']) subreddit_id = update_subreddit_info(subreddit, df) update_engagement_stats(subreddit_id, df) update_time_analysis(subreddit_id, df) # ... other update functions ... def update_engagement_stats(subreddit_id, df): scrape_time = pd.to_datetime(df['Scrape Time'].iloc[0]) currently_online = df['Currently Online'].iloc[0] currently_subscribed = df['Currently Subscribed'].iloc[0] engagement_data = { 'subreddit_id': subreddit_id, 'timestamp': scrape_time, 'users_online': int(currently_online), 'subscribers_count': int(currently_subscribed), 'posts_count': len(df), 'comments_count': int(df['Comments'].median()), 'upvotes_count': int(df['Upvotes'].median()) } db.engagement_stats.insert_one(engagement_data) # ... other processing functions ... def main(): subreddits = load_subreddits() for subreddit in subreddits: try: run_reddit_scraper(subreddit, 'day') process_scraped_data(subreddit) except Exception as e: logging.error(f'Error processing {subreddit}: {e}') logging.info(f'Processing completed for: {subreddit}') if __name__ == "__main__": main() ``` ### 4. Scheduling using crontab We can then use crontab to schedule the batch processing jobs. ```bash # runs at the start of each hour. */60 * * * * cd /home/projects/redditpulse && /usr/bin/python3 scrape_engagement.py >> output.log 2>&1 ``` ## Conclusion This browser-based approach to scraping Reddit data offers several advantages: 1. **Bypasses Anti-Bot Measures**: By using a real browser environment, we can interact with Reddit as a normal user would, avoiding many anti-scraping techniques. 2. **Scalability**: The use of Xvfb allows us to run this process on remote servers, enabling large-scale data collection. 3. **Flexibility**: the custom Chrome extension can be easily modified to adapt to changes in Reddit's structure or to collect different types of data. 4. **Data Processing Pipeline**: The combination of browser-based scraping and Python-based processing allows for a robust and flexible data pipeline. While this method requires more computational resources than traditional scraping approaches, it provides a stealthy way to access Reddit data at scale.