## Overview This system is an advanced web scraping solution designed to extract dynamically loaded content from web platforms, particularly suited for social media and similar sites with infinite scrolling or lazy loading. It utilizes a Chrome extension for in-browser scraping and automation, combined with a virtual display setup for headful operation. This can be run on a remote VPS as an evasive scraping strategy. ![[chrome-extension-architecture-mermaid.png]] ## What is this useful for? - Social media data collection (Reddit, Twitter, Facebook, etc.) - Social media automation - E-commerce product monitoring - News aggregation from dynamic news sites - Monitoring of dynamically updated content on various platforms ## Key Components 1. Chrome Extension: Custom-developed for scraping and automation 2. Xvfb (X Virtual Framebuffer): Virtual screen that hosts Chrome 3. Bash Scripting: To orchestrate the scraping process 4. Google Chrome: Launched via command line for better trust score 5. Data Processing: Python scripts for post-scrape data handling ## Workflow 1. Bash script initiates Xvfb to create a virtual display 2. Google Chrome is launched with the custom scraping extension 3. Chrome navigates to the target URL (e.g., Reddit subreddit) 4. The extension performs automated actions (scrolling, clicking, etc.) 5. Data is collected by the extension as the page dynamically loads 6. The extension outputs collected data to a CSV file 7. Python scripts process the CSV for further data manipulation ## Key Features - Uses native Google Chrome for improved trust score against bot detection - Chrome extension allows for precise in-page interactions and data extraction - Headful operation via Xvfb enables server-side deployment - Scalable to handle various social media platforms and dynamic websites - Outputs data in CSV format for easy integration with data processing pipelines ## Technical Details - Chrome extension developed using content scripts for in-page operations - Bash script manages Xvfb setup and Chrome launch - Xvfb enables us to run headful instances of Chrome, which generally have a higher trust score. - Utilizes scrot for capturing screenshots of the virtual display - Allows for customization of scraping parameters (e.g., subreddit, timeframe) - Can be easily adapted for different websites by modifying the extension and script ## Advantages - More trusted by websites compared to Puppeteer-driven browsers - Highly customizable through Chrome extension development - Can handle complex, dynamically loaded content effectively - Operates in a true browser environment, closely mimicking user behavior - Easier to maintain and update compared to full browser automation frameworks This system represents an innovative approach to web scraping, leveraging browser extensions and virtual displays to create a robust, scalable, and stealthy data collection solution. # Code Examples - The following examples are pulled from [[RedditPulse]]. ## Bash Script to start Browser ```bash #!/bin/bash #run_hot.sh # Assign command-line arguments to variables SUBREDDIT=$1 TIMEFRAME=$2 # Define the path where the screenshot will be saved SCREENSHOT_PATH="screenshots/screenshot.png" # Find an available display number for DISPLAY_NUMBER in {99..199} do if ! [ -e "/tmp/.X${DISPLAY_NUMBER}-lock" ]; then break fi done export DISPLAY=":${DISPLAY_NUMBER}" # Start Xvfb manually on the found display Xvfb $DISPLAY -screen 0 1280x720x24 -ac -noreset & XVFB_PID=$! echo "Xvfb running on $DISPLAY with PID $XVFB_PID" # Launch Google Chrome with the scraping extension google-chrome \ --load-extension=/path/to/extension \ --url "https://www.reddit.com/r/${SUBREDDIT}/top/?t=month" & echo "sleeping" # Wait for the page to load sleep 37 #arb value echo "awake!" # Use scrot to capture a screenshot of the virtual display scrot "$SCREENSHOT_PATH" # Notify the user or log that the screenshot has been taken echo "Screenshot saved to $SCREENSHOT_PATH" # Clean up Xvfb kill $XVFB_PID ``` ## Content Script (Chrome extension) ```javascript const delay = ms => new Promise(res => setTimeout(res, ms)); const scrollTimes = 11; const scrollInterval = 1500; async function scrollToBottom() { for (let i = 0; i < scrollTimes; i++) { window.scrollTo(0, document.body.scrollHeight); await delay(scrollInterval); } } function getSubredditStats() { const headerElement = document.querySelector('shreddit-subreddit-header'); if (headerElement) { const subscribers = headerElement.getAttribute('subscribers') || '0'; const active = headerElement.getAttribute('active') || '0'; console.log('Subscribers:', subscribers); console.log('Currently Active:', active); return { subscribers, active }; } else { console.log('Could not find shreddit-subreddit-header element'); return { subscribers: '0', active: '0' }; } } async function scrapeData() { const stats = getSubredditStats(); let allPosts = new Set(); for (let i = 0; i < scrollTimes; i++) { window.scrollTo(0, document.body.scrollHeight); await delay(scrollInterval); document.querySelectorAll('shreddit-post').forEach(post => allPosts.add(post)); } const now = new Date().toISOString(); const azTime = new Date(now); azTime.setHours(azTime.getHours() - 7); const azTimeString = azTime.toISOString(); const urlParams = new URLSearchParams(window.location.search); const pathSegments = window.location.pathname.split('/'); const sort = pathSegments[pathSegments.length - 2] || 'hot'; const timeframe = urlParams.get('t') || 'day'; const csvContent = [['Title', 'Date', 'Author', 'Post Text', 'Upvotes', 'Comments', 'Subreddit', 'Sort', 'Timeframe', 'Currently Online', 'Currently Subscribed', 'Scrape Time'].join(',')]; allPosts.forEach(post => { const title = post.getAttribute('post-title') || 'No Title'; const date = new Date(post.getAttribute('created-timestamp')).toLocaleString() || 'Unknown Date'; const subreddit = post.getAttribute('subreddit-prefixed-name') || 'Unknown Subreddit'; const postId = post.getAttribute('id') || 'Unknown Post ID'; const url = post.getAttribute('content-href') || `https://www.reddit.com/r/${subreddit}/comments/${postId}`; const author = post.getAttribute('author') || 'Unknown Author'; const postText = post.querySelector('#t3_1c4pvie-post-rtjson-content')?.textContent.trim().replace(/[\r\n]+/g, ' ') || ''; const upvotes = post.getAttribute('score') || '0'; const comments = post.getAttribute('comment-count') || '0'; csvContent.push([title, date, author, postText, upvotes, comments, subreddit, sort, timeframe, stats.active, stats.subscribers, azTimeString].map(field => `"${field}"`).join(',')); }); const csvBlob = new Blob([csvContent.join('\n')], { type: 'text/csv;charset=utf-8;' }); const downloadLink = document.createElement('a'); downloadLink.href = URL.createObjectURL(csvBlob); downloadLink.setAttribute('download', 'reddit_output.csv'); document.body.appendChild(downloadLink); downloadLink.click(); document.body.removeChild(downloadLink); } if (document.readyState === 'complete') { scrapeData(); } else { window.addEventListener('load', scrapeData); } ``` ## Python Processing ```python def run_reddit_scraper(subreddit, timeframe): script_path = './run_hot.sh' subprocess.run([script_path, subreddit, timeframe], check=True) logging.info(f'Scraping complete for {subreddit}!') os.rename('../reddit_output.csv', 'reddit_posts.csv') def process_scraped_data(subreddit): df = pd.read_csv('reddit_posts.csv', on_bad_lines='skip') df['Date'] = pd.to_datetime(df['Date'], errors='coerce') - pd.Timedelta(hours=7) df = df.dropna(subset=['Date']) time_of_scrape = pd.to_datetime(df['Scrape Time'].iloc[0]) ```