## Overview
This system is an advanced web scraping solution designed to extract dynamically loaded content from web platforms, particularly suited for social media and similar sites with infinite scrolling or lazy loading. It utilizes a Chrome extension for in-browser scraping and automation, combined with a virtual display setup for headful operation. This can be run on a remote VPS as an evasive scraping strategy.
![[chrome-extension-architecture-mermaid.png]]
## What is this useful for?
- Social media data collection (Reddit, Twitter, Facebook, etc.)
- Social media automation
- E-commerce product monitoring
- News aggregation from dynamic news sites
- Monitoring of dynamically updated content on various platforms
## Key Components
1. Chrome Extension: Custom-developed for scraping and automation
2. Xvfb (X Virtual Framebuffer): Virtual screen that hosts Chrome
3. Bash Scripting: To orchestrate the scraping process
4. Google Chrome: Launched via command line for better trust score
5. Data Processing: Python scripts for post-scrape data handling
## Workflow
1. Bash script initiates Xvfb to create a virtual display
2. Google Chrome is launched with the custom scraping extension
3. Chrome navigates to the target URL (e.g., Reddit subreddit)
4. The extension performs automated actions (scrolling, clicking, etc.)
5. Data is collected by the extension as the page dynamically loads
6. The extension outputs collected data to a CSV file
7. Python scripts process the CSV for further data manipulation
## Key Features
- Uses native Google Chrome for improved trust score against bot detection
- Chrome extension allows for precise in-page interactions and data extraction
- Headful operation via Xvfb enables server-side deployment
- Scalable to handle various social media platforms and dynamic websites
- Outputs data in CSV format for easy integration with data processing pipelines
## Technical Details
- Chrome extension developed using content scripts for in-page operations
- Bash script manages Xvfb setup and Chrome launch
- Xvfb enables us to run headful instances of Chrome, which generally have a higher trust score.
- Utilizes scrot for capturing screenshots of the virtual display
- Allows for customization of scraping parameters (e.g., subreddit, timeframe)
- Can be easily adapted for different websites by modifying the extension and script
## Advantages
- More trusted by websites compared to Puppeteer-driven browsers
- Highly customizable through Chrome extension development
- Can handle complex, dynamically loaded content effectively
- Operates in a true browser environment, closely mimicking user behavior
- Easier to maintain and update compared to full browser automation frameworks
This system represents an innovative approach to web scraping, leveraging browser extensions and virtual displays to create a robust, scalable, and stealthy data collection solution.
# Code Examples
- The following examples are pulled from [[RedditPulse]].
## Bash Script to start Browser
```bash
#!/bin/bash
#run_hot.sh
# Assign command-line arguments to variables
SUBREDDIT=$1
TIMEFRAME=$2
# Define the path where the screenshot will be saved
SCREENSHOT_PATH="screenshots/screenshot.png"
# Find an available display number
for DISPLAY_NUMBER in {99..199}
do
if ! [ -e "/tmp/.X${DISPLAY_NUMBER}-lock" ]; then
break
fi
done
export DISPLAY=":${DISPLAY_NUMBER}"
# Start Xvfb manually on the found display
Xvfb $DISPLAY -screen 0 1280x720x24 -ac -noreset & XVFB_PID=$!
echo "Xvfb running on $DISPLAY with PID $XVFB_PID"
# Launch Google Chrome with the scraping extension
google-chrome \
--load-extension=/path/to/extension \
--url "https://www.reddit.com/r/${SUBREDDIT}/top/?t=month" &
echo "sleeping"
# Wait for the page to load
sleep 37 #arb value
echo "awake!"
# Use scrot to capture a screenshot of the virtual display
scrot "$SCREENSHOT_PATH"
# Notify the user or log that the screenshot has been taken
echo "Screenshot saved to $SCREENSHOT_PATH"
# Clean up Xvfb
kill $XVFB_PID
```
## Content Script (Chrome extension)
```javascript
const delay = ms => new Promise(res => setTimeout(res, ms));
const scrollTimes = 11;
const scrollInterval = 1500;
async function scrollToBottom() {
for (let i = 0; i < scrollTimes; i++) {
window.scrollTo(0, document.body.scrollHeight);
await delay(scrollInterval);
}
}
function getSubredditStats() {
const headerElement = document.querySelector('shreddit-subreddit-header');
if (headerElement) {
const subscribers = headerElement.getAttribute('subscribers') || '0';
const active = headerElement.getAttribute('active') || '0';
console.log('Subscribers:', subscribers);
console.log('Currently Active:', active);
return { subscribers, active };
} else {
console.log('Could not find shreddit-subreddit-header element');
return { subscribers: '0', active: '0' };
}
}
async function scrapeData() {
const stats = getSubredditStats();
let allPosts = new Set();
for (let i = 0; i < scrollTimes; i++) {
window.scrollTo(0, document.body.scrollHeight);
await delay(scrollInterval);
document.querySelectorAll('shreddit-post').forEach(post => allPosts.add(post));
}
const now = new Date().toISOString();
const azTime = new Date(now);
azTime.setHours(azTime.getHours() - 7);
const azTimeString = azTime.toISOString();
const urlParams = new URLSearchParams(window.location.search);
const pathSegments = window.location.pathname.split('/');
const sort = pathSegments[pathSegments.length - 2] || 'hot';
const timeframe = urlParams.get('t') || 'day';
const csvContent = [['Title', 'Date', 'Author', 'Post Text', 'Upvotes', 'Comments', 'Subreddit', 'Sort', 'Timeframe', 'Currently Online', 'Currently Subscribed', 'Scrape Time'].join(',')];
allPosts.forEach(post => {
const title = post.getAttribute('post-title') || 'No Title';
const date = new Date(post.getAttribute('created-timestamp')).toLocaleString() || 'Unknown Date';
const subreddit = post.getAttribute('subreddit-prefixed-name') || 'Unknown Subreddit';
const postId = post.getAttribute('id') || 'Unknown Post ID';
const url = post.getAttribute('content-href') || `https://www.reddit.com/r/${subreddit}/comments/${postId}`;
const author = post.getAttribute('author') || 'Unknown Author';
const postText = post.querySelector('#t3_1c4pvie-post-rtjson-content')?.textContent.trim().replace(/[\r\n]+/g, ' ') || '';
const upvotes = post.getAttribute('score') || '0';
const comments = post.getAttribute('comment-count') || '0';
csvContent.push([title, date, author, postText, upvotes, comments, subreddit, sort, timeframe, stats.active, stats.subscribers, azTimeString].map(field => `"${field}"`).join(','));
});
const csvBlob = new Blob([csvContent.join('\n')], { type: 'text/csv;charset=utf-8;' });
const downloadLink = document.createElement('a');
downloadLink.href = URL.createObjectURL(csvBlob);
downloadLink.setAttribute('download', 'reddit_output.csv');
document.body.appendChild(downloadLink);
downloadLink.click();
document.body.removeChild(downloadLink);
}
if (document.readyState === 'complete') {
scrapeData();
} else {
window.addEventListener('load', scrapeData);
}
```
## Python Processing
```python
def run_reddit_scraper(subreddit, timeframe):
script_path = './run_hot.sh'
subprocess.run([script_path, subreddit, timeframe], check=True)
logging.info(f'Scraping complete for {subreddit}!')
os.rename('../reddit_output.csv', 'reddit_posts.csv')
def process_scraped_data(subreddit):
df = pd.read_csv('reddit_posts.csv', on_bad_lines='skip')
df['Date'] = pd.to_datetime(df['Date'], errors='coerce') - pd.Timedelta(hours=7)
df = df.dropna(subset=['Date'])
time_of_scrape = pd.to_datetime(df['Scrape Time'].iloc[0])
```