Scraping for your Lego Set with AgentQL

BaradDur

Disclaimer: Your partner may not like this blog post.

In this blog post I will show you how you can convince your partner that the Lego set you are craving for is worth buying. Which isn’t btw? We will explore AgentQL which is an AI-powered query language for scraping web sites and automating workflows and uses natural language queries to pinpoint data and elements on any web page. We will then analyze the scraped data with a BERT tuned on tweets to get the sentiment. If the majority of the data is positive, we will buy the set.

Scraping the data

If you have done some scraping, you know very well that it can be pretty frustrating and annoying to do as every website is different and there is no just one way to scrape any website. However, what if you could leverage natural language and LLM to do the scraping in an easier way. One tool I found for this exercise is AgentQL. The usage is pretty straightforward, and it has a good free-tier for you to try it out. The only thing you need is to create an API key for you to start playing.

You can create the API key from here.

Libraries used

Below is the list of libraries we will use, along with the dataclasses we will interact with. The most important one is the GraphQLQueries where you basically list the GraphQL queries used by the scraper. For example the COOKIES_FORM will let the Agent click on the cookies form that appears when opening YouTube with playwright.

from typing import List, Dict, Optional, Any, Union
import logging
import agentql
from dataclasses import dataclass
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from playwright.sync_api import Page, Browser, sync_playwright

logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)

@dataclass
class YouTubeComment:
    """Represents a YouTube comment with channel name and text."""
    channel_name: str
    comment_text: str

@dataclass
class YouTubeVideo:
    """Represents a YouTube video with basic metadata."""
    video_link: Any  # Playwright element
    video_title: Any  # Playwright element
    channel_name: str

@dataclass
class SentimentResult:
    """
    Container for sentiment analysis results.
    
    Attributes:
        text: The analyzed text
        sentiment: The predicted sentiment (POSITIVE, NEGATIVE, NEUTRAL)
        score: Confidence score for the prediction
        compound_score: Normalized score between -1 and 1
    """
    text: str
    sentiment: str
    score: float
    compound_score: float

class GraphQLQueries:
    """Container class for all GraphQL queries used in the scraper."""
    
    COOKIES_FORM = """
        {
            cookies_form {
                reject_btn
            }
        }
    """
    
    SEARCH = """
        {
            search_input
            search_btn
        }
    """
    
    VIDEO = """
        {
            videos[] {
                video_link
                video_title
                channel_name
            }
        }
    """
    
    VIDEO_CONTROLS = """
        {
            play_or_pause_btn
            expand_description_btn
        }
    """
    
    DESCRIPTION = """
        {
            description_text
        }
    """
    
    COMMENTS = """
        {
            comments[] {
                channel_name
                comment_text
            }
        }
    """

Create the Scraper

The objective is to obtain feedback on a specific Lego set, Barad Dur. We want to understand what other people think about it and what better place to do this than YouTube. On the official page of AgentQL there is already an example code, the below is just put nicely as a class.

class YouTubeScraper:
    """
    A class to scrape YouTube videos and perform sentiment analysis on comments.
    
    Dependencies:
        - agentql
        - playwright
        - transformers
        - torch
        
    Attributes:
        url (str): The base YouTube URL
        page (Page): The Playwright page object
        sentiment_analyzer: The sentiment analysis model
        scroll_count (int): Number of times to scroll down to load comments
        type_delay (int): Delay between keystrokes when typing in search
    """
    
    def __init__(
        self, 
        url: str = "https://www.youtube.com",
        scroll_count: int = 15,
        type_delay: int = 75
    ):
        """
        Initialize the YouTube scraper.
        
        Args:
            url (str): The base YouTube URL to start from
            scroll_count (int): Number of times to scroll down to load comments
            type_delay (int): Delay between keystrokes when typing in search
        """
        self.url = url
        self.scroll_count = scroll_count
        self.type_delay = type_delay
        self.page: Optional[Page] = None
        self.sentiment_analyzer = SentimentAnalyzer()
        self._playwright = None
        self._browser = None
    
    def setup_browser(self) -> None:
        """Set up the browser and navigate to the initial page."""
        try:
            self._playwright = sync_playwright().start()
            self._browser = self._playwright.chromium.launch(headless=False)
            self.page = agentql.wrap(self._browser.new_page())
            self.page.goto(self.url)
        except Exception as e:
            log.error(f"Failed to setup browser: {e}")
            self.cleanup()
            raise RuntimeError(f"Browser setup failed: {e}")
    
    def cleanup(self) -> None:
        """Clean up browser resources."""
        if self._browser:
            try:
                self._browser.close()
            except Exception as e:
                log.error(f"Error closing browser: {e}")
        
        if self._playwright:
            try:
                self._playwright.stop()
            except Exception as e:
                log.error(f"Error stopping playwright: {e}")
    
    def handle_cookies(self) -> None:
        """Handle the cookies consent form if present."""
        response = self.page.query_elements(GraphQLQueries.COOKIES_FORM)
        if response.cookies_form.reject_btn is not None:
            response.cookies_form.reject_btn.click()
    
    def search_video(self, search_term: str) -> None:
        """Search for a video using the given search term."""
        response = self.page.query_elements(GraphQLQueries.SEARCH)
        response.search_input.type(search_term, delay=self.type_delay)
        response.search_btn.click()
    
    def click_first_video(self) -> str:
        """Click the first video in search results."""
        response = self.page.query_elements(GraphQLQueries.VIDEO)
        video_title = response.videos[0].video_title.text_content()
        log.debug(f"Clicking Youtube Video: {video_title}")
        response.videos[0].video_link.click()
        return video_title
    
    def expand_description(self) -> str:
        """Expand and get the video description."""
        response = self.page.query_elements(GraphQLQueries.VIDEO_CONTROLS)
        response.expand_description_btn.click()
        
        response_data = self.page.query_data(GraphQLQueries.DESCRIPTION)
        description = response_data['description_text']
        log.debug(f"Captured the following description: \n{description}")
        return description
    
    def scroll_to_comments(self) -> None:
        """Scroll down to load comments."""
        log.debug(f"Scrolling {self.scroll_count} times to load comments...")
        for i in range(self.scroll_count):
            self.page.keyboard.press("PageDown")
            self.page.wait_for_page_ready_state()
            log.debug(f"Completed scroll {i + 1}/{self.scroll_count}")
    
    def get_comments(self) -> List[Dict[str, str]]:
        """Get all loaded comments."""
        response = self.page.query_data(GraphQLQueries.COMMENTS)
        comments = response.get("comments", [])
        log.debug(f"Captured {len(comments)} comments!")
        return comments
    
    def analyze_comments(self, comments: List[Dict[str, str]]) -> Dict:
        """
        Analyze sentiment of comments using the enhanced analyzer.
        
        Args:
            comments: List of comment dictionaries with 'comment_text' field
            
        Returns:
            Dict containing detailed sentiment analysis results
        """
        texts = [comment["comment_text"] for comment in comments]
        results = self.sentiment_analyzer.analyze_texts(texts)
        return self.sentiment_analyzer.get_sentiment_summary(results)
    
    def run(self, search_term: str) -> Dict:
        """
        Run the complete scraping and analysis process.
        
        Args:
            search_term (str): The search term to use for finding a video
            
        Returns:
            Dict: Detailed sentiment analysis results
            
        Raises:
            Exception: If any step in the process fails
        """
        try:
            self.setup_browser()
            self.handle_cookies()
            self.search_video(search_term)
            video_title = self.click_first_video()
            description = self.expand_description()
            self.scroll_to_comments()
            comments = self.get_comments()
            
            sentiment_results = self.analyze_comments(comments)
            
            return {
                "video_title": video_title,
                "total_comments": len(comments),
                "sentiment_analysis": sentiment_results
            }
            
        except Exception as e:
            log.error(f"Found Error: {e}")
            raise e
        finally:
            self.cleanup()

Couple of notes here.

scroll_count: int = 15 -> This is the number of times the Agent will scroll through the comment section of the video. The scraper will click on the first video that is searched in this case from YouTube.

In the init there is “self.sentiment_analyzer = SentimentAnalyzer()” which is the Sentiment Analysis class we will be creating below so that we can analyze the data.

Sentiment Analysis Class

The sentiment analysis will be done with a finetuned model from HuggingFace on English tweets.

class SentimentAnalyzer:
    """
     Sentiment analysis using RoBERTa model fine-tuned for sentiment analysis.

    """
    
    def __init__(
        self,
        model_name: str = "finiteautomata/bertweet-base-sentiment-analysis",
        batch_size: int = 8,
        device: Optional[str] = None
    ):
        """
        Initialize the sentiment analyzer.
        
        Args:
            model_name: The pretrained model to use
            batch_size: Number of texts to process simultaneously
            device: Device to run model on ('cuda' or 'cpu'). If None, automatically detected.
        """
        self.batch_size = batch_size
        self.device = device or ('cuda' if torch.cuda.is_available() else 'cpu')
        log.info(f"Using device: {self.device}")
        
        try:
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
            self.model.to(self.device)
            log.info(f"Loaded model: {model_name}")
        except Exception as e:
            log.error(f"Failed to load model: {e}")
            raise
            
        self.id2label = {
            0: "NEGATIVE",
            1: "NEUTRAL",
            2: "POSITIVE"
        }
    
    def _batch_texts(self, texts: List[str]) -> List[List[str]]:
        """Split texts into batches for efficient processing."""
        return [texts[i:i + self.batch_size] for i in range(0, len(texts), self.batch_size)]
    
    def _preprocess_text(self, text: str) -> str:
        """Preprocess text before analysis."""
        return text.strip()
    
    def _compute_compound_score(self, scores: np.ndarray) -> float:
        """Compute a compound sentiment score between -1 and 1."""
        probs = torch.nn.functional.softmax(torch.tensor(scores), dim=0).numpy()
        weights = np.array([-1.0, 0.0, 1.0])
        return float(np.sum(probs * weights))

    @torch.no_grad()
    def analyze_batch(self, texts: List[str]) -> List[SentimentResult]:
        """Analyze sentiment for a batch of texts."""
        processed_texts = [self._preprocess_text(text) for text in texts]
        
        inputs = self.tokenizer(
            processed_texts,
            padding=True,
            truncation=True,
            max_length=128,
            return_tensors="pt"
        ).to(self.device)
        
        outputs = self.model(**inputs)
        predictions = outputs.logits.cpu().numpy()
        
        results = []
        for text, prediction in zip(texts, predictions):
            label_id = np.argmax(prediction)
            sentiment = self.id2label[label_id]
            score = float(np.max(torch.nn.functional.softmax(torch.tensor(prediction), dim=0).numpy()))
            compound = self._compute_compound_score(prediction)
            
            results.append(SentimentResult(
                text=text,
                sentiment=sentiment,
                score=score,
                compound_score=compound
            ))
        
        return results
    
    def analyze_texts(self, texts: Union[str, List[str]]) -> Union[SentimentResult, List[SentimentResult]]:
        """Analyze sentiment for one or more texts."""
        single_text = isinstance(texts, str)
        texts_to_process = [texts] if single_text else texts
        
        all_results = []
        for batch in self._batch_texts(texts_to_process):
            results = self.analyze_batch(batch)
            all_results.extend(results)
        
        return all_results[0] if single_text else all_results
    
    def get_sentiment_summary(self, results: List[SentimentResult]) -> Dict:
        """Generate summary statistics for a set of sentiment results."""
        total = len(results)
        if total == 0:
            return {
                "total_analyzed": 0,
                "sentiment_distribution": {},
                "dominant_sentiment": "N/A",
                "average_compound_score": 0.0,
                "confidence": {"high_confidence": 0, "medium_confidence": 0, "low_confidence": 0}
            }
            
        sentiment_counts = {
            "POSITIVE": sum(1 for r in results if r.sentiment == "POSITIVE"),
            "NEUTRAL": sum(1 for r in results if r.sentiment == "NEUTRAL"),
            "NEGATIVE": sum(1 for r in results if r.sentiment == "NEGATIVE")
        }
        
        avg_compound = sum(r.compound_score for r in results) / total
        
        return {
            "total_analyzed": total,
            "sentiment_distribution": {
                k: f"{(v/total)*100:.1f}%" for k, v in sentiment_counts.items()
            },
            "dominant_sentiment": max(sentiment_counts.items(), key=lambda x: x[1])[0],
            "average_compound_score": avg_compound,
            "confidence": {
                "high_confidence": sum(1 for r in results if r.score > 0.8),
                "medium_confidence": sum(1 for r in results if 0.5 < r.score <= 0.8),
                "low_confidence": sum(1 for r in results if r.score <= 0.5)
            }
        }

Now that we have the two classes created, let’s put it into action and see what happens

LET’S SCRAPE

if __name__ == "__main__":

    scraper = YouTubeScraper()
    video_analysis_results = scraper.run("Lego Barad Dur Review")
    
    print(f"\nVideo: {video_analysis_results['video_title']}")
    print(f"Total comments analyzed: {video_analysis_results['sentiment_analysis']['total_analyzed']}")
    print("\nSentiment Distribution:")

    for sentiment, percentage in video_analysis_results['sentiment_analysis']['sentiment_distribution'].items():
        print(f"{sentiment}: {percentage}")
    
    print(f"\nDominant Sentiment: {video_analysis_results['sentiment_analysis']['dominant_sentiment']}")
    print(f"Average Sentiment Score: {video_analysis_results['sentiment_analysis']['average_compound_score']:.2f}")
    
    confidence = video_analysis_results['sentiment_analysis']['confidence']
    print("\nConfidence Levels:")
    print(f"High confidence predictions: {confidence['high_confidence']}")
    print(f"Medium confidence predictions: {confidence['medium_confidence']}")
    print(f"Low confidence predictions: {confidence['low_confidence']}")


    if video_analysis_results['sentiment_analysis']['dominant_sentiment'] == "POSITIVE":
        print("YES, BUY THE SET MATE!")
    else:
        print("NO, JUST DO NOT BUT IT...")

Let’s see it in action:

The results:

Video: LEGO Lord of the Rings Barad-dur (Review)
Total comments analyzed: 40

Sentiment Distribution:
POSITIVE: 50.0%
NEUTRAL: 42.5%
NEGATIVE: 7.5%

Dominant Sentiment: POSITIVE
Average Sentiment Score: 0.40

Confidence Levels:
High confidence predictions: 30
Medium confidence predictions: 9
Low confidence predictions: 1

YES, BUY THE SET MATE!

Well I guess we need to order this set then. :)

Quantization: A Crucial Technique in Today’s AI Landscape

Blog Archive

Archive of all previous blog posts