Crawler Integration Overview

Description

The Crawler Integration component manages the communication between the gb_console database and the PLG API (Crawler system). It handles sending summary data configurations to the Crawler for creating and updating crawl jobs, as well as receiving information about failed crawl operations. The component uses the SendToCrawler command to process summary data and prepare it for Crawler consumption through either create or update operations.

Overview System Diagram

---
config:
  theme: base
  layout: dagre
  flowchart:
    curve: linear
    htmlLabels: true
  themeVariables:
    edgeLabelBackground: "transparent"
---
flowchart TD
    %% Step 1: Database Tables at the top
    subgraph DatabaseTables["<div style='width:300px'>Database Tables (gb_console)</div>"]
        direction LR
        WishlistProducts[(summary_wishlist_products)]
        WishlistProductReviews[(summary_wishlist_product_reviews)]
        WishlistCategories[(summary_wishlist_categories)]
        WishlistSearchQueries[(summary_wishlist_search_queries)]
    end
    
    %% Step 2: Identify records with NotSent status for creation
    DatabaseTables --> CreateConfigsStep1[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>1</span>
            <p style='margin-top: 8px'>Identify NotSent Records</p>
        </div>
    ]
    
    %% Step 3: Identify records needing updates (parallel to step 2)
    DatabaseTables --> UpdateConfigsStep2[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>2</span>
            <p style='margin-top: 8px'>Identify Update Needed</p>
        </div>
    ]
    
    %% Step 4: Create and Update Configurations
    CreateConfigsStep1 --> CreateConfigs[Create Configurations]
    UpdateConfigsStep2 --> UpdateConfigs[Update Configurations]
    
    %% Step 5: Process data for Crawler
    CreateConfigs --> SummaryServiceStep3[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>3</span>
            <p style='margin-top: 8px'>Format Data for API</p>
        </div>
    ]
    UpdateConfigs --> SummaryServiceStep3
    
    %% Step 6: Summary Service Processing
    SummaryServiceStep3 --> SummaryService[Summary Processing]
    
    %% Step 7: Send to Crawler API
    SummaryService --> CrawlerStep4[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>4</span>
            <p style='margin-top: 8px'>Send to Crawler</p>
        </div>
    ]
    
    %% Step 8: Crawler API
    CrawlerStep4 --> Crawler((PLG API))
    
    %% Step 9: Fetch failed jobs (separate flow)
    Crawler --> SyncFailedStep5[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>5</span>
            <p style='margin-top: 8px'>Get Failed Jobs</p>
        </div>
    ]
    
    %% Step 10: Sync Failed Jobs
    SyncFailedStep5 --> SyncFailed[Sync Failed Jobs]
    
    %% Step 11: Update status in database
    SyncFailed --> TableUpdateStep6[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>6</span>
            <p style='margin-top: 8px'>Update Error Status</p>
        </div>
    ]
    
    %% Step 12: Back to Database Tables
    TableUpdateStep6 --> DatabaseTables
    
    style Crawler fill:#fcd9d9,stroke-width:2px
    style SummaryService fill:#fcf3d2
    style CreateConfigs fill:#d9f2d9
    style UpdateConfigs fill:#d9f2d9
    style SyncFailed fill:#d9f2d9
    style DatabaseTables fill:#ffe6cc,stroke:#ff9900,stroke-width:2px
    style WishlistProducts fill:#d9f2d9,stroke:#339933,stroke-width:1px
    style WishlistProductReviews fill:#d9f2d9,stroke:#339933,stroke-width:1px
    style WishlistCategories fill:#d9f2d9,stroke:#339933,stroke-width:1px
    style WishlistSearchQueries fill:#d9f2d9,stroke:#339933,stroke-width:1px
    style CreateConfigsStep1 fill:transparent,stroke:transparent,stroke-width:1px
    style UpdateConfigsStep2 fill:transparent,stroke:transparent,stroke-width:1px
    style SummaryServiceStep3 fill:transparent,stroke:transparent,stroke-width:1px
    style CrawlerStep4 fill:transparent,stroke:transparent,stroke-width:1px
    style SyncFailedStep5 fill:transparent,stroke:transparent,stroke-width:1px
    style TableUpdateStep6 fill:transparent,stroke:transparent,stroke-width:1px

    %% Style for arrows
    linkStyle default stroke-width:2px,stroke:#333
    
    %% Subgraph classes
    classDef databaseClass fill:#fff9db
    class DatabaseTables databaseClass

Detail Dataflow Dependency

The Crawler Integration component follows a data flow where:

  1. The summary wishlist tables in the gb_console database (summary_wishlist_products, summary_wishlist_product_reviews, summary_wishlist_categories, summary_wishlist_search_queries) store data that needs to be crawled
  2. The command plg-api:sending-configs-to-crawler with mode=create identifies records with SendingStatus=NotSent that need new crawler configurations
  3. The Summary Processing service formats the data according to the data type and sends it to the PLG API (Crawler system)
  4. The same command with mode=update identifies records that need updated crawler configurations
  5. The command plg-api:sync-crawl-failed-from-crawler periodically fetches information about failed crawl operations from the PLG API
  6. It updates the crawl_status of affected records in the database to reflect failures (CrawlStatus=Error)

Data Types

The system handles four distinct data types for crawler integration:

  • SummaryProduct ('product'): Product data with properties like product ID, mall ID, and input type
  • SummaryProductReview ('reviews'): Product review data that has a relationship with summary products
  • SummaryCategory ('category_ranking_group'): Category data with category ID and mall ID
  • SummarySearchQuery ('sq_ranking_group'): Search query data with keyword and mall ID

Frequency Overview

Timeline

timeline
    title Crawler Integration Schedule
    section Create/Update Configurations
        Every 5 minutes<br>(Ex. 08.00) : plg-api sending configs to crawler --mode=create
                                       : plg-api sending configs to crawler --mode=update
    section Sync Failed Crawls
        Every 30 minutes<br>(Ex. 08.30) : plg-api sync crawl failed from crawler

Note: All commands are executed for each data type: SummaryProduct, SummaryProductReview, SummaryCategory, SummarySearchQuery

Expected Outcomes

When these commands execute successfully, the system delivers:

  • Automated Crawler Configuration Management: New summary data records are automatically identified and sent to the PLG API for crawler configuration creation
  • Real-time Configuration Updates: Existing crawler configurations are updated when summary data changes, ensuring crawlers operate with current parameters
  • Failed Crawl Detection and Recovery: Failed crawl operations are automatically detected and their status is synchronized back to the database for monitoring and retry mechanisms
  • Multi-data Type Processing: Simultaneous handling of four distinct data types (products, reviews, categories, search queries) with type-specific formatting and API integration
  • Status Tracking and Monitoring: Comprehensive tracking of sending status and crawl status across all summary data types for operational visibility
  • Data Integrity Maintenance: Proper foreign key relationships and status updates ensure data consistency between the console database and crawler system

Database Schema

erDiagram
    summary_wishlist_products {
        bigint id PK
        string input "The input of the product"
        string input_type "The type of the input: jan, asin, rakuten_id"
        bigint mall_id FK "Foreign key to malls table"
        integer schedule_id "The id of the schedule"
        integer schedule_priority "The priority of the schedule"
        integer sending_status "The status of the sending to crawler"
        integer crawl_status "The status of the crawling"
        bigint crawl_config_id "The id of the configs table from Crawler (nullable)"
        integer status "The status of the product"
    }
    
    summary_wishlist_product_reviews {
        bigint id PK
        bigint summary_wishlist_product_id FK "Foreign key to summary_wishlist_products"
        integer schedule_id "The id of the schedule"
        integer schedule_priority "The priority of the schedule"
        integer sending_status "The status of the sending to crawler"
        integer crawl_status "The status of the crawling"
        bigint crawl_config_id "The id of the configs table from Crawler (nullable)"
        integer status "The status of the product"
    }
    
    summary_wishlist_categories {
        bigint id PK
        string category_id "The id of the category in the mall"
        bigint mall_id FK "Foreign key to malls table"
        integer schedule_id "The id of the schedule"
        integer schedule_priority "The priority of the schedule"
        integer sending_status "The status of the sending to crawler"
        integer crawl_status "The status of the crawling"
        bigint crawl_config_id "The id of the configs table from Crawler (nullable)"
        integer status "The status of the product"
    }
    
    summary_wishlist_search_queries {
        bigint id PK
        bigint mall_id FK "The id of the mall"
        string keyword "The keyword to search"
        integer schedule_id "The id of the schedule"
        integer schedule_priority "The priority of the schedule"
        integer sending_status "The status of the sending to crawler"
        integer crawl_status "The status of the crawling"
        bigint crawl_config_id "The id of the configs table from Crawler (nullable)"
        integer status "The status of the product"
    }
    
    summary_wishlist_products ||--o| summary_wishlist_product_reviews : "has one"

Batch List

Name Description
Create Configurations Commands that run every 5 minutes to create new configurations for the Crawler
Update Configurations Commands that run every 5 minutes to update existing configurations for the Crawler
Sync Failed Crawls Commands that run every 30 minutes to sync information about failed crawl operations