Crawler Integration Overview
Description
The Crawler Integration component manages the communication between the gb_console database and the PLG API (Crawler system). It handles sending summary data configurations to the Crawler for creating and updating crawl jobs, as well as receiving information about failed crawl operations. The component uses the SendToCrawler command to process summary data and prepare it for Crawler consumption through either create or update operations.
Overview System Diagram
---
config:
theme: base
layout: dagre
flowchart:
curve: linear
htmlLabels: true
themeVariables:
edgeLabelBackground: "transparent"
---
flowchart TD
%% Step 1: Database Tables at the top
subgraph DatabaseTables["<div style='width:300px'>Database Tables (gb_console)</div>"]
direction LR
WishlistProducts[(summary_wishlist_products)]
WishlistProductReviews[(summary_wishlist_product_reviews)]
WishlistCategories[(summary_wishlist_categories)]
WishlistSearchQueries[(summary_wishlist_search_queries)]
end
%% Step 2: Identify records with NotSent status for creation
DatabaseTables --> CreateConfigsStep1[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>1</span>
<p style='margin-top: 8px'>Identify NotSent Records</p>
</div>
]
%% Step 3: Identify records needing updates (parallel to step 2)
DatabaseTables --> UpdateConfigsStep2[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>2</span>
<p style='margin-top: 8px'>Identify Update Needed</p>
</div>
]
%% Step 4: Create and Update Configurations
CreateConfigsStep1 --> CreateConfigs[Create Configurations]
UpdateConfigsStep2 --> UpdateConfigs[Update Configurations]
%% Step 5: Process data for Crawler
CreateConfigs --> SummaryServiceStep3[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>3</span>
<p style='margin-top: 8px'>Format Data for API</p>
</div>
]
UpdateConfigs --> SummaryServiceStep3
%% Step 6: Summary Service Processing
SummaryServiceStep3 --> SummaryService[Summary Processing]
%% Step 7: Send to Crawler API
SummaryService --> CrawlerStep4[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>4</span>
<p style='margin-top: 8px'>Send to Crawler</p>
</div>
]
%% Step 8: Crawler API
CrawlerStep4 --> Crawler((PLG API))
%% Step 9: Fetch failed jobs (separate flow)
Crawler --> SyncFailedStep5[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>5</span>
<p style='margin-top: 8px'>Get Failed Jobs</p>
</div>
]
%% Step 10: Sync Failed Jobs
SyncFailedStep5 --> SyncFailed[Sync Failed Jobs]
%% Step 11: Update status in database
SyncFailed --> TableUpdateStep6[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>6</span>
<p style='margin-top: 8px'>Update Error Status</p>
</div>
]
%% Step 12: Back to Database Tables
TableUpdateStep6 --> DatabaseTables
style Crawler fill:#fcd9d9,stroke-width:2px
style SummaryService fill:#fcf3d2
style CreateConfigs fill:#d9f2d9
style UpdateConfigs fill:#d9f2d9
style SyncFailed fill:#d9f2d9
style DatabaseTables fill:#ffe6cc,stroke:#ff9900,stroke-width:2px
style WishlistProducts fill:#d9f2d9,stroke:#339933,stroke-width:1px
style WishlistProductReviews fill:#d9f2d9,stroke:#339933,stroke-width:1px
style WishlistCategories fill:#d9f2d9,stroke:#339933,stroke-width:1px
style WishlistSearchQueries fill:#d9f2d9,stroke:#339933,stroke-width:1px
style CreateConfigsStep1 fill:transparent,stroke:transparent,stroke-width:1px
style UpdateConfigsStep2 fill:transparent,stroke:transparent,stroke-width:1px
style SummaryServiceStep3 fill:transparent,stroke:transparent,stroke-width:1px
style CrawlerStep4 fill:transparent,stroke:transparent,stroke-width:1px
style SyncFailedStep5 fill:transparent,stroke:transparent,stroke-width:1px
style TableUpdateStep6 fill:transparent,stroke:transparent,stroke-width:1px
%% Style for arrows
linkStyle default stroke-width:2px,stroke:#333
%% Subgraph classes
classDef databaseClass fill:#fff9db
class DatabaseTables databaseClass
Detail Dataflow Dependency
The Crawler Integration component follows a data flow where:
- The summary wishlist tables in the gb_console database (
summary_wishlist_products,summary_wishlist_product_reviews,summary_wishlist_categories,summary_wishlist_search_queries) store data that needs to be crawled - The command
plg-api:sending-configs-to-crawlerwith mode=create identifies records with SendingStatus=NotSent that need new crawler configurations - The Summary Processing service formats the data according to the data type and sends it to the PLG API (Crawler system)
- The same command with mode=update identifies records that need updated crawler configurations
- The command
plg-api:sync-crawl-failed-from-crawlerperiodically fetches information about failed crawl operations from the PLG API - It updates the crawl_status of affected records in the database to reflect failures (CrawlStatus=Error)
Data Types
The system handles four distinct data types for crawler integration:
- SummaryProduct ('product'): Product data with properties like product ID, mall ID, and input type
- SummaryProductReview ('reviews'): Product review data that has a relationship with summary products
- SummaryCategory ('category_ranking_group'): Category data with category ID and mall ID
- SummarySearchQuery ('sq_ranking_group'): Search query data with keyword and mall ID
Frequency Overview
Timeline
timeline
title Crawler Integration Schedule
section Create/Update Configurations
Every 5 minutes<br>(Ex. 08.00) : plg-api sending configs to crawler --mode=create
: plg-api sending configs to crawler --mode=update
section Sync Failed Crawls
Every 30 minutes<br>(Ex. 08.30) : plg-api sync crawl failed from crawler
Note: All commands are executed for each data type: SummaryProduct, SummaryProductReview, SummaryCategory, SummarySearchQuery
Expected Outcomes
When these commands execute successfully, the system delivers:
- Automated Crawler Configuration Management: New summary data records are automatically identified and sent to the PLG API for crawler configuration creation
- Real-time Configuration Updates: Existing crawler configurations are updated when summary data changes, ensuring crawlers operate with current parameters
- Failed Crawl Detection and Recovery: Failed crawl operations are automatically detected and their status is synchronized back to the database for monitoring and retry mechanisms
- Multi-data Type Processing: Simultaneous handling of four distinct data types (products, reviews, categories, search queries) with type-specific formatting and API integration
- Status Tracking and Monitoring: Comprehensive tracking of sending status and crawl status across all summary data types for operational visibility
- Data Integrity Maintenance: Proper foreign key relationships and status updates ensure data consistency between the console database and crawler system
Database Schema
erDiagram
summary_wishlist_products {
bigint id PK
string input "The input of the product"
string input_type "The type of the input: jan, asin, rakuten_id"
bigint mall_id FK "Foreign key to malls table"
integer schedule_id "The id of the schedule"
integer schedule_priority "The priority of the schedule"
integer sending_status "The status of the sending to crawler"
integer crawl_status "The status of the crawling"
bigint crawl_config_id "The id of the configs table from Crawler (nullable)"
integer status "The status of the product"
}
summary_wishlist_product_reviews {
bigint id PK
bigint summary_wishlist_product_id FK "Foreign key to summary_wishlist_products"
integer schedule_id "The id of the schedule"
integer schedule_priority "The priority of the schedule"
integer sending_status "The status of the sending to crawler"
integer crawl_status "The status of the crawling"
bigint crawl_config_id "The id of the configs table from Crawler (nullable)"
integer status "The status of the product"
}
summary_wishlist_categories {
bigint id PK
string category_id "The id of the category in the mall"
bigint mall_id FK "Foreign key to malls table"
integer schedule_id "The id of the schedule"
integer schedule_priority "The priority of the schedule"
integer sending_status "The status of the sending to crawler"
integer crawl_status "The status of the crawling"
bigint crawl_config_id "The id of the configs table from Crawler (nullable)"
integer status "The status of the product"
}
summary_wishlist_search_queries {
bigint id PK
bigint mall_id FK "The id of the mall"
string keyword "The keyword to search"
integer schedule_id "The id of the schedule"
integer schedule_priority "The priority of the schedule"
integer sending_status "The status of the sending to crawler"
integer crawl_status "The status of the crawling"
bigint crawl_config_id "The id of the configs table from Crawler (nullable)"
integer status "The status of the product"
}
summary_wishlist_products ||--o| summary_wishlist_product_reviews : "has one"
Batch List
| Name | Description |
|---|---|
| Create Configurations | Commands that run every 5 minutes to create new configurations for the Crawler |
| Update Configurations | Commands that run every 5 minutes to update existing configurations for the Crawler |
| Sync Failed Crawls | Commands that run every 30 minutes to sync information about failed crawl operations |