Analyzer Integration Overview
Description
The Analyzer Integration component manages the communication between the Console database (gb_console) and the Analyzer database (gb_analyzer). It verifies whether records sent to the crawler have been successfully processed by checking their existence in the Analyzer database, and then updates their status in the Console database. This bidirectional verification flow ensures accurate status tracking across the Trend Viewer backend ecosystem.
Overview System Diagram
---
config:
theme: base
layout: dagre
flowchart:
curve: linear
htmlLabels: true
themeVariables:
edgeLabelBackground: "transparent"
---
flowchart TD
subgraph ConsoleTables["<div style='width: 300px'>Console Database (gb_console)</div>"]
direction TB
subgraph SummaryTables["Summary Wishlist Tables"]
direction LR
SummaryProducts[(summary_wishlist_products)]
SummaryProductReviews[(summary_wishlist_product_reviews)]
SummaryCategories[(summary_wishlist_categories)]
SummarySearchQueries[(summary_wishlist_search_queries)]
end
end
subgraph AnalyzerTables["<div style='width: 300px'>Analyzer Database (gb_analyzer)</div>"]
direction TB
subgraph ProductTables["Product Data Tables"]
direction LR
Products[(products)]
Reviews[(reviews)]
end
subgraph RankingTables["Ranking Tables"]
direction LR
TempCategoryRankings[(t_category_rankings)]
TempSearchQueryRankings[(t_sq_rankings)]
end
end
SyncSuccessCmd[analyzerdb:sync-crawl-success-from-analyzer]
SyncSuccessCmd --- SyncSuccessCmdStep1[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>1</span>
<p style='margin-top: 8px'>Retrieves Records</p>
</div>
]
SyncSuccessCmdStep1 --> ConsoleTables
SyncSuccessCmd --- SyncSuccessCmdStep2[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>2</span>
<p style='margin-top: 8px'>Queries For Data</p>
</div>
]
SyncSuccessCmdStep2 --> AnalyzerTables
SyncSuccessCmd --- SyncSuccessCmdStep3[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>3</span>
<p style='margin-top: 8px'>Updates Success Status</p>
</div>
]
SyncSuccessCmdStep3 --> ConsoleTables
style SyncSuccessCmd fill:#d9f2d9
style ConsoleTables fill:#fcf3d2,stroke:#339933,stroke-width:1px
style AnalyzerTables fill:#d2e3fc,stroke:#3333cc,stroke-width:1px
style SummaryTables fill:#fff9db
style ProductTables fill:#e6f0ff
style RankingTables fill:#e6f0ff
style SyncSuccessCmdStep1 fill:transparent,stroke:transparent,stroke-width:1px
style SyncSuccessCmdStep2 fill:transparent,stroke:transparent,stroke-width:1px
style SyncSuccessCmdStep3 fill:transparent,stroke:transparent,stroke-width:1px
Detail Dataflow Dependency
The Analyzer Integration component follows a three-step verification flow that aligns with the numbered steps in the diagram:
1. Record Retrieval
The first step involves retrieving records from the Console database that need to be verified:
- The
analyzerdb:sync-crawl-success-from-analyzercommand is triggered every 5 minutes - It accepts a
--data-typeparameter to specify which type of data to process:SummaryProduct: Retrieves records fromsummary_wishlist_productsSummaryProductReview: Retrieves records fromsummary_wishlist_product_reviewsSummaryCategory: Retrieves records fromsummary_wishlist_categoriesSummarySearchQuery: Retrieves records fromsummary_wishlist_search_queries
- Records are filtered by their
crawl_status(typically records that are in "Crawling" status) - The command uses repository interfaces to access the database tables:
SummaryWishlistProductRepositoryInterfaceSummaryWishlistProductReviewRepositoryInterfaceSummaryWishlistCategoryRepositoryInterfaceSummaryWishlistSearchQueryRepositoryInterface
- Records are retrieved in configurable batches (default: 100)
2. Data Verification
Once the records are retrieved, the system verifies their existence in the Analyzer database:
- For each batch of records, a specialized job is dispatched:
SummaryProductJob: Checksproductstable in the Analyzer databaseSummaryProductReviewJob: Checksreviewstable in the Analyzer databaseSummaryCategoryJob: Checkst_category_rankingstable in the Analyzer databaseSummarySearchQueryJob: Checkst_sq_rankingstable in the Analyzer database
- Each job creates specialized lookup keys to match records:
- For products:
{mall_id}_{input_type}_{input}where input_type may be Jan, Asin, or RakutenId - For reviews: Keys based on the related product's identifiers
- For categories:
{mall_id}_{category_id} - For search queries:
{mall_id}_{keyword}
- For products:
- Time constraints are applied to ensure data freshness:
- Only records with timestamps in the Analyzer database after the Console record's
updated_atare considered valid - This ensures that only data processed after the wishlist record was updated is considered a match
- Only records with timestamps in the Analyzer database after the Console record's
3. Status Update
After verification, the system updates the status of verified records:
- For records found in the Analyzer database, the job updates:
- The
crawl_statusfield to2 (Success)in the relevant summary wishlist table - This is done using repository methods like
updateByConditionsorupdateBySummaryWishlistProductIds
- The
- Multiple records can be updated in batch operations for efficiency
- Status updates are applied within database transactions to ensure consistency
- Deadlock detection and retry mechanisms are implemented for reliability
- Success and error notifications are sent via Slack
- Logs are generated with detailed information about the number of records processed and updated
This three-step process ensures accurate status tracking between the two databases, allowing the system to know which records have been successfully processed by the crawler and analyzer components.
Data Types and Tables
Console Database Tables (gb_console)
-
summary_wishlist_products: Tracks product data sent to crawler
- Key Fields: id, mall_id, input_type, input, sending_status, crawl_status
- Updated By:
SummaryProductJob - Lookup Keys: Based on mall_id, input_type (Jan, Asin, RakutenId), and input value
-
summary_wishlist_product_reviews: Tracks product review data
- Key Fields: id, summary_wishlist_product_id, sending_status, crawl_status
- Updated By:
SummaryProductReviewJob - Relationships: Foreign key to summary_wishlist_products
-
summary_wishlist_categories: Tracks category data
- Key Fields: id, mall_id, category_id, sending_status, crawl_status
- Updated By:
SummaryCategoryJob - Lookup Keys: Based on mall_id and category_id
-
summary_wishlist_search_queries: Tracks search query data
- Key Fields: id, mall_id, keyword, sending_status, crawl_status
- Updated By:
SummarySearchQueryJob - Lookup Keys: Based on mall_id and keyword
Analyzer Database Tables (gb_analyzer)
-
products: Contains processed product data
- Key Fields: mall_id, mall_product_id, jan_code, input_type, unique_key
- Queried By:
SummaryProductJob - Time Constraint: crawl_created_at must be after summary_wishlist_product.updated_at
-
reviews: Contains processed review data
- Key Fields: mall_id, mall_product_id/jan_code
- Queried By:
SummaryProductReviewJob - Time Constraint: crawl_created_at must be after summary_wishlist_product_review.updated_at
-
t_category_rankings: Temporary category rankings from crawler
- Key Fields: mall_id, category_id
- Queried By:
SummaryCategoryJob - Time Constraint: crawl_created_at must be after summary_wishlist_category.updated_at
-
t_sq_rankings: Temporary search query rankings from crawler
- Key Fields: mall_id, keyword
- Queried By:
SummarySearchQueryJob - Time Constraint: crawl_created_at must be after summary_wishlist_search_query.updated_at
Status Management
The system uses the CrawlStatus enum to track processing states:
- NotCrawled: Initial state for new records
- Crawling: Record is currently being processed by crawler
- Success: Processing completed successfully (updated by this component)
- Failed: Processing failed with errors (handled by Crawler Integration component)
- Error: System error occurred during processing
These status values enable:
- Accurate dashboard reporting on data processing status
- Identification of problematic records for troubleshooting
- Data integrity verification across the pipeline
- Proper sequencing of downstream operations
Database Schema
Console Database Tables (gb_console)
erDiagram
summary_wishlist_products {
bigint id PK
string input "The input of the product"
string input_type "The type of the input: jan, asin, rakuten_id"
bigint mall_id FK "Foreign key to malls table"
integer crawl_status "The status of the crawling (default: New)"
integer status "The status of the product (default: New)"
timestamp updated_at
}
summary_wishlist_product_reviews {
bigint id PK
bigint summary_wishlist_product_id FK "Foreign key to summary_wishlist_products"
integer crawl_status "The status of the crawling (default: New)"
integer status "The status of the product (default: New)"
timestamp updated_at
}
summary_wishlist_categories {
bigint id PK
string category_id "The id of the category in the mall"
bigint mall_id FK "Foreign key to malls table"
integer crawl_status "The status of the crawling (default: New)"
integer status "The status of the product (default: New)"
timestamp updated_at
}
summary_wishlist_search_queries {
bigint id PK
bigint mall_id FK "The id of the mall"
string keyword "The keyword to search"
integer crawl_status "The status of the crawling (default: New)"
integer status "The status of the product (default: New)"
timestamp updated_at
}
summary_wishlist_products ||--o| summary_wishlist_product_reviews : "has one"
Analyzer Database Tables (gb_analyzer)
erDiagram
products {
bigint id PK
integer mall_id "Mall identifier"
string mall_product_id "Product ID from the mall"
string jan_code "JAN code (nullable)"
string input_type "Type of input: asin, jan, rakuten_id (default: asin)"
timestamp crawl_created_at "When the product was crawled (nullable)"
}
reviews {
bigint id PK
integer mall_id "Mall identifier (nullable)"
string mall_product_id "Product ID from mall (nullable)"
string jan_code "JAN code (nullable, max 50 chars)"
timestamp crawl_created_at "When the review was crawled (nullable)"
}
t_category_rankings {
bigint id PK
integer mall_id "Mall identifier (nullable)"
string category_id "Category identifier (nullable)"
timestamp crawl_created_at "When the ranking was crawled (nullable)"
}
t_sq_rankings {
bigint id PK
integer mall_id "Mall identifier (nullable)"
string keyword "Search keyword (nullable, max 100 chars)"
timestamp crawl_created_at "When the ranking was crawled (nullable)"
}
Frequency Overview
Timeline
timeline
title Analyzer Integration Schedule
section Success Sync Operations
Every 5 minutes<br>(Ex. 08.00, 08.05, etc.) : analyzerdb sync crawl success from analyzer
Note: This command is executed for each data type: SummaryProduct, SummaryProductReview, SummaryCategory, SummarySearchQuery
Expected Outcomes
When these commands execute successfully, the system delivers:
- Automated Status Verification: Real-time verification of crawl success status across Console and Analyzer databases ensuring accurate processing state tracking
- Cross-Database Data Integrity: Bidirectional verification flow maintains consistency between gb_console and gb_analyzer databases with time-constraint validation
- Efficient Batch Processing: Asynchronous job queue processing with configurable batch sizes and specialized job classes for each data type
- Robust Error Handling: Automatic retry mechanisms with exponential backoff, deadlock detection, and comprehensive logging for operational reliability
- Status Synchronization: Accurate crawl_status updates from Crawling to Success state enabling proper downstream processing and dataset creation
- Multi-Data Type Support: Simultaneous handling of products, reviews, categories, and search queries with type-specific lookup key generation
- Performance Optimization: Repository pattern implementation with batch update operations and database transaction management for efficient resource utilization
Batch List
| Name | Description |
|---|---|
| Sync Success | Commands that run every 5 minutes to verify whether records have been successfully processed by the Analyzer system and update their status |