Analyzer Integration Overview

Description

The Analyzer Integration component manages the communication between the Console database (gb_console) and the Analyzer database (gb_analyzer). It verifies whether records sent to the crawler have been successfully processed by checking their existence in the Analyzer database, and then updates their status in the Console database. This bidirectional verification flow ensures accurate status tracking across the Trend Viewer backend ecosystem.

Overview System Diagram

---
config:
  theme: base
  layout: dagre
  flowchart:
    curve: linear
    htmlLabels: true
  themeVariables:
    edgeLabelBackground: "transparent"
---
flowchart TD
    subgraph ConsoleTables["<div style='width: 300px'>Console Database (gb_console)</div>"]
        direction TB
        
        subgraph SummaryTables["Summary Wishlist Tables"]
            direction LR
            SummaryProducts[(summary_wishlist_products)]
            SummaryProductReviews[(summary_wishlist_product_reviews)]
            SummaryCategories[(summary_wishlist_categories)]
            SummarySearchQueries[(summary_wishlist_search_queries)]
        end
    end
    
    subgraph AnalyzerTables["<div style='width: 300px'>Analyzer Database (gb_analyzer)</div>"]
        direction TB
        
        subgraph ProductTables["Product Data Tables"]
            direction LR
            Products[(products)]
            Reviews[(reviews)]
        end
        
        subgraph RankingTables["Ranking Tables"]
            direction LR
            TempCategoryRankings[(t_category_rankings)]
            TempSearchQueryRankings[(t_sq_rankings)]
        end
    end
    
    SyncSuccessCmd[analyzerdb:sync-crawl-success-from-analyzer]
    
    SyncSuccessCmd --- SyncSuccessCmdStep1[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>1</span>
            <p style='margin-top: 8px'>Retrieves Records</p>
        </div>
    ]
    SyncSuccessCmdStep1 --> ConsoleTables
    
    SyncSuccessCmd --- SyncSuccessCmdStep2[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>2</span>
            <p style='margin-top: 8px'>Queries For Data</p>
        </div>
    ]
    SyncSuccessCmdStep2 --> AnalyzerTables
    
    SyncSuccessCmd --- SyncSuccessCmdStep3[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>3</span>
            <p style='margin-top: 8px'>Updates Success Status</p>
        </div>
    ]
    SyncSuccessCmdStep3 --> ConsoleTables
    
    style SyncSuccessCmd fill:#d9f2d9
    style ConsoleTables fill:#fcf3d2,stroke:#339933,stroke-width:1px
    style AnalyzerTables fill:#d2e3fc,stroke:#3333cc,stroke-width:1px
    style SummaryTables fill:#fff9db
    style ProductTables fill:#e6f0ff
    style RankingTables fill:#e6f0ff
    style SyncSuccessCmdStep1 fill:transparent,stroke:transparent,stroke-width:1px
    style SyncSuccessCmdStep2 fill:transparent,stroke:transparent,stroke-width:1px
    style SyncSuccessCmdStep3 fill:transparent,stroke:transparent,stroke-width:1px

Detail Dataflow Dependency

The Analyzer Integration component follows a three-step verification flow that aligns with the numbered steps in the diagram:

1. Record Retrieval

The first step involves retrieving records from the Console database that need to be verified:

  • The analyzerdb:sync-crawl-success-from-analyzer command is triggered every 5 minutes
  • It accepts a --data-type parameter to specify which type of data to process:
    • SummaryProduct: Retrieves records from summary_wishlist_products
    • SummaryProductReview: Retrieves records from summary_wishlist_product_reviews
    • SummaryCategory: Retrieves records from summary_wishlist_categories
    • SummarySearchQuery: Retrieves records from summary_wishlist_search_queries
  • Records are filtered by their crawl_status (typically records that are in "Crawling" status)
  • The command uses repository interfaces to access the database tables:
    • SummaryWishlistProductRepositoryInterface
    • SummaryWishlistProductReviewRepositoryInterface
    • SummaryWishlistCategoryRepositoryInterface
    • SummaryWishlistSearchQueryRepositoryInterface
  • Records are retrieved in configurable batches (default: 100)

2. Data Verification

Once the records are retrieved, the system verifies their existence in the Analyzer database:

  • For each batch of records, a specialized job is dispatched:
    • SummaryProductJob: Checks products table in the Analyzer database
    • SummaryProductReviewJob: Checks reviews table in the Analyzer database
    • SummaryCategoryJob: Checks t_category_rankings table in the Analyzer database
    • SummarySearchQueryJob: Checks t_sq_rankings table in the Analyzer database
  • Each job creates specialized lookup keys to match records:
    • For products: {mall_id}_{input_type}_{input} where input_type may be Jan, Asin, or RakutenId
    • For reviews: Keys based on the related product's identifiers
    • For categories: {mall_id}_{category_id}
    • For search queries: {mall_id}_{keyword}
  • Time constraints are applied to ensure data freshness:
    • Only records with timestamps in the Analyzer database after the Console record's updated_at are considered valid
    • This ensures that only data processed after the wishlist record was updated is considered a match

3. Status Update

After verification, the system updates the status of verified records:

  • For records found in the Analyzer database, the job updates:
    • The crawl_status field to 2 (Success) in the relevant summary wishlist table
    • This is done using repository methods like updateByConditions or updateBySummaryWishlistProductIds
  • Multiple records can be updated in batch operations for efficiency
  • Status updates are applied within database transactions to ensure consistency
  • Deadlock detection and retry mechanisms are implemented for reliability
  • Success and error notifications are sent via Slack
  • Logs are generated with detailed information about the number of records processed and updated

This three-step process ensures accurate status tracking between the two databases, allowing the system to know which records have been successfully processed by the crawler and analyzer components.

Data Types and Tables

Console Database Tables (gb_console)

  • summary_wishlist_products: Tracks product data sent to crawler

    • Key Fields: id, mall_id, input_type, input, sending_status, crawl_status
    • Updated By: SummaryProductJob
    • Lookup Keys: Based on mall_id, input_type (Jan, Asin, RakutenId), and input value
  • summary_wishlist_product_reviews: Tracks product review data

    • Key Fields: id, summary_wishlist_product_id, sending_status, crawl_status
    • Updated By: SummaryProductReviewJob
    • Relationships: Foreign key to summary_wishlist_products
  • summary_wishlist_categories: Tracks category data

    • Key Fields: id, mall_id, category_id, sending_status, crawl_status
    • Updated By: SummaryCategoryJob
    • Lookup Keys: Based on mall_id and category_id
  • summary_wishlist_search_queries: Tracks search query data

    • Key Fields: id, mall_id, keyword, sending_status, crawl_status
    • Updated By: SummarySearchQueryJob
    • Lookup Keys: Based on mall_id and keyword

Analyzer Database Tables (gb_analyzer)

  • products: Contains processed product data

    • Key Fields: mall_id, mall_product_id, jan_code, input_type, unique_key
    • Queried By: SummaryProductJob
    • Time Constraint: crawl_created_at must be after summary_wishlist_product.updated_at
  • reviews: Contains processed review data

    • Key Fields: mall_id, mall_product_id/jan_code
    • Queried By: SummaryProductReviewJob
    • Time Constraint: crawl_created_at must be after summary_wishlist_product_review.updated_at
  • t_category_rankings: Temporary category rankings from crawler

    • Key Fields: mall_id, category_id
    • Queried By: SummaryCategoryJob
    • Time Constraint: crawl_created_at must be after summary_wishlist_category.updated_at
  • t_sq_rankings: Temporary search query rankings from crawler

    • Key Fields: mall_id, keyword
    • Queried By: SummarySearchQueryJob
    • Time Constraint: crawl_created_at must be after summary_wishlist_search_query.updated_at

Status Management

The system uses the CrawlStatus enum to track processing states:

  • NotCrawled: Initial state for new records
  • Crawling: Record is currently being processed by crawler
  • Success: Processing completed successfully (updated by this component)
  • Failed: Processing failed with errors (handled by Crawler Integration component)
  • Error: System error occurred during processing

These status values enable:

  • Accurate dashboard reporting on data processing status
  • Identification of problematic records for troubleshooting
  • Data integrity verification across the pipeline
  • Proper sequencing of downstream operations

Database Schema

Console Database Tables (gb_console)

erDiagram
    summary_wishlist_products {
        bigint id PK
        string input "The input of the product"
        string input_type "The type of the input: jan, asin, rakuten_id"
        bigint mall_id FK "Foreign key to malls table"
        integer crawl_status "The status of the crawling (default: New)"
        integer status "The status of the product (default: New)"
        timestamp updated_at
    }
    
    summary_wishlist_product_reviews {
        bigint id PK
        bigint summary_wishlist_product_id FK "Foreign key to summary_wishlist_products"
        integer crawl_status "The status of the crawling (default: New)"
        integer status "The status of the product (default: New)"
        timestamp updated_at
    }
    
    summary_wishlist_categories {
        bigint id PK
        string category_id "The id of the category in the mall"
        bigint mall_id FK "Foreign key to malls table"
        integer crawl_status "The status of the crawling (default: New)"
        integer status "The status of the product (default: New)"
        timestamp updated_at
    }
    
    summary_wishlist_search_queries {
        bigint id PK
        bigint mall_id FK "The id of the mall"
        string keyword "The keyword to search"
        integer crawl_status "The status of the crawling (default: New)"
        integer status "The status of the product (default: New)"
        timestamp updated_at
    }
    
    summary_wishlist_products ||--o| summary_wishlist_product_reviews : "has one"

Analyzer Database Tables (gb_analyzer)

erDiagram
    products {
        bigint id PK
        integer mall_id "Mall identifier"
        string mall_product_id "Product ID from the mall"
        string jan_code "JAN code (nullable)"
        string input_type "Type of input: asin, jan, rakuten_id (default: asin)"
        timestamp crawl_created_at "When the product was crawled (nullable)"
    }
    
    reviews {
        bigint id PK
        integer mall_id "Mall identifier (nullable)"
        string mall_product_id "Product ID from mall (nullable)"
        string jan_code "JAN code (nullable, max 50 chars)"
        timestamp crawl_created_at "When the review was crawled (nullable)"
    }
    
    t_category_rankings {
        bigint id PK
        integer mall_id "Mall identifier (nullable)"
        string category_id "Category identifier (nullable)"
        timestamp crawl_created_at "When the ranking was crawled (nullable)"
    }
    
    t_sq_rankings {
        bigint id PK
        integer mall_id "Mall identifier (nullable)"
        string keyword "Search keyword (nullable, max 100 chars)"
        timestamp crawl_created_at "When the ranking was crawled (nullable)"
    }

Frequency Overview

Timeline

timeline
    title Analyzer Integration Schedule
    section Success Sync Operations
        Every 5 minutes<br>(Ex. 08.00, 08.05, etc.) : analyzerdb sync crawl success from analyzer

Note: This command is executed for each data type: SummaryProduct, SummaryProductReview, SummaryCategory, SummarySearchQuery

Expected Outcomes

When these commands execute successfully, the system delivers:

  • Automated Status Verification: Real-time verification of crawl success status across Console and Analyzer databases ensuring accurate processing state tracking
  • Cross-Database Data Integrity: Bidirectional verification flow maintains consistency between gb_console and gb_analyzer databases with time-constraint validation
  • Efficient Batch Processing: Asynchronous job queue processing with configurable batch sizes and specialized job classes for each data type
  • Robust Error Handling: Automatic retry mechanisms with exponential backoff, deadlock detection, and comprehensive logging for operational reliability
  • Status Synchronization: Accurate crawl_status updates from Crawling to Success state enabling proper downstream processing and dataset creation
  • Multi-Data Type Support: Simultaneous handling of products, reviews, categories, and search queries with type-specific lookup key generation
  • Performance Optimization: Repository pattern implementation with batch update operations and database transaction management for efficient resource utilization

Batch List

Name Description
Sync Success Commands that run every 5 minutes to verify whether records have been successfully processed by the Analyzer system and update their status