Dataset Overview

Description

The Dataset component manages the creation and monitoring of datasets used for analysis and reporting through the TV Python API (Analyzer API) service. This component processes product, category, and search query data from the Console database (gb_console), sends structured requests to the external API service, and monitors the generation status. The resulting datasets enable trend analysis, pattern recognition, and specific viewpoint generation for wishlist groups with active subscriptions.

Overview System Diagram

---
config:
  theme: base
  layout: dagre
  flowchart:
    curve: linear
    htmlLabels: true
  themeVariables:
    edgeLabelBackground: "transparent"
---
flowchart TB
    %% Main database tables as separate entities
    WishlistDB[(gb_console.wishlist_to_groups)]
    DatasetHistoryDB[(gb_console.wishlist_dataset_histories)]
    DatasetLogsDB[(gb_console.wishlist_dataset_creation_logs)]
    PythonAPI((TV Python API))
    
    subgraph Commands
        CreateDatasetCmd[dataset:create]
        GetStatusCmd[dataset:get-status]
    end
    
    subgraph ConsoleTables["Console Database Tables"]
        direction TB
        
        subgraph SummaryTables
            direction LR
            WishlistSearchProducts[summary_wishlist_products]
            WishlistCategories[summary_wishlist_categories]
            WishlistSearchQueries[summary_wishlist_search_queries]
        end
        
        subgraph ViewpointTables
            direction LR
            WishlistSpecificVP[wl_spec_vps]
            WishlistCategoryVP[wl_cat_vps]
        end
    end
    
    subgraph AnalyzerTables["Analyzer Database Tables"]
        direction TB
        
        subgraph ProductTables
            direction LR
            Products[products]
            ReviewSentences[review_sentences]
        end
        
        subgraph CategoryTables
            direction LR
            CategoryRankings[category_rankings]
            ProductCategoryRankings[product_category_rankings]
        end
        
        subgraph SearchTables
            direction LR
            SearchQueryRankings[search_query_rankings]
            ProductSearchQueryRankings[product_search_query_rankings]
        end
    end
    
    CreateDatasetCmd --- CreateDatasetCmdStep1A[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>1A</span>
            <p style='margin-top: 8px'>Retrieve Eligible Wishlist</p>
        </div>
    ]
    CreateDatasetCmdStep1A --> WishlistDB

    CreateDatasetCmd --- CreateDatasetCmdStep2A[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>2A</span>
            <p style='margin-top: 8px'>Fetch Rankings Data</p>
        </div>
    ]
    CreateDatasetCmdStep2A --> AnalyzerTables

    CreateDatasetCmd --- CreateDatasetCmdStep3A[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>3A</span>
            <p style='margin-top: 8px'>Send Dataset Request</p>
        </div>
    ]
    CreateDatasetCmdStep3A --> PythonAPI

    CreateDatasetCmd --- CreateDatasetCmdStep4A[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>4A</span>
            <p style='margin-top: 8px'>Store Dataset ID</p>
        </div>
    ]
    CreateDatasetCmdStep4A --> DatasetHistoryDB

    GetStatusCmd --- GetStatusCmdStep1B[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>1B</span>
            <p style='margin-top: 8px'>Get Pending Datasets</p>
        </div>
    ]
    GetStatusCmdStep1B --> DatasetHistoryDB

    GetStatusCmd --- GetStatusCmdStep2B[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>2B</span>
            <p style='margin-top: 8px'>Check Status</p>
        </div>
    ]
    GetStatusCmdStep2B --> PythonAPI

    GetStatusCmd --- GetStatusCmdStep3B[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>3B</span>
            <p style='margin-top: 8px'>Update Status</p>
        </div>
    ]
    GetStatusCmdStep3B --> DatasetHistoryDB

    GetStatusCmd --- GetStatusCmdStep4B[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>4B</span>
            <p style='margin-top: 8px'>Log Events</p>
        </div>
    ]
    GetStatusCmdStep4B --> DatasetLogsDB
    
    %% Main relationships
    WishlistDB -.-> DatasetHistoryDB
    WishlistDB -.-> ConsoleTables
    DatasetHistoryDB -.-> DatasetLogsDB
    
    %% Relationship to analyzer data
    ConsoleTables -.-> AnalyzerTables
    
    %% Database prefixes in label text
    classDef consoleLabel text-anchor:start,fill:#fcf3d2
    classDef analyzerLabel text-anchor:start,fill:#d2e3fc
    
    class ConsoleTables consoleLabel
    class AnalyzerTables analyzerLabel
    
    style WishlistDB fill:#ffe6cc,stroke:#ff9900,stroke-width:2px
    style DatasetHistoryDB fill:#d9f2d9,stroke:#339933,stroke-width:1px
    style DatasetLogsDB fill:#d9f2d9,stroke:#339933,stroke-width:1px
    style PythonAPI fill:#fcd9d9,stroke-width:2px
    style CreateDatasetCmd fill:#d9f2d9
    style GetStatusCmd fill:#d9f2d9
    style ConsoleTables fill:#fcf3d2
    style AnalyzerTables fill:#d2e3fc
    style Commands fill:#f9f9f9
    style SummaryTables fill:#fff9db
    style ViewpointTables fill:#fff9db
    style ProductTables fill:#e6f0ff
    style CategoryTables fill:#e6f0ff
    style SearchTables fill:#e6f0ff
    style CreateDatasetCmdStep1A fill:transparent,stroke:transparent,stroke-width:1px
    style CreateDatasetCmdStep2A fill:transparent,stroke:transparent,stroke-width:1px
    style CreateDatasetCmdStep3A fill:transparent,stroke:transparent,stroke-width:1px
    style CreateDatasetCmdStep4A fill:transparent,stroke:transparent,stroke-width:1px
    style GetStatusCmdStep1B fill:transparent,stroke:transparent,stroke-width:1px
    style GetStatusCmdStep2B fill:transparent,stroke:transparent,stroke-width:1px
    style GetStatusCmdStep3B fill:transparent,stroke:transparent,stroke-width:1px
    style GetStatusCmdStep4B fill:transparent,stroke:transparent,stroke-width:1px

Database Schema

Console Database Tables (gb_console)

erDiagram
    wishlist_to_groups {
        bigint id PK
        bigint subscription_id FK "Foreign key to subscriptions table (used for eligibility check)"
        string name "Name of the wishlist group (used for dataset naming and logging)"
        string slug "URL-friendly identifier (used for specific wishlist processing)"
        tinyint training_schedule "Training schedule enum value (used for manual vs auto processing)"
        timestamp manual_request_dataset_at "Manual dataset request timestamp (used for manual processing timing)"
        integer status "Active or inactive status (filtered in buildWishlistQuery)"
        tinyint admin_status "Admin approval status (filtered in buildWishlistQuery)"
        text error_message "Error message for failed operations (updated on validation failures)"
        timestamp created_at
        timestamp updated_at
        timestamp deleted_at "Soft delete timestamp (filtered in buildWishlistQuery)"
    }
    
    wishlist_dataset_histories {
        bigint id PK
        bigint wishlist_to_group_id FK "Foreign key to wishlist_to_groups (used for dataset tracking)"
        string dataset_id "Dataset ID from TV Python API (used for status checking)"
        integer status "Dataset status enum value (updated by get-status command)"
        text error_message "Error message for failed datasets (updated on failure)"
        json config "Configuration used for dataset creation (stores product_ids for batch jobs)"
        boolean batch_job_created "Whether batch job was created (updated after completion)"
        string batch_job_id "Batch job identifier (updated after batch job creation)"
        timestamp created_at "Used for schedule interval calculations"
        timestamp updated_at
    }
    
    wishlist_dataset_creation_logs {
        bigint id PK
        string wishlist_to_group_name "Name of the wishlist group (copied for logging)"
        bigint wishlist_to_group_id FK "Foreign key to wishlist_to_groups (used for event tracking)"
        bigint wishlist_dataset_history_id FK "Foreign key to wishlist_dataset_histories (used for event linking)"
        string event_type "Event type enum value (request, success, failure)"
        json data "Event data (stores request/response payloads)"
        text message "Event message (stores error details)"
        timestamp created_at
        timestamp updated_at
    }
    
    summary_wishlist_products {
        bigint id PK
        bigint wishlist_to_group_id FK "Foreign key to wishlist_to_groups (used for data collection)"
        string input "The input of the product (used for product ID extraction)"
        string input_type "The type of the input: jan, asin, rakuten_id (used for product matching)"
        integer mall_id "Foreign key to malls table (used for product filtering)"
        integer crawl_status "The status of the crawling (used for success validation)"
        timestamp updated_at "Used for time constraint validation"
    }
    
    summary_wishlist_categories {
        bigint id PK
        bigint wishlist_to_group_id FK "Foreign key to wishlist_to_groups (used for data collection)"
        string category_id "The category identifier (used for ranking data collection)"
        integer mall_id "Foreign key to malls table (used for category filtering)"
        integer crawl_status "The status of the crawling (used for success validation)"
        timestamp updated_at "Used for time constraint validation"
    }
    
    summary_wishlist_search_queries {
        bigint id PK
        bigint wishlist_to_group_id FK "Foreign key to wishlist_to_groups (used for data collection)"
        integer mall_id "Foreign key to malls table (used for search filtering)"
        string keyword "The search keyword (used for ranking data collection)"
        integer crawl_status "The status of the crawling (used for success validation)"
        timestamp updated_at "Used for time constraint validation"
    }
    
    wl_spec_vps {
        bigint id PK
        bigint wishlist_id "Foreign key to wishlist (used for viewpoint association)"
        string name "Viewpoint name (used for batch job creation)"
        text description "Viewpoint description (used for configuration)"
        timestamp created_at
        timestamp updated_at
    }
    
    wl_cat_vps {
        bigint id PK
        bigint wishlist_category_id "Foreign key to wishlist categories (used for viewpoint linking)"
        bigint wishlist_id "Foreign key to wishlist (used for viewpoint association)"
        timestamp created_at
        timestamp updated_at
    }
    
    wishlist_to_groups ||--o{ wishlist_dataset_histories : "has many"
    wishlist_to_groups ||--o{ wishlist_dataset_creation_logs : "has many"
    wishlist_dataset_histories ||--o{ wishlist_dataset_creation_logs : "has many"
    wishlist_to_groups ||--o{ summary_wishlist_products : "has many"
    wishlist_to_groups ||--o{ summary_wishlist_categories : "has many"
    wishlist_to_groups ||--o{ summary_wishlist_search_queries : "has many"

Analyzer Database Tables (gb_analyzer)

erDiagram
    products {
        bigint id PK
        integer mall_id "Mall identifier (used for product filtering in dataset collection)"
        string mall_product_id "Product ID from the mall (used for ranking data collection)"
        string jan_code "JAN code (used for product ID enrichment)"
        string input_type "Type of input: asin, jan, rakuten_id (used for product matching)"
        string unique_key "Unique identifier (used for product deduplication)"
        timestamp created_at
        timestamp updated_at
        timestamp crawl_created_at "When the product was crawled (used for time constraints)"
    }
    
    category_rankings {
        bigint id PK
        string mall_category_id "Category ID from the mall (used for category ranking collection)"
        integer mall_id "Mall identifier (used for category filtering)"
        string unique_key "Unique identifier (used for category deduplication)"
        timestamp crawl_created_at "When the category was crawled (used for latest data selection)"
        timestamp created_at
        timestamp updated_at
    }
    
    product_category_rankings {
        bigint id PK
        bigint category_ranking_id FK "Foreign key to category_rankings (used for ranking data collection)"
        string mall_product_id "Product ID from the mall (used for product ranking collection)"
        tinyint mall_id "Mall identifier (used for ranking filtering)"
        timestamp crawl_created_at "When the ranking was crawled (used for latest data selection)"
        timestamp created_at
        timestamp updated_at
    }
    
    search_query_rankings {
        bigint id PK
        string keyword "Search keyword (used for search ranking collection)"
        timestamp crawl_created_at "When the search query was crawled (used for latest data selection)"
        timestamp created_at
        timestamp updated_at
    }
    
    product_search_query_rankings {
        bigint id PK
        bigint search_query_ranking_id FK "Foreign key to search_query_rankings (used for ranking data collection)"
        string mall_product_id "Product ID from the mall (used for product ranking collection)"
        tinyint mall_id "Mall identifier (used for ranking filtering)"
        timestamp crawl_created_at "When the ranking was crawled (used for latest data selection)"
        timestamp created_at
        timestamp updated_at
    }
    
    review_sentences {
        bigint id PK
        string mall_product_id "Product ID from the mall (used for review sentence validation)"
        text content "Sentence content extracted from review (used for dataset validation)"
        timestamp created_at
        timestamp updated_at
    }
    
    category_rankings ||--o{ product_category_rankings : "has many"
    search_query_rankings ||--o{ product_search_query_rankings : "has many"

Dataset Command Logic Fields

Dataset Create Command Fields Used:

Wishlist Eligibility Filtering:

  • wishlist_to_groups.id - Primary key for dataset association
  • wishlist_to_groups.subscription_id - Used to join with active subscriptions
  • wishlist_to_groups.status - Filtered for Active status
  • wishlist_to_groups.admin_status - Filtered for Active admin status
  • wishlist_to_groups.training_schedule - Used for manual vs automatic processing logic
  • wishlist_to_groups.manual_request_dataset_at - Used for manual processing timing
  • wishlist_to_groups.name - Used for dataset naming and logging
  • wishlist_to_groups.slug - Used for specific wishlist processing
  • wishlist_to_groups.deleted_at - Filtered for non-deleted records

Data Collection and Validation:

  • summary_wishlist_products.input, input_type, mall_id - Used for product ID extraction
  • summary_wishlist_categories.category_id, mall_id - Used for category ranking collection
  • summary_wishlist_search_queries.keyword, mall_id - Used for search ranking collection
  • summary_*.crawl_status - Used for success count validation
  • summary_*.updated_at - Used for time constraint validation

Dataset Creation:

  • wishlist_dataset_histories.wishlist_to_group_id - Links dataset to wishlist
  • wishlist_dataset_histories.dataset_id - Stores API response dataset ID
  • wishlist_dataset_histories.status - Stores initial dataset status
  • wishlist_dataset_histories.config - Stores dataset configuration
  • wishlist_dataset_histories.created_at - Used for schedule interval calculations

Dataset Get-Status Command Fields Used:

Status Monitoring:

  • wishlist_dataset_histories.id - Primary key for updates
  • wishlist_dataset_histories.dataset_id - Used for API status queries
  • wishlist_dataset_histories.status - Updated with new status from API
  • wishlist_dataset_histories.wishlist_to_group_id - Used for logging and notifications
  • wishlist_dataset_histories.batch_job_created - Updated when batch jobs are created
  • wishlist_dataset_histories.batch_job_id - Stores batch job identifier
  • wishlist_dataset_histories.config - Used to extract product_ids for batch jobs
  • wishlist_dataset_histories.error_message - Updated on failure

Batch Job Creation:

  • wl_spec_vps - Used for specific viewpoint batch job creation
  • wl_cat_vps - Used for category viewpoint associations

API Integration

The system integrates with the TV Python API (Analyzer API) service:

  • AnalyzerApiService: Service class for API communication
  • Dataset Creation Endpoint: Accepts structured data and returns dataset ID
  • Status Check Endpoint: Provides current status of dataset processing
  • Error Handling: Manages API failures and retry logic
  • Configuration: Uses config/analyzer_api.php for API settings

Frequency Overview

Timeline

timeline
    title Dataset Operations Schedule
    section Creation Process
        Every 30 minutes<br>(Ex. 08.00, 08.30, etc.) : dataset create
    section Status Monitoring
        Every 5 minutes<br>(Ex. 08.00, 08.05, etc.) : dataset get-status

Expected Outcomes

When these commands execute successfully, the system delivers:

  • Automated Dataset Generation: Regular creation of datasets for active wishlist groups with subscriptions ensuring continuous data availability for analysis
  • Comprehensive Data Integration: Seamless aggregation of product rankings, category data, search query results, and viewpoint information from multiple database sources
  • External API Orchestration: Reliable communication with TV Python API service including request formatting, response handling, and error recovery mechanisms
  • Status Tracking and Monitoring: Real-time monitoring of dataset creation progress with status transitions from Pending to Analyzing to Complete/Failed states
  • Audit Trail Management: Complete logging of dataset operations including creation events, status changes, and error conditions for troubleshooting and compliance
  • Notification System Integration: Automated alerts via Slack for administrators and real-time updates via Pusher for end-users on dataset completion or failure
  • Subscription-Based Processing: Intelligent filtering to process only eligible wishlists with active subscriptions and proper configuration settings
  • Data Quality Assurance: Validation of minimum thresholds and data availability before dataset creation to ensure meaningful analysis results

Batch List

Name Description
Dataset Create Command that runs every 30 minutes to identify eligible wishlist groups, collect analyzed data, and send dataset creation requests to the TV Python API
Dataset Get Status Command that runs every 5 minutes to monitor pending dataset generation processes, update status records, and trigger completion notifications