Dataset Overview
Description
The Dataset component manages the creation and monitoring of datasets used for analysis and reporting through the TV Python API (Analyzer API) service. This component processes product, category, and search query data from the Console database (gb_console), sends structured requests to the external API service, and monitors the generation status. The resulting datasets enable trend analysis, pattern recognition, and specific viewpoint generation for wishlist groups with active subscriptions.
Overview System Diagram
---
config:
theme: base
layout: dagre
flowchart:
curve: linear
htmlLabels: true
themeVariables:
edgeLabelBackground: "transparent"
---
flowchart TB
%% Main database tables as separate entities
WishlistDB[(gb_console.wishlist_to_groups)]
DatasetHistoryDB[(gb_console.wishlist_dataset_histories)]
DatasetLogsDB[(gb_console.wishlist_dataset_creation_logs)]
PythonAPI((TV Python API))
subgraph Commands
CreateDatasetCmd[dataset:create]
GetStatusCmd[dataset:get-status]
end
subgraph ConsoleTables["Console Database Tables"]
direction TB
subgraph SummaryTables
direction LR
WishlistSearchProducts[summary_wishlist_products]
WishlistCategories[summary_wishlist_categories]
WishlistSearchQueries[summary_wishlist_search_queries]
end
subgraph ViewpointTables
direction LR
WishlistSpecificVP[wl_spec_vps]
WishlistCategoryVP[wl_cat_vps]
end
end
subgraph AnalyzerTables["Analyzer Database Tables"]
direction TB
subgraph ProductTables
direction LR
Products[products]
ReviewSentences[review_sentences]
end
subgraph CategoryTables
direction LR
CategoryRankings[category_rankings]
ProductCategoryRankings[product_category_rankings]
end
subgraph SearchTables
direction LR
SearchQueryRankings[search_query_rankings]
ProductSearchQueryRankings[product_search_query_rankings]
end
end
CreateDatasetCmd --- CreateDatasetCmdStep1A[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>1A</span>
<p style='margin-top: 8px'>Retrieve Eligible Wishlist</p>
</div>
]
CreateDatasetCmdStep1A --> WishlistDB
CreateDatasetCmd --- CreateDatasetCmdStep2A[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>2A</span>
<p style='margin-top: 8px'>Fetch Rankings Data</p>
</div>
]
CreateDatasetCmdStep2A --> AnalyzerTables
CreateDatasetCmd --- CreateDatasetCmdStep3A[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>3A</span>
<p style='margin-top: 8px'>Send Dataset Request</p>
</div>
]
CreateDatasetCmdStep3A --> PythonAPI
CreateDatasetCmd --- CreateDatasetCmdStep4A[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>4A</span>
<p style='margin-top: 8px'>Store Dataset ID</p>
</div>
]
CreateDatasetCmdStep4A --> DatasetHistoryDB
GetStatusCmd --- GetStatusCmdStep1B[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>1B</span>
<p style='margin-top: 8px'>Get Pending Datasets</p>
</div>
]
GetStatusCmdStep1B --> DatasetHistoryDB
GetStatusCmd --- GetStatusCmdStep2B[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>2B</span>
<p style='margin-top: 8px'>Check Status</p>
</div>
]
GetStatusCmdStep2B --> PythonAPI
GetStatusCmd --- GetStatusCmdStep3B[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>3B</span>
<p style='margin-top: 8px'>Update Status</p>
</div>
]
GetStatusCmdStep3B --> DatasetHistoryDB
GetStatusCmd --- GetStatusCmdStep4B[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>4B</span>
<p style='margin-top: 8px'>Log Events</p>
</div>
]
GetStatusCmdStep4B --> DatasetLogsDB
%% Main relationships
WishlistDB -.-> DatasetHistoryDB
WishlistDB -.-> ConsoleTables
DatasetHistoryDB -.-> DatasetLogsDB
%% Relationship to analyzer data
ConsoleTables -.-> AnalyzerTables
%% Database prefixes in label text
classDef consoleLabel text-anchor:start,fill:#fcf3d2
classDef analyzerLabel text-anchor:start,fill:#d2e3fc
class ConsoleTables consoleLabel
class AnalyzerTables analyzerLabel
style WishlistDB fill:#ffe6cc,stroke:#ff9900,stroke-width:2px
style DatasetHistoryDB fill:#d9f2d9,stroke:#339933,stroke-width:1px
style DatasetLogsDB fill:#d9f2d9,stroke:#339933,stroke-width:1px
style PythonAPI fill:#fcd9d9,stroke-width:2px
style CreateDatasetCmd fill:#d9f2d9
style GetStatusCmd fill:#d9f2d9
style ConsoleTables fill:#fcf3d2
style AnalyzerTables fill:#d2e3fc
style Commands fill:#f9f9f9
style SummaryTables fill:#fff9db
style ViewpointTables fill:#fff9db
style ProductTables fill:#e6f0ff
style CategoryTables fill:#e6f0ff
style SearchTables fill:#e6f0ff
style CreateDatasetCmdStep1A fill:transparent,stroke:transparent,stroke-width:1px
style CreateDatasetCmdStep2A fill:transparent,stroke:transparent,stroke-width:1px
style CreateDatasetCmdStep3A fill:transparent,stroke:transparent,stroke-width:1px
style CreateDatasetCmdStep4A fill:transparent,stroke:transparent,stroke-width:1px
style GetStatusCmdStep1B fill:transparent,stroke:transparent,stroke-width:1px
style GetStatusCmdStep2B fill:transparent,stroke:transparent,stroke-width:1px
style GetStatusCmdStep3B fill:transparent,stroke:transparent,stroke-width:1px
style GetStatusCmdStep4B fill:transparent,stroke:transparent,stroke-width:1px
Database Schema
Console Database Tables (gb_console)
erDiagram
wishlist_to_groups {
bigint id PK
bigint subscription_id FK "Foreign key to subscriptions table (used for eligibility check)"
string name "Name of the wishlist group (used for dataset naming and logging)"
string slug "URL-friendly identifier (used for specific wishlist processing)"
tinyint training_schedule "Training schedule enum value (used for manual vs auto processing)"
timestamp manual_request_dataset_at "Manual dataset request timestamp (used for manual processing timing)"
integer status "Active or inactive status (filtered in buildWishlistQuery)"
tinyint admin_status "Admin approval status (filtered in buildWishlistQuery)"
text error_message "Error message for failed operations (updated on validation failures)"
timestamp created_at
timestamp updated_at
timestamp deleted_at "Soft delete timestamp (filtered in buildWishlistQuery)"
}
wishlist_dataset_histories {
bigint id PK
bigint wishlist_to_group_id FK "Foreign key to wishlist_to_groups (used for dataset tracking)"
string dataset_id "Dataset ID from TV Python API (used for status checking)"
integer status "Dataset status enum value (updated by get-status command)"
text error_message "Error message for failed datasets (updated on failure)"
json config "Configuration used for dataset creation (stores product_ids for batch jobs)"
boolean batch_job_created "Whether batch job was created (updated after completion)"
string batch_job_id "Batch job identifier (updated after batch job creation)"
timestamp created_at "Used for schedule interval calculations"
timestamp updated_at
}
wishlist_dataset_creation_logs {
bigint id PK
string wishlist_to_group_name "Name of the wishlist group (copied for logging)"
bigint wishlist_to_group_id FK "Foreign key to wishlist_to_groups (used for event tracking)"
bigint wishlist_dataset_history_id FK "Foreign key to wishlist_dataset_histories (used for event linking)"
string event_type "Event type enum value (request, success, failure)"
json data "Event data (stores request/response payloads)"
text message "Event message (stores error details)"
timestamp created_at
timestamp updated_at
}
summary_wishlist_products {
bigint id PK
bigint wishlist_to_group_id FK "Foreign key to wishlist_to_groups (used for data collection)"
string input "The input of the product (used for product ID extraction)"
string input_type "The type of the input: jan, asin, rakuten_id (used for product matching)"
integer mall_id "Foreign key to malls table (used for product filtering)"
integer crawl_status "The status of the crawling (used for success validation)"
timestamp updated_at "Used for time constraint validation"
}
summary_wishlist_categories {
bigint id PK
bigint wishlist_to_group_id FK "Foreign key to wishlist_to_groups (used for data collection)"
string category_id "The category identifier (used for ranking data collection)"
integer mall_id "Foreign key to malls table (used for category filtering)"
integer crawl_status "The status of the crawling (used for success validation)"
timestamp updated_at "Used for time constraint validation"
}
summary_wishlist_search_queries {
bigint id PK
bigint wishlist_to_group_id FK "Foreign key to wishlist_to_groups (used for data collection)"
integer mall_id "Foreign key to malls table (used for search filtering)"
string keyword "The search keyword (used for ranking data collection)"
integer crawl_status "The status of the crawling (used for success validation)"
timestamp updated_at "Used for time constraint validation"
}
wl_spec_vps {
bigint id PK
bigint wishlist_id "Foreign key to wishlist (used for viewpoint association)"
string name "Viewpoint name (used for batch job creation)"
text description "Viewpoint description (used for configuration)"
timestamp created_at
timestamp updated_at
}
wl_cat_vps {
bigint id PK
bigint wishlist_category_id "Foreign key to wishlist categories (used for viewpoint linking)"
bigint wishlist_id "Foreign key to wishlist (used for viewpoint association)"
timestamp created_at
timestamp updated_at
}
wishlist_to_groups ||--o{ wishlist_dataset_histories : "has many"
wishlist_to_groups ||--o{ wishlist_dataset_creation_logs : "has many"
wishlist_dataset_histories ||--o{ wishlist_dataset_creation_logs : "has many"
wishlist_to_groups ||--o{ summary_wishlist_products : "has many"
wishlist_to_groups ||--o{ summary_wishlist_categories : "has many"
wishlist_to_groups ||--o{ summary_wishlist_search_queries : "has many"
Analyzer Database Tables (gb_analyzer)
erDiagram
products {
bigint id PK
integer mall_id "Mall identifier (used for product filtering in dataset collection)"
string mall_product_id "Product ID from the mall (used for ranking data collection)"
string jan_code "JAN code (used for product ID enrichment)"
string input_type "Type of input: asin, jan, rakuten_id (used for product matching)"
string unique_key "Unique identifier (used for product deduplication)"
timestamp created_at
timestamp updated_at
timestamp crawl_created_at "When the product was crawled (used for time constraints)"
}
category_rankings {
bigint id PK
string mall_category_id "Category ID from the mall (used for category ranking collection)"
integer mall_id "Mall identifier (used for category filtering)"
string unique_key "Unique identifier (used for category deduplication)"
timestamp crawl_created_at "When the category was crawled (used for latest data selection)"
timestamp created_at
timestamp updated_at
}
product_category_rankings {
bigint id PK
bigint category_ranking_id FK "Foreign key to category_rankings (used for ranking data collection)"
string mall_product_id "Product ID from the mall (used for product ranking collection)"
tinyint mall_id "Mall identifier (used for ranking filtering)"
timestamp crawl_created_at "When the ranking was crawled (used for latest data selection)"
timestamp created_at
timestamp updated_at
}
search_query_rankings {
bigint id PK
string keyword "Search keyword (used for search ranking collection)"
timestamp crawl_created_at "When the search query was crawled (used for latest data selection)"
timestamp created_at
timestamp updated_at
}
product_search_query_rankings {
bigint id PK
bigint search_query_ranking_id FK "Foreign key to search_query_rankings (used for ranking data collection)"
string mall_product_id "Product ID from the mall (used for product ranking collection)"
tinyint mall_id "Mall identifier (used for ranking filtering)"
timestamp crawl_created_at "When the ranking was crawled (used for latest data selection)"
timestamp created_at
timestamp updated_at
}
review_sentences {
bigint id PK
string mall_product_id "Product ID from the mall (used for review sentence validation)"
text content "Sentence content extracted from review (used for dataset validation)"
timestamp created_at
timestamp updated_at
}
category_rankings ||--o{ product_category_rankings : "has many"
search_query_rankings ||--o{ product_search_query_rankings : "has many"
Dataset Command Logic Fields
Dataset Create Command Fields Used:
Wishlist Eligibility Filtering:
wishlist_to_groups.id- Primary key for dataset associationwishlist_to_groups.subscription_id- Used to join with active subscriptionswishlist_to_groups.status- Filtered for Active statuswishlist_to_groups.admin_status- Filtered for Active admin statuswishlist_to_groups.training_schedule- Used for manual vs automatic processing logicwishlist_to_groups.manual_request_dataset_at- Used for manual processing timingwishlist_to_groups.name- Used for dataset naming and loggingwishlist_to_groups.slug- Used for specific wishlist processingwishlist_to_groups.deleted_at- Filtered for non-deleted records
Data Collection and Validation:
summary_wishlist_products.input,input_type,mall_id- Used for product ID extractionsummary_wishlist_categories.category_id,mall_id- Used for category ranking collectionsummary_wishlist_search_queries.keyword,mall_id- Used for search ranking collectionsummary_*.crawl_status- Used for success count validationsummary_*.updated_at- Used for time constraint validation
Dataset Creation:
wishlist_dataset_histories.wishlist_to_group_id- Links dataset to wishlistwishlist_dataset_histories.dataset_id- Stores API response dataset IDwishlist_dataset_histories.status- Stores initial dataset statuswishlist_dataset_histories.config- Stores dataset configurationwishlist_dataset_histories.created_at- Used for schedule interval calculations
Dataset Get-Status Command Fields Used:
Status Monitoring:
wishlist_dataset_histories.id- Primary key for updateswishlist_dataset_histories.dataset_id- Used for API status querieswishlist_dataset_histories.status- Updated with new status from APIwishlist_dataset_histories.wishlist_to_group_id- Used for logging and notificationswishlist_dataset_histories.batch_job_created- Updated when batch jobs are createdwishlist_dataset_histories.batch_job_id- Stores batch job identifierwishlist_dataset_histories.config- Used to extract product_ids for batch jobswishlist_dataset_histories.error_message- Updated on failure
Batch Job Creation:
wl_spec_vps- Used for specific viewpoint batch job creationwl_cat_vps- Used for category viewpoint associations
API Integration
The system integrates with the TV Python API (Analyzer API) service:
- AnalyzerApiService: Service class for API communication
- Dataset Creation Endpoint: Accepts structured data and returns dataset ID
- Status Check Endpoint: Provides current status of dataset processing
- Error Handling: Manages API failures and retry logic
- Configuration: Uses
config/analyzer_api.phpfor API settings
Frequency Overview
Timeline
timeline
title Dataset Operations Schedule
section Creation Process
Every 30 minutes<br>(Ex. 08.00, 08.30, etc.) : dataset create
section Status Monitoring
Every 5 minutes<br>(Ex. 08.00, 08.05, etc.) : dataset get-status
Expected Outcomes
When these commands execute successfully, the system delivers:
- Automated Dataset Generation: Regular creation of datasets for active wishlist groups with subscriptions ensuring continuous data availability for analysis
- Comprehensive Data Integration: Seamless aggregation of product rankings, category data, search query results, and viewpoint information from multiple database sources
- External API Orchestration: Reliable communication with TV Python API service including request formatting, response handling, and error recovery mechanisms
- Status Tracking and Monitoring: Real-time monitoring of dataset creation progress with status transitions from Pending to Analyzing to Complete/Failed states
- Audit Trail Management: Complete logging of dataset operations including creation events, status changes, and error conditions for troubleshooting and compliance
- Notification System Integration: Automated alerts via Slack for administrators and real-time updates via Pusher for end-users on dataset completion or failure
- Subscription-Based Processing: Intelligent filtering to process only eligible wishlists with active subscriptions and proper configuration settings
- Data Quality Assurance: Validation of minimum thresholds and data availability before dataset creation to ensure meaningful analysis results
Batch List
| Name | Description |
|---|---|
| Dataset Create | Command that runs every 30 minutes to identify eligible wishlist groups, collect analyzed data, and send dataset creation requests to the TV Python API |
| Dataset Get Status | Command that runs every 5 minutes to monitor pending dataset generation processes, update status records, and trigger completion notifications |