Dataset Workflow - Overview
Introduction
Dataset Workflow is a complete automation process for creating and analyzing data from wishlist_to_group. This document describes the main workflow summary for dataset creation and processing.
Related Workflows
This main workflow builds upon several prerequisite processes that are documented separately:
- Temp Wishlist to Group → Wishlist to Group conversion workflow: Wishlist System Overview
- Detailed Wishlist Management:
wishlist_products,wishlist_categories,wishlist_search_queries,wishlist_product_reviews→summary_wishlist_*tables - Backend Console Commands: Detailed documentation available in trend-viewer-backend docs
- Processing Services: Documentation in development
analyzer_batch: Python ML/AI processing service (docs in development)spvp_batch: Python Qwen-based specific viewpoint processing (docs in development)
Note: This overview focuses on the core dataset creation flow from wishlist_to_group to completed analysis results.
PLG API - Interface with Crawler
PLG API (Playground API) is an API gateway provided by the Crawler team for TV interaction:
- Crawl Management: TV sends crawl requests through PLG API
- Data Validation: TV checks embedding/prediction status via PLG API
- Error Tracking: PLG API manages logs and failures to
crawler_v2DB - Success Storage: PLG API stores successful crawled data to
analyzer_v2DB
PLG API is managed by Crawler and operates directly on crawler_v2 and analyzer_v2 databases.
Overview Diagram
---
config:
theme: base
layout: dagre
flowchart:
curve: linear
htmlLabels: true
themeVariables:
edgeLabelBackground: "transparent"
---
flowchart LR
%% User Input
User[User Input]
%% Core Systems
TrendApp[trend-viewer-app]
TrendAPI[trend-viewer-api]
TrendBackend[trend-viewer-backend]
AnalyzerBatch[analyzer_batch]
SPVPBatch[spvp_batch]
PLGApi[PLG API]
%% Databases
GBConsole[(gb_console DB)]
DSAnalyzer[(ds_analyzer DB)]
AnalyzerV2[(analyzer_v2 DB)]
CrawlerV2[(crawler_v2 DB)]
%% Flow with numbered steps - Steps 1 and 2 vertical
User --- Step1[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>1</span>
<p style='margin-top: 8px'>Create Wishlist</p>
</div>
]
Step1 --> TrendApp
TrendApp --> TrendAPI
TrendAPI --> GBConsole
GBConsole --- Step2[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>2</span>
<p style='margin-top: 8px'>Crawl Data</p>
</div>
]
Step2 --> TrendBackend
TrendBackend --> PLGApi
PLGApi --> AnalyzerV2
PLGApi --> CrawlerV2
%% Steps 3, 4, 5 horizontal flow (left to right)
TrendBackend --- Step3[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #cc6699 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>3</span>
<p style='margin-top: 8px'>Create Dataset</p>
</div>
]
Step3 --> DSAnalyzer
TrendBackend --> Step4[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #ff9900 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>4</span>
<p style='margin-top: 8px'>Analyze</p>
</div>
]
Step4 --> AnalyzerBatch
AnalyzerV2 --> AnalyzerBatch
AnalyzerBatch --> DSAnalyzer
DSAnalyzer --> SPVPBatch
AnalyzerBatch --> Step5[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #cc3366 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>5</span>
<p style='margin-top: 8px'>SPVP Process</p>
</div>
]
Step5 --> SPVPBatch
SPVPBatch --> DSAnalyzer
%% Style step boxes as transparent
style Step1 fill:transparent,stroke:transparent,stroke-width:1px
style Step2 fill:transparent,stroke:transparent,stroke-width:1px
style Step3 fill:transparent,stroke:transparent,stroke-width:1px
style Step4 fill:transparent,stroke:transparent,stroke-width:1px
style Step5 fill:transparent,stroke:transparent,stroke-width:1px
Sequence Diagram
sequenceDiagram
participant User
participant TrendApp as trend-viewer-app
participant TrendAPI as trend-viewer-api
participant TrendBackend as trend-viewer-backend
participant PLG as PLG API
participant Analyzer as analyzer_batch
participant SPVP as spvp_batch
participant TVDB as TV DB
participant CrawlerDB as Crawler DB
participant Logger
participant Slack
Note over User,Slack: Dataset Workflow Complete Flow
rect rgb(200, 255, 200)
Note right of User: Happy Case - Phase 1: Wishlist Creation
User->>TrendApp: 1. Create wishlist_to_group
TrendApp->>TrendAPI: POST /api/v1/general/wishlist-to-group
TrendAPI->>TrendAPI: Validate and store data
TrendAPI->>TVDB: Store wishlist data
Note over TVDB: gb_console.wishlist_to_groups (status: 1 | Active)<br/>gb_console.summary_wishlist_* (crawl_status: 0 | New)
rect rgb(230, 200, 255)
Note right of TrendAPI: Success Monitoring
TrendAPI->>Logger: Log wishlist creation
TrendAPI->>Slack: Send creation notification
end
end
rect rgb(200, 230, 255)
Note right of TrendBackend: Happy Case - Phase 2: Data Crawling
TrendBackend->>TVDB: Query pending items
TVDB-->>TrendBackend: Return crawl queue
TrendBackend->>PLG: Send crawl config
PLG->>CrawlerDB: Store crawl requests
Note over CrawlerDB: crawler_v2.configs
rect rgb(255, 255, 200)
Note right of PLG: Optional - Multiple Products
PLG->>PLG: Execute crawling process
PLG->>CrawlerDB: Store successful crawled data
Note over CrawlerDB: analyzer_v2.products<br/>analyzer_v2.reviews<br/>analyzer_v2.review_sentences
PLG->>CrawlerDB: Update crawl status & errors
Note over CrawlerDB: crawler_v2.configs
end
PLG-->>TrendBackend: Return crawl status
TrendBackend->>TVDB: Update crawl_status: 0→1→2 (Success)
Note over TVDB: gb_console.summary_wishlist_*
end
rect rgb(255, 230, 200)
Note right of TrendBackend: Critical - Phase 3: Dataset Creation
TrendBackend->>TVDB: Check wishlist readiness
TVDB-->>TrendBackend: Return ready wishlists
TrendBackend->>PLG: Validate embedding & prediction
PLG->>CrawlerDB: Check data completeness
Note over CrawlerDB: analyzer_v2.products (embedding status)
CrawlerDB-->>PLG: Return validation status
PLG-->>TrendBackend: Validation completed
TrendBackend->>TVDB: Create dataset metadata
Note over TVDB: ds_analyzer.datasets (status: 1 | Pending, progress: 0)
TrendBackend->>TVDB: Create history record
Note over TVDB: gb_console.wishlist_dataset_histories (status: 1, spvp_status: 1)
TrendBackend->>Analyzer: Trigger analyzer_batch
Note over Analyzer: Google Cloud Run job started
end
rect rgb(200, 255, 200)
Note right of Analyzer: Happy Case - Phase 4: ML Analysis
Analyzer->>TVDB: Update status to Processing
Note over TVDB: ds_analyzer.datasets (status: 1→2 | Processing, progress: 0→25)
Analyzer->>CrawlerDB: Load source data directly
Note over CrawlerDB: analyzer_v2.products<br/>analyzer_v2.reviews<br/>analyzer_v2.review_sentences
CrawlerDB-->>Analyzer: Return products, reviews, sentences
Analyzer->>Analyzer: ML Processing
Note over Analyzer: - K-means Clustering (progress: 25→50)<br/>- OpenAI GPT-4 Labeling (progress: 50→75)<br/>- Product Similarity Calc (progress: 75→85)
Analyzer->>TVDB: Write analysis results (7 tables)
Note over TVDB: ds_analyzer.products<br/>ds_analyzer.product_details<br/>ds_analyzer.product_similarities<br/>ds_analyzer.ai_viewpoints<br/>ds_analyzer.review_sentence_aivp<br/>ds_analyzer.reviews<br/>ds_analyzer.review_sentences (progress: 85→100)
Analyzer->>TVDB: Complete analysis
Note over TVDB: ds_analyzer.datasets (status: 2→3 | Completed, progress: 100)
Analyzer-->>TrendBackend: Analysis completed notification
TrendBackend->>TVDB: Sync dataset status
Note over TVDB: gb_console.wishlist_dataset_histories<br/>(status: 3, spvp_status: 2 | Analyzing)
end
rect rgb(200, 230, 255)
Note right of TrendBackend: Happy Case - Phase 5: SPVP Processing
TrendBackend->>SPVP: Trigger spvp_batch
SPVP->>TVDB: Load review sentences
Note over TVDB: ds_analyzer.review_sentences
SPVP->>TVDB: Load specific viewpoints & categories
Note over TVDB: ds_analyzer.specific_viewpoints<br/>ds_analyzer.viewpoint_categories
TVDB-->>SPVP: Return sentences, viewpoints & categories
SPVP->>SPVP: Qwen mapping process
Note over SPVP: Qwen model maps<br/>specific_viewpoints ↔ review_sentences
SPVP->>TVDB: Store mapping results
Note over TVDB: ds_analyzer.review_sentence_spvp<br/>(sentence-viewpoint mappings)
SPVP->>TVDB: Update viewpoint progress
Note over TVDB: ds_analyzer.specific_viewpoints<br/>(last_object_id updated)
SPVP-->>TrendBackend: SPVP completed
TrendBackend->>TVDB: Final status update
Note over TVDB: gb_console.wishlist_dataset_histories<br/>(spvp_status: 2→3 | Completed)
rect rgb(230, 200, 255)
Note right of TrendBackend: Success Monitoring
TrendBackend->>Logger: Log workflow completion
TrendBackend->>Slack: Send success notification
TrendBackend-->>TrendAPI: Dataset ready notification
TrendAPI-->>TrendApp: WebSocket status update
TrendApp-->>User: Dataset available
end
end
rect rgb(255, 200, 200)
Note right of TrendBackend: Error Handling
rect rgb(255, 230, 230)
alt Crawl Failure
PLG->>CrawlerDB: Log crawl errors & failures
Note over CrawlerDB: crawler_v2.error_logs
PLG->>TVDB: Set crawl_status = 3 (Error)
Note over TVDB: gb_console.summary_wishlist_*
PLG->>Logger: Log crawl error
PLG->>Slack: Send crawl failure notification
else Analysis Failure
Analyzer->>TVDB: Set status = 9 (Failed), error_code
Note over TVDB: ds_analyzer.datasets
Analyzer->>TVDB: Update history status = 9
Note over TVDB: gb_console.wishlist_dataset_histories
Analyzer->>Logger: Log analysis error
Analyzer->>Slack: Send analysis failure notification
else SPVP Failure
SPVP->>TVDB: Set spvp mapping error
Note over TVDB: ds_analyzer.review_sentence_spvp<br/>ds_analyzer.specific_viewpoints
SPVP->>TVDB: Set spvp_status = 9 (Failed)
Note over TVDB: gb_console.wishlist_dataset_histories
SPVP->>Logger: Log SPVP error
SPVP->>Slack: Send SPVP failure notification
end
end
end
System Components
Frontend
- trend-viewer-app: Vue.js frontend for user interface
API Layer
- trend-viewer-api: Laravel REST API layer
Backend Services
- trend-viewer-backend: Laravel backend with scheduled commands
- PLG API: Playground API provided by Crawler for TV interaction
Processing Services
- analyzer_batch: Python ML/AI processing service
- spvp_batch: Python Qwen-based specific viewpoint processing
Databases
- gb_console DB (MySQL): TV Backend - User data, wishlist lifecycle, status tracking
- crawler_v2 DB (MySQL): Crawler managed - Crawl requests, logs, and error tracking
- analyzer_v2 DB (MySQL): Crawler managed - Successful crawled data from marketplace
- ds_analyzer DB (MySQL): TV Analyzer - Processed analysis results from analyzer_batch
Data Flow Summary
- User Input → gb_console DB (wishlist_to_groups, summary_wishlist_*)
- Crawl Requests → TV calls PLG API → Crawler operates crawler_v2 DB
- Successful Crawling → PLG API stores to analyzer_v2 DB (products, reviews, review_sentences)
- Dataset Creation → ds_analyzer DB (datasets metadata)
- Analysis → analyzer_batch reads directly from analyzer_v2 DB → stores to ds_analyzer DB (7 tables)
- SPVP Processing → ds_analyzer DB (specific_viewpoints updates)
Module List
| Name | Link | Description |
|---|---|---|
| Database Schema Changes | Database Schema | Detailed database schema changes for dataset workflow |
| Detailed Sequence Diagrams | Sequence Diagrams | Detailed sequence diagrams for each workflow phase |
Dataset Status
Dataset Status (ds_analyzer.datasets.status)
- 1 | Pending: Dataset created, waiting for analyzer_batch
- 2 | Processing: analyzer_batch is processing
- 3 | Completed: analyzer_batch completed
- 9 | Failed: analyzer_batch failed
SPVP Status (gb_console.wishlist_dataset_histories.spvp_status)
- 1 | Pending: Waiting for spvp_batch
- 2 | Analyzing: spvp_batch is processing
- 3 | Completed: spvp_batch completed
- 9 | Failed: spvp_batch failed
Crawl Status (gb_console.summary_wishlist_*.crawl_status)
- 0 | New: Not crawled yet
- 1 | InProgress: Crawler Service is crawling
- 2 | Success: Crawl successful
- 3 | Error: Crawl failed
- 4 | Canceled: Crawl canceled
Successful Dataset: status = 3 AND spvp_status = 3