Dataset Workflow - Tổng quan
Giới thiệu
Dataset Workflow là quy trình tự động hóa hoàn chỉnh để tạo và phân tích dữ liệu từ wishlist_to_group. Tài liệu này mô tả luồng chính tóm tắt cho việc tạo và xử lý dataset.
Các Luồng Liên Quan
Luồng chính này được xây dựng dựa trên một số quy trình tiên quyết đã được tài liệu hóa riêng biệt:
- Temp Wishlist to Group → Wishlist to Group quy trình chuyển đổi: Wishlist System Overview
- Quản lý Wishlist Chi tiết:
wishlist_products,wishlist_categories,wishlist_search_queries,wishlist_product_reviews→summary_wishlist_*tables - Backend Console Commands: Tài liệu chi tiết có sẵn trong trend-viewer-backend docs
- Processing Services: Tài liệu đang phát triển
analyzer_batch: Python ML/AI processing service (docs in development)spvp_batch: Python Qwen-based specific viewpoint processing (docs in development)
Lưu ý: Tổng quan này tập trung vào luồng tạo dataset cốt lõi từ wishlist_to_group đến kết quả phân tích hoàn chỉnh.
PLG API - Điểm Giao Tiếp với Crawler
PLG API (Playground API) là API gateway do bên Crawler cung cấp để TV tương tác:
- Crawl Management: TV gửi yêu cầu cào dữ liệu qua PLG API
- Data Validation: TV kiểm tra trạng thái embedding/prediction qua PLG API
- Error Tracking: PLG API quản lý logs và failures vào
crawler_v2DB - Success Storage: PLG API lưu dữ liệu cào thành công vào
analyzer_v2DB
PLG API được Crawler phụ trách và thao tác trực tiếp trên crawler_v2 và analyzer_v2 databases.
Sơ đồ tổng quan
---
config:
theme: base
layout: dagre
flowchart:
curve: linear
htmlLabels: true
themeVariables:
edgeLabelBackground: "transparent"
---
flowchart LR
%% User Input
User[User Input]
%% Core Systems
TrendApp[trend-viewer-app]
TrendAPI[trend-viewer-api]
TrendBackend[trend-viewer-backend]
AnalyzerBatch[analyzer_batch]
SPVPBatch[spvp_batch]
PLGApi[PLG API]
%% Databases
GBConsole[(gb_console DB)]
DSAnalyzer[(ds_analyzer DB)]
AnalyzerV2[(analyzer_v2 DB)]
CrawlerV2[(crawler_v2 DB)]
%% Flow with numbered steps - Steps 1 and 2 vertical
User --- Step1[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>1</span>
<p style='margin-top: 8px'>Tạo Wishlist</p>
</div>
]
Step1 --> TrendApp
TrendApp --> TrendAPI
TrendAPI --> GBConsole
GBConsole --- Step2[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>2</span>
<p style='margin-top: 8px'>Crawl Data</p>
</div>
]
Step2 --> TrendBackend
TrendBackend --> PLGApi
PLGApi --> AnalyzerV2
PLGApi --> CrawlerV2
%% Steps 3, 4, 5 horizontal flow (left to right)
TrendBackend --- Step3[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #cc6699 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>3</span>
<p style='margin-top: 8px'>Tạo Dataset</p>
</div>
]
Step3 --> DSAnalyzer
TrendBackend --> Step4[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #ff9900 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>4</span>
<p style='margin-top: 8px'>Phân Tích</p>
</div>
]
Step4 --> AnalyzerBatch
AnalyzerV2 --> AnalyzerBatch
AnalyzerBatch --> DSAnalyzer
DSAnalyzer --> SPVPBatch
AnalyzerBatch --> Step5[
<div style='text-align: center'>
<span style='display: inline-block; background-color: #cc3366 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>5</span>
<p style='margin-top: 8px'>SPVP Process</p>
</div>
]
Step5 --> SPVPBatch
SPVPBatch --> DSAnalyzer
%% Style step boxes as transparent
style Step1 fill:transparent,stroke:transparent,stroke-width:1px
style Step2 fill:transparent,stroke:transparent,stroke-width:1px
style Step3 fill:transparent,stroke:transparent,stroke-width:1px
style Step4 fill:transparent,stroke:transparent,stroke-width:1px
style Step5 fill:transparent,stroke:transparent,stroke-width:1px
Sequence Diagram
sequenceDiagram
participant User
participant TrendApp as trend-viewer-app
participant TrendAPI as trend-viewer-api
participant TrendBackend as trend-viewer-backend
participant PLG as PLG API
participant Analyzer as analyzer_batch
participant SPVP as spvp_batch
participant TVDB as TV DB
participant CrawlerDB as Crawler DB
participant Logger
participant Slack
Note over User,Slack: Dataset Workflow Complete Flow
rect rgb(200, 255, 200)
Note right of User: Happy Case - Phase 1: Wishlist Creation
User->>TrendApp: 1. Tạo wishlist_to_group
TrendApp->>TrendAPI: POST /api/v1/general/wishlist-to-group
TrendAPI->>TrendAPI: Validate và store data
TrendAPI->>TVDB: Store wishlist data
Note over TVDB: gb_console.wishlist_to_groups (status: 1 | Active)<br/>gb_console.summary_wishlist_* (crawl_status: 0 | New)
rect rgb(230, 200, 255)
Note right of TrendAPI: Success Monitoring
TrendAPI->>Logger: Log wishlist creation
TrendAPI->>Slack: Send creation notification
end
end
rect rgb(200, 230, 255)
Note right of TrendBackend: Happy Case - Phase 2: Data Crawling
TrendBackend->>TVDB: Query pending items
TVDB-->>TrendBackend: Return crawl queue
TrendBackend->>PLG: Send crawl config
PLG->>CrawlerDB: Store crawl requests
Note over CrawlerDB: crawler_v2.configs
rect rgb(255, 255, 200)
Note right of PLG: Optional - Multiple Products
PLG->>PLG: Execute crawling process
PLG->>CrawlerDB: Store successful crawled data
Note over CrawlerDB: analyzer_v2.products<br/>analyzer_v2.reviews<br/>analyzer_v2.review_sentences
PLG->>CrawlerDB: Update crawl status & errors
Note over CrawlerDB: crawler_v2.configs
end
PLG-->>TrendBackend: Return crawl status
TrendBackend->>TVDB: Update crawl_status: 0→1→2 (Success)
Note over TVDB: gb_console.summary_wishlist_*
end
rect rgb(255, 230, 200)
Note right of TrendBackend: Critical - Phase 3: Dataset Creation
TrendBackend->>TVDB: Check wishlist readiness
TVDB-->>TrendBackend: Return ready wishlists
TrendBackend->>PLG: Validate embedding & prediction
PLG->>CrawlerDB: Check data completeness
Note over CrawlerDB: analyzer_v2.products (embedding status)
CrawlerDB-->>PLG: Return validation status
PLG-->>TrendBackend: Validation completed
TrendBackend->>TVDB: Create dataset metadata
Note over TVDB: ds_analyzer.datasets (status: 1 | Pending, progress: 0)
TrendBackend->>TVDB: Create history record
Note over TVDB: gb_console.wishlist_dataset_histories (status: 1, spvp_status: 1)
TrendBackend->>Analyzer: Trigger analyzer_batch
Note over Analyzer: Google Cloud Run job started
end
rect rgb(200, 255, 200)
Note right of Analyzer: Happy Case - Phase 4: ML Analysis
Analyzer->>TVDB: Update status to Processing
Note over TVDB: ds_analyzer.datasets (status: 1→2 | Processing, progress: 0→25)
Analyzer->>CrawlerDB: Load source data directly
Note over CrawlerDB: analyzer_v2.products<br/>analyzer_v2.reviews<br/>analyzer_v2.review_sentences
CrawlerDB-->>Analyzer: Return products, reviews, sentences
Analyzer->>Analyzer: ML Processing
Note over Analyzer: - K-means Clustering (progress: 25→50)<br/>- OpenAI GPT-4 Labeling (progress: 50→75)<br/>- Product Similarity Calc (progress: 75→85)
Analyzer->>TVDB: Write analysis results (7 tables)
Note over TVDB: ds_analyzer.products<br/>ds_analyzer.product_details<br/>ds_analyzer.product_similarities<br/>ds_analyzer.ai_viewpoints<br/>ds_analyzer.review_sentence_aivp<br/>ds_analyzer.reviews<br/>ds_analyzer.review_sentences (progress: 85→100)
Analyzer->>TVDB: Complete analysis
Note over TVDB: ds_analyzer.datasets (status: 2→3 | Completed, progress: 100)
Analyzer-->>TrendBackend: Analysis completed notification
TrendBackend->>TVDB: Sync dataset status
Note over TVDB: gb_console.wishlist_dataset_histories<br/>(status: 3, spvp_status: 2 | Analyzing)
end
rect rgb(200, 230, 255)
Note right of TrendBackend: Happy Case - Phase 5: SPVP Processing
TrendBackend->>SPVP: Trigger spvp_batch
SPVP->>TVDB: Load review sentences
Note over TVDB: ds_analyzer.review_sentences
SPVP->>TVDB: Load specific viewpoints & categories
Note over TVDB: ds_analyzer.specific_viewpoints<br/>ds_analyzer.viewpoint_categories
TVDB-->>SPVP: Return sentences, viewpoints & categories
SPVP->>SPVP: Qwen mapping process
Note over SPVP: Qwen model maps<br/>specific_viewpoints ↔ review_sentences
SPVP->>TVDB: Store mapping results
Note over TVDB: ds_analyzer.review_sentence_spvp<br/>(sentence-viewpoint mappings)
SPVP->>TVDB: Update viewpoint progress
Note over TVDB: ds_analyzer.specific_viewpoints<br/>(last_object_id updated)
SPVP-->>TrendBackend: SPVP completed
TrendBackend->>TVDB: Final status update
Note over TVDB: gb_console.wishlist_dataset_histories<br/>(spvp_status: 2→3 | Completed)
rect rgb(230, 200, 255)
Note right of TrendBackend: Success Monitoring
TrendBackend->>Logger: Log workflow completion
TrendBackend->>Slack: Send success notification
TrendBackend-->>TrendAPI: Dataset ready notification
TrendAPI-->>TrendApp: WebSocket status update
TrendApp-->>User: Dataset available
end
end
rect rgb(255, 200, 200)
Note right of TrendBackend: Error Handling
rect rgb(255, 230, 230)
alt Crawl Failure
PLG->>CrawlerDB: Log crawl errors & failures
Note over CrawlerDB: crawler_v2.error_logs
PLG->>TVDB: Set crawl_status = 3 (Error)
Note over TVDB: gb_console.summary_wishlist_*
PLG->>Logger: Log crawl error
PLG->>Slack: Send crawl failure notification
else Analysis Failure
Analyzer->>TVDB: Set status = 9 (Failed), error_code
Note over TVDB: ds_analyzer.datasets
Analyzer->>TVDB: Update history status = 9
Note over TVDB: gb_console.wishlist_dataset_histories
Analyzer->>Logger: Log analysis error
Analyzer->>Slack: Send analysis failure notification
else SPVP Failure
SPVP->>TVDB: Set spvp mapping error
Note over TVDB: ds_analyzer.review_sentence_spvp<br/>ds_analyzer.specific_viewpoints
SPVP->>TVDB: Set spvp_status = 9 (Failed)
Note over TVDB: gb_console.wishlist_dataset_histories
SPVP->>Logger: Log SPVP error
SPVP->>Slack: Send SPVP failure notification
end
end
end
Thành phần hệ thống
Frontend
- trend-viewer-app: Vue.js frontend cho user interface
API Layer
- trend-viewer-api: Laravel REST API layer
Backend Services
- trend-viewer-backend: Laravel backend với scheduled commands
- PLG API: Playground API do Crawler cung cấp cho TV tương tác
Processing Services
- analyzer_batch: Python ML/AI processing service
- spvp_batch: Python Qwen-based specific viewpoint processing
Databases
- gb_console DB (MySQL): TV Backend - User data, wishlist lifecycle, status tracking
- crawler_v2 DB (MySQL): Crawler quản lý - Crawl requests, logs, và error tracking
- analyzer_v2 DB (MySQL): Crawler quản lý - Successful crawled data từ marketplace
- ds_analyzer DB (MySQL): TV Analyzer - Processed analysis results từ analyzer_batch
Data Flow Summary
- User Input → gb_console DB (wishlist_to_groups, summary_wishlist_*)
- Crawl Requests → TV gọi PLG API → Crawler thao tác crawler_v2 DB
- Successful Crawling → PLG API lưu vào analyzer_v2 DB (products, reviews, review_sentences)
- Dataset Creation → ds_analyzer DB (datasets metadata)
- Analysis → analyzer_batch đọc trực tiếp analyzer_v2 DB → lưu vào ds_analyzer DB (7 tables)
- SPVP Processing → ds_analyzer DB (specific_viewpoints updates)
Danh sách Mô-đun
| Tên | Link | Mô tả |
|---|---|---|
| Database Schema Changes | Database Schema | Chi tiết thay đổi cấu trúc database cho dataset workflow |
| Detailed Sequence Diagrams | Sequence Diagrams | Sơ đồ trình tự chi tiết cho từng giai đoạn workflow |
Trạng thái Dataset
Dataset Status (ds_analyzer.datasets.status)
- 1 | Pending: Dataset được tạo, chờ analyzer_batch
- 2 | Processing: analyzer_batch đang xử lý
- 3 | Completed: analyzer_batch hoàn thành
- 9 | Failed: analyzer_batch thất bại
SPVP Status (gb_console.wishlist_dataset_histories.spvp_status)
- 1 | Pending: Chờ spvp_batch
- 2 | Analyzing: spvp_batch đang xử lý
- 3 | Completed: spvp_batch hoàn thành
- 9 | Failed: spvp_batch thất bại
Crawl Status (gb_console.summary_wishlist_*.crawl_status)
- 0 | New: Chưa crawl
- 1 | InProgress: Crawler Service đang crawl
- 2 | Success: Crawl thành công
- 3 | Error: Crawl thất bại
- 4 | Canceled: Crawl bị hủy
Dataset thành công: status = 3 AND spvp_status = 3