Crawler Integration - Sync Failed Crawls
Command Signatures
php artisan plg-api:sync-crawl-failed-from-crawler --data-type=SummaryProduct [--limit=100]
php artisan plg-api:sync-crawl-failed-from-crawler --data-type=SummaryProductReview [--limit=100]
php artisan plg-api:sync-crawl-failed-from-crawler --data-type=SummaryCategory [--limit=100]
php artisan plg-api:sync-crawl-failed-from-crawler --data-type=SummarySearchQuery [--limit=100]
Purpose
These commands retrieve information about failed crawl operations from the Crawler system via the Playground API Service and update the crawl status of the affected records in the summary wishlist tables. This ensures that the system can track which crawl operations failed, allowing for appropriate remediation actions or retries.
Sequence Diagram
sequenceDiagram
participant System
participant Command as plg-api:sync-crawl-failed-from-crawler
participant APIService as PlaygroundApiService
participant Crawler as Crawler System
participant Job as Summary*Job
participant Repository as SummaryWishlist*Repository
participant Logger
participant Slack
Note over System,Slack: Crawler Failed Operations Sync Flow (Every 30 Minutes)
rect rgb(200, 255, 200)
Note right of System: Happy Case - Normal Processing
System->>Command: Execute with data type
Command->>Logger: Log command start
Command->>Command: Validate data type parameter
Command->>Command: Calculate date range (start/end of day)
rect rgb(200, 230, 255)
loop Until no more results or no next page
Note right of Command: Paginated API Processing
Command->>APIService: failedList(data_type, date_range, limit, offset)
APIService->>Crawler: HTTP GET /failed-list
Crawler-->>APIService: Response with failed crawl data
APIService-->>Command: Return paginated response
rect rgb(230, 200, 255)
alt Has Results
Note right of Command: Job Processing
Command->>Job: dispatch(failed_crawls_batch)
Job->>Job: Extract identifiers from failed crawls
Job->>Repository: getExisting(identifiers)
Repository-->>Job: Return matching records
Job->>Repository: Update crawl_status to Error
Job->>Logger: Log batch processing success
Job->>Slack: Send batch success notification
else No Results
Note right of Command: No Data Scenario
Command->>Logger: Log "No failed list found"
end
end
Command->>Command: Check for next page URL
Command->>Command: Update offset for next iteration
end
end
Command->>Logger: Log command completion
Command->>Slack: Send final summary notification
end
rect rgb(255, 200, 200)
Note right of System: Error Handling
rect rgb(255, 230, 230)
alt API Error Occurs
APIService->>Logger: Log API error details
APIService->>Slack: Send API error notification
else Job Processing Error
Job->>Logger: Log job error details
Job->>Slack: Send job error notification
else Unexpected Error Occurs
Command->>Logger: Log error details
Command->>Slack: Send error notification with context
end
end
end
Detail
Parameters
--data-type: Required parameter specifying the type of data for which to sync failed crawl informationSummaryProduct: Product summary dataSummaryProductReview: Product review dataSummaryCategory: Category summary dataSummarySearchQuery: Search query summary data
--limit=N: Optional parameter to control the number of failed crawls retrieved in each API call (default: 100)
Frequency
Every 30 minutes for each data type
Dependencies
- Playground API service must be accessible
- Valid API authentication tokens
- Crawler system must provide failed crawl data via API
- Summary wishlist tables must contain records with matching identifiers
Output
Tables
summary_wishlist_products: Updates crawl_status field- crawl_status: Changes to Error for failed crawl operations
summary_wishlist_product_reviews: Updates crawl_status field- crawl_status: Changes to Error for failed crawl operations
summary_wishlist_categories: Updates crawl_status field- crawl_status: Changes to Error for failed crawl operations
summary_wishlist_search_queries: Updates crawl_status field- crawl_status: Changes to Error for failed crawl operations
Services
- Playground API: Provides paginated list of failed crawl operations
- Crawler System: Source of failed crawl operation data
Database Schema
erDiagram
summary_wishlist_products {
bigint id PK
string input "The input of the product"
string input_type "The type of the input: jan, asin, rakuten_id"
bigint mall_id FK "Foreign key to malls table"
integer crawl_status "The status of the crawling"
}
summary_wishlist_product_reviews {
bigint id PK
bigint summary_wishlist_product_id FK "Foreign key to summary_wishlist_products (unique)"
integer crawl_status "The status of the crawling"
}
summary_wishlist_categories {
bigint id PK
string category_id "The id of the category in the mall"
bigint mall_id FK "Foreign key to malls table"
integer crawl_status "The status of the crawling"
}
summary_wishlist_search_queries {
bigint id PK
bigint mall_id FK "The id of the mall"
string keyword "The keyword to search"
integer crawl_status "The status of the crawling"
}
%% Relationships
summary_wishlist_products ||--o{ summary_wishlist_product_reviews : "has reviews"
Error Handling
Log
- Command execution start/end with data type and parameters
- API response status and result counts for each page
- Pagination information and offset tracking for debugging
- Batch processing success/failure with record counts
- Explicit logging when no failed crawls are found
- Detailed error messages with file and line information for debugging
Slack
- Success notifications with data type, API details, and processing statistics
- Batch processing notifications with records updated counts
- Final summary notifications with total processed records
- Error notifications with detailed message and source information
- Full error context including API response details and affected record counts
Troubleshooting
Check Data
- Verify summary_wishlist_* tables contain records with crawl_status = Error after command execution
- Check that records have valid crawl_config_id values from previous create operations
- Ensure the identifiers in failed crawls match database records (input, input_type, mall_id, etc.)
- Validate that updated_at timestamps indicate recent status updates
Check Logs
- Monitor command execution logs for successful starts and completions
- Check API response logs for HTTP status codes and pagination information
- Review Slack notifications for success/failure patterns and processing statistics
- Examine job queue logs for processing delays or failures
- Verify database update logs show proper crawl_status transitions to Error
- Compare the count of failed crawls in API responses to updated database records
- Check for any skipped or unmatched failed crawls in processing logs