Crawler Integration - Sync Failed Crawls

Command Signatures

php artisan plg-api:sync-crawl-failed-from-crawler --data-type=SummaryProduct [--limit=100]
php artisan plg-api:sync-crawl-failed-from-crawler --data-type=SummaryProductReview [--limit=100]
php artisan plg-api:sync-crawl-failed-from-crawler --data-type=SummaryCategory [--limit=100]
php artisan plg-api:sync-crawl-failed-from-crawler --data-type=SummarySearchQuery [--limit=100]

Purpose

These commands retrieve information about failed crawl operations from the Crawler system via the Playground API Service and update the crawl status of the affected records in the summary wishlist tables. This ensures that the system can track which crawl operations failed, allowing for appropriate remediation actions or retries.

Sequence Diagram

sequenceDiagram
    participant System
    participant Command as plg-api:sync-crawl-failed-from-crawler
    participant APIService as PlaygroundApiService
    participant Crawler as Crawler System
    participant Job as Summary*Job
    participant Repository as SummaryWishlist*Repository
    participant Logger
    participant Slack
    
    Note over System,Slack: Crawler Failed Operations Sync Flow (Every 30 Minutes)
    
    rect rgb(200, 255, 200)
    Note right of System: Happy Case - Normal Processing
    
    System->>Command: Execute with data type
    Command->>Logger: Log command start
    Command->>Command: Validate data type parameter
    Command->>Command: Calculate date range (start/end of day)
    
    rect rgb(200, 230, 255)
    loop Until no more results or no next page
        Note right of Command: Paginated API Processing
        Command->>APIService: failedList(data_type, date_range, limit, offset)
        APIService->>Crawler: HTTP GET /failed-list
        Crawler-->>APIService: Response with failed crawl data
        APIService-->>Command: Return paginated response
        
        rect rgb(230, 200, 255)
        alt Has Results
            Note right of Command: Job Processing
            Command->>Job: dispatch(failed_crawls_batch)
            
            Job->>Job: Extract identifiers from failed crawls
            Job->>Repository: getExisting(identifiers)
            Repository-->>Job: Return matching records
            
            Job->>Repository: Update crawl_status to Error
            Job->>Logger: Log batch processing success
            Job->>Slack: Send batch success notification
        else No Results
            Note right of Command: No Data Scenario
            Command->>Logger: Log "No failed list found"
        end
        end
        
        Command->>Command: Check for next page URL
        Command->>Command: Update offset for next iteration
    end
    end
    
    Command->>Logger: Log command completion
    Command->>Slack: Send final summary notification
    end
    
    rect rgb(255, 200, 200)
    Note right of System: Error Handling
    rect rgb(255, 230, 230)
    alt API Error Occurs
        APIService->>Logger: Log API error details
        APIService->>Slack: Send API error notification
    else Job Processing Error
        Job->>Logger: Log job error details
        Job->>Slack: Send job error notification
    else Unexpected Error Occurs
        Command->>Logger: Log error details
        Command->>Slack: Send error notification with context
    end
    end
    end

Detail

Parameters

  • --data-type: Required parameter specifying the type of data for which to sync failed crawl information
    • SummaryProduct: Product summary data
    • SummaryProductReview: Product review data
    • SummaryCategory: Category summary data
    • SummarySearchQuery: Search query summary data
  • --limit=N: Optional parameter to control the number of failed crawls retrieved in each API call (default: 100)

Frequency

Every 30 minutes for each data type

Dependencies

  • Playground API service must be accessible
  • Valid API authentication tokens
  • Crawler system must provide failed crawl data via API
  • Summary wishlist tables must contain records with matching identifiers

Output

Tables

  • summary_wishlist_products: Updates crawl_status field
    • crawl_status: Changes to Error for failed crawl operations
  • summary_wishlist_product_reviews: Updates crawl_status field
    • crawl_status: Changes to Error for failed crawl operations
  • summary_wishlist_categories: Updates crawl_status field
    • crawl_status: Changes to Error for failed crawl operations
  • summary_wishlist_search_queries: Updates crawl_status field
    • crawl_status: Changes to Error for failed crawl operations

Services

  • Playground API: Provides paginated list of failed crawl operations
  • Crawler System: Source of failed crawl operation data

Database Schema

erDiagram
    summary_wishlist_products {
        bigint id PK
        string input "The input of the product"
        string input_type "The type of the input: jan, asin, rakuten_id"
        bigint mall_id FK "Foreign key to malls table"
        integer crawl_status "The status of the crawling"
    }
    
    summary_wishlist_product_reviews {
        bigint id PK
        bigint summary_wishlist_product_id FK "Foreign key to summary_wishlist_products (unique)"
        integer crawl_status "The status of the crawling"
    }
    
    summary_wishlist_categories {
        bigint id PK
        string category_id "The id of the category in the mall"
        bigint mall_id FK "Foreign key to malls table"
        integer crawl_status "The status of the crawling"
    }
    
    summary_wishlist_search_queries {
        bigint id PK
        bigint mall_id FK "The id of the mall"
        string keyword "The keyword to search"
        integer crawl_status "The status of the crawling"
    }
    
    %% Relationships
    summary_wishlist_products ||--o{ summary_wishlist_product_reviews : "has reviews"

Error Handling

Log

  • Command execution start/end with data type and parameters
  • API response status and result counts for each page
  • Pagination information and offset tracking for debugging
  • Batch processing success/failure with record counts
  • Explicit logging when no failed crawls are found
  • Detailed error messages with file and line information for debugging

Slack

  • Success notifications with data type, API details, and processing statistics
  • Batch processing notifications with records updated counts
  • Final summary notifications with total processed records
  • Error notifications with detailed message and source information
  • Full error context including API response details and affected record counts

Troubleshooting

Check Data

  1. Verify summary_wishlist_* tables contain records with crawl_status = Error after command execution
  2. Check that records have valid crawl_config_id values from previous create operations
  3. Ensure the identifiers in failed crawls match database records (input, input_type, mall_id, etc.)
  4. Validate that updated_at timestamps indicate recent status updates

Check Logs

  1. Monitor command execution logs for successful starts and completions
  2. Check API response logs for HTTP status codes and pagination information
  3. Review Slack notifications for success/failure patterns and processing statistics
  4. Examine job queue logs for processing delays or failures
  5. Verify database update logs show proper crawl_status transitions to Error
  6. Compare the count of failed crawls in API responses to updated database records
  7. Check for any skipped or unmatched failed crawls in processing logs