Weekly Checks
Weekly maintenance checklist for application performance and reliability
This guide outlines critical maintenance tasks that should be performed weekly to ensure optimal application performance and reliability.
Overview
Frequency: Weekly
Priority: High - These checks catch most issues before they become critical
Pre-Check Requirements
- Admin access to the application
- Access to Kubernetes cluster (kubectl configured)
- Access to LLM provider dashboard (OpenAI/Gemini/Anthropic)
- Access to Sentry dashboard
- (Optional) Access to Langfuse dashboard
1. Error Monitoring (Sentry)
Priority: Critical
Steps
-
Access Sentry Dashboard
- Get Sentry URL from
SENTRY_DSNenvironment variable - Or directly access your Sentry project dashboard
- Get Sentry URL from
-
Review Error Trends
- Check for new error types in the last 7 days
- Review error frequency trends (increasing/decreasing)
- Identify errors affecting multiple users
-
Priority Assessment
- Critical: Errors preventing core functionality (chat, login, document access)
- High: Errors affecting >10 users
- Medium: Sporadic errors with workarounds
- Low: Single-occurrence errors
-
Action Items
- Note critical/high priority errors for investigation
- Apply immediate fixes if possible
What to Look For
- 500 Internal Server Errors
- Database connection errors
- LLM API failures
- Authentication failures
- Background job errors
Expected State
- Error rate stable or decreasing
- No critical errors affecting core features
- Known issues properly tracked
2. System Resources Monitoring
Priority: Critical
Kubernetes Native Monitoring
Check Disk Space
# Check disk usage from within application pod
kubectl exec <app-pod-name> -n <namespace> -- df -hThresholds:
- ⚠️ Warning: >70% usage
- 🚨 Critical: >80% usage
- 🔥 Emergency: >90% usage
Actions if high:
- Consider volume expansion
- Consider app documents cleanup
Check Memory (RAM)
# Check pod memory usage
kubectl top pods -A
# Detailed pod resource usage
kubectl describe pod -n <namespace> <pod-name>Thresholds:
- ⚠️ Warning: >70% of requested memory
- 🚨 Critical: >85% of requested memory
- Check for memory leaks if usage steadily increases
Actions if high:
- Check for memory leaks in logs
- Review large data processing jobs
- Consider scaling horizontally (more pods)
- Adjust resource limits if needed
Check CPU Usage
# Check pod CPU usage
kubectl top pods -A
# Check node CPU usage
kubectl top nodesThresholds:
- ⚠️ Warning: Sustained >70% usage
- 🚨 Critical: Sustained >85% usage
- Brief spikes to 100% are normal during processing
Actions if high:
- Identify resource-intensive processes
- Review background job queues
- Consider horizontal pod autoscaling
3. Document Health
Priority: High
Admin Dashboard Review
-
Access Main Dashboard
- Navigate to:
[APP_HOST]/admin/dashboards - Login with admin credentials
- Navigate to:
-
Review Core Metrics
Check the following metrics and trends:
Documents Status:
- Total Documents count
- Total Chunks count
- Unvectorized Documents percentage
- Recent Documents (last 7 days)
System Health Indicators:
- Unvectorized Documents rate
- Crawler Error Rate (last 7 days)
- Empty Sources status
- Uncrawled sources count
User Engagement:
- Daily Active Users (last 24h)
- Monthly Active Users (last 30 days)
- Average Messages per Chat
-
Document Vectorization Check
Navigate to:
[APP_HOST]/admin/documents- Review documents list
- Check "chunks / vectorized" status for each document
- Identify documents with incomplete vectorization
Expected State:
- <10% of documents unvectorized
- Vectorization completing within 24 hours of upload
Actions if issues found:
- Identify stuck vectorization jobs
- Check background job queue health
- Review LLM API errors
- Consider manual rechunking via admin panel
-
Crawler Health Check
- Review crawler error rates
- Check recent crawler runs status
- Identify failing document sources
Navigate to:
[APP_HOST]/admin/document_sources- Review each source's last crawl status
- Check for sources with repeated failures
Alert Thresholds
| Metric | Warning | Critical |
|---|---|---|
| Unvectorized Docs | >10% | >25% |
| Crawler Error Rate | >5% | >15% |
| Empty Sources | >0 | >3 |
| Failed Chunks | >50 | >200 |
4. LLM Provider Monitoring
Priority: Critical
Provider Dashboard Review
Steps to Review API Usage
-
Access Billing Dashboard
- Navigate to your LLM provider's billing/account section
-
Review Key Metrics
- API usage (last 7 days)
- Cost trends
- Rate limit status
- Credit balance remaining
- Any deprecation notices
-
Alert Thresholds
- ⚠️ Warning: Low credit balance (<1 week runway based on usage)
- 🚨 Critical: Very low credit balance (<3 days runway)
- 🚨 Critical: Rate limit errors detected
-
Actions if Issues Found
- Add credits if balance low
- Review unusual cost spikes
- Check for inefficient prompts
- Upgrade API tier if rate limited
Example with OpenAI:
- Navigate to: https://platform.openai.com/usage
- Review API usage, cost trends, and credit balance
- Check for rate limit errors or deprecation notices
- Set up billing alerts if needed
Secondary: Langfuse Monitoring
Access Langfuse dashboard (default: https://cloud.langfuse.com)
Review application-level metrics:
- Generation counts by model
- Average latency per model
- Token usage trends
- Error rates by operation
- Cost per generation
Actions if Issues Found
High Costs:
- Review unusual spike patterns
- Check for inefficient prompts
Rate Limits:
- Identify peak usage times
- Upgrade API tier if needed
5. Background Jobs Monitoring
Priority: Critical
Access Mission Control
-
Navigate to:
[APP_HOST]/jobs- Note: Requires
super_adminrole - Login with super admin credentials if needed
- Note: Requires
-
Review Job Queue Health
- Check queue depths (should be low)
- Review failed jobs count
- Check average processing times
-
Failed Jobs Analysis
For each failed job:
- Review error message
- Check failure time
- Identify job type (document processing, crawler, etc.)
- Determine if retry is safe
-
Key Job Types to Monitor
Job Type Expected Frequency Max Duration Document Processing Per upload 5-60 minutes Crawler Runs Scheduled 10-120 minutes Embedding Generation Per document 2-10 minutes
Alert Thresholds
- ⚠️ Warning: >10 failed jobs
- 🚨 Critical: >50 failed jobs or jobs stuck >2 hours
- 🚨 Critical: Queue depth >1000 jobs
Actions if Issues Found
High Failed Job Count:
- Review error patterns
- Check LLM API connectivity
- Verify database connectivity
- Check for code deployment issues
Stuck Jobs:
- Identify the operation being performed
- Check associated logs
- Consider manual job termination
- May require application restart
High Queue Depth:
- Check worker availability
- Review job priorities
- Consider scaling workers
- Identify slow operations
6. Health Endpoint Check
Priority: Medium
Manual Check
# Simple check
curl [APP_HOST]/up
# Expected response: 200 OKWhat it Checks
The /up endpoint verifies:
- Application is running
- Database connectivity
- Basic Rails stack health
Expected Response
200 OKActions if Fails
- 🚨 Critical: Immediate investigation required
- Check application logs:
kubectl logs -n <namespace> <pod-name> - Verify database connectivity
- Check pod status:
kubectl get pods -A - Review recent deployments
Automation Recommendations
While this checklist is designed for manual review, consider automating alerts for:
- Disk space >80%
- Failed jobs >10
- Health endpoint failures
- Sentry critical error spikes
- LLM credit balance low