# Workspace Management Guide

**Part of Task P2-1.3: Supervised Workspace Cleanup (Phase 2)**

Comprehensive guide to the supervised workspace management system with automatic cleanup, crash recovery, and TTL-based retention.

## Table of Contents

1. [Architecture Overview](#architecture-overview)
2. [Workspace Lifecycle](#workspace-lifecycle)
3. [Cleanup Policies](#cleanup-policies)
4. [Orphan Detection and Recovery](#orphan-detection-and-recovery)
5. [Size Limit Enforcement](#size-limit-enforcement)
6. [Audit Trail and Monitoring](#audit-trail-and-monitoring)
7. [Manual Cleanup Procedures](#manual-cleanup-procedures)
8. [Troubleshooting](#troubleshooting)
9. [Performance Characteristics](#performance-characteristics)

## Architecture Overview

The workspace management system provides:

- **Isolated Workspaces**: Each agent task gets its own isolated directory
- **Automatic Cleanup**: Workspaces are automatically cleaned up on agent completion or crash
- **TTL-Based Retention**: Workspaces automatically expire after configurable TTL (default: 24h)
- **Orphan Detection**: Crashed/killed agents trigger orphan detection and cleanup
- **Size Limits**: Per-workspace size limits (configurable, default: 1GB) prevent disk exhaustion
- **Audit Trail**: Complete history of all cleanup operations with metadata
- **Manual Cleanup**: Operator utilities for manual workspace management

### Component Structure

```
WorkspaceSupervisor (src/services/workspace-supervisor.ts)
├── Create/cleanup workspaces
├── Track lifecycle and metadata
├── Enforce TTL policies
├── Monitor size limits
└── Record audit trail

OrphanDetector (src/lib/orphan-detector.ts)
├── Detect orphaned workspaces (no active process)
├── Grace period management (10 minutes)
├── Background scanning
└── Automatic cleanup

Database Schema (src/db/migrations/007-workspace-tracking-schema.sql)
├── Workspaces table (metadata)
├── Cleanup history (audit trail)
├── Workspace metrics (size tracking)
└── Orphan tracking (crash detection)

Cleanup Utilities (scripts/cleanup-workspaces.sh)
├── List workspaces
├── Manual cleanup
├── Force cleanup (skip TTL)
└── Reports and statistics
```

## Workspace Lifecycle

### 1. Creation (Agent Starts)

When an agent task begins, a supervised workspace is created:

```typescript
const supervisor = new WorkspaceSupervisor({
  workspaceRoot: '/tmp/cfn-workspaces',
  maxWorkspaceSizeBytes: 1024 * 1024 * 1024, // 1GB
  defaultTtlHours: 24
});
await supervisor.initialize();

// Create isolated workspace
const workspace = await supervisor.createWorkspace({
  agentId: 'backend-dev-001',
  taskId: 'task-123',
  maxSizeBytes: 1024 * 1024 * 1024,
  ttlHours: 24
});

// Returns:
// {
//   id: 'uuid...',
//   agentId: 'backend-dev-001',
//   taskId: 'task-123',
//   path: '/tmp/cfn-workspaces/backend-dev-001-task-123-uuid/',
//   createdAt: Date,
//   ttlHours: 24,
//   ...
// }
```

**What Happens:**
- Isolated directory created: `/tmp/cfn-workspaces/{agentId}-{taskId}-{uuid}/`
- Workspace metadata registered in SQLite database
- TTL timer initialized (24h default)
- Size monitoring activated

### 2. Monitoring (During Execution)

During agent execution:

```typescript
// Agent writes output files
await fs.writeFile(path.join(workspace.path, 'output.txt'), 'result');

// Supervisor tracks:
// - File count and total size
// - Size limit violations
// - Last access time (for orphan detection)
```

**Tracked Metrics:**
- Current size in bytes
- File count
- Last accessed timestamp
- Process ID (for crash detection)
- Size limit exceeded flag

### 3. Cleanup (Completion or Crash)

On completion, cleanup is triggered:

```typescript
// Normal completion
await supervisor.cleanupWorkspace(workspace.id, {
  reason: 'agent_completed',
  preserveArtifacts: ['report.md', 'output.json'],
  artifactDestination: '/path/to/artifacts'
});

// On crash (orphan detection)
const orphanDetector = new OrphanDetector({
  workspaceRoot: '/tmp/cfn-workspaces',
  gracePeriodMinutes: 10
});
const cleanupStats = await orphanDetector.cleanupOrphans();
```

**Cleanup Options:**
- `reason`: Why cleanup occurred (agent_completed, agent_crashed, ttl_expired, manual)
- `preserveArtifacts`: Glob patterns for files to preserve
- `artifactDestination`: Where to move preserved artifacts
- `metadata`: Additional context (exit code, duration, etc.)

**What Happens:**
1. Files matching preserve patterns are moved to artifact destination
2. Remaining workspace directory is removed
3. Cleanup recorded in audit trail
4. Workspace unregistered from database

### 4. TTL Cleanup (Background Task)

Background scheduler runs hourly:

```
Every 60 minutes (configurable):
  1. Find workspaces older than TTL
  2. For each stale workspace:
     - Preserve artifacts if configured
     - Remove workspace directory
     - Record cleanup in audit trail
  3. Log statistics (workspaces cleaned, space freed)
```

## Cleanup Policies

### Normal Completion Cleanup

Triggered when agent completes successfully or fails:

```typescript
const stats = await supervisor.cleanupWorkspace(workspace.id, {
  reason: 'agent_completed',
  preserveArtifacts: ['*.md', 'artifacts/**'],
  artifactDestination: '/artifacts/task-123/'
});

// Returns:
// {
//   cleanedCount: 1,
//   totalSizeFreed: 52428800, // 50MB
//   filesRemoved: 127
// }
```

**Preserved Artifacts Example:**
```
Workspace:
  output.json          → /artifacts/task-123/output.json ✓
  report.md            → /artifacts/task-123/report.md ✓
  temp_build/          → (removed) ✗
  cache/               → (removed) ✗
```

### TTL-Based Cleanup

Automatic cleanup for old workspaces:

```typescript
// Runs every 60 minutes (configurable)
const stats = await supervisor.enforceRetentionPolicy({
  preservePatterns: ['*.md', 'output.*']
});

// Returns statistics of cleaned workspaces
```

**TTL Policy:**
- Default: 24 hours
- Configurable per workspace
- Preserved artifacts moved before cleanup
- Audit trail recorded

### Crash Recovery Cleanup

Orphaned workspace detection and cleanup:

```typescript
const orphans = await orphanDetector.detectOrphans();

// Returns:
// [
//   {
//     id: 'workspace-uuid',
//     agentId: 'agent-001',
//     path: '/tmp/cfn-workspaces/...',
//     processId: 12345,
//     lastAccessedAt: Date,
//     sizeBytes: 52428800,
//     fileCount: 127
//   }
// ]

// Cleanup orphans after grace period
const stats = await orphanDetector.cleanupOrphans();
```

**Orphan Detection Logic:**
1. Scan workspace directories
2. Check if associated process is still active (using process ID)
3. Measure time since last access
4. If process dead AND past grace period → mark as orphan
5. Cleanup orphan workspace

**Grace Period:** 10 minutes (configurable)
- Prevents premature cleanup during agent restarts
- After 10 minutes, orphaned workspace is safe to delete

## Orphan Detection and Recovery

### How Orphan Detection Works

1. **Process Monitoring**: Each workspace tracks the process ID (PID) of the agent
2. **Activity Tracking**: Last access time recorded for each workspace
3. **Grace Period**: 10-minute grace period after process death
4. **Automatic Cleanup**: After grace period, workspace is cleaned up

### Detecting Orphans

```bash
# Manual orphan detection
node -e "
const { OrphanDetector } = require('./src/lib/orphan-detector');
const detector = new OrphanDetector({
  workspaceRoot: '/tmp/cfn-workspaces',
  gracePeriodMinutes: 10
});
detector.detectOrphans().then(orphans => {
  console.log('Orphaned workspaces:', orphans);
});
"
```

### Grace Period Protection

**Scenario: Agent Restart During Grace Period**

```
Time: 0:00   Agent crashes, workspace marked orphaned
Time: 0:05   Process restarts, updates last_accessed_at
             Grace period timer resets
Time: 0:15   Grace period expires, but process is active
             → Workspace NOT cleaned up
```

**Scenario: Agent Crash, No Restart**

```
Time: 0:00   Agent crashes, workspace marked orphaned
Time: 0:10   Grace period expires, process still dead
             → Workspace cleaned up automatically
```

### Background Scanning

Orphan detector runs background scans (default: every 30 minutes):

```typescript
const detector = new OrphanDetector({
  workspaceRoot: '/tmp/cfn-workspaces',
  gracePeriodMinutes: 10,
  scanIntervalMinutes: 30 // Background scan interval
});

detector.start(); // Start background scanning
// ... later ...
detector.stop(); // Stop background scanning
```

## Size Limit Enforcement

### Per-Workspace Size Limits

Each workspace has a configurable size limit (default: 1GB):

```typescript
const workspace = await supervisor.createWorkspace({
  agentId: 'agent-001',
  taskId: 'task-001',
  maxSizeBytes: 1024 * 1024 * 1024, // 1GB limit
  ttlHours: 24
});
```

### Monitoring Size Usage

```typescript
// Get workspace info with size
const info = await supervisor.getWorkspaceInfo(workspace.id);

console.log({
  sizeBytes: 524288000,    // 500MB
  fileCount: 1240,
  maxSizeBytes: 1073741824, // 1GB
  exceedsLimit: false
});
```

### What Happens When Limit Exceeded

1. **Violation Flagged**: `exceedsLimit` flag set to true
2. **Metric Recorded**: Size violation recorded in workspace_metrics
3. **Alert Generated**: Log warning with workspace details
4. **Manual Cleanup**: Operator notified via monitoring
5. **Emergency Cleanup**: Can be manually triggered

### Disk Usage Monitoring

Get overall workspace statistics:

```typescript
const stats = await supervisor.getStatistics();

console.log({
  totalWorkspaces: 45,
  activeWorkspaces: 42,
  totalDiskUsage: 536870912000, // 500GB
  staleWorkspaces: 3
});
```

## Audit Trail and Monitoring

### Cleanup History

Complete audit trail of all cleanup operations:

```typescript
const history = await supervisor.getCleanupHistory(workspace.id);

// Returns:
// [
//   {
//     cleanedAt: Date,
//     reason: 'agent_completed',
//     sizeFreed: 52428800,
//     filesRemoved: 127,
//     metadata: { exitCode: 0, duration: 5000 }
//   },
//   {
//     cleanedAt: Date,
//     reason: 'ttl_expired',
//     sizeFreed: 0, // Artifacts preserved
//     filesRemoved: 0,
//     metadata: { preserved_artifacts: 3 }
//   }
// ]
```

### Metrics Tracking

Workspace size metrics tracked over time:

```sql
-- Query workspace metrics
SELECT
  recorded_at,
  size_bytes,
  file_count,
  exceeds_limit
FROM workspace_metrics
WHERE workspace_id = 'workspace-id'
ORDER BY recorded_at DESC
LIMIT 100;
```

### Cleanup Reporting

Generate cleanup reports:

```bash
# List all cleanups (last 10)
sqlite3 ./cfn-workspaces/metadata.db \
  "SELECT workspace_id, reason, size_freed, cleaned_at
   FROM cleanup_history
   ORDER BY cleaned_at DESC
   LIMIT 10;"

# Cleanup by reason
sqlite3 ./cfn-workspaces/metadata.db \
  "SELECT reason, COUNT(*) as count, SUM(size_freed) as total_freed
   FROM cleanup_history
   GROUP BY reason;"

# Disk freed over time
sqlite3 ./cfn-workspaces/metadata.db \
  "SELECT DATE(cleaned_at) as date, COUNT(*) as cleanups, SUM(size_freed) as freed
   FROM cleanup_history
   GROUP BY DATE(cleaned_at)
   ORDER BY date DESC;"
```

## Manual Cleanup Procedures

### List All Workspaces

```bash
./scripts/cleanup-workspaces.sh --list
```

Output:
```
=== Workspace Listing ===

Active Workspaces:
ID          Agent              Task           Size      Created
─────────────────────────────────────────────────────────────
abc123      backend-dev-001    task-123       50MB      2025-11-15T10:00:00Z
def456      frontend-dev-002   task-124       100MB     2025-11-15T11:30:00Z
ghi789      tester-003         task-125       25MB      2025-11-15T14:00:00Z
```

### Show Orphaned Workspaces

```bash
./scripts/cleanup-workspaces.sh --orphans
```

Shows workspaces in grace period (orphan detected, waiting for grace period expiry).

### Manual Cleanup

```bash
./scripts/cleanup-workspaces.sh --cleanup abc123def456
```

Interactive cleanup (requires confirmation):
```
About to delete workspace:
  ID: abc123def456
  Path: /tmp/cfn-workspaces/backend-dev-001-task-123-abc123/
  Size: 50MB

Continue? (yes/no): yes
✓ Workspace cleaned up: abc123def456 (50MB freed)
```

### Force Cleanup (Skip TTL)

```bash
./scripts/cleanup-workspaces.sh --force-cleanup abc123def456
```

Skips grace period and immediately cleans up workspace.

### Cleanup All Orphans

```bash
./scripts/cleanup-workspaces.sh --cleanup-orphans
```

Scans for orphaned workspaces past grace period and cleans them up.

### Generate Report

```bash
./scripts/cleanup-workspaces.sh --report
```

Comprehensive workspace report:
```
=== Workspace Report ===

Overall Statistics:
Total Workspaces: 42
Total Size (bytes): 536870912000
Total Files: 12450
Most Recent: 2025-11-16T10:00:00Z

Recent Cleanups:
ID (short)  Reason            Size (bytes)    When
─────────────────────────────────────────────────
abc123      agent_completed   52428800        2025-11-16T09:30:00Z
def456      ttl_expired       26214400        2025-11-16T08:00:00Z
ghi789      manual            10485760        2025-11-15T20:00:00Z

Largest Workspaces:
ID (short)  Agent             Task           Size (bytes)  Files
──────────────────────────────────────────────────────────────
xyz999      heavy-job-001     task-500       268435456     2840
```

## Troubleshooting

### Orphaned Files Not Being Cleaned

**Symptoms:** Workspace directories still exist after 24h

**Diagnosis:**
```bash
# Check if process is still active
ps aux | grep <process-id>

# Check workspace metadata
cat /tmp/cfn-workspaces/agent-task-uuid/.metadata.json

# Check orphan tracking table
sqlite3 ./cfn-workspaces/metadata.db \
  "SELECT * FROM orphan_tracking WHERE cleaned_at IS NULL;"
```

**Solutions:**
1. **Check process**: If process still running, that's correct (workspace not orphaned yet)
2. **Check grace period**: Wait 10+ minutes after process death
3. **Manual cleanup**: Use `--force-cleanup` to skip grace period
4. **Check permissions**: Verify write permissions to workspace root

### Workspace Size Limit Exceeded

**Symptoms:** `exceedsLimit: true` in workspace info

**Diagnosis:**
```bash
# Check workspace size
du -sh /tmp/cfn-workspaces/agent-task-uuid/

# List largest files
du -sh /tmp/cfn-workspaces/agent-task-uuid/* | sort -h | tail -10
```

**Solutions:**
1. **Clean temporary files**: Remove cache, temp directories
2. **Move artifacts**: Preserve important files and cleanup
3. **Increase limit**: Adjust `maxSizeBytes` in workspace config
4. **Manual cleanup**: Force cleanup and restart task

### Database Corruption

**Symptoms:** Database errors in logs, cleanup operations fail

**Recovery:**
```bash
# Backup corrupted database
cp ./cfn-workspaces/metadata.db ./cfn-workspaces/metadata.db.backup

# Verify database integrity
sqlite3 ./cfn-workspaces/metadata.db "PRAGMA integrity_check;"

# Rebuild if corrupted
sqlite3 ./cfn-workspaces/metadata.db < ./src/db/migrations/007-workspace-tracking-schema.sql
```

### High Disk Usage

**Diagnosis:**
```bash
# Report by reason for cleanup failures
sqlite3 ./cfn-workspaces/metadata.db \
  "SELECT reason, COUNT(*) FROM cleanup_history GROUP BY reason;"

# Find workspaces not being cleaned
sqlite3 ./cfn-workspaces/metadata.db \
  "SELECT id, agent_id, task_id, current_size_bytes
   FROM workspaces
   WHERE id NOT IN (SELECT DISTINCT workspace_id FROM cleanup_history)
   ORDER BY current_size_bytes DESC;"
```

**Solutions:**
1. Run TTL cleanup: `supervisor.enforceRetentionPolicy()`
2. Clean orphans: `orphanDetector.cleanupOrphans()`
3. Manual cleanup for large workspaces
4. Review TTL policy (may be too long)

## Performance Characteristics

### Workspace Creation

- **Time**: < 100ms
- **Operations**: Directory creation + database insert
- **Scalability**: 1000+ concurrent workspaces

### Cleanup Operations

- **< 100MB**: < 1 second
- **< 1GB**: < 5 seconds
- **> 1GB**: ~5ms per MB
- **Database updates**: < 100ms

### Orphan Detection

- **Scan time**: ~30 seconds for 1000 workspaces
- **Background interval**: 30 minutes (configurable)
- **Grace period check**: O(n) where n = workspaces in grace period

### TTL Cleanup

- **100 workspaces**: < 5 minutes
- **1000 workspaces**: < 50 minutes
- **Background interval**: 60 minutes (configurable)

### Database

- **Schema**: ~7 tables with indexes
- **Cleanup history retention**: Depends on workspaces × average cleanups (1-2 per workspace)
- **Estimated size**: ~1MB per 1000 workspaces × 10 cleanup events

### Disk Usage Limits

- **Maximum workspaces**: Unlimited (filesystem dependent)
- **Recommended monitoring threshold**: 80% of available disk
- **Emergency cleanup trigger**: 90% disk usage

## Best Practices

1. **Set Appropriate TTL**: Default 24h is good for most tasks
2. **Preserve Artifacts**: Always preserve important output files
3. **Monitor Orphans**: Check orphan detector logs regularly
4. **Review Size Limits**: Adjust per-workspace limits based on task requirements
5. **Regular Audits**: Use `--report` to monitor cleanup effectiveness
6. **Backup Metadata**: Periodically backup database for disaster recovery

## Related Documentation

- **Task P2-1.3**: Supervised Workspace Cleanup
- **Database Service**: `docs/DATABASE_SERVICE_GUIDE.md`
- **Backup Manager**: `docs/BACKUP_MANAGER_GUIDE.md`
- **File Operations**: `src/lib/file-operations.ts`

---

**Last Updated**: 2025-11-16
**Version**: 1.0.0
**Status**: Production Ready
