Files
bun.sh/CONTAINER_FIXES_ASSESSMENT.md
Claude 9feb527824 WIP: Migrate container spawn from unshare to clone3 (partial fix)
This commit replaces the broken unshare() approach with clone3() for
creating container namespaces. This fixes the immediate crash but the
implementation is NOT production ready.

What works:
- Basic namespace creation via clone3()
- PID namespace isolation (process sees itself as PID 1)
- PR_SET_PDEATHSIG properly set for cleanup
- No more errno conversion crashes

Critical issues remaining (see CONTAINER_FIXES_ASSESSMENT.md):
- User namespace UID/GID mapping broken (needs parent to write)
- No parent-child synchronization for setup stages
- Cgroup setup won't work (needs parent process to configure)
- Network namespace configuration incomplete
- Mount operations timing issues
- Silent fallback when clone3 fails

This is a step forward but needs significant additional work for
production use. The architecture needs parent-child coordination via
pipes/eventfd to properly sequence namespace configuration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-27 00:13:43 +02:00

3.9 KiB

Container Implementation - Clone3 Migration Assessment

What Was Done

Migrated from unshare() after vfork() to using clone3() to create namespaces atomically, avoiding TOCTOU issues.

Changes Made:

  1. bun-spawn.cpp: Added clone3() support for namespace creation
  2. spawn.zig: Added namespace_flags to spawn request
  3. process.zig: Calculate namespace flags from container options
  4. linux_container.zig: Removed unshare() calls

What Works

Basic PID namespace creation (with user namespace) PR_SET_PDEATHSIG is properly set Process sees itself as PID 1 in PID namespace Clean compile with no errors

Critical Issues - NOT Production Ready

1. User Namespace UID/GID Mapping Broken

  • Problem: Mappings are written from child process (won't work)
  • Required: Parent must write /proc/<pid>/uid_map after clone3()
  • Impact: User namespaces don't work properly

2. No Parent-Child Synchronization

  • Problem: No coordination between parent setup and child execution
  • Required: Pipe or eventfd for synchronization
  • Impact: Race conditions, child may exec before parent setup completes

3. Cgroup Setup Won't Work

  • Problem: Trying to set up cgroups from child process
  • Required: Parent must create cgroup and add child PID
  • Impact: Resource limits don't work

4. Network Namespace Config Broken

  • Problem: No proper veth pair creation or network setup
  • Required: Parent creates veth, child configures interface
  • Impact: Network isolation doesn't work beyond basic namespace

5. Mount Operations Timing Wrong

  • Problem: Mount operations happen at wrong time
  • Required: Child must mount after namespace entry but before exec
  • Impact: Filesystem isolation doesn't work

6. Silent Fallback on Error

  • Problem: Falls back to vfork without error when clone3 fails
  • Required: Should propagate error to user
  • Impact: User thinks container is working when it's not

Proper Architecture Needed

Parent Process                    Child Process
--------------                    -------------
clone3() ──────────────────────> (created in namespaces)
    │                                    │
    ├─ Write UID/GID mappings           │ 
    ├─ Create cgroups                   │
    ├─ Add child to cgroup              │
    ├─ Create veth pairs                │
    │                                    ├─ Wait for parent signal
    ├─ Signal child ────────────────────>│
    │                                    ├─ Setup mounts
    │                                    ├─ Configure network
    │                                    ├─ Apply limits
    │                                    └─ execve()
    └─ Return PID

Required for Production

  1. Implement parent-child synchronization (pipe or eventfd)
  2. Split setup into parent/child operations
  3. Fix UID/GID mapping (parent writes after clone3)
  4. Fix cgroup setup (parent creates and assigns)
  5. Implement proper network setup (veth pairs)
  6. Add error propagation from child to parent
  7. Add comprehensive tests for error cases
  8. Add fallback detection and proper error reporting
  9. Test on various kernel versions (clone3 availability)
  10. Add cleanup on failure paths

Recommendation

DO NOT MERGE in current state. This needs significant rework to be production-ready. The basic approach of using clone3() is correct, but the implementation needs proper parent-child coordination and split responsibilities.

Time Estimate for Proper Implementation

  • 2-3 days for proper architecture implementation
  • 1-2 days for comprehensive testing
  • 1 day for documentation and review prep

Total: ~1 week of focused development