# Erasure Coding Integration Tests This directory contains integration tests for the EC (Erasure Coding) encoding volume location timing bug fix. ## The Bug The bug caused **double storage usage** during EC encoding because: 1. **Silent failure**: Functions returned `nil` instead of proper error messages 2. **Timing race condition**: Volume locations were collected **AFTER** EC encoding when master metadata was already updated 3. **Missing cleanup**: Original volumes weren't being deleted after EC encoding This resulted in both original `.dat` files AND EC `.ec00-.ec13` files coexisting, effectively **doubling storage usage**. ## The Fix The fix addresses all three issues: 1. **Fixed silent failures**: Updated `doDeleteVolumes()` and `doEcEncode()` to return proper errors 2. **Fixed timing race condition**: Created `doDeleteVolumesWithLocations()` that uses pre-collected volume locations 3. **Enhanced cleanup**: Volume locations are now collected **BEFORE** EC encoding, preventing the race condition ## Integration Tests ### TestECEncodingVolumeLocationTimingBug The main integration test that: - **Simulates master timing race condition**: Tests what happens when volume locations are read from master AFTER EC encoding has updated the metadata - **Verifies fix effectiveness**: Checks for the "Collecting volume locations...before EC encoding" message that proves the fix is working - **Tests multi-server distribution**: Runs EC encoding with 6 volume servers to test shard distribution - **Validates cleanup**: Ensures original volumes are properly cleaned up after EC encoding ### TestECEncodingMasterTimingRaceCondition A focused test that specifically targets the **master metadata timing race condition**: - **Simulates the exact race condition**: Tests volume location collection timing relative to master metadata updates - **Detects timing fix**: Verifies that volume locations are collected BEFORE EC encoding starts - **Demonstrates bug impact**: Shows what happens when volume locations are unavailable after master metadata update ### TestECEncodingRegressionPrevention Regression tests that ensure: - **Function signatures**: Fixed functions still exist and return proper errors - **Timing patterns**: Volume location collection happens in the correct order ## Test Architecture The tests use: - **Real SeaweedFS cluster**: 1 master server + 6 volume servers - **Multi-server setup**: Tests realistic EC shard distribution across multiple servers - **Timing simulation**: Goroutines and delays to simulate race conditions - **Output validation**: Checks for specific log messages that prove the fix is working ## Why Integration Tests Were Necessary Unit tests could not catch this bug because: 1. **Race condition**: The bug only occurred in real-world timing scenarios 2. **Master-volume server interaction**: Required actual master metadata updates 3. **File system operations**: Needed real volume creation and EC shard generation 4. **Cleanup timing**: Required testing the sequence of operations in correct order The integration tests successfully catch the timing bug by: - **Testing real command execution**: Uses actual `ec.encode` shell command - **Simulating race conditions**: Creates timing scenarios that expose the bug - **Validating output messages**: Checks for the key "Collecting volume locations...before EC encoding" message - **Monitoring cleanup behavior**: Ensures original volumes are properly deleted ## Running the Tests ```bash # Run all integration tests go test -v # Run only the main timing test go test -v -run TestECEncodingVolumeLocationTimingBug # Run only the race condition test go test -v -run TestECEncodingMasterTimingRaceCondition # Skip integration tests (short mode) go test -v -short ``` ## Test Results **With the fix**: Shows "Collecting volume locations for N volumes before EC encoding..." message **Without the fix**: No collection message, potential timing race condition The tests demonstrate that the fix prevents the volume location timing bug that caused double storage usage in EC encoding operations.