Critical diagnostic: Our FSDataOutputStream.getPos() override is NOT being called!
Adding WARN logs to SeaweedFileSystemStore.createFile() to determine:
1. Is createFile() being called at all?
2. If yes, but FSDataOutputStream override not called, then streams are
being returned WITHOUT going through SeaweedFileSystem.create/append
3. This would explain why our position tracking fix has no effect
Hypothesis: SeaweedFileSystemStore.createFile() returns SeaweedHadoopOutputStream
directly, and it gets wrapped by something else (not our custom FSDataOutputStream).
INFO logs from seaweed.hdfs package may be filtered.
Changed all diagnostic logs to WARN level to match the
'PARQUET FILE WRITTEN' log which DOES appear in test output.
This will definitively show:
1. Whether our code path is being used
2. Whether the getPos() override is being called
3. What position values are being returned
Java compilation error:
- 'local variables referenced from an inner class must be final or effectively final'
- The 'path' variable was being reassigned (path = qualify(path))
- This made it non-effectively-final
Solution:
- Create 'final Path finalPath = path' after qualification
- Use finalPath in the anonymous FSDataOutputStream subclass
- Applied to both create() and append() methods
This will help determine:
1. If the anonymous FSDataOutputStream subclass is being created
2. If the getPos() override is actually being called by Parquet
3. What position value is being returned
If we see 'Creating FSDataOutputStream' but NOT 'getPos() override called',
it means FSDataOutputStream is using a different mechanism for position tracking.
If we don't see either log, it means the code path isn't being used at all.
CRITICAL FIX for Parquet 78-byte EOF error!
Root Cause Analysis:
- Hadoop's FSDataOutputStream tracks position with an internal counter
- It does NOT call SeaweedOutputStream.getPos() by default
- When Parquet writes data and calls getPos() to record column chunk offsets,
it gets FSDataOutputStream's counter, not SeaweedOutputStream's actual position
- This creates a 78-byte mismatch between recorded offsets and actual file size
- Result: EOFException when reading (tries to read beyond file end)
The Fix:
- Override getPos() in the anonymous FSDataOutputStream subclass
- Delegate to SeaweedOutputStream.getPos() which returns 'position + buffer.position()'
- This ensures Parquet gets the correct position when recording metadata
- Column chunk offsets in footer will now match actual data positions
This should fix the consistent 78-byte discrepancy we've been seeing across
all Parquet file writes (regardless of file size: 684, 693, 1275 bytes, etc.)
Added comprehensive logging to identify why Parquet files fail with
'EOFException: Still have: 78 bytes left'.
Key additions:
1. SeaweedHadoopOutputStream constructor logging with 🔧 marker
- Shows when output streams are created
- Logs path, position, bufferSize, replication
2. totalBytesWritten counter in SeaweedOutputStream
- Tracks cumulative bytes written via write() calls
- Helps identify if Parquet wrote 762 bytes but only 684 reached chunks
3. Enhanced close() logging with 🔒 and ✅ markers
- Shows totalBytesWritten vs position vs buffer.position()
- If totalBytesWritten=762 but position=684, write submission failed
- If buffer.position()=78 at close, buffer wasn't flushed
Expected scenarios in next run:
A) Stream never created → No 🔧 log for .parquet files
B) Write failed → totalBytesWritten=762 but position=684
C) Buffer not flushed → buffer.position()=78 at close
D) All correct → totalBytesWritten=position=684, but Parquet expects 762
This will pinpoint whether the issue is in:
- Stream creation/lifecycle
- Write submission
- Buffer flushing
- Or Parquet's internal state
The persistent EOFException shows Parquet expects 78 more bytes than exist.
This suggests a mismatch between what was written vs what's in chunks.
Added logging to track:
1. Buffer state at close (position before flush)
2. Stream position when flushing metadata
3. Chunk count vs file size in attributes
4. Explicit fileSize setting from stream position
Key hypothesis:
- Parquet writes N bytes total (e.g., 762)
- Stream.position tracks all writes
- But only (N-78) bytes end up in chunks
- This causes Parquet read to fail with 'Still have: 78 bytes left'
If buffer.position() = 78 at close, the buffer wasn't flushed.
If position != chunk total, write submission failed.
If attr.fileSize != position, metadata is inconsistent.
Next run will show which scenario is happening.
ROOT CAUSE: Maven was downloading seaweedfs-client:3.80 from Maven Central
instead of using the locally built version in CI!
Changes:
- Changed all versions from 3.80 to 3.80.1-SNAPSHOT
- other/java/client/pom.xml: 3.80 → 3.80.1-SNAPSHOT
- other/java/hdfs2/pom.xml: property 3.80 → 3.80.1-SNAPSHOT
- other/java/hdfs3/pom.xml: property 3.80 → 3.80.1-SNAPSHOT
- test/java/spark/pom.xml: property 3.80 → 3.80.1-SNAPSHOT
Maven behavior:
- Release versions (3.80): Downloaded from remote repos if available
- SNAPSHOT versions: Prefer local builds, can be updated
This ensures the CI uses the locally built JARs with our debug logging!
Also added unique [DEBUG-2024] markers to verify in logs.