Bug Description
When using undici's Pool.request() and iterating over response.body with setEncoding('utf8'), multi-byte UTF-8 characters (specifically 3-byte CJK characters) that span chunk boundaries are replaced with U+FFFD (replacement character).
This does NOT occur with:
- Node.js built-in
https module's setEncoding('utf8') on the same endpoint
- Collecting raw Buffers and calling
Buffer.concat().toString('utf8') on the same undici response
Reproducible By
Verified on a production server (Node v24.14.1, undici 7.15.0) against an Elasticsearch endpoint returning ~40KB JSON containing Chinese text.
All three tests run in the same Node.js process, against the same endpoint, returning the same data:
const { Pool } = require('undici');
const https = require('https');
const pool = new Pool('https://your-elasticsearch-host');
const requestOpts = {
path: '/your-index/_search',
method: 'POST',
headers: {
'Authorization': 'ApiKey YOUR_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
query: { match_all: {} },
size: 10
})
};
// ❌ BROKEN: undici + setEncoding('utf8')
const r1 = await pool.request(requestOpts);
let str = '';
r1.body.setEncoding('utf8');
for await (const chunk of r1.body) { str += chunk; }
console.log('undici setEncoding FFFD:', (str.match(/\ufffd/g) || []).length);
// Output: 10
// ✅ OK: undici + Buffer.concat
const r2 = await pool.request(requestOpts);
const bufs = [];
for await (const chunk of r2.body) {
bufs.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk));
}
const txt = Buffer.concat(bufs).toString('utf8');
console.log('undici Buffer.concat FFFD:', (txt.match(/\ufffd/g) || []).length);
// Output: 0
// ✅ OK: Node.js https + setEncoding('utf8')
const httpsResult = await new Promise((resolve) => {
const url = new URL('https://your-elasticsearch-host/your-index/_search');
const req = https.request(url, {
method: 'POST',
headers: requestOpts.headers
}, (res) => {
let s = '';
res.setEncoding('utf8');
res.on('data', (c) => { s += c; });
res.on('end', () => resolve(s));
});
req.write(requestOpts.body);
req.end();
});
console.log('https setEncoding FFFD:', (httpsResult.match(/\ufffd/g) || []).length);
// Output: 0
The corrupted characters are consistently 3-byte UTF-8 CJK characters (e.g., U+50B3 傳 = bytes e5 82 b3) that fall on chunk boundaries. The corruption is deterministic — same request always produces FFFD at the same positions.
Expected Behavior
setEncoding('utf8') on undici response body should produce identical output to Buffer.concat().toString('utf8'). The internal StringDecoder should correctly buffer incomplete multi-byte sequences across chunks, as Node.js's built-in https module does.
Logs & Screenshots
# Same request, same data, same Node.js process:
undici Pool + setEncoding('utf8'): 10 FFFD ← broken
undici Pool + Buffer.concat: 0 FFFD ← correct
Node.js https + setEncoding('utf8'): 0 FFFD ← correct
Environment
- OS: Alpine Linux (Docker
node:24-alpine)
- Node.js: v24.14.1
- undici: 7.15.0
- Upstream server: Elasticsearch 9.2.0 (chunked transfer-encoding,
application/json; charset=utf-8)
Additional context
This issue was discovered through the @elastic/elasticsearch Node.js client (v9.1.1), which uses @elastic/transport (v9.1.2). The transport layer calls response.body.setEncoding('utf8') in UndiciConnection.js, causing all JSON responses containing CJK characters to be silently corrupted (~1 character per ~4KB of response body).
Downstream impact: any application using undici with setEncoding('utf8') for non-ASCII text is affected.
Bug Description
When using undici's
Pool.request()and iterating overresponse.bodywithsetEncoding('utf8'), multi-byte UTF-8 characters (specifically 3-byte CJK characters) that span chunk boundaries are replaced with U+FFFD (replacement character).This does NOT occur with:
httpsmodule'ssetEncoding('utf8')on the same endpointBuffer.concat().toString('utf8')on the same undici responseReproducible By
Verified on a production server (Node v24.14.1, undici 7.15.0) against an Elasticsearch endpoint returning ~40KB JSON containing Chinese text.
All three tests run in the same Node.js process, against the same endpoint, returning the same data:
The corrupted characters are consistently 3-byte UTF-8 CJK characters (e.g., U+50B3
傳= bytese5 82 b3) that fall on chunk boundaries. The corruption is deterministic — same request always produces FFFD at the same positions.Expected Behavior
setEncoding('utf8')on undici response body should produce identical output toBuffer.concat().toString('utf8'). The internalStringDecodershould correctly buffer incomplete multi-byte sequences across chunks, as Node.js's built-inhttpsmodule does.Logs & Screenshots
Environment
node:24-alpine)application/json; charset=utf-8)Additional context
This issue was discovered through the
@elastic/elasticsearchNode.js client (v9.1.1), which uses@elastic/transport(v9.1.2). The transport layer callsresponse.body.setEncoding('utf8')inUndiciConnection.js, causing all JSON responses containing CJK characters to be silently corrupted (~1 character per ~4KB of response body).Downstream impact: any application using undici with
setEncoding('utf8')for non-ASCII text is affected.