Skip to content

setEncoding('utf8') on response body corrupts multi-byte UTF-8 characters at chunk boundaries #5002

@joecwu

Description

@joecwu

Bug Description

When using undici's Pool.request() and iterating over response.body with setEncoding('utf8'), multi-byte UTF-8 characters (specifically 3-byte CJK characters) that span chunk boundaries are replaced with U+FFFD (replacement character).

This does NOT occur with:

  • Node.js built-in https module's setEncoding('utf8') on the same endpoint
  • Collecting raw Buffers and calling Buffer.concat().toString('utf8') on the same undici response

Reproducible By

Verified on a production server (Node v24.14.1, undici 7.15.0) against an Elasticsearch endpoint returning ~40KB JSON containing Chinese text.

All three tests run in the same Node.js process, against the same endpoint, returning the same data:

const { Pool } = require('undici');
const https = require('https');

const pool = new Pool('https://your-elasticsearch-host');
const requestOpts = {
  path: '/your-index/_search',
  method: 'POST',
  headers: {
    'Authorization': 'ApiKey YOUR_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    query: { match_all: {} },
    size: 10
  })
};

// ❌ BROKEN: undici + setEncoding('utf8')
const r1 = await pool.request(requestOpts);
let str = '';
r1.body.setEncoding('utf8');
for await (const chunk of r1.body) { str += chunk; }
console.log('undici setEncoding FFFD:', (str.match(/\ufffd/g) || []).length);
// Output: 10

// ✅ OK: undici + Buffer.concat
const r2 = await pool.request(requestOpts);
const bufs = [];
for await (const chunk of r2.body) {
  bufs.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk));
}
const txt = Buffer.concat(bufs).toString('utf8');
console.log('undici Buffer.concat FFFD:', (txt.match(/\ufffd/g) || []).length);
// Output: 0

// ✅ OK: Node.js https + setEncoding('utf8')
const httpsResult = await new Promise((resolve) => {
  const url = new URL('https://your-elasticsearch-host/your-index/_search');
  const req = https.request(url, {
    method: 'POST',
    headers: requestOpts.headers
  }, (res) => {
    let s = '';
    res.setEncoding('utf8');
    res.on('data', (c) => { s += c; });
    res.on('end', () => resolve(s));
  });
  req.write(requestOpts.body);
  req.end();
});
console.log('https setEncoding FFFD:', (httpsResult.match(/\ufffd/g) || []).length);
// Output: 0

The corrupted characters are consistently 3-byte UTF-8 CJK characters (e.g., U+50B3 = bytes e5 82 b3) that fall on chunk boundaries. The corruption is deterministic — same request always produces FFFD at the same positions.

Expected Behavior

setEncoding('utf8') on undici response body should produce identical output to Buffer.concat().toString('utf8'). The internal StringDecoder should correctly buffer incomplete multi-byte sequences across chunks, as Node.js's built-in https module does.

Logs & Screenshots

# Same request, same data, same Node.js process:
undici Pool + setEncoding('utf8'):     10 FFFD  ← broken
undici Pool + Buffer.concat:            0 FFFD  ← correct
Node.js https + setEncoding('utf8'):    0 FFFD  ← correct

Environment

  • OS: Alpine Linux (Docker node:24-alpine)
  • Node.js: v24.14.1
  • undici: 7.15.0
  • Upstream server: Elasticsearch 9.2.0 (chunked transfer-encoding, application/json; charset=utf-8)

Additional context

This issue was discovered through the @elastic/elasticsearch Node.js client (v9.1.1), which uses @elastic/transport (v9.1.2). The transport layer calls response.body.setEncoding('utf8') in UndiciConnection.js, causing all JSON responses containing CJK characters to be silently corrupted (~1 character per ~4KB of response body).

Downstream impact: any application using undici with setEncoding('utf8') for non-ASCII text is affected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions