Skip to content

perf: replace Uint8Array lookup tables with regex in buildUrl#345

Open
usualoma wants to merge 1 commit intov2from
perf-v2-simplify-url
Open

perf: replace Uint8Array lookup tables with regex in buildUrl#345
usualoma wants to merge 1 commit intov2from
perf-v2-simplify-url

Conversation

@usualoma
Copy link
Copy Markdown
Member

I made things quite complicated in #310, but in this case, regular expressions seem to be (slightly) faster.

Since this also reduces the amount of code, I’d like to include this refactoring at the end of v2.

benchmark

  • Current implementation
  • Several regular expressions (this PR)
  • A version using Uint8Array in part

I compared them, and this PR’s approach was slightly faster. (And the code was simpler, too.)

benchmark                                    avg (min … max) p75 / p99    (min … top 1%)
------------------------------------------------------------ -------------------------------
• host validation (fast path)
------------------------------------------------------------ -------------------------------
Old: Uint8Array + charCodeAt loop             150.01 ns/iter 152.59 ns       █
                                     (132.95 ns … 200.40 ns) 181.89 ns       █▄
                                     (  0.10  b … 197.16  b)   0.46  b ▁▂▆▃▅███▆▅▄▅▃▂▁▂▁▂▂▂▁

New: regex (reValidHost)                      236.52 ns/iter 244.36 ns  █
                                     (222.97 ns … 429.22 ns) 293.46 ns ▆█▆
                                     (  0.10  b … 197.61  b)   0.56  b ███▇▅▅▇▇▇▃▂▂▁▂▂▁▁▁▁▁▁

Hybrid: Uint8Array host + regex URL           268.48 ns/iter 273.12 ns   █
                                     (233.17 ns … 537.30 ns) 409.20 ns  ▂██▄
                                     (  0.10  b … 377.06  b)   1.43  b ▃████▆▅▃▄▂▂▁▂▂▁▂▁▁▁▁▁

• URL validation (fast path)
------------------------------------------------------------ -------------------------------
Old: Uint8Array + charCodeAt loop             416.14 ns/iter 424.17 ns  █
                                     (397.24 ns … 518.66 ns) 471.43 ns ███▄▅  ▃
                                     (248.43  b … 621.62  b) 257.87  b █████████▆▅▂▄▃▃▂▂▃▂▂▂

New: regex (reValidRequestUrl + reDotSegment) 392.50 ns/iter 421.70 ns   ▇█
                                     (362.21 ns … 485.88 ns) 439.09 ns   ██▄           ▅▄
                                     ( 70.85  b … 505.59  b) 255.75  b ▃████▄▅▄▃▄▃▄▂▃▁▆██▅▃▁

Hybrid: Uint8Array host + regex URL           379.04 ns/iter 386.05 ns   █
                                     (351.70 ns … 525.60 ns) 455.40 ns  ▆█▇
                                     (248.32  b … 545.37  b) 257.46  b ▄████▇▅▄▂▂▂▁▅▆▅▁▂▂▂▁▁

• buildUrl end-to-end (host × URL)
------------------------------------------------------------ -------------------------------
Old: Uint8Array + charCodeAt loops              4.19 µs/iter   4.29 µs      ▂    ▂█▂██
                                         (3.80 µs … 4.94 µs)   4.52 µs    ▅ █▅  ▅█████   ▅
                                     (  0.09  b …   0.11  b)   0.10  b ▇▇▇█▁██▇▁██████▇▇▁█▁▇

New: regex                                      3.52 µs/iter   3.62 µs  ██        ▄
                                         (3.30 µs … 3.90 µs)   3.88 µs  ████      █ ▅     █
                                     (  0.10  b …   0.11  b)   0.10  b ██████▅▅▅▅▅█▁█▁▁▅▁▅█▅

Hybrid: Uint8Array host + regex URL             3.73 µs/iter   3.88 µs                  ▄█
                                         (3.47 µs … 3.95 µs)   3.93 µs   ▅▅▅  ▅         ██ ▅
                                     (  0.10  b …   0.11  b)   0.10  b █▁████▅█▁▅▁██▅▅▅▅████
url-buildurl.mjs
// Benchmark: buildUrl — Uint8Array lookup tables vs regex
//
// Compares the two approaches used in commit 1441383:
//   - Old: Uint8Array + charCodeAt loops for host & URL validation
//   - New: Precompiled regex (reValidHost, reValidRequestUrl, reDotSegment)
//
// Usage: node benchmarks/url-buildurl.mjs

import { bench, group, run } from 'mitata'

// ============================================================
// Test data
// ============================================================
const hosts = [
  'localhost',
  'localhost:3000',
  'example.com',
  'example.com:8080',
  'my-app.example.com:4567',
  'sub.domain.example.co.jp:12345',
  'a',
  'my_host.local:1234',
]

const incomingUrls = [
  '/',
  '/path/to/resource',
  '/path?key=value&foo=bar',
  '/api/v2/users/123/posts/456/comments?page=1&limit=20&sort=created_at#section',
  '/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z',
  '/~user/path_name/file-name.html',
  '/assets/js/app.min.js?v=1234567890',
  '/search?q=hello+world&lang=en&page=1',
]

// ============================================================
// Old: Uint8Array + charCodeAt loops
// ============================================================
const allowedRequestUrlChar = new Uint8Array(128)
for (let c = 0x30; c <= 0x39; c++) allowedRequestUrlChar[c] = 1
for (let c = 0x41; c <= 0x5a; c++) allowedRequestUrlChar[c] = 1
for (let c = 0x61; c <= 0x7a; c++) allowedRequestUrlChar[c] = 1
{
  const chars = "-./:?#[]@!$&'()*+,;=~_"
  for (let i = 0; i < chars.length; i++) allowedRequestUrlChar[chars.charCodeAt(i)] = 1
}

const safeHostChar = new Uint8Array(128)
for (let c = 0x30; c <= 0x39; c++) safeHostChar[c] = 1
for (let c = 0x61; c <= 0x7a; c++) safeHostChar[c] = 1
{
  const chars = '.-_:'
  for (let i = 0; i < chars.length; i++) safeHostChar[chars.charCodeAt(i)] = 1
}

const isPathDelimiter = (c) => c === 0x2f || c === 0x3f || c === 0x23

function hasDotSegment(url, dotIndex) {
  const prev = dotIndex === 0 ? 0x2f : url.charCodeAt(dotIndex - 1)
  if (prev !== 0x2f) return false
  const nextIndex = dotIndex + 1
  if (nextIndex === url.length) return true
  const next = url.charCodeAt(nextIndex)
  if (isPathDelimiter(next)) return true
  if (next !== 0x2e) return false
  const nextNextIndex = dotIndex + 2
  if (nextNextIndex === url.length) return true
  return isPathDelimiter(url.charCodeAt(nextNextIndex))
}

function buildUrlOld(scheme, host, incomingUrl) {
  const url = `${scheme}://${host}${incomingUrl}`

  let needsHostValidationByURL = false
  for (let i = 0, len = host.length; i < len; i++) {
    const c = host.charCodeAt(i)
    if (c > 0x7f || safeHostChar[c] === 0) {
      needsHostValidationByURL = true
      break
    }
    if (c === 0x3a) {
      i++
      const firstDigit = host.charCodeAt(i)
      if (
        firstDigit < 0x31 ||
        firstDigit > 0x39 ||
        i + 4 > len ||
        i + (firstDigit < 0x36 ? 5 : 4) < len
      ) {
        needsHostValidationByURL = true
        break
      }
      for (; i < len; i++) {
        const c = host.charCodeAt(i)
        if (c < 0x30 || c > 0x39) {
          needsHostValidationByURL = true
          break
        }
      }
    }
  }

  if (needsHostValidationByURL) {
    return new URL(url).href
  } else if (incomingUrl.length === 0) {
    return url + '/'
  } else {
    if (incomingUrl.charCodeAt(0) !== 0x2f) {
      return 'invalid'
    }
    for (let i = 1, len = incomingUrl.length; i < len; i++) {
      const c = incomingUrl.charCodeAt(i)
      if (
        c > 0x7f ||
        allowedRequestUrlChar[c] === 0 ||
        (c === 0x2e && hasDotSegment(incomingUrl, i))
      ) {
        return new URL(url).href
      }
    }
    return url
  }
}

// ============================================================
// New: Precompiled regex
// ============================================================
const reValidRequestUrl = /^\/[!#$&-;=?-\[\]_a-z~]*$/
const reDotSegment = /\/\.\.?(?:[/?#]|$)/
const reValidHost = /^[a-z0-9._-]+(?::(?:[1-5]\d{3,4}|[6-9]\d{3}))?$/

function buildUrlNew(scheme, host, incomingUrl) {
  const url = `${scheme}://${host}${incomingUrl}`

  if (!reValidHost.test(host)) {
    return new URL(url).href
  } else if (incomingUrl.length === 0) {
    return url + '/'
  } else {
    if (incomingUrl.charCodeAt(0) !== 0x2f) {
      return 'invalid'
    }
    if (!reValidRequestUrl.test(incomingUrl) || reDotSegment.test(incomingUrl)) {
      return new URL(url).href
    }
    return url
  }
}

// ============================================================
// Hybrid: Uint8Array host + regex URL
// ============================================================
function buildUrlHybrid(scheme, host, incomingUrl) {
  const url = `${scheme}://${host}${incomingUrl}`

  let needsHostValidationByURL = false
  for (let i = 0, len = host.length; i < len; i++) {
    const c = host.charCodeAt(i)
    if (c > 0x7f || safeHostChar[c] === 0) {
      needsHostValidationByURL = true
      break
    }
    if (c === 0x3a) {
      i++
      const firstDigit = host.charCodeAt(i)
      if (
        firstDigit < 0x31 ||
        firstDigit > 0x39 ||
        i + 4 > len ||
        i + (firstDigit < 0x36 ? 5 : 4) < len
      ) {
        needsHostValidationByURL = true
        break
      }
      for (; i < len; i++) {
        const c = host.charCodeAt(i)
        if (c < 0x30 || c > 0x39) {
          needsHostValidationByURL = true
          break
        }
      }
    }
  }

  if (needsHostValidationByURL) {
    return new URL(url).href
  } else if (incomingUrl.length === 0) {
    return url + '/'
  } else {
    if (incomingUrl.charCodeAt(0) !== 0x2f) {
      return 'invalid'
    }
    if (!reValidRequestUrl.test(incomingUrl) || reDotSegment.test(incomingUrl)) {
      return new URL(url).href
    }
    return url
  }
}

// ============================================================
// Correctness check
// ============================================================
const scheme = 'https'
for (const host of hosts) {
  for (const incoming of incomingUrls) {
    const old = buildUrlOld(scheme, host, incoming)
    const nw = buildUrlNew(scheme, host, incoming)
    const hyb = buildUrlHybrid(scheme, host, incoming)
    if (old !== nw || old !== hyb) {
      console.error(`MISMATCH host="${host}" url="${incoming}": old=${old} new=${nw} hybrid=${hyb}`)
      process.exit(1)
    }
  }
}
console.log('Correctness check passed.\n')

// ============================================================
// Benchmark
// ============================================================
group('host validation (fast path)', () => {
  bench('Old: Uint8Array + charCodeAt loop', () => {
    for (const host of hosts) buildUrlOld(scheme, host, '/')
  })
  bench('New: regex (reValidHost)', () => {
    for (const host of hosts) buildUrlNew(scheme, host, '/')
  })
  bench('Hybrid: Uint8Array host + regex URL', () => {
    for (const host of hosts) buildUrlHybrid(scheme, host, '/')
  })
})

group('URL validation (fast path)', () => {
  bench('Old: Uint8Array + charCodeAt loop', () => {
    for (const url of incomingUrls) buildUrlOld(scheme, 'localhost:3000', url)
  })
  bench('New: regex (reValidRequestUrl + reDotSegment)', () => {
    for (const url of incomingUrls) buildUrlNew(scheme, 'localhost:3000', url)
  })
  bench('Hybrid: Uint8Array host + regex URL', () => {
    for (const url of incomingUrls) buildUrlHybrid(scheme, 'localhost:3000', url)
  })
})

group('buildUrl end-to-end (host × URL)', () => {
  bench('Old: Uint8Array + charCodeAt loops', () => {
    for (const host of hosts) {
      for (const url of incomingUrls) buildUrlOld(scheme, host, url)
    }
  })
  bench('New: regex', () => {
    for (const host of hosts) {
      for (const url of incomingUrls) buildUrlNew(scheme, host, url)
    }
  })
  bench('Hybrid: Uint8Array host + regex URL', () => {
    for (const host of hosts) {
      for (const url of incomingUrls) buildUrlHybrid(scheme, host, url)
    }
  })
})

await run()

@usualoma
Copy link
Copy Markdown
Member Author

Hi @yusukebe
Would you mind reviewing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant