Skip to content

-w / \b treat non-ASCII bytes as word characters, unlike GNU in the C locale #64

Description

@sylvestre

grep - non-ASCII bytes are treated as word characters for -w/\b, unlike GNU in the C locale

Found by the differential fuzzer with args grep ["-e", "", "-F", "-w"] on input
lines containing non-ASCII bytes (e.g. 0xC3).

GNU grep run under LC_ALL=C classifies only ASCII [A-Za-z0-9_] as word-constituent
characters; every byte ≥ 0x80 is a non-word character. uu_grep decides word membership
with oniguruma's UTF-8 ONIGENC_CTYPE_WORD classifier, which counts Unicode letters/digits
(and any high byte it decodes) as word characters. This changes where word boundaries fall,
so -w (and the \b regex anchor) select different lines whenever the text around a match
contains a non-ASCII byte.

Reproduction

A word match r immediately followed by the byte 0xC3: GNU sees 0xC3 as a non-word
character, so r is a whole word and matches; uu_grep sees it as a word character, so the
right boundary fails and nothing matches.

Rust (incorrect)

$ printf 'r\xc3z\n' | ./target/release/grep -w -F 'r'
# Output: (none)
# Exit code: 1

GNU (correct)

$ printf 'r\xc3z\n' | LC_ALL=C /usr/bin/grep -w -F 'r'
# Output: r<0xC3>z
# Exit code: 0

The same divergence happens with a valid multi-byte UTF-8 letter such as é (0xC3 0xA9),
and with the \b regex anchor rather than -w:

$ printf 'caf\xc3\xa9\n' | ./target/release/grep -w -F 'caf'     # Rust: no match (exit 1)
$ printf 'caf\xc3\xa9\n' | LC_ALL=C /usr/bin/grep -w -F 'caf'    # GNU:  matches "café" (exit 0)

$ printf 'caf\xc3\xa9\n' | ./target/release/grep -a -o 'caf\b'   # Rust: no match (exit 1)
$ printf 'caf\xc3\xa9\n' | LC_ALL=C /usr/bin/grep -a -o 'caf\b'  # GNU:  matches "caf"  (exit 0)

This is independent of the existing empty-pattern word/line report: here the pattern is a
non-empty literal and the difference is purely in how the adjacent character is classified.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions