grep - non-ASCII bytes are treated as word characters for -w/\b, unlike GNU in the C locale
Found by the differential fuzzer with args grep ["-e", "", "-F", "-w"] on input
lines containing non-ASCII bytes (e.g. 0xC3).
GNU grep run under LC_ALL=C classifies only ASCII [A-Za-z0-9_] as word-constituent
characters; every byte ≥ 0x80 is a non-word character. uu_grep decides word membership
with oniguruma's UTF-8 ONIGENC_CTYPE_WORD classifier, which counts Unicode letters/digits
(and any high byte it decodes) as word characters. This changes where word boundaries fall,
so -w (and the \b regex anchor) select different lines whenever the text around a match
contains a non-ASCII byte.
Reproduction
A word match r immediately followed by the byte 0xC3: GNU sees 0xC3 as a non-word
character, so r is a whole word and matches; uu_grep sees it as a word character, so the
right boundary fails and nothing matches.
Rust (incorrect)
$ printf 'r\xc3z\n' | ./target/release/grep -w -F 'r'
# Output: (none)
# Exit code: 1
GNU (correct)
$ printf 'r\xc3z\n' | LC_ALL=C /usr/bin/grep -w -F 'r'
# Output: r<0xC3>z
# Exit code: 0
The same divergence happens with a valid multi-byte UTF-8 letter such as é (0xC3 0xA9),
and with the \b regex anchor rather than -w:
$ printf 'caf\xc3\xa9\n' | ./target/release/grep -w -F 'caf' # Rust: no match (exit 1)
$ printf 'caf\xc3\xa9\n' | LC_ALL=C /usr/bin/grep -w -F 'caf' # GNU: matches "café" (exit 0)
$ printf 'caf\xc3\xa9\n' | ./target/release/grep -a -o 'caf\b' # Rust: no match (exit 1)
$ printf 'caf\xc3\xa9\n' | LC_ALL=C /usr/bin/grep -a -o 'caf\b' # GNU: matches "caf" (exit 0)
This is independent of the existing empty-pattern word/line report: here the pattern is a
non-empty literal and the difference is purely in how the adjacent character is classified.
grep - non-ASCII bytes are treated as word characters for
-w/\b, unlike GNU in the C localeFound by the differential fuzzer with args
grep ["-e", "", "-F", "-w"]on inputlines containing non-ASCII bytes (e.g.
0xC3).GNU grep run under
LC_ALL=Cclassifies only ASCII[A-Za-z0-9_]as word-constituentcharacters; every byte ≥
0x80is a non-word character.uu_grepdecides word membershipwith oniguruma's UTF-8
ONIGENC_CTYPE_WORDclassifier, which counts Unicode letters/digits(and any high byte it decodes) as word characters. This changes where word boundaries fall,
so
-w(and the\bregex anchor) select different lines whenever the text around a matchcontains a non-ASCII byte.
Reproduction
A word match
rimmediately followed by the byte0xC3: GNU sees0xC3as a non-wordcharacter, so
ris a whole word and matches;uu_grepsees it as a word character, so theright boundary fails and nothing matches.
Rust (incorrect)
GNU (correct)
The same divergence happens with a valid multi-byte UTF-8 letter such as
é(0xC3 0xA9),and with the
\bregex anchor rather than-w:This is independent of the existing empty-pattern word/line report: here the pattern is a
non-empty literal and the difference is purely in how the adjacent character is classified.