`-w` / `\b` treat non-ASCII bytes as word characters, unlike GNU in the C locale

# grep - non-ASCII bytes are treated as word characters for `-w`/`\b`, unlike GNU in the C locale

Found by the differential fuzzer with args `grep ["-e", "", "-F", "-w"]` on input
lines containing non-ASCII bytes (e.g. `0xC3`).

GNU grep run under `LC_ALL=C` classifies only ASCII `[A-Za-z0-9_]` as word-constituent
characters; every byte ≥ `0x80` is a non-word character. `uu_grep` decides word membership
with oniguruma's UTF-8 `ONIGENC_CTYPE_WORD` classifier, which counts Unicode letters/digits
(and any high byte it decodes) as word characters. This changes where word boundaries fall,
so `-w` (and the `\b` regex anchor) select different lines whenever the text around a match
contains a non-ASCII byte.

## Reproduction

A word match `r` immediately followed by the byte `0xC3`: GNU sees `0xC3` as a non-word
character, so `r` is a whole word and matches; `uu_grep` sees it as a word character, so the
right boundary fails and nothing matches.

**Rust (incorrect)**
```bash
$ printf 'r\xc3z\n' | ./target/release/grep -w -F 'r'
# Output: (none)
# Exit code: 1
```

**GNU (correct)**
```bash
$ printf 'r\xc3z\n' | LC_ALL=C /usr/bin/grep -w -F 'r'
# Output: r<0xC3>z
# Exit code: 0
```

The same divergence happens with a *valid* multi-byte UTF-8 letter such as `é` (`0xC3 0xA9`),
and with the `\b` regex anchor rather than `-w`:

```bash
$ printf 'caf\xc3\xa9\n' | ./target/release/grep -w -F 'caf'     # Rust: no match (exit 1)
$ printf 'caf\xc3\xa9\n' | LC_ALL=C /usr/bin/grep -w -F 'caf'    # GNU:  matches "café" (exit 0)

$ printf 'caf\xc3\xa9\n' | ./target/release/grep -a -o 'caf\b'   # Rust: no match (exit 1)
$ printf 'caf\xc3\xa9\n' | LC_ALL=C /usr/bin/grep -a -o 'caf\b'  # GNU:  matches "caf"  (exit 0)
```

This is independent of the existing empty-pattern word/line report: here the pattern is a
non-empty literal and the difference is purely in how the adjacent character is classified.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`-w` / `\b` treat non-ASCII bytes as word characters, unlike GNU in the C locale #64

grep - non-ASCII bytes are treated as word characters for `-w`/`\b`, unlike GNU in the C locale

Reproduction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

-w / \b treat non-ASCII bytes as word characters, unlike GNU in the C locale #64

Description

grep - non-ASCII bytes are treated as word characters for -w/\b, unlike GNU in the C locale

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`-w` / `\b` treat non-ASCII bytes as word characters, unlike GNU in the C locale #64

grep - non-ASCII bytes are treated as word characters for `-w`/`\b`, unlike GNU in the C locale