Skip to content

feat: Add new input_file_name UDF for file-backed scans#22978

Open
AdamGS wants to merge 6 commits into
apache:mainfrom
AdamGS:adamg/input-file-name-udf
Open

feat: Add new input_file_name UDF for file-backed scans#22978
AdamGS wants to merge 6 commits into
apache:mainfrom
AdamGS:adamg/input-file-name-udf

Conversation

@AdamGS

@AdamGS AdamGS commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Adds useful metadata functions to DataFusion. This PR builds on @ethan-tyler's #20071.

What changes are included in this PR?

  1. A new input_file_name() UDF, which reports a UTF8 return_type. Like file_row_index(), it errors when evaluated out of context.
  2. Add rewrites in the FileOpener (Both in ParquetOpener and ProjectionOpener), making it a literal with the file's path.
  3. New public rewrite helper rewrite_input_file_name_in_projection in datafusion-physical-expr-adapter.

Are these changes tested?

  1. Unit tests in all rewrite-sites and for the core rewrite logic
  2. New SLT tests working on CSV
  3. New SLT testing metadata functions specifically on Parquet which has its own opener. These include file_row_index().

Are there any user-facing changes?

  • The new UDF
  • New public function in datafusion-physical-expr-adapter - rewrite_input_file_name_in_projection.

AI was used in this PR, mostly when helping to come up with test cases.

AdamGS added 4 commits June 16, 2026 11:06
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
@github-actions github-actions Bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation datasource Changes to the datasource crate labels Jun 16, 2026
@AdamGS AdamGS changed the title Adamg/input file name udf feat: Add new input_file_name UDF for file-backed scans Jun 16, 2026
@AdamGS

AdamGS commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@alamb you mentioned in a previous review your dislike of all the helpers that keep accumulating in schema_rewriter.rs. I think that after this I'll try and figure out a better place to put them - even a dedicated file would be much nicer.

AdamGS and others added 2 commits June 16, 2026 12:27
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @AdamGS -- this makes sense to me. I had a few small comments, but otherwise it looks 👌


Returns the path of the input file that produced the current row.

Note: file paths/URIs may be sensitive metadata depending on your environment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is an interesting point -- can we add a note to the upgrading guide (and the release notes) and explain how to disable the function for people for whom it will be a security problem?

Maybe the explanation is just "register a function with the same filename" 🤔 Or should we add a config flag as a follow on PR to avoid registering these 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest consolidating the the parquet tests in datafusion/sqllogictest/test_files/input_file_name.slt so the tests for the same function live together in the same file


Note: file paths/URIs may be sensitive metadata depending on your environment.

This function is intended to be rewritten at file-scan time (when the file is

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that we might want to add some documentation (as a follow on PR) to the TableProvider about functions that might need special handling (e.g. input_file_name, input_row_number, get_field)

Otherwise people implementing table providers might not know it would be helpful

@ethan-tyler ethan-tyler left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me


This function is intended to be rewritten at file-scan time (when the file is
known). If the input file is not known (for example, if this function is
evaluated outside a file scan, or was not pushed down into one), direct evaluation returns an error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be returning an error or try be more permissive and return null/empty string for this case 🤔

For reference, Spark seems to return an empty string:

>>> spark.sql("select input_file_name()").show()
+-----------------+
|input_file_name()|
+-----------------+
|                 |
+-----------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add input_file_name built-in function

4 participants