feat: Add new input_file_name UDF for file-backed scans#22978
Conversation
input_file_name UDF for file-backed scans
|
@alamb you mentioned in a previous review your dislike of all the helpers that keep accumulating in |
|
|
||
| Returns the path of the input file that produced the current row. | ||
|
|
||
| Note: file paths/URIs may be sensitive metadata depending on your environment. |
There was a problem hiding this comment.
that is an interesting point -- can we add a note to the upgrading guide (and the release notes) and explain how to disable the function for people for whom it will be a security problem?
Maybe the explanation is just "register a function with the same filename" 🤔 Or should we add a config flag as a follow on PR to avoid registering these 🤔
There was a problem hiding this comment.
I suggest consolidating the the parquet tests in datafusion/sqllogictest/test_files/input_file_name.slt so the tests for the same function live together in the same file
|
|
||
| Note: file paths/URIs may be sensitive metadata depending on your environment. | ||
|
|
||
| This function is intended to be rewritten at file-scan time (when the file is |
There was a problem hiding this comment.
It occurs to me that we might want to add some documentation (as a follow on PR) to the TableProvider about functions that might need special handling (e.g. input_file_name, input_row_number, get_field)
Otherwise people implementing table providers might not know it would be helpful
|
|
||
| This function is intended to be rewritten at file-scan time (when the file is | ||
| known). If the input file is not known (for example, if this function is | ||
| evaluated outside a file scan, or was not pushed down into one), direct evaluation returns an error. |
There was a problem hiding this comment.
I wonder if we should be returning an error or try be more permissive and return null/empty string for this case 🤔
For reference, Spark seems to return an empty string:
>>> spark.sql("select input_file_name()").show()
+-----------------+
|input_file_name()|
+-----------------+
| |
+-----------------+
Which issue does this PR close?
Rationale for this change
Adds useful metadata functions to DataFusion. This PR builds on @ethan-tyler's #20071.
What changes are included in this PR?
input_file_name()UDF, which reports a UTF8 return_type. Likefile_row_index(), it errors when evaluated out of context.FileOpener(Both inParquetOpenerandProjectionOpener), making it a literal with the file's path.rewrite_input_file_name_in_projectionindatafusion-physical-expr-adapter.Are these changes tested?
file_row_index().Are there any user-facing changes?
datafusion-physical-expr-adapter-rewrite_input_file_name_in_projection.AI was used in this PR, mostly when helping to come up with test cases.