Xyseries command implementation#5343
Xyseries command implementation#5343asifabashar wants to merge 11 commits intoopensearch-project:mainfrom
Conversation
Signed-off-by: Asif Bashar <asif.bashar@gmail.com>
Signed-off-by: Asif Bashar <asif.bashar@gmail.com>
Signed-off-by: Asif Bashar <asif.bashar@gmail.com>
Signed-off-by: Asif Bashar <asif.bashar@gmail.com>
PR Reviewer Guide 🔍(Review updated until commit 1858a98)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 1858a98 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 13ddcb5
Suggestions up to commit a72ffe1
Suggestions up to commit a72ffe1
Suggestions up to commit c7b7536
Suggestions up to commit 59ad4dc
|
Signed-off-by: Asif Bashar <asif.bashar@gmail.com>
|
Persistent review updated to latest commit 59ad4dc |
Signed-off-by: Asif Bashar <asif.bashar@gmail.com>
Signed-off-by: Asif Bashar <asif.bashar@gmail.com>
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit c7b7536.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
Persistent review updated to latest commit c7b7536 |
Signed-off-by: Asif Bashar <asif.bashar@gmail.com>
Signed-off-by: Asif Bashar <asif.bashar@gmail.com>
|
Persistent review updated to latest commit a72ffe1 |
|
Persistent review updated to latest commit a72ffe1 |
|
Persistent review updated to latest commit 13ddcb5 |
Signed-off-by: Asif Bashar <asif.bashar@gmail.com>
|
Persistent review updated to latest commit 1858a98 |
Description
Problem Statement
xyseries command is missing in PPL which is part of todo in roadmap
This RFC proposes adding a new PPL transforming command,
xyseries, with syntax and behavior aligned with SPL .xyseriesconverts row-oriented grouped results into a wide table where:x-field) and stays as a row key.y-name-field) provides a part of column name from proivded paramter values to choose from and used as pivot values in conjuction with (y-data-field) field name , where data for this column are pivoted cells that are agrregated value fields.OpenSearch PPL already supports
stats,chart, andtimechart, but there is no direct equivalent for SPLxyseries.Adding
xyseriesimproves:Current State
There is no xyseries compatiable command.
Long-Term Goals
Provide xyseries functionality.
Proposal
Summarizes the suggested solution or improvement.
Approach
User Syntax
xyseries [sep=<string>] [format=<string>] <x-field> <y-name-field> in (<value1>, <value2>, ...) <y-data-field>[,<y-data-field2>Arguments
x-field(required): row key in output.y-name-field(required): Number of values used to generate output series column names based on provided in paramter values such as<value1>, <value2>etcy-data-field...(required, at least one): value field(s) used to fill cells.Options
sep":"y-data-fieldname and 'in' parameter values , etc.format$AGG$<sep>$VAL$. For each data row pivot value fory-name-fieldand$AGG$(y-data-fieldname). The format parameter inserts the (y-data-fieldname) uses “:” as the built-in separator and then pivot value.If
formatis omitted:y-data-field: output column is based row number .y-data-field: output column is$AGG$<sep>$VAL$whereExamples
After xyseries transformation:
ppl
If format is not specified, by default, you’d see something like:
url:count(host) and url:count(method) as column names.
If you provide your own separator through the format option, it overrides anything defined with sep. If for example, sep is set to “-”, but format specifies “+”, and because format has higher priority, the “+” is used.$VAL$ and $AGG$ are placeholders represent and the respectively. In the output, you can see that the name field (url) and the data field count(host) may appear in the position of $VAL$ and $AGG$ , depending on how the format string arranges them.
Here,
Semantics
Input shape and type rules
Support xyseries only when pivot values are explicitly provided.
x-fieldandy-name-fieldmust exist in input schema..y-name-fieldvalues are converted to string for output column naming.y-data-fieldOne or more fields that contain the data to chart. If there are multiple fields specified, separate the field names with commas.Null handling
nullin a giveny-data-fielddo not contribute a value for that generated series column.Implementation Plan
Validate required fields/options and defaults.
Project to
x,y_nameselected from provided pivot values in 'in' operator, and selectedy_datafields.Build generated series column name using
formator default naming withsep.Pivot to wide schema based on provided pivot values passed as parameter.
Today our direction is to keep PPL commands translatable to SQL/Calcite plans as much as possible. We also plan to run PPL on Spark; SQL-native xyseries (with explicit in (...) values) is portable to Spark SQL, while post-processing would require separate Spark-side custom implementation.
Composability: if xyseries is implemented after query execution, it effectively must be terminal and cannot be reliably chained with downstream commands. (Reference: Comments from penghuo )
out of scope
grouping option which does not apply for multifile input in OpenSearch.
Alternative
chart , timeseries can be used for similar use cases but not exactly same.
Limitations
As the rows will be transposed to columns, a limit is required field as calcite planning phase number of rows are not known.
Related Issues
Resolves [ #5142 ]
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.