Conversation
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
| let agg = &options.aggregate_fn; | ||
|
|
||
| // Try encoding-specific fast path first. | ||
| if let Some(states) = list.elements().aggregate_list(&list, agg)? { |
There was a problem hiding this comment.
It also the wrong type
| fn accumulate_list(&mut self, list: &ListViewArray) -> VortexResult<()> { | ||
| for i in 0..list.len() { | ||
| self.accumulate(&list.list_elements_at(i)?)?; | ||
| self.flush()?; | ||
| } | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
I think we might want to use a array + offset + len, approach to avoid list construction at each step?
There was a problem hiding this comment.
What do you mean each step?
There was a problem hiding this comment.
I way thinking as you do pushdown or reduce you will need to unwrap the elements, unwrap an encodings and wrap that up with offset + len
There was a problem hiding this comment.
Isn't that == canonicalize to ListView?
| /// Merge a partial state scalar into the current group state. | ||
| fn merge( | ||
| &self, options: &Self::Options, state: &mut Self::GroupState, partial: &Scalar, | ||
| ) -> VortexResult<()>; |
There was a problem hiding this comment.
Why do you define merge in this way? It could be (GroupState, GroupState) -> GroupState
There was a problem hiding this comment.
Because then you need an extra function for Scalar -> GroupState and also merging on multiple groups takes an ArrayRef, not a Vec
There was a problem hiding this comment.
Can you expand on this or did you define this else where?
There was a problem hiding this comment.
Can you explain what the scalar partial is here and is it similar to a GroupState state.
There was a problem hiding this comment.
It's the "vortex" version of GroupState. We could just use a Scalar to model GroupState if we wanted. Maybe it's nicer to have a native type for performance. Or maybe it's ok to just use and merge scalars.
There was a problem hiding this comment.
No, we definitely want a native type for GroupState, e.g. string_concat can hold a mutable string buffer and accumulate data into it. Then we only convert to scalar when we flush.
|
|
||
| /// Accumulate a canonical batch into the current group state. | ||
| fn accumulate( | ||
| &self, options: &Self::Options, state: &mut Self::GroupState, batch: &Canonical, |
There was a problem hiding this comment.
This is the fallback and we have encoding specific kernels?
| -> VortexResult<Self::GroupState>; | ||
|
|
||
| /// Accumulate a canonical batch into the current group state. | ||
| fn accumulate( |
There was a problem hiding this comment.
trying to pull out of stats happens here?
First PR implementing the Aggregate Functions proposal in vortex-data/rfcs#21 --------- Signed-off-by: Nicholas Gates <nick@nickgates.com>
|
Apologies for the delay. I think one thing that would be helpful to add to this RFC is a small section on what kinds of grouping are supported. It seems like it's mostly pre-defined groups (i.e. list offsets and sorted groups?). I mentioned this in our sync, but I think supporting partial aggregations on unordered groups might be interesting too. We have a query of the type |
|
I'm not sure I understand how the scan should group by an unordered column? I would want to avoid anything in the Vortex scan from having to hold state indefinitely (i.e. accumulating state until the end of the scan) |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The idea is that it would be a partial aggregation within batches (i.e. emit on every batch). I don't have concrete proof of an improvement, but something we do is group_by(x), first_value(y), where y is a complex expensive column to read/process and x is unordered but can have very few distinct values within most batches and I think even just pre-aggregating within each batch can produce savings (if not in segment reads after segment slicing, at least in downstream execution engine processing). The meat of the aggregation is still handed over to the execution engine. I think this is not fundamental nor a blocker, but could be interesting to keep in mind. |
No description provided.