Brush things up

Some refactoring is needed and we are more or less ready for this changes. I put them all here together, but they can be split later into separate issues.

- [ ] Refactor `init`
- [ ] Decide on algorithms type (or "distance type can be different from `X`")
- [ ] Document algorithms separately
- [ ] Drop extra fields from result.

Below is a short description of these problems.

1. Refactor `init`.
Currently `init` is redundant, since we have `init` and `init_k` and some weird logic, which one to choose. Correct solution is to use multiple dispatch, add additional function `create_seed` which should accept argument `init` (and all other necessary arguments). If `init` is String or `Symbol`  it should fall to `smart_init`, if `Nothing` then default `kmeans++`, otherwise return deepcopy of the init.

All of this should happen in `kmeans` (before `kmeans!`), so duplicated copy is avoided.

2. Decide on algorithms type (distance may have different type)
Currently we infer `distance` type from the type of the design matrix. This can be wrong, for example, if `X` `eltype` is `RGB` or `Complex`, then distance can have different type, usually `Float64` or `Float32`.

This can be solved by turning all algorithms to parametric, for example
`Lloyd{Float64, Float64}` and we can define something like this

```julia
struct Foo{T1, T2} end
Foo{T1}() where {T1} = Foo{T1, T1}()
Foo() = Foo{Float64, Float64}()

Foo{Float64}() # Foo{Float64, Float64}
Foo() # Foo{Float64, Float64}
```

It make it somewhat more verbose, and constraint to the design matrix type, but on the other hand it's more Julia like.

On the other hand, currently we infer everything from the matrices itself and distance type can be `kmeans` argument. I think it can work, but it looks weird.

3. Better documentation

I think it would be better for users to come to the documentation and see separate page, where all algorithms and their usage is described, especially taking into account the fact that we soon will add stochastic algorithms (coresets and minibatch). It can be organized as follows;

- Full scan algorithms
-- Lloyd
-- Elkan
-- Hamerly
-- Yinyang
- Stochastic algortihms
-- Coresets
-- MiniBatch

4. Drop extra fields from results.

Currently we have lots of redundant fields in result, which are not used, and I think they shouldn't be added, since they can be always calculated from all current data result. This extra information shouldn't be calculated inside `kmeans`,  there should be separate set of utility functions, which can be invoked if need arise. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brush things up #71

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Brush things up #71

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions