scitkit-learn requires the X
parameter of the fit()
function to be two-dimensional and the y
parameter to be one-dimensional.
X
must be two-dimensional, even if there is only one feature (column) present in your data. This can sometimes be a bit confusing as to humans there’s little difference between a table with one column and a simple list of values. Computers, however are very explicit about this difference and so we need to make sure we’re doing the right thing.
First, let’s grab the data we were working with:
from pandas import read_csv
= read_csv("https://bristol-training.github.io/applied-data-analysis-in-python/data/linear.csv") data
2D DataFrame
s
If we look at it, we see it’s a pandas DataFrame
which is always inherently two-dimensional:
data.head()
x | y | |
---|---|---|
0 | 3.745401 | 3.229269 |
1 | 9.507143 | 14.185654 |
2 | 7.319939 | 9.524231 |
3 | 5.986585 | 6.672066 |
4 | 1.560186 | -3.358149 |
To get a more specific idea of the shape of the data structure, we can use the shape
attribute:
data.shape
(50, 2)
This tell us that it’s a \((50 \times 2)\) structure so is two dimensional.
To be explicit, we can also query its dimensionality directly with ndim
:
data.ndim
2
1D Series
If we ask a DataFrame
for one of its columns, it returns it to us as a pandas Series
. These objects are always one-dimensional (ignoring the potential for multi-indexes):
"x"].head() data[
0 3.745401
1 9.507143
2 7.319939
3 5.986585
4 1.560186
Name: x, dtype: float64
type(data["x"])
pandas.core.series.Series
"x"].shape data[
(50,)
Note that the shape
is (50,)
. This might look like it could have multiple values but this is just how Python represents a tuple with one value. To check the dimensionality explicitly, we can peek at ndim
again:
"x"].ndim data[
1
2D subsets of DataFrame
s
If we want to ask a DataFrame
for a subset of its columns, it will return the answer to us as a another DataFrame
as this is the only way to represent data with multiple columns.
We can ask for multiple columns by passing a list of column names to the DataFrame
indexing operator.
Pay attention here as the outer pair of square brackets are denoting the indexing operator being called while the inner pair denotes the list being created.
"x", "y"]].head() data[[
x | y | |
---|---|---|
0 | 3.745401 | 3.229269 |
1 | 9.507143 | 14.185654 |
2 | 7.319939 | 9.524231 |
3 | 5.986585 | 6.672066 |
4 | 1.560186 | -3.358149 |
"x", "y"]].shape data[[
(50, 2)
We can see here that when we asked the DataFrame
for multiple columns by passing a list of column names it returns a two-dimensional object.
If we want to extract just one column but still maintain the dimensionality, we can pass a list with only one column name:
"x"]].head() data[[
x | |
---|---|
0 | 3.745401 |
1 | 9.507143 |
2 | 7.319939 |
3 | 5.986585 |
4 | 1.560186 |
If we check the shape and dimensionality of this, we see that it is a \((50 \times 1)\) structure with two dimensions:
"x"]].shape data[[
(50, 1)
"x"]].ndim data[[
2
Final comparison
Finally, to reiterate, the difference between
"x"].head() data[
0 3.745401
1 9.507143
2 7.319939
3 5.986585
4 1.560186
Name: x, dtype: float64
and
"x"]].head() data[[
x | |
---|---|
0 | 3.745401 |
1 | 9.507143 |
2 | 7.319939 |
3 | 5.986585 |
4 | 1.560186 |
is not really in the data itself, but in the mathematical structure. One is a vector and and the other is a matrix. One is one-dimensional and the other is two-dimensional.
"x"].ndim data[
1
"x"]].ndim data[[
2