aside_one_two_dimensional – Applied Data Analysis in Python

scitkit-learn requires the X parameter of the fit() function to be two-dimensional and the y parameter to be one-dimensional.

X must be two-dimensional, even if there is only one feature (column) present in your data. This can sometimes be a bit confusing as to humans there’s little difference between a table with one column and a simple list of values. Computers, however are very explicit about this difference and so we need to make sure we’re doing the right thing.

First, let’s grab the data we were working with:

from pandas import read_csv

data = read_csv("https://bristol-training.github.io/applied-data-analysis-in-python/data/linear.csv")

2D `DataFrame`s

If we look at it, we see it’s a pandas DataFrame which is always inherently two-dimensional:

data.head()

	x	y
0	3.745401	3.229269
1	9.507143	14.185654
2	7.319939	9.524231
3	5.986585	6.672066
4	1.560186	-3.358149

To get a more specific idea of the shape of the data structure, we can use the shape attribute:

data.shape

(50, 2)

This tell us that it’s a \((50 \times 2)\) structure so is two dimensional.

To be explicit, we can also query its dimensionality directly with ndim:

data.ndim

1D `Series`

If we ask a DataFrame for one of its columns, it returns it to us as a pandas Series. These objects are always one-dimensional (ignoring the potential for multi-indexes):

data["x"].head()

0    3.745401
1    9.507143
2    7.319939
3    5.986585
4    1.560186
Name: x, dtype: float64

type(data["x"])

pandas.core.series.Series

data["x"].shape

(50,)

Note that the shape is (50,). This might look like it could have multiple values but this is just how Python represents a tuple with one value. To check the dimensionality explicitly, we can peek at ndim again:

data["x"].ndim

2D subsets of `DataFrame`s

If we want to ask a DataFrame for a subset of its columns, it will return the answer to us as a another DataFrame as this is the only way to represent data with multiple columns.

We can ask for multiple columns by passing a list of column names to the DataFrame indexing operator.

Pay attention here as the outer pair of square brackets are denoting the indexing operator being called while the inner pair denotes the list being created.

data[["x", "y"]].head()

	x	y
0	3.745401	3.229269
1	9.507143	14.185654
2	7.319939	9.524231
3	5.986585	6.672066
4	1.560186	-3.358149

data[["x", "y"]].shape

(50, 2)

We can see here that when we asked the DataFrame for multiple columns by passing a list of column names it returns a two-dimensional object.

If we want to extract just one column but still maintain the dimensionality, we can pass a list with only one column name:

data[["x"]].head()

	x
0	3.745401
1	9.507143
2	7.319939
3	5.986585
4	1.560186

If we check the shape and dimensionality of this, we see that it is a \((50 \times 1)\) structure with two dimensions:

data[["x"]].shape

(50, 1)

data[["x"]].ndim

Final comparison

Finally, to reiterate, the difference between

data["x"].head()

0    3.745401
1    9.507143
2    7.319939
3    5.986585
4    1.560186
Name: x, dtype: float64

and

data[["x"]].head()

	x
0	3.745401
1	9.507143
2	7.319939
3	5.986585
4	1.560186

is not really in the data itself, but in the mathematical structure. One is a vector and and the other is a matrix. One is one-dimensional and the other is two-dimensional.

data["x"].ndim

data[["x"]].ndim

2D DataFrames

1D Series

2D subsets of DataFrames

Final comparison

2D `DataFrame`s

1D `Series`

2D subsets of `DataFrame`s