Numpy/Matplotlib Hints

Below are some quick hints in numpy/matplotlib that should help you with the kmeans assignment, should you choose to take advantage of them

Selection

First, let's create a point cloud. As is the usual default (except for NMF, which is the tranpose), we'll express our point cloud as a data matrix, where each point is along a row, and the dimension are along a column. This means that for $N$ points in 2 dimensions, we'll have an $N \times 2$ matrix. Let's generate such a matrix below, where each coordinate is chosen independently according to a unit Gaussian distribution

We can plot it by plotting the first column as the x coordinate and the second column as the y coordinate. We can pull a particular column out with slice notation, where we say take all of the rows :, but only a particulary column

Next, let's consider how we pull certain points out of this point cloud. Let's say we wanted to select all points that have a distance of at most 1 from the origin. We could make a good old python loop to do this and to filter out the elements that meet our criteria into some list Y, and then plot it

But numpy also has a very nice feature known as "boolean selection" which allows us to do this without a python loop. First, we create a parallel array with $N$ elements, each of which holds the corresponding point's distance from the origin.

Notice how I'm actually doing a "element-wise operations" here; when I say X[:, 0]**2, I'm raising every element of the first column of X to the second power and creating a new array with that result. I can then add this element-wise to the same array

Anyway, since this array is parallel to the rows in X, we can use a boolean expression of it in place of a slice to take elements out

Just one more quick note that an even faster way to compute the distances is by using the np.sum method, and this will generalize to higher dimensions

Taking Means

The mean of a point cloud is obtained by taking the mean of each coordinate individually. Let's compute the mean of X a more tedious way using loops

But actually, there's a really nice function in numpy called np.mean. If we pass it an "axis" parameter, it tells us the axis along which to vary the loop when taking the mean. Since each point is in a different row, we want to vary the rows (axis 0) while we're taking the mean here, so we could do this simply as