1
00:00:04,500 --> 00:00:06,990
Let us discuss the performance
of a benchmark algorithm.

2
00:00:06,990 --> 00:00:09,250
The Random Forest
algorithm is known

3
00:00:09,250 --> 00:00:12,690
for its attractive property of
detecting variable interactions

4
00:00:12,690 --> 00:00:15,810
and excellent performance
as a learning algorithm.

5
00:00:15,810 --> 00:00:17,940
For the decision, we're
selecting the Random Forest

6
00:00:17,940 --> 00:00:19,990
algorithm as a
benchmark-- initially,

7
00:00:19,990 --> 00:00:21,890
we randomly partitioned
the full data

8
00:00:21,890 --> 00:00:25,900
set into two separate parts,
where the split was 50-50,

9
00:00:25,900 --> 00:00:29,750
and the partitioning was done
evenly within each cost bin.

10
00:00:29,750 --> 00:00:31,380
The first part,
the training set,

11
00:00:31,380 --> 00:00:33,830
was used to develop the method.

12
00:00:33,830 --> 00:00:36,400
The second part,
the test set, was

13
00:00:36,400 --> 00:00:39,020
used to evaluate the
model's performance.

14
00:00:39,020 --> 00:00:43,290
The table in this slide
reports the accuracy

15
00:00:43,290 --> 00:00:49,330
of the Random Forest algorithm
on each of the three buckets.

16
00:00:49,330 --> 00:00:52,170
Let us now introduce
the idea of clustering.

17
00:00:52,170 --> 00:00:55,740
Patients in each bucket may
have different characteristics.

18
00:00:55,740 --> 00:01:00,190
For this reason, we create
clusters for each cost bucket

19
00:01:00,190 --> 00:01:03,860
and make predictions for each
cluster using the Random Forest

20
00:01:03,860 --> 00:01:04,360
algorithm.

21
00:01:06,990 --> 00:01:11,030
Clustering is mostly used in
the absence of a target variable

22
00:01:11,030 --> 00:01:13,970
to search for relationships
among input variables

23
00:01:13,970 --> 00:01:17,080
or to organize data
into meaningful groups.

24
00:01:17,080 --> 00:01:19,450
In this study, although
the target variable

25
00:01:19,450 --> 00:01:22,500
is well-defined as a heart
attack or not a heart attack,

26
00:01:22,500 --> 00:01:25,160
there are many
different trajectories

27
00:01:25,160 --> 00:01:27,780
that are associated
with the target.

28
00:01:27,780 --> 00:01:32,100
There's not one set
pattern of health

29
00:01:32,100 --> 00:01:35,530
or diagnostic combination that
leads a person to heart attack.

30
00:01:35,530 --> 00:01:37,690
Instead, we'll
show that there are

31
00:01:37,690 --> 00:01:41,360
many different dynamic health
patterns and time series

32
00:01:41,360 --> 00:01:44,300
diagnostic relations
preceding a heart attack.

33
00:01:47,440 --> 00:01:49,650
The clustering
methods were used were

34
00:01:49,650 --> 00:01:52,740
spectral clustering
and k-means clustering.

35
00:01:52,740 --> 00:01:57,550
We focus, in the lecture,
on the k-means clustering.

36
00:01:57,550 --> 00:02:02,580
The broad description of
the algorithm is as follows.

37
00:02:02,580 --> 00:02:08,500
We first specify the
number of clusters k.

38
00:02:08,500 --> 00:02:14,780
Then we randomly assign each
data point to a cluster.

39
00:02:14,780 --> 00:02:17,920
We then compute the
cluster centroids.

40
00:02:17,920 --> 00:02:22,600
We re-assign each point to
the closest cluster centroid.

41
00:02:22,600 --> 00:02:25,180
We then re-compute
the cluster centroids,

42
00:02:25,180 --> 00:02:29,590
and we repeat steps 4 and 5
until no improvement is made.

43
00:02:32,560 --> 00:02:38,030
Let us illustrate the
k-means algorithm in action.

44
00:02:38,030 --> 00:02:42,560
We specify the desired
number of clusters k.

45
00:02:42,560 --> 00:02:44,640
In this case, we use k=2.

46
00:02:48,840 --> 00:02:53,800
We then randomly assign each
data point to a cluster.

47
00:02:57,100 --> 00:03:00,940
In this case, we have
the three points in red,

48
00:03:00,940 --> 00:03:04,380
and the two points in black.

49
00:03:04,380 --> 00:03:08,400
We then compute the
cluster centroids,

50
00:03:08,400 --> 00:03:12,880
indicated by the red
x and the grey x.

51
00:03:12,880 --> 00:03:19,120
We re-assign each point to
the closest cluster centroid,

52
00:03:19,120 --> 00:03:24,540
and now you observe that
this point changes from a red

53
00:03:24,540 --> 00:03:26,900
to a grey.

54
00:03:26,900 --> 00:03:34,130
We re-compute the
cluster centroids,

55
00:03:34,130 --> 00:03:40,090
and we repeat the
previous steps, 4 and 5

56
00:03:40,090 --> 00:03:41,850
until no improvement is made.

57
00:03:41,850 --> 00:03:46,579
We observe that, in this case,
the k-means clustering is done,

58
00:03:46,579 --> 00:03:48,210
and this is our
final clustering.

59
00:03:53,920 --> 00:03:56,690
Let us discuss some
practical considerations.

60
00:03:56,690 --> 00:03:58,790
The number of clusters
k can be selected

61
00:03:58,790 --> 00:04:02,070
from previous knowledge or
by simply experimenting.

62
00:04:02,070 --> 00:04:05,570
We can strategically select
initial partition of points

63
00:04:05,570 --> 00:04:09,910
into clusters if we have
some knowledge of the data.

64
00:04:09,910 --> 00:04:12,660
We can also run the
algorithm several times

65
00:04:12,660 --> 00:04:15,000
with different random
starting points.

66
00:04:15,000 --> 00:04:17,560
In the recitations,
we'll learn how

67
00:04:17,560 --> 00:04:24,120
to run the k-means
algorithm in R.

68
00:04:24,120 --> 00:04:26,980
So how do we
measure performance?

69
00:04:26,980 --> 00:04:29,830
After we construct the
clusters in the training set,

70
00:04:29,830 --> 00:04:33,390
we assign new observations
to clusters by proximity

71
00:04:33,390 --> 00:04:35,190
to the centroid of each cluster.

72
00:04:37,860 --> 00:04:40,730
We measure performance
by recording

73
00:04:40,730 --> 00:04:43,240
the average performance
rate in each cluster.

74
00:04:47,120 --> 00:04:49,290
Let us now discuss the
performance of the clustering

75
00:04:49,290 --> 00:04:49,790
methods.

76
00:04:53,070 --> 00:04:58,150
We perform clustering on each
bucket using k=10 clusters.

77
00:04:58,150 --> 00:05:03,650
In the table we record
the average prediction

78
00:05:03,650 --> 00:05:06,700
rate of each cost bucket.

79
00:05:06,700 --> 00:05:10,550
We observe a very
visible improvement

80
00:05:10,550 --> 00:05:17,800
when we use clustering-- from
49% to 64%, from 56% to 73%,

81
00:05:17,800 --> 00:05:20,890
from 58% to 78%.