# ALDA  HW 1 solution

1. (13 points) [Song Ju] Classify the following attributes as binary, discrete, or continuous.
Also classify them as nominal, ordinal, interval, or ratio. Some cases may have more
than one interpretation, so briefly justify your answer if you think there may be some
ambiguity.
(a) (1 point) Hair color (Black, Blonde, Red)
(b) (1 point) Level of agreement (yes, maybe, no)
(c) (1 point) Income earned in a week
(d) (1 point) Celsius temperature
(e) (1 point) Genotype (Bb, bb, BB, bB)
(f) (1 point) ISBN numbers for books.
(g) (1 point) Time in terms of AM or PM
(h) (1 point) Waiting number for restaurant
(i) (1 point) Years of work experience
(j) (1 point) Categorization of clothing (hat, shirt, pants, shoes)
(k) (1 point) Angles as measured in degrees between 0 and 360
(l) (1 point) Ratings of movies (G, PG, R)
(m) (1 point) Coat check number. (When you attend an event, you can often give your
coat to someone who, in turn, gives you a numb that you can use to claim your
coat when you leave.)
2. (10 points) [Ruth Okoilu] Data Transformation.
In natural language processing, we often use term frequency and inverse document
frequency transformation (tf0
ij ), defined by the following equation:
tf0
ij = tfij ∗ log m
dfi
(1)
where tfij is the term frequency of the i
th word (term) in the j
th document, m is the
number of documents, and dfi
is the number of documents in which the i
th term appears.
Alternatively, we can define (tf00
ij ) as:
tf00
ij = tfij ∗ log
Pm
k=1 dk
Pdfi
k=1 dk
(2)
where dk is the length of a document k.
Assume the max term frequency tfij is p and answer the following questions.
(a) (6 points) What are the maximum and minimum values of tf0
ij and tf00
ij respectively? Please specify what cases the max and min value achieves.
(b) (4 points) Briefly explain the purpose for using tf0
ij and tf00
ij respectively in the
context of natural language processing and also explain what is the main difference
between tf0
ij and tf00
ij .
3. (8 points) [Xi Yang] Answer the following questions:
(a) (4 points) A healthcare dataset contains 523,000 patients. Among these patients,
26,150 patients have albinism and the remaining 496,850 patients have normal skin.
Suppose we will sample 1,000 patients from the dataset to conduct albinotic analysis, which sampling method should be selected to apply in this situation: simple
random sampling or stratified sampling, and why? With the selected sampling
method, how many albinotic and normal skin patients will be sampled, respectively?
(b) (4 points) Consider the following scenario, a patient’s systolic blood pressure (SBP)
is recorded to be 250. When SBP is higher than 180, a patient is considered to have
hypertensive crisis and need to seek the emergency care. For this given scenario, is
the recorded data noise or outlier? And why? (no point will be given if you do not
give a justification).
4. (15 points) [Song Ju] Write your code in Matlab, R or Python to perform the following
your code file (end with .m, .r or .py) in the .zip file.
(a) (1 point) Generate a 5*5 identity matrix A.
(b) (1 point) Change all elements in the 2nd column of A to 3.
(c) (1 point) Sum of all elements in the matrix (use a ”for/while loop”).
(d) (1 point) Transpose the matrix A (A = AT
)
(e) (2 points) Calculate sum of the 3rd row, and the diagonal in the matrix A.
(f) (1 point) Generate a 5*5 matrix B following Gaussian Distribution with mean 5
and variance 3.
(g) (2 points) From B, using matrix operations to get a new matrix C such that, the
first row of C is equal to the first row of B times the second row of B, the second
row of C is equal to the sum of the 3rd and 4th row of B minus the 5th row of B.
(h) (2 points) From C, using one matrix operation to get a new matrix D such that,the
first column of D is equal to the first column of C times 2, the second column of D
is equal to the second column of C times 3 and so on.
(i) (2 points) X = [2, 4, 6, 8]T
, Y = [6, 5, 4, 3]T
, Z = [1, 3, 5, 7]T
. Compute the covariance matrix of X, Y and Z.
(j) (2 points) Verify the equation: ¯x
2 = (¯x
2 + σ
2
(x)),
using x = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]T
. σ(x) is the standard deviation.
5. (33 points) [Ruth Okoilu] For this exercise, use the provided ‘seeds.csv’ file, which contains a list of 210 data instances. The examined group comprised kernels belonging to
three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly
selected for the experiment. (Source: https://archive.ics.uci.edu/ml/datasets/seeds)
There are 8 columns representing: 1) area A, 2)perimeter P, 3) compactness, 4) length
of kernel, 5) width of kernel, 6) asymmetry coefficient, 7) length of kernel, and 8) groove
Class (Type of wheat). For the purpose of this exercise, you consider two features,
‘area A’ and ‘kernel width’ (columns 1 & 5) of the provided ‘seeds.csv’ dataset. Write
outputs and key codes in the document file, and also include your code file (end with .m,
.r or .py) in the .zip file.
(a) (3 points) Load the file and read ‘area A’ and ‘kernel width’ columns and save them
as the original raw dataset. Apply normalization (transformed data z ∈ [0, 1]) to
the raw dataset to get the normalized dataset and apply the standardization to the
raw dataset to get the standardized dataset. Show the range of the two features in
each dataset.
(b) (30 points) Perform the following operations on the raw, normalized and standardized datasets respectively.
i. (3 points) Make a 2D plot of the values and label the axes (area A should be
x-axis and kernel width should be y-axis). Compare the three plots.
ii. (3 points) Compute the mean of area A and kernel width values. Consider this
point as P.
iii. (9 points) Compute the distance between P and the 210 data points using the
following distance measures: 1) Euclidean distance, 2) Mahalanobis distance,
3) City block metric, 4) Minkowski metric (for r=3), 5) Chebyshev distance, 6)
Cosine distance and 7) Canberra distance.
iv. (3 points) For each distance measure, identify the 10 points from the dataset
that are the closest to the point P from (ii). (You are allowed to use any package
functions to calculate the distances.)
v. (6 points) Create plots, one for each distance measure. Place an ‘X’ for P and
mark the 10 closest points. To mark them, you could place a circle or draw the
line between these closest neighbors and the points ‘X’. Make sure the points
can be uniquely identified.
vi. (3 points) Verify if the set of points is the same across all the distance measures.
If there is any big difference, briefly explain why it is.
vii. (3 points) Reason about your results and state the importance of data transformation in the dataset.
6. (21 points) [Xi Yang] In this question, please summarize and explore data in the provided file “hw1q6 data.csv”, which comes from the Pima Indians Diabetes Database
(https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/). In
this data file, each row indicates the data for a patient. The first 6 columns are features
for patients, and the last column ”Class” indicates if a patient has diabetes: 1 (diabetic)
or 0 (nondiabetic). The specific meaning for each feature is as follows:
1. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. BloodPressure: Diastolic blood pressure (mm Hg).
3. SkinThickness: Triceps skinfold thickness (mm).
4. BMI: Body mass index (weight in kg/(height in m)2
).
5. DiabetesPedigreeFunction: Diabetes pedigree function.
6. Age: (years).
Write code in Matlab, R or Python to perform the following tasks. Please report your
outputs and key codes in the document file, and also include your code file (end with .m,
.r or .py) in the .zip file.
(a) (1 point) How many diabetic and nondiabetic patients are in the dataset?
(b) (2 points) There are missing values in the features which are marked as 0. What
is the missing rate (%) for each feature?
(c) (4 points) Specify two methods for missing data handling and discuss their respective advantages and disadvantages.
Remove the patients (rows) in dataset with missing values, then answer the following
questions based on the remaining data:
(d) (1 point) How many diabetic and nondiabetic patients are in the remaining data?
(e) (3 points) Compute the mean, median, standard deviation, range, 25th percentiles,
50th percentiles, 75th percentiles for each feature.
(f) (4 points) Create histogram plot using 10 bins for the two features BloodPressure
and DiabetesPedigreeFunction, respectively.
(g) (6 points) Quantile-quantile plot can be used for comparing the distribution of data
against the normal distribution. Create quantile-quantile plot for the two features
BloodPressure and DiabetesPedigreeFunction, respectively. Give a brief analysis for
the two plots.