Data Preprocessing

shibozhang2015
Jun 8, 2016
2 min read

We apply a rolling mean with a window size of 100 points (approximately 3 seconds, which we empirically set). We then normalize the data to unit norm which can improve the accuracy of results when quantifying the similarity of signals across samples.

We define a feeding gesture through two subfeeding moments: food-to-mouth, and back-to-rest. Prior to processing we set values for three variables: 1) window_size; 2) overlap threshold (overlap); and 3) sliding window shift (shift).

We collect 11 statistical features on fixed time subdivisions of the data that are known to be useful in detecting activity and eating, such as: mean, median, max, min, standard deviation, kurtosis, interquartile range, quartile 1, quartile 3, skewness, and root mean square (RMS). Running each statistical feature on each axis of the inertial sensors generates 132 features, creating samples with 132 dimensions in the R132 feature space.

It is important to test whether a slow-moving fine-grained (small window size, small shift, and high overlap) or fast-moving coarse (large window size, large shift, and low overlap) segmentation of the data provides improvements in the analysis of feeding gestures. While several prior efforts in detecting feeding gestures overlook this important step, this can drastically impact the results of classification and can provide insight into the detection of feeding gestures. We tested window sizes ranging from 1 to 2.5 seconds, sliding window shifts from 30% to 70%, and signal overlap thresholds from 50% to 80%.

In order to test the performance of each signal processing parameter we used data from 13 participants (right-handed participants only) and balanced each participant’s data into equal feeding and non-feeding gestures (total of 520 samples). We used Random Forest (n=100) to build models from the training set prior to testing.

We test each parameter setting using 10-fold Cross Validation (CV) and Leave One Subject Out Cross Validation (LOSOCV).

We tested Logistic Regression (LogisticReg), AdaBoost- Classifier (AdaBoost), C4.5 Decision Trees (DecisionTree), Gaussian Naive Bayes (GaussianNB), Linear Support Vector Classifier (LinearSVC), and Random Forest (RF) with n=100 trees. We tested both LOSOCV and 10-fold CV (averaged 10 times).

We applied density-based spatial clustering of applications with noise (DBSCAN) to group together the samples that are close together (high density), while marking as outliers the points that are lower density. We test a range of values for the two parameters used by the DBSCAN algorithm: 1) Îµ is used to create an neighborhood of points to assess whether a cluster is worthy of being formed (we tested a range from 2-4), and 2) minPts is used to calculate the minimum number of points required to form a dense region (we tested a range from 1-3).