Here are some jargons that are written in this blog that you might not be familiar with: Uncertainty threshold: It defines the boundary i.e(specify the upper and lower limits) that we want to keep in order to determine the range that we want to set, to classify a sample as uncertain.
For Example: In our case, since 0 means a model has detected a call and 1 means that a model has detected no call, therefore all the samples that are closer to 0.5 would mean that the model is having a hard time in predicting the label of the sample and we need to select a boundary that such that all the samples predicted by the model within that range would be passed to the expert for labeling.
Accuracy: For binary classification, accuracy can be calculated in terms of positives and negatives as follows:
Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negati
In this final phase, I have worked on python scripts, docker file, documentation, and tests. I have also researched how changing the uncertainty threshold or the number of samples used in training affects the accuracy of the model (as measured with a test dataset).
This blog post is the continuation of the previous blog posts where I showed how the accuracy of the model improved with the help of active learning. In this phase, we will dive deeper and see how changing the threshold for uncertainty affects the accuracy of the model predictions test dataset. The test dataset, in this case, consists of 201 preprocessed Mel spectrogram samples that are generated after applying PCEN and Wavelet denoising. In the last blog, the upper limit was 0.6 and the lower limit was 0.4. In this blog post, we would see how changing the upper and lower limits to 0.9 and 0.1 affect the model.
Here is the flowchart for the active learning loop where the threshold for uncertainty is between 0.1 to 0.9 (in the purple diamond).
Flowchart
Thanks to my mentor Jesse, he advised this idea of choosing the uncertain range between 0.1 to 0.9, and an increase in accuracy was found with the help of this method as compared to the method that I used in the previous blog-post where the range was between 0.4 to 0.6. The first three steps of Preprocessing Spectrograms, Building the CNN model, and training it are the same and are explained in the previous blog-posts.
The new step in this phase is labeling only the uncertain samples having the prediction probability between 0.1 to 0.9. These uncertain samples would be labeled by experts like Scott and Val and then be used for training along with the old training samples.
The steps taken in the above flowchart are as follows:
1) Preprocess Spectrograms: Generate Melspectrograms with the help of and librosa library and then apply PCEN and Wavelet-denoising. These spectrograms are generated from the audio files containing calls and no calls.
2) Train our CNN model on training data.
Note: A small subset from the training data has been removed (or withheld) for active learning on which the model is not trained on.
3) Test the accuracy of the model on the test data. The test dataset consists of 201 samples of Mel Spectrograms generated after applying PCEN and Wavelet Denoising, where 101 are calls and 100 are no calls samples.
4) Use this model to estimate the probability on the subset of the training sample and check if the probability prediction is relatively uncertain (with a value between 0.1 and 0.9, in this case).
5) If yes, ask experts like Scott, Val to label them and pass them to the training directory depending on the labels annotated by the expert.
6) If no, then ask for the next batch of samples to be labeled.
7) After a certain number of samples within these batches are labeled, retrain the model with this new data combined with the old data.
8) Measure the accuracy of the model on test data and compare it with previous accuracy results.
The distribution of the training, active learning, retraining, and test datasets is the same as in the previous blog. Here, is the distribution chart:
Calls | No calls | Total | |
Training Data | 697 | 697 | 1394 |
Active Learning Data | 88 | 88 | 176 |
Retraining data | 780 | 777 | 1557 |
Test data | 101 | 100 | 201 |
Within the 176 samples processed through the active learning loop, there were 163 predictions made by the model that I define as uncertain (with values between 0.1 and 0.9 in this case). There were also 12 predictions of confident calls and 1 sample in which the model was confident there was no call.
These 163 uncertain samples are being validated and then combined sent with the previous training dataset to retrain the model. The new accuracy of the model was found to be 84%.
precision | recall | f1-score | support | |
calls | 0.90 | 0.77 | 0.83 | 101 |
nocalls | 0.80 | 0.91 | 0.85 | 100 |
accuracy | 0.84 | 201 | ||
macro avg | 0.85 | 0.84 | 0.84 | 201 |
weighted avg | 0.85 | 0.84 | 0.84 | 201 |
[[78 23]
[ 9 91]]
acc: 0.8408
sensitivity: 0.7723
specificity: 0.9100
Thus, since the accuracy of the model without active learning was 82.5% as we saw in the previous blog, and accuracy of the model with active learning caused an increase of 1.5% where the uncertainty range is from 0.1 to 0.9.
Another task that I worked on was developing Python scripts for preprocessing, training, and active learning. Previously, much of this code was embedded in Python notebooks. The links to those scripts could be found here.
Preprocessing script: This script is used to convert the raw audio dataset into spectrograms with the help of a .tsv file specifying the start time, duration of the call, and the label for the call. The different types of spectrograms that the script supports are :
1) Power_spectral_density spectrograms
2) Grayscale power_spectral_density_spectrograms
3) Mel Spectrogram
4) Mel Spectrogram with PCEN
5) Mel Spectrogram with wavelet-denoising
Please take a look at this page for more information
Training_script: These are some of the scripts for training and model building, predicting, and generating statistics like a ROC curve.
1) Model_building and training script: This script is used to build and train the CNN model.
2) Statistics: This script is used to generate a ROC curve
3) Report: This script is used to generate the report of how the model performed on the test dataset.
4) Model_predict: This script is used to predict whether the spectrogram contains a call.
Please take a look at this page for more information.
Active_learning script: This script is used to generate the identify the uncertain samples that should be labeled by experts like Scott and Val.
After this, I wrote the test cases for these scripts and checked these scripts on the Ubuntu EC2 instance to determine whether they are working fine. Here is the link to the tests.
The last part that I focused on was creating the Dockerfile for preprocessing script where the user would be able to generate the spectrograms with the help of preprocessing script. Moreover, I have also created a Dockerfile to generate a report. It would help the user to replicate the steps and get the same results using the models and the preprocessing stage that I used.
These three months have really been the most amazing and enjoyable journey and sometimes a bit frustrating as well. The most enjoyable part was researching various stages such as which preprocessing stage would work the best, which model to use, the pre-trained ones or the create a new one from scratch and how to hyper tune the parameters of CNN that was created from scratch to get the best result. I would like to thank my mentors Scott, Jesse, Valentina, Val, and Abhishek for their thorough guidance and support. They have constantly helped us and I think it was because of their advice and support that we were able to complete this project. They have taught me so many things such as the way of writing clean code, how to plot different machine learning graphs, how to build dockerfile, ways to write tests, use Github, and how to search for the best practices. Moreover, they also spend hours correcting the blog posts that we are writing as they have tons of grammatical errors and therefore I think that they have helped us in every way they could.
There is some research that could still be done further, where the different active learning strategies could be tested to find which works the best, different models could still be built to beat the accuracy that I am getting and try pre-trained models with different datasets.
I hope you found this blog useful, and thanks for reading it.