Semantic Segmentation: Mask RCNN
Mask RCNN is an image segmentation model. The image is provided as an input to two separate convolutional networks which provide a convolutional feature map and region proposals. A region of interest pooling layer is used to reshape the region proposals and classify the image. For our project, we are using a neural network that was pretrained from the MS Coco dataset and that was finetuned with images from the Berkeley Deep Drive 100K dataset. The machine learning library used was Pytorch and the pretrained model was found from Facebook's Detectron 2.


Depth Detection
As mentioned previously, we also have a depth camera on our robot. Due to fact that this is a simulator, we are able to get accurate depth information up to 10 meters. In obstacle detection, we use the depth information directly, using the pixels corresponding to the labeled object to get an estimate of the depth. We are then able to alert the user that an object is detected using an audible cue (which can be customized for directional information).
For the case of the car ramp example, we are able to use this depth information in order to determine a waypoint to the base of the truck. We did this using the camera intrinsics matrix and are able to determine a 2d coordinate. To be clear, there were some hacks employed here. Since we know the length of the ramp, we are able to extrapolate the waypoint to the base of the ramp. Then we have a manual procedure to drive up the ramp using torques provided to the dynamics model. We hope to remove this in future implementations.
Other Techniques
Iteration is useful
One thing explored was canny edge detection for detecting stationary obstacles or regions of interest. This can be done by first removing any objects in motion by taking two consecutive shots while the wheelchair is stationary. Then, the two images can be subtracted and thresholded. This enables the edge detection algorithm to only detect stationary objects. These object's presence can be confirmed through the depth camera data described above and thus be relayed back to the user. One issue with this technique is that empirically we found that lighting is a large issue for our edge detector implementation and the algorithm could be thrown off by shadows and reflections. Thus we opted to use the semantic segmentation model above due to its higher accuracy (with the tradeoff of efficiency).

Future Extension
Using the Camera Intrinsics + RGB Image + Depth Image, we are able to create a point cloud in 3D space. This point cloud was a map of the environment that was logged to file. This 3d reconstructed image could be used for future autonomous navigation in unknown environments (for example in a house, where a user commonly is). We hope to try this with the house task, where a wheelchair can be logging the point cloud every few seconds and you could use this along with the path planning algorithm to help the user leave this tough environment.
