AR and ML using Google ARCore, Arm NN in Unity running on mobile devices

Introduction

Augmented reality (AR) and machine learning (ML) are two leading-edge technologies. AR brings virtual objects into the real world by overlaying digital content above physical objects. ML, on the other hand, helps the program to recognize physical objects of the real world. By combining these two technologies, we can create some innovative projects.

Recently, Arm showcased a demo at Unite Beijing 2018 that combined AR and ML, as well as using a Mali GPU to accelerate the computing tasks. This blog is intended to share our experience in developing our AR and ML demo using Google ARCore and Arm NN in Unity. Below is a video for the end result, and throughout this guide I'll point out the most important steps involved in creating the demo.

You will need a Google ARCore supported device to run our demo. We have tested on a Samsung Galaxy S8, S9 and also Huawei P20, all of which had a Mali GPU inside.

We worked with Arm NN because it can bring >4x performance boost on Arm Cortex-A CPUs and Mali GPUs. See the "Arm NN for Android" section of the Arm NN SDK webpage for more details.

We use Arm NN with YOLO v1 tiny model to do object detection:

  • Detect physical objects: Use deep learning neural network to detect physical objects from camera raw input.
  • Classification: The ML model pre-trained with COCO datasets. It can recognize 80 object categories.
  • Localization: Put a bounding box around the detected object. Localized it in 2D space.

Then use Google ARCore to handle the AR parts of the demo:

  • Plane detection: Detect horizontal plane, localize the physical object in 3D by ray cast from 2D to the detected plane.
  • Motion tracking: Track the camera movement.
  • Anchors: Anchor the specific pose in the real world and keep the virtual content at the same location.

Enable Google ARCore in Unity

First of all, you should follow these instructions to prepare your hardware and software environments. Try to build and run the HelloAR sample, and we will start our project from there. For example, you could name your Unity project as "AR Detector".

Change the Orientation in "PlayerSettings > Resolution and Presentation > Default Orientation" from Auto Rotation to  Landscape Left for a better experience. And set the "PlayerSetting > Configuration > Scripting Runtime Version" to Experimental (.NET 4.6 Equivalent)

Copy the HelloAR scene:

  1. Create a Scenes folder in Assets.
  2. Duplicate the HelloAR scene in the "Assets > GoogleARCore > Examples > HelloAR > Scenes" folder.
  3. Move the copy of the HelloAR scene to "Assets > Scenes" and rename it to Main.
  4. Double-click to open the "Assets > Scenes > Main" scene.

You should get these in your Unity project:

Unity Main Scene

The example scene will visualize the detected plane and add an Andy 3D model when you touch the detected plane on the screen. We need the plane visualizer but not the screen touch function.

  1. Click on the Example Controller in the Hierarchy tab.
  2. Double-click the HelloARController in the "Inspector > Hello AR Controller (Script) > Script" component to open the C Sharp script.
  3. Inside the Update method, comment out everything after the comment "If the player has not touched the screen, we are done with this update."

Build and run to check whether the touch function has disabled.

Build Arm NN shared libraries

You'll need to build Arm NN shared libraries yourself and manually integrate into Unity as native plugins. To do that, you need to create a standalone NDK toolchain. We use armeabi-v7a compiler rather than arm64 because Google ARCore and Unity only support armeabi-v7a right now.

Read the instructions about "Building Open Source Projects Using Standalone Toolchains" and use this command to create the standalone toolchain:

The Update function of the ARTrackerController script will look like this:

$NDK/build/tools/make_standalone_toolchain.py \
  --arch arm \
  --api 26 \
  --stl=libc++ \
  --install-dir=my-toolchain


Then configure and build Arm NN with Caffe parser support by using the newly created standalone toolchain. Read the instructions. Please enable OpenCL option in order to benefit from GPU acceleration. You may also need to build the protobuf shared library for both host and armeabi-v7a versions. And set the -DPROTOBUF_ROOT=/path/to/protobuf/armeabi-v7a_install option while compiling the Arm NN.

After doing that, you should get the "libarmnn.so", "libarmnnCaffeParser.so", and "UnitTests" in the build directory, and you should able to push and run the UnitTests on your Android phone.

A CPP Object Detector as a Native Plugin for Unity

Our demo uses Arm NN to perform object detection. We chose YOLO v1 tiny with COCO datasets pre-trained model which can be downloaded at GitHub. Please download the YOLO CoCo tiny in the "Legacy models" section and push it to your Android device.

adb shell mkdir -p /mnt/sdcard/Android/data/com.yourCompany.ARDetector/files/
adb push coco_tiny.caffemodel /mnt/sdcard/Android/data/com.yourCompany.ARDetector/files/

Since we need to call the C++ API which Arm NN provided, we could use the Native Plugins feature in Unity. See the Unity documentation for more detail. We implemented a shared library named "libyoloDetector.so" and exported two C APIs for Unity to use.

The initDetector C API is used to load the machine learning model and initialize the Arm NN network. It should be called when the APP start.

s_InputBindingInfo; static std::pair s_OutputBindingInfo; static float *s_OutputBuffer; static char k_ModelFileName[] = "/mnt/sdcard/Android/data/com.yourCompany.ARDetector/files/coco_tiny.caffemodel"; static char k_InputTensorName[] = "data"; static char k_OutputTensorName[]]>

// Optimize the network for a specific runtime compute device, e.g. CpuAcc, GpuAcc
static armnn::IRuntimePtr s_Runtime = armnn::IRuntime::Create(armnn::Compute::GpuAcc);
static armnn::NetworkId s_NetworkIdentifier;

static std::pair<armnn::LayerBindingId, armnn::TensorInfo> s_InputBindingInfo;
static std::pair<armnn::LayerBindingId, armnn::TensorInfo> s_OutputBindingInfo;

static float *s_OutputBuffer;
static char k_ModelFileName[] = "/mnt/sdcard/Android/data/com.yourCompany.ARDetector/files/coco_tiny.caffemodel";
static char k_InputTensorName[] = "data";
static char k_OutputTensorName[] = "result";
const unsigned int k_YoloImageWidth = 448;
const unsigned int k_YoloImageHeight = 448;
const unsigned int k_YoloChannelNums = 3;
const unsigned int k_YoloImageBatchSize = 1;
const unsigned int k_YoloOutputSize = 7 * 7 * (5 * 3 + 80);

extern "C" __attribute__ ((visibility ("default")))
void initDetector()
{
 auto parser = armnnCaffeParser::ICaffeParser::Create();

 auto network = parser->CreateNetworkFromBinaryFile(
 k_ModelFileName,
 { {k_InputTensorName, {k_YoloImageBatchSize, k_YoloChannelNums, k_YoloImageHeight, k_YoloImageWidth}} },
 { k_OutputTensorName });

 // Find the binding points for the input and output nodes
 s_InputBindingInfo = parser->GetNetworkInputBindingInfo(k_InputTensorName);
 s_OutputBindingInfo = parser->GetNetworkOutputBindingInfo(k_OutputTensorName);

 armnn::IOptimizedNetworkPtr optNet =
 armnn::Optimize(*network, s_Runtime->GetDeviceSpec());

 // Load the optimized network onto the runtime device
 armnn::Status ret = s_Runtime->LoadNetwork(s_NetworkIdentifier, std::move(optNet));
 if (ret == armnn::Status::Failure)
 {
 throw armnn::Exception("IRuntime::LoadNetwork failed");
 }

 s_OutputBuffer = (float*)malloc(sizeof(float) * k_YoloOutputSize);
}

CreateNetworkFromBinaryFile( k_ModelFileName, { {k_InputTensorName, {k_YoloImageBatchSize, k_YoloChannelNums, k_YoloImageHeight, k_YoloImageWidth}} }, { k_OutputTensorName }); // Find the binding points for the input and output nodes s_InputBindingInfo = parser->GetNetworkInputBindingInfo(k_InputTensorName); s_OutputBindingInfo = parser->GetNetworkOutputBindingInfo(k_OutputTensorName); armnn::IOptimizedNetworkPtr optNet = armnn::Optimize(*network, s_Runtime->GetDeviceSpec()); // Load the optimized network onto the runtime device armnn::Status ret = s_Runtime->LoadNetwork(s_NetworkIdentifier, std::move(optNet)); if (ret == armnn::Status::Failure) { throw armnn::Exception("IRuntime::LoadNetwork failed"); } s_OutputBuffer = (float*)malloc(sizeof(float) * k_YoloOutputSize); }]]>

And the detectObjects C API is used to detect objects from camera raw data continuously.

& input, const void* inputTensorData) { return { { input.first, armnn::ConstTensor(input.second, inputTensorData) } }; } // Helper function to make output tensors armnn::OutputTensors MakeOutputTensors(const std::pair& output, void* outputTensorData) { return { { output.first, armnn::Tensor(output.second, outputTensorData) } }; } extern "C" __attribute__ ((visibility ("default"))) int detectObjects(float *inputPtr, float *result) { float *outputPtr = s_OutputBuffer; armnn::Status ret = s_Runtime->EnqueueWorkload(s_NetworkIdentifier, MakeInputTensors(s_InputBindingInfo, inputPtr), MakeOutputTensors(s_OutputBindingInfo, outputPtr)); if (ret == armnn::Status::Failure) { throw armnn::Exception("IRuntime::EnqueueWorkload failed"); } return ParseOutputTensorsYoloV1(outputPtr, result); }]]>

// Helper function to make input tensors
armnn::InputTensors MakeInputTensors(const std::pair<armnn::LayerBindingId,
 armnn::TensorInfo>& input,
 const void* inputTensorData)
{
 return { { input.first, armnn::ConstTensor(input.second, inputTensorData) } };
}
 
// Helper function to make output tensors
armnn::OutputTensors MakeOutputTensors(const std::pair<armnn::LayerBindingId,
 armnn::TensorInfo>& output,
 void* outputTensorData)
{
 return { { output.first, armnn::Tensor(output.second, outputTensorData) } };
}
 
extern "C" __attribute__ ((visibility ("default")))
int detectObjects(float *inputPtr, float *result)
{
 float *outputPtr = s_OutputBuffer;
 armnn::Status ret = s_Runtime->EnqueueWorkload(s_NetworkIdentifier,
 MakeInputTensors(s_InputBindingInfo, inputPtr),
 MakeOutputTensors(s_OutputBindingInfo, outputPtr));
 if (ret == armnn::Status::Failure)
 {
 throw armnn::Exception("IRuntime::EnqueueWorkload failed");
 }
 

You may need to implement the ParseOutputTensorsYoloV1 yourself. GitHub has some useful code snippets that may help you to implement the YOLO v1 parser.

Use the NDK standalone toolchain to compile the above code, and then generate the "libyoloDetector.so" shared library. In order to call from Unity, you should create a folder named "Assets > Plugins > Android" and copy the armeabi-v7a shared libraries to that folder in your Unity project. Here are the libraries I copied:

  • libarmnn.so
  • libarmnnCaffeParser.so
  • libprotobuf.so
  • libc++_shared.so
  • libyoloDetector.so

Integrate Arm NN into Unity

Let's switch back to the Unity project we created. We need camera raw data to feed into the object detection model, but the Google ARCore already took control of the camera. Fortunately, Google ARCore realized other programs may want to access the camera raw data as well. Thus, they have provided an example for this. You can see the "Assets > GoogleARCore > Examples > ComputerVision" example for more detail. We could use the TextureReader script to do the same thing in our example.

  1. Select the Main scene in the Hierarchy tab.
  2. Select "GameObject > Create Empty" to create an empty game object.
  3. Rename the empty game object to "ArmNNCaffeParserController"
  4. Click the "Add Component" button in the inspector.
  5. Search "Texture Reader" and add it to the "ArmNNCaffeParserController".
  6. Change the Image Width and Height to 448.
  7. Change "Image Sample Mode" to "Keep Aspect Ratio"
  8. Change "Image Format" to "Image Format Color"

You should get this in the end.

Tecture render scene

Create two C Sharp scripts for this demo.

  1. Create a "Assets > Scripts" directory.
  2. Create two C Sharp scripts named "ArmNNCaffeParserController.cs" and "ArmNNCaffeDetector.cs".
  3. Select "ArmNNCaffeParserController" in the "Main" scene.
  4. Use "Add Component", search for "Arm NN Caffe Parser Controller" and add it to the controller.

In the "ArmNNCaffeDetector.cs" script, call the initDetector native function in the constructor.

private static int INPUT_SIZE = 448;
private static int RESULT_SIZE = 7 * 7 * 6;
private float[] fetchResults = new float[RESULT_SIZE];
 
[DllImport ("yoloDetector")]
private static extern void initDetector();
 
public ArmNNCaffeDetector()
{
 inPtr = Marshal.AllocHGlobal(3 * INPUT_SIZE * INPUT_SIZE * sizeof(float));
 outPtr = Marshal.AllocHGlobal(RESULT_SIZE * sizeof(float));
 initDetector();
}

And call the detectObjects native function in the newly created DetectAsync method.

[DllImport ("yoloDetector")]
private static extern int detectObjects(IntPtr input, IntPtr output);
 
public Task<List<KeyValuePair<DetectResults, float>>> DetectAsync(byte[] camImage)
{
 return Task.Run(() =>
 {
 // Prepare input here
 ...
 Marshal.Copy(inputBuffer, 0, inPtr, inputBuffer.Length);
 int detectObjectNums = detectObjects(inPtr, outPtr);
 Marshal.Copy(outPtr, fetchResults, 0, RESULT_SIZE);
 ...
 // Parse and return the results here
 });
}

>> DetectAsync(byte[] camImage) { return Task.Run(() => { // Prepare input here ... Marshal.Copy(inputBuffer, 0, inPtr, inputBuffer.Length); int detectObjectNums = detectObjects(inPtr, outPtr); Marshal.Copy(outPtr, fetchResults, 0, RESULT_SIZE); ... // Parse and return the results here }); }]]>

Before the detectObjects call, you may need to convert the camera data format to ArmNN order. Here is the code snippet to do that.

float[] inputBuffer = new float[INPUT_SIZE * INPUT_SIZE * 3];
int h = INPUT_SIZE;
int w = INPUT_SIZE;
int c = 4;
for (int j = 0; j < h; ++j)
{
 for (int i = 0; i < w; ++i)
 {
 int r, g, b;
 r = camImage[j * w * c + i * c + 0];
 g = camImage[j * w * c + i * c + 1];
 b = camImage[j * w * c + i * c + 2];
 
 // ArmNN order: C, H, W
 int rDstIndex = 0 * h * w + j * w + i;
 int gDstIndex = 1 * h * w + j * w + i;
 int bDstIndex = 2 * h * w + j * w + i;
 
 inputBuffer[rDstIndex] = (float)r/255.0f;
 inputBuffer[gDstIndex] = (float)g/255.0f;
 inputBuffer[bDstIndex] = (float)b/255.0f;
 }
}

In the "ArmNNCaffeParserController.cs" script, instantiate the ArmNNCaffeDetector class and setup a callback function for TextureReader.

TextureReader TextureReaderComponent;
private ArmNNCaffeDetector detector;
 
private int m_ImageWidth = 0;
private int m_ImageHeight = 0;
private byte[] m_CamImage = null;
 
private bool m_IsDetecting = false;
 
void Start () {
 this.detector = new ArmNNCaffeDetector();
 
 TextureReaderComponent = GetComponent<TextureReader> ();
 
 // Registers the TextureReader callback.
 TextureReaderComponent.OnImageAvailableCallback += OnImageAvailable;
 Screen.sleepTimeout = SleepTimeout.NeverSleep;
}

(); // Registers the TextureReader callback. TextureReaderComponent.OnImageAvailableCallback += OnImageAvailable; Screen.sleepTimeout = SleepTimeout.NeverSleep; }]]>

Implement the OnImageAvailable to receive the camera data and call the ArmNNDetect method.

public void OnImageAvailable(TextureReaderApi.ImageFormatType format, int width, int height, IntPtr pixelBuffer, int bufferSize)
{
 if (format != TextureReaderApi.ImageFormatType.ImageFormatColor)
 {
 Debug.Log("No object detected due to incorrect image format.");
 return;
 }
 
 if (m_IsDetecting) {
 return;
 }
 
 if (m_CamImage == null || m_ImageWidth != width || m_ImageHeight != height)
 {
 m_CamImage = new byte[width * height * 4];
 m_ImageWidth = width;
 m_ImageHeight = height;
 }
 System.Runtime.InteropServices.Marshal.Copy(pixelBuffer, m_CamImage, 0, bufferSize);
 
 m_IsDetecting = true;
 Invoke(nameof(ArmNNDetect), 0f);
}

Call the DetectAsync method in the ArmNNDetect method.

private async void ArmNNDetect()
{
 var probabilities_and_bouding_boxes = await this.detector.DetectAsync (m_CamImage);
 ...
 // Visualize the bounding boxes and probabilities to the screen
 // Use "Frame.Raycast" which ARCore provided to find the 3D pose of the detected objects.
 // And render a related virtual object at the pose.
}

The DetectAsync method will return the probabilities and bounding boxes for the detected objects. After that, do whatever you want to do there. e.g. visualize the bounding boxes and push some virtual content nearby the physical objects.

How to use the "Frame.Raycast" function to get the 3D pose of the detected objects? Do you remember the code you comment out in the Update method of the "HelloARController.cs" script? You can refer to that code, and use the bounding box coordinate instead of the touch point coordinate.

// Raycast against the location the object detected to search for planes.
TrackableHit hit;
TrackableHitFlags raycastFilter = TrackableHitFlags.PlaneWithinPolygon |
 TrackableHitFlags.FeaturePointWithSurfaceNormal;
 
if (Frame.Raycast(boundingbox.position.x, boundingbox.position.y, raycastFilter, out hit))
{
 var andyObject = Instantiate(AndyAndroidPrefab, hit.Pose.position, hit.Pose.rotation);
 
 // Create an anchor to allow ARCore to track the hitpoint as understanding of the physical
 // world evolves.
 var anchor = hit.Trackable.CreateAnchor(hit.Pose);
 
 // Andy should look at the camera but still be flush with the plane.
 if ((hit.Flags & TrackableHitFlags.PlaneWithinPolygon) != TrackableHitFlags.None)
 {
 // Get the camera position and match the y-component with the hit position.
 Vector3 cameraPositionSameY = FirstPersonCamera.transform.position;
 cameraPositionSameY.y = hit.Pose.position.y;
 
 // Have Andy look toward the camera respecting his "up" perspective, which may be from ceiling.
 andyObject.transform.LookAt(cameraPositionSameY, andyObject.transform.up);
 }
 
 // Make Andy model a child of the anchor.
 andyObject.transform.parent = anchor.transform;
}

There we have it. You should have your own AR/ML demo up and running! Did you do something differently? Why not share it with us in the comments?

Arm NN SDK is a free of charge set of open-source Linux software and tools that enables machine learning workloads on power-efficient devices. It provides a bridge between existing neural network frameworks and power-efficient Arm Cortex CPUs, Arm Mali GPUs or the Arm Machine Learning processor.

Learn more about Arm NN SDK

Anonymous
Graphics & Multimedia blog