The fast Walsh Hadamard transform provides a path to fast random projections that further lead to locality sensitive hashing, unbiased dimension reduction or increase, compressive sensing, associative memory, extreme learning machines and reservoir computing.
I hope you have code for it in your HPC/AI libraries!
The simplest way to get a random projection is to apply a predetermined random pattern of sign flipping to an array of data followed by the WHT. Simply by binarizing the output you have a locality sensitive hash. If the bit outputs are +1,-1 then by weighting each one and summing you get a recalled value. To train, recall and calculate the error. Divide by the number of bits. Then add or subtract that from each weight as appropriate to make the error zero. That give an associative memory whose capacity is just short of the number of bits.
Instead of random sign flipping you can use weighting to get more sophisticated projections:
At present this is not a function that is provided in Arm Performance Libraries. It's not a function I was previously aware of, and I can see the similarities to the functions that are provided for FFT cases. The Arm Performance Libraries, as well as BLAS and LAPACK functionality, do provide complete coverage of FFTW's DFT interfaces. They do not, however, at present provide the real-to-real functionality of cosine transforms that FFTW does also provide.
In order to solve these cases on Arm I think that open-source FFTW is currently the best option. I can see that the FALCONN-LIB you link to does perform significantly faster on x86 than the FFTW implementation from the results they show, and getting them to port to NEON from AVX will be a good plan.
The out-of-place algorithm for the WHT is very simple. You step through the input data elements pair-wise, putting the sum sequentially in the lower half of a new array, the difference sequentially in the upper half. Then repeat that process a total of log base 2 (n) times.
Usually it is fastest to use the out-of-place algorithm 'in register' up to the possible limit and then switch to the in-place algorithm.
It also helps to be cache aware and do the in-place algorithm in chunks in L1,L2 cache.
I don't know if ARM has some equivalent of horizontal add and subtract SIMD instructions hadd,hsub, those are useful.
Anyway I would say having GPU code is a better idea. I see Nvidia have a Jetson Nano board for $99 that mixes ARM and their GPU. I guess that is a ready made solution. Though I'd like to inspect their WHT code to see if it could be improved.