Hello,
I am developing an instruction-level analysis system for unknown ARM firmware binaries. For analysis, I am using Ghidra to disassemble firmware.
The problem/question
To import a binary into Ghidra, I need to specify information about the file, in the following format: "architecture:endianness:32_or_64_bit:variant". For example, "ARM:LE:32:Cortex".
As the binaries are unknown, I do not know this information for certain, so my goal is to choose the most all-encompassing version for each architecture type, i.e. the one with the largest superset of supported features/instructions. My question is, what would this architecture type be?
The tool I use for architecture detection, cpu_rec, will for ARM binaries report: ARM64, ARMeb, ARMel, or ARMhf. This means that I can handle ARMv7 and ARMv8 separately.
Importing ARMeb, ARMel, ARMhf (ARMv7)
They seem to use Debian's naming conventions. Based on that, I know that ARMhf is ARMv7 with VFP3-D16 and Thumb-2 support, little-endian. ARMel is ARMv7 without extensions, little-endian. ARMeb is ARMv7 without extensions, big-endian. Since ARMv7 is 32-bit, the only thing left to specify is the variant.
For little-endian, Ghidra gives me the options of using a generic ARM/Thumb v7 variant, or a Cortex variant. As far as I understand (please correct me if I'm wrong), "Cortex" refers only to the processor group for ARMv7. So I would think it's best to use the Cortex variant here, because it should be a superset of ARMv7, and I want to recognize any processor-specific instructions if they're present. But I couldn't find a clear document explaining the differences between Cortex and other ARMv7 variants.
For big-endian, I can also choose between Cortex and generic, so I'd also go with the Cortex variant, for the same reason.
One other thing I'm not sure about is whether the Cortex variant implies a different ARMv7 profile than the generic variant, and how much of an effect that would have on analysis, as the different profiles seem quite disjoint in terms of supported functionality.
There is also a third option, with little-endian instructions and big-endian data. It seems to me that for my instruction-level analysis, that doesn't make much difference, but perhaps it could help with disassembly, to differentiate between code and data when the data is indeed big-endian.
Importing ARM64 (ARMv8, AArch64)For ARMv8, which gets separately recognized as "ARM64", I don't get information about the endianness.However, Ghidra's options for ARMv8 are all little-endian for instructions. The difference is with the endianness of data; I don't have enough experience to judge how much this matters for analysis and disassembly; what do you think?Furthermore, I can also choose between a generic application-profile variant or an ILP32 variant. From what I read, ILP32 uses a different ABI, one which is 32-bit instead of 64-bit. My intuition here is that could have a larger effect on disassembly, and ILP32 might actually be used for the devices targeted by firmware, so it might be better to try both.
Summary
Does this seem reasonable? Would you have any suggestions?
Thank you.
Arm IPs are the products of Arm company and we should protect them. This technical forum is to help our partners about how to use our IPs.
It looks that your reverse-engineering goal is not suitable for this technical forum.