This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Reverse engineering ARM firmware: selecting an architecture with a superset of features when importing ARMv7/ARMv8 firmware into Ghidra

Hello,

I am developing an instruction-level analysis system for unknown ARM firmware binaries. For analysis, I am using Ghidra to disassemble firmware.

The problem/question

To import a binary into Ghidra, I need to specify information about the file, in the following format: "architecture:endianness:32_or_64_bit:variant". For example, "ARM:LE:32:Cortex".

As the binaries are unknown, I do not know this information for certain, so my goal is to choose the most all-encompassing version for each architecture type, i.e. the one with the largest superset of supported features/instructions. My question is, what would this architecture type be?

The tool I use for architecture detection, cpu_rec, will for ARM binaries report: ARM64, ARMeb, ARMel, or ARMhf. This means that I can handle ARMv7 and ARMv8 separately.

Importing ARMeb, ARMel, ARMhf (ARMv7)

They seem to use Debian's naming conventions. Based on that, I know that ARMhf is ARMv7 with VFP3-D16 and Thumb-2 support, little-endian. ARMel is ARMv7 without extensions, little-endian. ARMeb is ARMv7 without extensions, big-endian. Since ARMv7 is 32-bit, the only thing left to specify is the variant.

For little-endian, Ghidra gives me the options of using a generic ARM/Thumb v7 variant, or a Cortex variant. As far as I understand (please correct me if I'm wrong), "Cortex" refers only to the processor group for ARMv7. So I would think it's best to use the Cortex variant here, because it should be a superset of ARMv7, and I want to recognize any processor-specific instructions if they're present. But I couldn't find a clear document explaining the differences between Cortex and other ARMv7 variants.

For big-endian, I can also choose between Cortex and generic, so I'd also go with the Cortex variant, for the same reason.

One other thing I'm not sure about is whether the Cortex variant implies a different ARMv7 profile than the generic variant, and how much of an effect that would have on analysis, as the different profiles seem quite disjoint in terms of supported functionality.

There is also a third option, with little-endian instructions and big-endian data. It seems to me that for my instruction-level analysis, that doesn't make much difference, but perhaps it could help with disassembly, to differentiate between code and data when the data is indeed big-endian.

Importing ARM64 (ARMv8, AArch64)
For ARMv8, which gets separately recognized as "ARM64", I don't get information about the endianness.
However, Ghidra's options for ARMv8 are all little-endian for instructions. The difference is with the endianness of data; I don't have enough experience to judge how much this matters for analysis and disassembly; what do you think?
Furthermore, I can also choose between a generic application-profile variant or an ILP32 variant. From what I read, ILP32 uses a different ABI, one which is 32-bit instead of 64-bit. My intuition here is that could have a larger effect on disassembly, and ILP32 might actually be used for the devices targeted by firmware, so it might be better to try both.

Summary

  • For ARMv7 (ARMeb, ARMel, ARMhf) binaries, I would load them as ARM Cortex 32-bit, to support any potential Cortex-specific instructions.
  • For ARMv8 (ARM64) binaries, I would load them as AARCH64 64-bit little endian, as there will probably be few (if any) developed for the ILP32 ABI. However, it might be worthwhile to also try the ILP32 variant.

Does this seem reasonable? Would you have any suggestions?

Thank you.