AArch64/GICv3:ICC_SGI1R_EL1: AFF1

I wonder, is AFF1 in ICC_SGI1R_EL1 also a bit-mask or does it address directly the cluster?

So does AFF1 == 3 address cluster 3 or cluster 0 and cluster 1.

Parents
  • I think that, from the perspective of the system software, it doesn't matter. It needs to paste into SGI register's aff{1,2,3} the values copied from target's mpidr_el1. It is only the mpidr_el1.aff0 field which is exposed/expanded as a bitmask inside the SGI register. The register interface forces the software to treat aff0 differently than it does the other affinity fields.

    One can see, for instance, how OpenBSD treats the affinity fields in its function agintc_send_ipi.

    But, since the aff1 field is 8-bits wide, I think that an implementation with <=8 clusters can choose the clusterIDs as some subset of {1,2,4,8,16,32,64,128}, thus turning aff1 into a bitmask. Benefit? Probably that the GIC implementation can do without a decoder circuit otherwise needed for determining the selection of a particular cluster. I don't know how significant, in practice, such a benefit is. It is likely that reducing the number of bits implemented for aff1 is better than keeping it at its full 8-bit length - for e.g. in Cortex-A77, described below.

    Another point - aff1 isn't always a clusterID. For instance, Cortex-A77 has mpidr_el1.aff1 deciding the coreID; it has mpidr_el1.aff0 deciding the threadID within a core. It fixes mpidr_el1.aff1 to be 3-bits long (any value is a direct address), and mpidr_el1.aff0 to 0x00. An a77 multi-core processor can thus contain max. 8 cores each of which is single-threaded.

    It is also interesting to see the range-selector field in the SGI register - it enlarges the 16-bit TargetList field into a 256-bit field.

    Edit: I just realized if you weren't actually asking about the ability to select multiple clusters in a single register write. The above is under the assumption that we cannot select more than one cluster in a single write.

  • I think that, from the perspective of the system software, it doesn't matter. It needs to paste into SGI register's aff{1,2,3} the values copied from target's mpidr_el1. It is only the mpidr_el1.aff0 field which is exposed/expanded as a bitmask inside the SGI register. The register interface forces the software to treat aff0 differently than it does the other affinity fields.

    The linked PDF is clearer than other description I found.

    So it really seems, one can send an IPI to different cores within a cluster, but not to different clusters or with IRM == 1 to all cores but the sender.

    Thanks for the hint about MT, but it seems the only core yet is the CA65 which has two threads (but never read there is already silicon outside with this core).

  • So it really seems, one can send an IPI to different cores within a cluster, but not to different clusters or with IRM == 1 to all cores but the sender.

    Scheduling IPIs targeting some cores belonging to different clusters is a job that software can do relatively easily. If such IPIs are a regular part of the interrupt traffic, redesigning the cluster-setup may be of benefit so that the related cores can be grouped together into a single cluster.

    the only core yet is the CA65

    I read that it is the basis for A65AE and for Neoverse E1. Internet search showed (non-Cortex) brand-name Vulcan/ThunderX2 from Broadcom/Marvell/Cavium, a model of which has 32 cores with 4 threads per core.

  • If such IPIs are a regular part of the interrupt traffic, redesigning the cluster-setup may be of benefit so that the related cores can be grouped together into a single cluster.

    How would you "re-organize" the core/cluster setup. I'd say this is fixed in the HW.

    Actually, I wonder, why NXP make the LX2160A with 8 clusters à 2 core instead of 4 by 4. But maybe it is a yield thing as it might be easier do disable a single core in a cluster to get the LX2080A ;-)

    Scheduling IPIs targeting some cores belonging to different clusters is a job that software can do relatively easily.

    Sure, but if you want to wake up for example in the LX2160a core 2,6,7 you need to make 2 IPIs, as core 2 is in cluster 1, core 6 and 7 in cluster 3.

  • How would you "re-organize" the core/cluster setup. I'd say this is fixed in the HW.

    True. I did indeed mean a hardware redesign/reorganization. If the hardware design determined a particular cluster organization, that design must have taken into account the typical load the system is expected to handle. Running on it a generic load, or a load which constitutes a worst-case scenario (but not an average-case scenario) does lead to a worse performance.

    Actually, I wonder, why NXP make the LX2160A with 8 clusters à 2 core instead of 4 by 4. But maybe it is a yield thing as it might be easier do disable a single core in a cluster to get the LX2080A ;-)

    :-)

    Nevertheless, looking at the dts included in Linux for the two devices, it can be seen that each cluster (in any of the two LX devices/chips) is made up of 2 cores. Certain amount of dedicated L2 cache is assigned to each cluster. I think this cluster-setup has to do with supporting parallel-processing without too much sharing/contention (of/on L2 cache, for instance). That is, each cluster is effectively a 'single' 2-threaded core. If an Arm chip, comparable to a72 and with 2 threads per core, were available, I guess NXP would choose that chip and would build a 2160' with 8 such cores.

    For LX2160A, wouldn't a 4x4 cluster, with 2MB L2 per cluster, introduce higher contention/traffic on the L2 controller than their current design?

    Sure, but if you want to wake up for example in the LX2160a core 2,6,7 you need to make 2 IPIs, as core 2 is in cluster 1, core 6 and 7 in cluster 3.

    True. However, the LX devices are meant to run network-processing load. Its OS/driver needs to honour the hardware design and schedule the workload such that the cross-cluster IPIs are kept at a minimum required.

    I do not know how the load is distributed on such devices. But, assuming that each cluster is given a set of connections to process, and these sets are kept disjoint, the processing of each set won't need IPIs outside its home-cluster, taking also into account factors such as per-cpu data areas, etc. maintained by the OS/driver.

    The load distribution is one factor that determines the ratio of (total) cross-cluster IPIs to (total) home-cluster IPIs. If the OS reserves a single cluster for its own use and dedicates others to processing the connections, would that ratio get close to 1?

    I guess that for Arm to change sgi.aff1 into a bitmask, a justification, which is at least as strong as the one required when aff0 was made a bitmask, becomes necessary.

Reply
  • How would you "re-organize" the core/cluster setup. I'd say this is fixed in the HW.

    True. I did indeed mean a hardware redesign/reorganization. If the hardware design determined a particular cluster organization, that design must have taken into account the typical load the system is expected to handle. Running on it a generic load, or a load which constitutes a worst-case scenario (but not an average-case scenario) does lead to a worse performance.

    Actually, I wonder, why NXP make the LX2160A with 8 clusters à 2 core instead of 4 by 4. But maybe it is a yield thing as it might be easier do disable a single core in a cluster to get the LX2080A ;-)

    :-)

    Nevertheless, looking at the dts included in Linux for the two devices, it can be seen that each cluster (in any of the two LX devices/chips) is made up of 2 cores. Certain amount of dedicated L2 cache is assigned to each cluster. I think this cluster-setup has to do with supporting parallel-processing without too much sharing/contention (of/on L2 cache, for instance). That is, each cluster is effectively a 'single' 2-threaded core. If an Arm chip, comparable to a72 and with 2 threads per core, were available, I guess NXP would choose that chip and would build a 2160' with 8 such cores.

    For LX2160A, wouldn't a 4x4 cluster, with 2MB L2 per cluster, introduce higher contention/traffic on the L2 controller than their current design?

    Sure, but if you want to wake up for example in the LX2160a core 2,6,7 you need to make 2 IPIs, as core 2 is in cluster 1, core 6 and 7 in cluster 3.

    True. However, the LX devices are meant to run network-processing load. Its OS/driver needs to honour the hardware design and schedule the workload such that the cross-cluster IPIs are kept at a minimum required.

    I do not know how the load is distributed on such devices. But, assuming that each cluster is given a set of connections to process, and these sets are kept disjoint, the processing of each set won't need IPIs outside its home-cluster, taking also into account factors such as per-cpu data areas, etc. maintained by the OS/driver.

    The load distribution is one factor that determines the ratio of (total) cross-cluster IPIs to (total) home-cluster IPIs. If the OS reserves a single cluster for its own use and dedicates others to processing the connections, would that ratio get close to 1?

    I guess that for Arm to change sgi.aff1 into a bitmask, a justification, which is at least as strong as the one required when aff0 was made a bitmask, becomes necessary.

Children
No data
More questions in this forum