I wonder, is AFF1 in ICC_SGI1R_EL1 also a bit-mask or does it address directly the cluster?
So does AFF1 == 3 address cluster 3 or cluster 0 and cluster 1.
a.surati said:I think that, from the perspective of the system software, it doesn't matter. It needs to paste into SGI register's aff{1,2,3} the values copied from target's mpidr_el1. It is only the mpidr_el1.aff0 field which is exposed/expanded as a bitmask inside the SGI register. The register interface forces the software to treat aff0 differently than it does the other affinity fields.
The linked PDF is clearer than other description I found.
So it really seems, one can send an IPI to different cores within a cluster, but not to different clusters or with IRM == 1 to all cores but the sender.
Thanks for the hint about MT, but it seems the only core yet is the CA65 which has two threads (but never read there is already silicon outside with this core).
42Bastian Schick said:So it really seems, one can send an IPI to different cores within a cluster, but not to different clusters or with IRM == 1 to all cores but the sender.
Scheduling IPIs targeting some cores belonging to different clusters is a job that software can do relatively easily. If such IPIs are a regular part of the interrupt traffic, redesigning the cluster-setup may be of benefit so that the related cores can be grouped together into a single cluster.
42Bastian Schick said:the only core yet is the CA65
I read that it is the basis for A65AE and for Neoverse E1. Internet search showed (non-Cortex) brand-name Vulcan/ThunderX2 from Broadcom/Marvell/Cavium, a model of which has 32 cores with 4 threads per core.
a.surati said:If such IPIs are a regular part of the interrupt traffic, redesigning the cluster-setup may be of benefit so that the related cores can be grouped together into a single cluster.
How would you "re-organize" the core/cluster setup. I'd say this is fixed in the HW.
Actually, I wonder, why NXP make the LX2160A with 8 clusters à 2 core instead of 4 by 4. But maybe it is a yield thing as it might be easier do disable a single core in a cluster to get the LX2080A ;-)
a.surati said:Scheduling IPIs targeting some cores belonging to different clusters is a job that software can do relatively easily.
Sure, but if you want to wake up for example in the LX2160a core 2,6,7 you need to make 2 IPIs, as core 2 is in cluster 1, core 6 and 7 in cluster 3.
42Bastian Schick said:How would you "re-organize" the core/cluster setup. I'd say this is fixed in the HW.
True. I did indeed mean a hardware redesign/reorganization. If the hardware design determined a particular cluster organization, that design must have taken into account the typical load the system is expected to handle. Running on it a generic load, or a load which constitutes a worst-case scenario (but not an average-case scenario) does lead to a worse performance.
42Bastian Schick said:Actually, I wonder, why NXP make the LX2160A with 8 clusters à 2 core instead of 4 by 4. But maybe it is a yield thing as it might be easier do disable a single core in a cluster to get the LX2080A ;-)
:-)
Nevertheless, looking at the dts included in Linux for the two devices, it can be seen that each cluster (in any of the two LX devices/chips) is made up of 2 cores. Certain amount of dedicated L2 cache is assigned to each cluster. I think this cluster-setup has to do with supporting parallel-processing without too much sharing/contention (of/on L2 cache, for instance). That is, each cluster is effectively a 'single' 2-threaded core. If an Arm chip, comparable to a72 and with 2 threads per core, were available, I guess NXP would choose that chip and would build a 2160' with 8 such cores.
For LX2160A, wouldn't a 4x4 cluster, with 2MB L2 per cluster, introduce higher contention/traffic on the L2 controller than their current design?
42Bastian Schick said:Sure, but if you want to wake up for example in the LX2160a core 2,6,7 you need to make 2 IPIs, as core 2 is in cluster 1, core 6 and 7 in cluster 3.
True. However, the LX devices are meant to run network-processing load. Its OS/driver needs to honour the hardware design and schedule the workload such that the cross-cluster IPIs are kept at a minimum required.
I do not know how the load is distributed on such devices. But, assuming that each cluster is given a set of connections to process, and these sets are kept disjoint, the processing of each set won't need IPIs outside its home-cluster, taking also into account factors such as per-cpu data areas, etc. maintained by the OS/driver.
The load distribution is one factor that determines the ratio of (total) cross-cluster IPIs to (total) home-cluster IPIs. If the OS reserves a single cluster for its own use and dedicates others to processing the connections, would that ratio get close to 1?
I guess that for Arm to change sgi.aff1 into a bitmask, a justification, which is at least as strong as the one required when aff0 was made a bitmask, becomes necessary.