Dear Forum members,
I have encountered some unusual behaviour in VTune displaying the time spent in various subroutines inlined by IPO. I managed to reproduce my problem in a simple example using ifort version 16.0.1 and VTune Update 1 (build 434111):
test.f90
PROGRAM test
USE m1
IMPLICIT NONE
INTEGER, PARAMETER ::no_repeats = 100
INTEGER, PARAMETER :: n = 100000000
INTEGER :: repeat, i
REAL :: gamma, delta, epsilon
REAL, DIMENSION(:), ALLOCATABLE :: a, b, c
ALLOCATE( a(1:n) )
ALLOCATE( b(1:n) )
ALLOCATE( c(1:n) )
a = 1.0
b = 1.0
c = 2.0
DO repeat = 1, no_repeats
DO i = 1, n
epsilon = b(i)
CALL sub1( i+1, epsilon, gamma )
epsilon = c(i)
CALL sub1( i+2, epsilon, delta )
a(i) = a(i) + gamma * delta
ENDDO ! n
ENDDO ! repeat
END PROGRAM test
sub2.f90
MODULE m1
IMPLICIT NONE
CONTAINS
SUBROUTINE sub1( k, alpha, beta )
IMPLICIT NONE
INTEGER, INTENT(IN) :: k
REAL, INTENT(IN) :: alpha
REAL, INTENT(OUT) :: beta
IF( MOD( k, 10 ) == 0 ) THEN
beta = 4.0 * alpha
ELSE
beta = 2.0 * alpha
ENDIF
END SUBROUTINE sub1
END MODULE m1
compile.sh
#!/bin/sh
ifort -c -no-vec -O3 -ipo -debug full sub1.f90
ifort -c -no-vec -O3 -ipo -debug full test.f90
ifort -no-vec -O3 -ipo -debug full sub1.o test.o -o test
After running this program under VTune, the Top-down window shows plausible results, both instances of the inlined subroutine sub1 are assigned similar runtimes:
![]()
However, if I open the source code of "test" to check the time consumption of the various inlined instances of sub1, all time (0.188s) is assigned to the second inline instance:
![]()
This is confirmed by checking the stack information on the right hand side of the screen, where both "contributions" are assigned to line 26 of test.f90:
, ![]()
Moreover, when I check the assembly code and the times assigned to the machine instructions, all instructions belonging to both instantiation of sub1 seem to have some reasonable timings, but the lea at 0x40284d is attributed all time of both subroutines (0.188s; I do understand that the assembly level information is not necessarily precise due to the stochastic nature of this kind of performance testing and not using hardware counter based methods, but I still think it is a sign of something going to the wrong way):
![]()
![]()
Overall, my problem is that I can not check the time consumption of the individual instances of the inlined subroutines. The whole phenomenon can not be blamed on the fact that VTune sometimes attributes the time of an instruction to an other instruction nearby, often one or few instructions later, since inserting some complicated code after line 24 does not change the timings of sub1 and sub2. I have the impression that VTune is booking the time of the inlined code to the wrong place. A possible workaround would be to manually create a second copy of sub1 called sub2 and inline them separately. However, this is not working since the time of both sub1 and sub2 will be attributed to sub2. Furthermore, in a real program (a heavily inlined spagetti of some 100000 Fortran lines with OpenMP involved) this isn't feasible since the inlined subroutines are not small and their full call tree ,i.e. all called subroutines would have to be duplicated (triplicated, quadricated ...).
Is there some way to check these timings?
Thank you for your help,
Jozsef