Quantcast
Channel: Intel® Software - Intel® VTune™ Profiler (Intel® VTune™ Amplifier)
Viewing all articles
Browse latest Browse all 1574

Unexpected behaviour while vtuning inlined and IPOed code

$
0
0

Dear Forum members,

I have encountered some unusual behaviour in VTune displaying the time spent in various subroutines inlined by IPO. I managed to reproduce my problem in a simple example using ifort version 16.0.1 and VTune Update 1 (build 434111):

test.f90

PROGRAM test
  USE m1
  IMPLICIT NONE

  INTEGER, PARAMETER ::no_repeats = 100
  INTEGER, PARAMETER :: n = 100000000
  INTEGER :: repeat, i
  REAL :: gamma, delta, epsilon
  REAL, DIMENSION(:), ALLOCATABLE :: a, b, c


  ALLOCATE( a(1:n) )
  ALLOCATE( b(1:n) )
  ALLOCATE( c(1:n) )

  a = 1.0
  b = 1.0
  c = 2.0

  DO repeat = 1, no_repeats
    DO i = 1, n

      epsilon = b(i)
      CALL sub1( i+1, epsilon, gamma )
      epsilon = c(i)
      CALL sub1( i+2, epsilon, delta )
      a(i) = a(i) + gamma * delta
    ENDDO  ! n
  ENDDO    ! repeat

END PROGRAM test

sub2.f90

MODULE m1
  IMPLICIT NONE
  CONTAINS

  SUBROUTINE sub1( k, alpha, beta )
    IMPLICIT NONE

    INTEGER, INTENT(IN) :: k
    REAL, INTENT(IN) :: alpha
    REAL, INTENT(OUT) :: beta


    IF( MOD( k, 10 ) == 0 ) THEN
      beta = 4.0 * alpha
    ELSE
      beta = 2.0 * alpha
    ENDIF

  END SUBROUTINE sub1

END MODULE m1
compile.sh
#!/bin/sh

ifort -c -no-vec -O3 -ipo -debug full sub1.f90
ifort -c -no-vec -O3 -ipo -debug full test.f90
ifort -no-vec -O3 -ipo -debug full sub1.o test.o -o test

After running this program under VTune, the Top-down window shows plausible results, both instances of the inlined subroutine sub1 are assigned similar runtimes:

However, if I open the source code of "test" to check the time consumption of the various inlined instances of sub1, all time (0.188s) is assigned to the second inline instance:

This is confirmed by checking the stack information on the right hand side of the screen, where both "contributions" are assigned to line 26 of test.f90:

,

Moreover, when I check the assembly code and the times assigned to the machine instructions, all instructions belonging to both instantiation of sub1 seem to have some reasonable timings, but the lea at 0x40284d is attributed all time of both subroutines (0.188s; I do understand that the assembly level information is not necessarily precise due to the stochastic nature of this kind of performance testing and not using hardware counter based methods, but I still think it is a sign of something going to the wrong way):

Overall, my problem is that I can not check the time consumption of the individual instances of the inlined subroutines. The whole phenomenon can not be blamed on the fact that VTune sometimes attributes the time of an instruction to an other instruction nearby, often one or few instructions later, since inserting some complicated code after line 24 does not change the timings of sub1 and sub2. I have the impression that VTune is booking the time of the inlined code to the wrong place. A possible workaround would be to manually create a second copy of sub1 called sub2 and inline them separately. However, this is not working since the time of both sub1 and sub2 will be attributed to sub2. Furthermore, in a real program (a heavily inlined spagetti of some 100000 Fortran lines with OpenMP involved) this isn't feasible since the inlined subroutines are not small and their full call tree ,i.e. all called subroutines would have to be duplicated (triplicated, quadricated ...).

Is there some way to check these timings?

Thank you for your help,

Jozsef

 


Viewing all articles
Browse latest Browse all 1574

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>