Quantcast
Channel: Intel® Software - Intel® VTune™ Profiler (Intel® VTune™ Amplifier)
Viewing all articles
Browse latest Browse all 1574

how can Vtune record the asm code's cpu time

$
0
0

Hello, everyone

  recently, I am using Vtune to test my BSDE code in hotspot mode. I have found some insteresting things.

     int a,a1,a2,a3;

     float trans[4];

    _mm_store_ps(trans,a_sse);

   below are four lines of code

  1.   a= (int)*(trans);
  2.   a1= (int)*(trans+1);
  3.   a2= (int)(trans[2]);
  4.   a3 = (int )trans[3]; 

  compile using gcc with -O0 Optimize optimization, the time each line costs increase as below

  1. a= (int)*(trans); 9.319s
  2. a1= (int)*(trans+1); 1.970s
  3. a2= (int)(trans[2]); 1.020s
  4. a3 = (int )trans[3]; 2.130s

inorder to find the hotspot, I open the asm code, take line1's asm and line2's asm as an example

0x4055d2  1      movq -0xb8(%rbp), %rax 1.361s
0x4055d9  1      movssl (%rax), %xmm0
0x4055dd  1      cvttss2si %xmm0, %eax 4.238s
0x4055e1  1      movl %eax, -0xcc(%rbp) 3.720s

0x4055e7  2    movq -0xb8(%rbp), %rax 1.000s
0x4055ee  2    add $0x4, %rax
0x4055f2   2    movssl (%rax), %xmm0
0x4055f6   2    cvttss2si %xmm0, %eax 0.100s
0x4055fa   2    movl %eax, -0xd0(%rbp) 0.870s

if the cost time of first asm line "0x4055d2 1      movq -0xb8(%rbp), %rax 1.361s" is much larger than "0x4055e7 2    movq -0xb8(%rbp), %rax 1.000s", maybe I can understand it, the first line cause the cache miss which can benefit the second one. But, as you can see, the mainly different is "cvttss2si" and "movl". I don't know what caused the big difference ?

 If I change the order of "a=*" and "a3=*" in the C code, "a3=*" cost more time than "a=*". 

To make things more interesting, I compile the code with -O3 optimization, here is the time each line cost

  1. a= (int)*(trans); 1.080s
  2. a1= (int)*(trans+1); 1.191s
  3. a2= (int)(trans[2]); 0.520s
  4. a3 = (int )trans[3]; 5.900s

The last line cost the most time now.  I compared asm code of the last two line below.

Address Line Assembly CPU Time
0x4036b0  3  movssl 0x8(%rcx), %xmm4 0.200s
0x4036b5  4  movssl 0xc(%rcx), %xmm5 2.900s
0x4036ba  3  cvttss2si %xmm4, %edi 0.320s
0x4036be  4  cvttss2si %xmm5, %r10d 3.000s

With -O3 optimization, gcc put the asm code of line3,4 in the front of line1,2. Why did gcc believe this can save time? I know we put two "movssl" together to speed up the pipeline, But why does the the second "movssl" and "cvttss2si" cost more time than the first? How does Vtune record the asm code's cpu time? is it correct?

Thankyou  for your help!! 


Viewing all articles
Browse latest Browse all 1574

Trending Articles