how can Vtune record the asm code's cpu time

Hello, everyone

recently, I am using Vtune to test my BSDE code in hotspot mode. I have found some insteresting things.

int a,a1,a2,a3;

float trans[4];

_mm_store_ps(trans,a_sse);

below are four lines of code

a= (int)*(trans);
a1= (int)*(trans+1);
a2= (int)(trans[2]);
a3 = (int )trans[3];

compile using gcc with -O0 Optimize optimization, the time each line costs increase as below

a= (int)*(trans); 9.319s
a1= (int)*(trans+1); 1.970s
a2= (int)(trans[2]); 1.020s
a3 = (int )trans[3]; 2.130s

inorder to find the hotspot, I open the asm code, take line1's asm and line2's asm as an example

0x4055d2 1 movq -0xb8(%rbp), %rax 1.361s
0x4055d9 1 movssl (%rax), %xmm0
0x4055dd 1 cvttss2si %xmm0, %eax 4.238s
0x4055e1 1 movl %eax, -0xcc(%rbp) 3.720s

0x4055e7 2 movq -0xb8(%rbp), %rax 1.000s
0x4055ee 2 add $0x4, %rax
0x4055f2 2 movssl (%rax), %xmm0
0x4055f6 2 cvttss2si %xmm0, %eax 0.100s
0x4055fa 2 movl %eax, -0xd0(%rbp) 0.870s

if the cost time of first asm line "0x4055d2 1 movq -0xb8(%rbp), %rax 1.361s" is much larger than "0x4055e7 2 movq -0xb8(%rbp), %rax 1.000s", maybe I can understand it, the first line cause the cache miss which can benefit the second one. But, as you can see, the mainly different is "cvttss2si" and "movl". I don't know what caused the big difference ?

If I change the order of "a=*" and "a3=*" in the C code, "a3=*" cost more time than "a=*".

To make things more interesting, I compile the code with -O3 optimization, here is the time each line cost

a= (int)*(trans); 1.080s
a1= (int)*(trans+1); 1.191s
a2= (int)(trans[2]); 0.520s
a3 = (int )trans[3]; 5.900s

The last line cost the most time now. I compared asm code of the last two line below.

Address Line Assembly CPU Time
0x4036b0 3 movssl 0x8(%rcx), %xmm4 0.200s
0x4036b5 4 movssl 0xc(%rcx), %xmm5 2.900s
0x4036ba 3 cvttss2si %xmm4, %edi 0.320s
0x4036be 4 cvttss2si %xmm5, %r10d 3.000s

With -O3 optimization, gcc put the asm code of line3,4 in the front of line1,2. Why did gcc believe this can save time? I know we put two "movssl" together to speed up the pipeline, But why does the the second "movssl" and "cvttss2si" cost more time than the first? How does Vtune record the asm code's cpu time? is it correct?

Thankyou for your help!!

how can Vtune record the asm code's cpu time

Trending Articles

ZARIA CUMMINGS

BREAKING NEWS: Early success in Chinn appeal bid

Black Angus Grilled Artichokes

Michel Roux roast duck with cherries, cherry sauce and potatoes recipe on...

Mtu mwenye Div four ya 26,unaweza kusomea nini??

TO: TIA PARMETER AND CORY GROU...

Kumbalangi Nights - English (1CD ) - subtitles

Sheila Mwanyigha Biography, Boyfriend,Marriage and Tribe

Wutah – Kotosa ( Prod by Appietus ) ThrowBack

99 Rain Status for Whatsapp - Best Rain Dp Collection

Practice Sheet of Right form of verbs for HSC Students

LEGO® Marvel Avengers + DLCs [US]

m-flo loves ZICO, eill – EKO EKO – Single [iTunes Plus M4A]

CC1310: FCC ID Help

Police arrest and charge wanted man Ryan Griffin

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

Autodesk AutoCAD 2015 Portable (Win64)

BigXthaPlug – TAKE CARE (DELUXE) [iTunes Plus M4A]

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana