Quantcast
Channel: Intel® Software - Intel® VTune™ Profiler (Intel® VTune™ Amplifier)
Viewing all 1574 articles
Browse latest View live

VTune 2016 Update 1 Bluescreen with Advanced Hotspots

$
0
0

Hi there,

I've been using VTune Amplifier 2016 on a Dell Precision T7600 workstation running Windows 8.1 for a couple of months with great success. A new hardware rollout has seen me upgraded to a T7910 with Windows 10, but the Advanced Hotspots Analysis is resulting in a bluescreen every time I run it, in vtss.sys.I was initially seeing a DRIVER_OVERRAN_STACK_BUFFER error, but since removing all Hyper-V options from my BIOS as suggested in another thread I am now seeing KMODE_EXCEPTION_NOT_HANDLED

Meanwhile the Basic Hotspots view is fine.

 

​I think we technically have premium support but not associated with my developer account. Any assistance would be appreciated...

 

- Richard Semmens


VTunes XE 2016 is crashing on any attempt to execute ITT code

$
0
0

Our VTunes (XE 2016 Update 1 (build 434111)) is crashing if it encounters any ITT functions (the same version Intel Inspector works fine with the ITT functions)

Problem signature:
Problem Event Name: APPCRASH
Application Name: myapp.exe
Application Version: 0.0.0.0
Application Timestamp: 56d597b4
Fault Module Name: ntdll.dll
Fault Module Version: 6.1.7601.23136
Fault Module Timestamp: 55a6a1ad
Exception Code: c0000005
Exception Offset: 0000000000049054
OS Version: 6.1.7601.2.1.0.256.4

Does anyone know the cause?

VTune 2016 Update 2 GUI doesn't expand across multiple monitors

$
0
0

I have a desktop with 4 monitors.  Parallel XE 2016 Update 2's VTune GUI was able span across the 4 monitors. However after updating to Parallel XE 2016 Update 2 this no longer possible. The VTune GUI is restricted in size to 1 monitor which makes it very difficult to use (it's not restricted to 1 monitor if VTune is executed from within Visual Studio). Intel Advisor is also afflicted with the same issue. However Intel Inspector is not, it's GUI can span across multiple monitors.

I should add that the OS is 64bit Windows 7

 

 

Unexpected behaviour while vtuning inlined and IPOed code

$
0
0

Dear Forum members,

I have encountered some unusual behaviour in VTune displaying the time spent in various subroutines inlined by IPO. I managed to reproduce my problem in a simple example using ifort version 16.0.1 and VTune Update 1 (build 434111):

test.f90

PROGRAM test
  USE m1
  IMPLICIT NONE

  INTEGER, PARAMETER ::no_repeats = 100
  INTEGER, PARAMETER :: n = 100000000
  INTEGER :: repeat, i
  REAL :: gamma, delta, epsilon
  REAL, DIMENSION(:), ALLOCATABLE :: a, b, c


  ALLOCATE( a(1:n) )
  ALLOCATE( b(1:n) )
  ALLOCATE( c(1:n) )

  a = 1.0
  b = 1.0
  c = 2.0

  DO repeat = 1, no_repeats
    DO i = 1, n

      epsilon = b(i)
      CALL sub1( i+1, epsilon, gamma )
      epsilon = c(i)
      CALL sub1( i+2, epsilon, delta )
      a(i) = a(i) + gamma * delta
    ENDDO  ! n
  ENDDO    ! repeat

END PROGRAM test

sub2.f90

MODULE m1
  IMPLICIT NONE
  CONTAINS

  SUBROUTINE sub1( k, alpha, beta )
    IMPLICIT NONE

    INTEGER, INTENT(IN) :: k
    REAL, INTENT(IN) :: alpha
    REAL, INTENT(OUT) :: beta


    IF( MOD( k, 10 ) == 0 ) THEN
      beta = 4.0 * alpha
    ELSE
      beta = 2.0 * alpha
    ENDIF

  END SUBROUTINE sub1

END MODULE m1
compile.sh
#!/bin/sh

ifort -c -no-vec -O3 -ipo -debug full sub1.f90
ifort -c -no-vec -O3 -ipo -debug full test.f90
ifort -no-vec -O3 -ipo -debug full sub1.o test.o -o test

After running this program under VTune, the Top-down window shows plausible results, both instances of the inlined subroutine sub1 are assigned similar runtimes:

However, if I open the source code of "test" to check the time consumption of the various inlined instances of sub1, all time (0.188s) is assigned to the second inline instance:

This is confirmed by checking the stack information on the right hand side of the screen, where both "contributions" are assigned to line 26 of test.f90:

,

Moreover, when I check the assembly code and the times assigned to the machine instructions, all instructions belonging to both instantiation of sub1 seem to have some reasonable timings, but the lea at 0x40284d is attributed all time of both subroutines (0.188s; I do understand that the assembly level information is not necessarily precise due to the stochastic nature of this kind of performance testing and not using hardware counter based methods, but I still think it is a sign of something going to the wrong way):

Overall, my problem is that I can not check the time consumption of the individual instances of the inlined subroutines. The whole phenomenon can not be blamed on the fact that VTune sometimes attributes the time of an instruction to an other instruction nearby, often one or few instructions later, since inserting some complicated code after line 24 does not change the timings of sub1 and sub2. I have the impression that VTune is booking the time of the inlined code to the wrong place. A possible workaround would be to manually create a second copy of sub1 called sub2 and inline them separately. However, this is not working since the time of both sub1 and sub2 will be attributed to sub2. Furthermore, in a real program (a heavily inlined spagetti of some 100000 Fortran lines with OpenMP involved) this isn't feasible since the inlined subroutines are not small and their full call tree ,i.e. all called subroutines would have to be duplicated (triplicated, quadricated ...).

Is there some way to check these timings?

Thank you for your help,

Jozsef

 

CPI increasing in parallel execution

$
0
0

Dear Everyone

I am trying to understand why we are experiencing scalability problems in our application. The application is a C++ based simulation software running on Windows only. Concurrency is implemented using C++11 threading facilities and a “self made” thread pool. The tasks are embarrassingly parallel and have a runtime of 20 seconds or more depending on the simulation case. The variation in runtime is relatively low. The “self made” thread pool is of course not an optimal solution, but since the task count is low and the runtime high, it does not seem to be a hotspot.

Simulation tasks make heavy use of object allocation and deallocation. Therefore, we have switched to TBB malloc_proxy (tbb44_20151115oss). This improved the scalability significantly. 

However, scalability is nonetheless poor. When comparing sequential (1 worker thread) with parallel execution (8 worker threads), I can observe the following (footnote 1):
 - The runtime of an individual task is increasing several seconds when running in parallel.
 - CPI is increasing from 0.616 to 0.688.
 - Front-end bound is decreasing from 23.7% to 17.2%, bad speculation is decreasing from 6.5% to 5.7%.
 - Backend-bound is increasing from 47.7% to 56.9%.
 - Retiring is decreasing from 22.1% to 20.2%.

I initially suspected that in parallel execution the L3 cache is used less efficiently, since all 8 workers compete on space in the L3 cache. However, non of the functions with a high difference in runtime is DRAM bound according to VTune.

When digging into the backend-bound category, I see that it is mostly “Core-bound” or “DTLB Overhead” which is going up. I don’t understand why those metrics are higher when I increase the number of workers. Since (at least in my understanding) execution ports and DTLB buffers are not shared among cores. Therefore, I would be really happy to learn what could cause such a behaviour and how I could try to mitigate it.

I uploaded an excerpt from VTune here (sorry for strange page format):
http://www.csc.kth.se/~kaeslin/vtune.pdf

I tried Windows’s native SetThreadAffinityMask to bind threads to individual cores, but I could not notice any difference.

My testing system is equipped with a 8-core Xeon E5-2630 v3 processor, hence it cannot be a NUMA problem. Turbo boost is turned off for this analysis.

Any hints and suggestions are welcomed!
Thank you in advance.

(1) All values obtained when filtering on actual simulation function, hence ignoring everything happening outside of parallel region.

Vtune not showing thread names when attaching to process

$
0
0

Hi,

I have used the Vtune API to name my threads so that when I look at any analysis data, I can see my named threads. This works perfectly well when vtune launches my application either through the GUI, or from command-line and then viewed later in the GUI.

However, when I use the feature in vtune to attach to a running process with my software (I need to use this method as the execution that I want to analyze is getting piped data from another system process), the thread names don't show up. Is it expected to work? If so. should I be doing something special to see the thread names?

Details of my installation

Platform: Linux centos 6.6 644-bit

VTune version: 2016 build 424694.

 

Thanks in advance for your prompt help!

Pradeep.

Time consumed by all hotspot using command line

$
0
0

Hi,

   Is it possible to collect the time consumed by all basic blocks of a program using command line (i.e using amplxe-cl) after hotspot analysis? I want all the information in a file (preferably sorted with max. time consuming basic block at top). I am able to get information about only a small portion of the code using GUI. I want the information about all the basic block. Is there a command for that in amplxe-cl?

Thanks in advance

Regards - renjub (renjub007@gmail.com)

 

Perhaps a silly question. VTune can read hardware counters, no other tool can

$
0
0

 

When I try to use perf on the new NERSC supercomputer, which us running a linux kernel I get the perfectly understandable

 

perf record -o test.perf -e cpu-cycles,instructions -a sleep 5
Permission error - are you root?
Consider tweaking /proc/sys/kernel/perf_event_paranoid:
 -1 - Not paranoid at all
  0 - Disallow raw tracepoint access for unpriv
  1 - Disallow cpu events for unpriv
  2 - Disallow kernel profiling for unpriv

sure enough when I look at perf_event_parnoid I see "1"  so no hardware counters for me and perf.

 

but when I run VTune, there are hardware counters.

 

to run VTune we log into compute nodes and I can see three kernel modules get added when I ask for a VTune enable allocation

sep3_15               562471  0
vtsspp                364894  0
pax                     4510  0

PaX is a security upgrade that isn't from Intel, but vtsspp and sep3_15 are from Intel.  

 

I can see that sep3_15 is doing something when that kernel module is inserted

 

Creating /dev/sep3_15 base devices with major number 246 ... done.

05
Creating /dev/sep3_15 percpu devices with major number 245 ... done.

06
Setting group ownership of devices to group "vtune" ... done.

so, some new devices and permissions for group "vtune" got made, but I don't know what.  and I can see that the Performance Monitoring Unit (PMU) got hooked in. 

 

I would like to use hardware counters with my own source-code instrumented system in my library.  I used to do this with PAPI, but I get nothing now.  PCM would seem like the right option, but I'm not super-user on this system and it can take a long time to get new software installed.  Can PCM or perf be changed to be in group "vtune" and get access to hardware counters?

 

 

 

 

 


register_kretprobe('do_fork') failed: -2

$
0
0

Hello

vtune installation finished without error:
$ lsmod |egrep 'pax|sep3_15|vtsspp'
vtsspp                360448  0
sep3_15               557056  0
pax                    16384  0

but when amplxe-cl -collect is launched I get the following error:
$ dmesg
[ 3609.758638] vtss_pebs_init[cpu10]: PEBS: size=0xC0, mask=0xF, family = 6, model = 3f
[ 3609.771611] probe_sched_process_fork[cpu10]: register_kretprobe('do_fork') failed: -2
[ 3613.868145] vtss_collection_fini[cpu10]: vtss++ collection stopped 0
[ 3613.868686] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[ 3613.879365] vtss_transport_fini[cpu5]: Transport stopped.

I have seen that parallel_studio_xe_2016_update2 is supposed to work with Fedora 23, could you please help me fix this problem ?

Fedora 23
Kernel 4.4.4-301.fc23.x86_64
Intel i7-5820K

Best Regards

Profiling time consuming loops of a program

$
0
0

How can VTune help profiling a program that has too many loops/look-ups that take the CPU? It may or may not catch them as hotspots based on the 1ms sample collecting interval. I recently had a significant performance improvement choosing appropriate functions that relatively have less iterations or loops in it. However, none of these functions are listed in the basic or advanced hotspot analysis. Looked at General exploration too but no signs of bottleneck piece of code that was taken out to gain huge performance improvement.

Questions:

1. Can there be any suggestions to catch short loops of a C program called multiple times that reduces the overall performance ? 

2. How can VTune help profiling a software if it cannot catch bottlenecks within the 1ms time interval of collecting samples? For example assuming VTune misses a function that executes less than 1ms time interval but multiple times when it collects the sample in every 1ms time interval ? General exploration also may not find any issue with the bottleneck-function in its front-end, back-end, bad-speculation or retiring clauses.

Any help is greatly appreciated.

Thanks in advance

Memory/Graphics events supporting of emon

$
0
0

I want to use emon included in "Intel® VTune™ Amplifier for Systems with Intel® Energy Profiler NDA 2016 updae2" to evaluate memory/graphics performance on Broxton platform, but the support list of events shows that it only support CPU events, memory/graphic events are not included in the support list. Does anyone knows the plan and schedule of memory/graphic events support of Broxton?

OMPT support

$
0
0

Hello;

Does VTune support OMPT APIs for OpenMP performance analysis?

Intel OpenMP Runtime Lib. supports OMPT APIs.

 

Thanks,

 

 

remote profiling with windows embedded running on target

$
0
0

Hi 

 

I am trying to profile an application which runs in the following fashion:-

 

Local GUI (running on windows 7) (C# code)-> (WinserviceForm )

(Server) Remote Windows Server (C# code) -> (Server.exe)

(Multiple Processes) Remote Blades running windows embedded (C++code) (helper.exe(s))

 

I wish to profile such an application to get profiling data of all the processes (including the multiple processes on the remote blades). I am confused what should i mention as the remote platform in the vtune interface in this case.

Please also note that i can run my entire application (GUI, server, blades) locally as well and in that case i get the profiling information on individual processes of my application (GUI, server and the blades). However, it is extremely slow and i wish to use it on the actual configuration that i mentioned above. Kindly suggest how i might be able to do so using Vtune.

 

Rohit

 

 

Cannot see function names when profiling C# managed code

$
0
0

Hello,
I am unable to see any function names when doing a Hotspots Analysis of C#.  I am trying the VTune sample project serial_nqueens_csharp with VTune 2016 and Visual Studio 2015.  I see the sample project only has solutions/projects for vs2008 and vs2010 - not sure if that is relevant.  By comparison everything in VTune is working fine for C++ profiling.

Here is what it looks like on the Basic Hotspots Summary tab for serial_nqueens_csharp:

    Function                     Module CPU        Time
    -----------------------------------------------------
    [Outside any known module]   [Unknown]         3.515s
    func@0x6427830d8bc           mscorlib.ni.dll   0.249s
    -----------------------------------------------------

This is from the Bottom-Up tab:

    CPU Time
    1 of 1: 100.0% (3.515s of 3.515s)

    null ! [Outside any known module] - [unknown source file]
    null ! [Unknown stack frame(s)] - [unknown source file]
    MSCOREE.DLL ! CorExeMain_Exported + 0x68 - [unknown source file]
    KERNEL32.dll ! BaseThreadInitThunk + 0xc - [unknown source file]
    ntdll.dll ! RtlUserThreadStart + 0x20 - [unknown source file]

Visual Studio is generating the pdb file (serial_nqueens_csharp.pdb) alongside the exe but VTune just doesn't seem to be processing it to get the symbols.  Did support for C# get dropped for versions past vs2010?

I have had the same issue in "VTune Amplifier XE 2015 update2" before upgrading to "VTune Amplifier XE 2016 update2" which still has the problem.  Please advise!

Thanks,
-Mike

Intel® Parallel Studio XE 2017 Beta invitation – please register and provide feedback!

$
0
0

Intel® Parallel Studio XE 2017 Beta is packed with lots of new features and a couple of new products even! 

Some of the new VTune Amplifier XE features include:

  • New analysis types: Disk I/O, HPC Characterization, and GPU Hotspots
  • Mixed-mode Python* profiling
  • OpenCL* 2.0 Shared Virtual Memory usage type detection
  • and the ability to configure a collection for any generation of supported processor without being on the actual hardware

For details, please see the "What's New" document on the 2017 Beta page.

You can register today for the Beta and provide valuable feedback by following this link: https://softwareproductsurvey.intel.com/f/150587/1103/


Beta for Linux* 2017 - runsa.options: second, empty, --event-config line

$
0
0

when i try to run memory analysis, "the result you are opening does not contain any data".

taking to the command line:

/opt/intel/vtune_amplifier_xe_2017.0.0.457868/bin64/amplxe-cl -collect memory-access -app-working-dir /home/kook/ClionProjects/t2 -- /home/kook/ClionProjects/t2/cppout silent

 

i get:

amplxe: Error: Option 'event-config' received value from the next argument: '--sample-after-multiplier=1.0', but it seems to be an option. Use '-option=value' declaration form or make sure that value is not missed

the generated runsa.options file looks like this:

-r
/home/kook/intel/amplxe/projects/cccc/r001macc
--target=host
--itt-config=frame
--bandwidth-limits
--data-limit-mb=500
--mrte-type=java
--event-config=CPU_CLK_UNHALTED.CORE:sa=2300000,CPU_CLK_UNHALTED.REF:sa=2300000,INST_RETIRED.ANY:sa=
2300000,BUS_DRDY_CLOCKS.THIS_AGENT:sa=100003,BUS_TRANS_BURST.SELF:sa=200003
--event-config
--sample-after-multiplier=1.0
--event-mux
--mrte-mode=auto
--
/home/kook/ClionProjects/t2/cppout
silent

 

notice the spurious --event-config line. I wasnt able to get rid of it with any switches in the gui.

Please help me delete this topic

JIT API and profiling a running process

$
0
0

Hi,

We use the JIT Profiling API to be able to see dynamically generated code in the profile. This works well when the process is started by VTune. However, it does not work for an already running process, e.g., with the --target-pid argument.

It appears that iJIT_IsProfilingActive() returns iJIT_NOTHING_RUNNING in that case. I am sure that this function was called after the profiler has attached to the process. 

Is it a supported scenario use the JIT profiling API for attached processes? Thanks.

Cannot load raw collector data

$
0
0

I want to use vtune to test the TLB miss of a program.The program will waste a lot of time.I get the following information.

wj@mcc21:~/graph/sssp-mpi$ mpirun -host mcc21 -n 1 amplxe-cl -target-system=mic-host-launch -c advanced-hotspots   -r /tmp/c ./main.cpu : -host mic0 -n 1 /home/wj/graph/main.mic
amplxe: Using target: mic-host-launch
amplxe: Analyzing data in the node-wide mode. The hostname (mcc21) will be added to the result path/name.
amplxe: Collection started. To stop the collection, either press CTRL-C or enter from another console window: amplxe-cl -r /tmp/c.mcc21 -command stop.
amplxe: Warning: To enable hardware event-base[d] sampling, VTune Amplifier has disabled the NMI watchdog timer. The watchdog timer will be re-enabled after collection completes

......

......

mcc21-mic0 Finish.
^C[mpiexec@mcc21] Sending Ctrl-C to processes as requested
[mpiexec@mcc21] Press Ctrl-C again to force abort
amplxe: CTRL-C signal is received.
amplxe: Error: The given command is not valid now. Please check the current state of the launcher using `status' command.
amplxe: Collection stopped.
amplxe: Error: target - ERROR: ld.so: object 'libittnotify_collector.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
amplxe: Error: target - ERROR: ld.so: object 'libittnotify_collector.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
amplxe: Using result path `/tmp/c.mcc21'
amplxe: Executing actions  8 % Loading raw data to the database                

amplxe: Error: Cannot load data file `/tmp/c.mcc21/data.0/sep2b0a4d55c700.20160411T161114.414300.tb6' (tbrw call "TBRW_dobind(tbrwFile->getHandle(), streamIndex)" failed: invalid string (97)).
amplxe: Executing actions 50 % done                                            
amplxe: Error: 0x4000001e (Cannot load raw collector data)

Then I use amplxe-cl to check data file.

wj@mcc21:/tmp/c.mcc21$ amplxe-cl -finalize -r /tmp/c.mcc21
amplxe: Using result path `/tmp/c.mcc21'
amplxe: Executing actions 16 % Loading raw data to the database                
amplxe: Error: Cannot load data file `/tmp/c.mcc21/data.0/sep2b0a4d55c700.20160411T161114.414300.tb6' (tbrw call "TBRW_reading_section(ptr, m_sectionId)" failed: Section does not exist (78)).
amplxe: Executing actions 100 % done                                           
amplxe: Error: 0x4000001e (Cannot load raw collector data)

But when I add -duration option.it works well.I want to get global imformation not just a point information.What should I do ?

cannot resize application window width beyond screen size

$
0
0

I just installed Vtune 2016 Update 2 (build 444464) on Windows 7.

I have 4 vertical screens side by side.

If I move the VTune window, I have no problem in positioning it so that it spans over two screen, but I cannot emlarge the width of the VTune window any larger than the width of one single screen.

I suspect this is a bug.

Any suggestion from Intel support? Thanks

Regards,

Fabio

 

Viewing all 1574 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>