Quantcast
Channel: Intel® Software - Intel® VTune™ Profiler (Intel® VTune™ Amplifier)
Viewing all 1574 articles
Browse latest View live

The total size result directory has is too big.

$
0
0

I want to question about the meaning of -data-limit options.

I'm using the command line interface to profile the application with --data-limit=0 option.

The problem is that, the total size of result directory is too big. I have to measure metrics with different environment many times, but I cannot do it because of the limited storage. In documentation, default data-limit is enough in normal case. But I'm not sure about this because the application I'm profiling is the PyTorch python script training complex CNN models.

In has the data preprocessing, data loading, training(forward & backward), validating(only forward) code. And with default data-limit, the measurement ends too early and I don't think this result represents the overall characteristic of my Python script. Could you provide me any tip about this situation?

 

 

+) I have one more question about hardware event-based sampling in the collection mode threading. The elapsed time in this mode is about 20% longer than other mode (hotspots, memory access) Why it happened? And when I writes a report about this analysis, which elapsed time will be meaningful to readers do you think?

TCE Open Date: 

Thursday, November 14, 2019 - 20:54

amplxe: Error: [Instrumentation Engine]: Pin is out of memory

$
0
0
amplxe: Error: [Instrumentation Engine]: Pin is out of memory
amplxe: Collection failed.
amplxe: Internal Error

Environment:

  • Ubuntu 18.04, both in VirtualBox and bare metal
  • vtune_amplifier_2019.6.0.602217
  • 128GB system RAM, not remotely close to exhausted

Source Code:

https://github.com/EOSIO/eos

Run

scripts/eosio_build.sh -o RelWithDebInfo

Run

scripts/eosio_install.sh

Run

H=`pwd`
DATE=`which date`
TIMESTAMP=`$DATE +"%Y-%m-%d-%H-%M-%S"`
NODEOS=~/eosio/2.0/bin/nodeos
NVER=`$NODEOS --version`
/opt/intel/vtune_amplifier/bin64/amplxe-cl -c hotspots -app-working-dir=$H -data-limit=2048 -finalization-mode=full \
  -inline-mode=on -mrte-mode=native -target-duration-type=veryshort \
  -r $H/hotspots-$NVER-$TIMESTAMP -- \
  ~/eosio/2.0/bin/nodeos --data-dir "$H/data" --config-dir "$H/config" \
  --wasm-runtime eos-vm --eos-vm-oc-enable --force-all-checks --disable-replay-opts \
  --terminate-at-block 1000 \
  > "$H/stdout.txt" 2> "$H/stderr.txt"

 

TCE Open Date: 

Thursday, December 19, 2019 - 12:44

Intel HD graphics not detected

$
0
0

Hi 

I'm trying to profile my OpenCL application on PyOpenCL via Intel VTune profiler. I have Intel HD Graphics 630 installed in my machine. From my understanding, that can serve as a GPU to run my OpenCL computations on. 

I start the profiler via the launch application selection, pointing the profiler to the python script that I am executing. However, in the summary report generated, I noticed that the profiler picks up Nvidia GeForce as the GPU to profile. I have an Nvidia GeForce 1060 graphics card mounted into the system, just that the driver has not yet been installed in the OS. 

I would like to know if it is possible to select the GPU used in the profiling. Attaching a screen shot of the summary page here for debugging. 

Thanks

TCE Open Date: 

Monday, December 30, 2019 - 20:01

Assertion failed: pytrace_collector710

$
0
0

After capturing perf data using the command:

     /home/tools/intel/vtune_amplifier_2018.1.0.535340/bin64/amplxe-cl -collect hotspots -run-pass-thru=--no-altstack <my binary>

I'm getting:

amplxe: Error: Assertion failed: pytrace_collector710: thread_manager_result == tpss_thread_manager_op_err_ok || thread_manager_result == tpss_thread_manager_op_err_acquired : Cannot acquire tsd for current thread (tid = 42407), result = 8. Please contact the technical support. 
amplxe: Error: Assertion failed: pytrace_collector710: thread_manager_result == tpss_thread_manager_op_err_ok || thread_manager_result == tpss_thread_manager_op_err_acquired : Cannot acquire tsd for current thread (tid = 42412), result = 8. Please contact the technical support. 
amplxe: Error: Assertion failed: pytrace_collector710: thread_manager_result == tpss_thread_manager_op_err_ok || thread_manager_result == tpss_thread_manager_op_err_acquired : Cannot acquire tsd for current thread (tid = 42417), result = 8. Please contact the technical support. 
amplxe: Error: Assertion failed: pytrace_collector710: thread_manager_result == tpss_thread_manager_op_err_ok || thread_manager_result == tpss_thread_manager_op_err_acquired : Cannot acquire tsd for current thread (tid = 42422), result = 8. Please contact the technical support. 
amplxe: Collection stopped.
amplxe: Using result path `/home/scratch.mwoodpatrick_inf/fsa/fsf_trees/Cuda_Test/450489367/output_12_31_19__0434/r000hs'

What do these errors mean? Any suggestions what I'm doing wrong, the report is generated & viewable. I'm still learning my way around the viewer so I'm not sure if they make sense yet but at first glance they don't look unreasonable, is there some documentation that will help me understand what this error means

Operating system and version: CentOS release 6.8 (Final)

Tool version: vtune_amplifier_2018.1.0.535340

Compiler version: gcc-5.4.0

TCE Open Date: 

Tuesday, December 31, 2019 - 05:46

OneAPI beta03 VTune vtsspp driver failed to be loaded

$
0
0

Hello,

During installation of OneAPI beta03 on Ubuntu 18.04, I encountered the failure of loading of vtsspp driver.

I attempted to recompile the vtune drivers. Although the modules has been recompiled successfully, but the problem cannot be solved. The output of running ismod-sep is given as below:

$ sudo ./insmod-sep 
Checking for PMU arbitration service (PAX) ... not detected.
Attempting to start PAX service ...
Executing: insmod ./pax/pax-x32_64-5.0.0-37-genericsmp.ko
Setting group ownership of devices to group "vtune" ... done.
Setting file permissions on devices to "660" ... done.
The pax driver has been successfully loaded.
PAX service has been started.
Checking for socperf driver ... not detected.
Executing: insmod ./socperf/src/socperf3-x32_64-5.0.0-37-genericsmp.ko
Setting group ownership of devices to group "vtune" ... done.
Setting file permissions on devices to "660" ... done.
The socperf3 driver has been successfully loaded.
Executing: insmod ./sep5-x32_64-5.0.0-37-genericsmp.ko
Setting group ownership of devices to group "vtune" ... done.
Setting file permissions on devices to "660" ... done.
The sep5 driver has been successfully loaded.
Checking for vtsspp driver ... not detected.
Executing: insmod ./vtsspp/vtsspp-x32_64-5.0.0-37-genericsmp.ko gid=1001 mode=0660
insmod: ERROR: could not insert module ./vtsspp/vtsspp-x32_64-5.0.0-37-genericsmp.ko: Argument list too long

Error:  vtsspp driver failed to load!

You may need to build vtsspp driver for your kernel.
Please see the vtsspp driver README for instructions.

Checking for socwatch driver ... not detected.
Executing: insmod ./socwatch/drivers/socwatch2_10.ko
The socwatch driver has been successfully loaded.

I have already tried to increase the system stack limit to 65536, but the issue cannot be solved.

 

May anyone advise a solution?

Thank you and Regards,

Amon

TCE Open Date: 

Friday, January 3, 2020 - 03:11

Getting Front-End Bound in a modulas operation

$
0
0

Hi,

I am using VTune Profile for tuning my code and getting the following:

Retiring=45.4%, FrontenendBound=30%

The line here performs a modulus operation:

 currentSegmentIndex = marketDataTries%TOTAL_SEG;

Assembly code for this is :

movsxdl  0x238(%rbx), %rcx

xor %edx, %edx

movq  0xe8(%rbx), %rax

div %rcx

movl  %edx, 0x100(%rbx)

Anyone with some suggestions, what optimizations can I try here ?

 

Thanks and Regards 

Dipanker Singh

TCE Open Date: 

Thursday, January 9, 2020 - 00:15

Issues generating reports from sgx hotspots analysis data

$
0
0

Hi,

Software version: Vtune Amplifier 2019 Update 8

As stated in the documentation, Vtune sgx hotspots analysis is deprecated in the GUI and only allowed via the CLI.

Question 1: Does this mean visualizing the result via the GUI is also deprecated? ie: can I collect sgx hotspots data via the CLI and pass the collected analysis data to amplxe-gui to accurately visualize sgx hotspots in the GUI? I tried doing so and could only find the `Hotspots` and `Hotspots by CPU utilization` panes in my GUI which made me question whether it was the right way to visualize the data.

Question 2: If visualization of sgx hotspots is not allowed via the GUI, is there a way to do the same via the command line? From what I have observed the CLI interface also does not allow a `sgx-hotspots` report type.

Thanks,

Anubhav

 

TCE Open Date: 

Thursday, January 9, 2020 - 15:12

Missing analysis and counters on new laptop

$
0
0

Hello,

VTune seems to be missing almost all hardware counters on my new Surface Laptop 3 and provides no information beyond what other sampling profilers like Visual Studio or Very Sleepy provide. I am running the recent VTune 2020 (downloaded today) through the standalone GUI interface. The new laptop has an i5-1035G7 processor running windows 10 version 1903 (OS Build 18362.592). Is the new CPU not supported yet? Can I expect it to work with a future update to VTune?

Note that I also have an older laptop that correctly samples HW counters with VTune and supplies information about the bandwidth, bottlenecks, branch mispredictions, etc. Because I can get things working on another machine I'm guessing it isn't a mistake that I am making. The older working laptop is an i7-6700HQ.

Thanks for your time!

TCE Level: 

TCE Open Date: 

Friday, January 17, 2020 - 00:52

VTune hi-jack the top-spot of Visual Studio's project context menu

$
0
0

Hello,

I realize this message is going to sound absolutely silly...

Visual Studio integration adds VTune at the very-top of per-Project context menu. I've got years of muscle memory right-clicking a Project and clicking Build (normally the 1st item) and this is throwing me off badly to the point where I'm uninstalling the extension because "right now" is not the right time for me to readjust for such commonly used action.

Could the menu item be added anywhere else, or some option (even undocumented) to move it?

Thank you.

TCE Level: 

TCE Open Date: 

Friday, January 17, 2020 - 05:46

Vtune - no python source code

$
0
0

I am trying to use Vtune 2020 to profile a python script. However, I am running into two problems. 

 

1. I am not finding the call stack tab on the right side of Bottom-Up

The example project that comes with Vtune has a view like this.

 

But I get something like this:

 

The call stack pane on the right side, which is mentioned in every tutorial, is nowhere to be seen,

The second problem is that I cannot see any Python source code file anywhere.  I set the Managed code profiling mode to Auto.

I am using Python 3.7.1 from Anaconda on Windows 10. 

Any help is appreciated. 

 

 

TCE Level: 

TCE Open Date: 

Sunday, January 19, 2020 - 12:55

ERROR: CPU_CLK_UNHALTED.REF_P is not a valid event multiplexing trigger

$
0
0

Hello,

I am try using VTune Profiler on FreeBSD target and got next errors:

# /opt/intel/bin64/vtune -collect hotspots -knob sampling-mode=hw /bin/ls
vtune: Warning: On some systems based on the Intel microarchitecture code name Nehalem / Westmere with C-states enabled, this analysis type may cause system hanging due to a known hardware issue (see errata AAJ134 in http://download.intel.com/design/processor/specupdt/320836.pdf). To avoid this situation, disable all "Cn(ACPI Cn) report to OS" BIOS options before sampling with VTune Profiler on such systems.
***ERROR: could not retrieve time stamp!
ERROR: CPU_CLK_UNHALTED.REF_P is not a valid event multiplexing trigger
ERROR: Success
ERROR: Success
ERROR: Success
ERROR: There are no valid events specified - Sampling aborted
Run parameters are not valid - Aborting sampling run ...
Options error
vtune: Error: ***ERROR: could not retrieve time stamp!
ERROR: CPU_CLK_UNHALTED.REF_P is not a valid event multiplexing trigger
ERROR: Success
ERROR: Success
ERROR: Success
ERROR: There are no valid events specified - Sampling aborted
Run parameters are not valid - Aborting sampling run ...
Options error

vtune: Collection failed.
vtune: Internal Error

What is a problem? How to resolve this?

My CPU is Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz (3066.83-MHz K8-class CPU)

# /opt/intel/bin64/sep -platform-info
***ERROR: could not retrieve time stamp!
Sampling Enabling Product version: 5.14  built on Nov 26 2019 11:01:27
SEP User Mode Version: 5.14
SEP Driver Version: 5.14.3
PAX Driver Version: 1.0.2

Copyright(C) 2007-2019 Intel Corporation. All rights reserved.

total_number_of_processors  ...... 12
cpu_family ................ Intel(R) Xeon(R) Processor 980X series code named Westmere
cpu_model ................. 44 (0x2c)
cpu_stepping .............. 2 (0x2)
L1 Data Cache ............. 32KB, 8-way, 64-byte line size
                            2 HW threads share this cache, No SW Init Required
L1 Code Cache ............. 32KB, 4-way, 64-byte line size
                            2 HW threads share this cache, No SW Init Required
L2 Unified Cache .......... 256KB, 8-way, 64-byte line size
                            2 HW threads share this cache, No SW Init Required
L3 Unified Cache .......... 12MB, 16-way, 64-byte line size
                            No SW Init Required
Data TLB0 ................. 4-way, 2M/4M Pages, 32 entries
Data TLB .................. 4-way, 4K Pages, 64 entries
Instruction TLB ........... fully, 2M/4M Pages, 7 entries
Instruction TLB ........... 4-way, 4K Pages, 64 entries
64-byte Prefetching
Shared 2nd Level TLB ...... 4-way, 4K Pages, 512 entries

Device Type ............... Intel(R) Xeon(R) Processor 980X series code named Westmere
EMON Database ............. corei7wdp
number_of_selectors ....... 4
number_of_var_counters .... 4
number_of_fixed_ctrs....... 3
Fixed Counter Events:
counter 0 ................. INST_RETIRED.ANY
counter 1 ................. CPU_CLK_UNHALTED.THREAD
counter 2 ................. CPU_CLK_UNHALTED.REF
number of devices ......... 1
number_of_events .......... 788

Processor Features:
    (Thermal Throttling) (Enabled)
    (Hyper-Threading) (Enabled)
    (Number of Packages:    1)
    (Cores Per Package:    6)
    (Threads Per Package:  12)
    (Threads Per Core:      2)

TSC Freq .................. TBD MHz

CPU Freq (detected) ....... 3067.00 MHz

 

TCE Level: 

TCE Open Date: 

Monday, January 20, 2020 - 08:38

remote target ssh access

$
0
0

Good day.

I'am new user of Vtunes. My main question is Why Vtunes hasn't it's own ssh client? Do I write understand that I should use Putty as ssh-client, not other application? What should I do if my ssh client is SecureCRT? Where could I read about mechanism of coworking with Vtunes and Putty? Now I has such problem: I have ssh connection with Putty with ssh-key with passphrase. When I type username@host_name, Vtunes answer: "Cannot communicate with target, Please check password-less authentication request". But when I type wrong_username@host_name, Vtunes answer "Server refused our key". Does problem in ssh-key passphrase?

Intel® VTune™ Profiler 2020 is now available

$
0
0

Intel® VTune™ Profiler 2020 (formerly “Intel® VTune™ Amplifier) is now available. https://software.intel.com/en-us/vtune 

Features Overview

  • Command line interface amplxe-cl and GUI interface amplxe-gui were re-named to vtune and vtune-gui respectively
  • Intel® VTune™ Profiler has been updated to include more recent versions of 3rd party components, which include functional and security updates. Users should update to the latest version.
  • GPU accelerators support:
    • New GPU Offload analysis added to explore and correlate code execution across CPUs and GPUs. You can identify a kernel of interest for GPU-bound applications and explore further with GPU Compute/Media Hotspots analysis.
    • GPU Compute/Media Hotspots analysis updated with GPU in-kernel analysis for OpenCL™ code and an option to filter by a kernel of interest.
    • Command line hotspots report now supports GPU analysis types. You can apply the computing-task and computing-instance groupings to your collected data to focus on time-intensive computing tasks.
    • Dynamic instruction count collection (available as part of the GPU Compute/Media Hotspots Analysis) improved to provide better accuracy for basic block Assembly analysis.
    • Support for Intel® Processor Graphics Gen11.
  • Platform analysis support:
    • System Overview analysis updated to serve as an entry point to platform analysis. Use this analysis to assess system (IO, accelerators and CPU) performance and review guidance for next steps.
    • New Hardware Tracing mode in the System Overview analysis enables application analysis on the micro-second level and identification of causes for latency issues.
  • HPC analysis improvements:
    • Max and Bound Bandwidth metrics added to Application Performance Snapshot to better estimate the efficiency of the DRAM, MCDRAM, Persistent Memory and Intel® Omni-Path usage.
  • Platform Profiler new features and improvements:
    • Overview and Memory views extended with new metrics to analyze Non-Uniform Memory Access (NUMA) behavior.
    • User authentication and authorization implemented to enable access control to user data.
    • Added a new option for users to choose or modify the location of Platform Profiler data files.
  • Energy analysis improvements:
    • New Throttling analysis added to identify causes for system throttling, including violation of safe thermal or power limits.
    • Options for Energy analysis, based on the Intel SoC Watch data collector, extended to monitor processor package energy consumption over time and identify how it correlates with CPU throttling.
  • Cloud and containerization support:
    • Containerization support extended with an option to install and run  VTune™ Profiler in a Docker* container and profile targets inside and outside the same container.
    • Added support to profile applications running in Amazon Web Services* (AWS) EC2 Instances based on Intel microarchitecture code name Cascade Lake X.
  • New Fabric Profiler performance tool added to VTune™ Profiler in  Preview mode. Use Fabric Profiler to identify detailed characteristics of the runtime behavior for an OpenSHMEM application.
  • Quality and usability improvements:
    • Symbol resolution for effective source-level analysis enabled for crossgen (Ahead-of-JIT compilation) functions on Linux* systems.
    • Interactive Help Tour (available on the Welcome page) guides you through the product interface using a sample project.
  • New hardware/operating systems/IDEs support:
    • 10th Gen Intel® Core™ processors
    • Ubuntu* 19.10
    • Microsoft* Windows* 10, November 2019 Update

APS data validity without perf_event_paranoid and VTune Profiler drivers

$
0
0

Hi ,
I had recently installed intel parallel studio xe 2018u4.
I have gathered some data for the purpose of application profiling on intel 8280 CPU using the intel APS tool (shipped with 2018u4), but i did not enable the system wide monitoring while using APS binary. For Example, I did not -

1) set the /proc/sys/kernel/perf_event_paranoid value to 0  and,
2) install the  VTune Profiler/sampling  drivers .
I skipped aforementioned steps as it was listed as optional, and i am not too worried about collection overhead.

I have gathered data using following commands - 
a) mpiexec.hydra -genvall -np $SLURM_NPROCS -ppn $SLURM_NTASKS_PER_NODE  aps -c mpi   ./wrf.exe
b) mpiexec.hydra -genvall -np $SLURM_NPROCS -ppn $SLURM_NTASKS_PER_NODE  aps   ./wrf.exe

The data gathered using aforementioned  commands is valid for the purpose of analysis? as i had noticed sometimes back that there was a significant difference in the performance metrics data with b) command   ( Linux Perf vs PMU analysis )

I hope that the mpi data gathered using "aps -c mpi" should be good (not require drivers)? - as per this document it seems the drivers help only in the data collection metrics such as HPC Performance Characterization, Memory Access, and Microarchitecture Exploration.

Amplifier cannot detect remote machine configuration.

$
0
0

I am trying to use VTune to get memory consumption for my binaries. I have installed VTune on windows and trying to connect to (remote SSH) linux to get the results. However it doesn't move past the first step and gives the error - "Amplifier cannot detect remote machine configuration."

From your another thread I picked a command

INSTALL_VTUNE_DIR/bin64/amplxe-python INSTALL_VTUNE_DIR/bin64/amplxe-runss.py --no-modules --log-folder=WRITABLE_FOLDER --target-system=ssh:USER@TARGET_IP --target-install-dir=/opt/intel/vtune_amplifier_2018.3.0.566015 --ui-output-format xml --context-value-list

<?xml version="1.0" encoding="UTF-8"?>
<feedback><message severity="info">Cannot find product on the device. Enabling automatic installation...</message>
<nop/><message severity="info">Installing the package to <USER>@<MACHINE-NAME></message>
<nop/>#########################################################################
  Programs and data  held on this machine are  private property  and ## may be accessed  only  by authorised users  for purposes  which have ## been authorised.
  ####   Unauthorised  access  to  this system  by  employees  contravenes
  ## company rules and is  a disciplinary offence.  Intentional unauthorised access by  any person  is a  criminal offence  and  may lead to ## criminal penalties and civil damages.
  ##
  ##       IF YOU ARE NOT AN AUTHORISED USER DISCONNECT IMMEDIATELY.      ##

These logs are repeated for few times (may be retries) but doesn't exit with specific error. Do you think these are permissions issue?


(Platform Profiler) What is Read/Write Hit Ratio in Persistent Memory Traffic?

$
0
0

Hi,

I ran the latest version of Platform Profiler with the Optane DC Persistent memory in Memory mode and uploaded the result file to the platform profiler webpage (localhost:6543).

In Memory view, I can see the 'Read Hit Ratio / Write Hit Ratio' graphs in 'Persistent Memory Traffic (per DIMM)'.

But What is Read Hit Ratio / Write Hit Ratio?

Memory mode metrics (read miss rate of DRAM Cache) are already shown separately in Memory view.

Also, what is '% of Non-Inclusive Writes to Near Memory' in Memory mode? I'm guessing this is writes which are evicted from CPU LLC (Last-level cache) and entering DRAM Cache. Is this correct?

 

Q1. What is Read Hit Ratio / Write Hit Ratio in Persistent Memory Traffic (per DIMM)?

Q2. What is % of Non-Inclusive Writes to Near Memory? 

Best regards,

Minjae Kim

vtune ssh support destroys remote directories

$
0
0

Just tried vtune for the first time, connecting to a remote linux machine, and it obliterated the file system on the remote machine.

This is easily reproducible. When typing the ssh information, the GUI will *dynamically*, at any typing pause, attempt to log in to the remote machine, and erase whatever path or partial path is currently in the input form. E.g. if the installation directory is in /home/ubuntu/intel, and you start typing "/ h o m e" and then pause, it starts deleting /home. If you get as far as "/ h o m e / u b u n t u" and then pause, it erases /home/ubuntu.

This is pretty remarkable behavior. Be sure never to pause while typing, if you like your files.

Not able to Start Platform Profiler

$
0
0

I am getting this following error message when i am launching Platform Profiler

Error : server terminated with code 2. See logs/server.log for 
more information.['nexus', 'launch',
'C:\\PROGRA~3\Intel\\VTune Amplifier Platform Profiler\\data\\datasets',
'-p', '6543', '--local-ini-path',
'C:\\PROGRA~2\\INTELS~1\\VTUNEA~1\\vpp\\server\\ppe\\local.ini']

 

 

Using Intel V Tune Profiler 2019 Update 8

Product Build : 604197

 

vtune profiler 2020: memory access collector, LLC count zero

$
0
0

Operating system and version

NAME=Fedora
VERSION="27 (Twenty Seven)"
ID=fedora
VERSION_ID=27
PRETTY_NAME="Fedora 27 (Twenty Seven)"
ANSI_COLOR="0;34"
CPE_NAME="cpe:/o:fedoraproject:fedora:27"
HOME_URL="https://fedoraproject.org/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=27
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=27
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
Fedora release 27 (Twenty Seven)
Fedora release 27 (Twenty Seven)
cpe:/o:fedoraproject:fedora:27

Kernel version: 4.18.8-100.fc27.x86_64

Tool version

Intel(R) VTune(TM) Profiler 2020 (build 605129) Command Line Tool
Copyright (C) 2009-2019 Intel Corporation. All rights reserved.

Compiler version
gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6)

Steps to reproduce the error

vtunes with memory-access collector is always showing LLC miss counts as zero. And also the persistent memory bound is 0% which is suspect considering the high number of parallel IO threads. I have tried changing the kernel(5.0, 5.3 on another system with similar CPUs), use driverless mode instead of event-based profiling, rebuilding vtune with different options per user vs system wife, allowing multiple runs to disallow multiplexing PMC counters. Nothing as worked so far and have always seen LLC counter to zero. 

Following is the vtune log of a sample run demonstrating the problem. I would highly appreciate it if someone could throw some light on this issue. Thanks in advance

 

 

    CPU Time: 75.415s
    Memory Bound: 48.7% of Pipeline Slots
     | The metric value is high. This may indicate that a significant fraction
     | of execution pipeline slots could be stalled due to demand memory load
     | and stores. Explore the metric breakdown by memory hierarchy, memory
     | bandwidth information, and correlation by memory objects.
     |
        L1 Bound: 17.8% of Clockticks
         | This metric shows how often machine was stalled without missing the
         | L1 data cache. The L1 cache typically has the shortest latency.
         | However, in certain cases like loads blocked on older stores, a load
         | might suffer a high latency even though it is being satisfied by the
         | L1.
         |
        L2 Bound: 0.2% of Clockticks
        L3 Bound: 0.3% of Clockticks
        DRAM Bound: 0.0% of Clockticks
            DRAM Bandwidth Bound: 0.0% of Elapsed Time
        Store Bound: 12.6% of Clockticks
        NUMA: % of Remote Accesses: 0.0%
        UPI Utilization Bound: 0.0% of Elapsed Time
        Persistent Memory Bound: 0.0% of Clockticks
            Persistent Memory Bandwidth Bound: 0.0% of Elapsed Time
    Loads: 87,025,610,690
    Stores: 36,653,099,560
    LLC Miss Count: 0
        Local DRAM Access Count: 0
        Remote DRAM Access Count: 0
        Local Persistent Memory Access Count: 0
        Remote Persistent Memory Access Count: 0
        Remote Cache Access Count: 0
    Average Latency (cycles): 48
    Total Thread Count: 30
    Paused Time: 0s

Bandwidth Utilization
Bandwidth Domain                          Platform Maximum  Observed Maximum  Average  % of Elapsed Time with High BW Utilization(%)
----------------------------------------  ----------------  ----------------  -------  ---------------------------------------------
DRAM, GB/sec                              220                         18.300    6.460                                           0.0%
DRAM Single-Package, GB/sec               110                         18.200    6.706                                           0.0%
UPI Utilization Single-link, (%)          100                         10.800    6.538                                           0.0%
Persistent Memory, GB/sec                 60                          10.600    5.603                                           0.0%
Persistent Memory Single-Package, GB/sec  30                          10.600    5.207                                           0.0%
Collection and Platform Info
    Application Command Line: mpirun "--cpu-set""24-43""-np""20""--wdir""./writer""--bind-to""core""--mca""btl""tcp,self""../workflowwriters""1""67108864""16"
    User Name: ranjan
    Operating System: 4.18.8-100.fc27.x86_64 NAME=Fedora VERSION="27 (Twenty Seven)" ID=fedora VERSION_ID=27 PRETTY_NAME="Fedora 27 (Twenty Seven)" ANSI_COLOR="0;34" CPE_NAME="cpe:/o:fedoraproject:fedora:27" HOME_URL="https://fedoraproject.org/" SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Fedora" REDHAT_BUGZILLA_PRODUCT_VERSION=27 REDHAT_SUPPORT_PRODUCT="Fedora" REDHAT_SUPPORT_PRODUCT_VERSION=27 PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
    Computer Name: aep03
    Result Size: 55 MB
    Collection start time: 23:38:34 31/01/2020 UTC
    Collection stop time: 23:38:43 31/01/2020 UTC
Collector Type: Event-based sampling driver
    CPU
        Name: Intel(R) Xeon(R) Processor code named Cascadelake
        Frequency: 2.394 GHz
        Logical CPU Count: 96
        Max DRAM Single-Package Bandwidth: 110.000 GB/s

 

 

Vector instruction set in HPC Perf. Characterization

$
0
0

Hi,

I am currently profiling a program by using VTune.

 

The attached image shows that there are more than 2 vector instructions.

 

I am wondering if there are more than 2 vector instruction set in the "HPC Performance Characterization", this means that a certain function uses all of the instructions? or they are the possible candidate vector instructions that are used by the function?

 

It would be appreciated if anyone could let me know.

 

Thanks.

 

 

 

TCE Open Date: 

Monday, February 10, 2020 - 19:55
Viewing all 1574 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>