www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 3742] New: Please add support for 'Lightweight Profiling' which adds a set of user-controlled counters to the AMD64 architecture

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3742

           Summary: Please add support for 'Lightweight Profiling' which
                    adds a set of user-controlled counters to the AMD64
                    architecture
           Product: D
           Version: future
          Platform: x86_64
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: DMD
        AssignedTo: nobody puremagic.com
        ReportedBy: nick.barbalich gmail.com


--- Comment #0 from nick barbalich <nick.barbalich gmail.com> 2010-01-25
16:44:15 PST ---
Late in 2007, AMD announced Lightweight Profiling as a proposed extension to
the AMD64 architecture that would allow an application to gather performance
statistics about itself with low overhead. We [AMD] posted the preliminary
specification and asked for feedback from the developer community. Much to our
delight, many of you responded with comments, criticisms, and suggestions on
the proposal. We've read all of your feedback, and last week we posted the
current version of the LWP specification. The announcement and the link to the
spec are here. Thanks to all of you who helped us out.

What came before...

It's important to be able to measure the details of a program's performance in
order to find ways to speed it up. Until now, there have been just two ways to
do this. The first is via instrumentation, i.e., adding code to the program to
watch the clock, or the cycle counter, or just to count the number of times an
instruction or loop is executed. Instrumentation can be added by the programmer
or by a compiler. Unfortunately, it seriously perturbs the application, and the
instrumented code usually doesn't have the same characteristics as the original
code, especially when dealing with the data and instruction caches. Also,
instrumentation can't observe the hardware caches, so it can't gather data
about cache behavior.

The second traditional method of monitoring performance is to use the hardware
performance counters. These count hardware events and generate an interrupt
after a programmed number of events have happened. The counters can report on
events that are too hard to instrument (like counting each x86 instruction) or
are not visible to software (like cache misses). These counters are used by the
AMD CodeAnalyst Performance Analyzer and provide deep insight into application
and system performance. However, each time a data sample is gathered, the
processor must take an interrupt to a kernel-mode driver, and that takes
hundreds or thousands of cycles. The driver, by simply executing, changes the
contents of the data cache and the instruction cache and may perturb the
application's performance. The counters can only be configured, started, and
stopped from kernel mode, so an application must call a driver or the operating
system to control them. Finally, some systems do not context-switch the
performance counters when changing threads or processes, and on those systems,
performance monitoring can only be done globally by a single user at a time.

Introducing LWP

After reading about current technology, you might think that an ideal
performance monitor should:

    * Operate entirely in user mode
    * Cause little or no perturbation of the application
    * Be controlled separately for each thread
    * Have low overhead to allow for higher sampling rates

And that describes LWP!

Lightweight Profiling adds a set of user-controlled counters to the AMD64
architecture. They can monitor multiple events simultaneously. An application
thread starts profiling by providing the address of an LWP control block
(LWPCB) as the operand to the new LLWPCB instruction. The contents of the LWPCB
specify which events to count and how often to count them. It also points to a
ring buffer in the application's memory into which the hardware will store
event records. That's it.

Once started, LWP counts the specified events. When an event counter
underflows, it stores an event record at the head of the ring buffer and resets
the counter. (If requested, LWP randomizes the bottom bits of the new counter
value to prevent "beating" against constant length loops.) LWP stores the
record without interrupting the flow of the program, so the only perturbation
to the program's performance is writing the record (usually affecting only a
single data cache line) and a few cycles to perform the write. The record
contains the event type, the address of the instruction that caused the
underflow, and other information about the event. All event types share one
ring buffer and can be sorted out by the event type field in the record.

Of course, eventually the buffer will fill up. What then? Well, a program has
two options for emptying the ring buffer. First, it can simply poll the buffer
and remove event records from the tail of the ring. When software rewrites the
tail pointer, the LWP hardware knows it can reuse the newly emptied region of
the ring buffer. Since the buffer is in user memory, the program can even share
the memory with another process, and that second process can be responsible for
draining the buffer. Second, the application can specify that it wants LWP to
generate an interrupt when the ring buffer is filled past a certain threshold.
For instance, it can configure a buffer to hold 10,000 event records and tell
LWP to interrupt whenever there are more than 9,000 records in the buffer. The
interrupt does indeed perturb the program, but it does so 1/9000th as often as
the traditional performance counters would. Better still, since the buffer is
in user memory, the application can catch the interrupt and do whatever it
wants with the data. It can store it to disk for later analysis, or it can
process it immediately and even try to fix performance problems as they are
happening.

In addition, LWP is a per-thread feature. Each thread on the system can be
monitoring different events at different rates without interference. If a
thread is not using LWP, there is no impact on its performance even if other
threads have LWP active.

Some LWP Details

The LWP events are a small subset of the events available in the traditional
performance counters. They include Instructions Retired, Branches Retired, and
DCache Misses. The Branches Retired event can be filtered by whether the branch
is direct or indirect, conditional or unconditional, or other criteria. It
captures the target address of the branch, a useful value when looking at
indirect branches. The DCache miss event can be filtered by cache level to
capture only "expensive" cache misses.

One exciting feature of LWP is the ability to insert events into the ring
buffer under program control. There are two new instructions to do this:

    * LWPINS inserts a record into the ring buffer containing data taken from
the arguments to the instruction. A program can use LWPINS to insert a marker
to indicate an important event, such as loading or unloading a shared library,
that influences the way addresses should be interpreted in subsequent event
records.
    * LWPVAL uses an event counter and decrements the counter each time it is
executed, much the way the hardware event counters work. When the counter
underflows, it inserts a record into the ring buffer containing data from its
arguments. A program uses LWPVAL to implement a technique called value
profiling. For instance, it can profile the divisor of a commonly executed DIV
instruction and if the data show that the divisor is frequently the same
number, it can rewrite the instruction to test for that value and execute an
optimized code sequence. Similarly, it can profile the target of a hot indirect
branch and generate better code if one way of the branch is dominant.

Who will use LWP?

LWP can be used in many different application environments. These include:

    * Managed Runtime Environment: Managed Runtimes (MRTEs) are programming
environments such as Java and the Microsoft® .NET Framework. These environments
have the ability to generate AMD x86 or x64 code for routines coded in a high
level managed language (such as Java or C#), and they can do that on the fly as
a program is running. The MRTE can enable LWP and periodically look for
performance problems. If (when!) it finds them, it can generate better code for
the hot spots and improve the program's overall performance. LWP is lightweight
enough that it can run continuously.
    * Dynamic Optimizer: A Dynamic Optimizer is a program that monitors an
application and attempts to improve its performance by modifying it as it runs.
In this case, the target application is compiled to native code from a
traditional language like C or C++. The Dynamic Optimizer can gather
performance data without affecting the flow of control in the application.
    * Compiler Feedback: Most modern compilers have an option to build an
instrumented program which the developer runs to gather information on the
program's performance. Unfortunately, the added instrumentation (and the fact
that optimization levels are often cranked down in a feedback compilation)
perturbs the program so much that what's being measured is substantially
different from the "real" program. With LWP, the compiler can gather statistics
on the program execution without changes, and it can insert LWPVAL instructions
to profile interesting areas without adding a large block of instrumentation
code and without clobbering any registers. If the application runs without
turning on LWP, the LWPVAL instructions act as NOPs and only take a few cycles.


Note the above has been taken from:
http://forums.amd.com/devblog/blogpost.cfm?catid=208&threadid=116487&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+AmdDeveloperBlogs+%28AMD+Developer+Blogs%29

The latest revision of the Lightweight Profiling specification document (v3.03)
is a specification containing updates that are a direct result of AMD community
feedback, and can be found here:  
http://support.amd.com/us/Processor_TechDocs/43724.pdf

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 25 2010
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3742


nick barbalich <nick.barbalich gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |performance
                 CC|                            |nick.barbalich gmail.com


--- Comment #1 from nick barbalich <nick.barbalich gmail.com> 2010-01-25
16:50:07 PST ---
Correction.  The current version is 3.04 dated August 2009.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 25 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3742


Brad Roberts <braddr puremagic.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |braddr puremagic.com


--- Comment #2 from Brad Roberts <braddr puremagic.com> 2010-01-25 17:38:52 PST
---
I think you're going to need to be a lot more specific about what you think
needs to be done with D.  Chances are high it's not a compiler change but
rather some library changes in either the core runtime or in phobos.  However,
whatever it is, it'd need to be done in a cpu agnostic manner with the
potential to be optimized for AMD.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 25 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3742


Don <clugdbug yahoo.com.au> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |clugdbug yahoo.com.au


--- Comment #3 from Don <clugdbug yahoo.com.au> 2010-02-04 06:44:53 PST ---
I'm not sure that this is terribly relevant to D.
For in-depth analysis, this feature is nowhere near as comprehensive as the
hardware performance counters which have existed since the Pentium MMX days
(and which seem very under-utilized -- they are really fantastic). I think AMD
is being deliberately misleading. I take offence to this section:
"These counters are used by the AMD CodeAnalyst Performance Analyzer and
provide deep insight into application
and system performance. However, each time a data sample is gathered, the
processor must take an interrupt to a kernel-mode driver, and that takes
hundreds or thousands of cycles."

It *sounds* as though this is talking about the hardware performance counters.
But it ISNT!! It's talking about their "performance analyzer". It is possible
to get the same data they're talking about, without calling into kernel mode
all the time. You do NOT need to use their driver.

So, it's maybe a 5-10% improvement over what already exists. May be useful for
JIT compilers, but it isn't really anything to get terribly excited about.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 04 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3742


Brad Roberts <braddr puremagic.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Platform|x86_64                      |x86


--- Comment #4 from Brad Roberts <braddr puremagic.com> 2011-02-06 15:40:00 PST
---
Mass migration of bugs marked as x86-64 to just x86.  The platform run on isn't
what's relevant, it's if the app is a 32 or 64 bit app.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 06 2011
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3742


Don <clugdbug yahoo.com.au> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID


--- Comment #5 from Don <clugdbug yahoo.com.au> 2012-02-01 22:30:23 PST ---
Invalid - processor specific, belongs in a library, too vague to be actionable.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 01 2012