state

Timing Verification of Real-Time Systems
- A Window is Closing -
Reinhard Wilhelm
Universität des Saarlandes
DS Adaptive Isolation for
Predictability and Security
My Message
-2-
• Hard real-time embedded systems stay with us, even
increasingly – autonomous driving
• Timing verification for high-performant platforms has been
possible and has been practiced from roughly 2001 till
today
• It is made impossible by new architectural developments
• The problem remains, the potential to solve it disappears
• What is the alternative?
Structure of the Talk
• The problem: determining bounds on
execution times
• Increasing complexity by architectural
developments
• Predictability research – where is our
impact?
• The PROMPT vision
-3-
Deriving Run-Time Guarantees for
Hard Real-Time Systems
-4-
The simplest problem statement: Given
1. an uninterrupted, terminating software to produce a
reaction,
2. a (single-core) hardware platform, on which to execute
the software,
3. a required reaction time.
Derive: a guarantee for timeliness
Complexity increased by preemptive scheduling, more
complex architectures, e.g. multi-core platforms
Goal: Efficiently and precisely predictable good worst-case
performance
-5-
Timing Analysis
•
•
–
–
Sounds methods determine upper bounds for all
execution times,
can be seen as the search for a longest path,
through different types of graphs,
through a huge space of paths.
1. I will show how this huge state space originates.
2. How and how far we can cope with this huge
state space.
Decidability is not the problem! - It’s Complexity!
-6-
Timing Analysis – the Search HistoricSpace
• All control-flow paths (through the binary
executable) – depending on the possible inputs.
• Feasible as search for a longest path if
– Iteration and recursion are bounded,
– Execution time of instructions are (positive)
constants.
• Timing schema (Shaw’91) for induction over the
structure of the program
Input
Software
Architecture
(constant
execution
times)
-7-
High-Performance Microprocessors
• increase (average-case) performance by using:
Caches, Pipelines, Branch Prediction, Speculation
• These features make timing analysis difficult:
Execution times of instructions vary widely
– Best case - everything goes smoothly: no cache miss,
operands ready, resources free, branch correctly
predicted
– Worst case - everything goes wrong: all loads miss the
cache, resources are occupied, operands not ready
– Span may be several hundred cycles
-8-
Variability of Execution Times
x = a + b;
LOAD
r2, _a
LOAD
r1, _b
ADD
r3,r2,r1
PPC 755
Execution Time (Clock Cycles)
In most cases, execution
will be fast.
So, assuming the worst case
is safe, but very pessimistic!
350
300
250
200
Clock Cycles
150
100
50
0
Best Case
Worst Case
State-dependent Execution Times
• Execution time of an instruction
is a function of the execution
state  timing schemata no
more applicable.
• Execution state results from the
execution history.
-9-
state
semantics state:
values of variables
execution state:
occupancy of
resources
Timing Analysis – the Search Space
with State-dependent Execution Times
• all control-flow paths – depending on
the possible inputs
• all paths through the architecture for
potential initial states
execution states for
paths reaching this
program point
instruction
in I-cache
instruction
not in I-cache
mul rD, rA, rB
- 10 -
Input
Software
initial
state
Architecture
1
small operands 1
bus occupied
bus not occupied
≥ 40
large operands
4
Timing Analysis – the Search Space
with out-of-order execution
• all control-flow paths – depending on
the possible inputs
• all paths through the architecture for
potential initial states
• including different schedules for
instruction sequences
- 11 -
Input
Software
initial
state
Architecture
Timing Analysis – the Search Space
with multi-threading
• all control-flow paths – depending on
the possible inputs
• all paths through the architecture for
potential initial states
• including different schedules for
instruction sequences
• including different interleavings of
accesses to shared resources
- 12 -
Input
Software
initial
state
Architecture
Timing Accidents and Penalties
Timing Accident – cause for an increase
of the execution time of an instruction
Timing Penalty – the associated increase
• Types of timing accidents
–
–
–
–
–
–
Cache misses
Pipeline stalls
Branch mispredictions
Bus collisions
Memory refresh of DRAM
TLB miss
- 13 -
- 14 -
Our Approach
• Static Analysis of Programs for
their behavior on the execution
platform
• computes invariants about the
set of all potential execution
states at all program points,
• the execution states result from
the execution history,
• static analysis explores all
execution histories
state
semantics state:
values of variables
execution state:
occupancy of
resources
Deriving Run-Time Guarantees
- 15 -
• Our method and tool derives Safety
Properties from these invariants :
Certain timing accidents will never happen.
Example: At program point p, instruction
fetch will never cause a cache miss.
• The more accidents excluded, the lower
the upper bound.
Murphy’s
invariant
Fastest
Variance of execution times
Slowest
Architectural Complexity implies
Analysis Complexity
- 16 -
Every hardware component whose state has an
influence on the timing behavior
• must be conservatively modeled,
• contribute to the size of the search space, most
of the time exponentially in some architectural
parameters
• Exception: Caches
– some have good abstractions providing for highly
precise analyses (LRU), cf. Diss. of J. Reineke
– some have abstractions with compact
representations, but not so precise analyses
Recipes for Success
• Abstraction: identify abstract domains
that are
– precise and
– efficient
• Decomposition: separate different
aspects of the semantics and use
precomputation
- 17 -
Abstraction and Decomposition
- 18 -
Components with domains of states C1, C2, … , Ck
Analysis has to track domain C1  C2 …  Ck
Start with the powerset domain 2 C1  C2 …
C
k
Find an abstract domain C1#
Find abstractions C11# and C12#
transform into C1#  2 C2 …  Ck factor out C11# and transform
rest into 2 C12# C2…  Ck
This has worked for caches and
cache-like devices.
program
This has worked for the arithmetic
of the pipeline.
C11#
value analysis
program with
annotations
2 C12# …
C
k
microarchitectural
analysis
Analyzability
- 19 -
• M. Lv, N. Guan, J. Reineke, R.Wilhelm, W. Yi:
A Survey on Static Cache Analysis for Real-Time Systems. LITES
3(1): 05:1-05:48 (2016)
explains several different abstract domains for cache analysis
• S.Hahn, J.Reineke, R.Wilhelm: Toward Compact Abstractions for
Processor Pipelines. Correct System Design 2015: 205-220
shows how to obtain a compact domain for pipeline analysis and how
tog get rid of timing anomalies
- 20 -
State Space Explosion in Timing Analysis
concurrency +
shared resources
preemptive
scheduling
out-of-order
execution
state-dependent
execution times
constant
execution
times
years +
~1995
~2000
methods
Timing schemata Static analysis
2010+
???
ARM Cortex R5F
- an architecture for real-time? -
- 21 -
The ARM Cortex R5F processor “provides a
high-performance solution for real-time
applications” and provides “simplified
certification effort with the optional Safety
Documentation Package for standards such as
ISO 26262 and IEC 61508, and enable higher
levels of certification to be obtained”,
according to ARM.
22
- 22 -
Cortex-R5F Predicatability Issues
features of the design that limit the predictability
of the R5F-based TMS570 in real-time systems
23
- 23 -
Cortex-R5F Predictability Issues
• Random replacement caches + L2 memories
– average performance better than cache-less TCMs on
earlier Cortex-R4F-based TMS570 variants
– 0-cycle hit vs. high cache miss latency leads to runtime
variability
– static predictability reduced from a 4-way to a 1-way
associativity cache (replaced way is random)
– locking one or multiple ways is not supported so that
critical code or data regions cannot be guaranteed to
hit the cache
24
- 24 -
Cortex-R5F Predictability Issues
• Branch prediction
– Complex branch outcome and loop prediction
– Can be switched to static prediction
• Decoupled writes
–
–
–
–
2-entry Store Queue (SQ), 4-entry Store Buffer (STB)
STB can delay a single write from 64 upto 128 cycles
STB can merge multiple writes to a single access
L2 memory interface can merge multiple accesses
from STB to a burst
– SQ buffers writes if STB is full
25
- 25 -
L2 AXI Master Port
up to 7 outstanding reads,
up to 4 outstanding writes
out-of-order access handling
runs at slower bus clock speed, which is synchrounous to
the CPU core clock
What is the alternative to sound
static timing analysis?
• Measurement-based methods
– have soundness problems
– don’t get the necessary trace data off the
platform
- 26 -
Taking Constructive Influence
- the PROMPT Approach -
- 27 -
• Multi-core implementations of many embedded
systems require mapping applications to cores – one
point of attack.
Traditional System Design Process
Selection of the
execution
platform
One application
as a set of tasks
Software
development
Timing
Analysis
No
Schedulability
Analysis
Yes
- 28 -
System Design Process with Integration of
Applications
Design of distributed
execution platform
Several applications
as sets of sets of tasks
Software
Software
development
Software
development
development
Timing
Timing
Analysis
Timing
Analysis
Analysis
No
Integration:
Mapping and
Schedulability
Yes
- 29 -
Application Domains I
- 30 -
• Architectures for safety- and time-critical avionics
and automotive systems
• system characteristics:
–
–
–
–
–
–
–
combination of control loops and finite-state control
each control loop fully contained in one application
little shared code
global (finite) state partly shared between applications;
state transitions influence control parameters,
control loops trigger state transitions
reading from and writing to shared state happens only at
the beginning and at the end of task activations
– some applications require high performance, but share
little with the control applications
- 31 -
Application Domains II
• Similar integration trends, IMA and AUTOSAR,
integrating applications on powerful platforms instead of
1-application-per-platform/ECU
• More complex development process – Mapping a set of
applications to nodes of a platform.
• Goal is Composability:
timing behavior of one task is independent of that of the
other tasks integrated on the same platform.
– IMA: incremental qualification, i.e. modification of one application
integrated with a set of other applications only requires recertification of the modified component.
“Total” Task Isolation
- 32 -
• IMA attempts to realize total task isolation by
– Spatial partitioning – one task does not access a
memory area or device assigned to another task
– Temporal partitioning – the execution of one task
must not have an effect on the timing behavior of
another task
• These brick walls
– are too thick, i.e. entail too much performance loss,
– have holes, i.e. cannot be realized on complex
processor architectures
Dealing with Shared Resources
- 33 -
Alternatives:
• Avoiding them,
• Bounding their effects on timing variability
The PROMPT Principle:
Architecture Follows Application
- 34 -
Starting with a generic multi-node architecture,
the PROMPT architecture,
• parametric in the ISAs, the hierarchy of “nodes”,
the memory hierarchies, the interconnect, etc.
• nodes may be
– atomic processing units with their private resources or
– if performance requires with shared resources,
• nodes on each hierarchy level should be
predictable
• we start with predictable cores, i.e., fully
compositional architectures
- 35 -
The PROMPT Design Process
The generic PROMPT architecture is
instantiated for a given set of
applications with their resource
requirements
The design process works in multiple phases
1. hierarchical privatization
2. sharing of lonely resources
3. controlled socialization
Principles for the PROMPT
Architecture and Design Process
- 36 -
• No shared resources where not needed
for performance,
• Harmonious integration of applications:
not introducing interferences on shared
resources not existing in the applications.
The PROMPT System Design Process
Generic PROMPT
architecture
Software development
Core Design
Sets of applications
as sets of set of tasks
Implement
Timing Analysis
Analysis of
Applications
Timing Analysis
Multi-core Design

Derivation of
Timing Guarantees


- 37 -
I
N
S
T
A
N
T
I
A
T
I
O
N
Steps of the Design Process
1.
–
–
–
–
2.
3.
•
•
Hierarchical privatization
- 38 -
decomposition of the set of applications according to the
sharing relation on the global state
allocation of private resources for non-shared code and state
allocation of the shared global state to non-cached memory,
e.g. scratchpad,
sound (and precise) determination of delays for accesses to
the shared global state
Sharing of lonely resources – seldom accessed
resources, e.g. I/O devices
Controlled socialization
introduction of sharing to reduce costs
controlling loss of predictability
- 39 -
Sharing of Lonely Resources
• Costly lonely resources will be shared.
• Accesses rate is low compared to CPU and
memory bandwidth.
• The access delay contributes little to the
overall execution time because accesses
happen infrequently.
PROMPT Design Principles
for Predictable Systems
- 40 -
• reduce interference on shared resources in
architecture design
• avoid introduction of interferences in mapping
application to target architecture
Applied to Predictable Multi-Core Systems
• Private resources for non-shared components of
applications
• Deterministic regime for the access to shared
resources
Some Relevant Publications from my Group
•
•
•
•
•
•
•
•
•
•
•
•
- 41 -
C. Ferdinand et al.: Cache Behavior Prediction by Abstract Interpretation. Science of
Computer Programming 35(2): 163-189 (1999)
C. Ferdinand et al.: Reliable and Precise WCET Determination of a Real-Life Processor,
EMSOFT 2001
R. Heckmann et al.: The Influence of Processor Architecture on the Design and the
Results of WCET Tools, IEEE Proc. on Real-Time Systems, July 2003
St. Thesing et al.: An Abstract Interpretation-based Timing Validation of Hard Real-Time
Avionics Software, IPDS 2003
L. Thiele, R. Wilhelm: Design for Timing Predictability, Real-Time Systems, Dec. 2004
R. Wilhelm: Determination of Execution Time Bounds, Embedded Systems Handbook, CRC
Press, 2005
St. Thesing: Modeling a System Controller for Timing Analysis, EMSOFT 2006
J. Reineke et al.: Predictability of Cache Replacement Policies, Real-Time Systems, 2007
R. Wilhelm et al.:The Determination of Worst-Case Execution Times - Overview of the
Methods and Survey of Tools. ACM Transactions on Embedded Computing Systems (TECS)
7(3), 2008.
R.Wilhelm et al.: Memory Hierarchies, Pipelines, and Buses for Future Architectures in
Time-critical Embedded Systems, IEEE TCAD, July 2009
M. Lv, N. Guan, J. Reineke, R.Wilhelm, W. Yi:
A Survey on Static Cache Analysis for Real-Time Systems. LITES 3(1): 05:1-05:48 (2016)
S.Hahn, J.Reineke, R.Wilhelm: Toward Compact Abstractions for Processor Pipelines.
Correct System Design 2015: 205-220