Architecture/Source Code Inventory

From Apache OpenOffice Wiki
Jump to: navigation, search

Owner: Kay Ramme, Stefan Zimmermann Type: analysis State: draft


Recent surveys and current experiences with the project have caused concern over the existent "barrier of entrance" for potential contributors that may hinder e.g. developers to become an active member in the community. This "barrier of entrance" surely has a lot of dimensions. Some of these dimensions may be the complexity of the source code, the build environment, the lack of modularity or simply the pure mass of items involved in the product.

Therefor Thorsten Behrens, Kay Ramme and Stefan Zimmermann stepped up to determine the sub-dimensions of complexity, find and develop measures to quantify the code base of the project, and provide data that describes sub-dimensions of complexity in the project to potential improvement teams. This is a call for help. Everybody who wants to contribute his experiences and ideas is more than welcome.


The overarching motto we agree is : Less [code] is better !, where the word "code" is actually optional.

If we say "less", we need in turn to know how much we have now. Means we need to quantify our (code) base. Although we think we should focus in the first step on specific areas which are:

  • dead code
  • redundancy
  • cyclomatic complexity (McCabe)
  • (unused/useless features)

after these focus areas are adressed, we may focus more on finding indicators for some properties that are described in the next Section, "The Zen of Programming" ;)

The Zen of Programming

  • Beautiful is better than ugly.
  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.
  • Flat is better than nested.
  • Sparse is better than dense.
  • Readability counts.
  • Special cases aren't special enough to break the rules.
  • Although practicality beats purity.
  • Errors should never pass silently.
  • Unless explicitly silenced.
  • In the face of ambiguity, refuse the temptation to guess.
  • There should be one-- and preferably only one --obvious way to do it
  • Now is better than never.
  • Although never is often better than right now.
  • If the implementation is hard to explain, it's a bad idea.
  • If the implementation is easy to explain, it may be a good idea.
  • Namespaces are one honking great idea -- let's do more of those!

(cited from the Zen of Python by Tim Peters)

possible data collection plan

Data to be collected:

At first it is quantitative data and will range from number of files, lines of code (in it's characteristics LINES and SLOC according to DSI concept), number of classes, methods, lines of code per function etc. but also file dependencies, -scattering, -location will get into focus of investigation.

Purpose of Data Collection:

Ultimately, the goal is to provided ideas how to simplify the project to lower the "barrier of entrance" for contributors and determine if maintenance capability or maintainability can be expressed

What Insight The Data Will Provide:

The data, when counted and compared will provide us with information about dependencies, redundencies in the code as well as the purpose/duty of specific code sections.

How It Will Help potential Improvement Teams:

The teams will be able to make a decision on whether to eliminate, consolidate, refactor or modularize code or simply abandom from consideration the possible effects of the multiple dimensions of complexity.

What Will Be Done With The Data After Collection:

The teams will use the data to arrive at code complexity measures, which may be able to describe code "easy to maintain" and code "not so easy to maintain " :). For sure the data will be used to continuously draw a picture what OpenOffice code base is about and how it develops over time.

What we think what data to collect and why (Detailed)

Source Code Size Metrics

On the way to develop an "Operational Definition" the sub-site "Size Metrics" details how we measure the potential data points mentioned here. What is i.e. a "Source Line of Code" (SLOC).

Code Metrics

    • size estimates
    • best practice comparison
    • language to language comparison
  • LOC
    • size estimates
    • comment line / source line / blank line ratio
    • best practice comparison
    • language to language comparison
  • SLOC (source lines of code)
    • compiler relevant lines of code
    • relate to DSI
  • DSI (delivered source instructions)
    • use in COCOMO II (Constructive Cost Model II)
    • PM (person month) estimates
    • TDEV (development time) estimates
  • Pre Processor directives
    • creating file inclusion hierarchy
    • comparing definition count (constants and macros) with "best practice"
  • Keywords
    • calculate cyclomatic complexity (MyCabe)
    • compare with "best practice"
  • Statements
    • compare with DSI
    • estimate statement density per method, file
  • Classes
    • class hierarchy (inheritance depth)
    • dependencies (circular)
    • "is a" - "has a" relationships (ratio)
  • Methods per Class
    • size of class
    • maintainability estimations
    • interfaces (external)

File Metrics

  • Files to handle
    • evaluation of possible consolidation efforts (scattering together with location)
    • comparison with "best practice" data of industrie
    • ratio of product source to product build environment
  • File inclusion hierarchy
    • inclusion depth
    • file dependencies
    • consolidation opportunities
  • Commented Code Sections
    • potential sub instance of dead/unneeded code
    • consolidation opportunities

Call for Help

Any ideas and experiences about what to collect how and why are welcome

Thinks to Think About

  • How is the software development cost related to the code size of the base product? Exponentially? Linear? Logarithmically?


To be continued ...

Personal tools