Uno/Binary/Analysis/String Performance

From Apache OpenOffice Wiki
< Uno‎ | Binary‎ | Analysis
Jump to: navigation, search

author: Kay Ramme
date: 09/13/2005
type: analysis

version: 1.2

Some String Performance Thoughts

Motivation

While doing some performance investigations into OOo document load times, different profilers showed that a significant time was spent in memory de-/allocation and interlock counter in-/decrements. At lot of these invocations were done on behalf of the OOo string implementations. These string implementations are pure heap based and allocate all strings with a length > 0 on the heap.

This document should give a first hint about how-to calculate the memory and time consumptions of a particular string implementation. Being able to calculate these e.g. allows to judge the quality of current filter implementations in respect to strings, and may give a hint about too many string usages. Summary

The shared-heap-string-with-a-buffer performance is (much) better than the shared-heap-string performance, but needs some slightly more memory, depending on the usage case.

The below first thoughts / investigation are probably not yet enough, to really judge potential advantages of the shared-heap-string-with-a-buffer against the shared-heap-string, but may already give some hints for further investigations.

Unfortunately changing / extending OOos string implementations would be an ABI incompatible change, so that doing this should be well motivated. Preface

Despite that some simple optimizations, as not acquiring / releasing the statically allocated empty string, can be done immediately, further thoughts seem to be necessary. Obviously different aspects of strings can be optimized, partly with competing results, depending on the usage scenarios and the optimization focus (space vs. time). Some candidates are:

  • Acquire heap blocks for unique strings only.
  • Reduce heap de- / allocation calls.
  • Reduce acquire / release calls.
  • Improve string manipulation operations, as +=.

This document aims to give ideas about time optimizations, as these seem to be the most pressing, regarding e.g. the current document load times.

While loading a big spreadsheet with integer values only, a lot of time is spend in heap de-/allocation and in in-/decrementing interlock counts. These functions are called by string construction and destruction operations, therefor this document only investigates into construction and destruction of strings.

Examined Operations

Examined operations are

  • string construction and
  • string destruction,

as these are the most costly string operations in a typical usage scenario, e.g. loading spreadsheet documents. Construction and destruction of empty strings has been ignored, for simplicity reasons. Such operations are implementable to execute in nearly zero time, compared to the non empty string operations.

Examined Strings

This document only investigates into the original shared-heap-string, as currently implemented and used in OOo, and into a variant of this string, the shared-heap-string-with-a-buffer, which seems to be promising in respect to the examined operations (construction and destruction).

Parameters

The time relevant parameters for construction and destruction are

  • heap allocation times,
  • heap de-allocation times,
  • acquire times,
  • release times and the
  • character copy times.

The space relevant parameters are, the

  • shared sequence size,
  • character sequences length,
  • string size and the
  • characters size.

The Shared-Heap-String

The shared-heap-string, as OOos rtl::OUString, allocates every character sequence on the heap, while sharing these sequences in case of copy operations. The life time of the shared sequences is controlled by reference counting. In a multi threaded environment, this reference counting must be implemented in a thread safe manner. The thread safe OOo reference counter used is the oslInterlockedCount.

{{{Uno/Note}}} As the shared-heap-string always does heap operations for creating / deleting shared sequences, it can become a major bottleneck for multi-threaded applications, as heap de-/allocations typically acquire a global MutEx, before performing their underlying function.

Construction

Constructing a shared-heap-string, means to

  • acquire a shared sequence, in case the source is a shared-heap-string, or to
  • allocate from the heap and to copy a number of characters, in case the source is a literal.

Destruction

Destructing a shared-heap-string means to

  • release the shared sequence, or to
  • deallocate the shared sequence, in case this was the last reference.

Parameters

Usage dependent parameters are the

  • share ratio, the ratio of assigning literals compared to assigning strings,
  • string length distribution.

Costs

Time:

  • destruction_time := heap_de-allocation_time * (1 - share_ratio) + release_time * share_ratio
  • construction_time := acquire_time * share_ratio + (1 – share_ratio) * (heap_allocation_time + character_copy_time * string_length)

Space:

  • memory_size := string_size + (1 - share_ratio) * (string_length * character_size + shared_sequence_size)

The Shared-Heap-String-with-a-Buffer

The shared-heap-string-with-a-buffer reserves some of the strings structural size for buffering characters, instead of always allocating memory from the heap. The shared-heap-string-with-a-buffer performance degenerates to the shared-heap-string performance, in case of the buffer being of size 0.

It is expected to get better performance, when using the shared-heap-string-with-a-buffer, in respect to time and concurrency, especially in scenarios where a lot of small temporary strings are used, for example when parsing / loading XML documents.

Construction

Instead of being constant in time, the construction operation depends on the source strings length in case it fits into the buffer, and is otherwise the same as the shared-heap-string construction operation, in case the source string is longer than the buffer.

{{{Uno/Note}}} The shared-heap-string-with-a-buffer construction operation is guaranteed to be cheaper than the pure heap-string construction operation, if the buffer size is selected in a way, that the character_copy * string_length operation is cheaper than the acquire operation.

Constructing a shared-heap-string-with-a-buffer, means to

  • acquire a shared sequence, in case the source length >= the buffer, or to
  • copy a number of characters, in case the source length < the buffer, or to
  • allocate from the heap and to copy a number of characters, in case the source is a literal, which's length >= the buffer.

Destruction

Destructing a shared-heap-string-with-a-buffer means to

  • do nothing, in case the strings length < the buffer, or to
  • ,in case the strings length >= the buffer,
    • release the shared sequence, in case this was not the last reference, or to
  • deallocate the shared sequence, in case this was the last reference.

Parameters

Usage dependent parameters are the

  • share ratio, the ratio of assigning literals compared to assigning strings, the
  • string length distribution, and the
  • buffer size.

Costs

Time:

  • destruction_time := probability(sequence >= buffer) * heap-string-destruction_time
  • construction_time := probability(sequence >= buffer) * heap-string-construction_time + probability(sequence < buffer) * char_copy_time * sequence_length

Space:

  • memory_size := string_size + buffer_size + probability(sequence >= buffer) * (1 - share ratio) * (sequence_length * character_size + shared_sequence_size)

Comparison

Independent Parameters

To find out the relationships of the different operations, I wrote a short test program (see appendix for details). This program gave me some numbers to start with:

localhost:~> ./perf_test.bin
        fun calls  256 -  average spendings: abs 0.000009ms proc 0.000000ms 326insts
           malloc  256 -  average spendings: abs 0.000008ms proc 0.000001ms 519insts
             free  256 -  average spendings: abs 0.000089ms proc 0.000001ms 465insts
          acquire  256 -  average spendings: abs 0.000008ms proc 0.000001ms 338insts
          release  256 -  average spendings: abs 0.000008ms proc 0.000001ms 338insts
copied characters 1024 -  average spendings: abs 0.000001ms proc 0.000000ms  35insts

Any suggestions / improvement for measuring the execution are certainly welcome. For precision I used the determined number for instructions (insts), to do further calculations.

Defining the costs of operation gives

  • heap_allocation_time = 519 insts,
  • heap_de-allocation_time = 465 insts,
  • acquire_time = 338 insts,
  • release_time = 338 inst and
  • character_copy_time = 35 insts.

String Parameters

The space relevant parameters for both strings are, the

  • shared_sequence_size == sizeof(rtl_uString) == 10 bytes,
  • string_size == sizeof(rtl::OUString) == 4 bytes and the
  • character_size == sizeof(sal_Unicode) == 2 bytes.

Shared-Heap-String-with-a-Buffer

And additionally, the buffer size in case of a shared-heap-string-with-a-buffer

  • buffer_size = character_size * buffer_length.

Scenario: Loading a Spreadsheet

A Calc document with 400 000 numeric cells, creates many strings during loading. A cell stored in the document typically looks like this:

<table:table-cell office:value-type="float" office:value="10">
      <text:p>10</text:p>
</table:table-cell>

The loading procedure in general does the following,

  • the XML file gets parsed, about 9 strings (for every tag and value) get created for every cell,
  • the strings are passed via callbacks to the Calc engine,
  • the Calc engine converts the content string into a number
  • the strings get destructed.

For the above cell description, there is no sharing of strings possible, as the only double string (“10”) is part of different tags.

The performance parameters for this scenario are:

  • share_ratio = 0
  • average string length for the above cell description
    • length(“table:table-cell”) = 16c
    • length(“office:value-type”) = 17c
    • length(“office:value”) = 12c
    • length(“text:p”) = 6c
    • = (16+17+12+6)c / 4 = 12.75c
  • concurrency (max. number of concurrent instances) = 3 (one tag + two attributes)

Obviously, a spreadsheet with only numeric data should not do any heap allocations while loading, except for constructing the spreadsheet model.

Time and Space performance depend on the number of cells to be loaded, and on the concurrency.

  • Time Costs :=concurrency * 400000 * (construction_time + destruction_time)
  • Space Costs := concurrency * (string_size + average_sequence_length * character_size + shared_sequence_size)

Calculation for the Shared-Heap-String

Time

  • concurrency * 400000 * (construction_time + destruction_time) :=
  • (
  • destruction_time

== 3 * 400000 * (heap_de-allocation_time * (1 - share_ratio) + release_time * share_ratio) == 3 * 400000 * 465insts * 1 == 558000000 insts

  • +
  • construction_time

== 3 * 400000 * (acquire_time * share_ratio + (1 – share_ratio) * (heap_allocation_time + character_copy_time * sequence_length)) == 3 * 400000 * (519insts + 35insts * 12.75) == 1158300000 insts

  • )
  • == 1716300000 insts

Space:

  • concurrency * shared-heap-string_size

== concurrency * (string_size + (1 - share_ratio) * (sequence_length * character_size + shared_sequence_size)) == 3 * (4 bytes + 12.75 * 2 bytes + 10 bytes) == 118.5 bytes

Calculation for Shared-Heap-String-with-a-Buffer

Additional parameters are, the

  • buffer_size == 16c,
  • probability(sequence >= buffer) == 50% ,
  • probability(sequence < buffer) == 50% .

Time

  • concurrency * 400000 * (construction_time + destruction_time) :=
  • (
  • destruction_time

== 3 * 400000 * probability(sequence >= buffer) * heap-string-destruction_time == 3 * 400000 * 0.5 * heap-string-destruction_time == 3 * 400000 * 0.5 * (heap_de-allocation_time * (1 - share_ratio) + release_time * share_ratio) == 3 * 400000 * 0.5 * heap_de-allocation_time == 3 * 400000 * 0.5 * 465insts == 279000000 insts

  • +
  • construction_time

== 3 * 400000 * (probability(sequence >= buffer) * heap-string-construction_time + probability(sequence < buffer) * char_copy_time * sequence_length) == 3 * 400000 * (0.5 * heap-string-construction_time + 0.5 * char_copy_time * sequence_length) == 3 * 400000 * (0.5 * (acquire_time * share_ratio + (1 – share_ratio) * (heap_allocation_time + character_copy_time * sequence_length)) + 0.5 * char_copy_time * sequence_length) == 3 * 400000 * (0.5 * (heap_allocation_time + character_copy_time * sequence_length) + 0.5 * char_copy_time * sequence_length) == 3 * 400000 * (0.5 * (519insts + 35insts * 12.75) + 0.5 * 35insts * 12.75) == 846900000 insts

  • )
  • == 1125900000 insts

Average Space:

  • concurrency * shared-heap-string-with-a-buffer_size

== concurrency * (string_size + buffer_size + probability(sequence >= buffer) * (1 - share ratio) * (sequence_length * character_size + shared_sequence_size)) == 3 * (string_size + buffer_size + probability(sequence >= buffer) * (1 - share ratio) * (sequence_length * character_size + shared_sequence_size)) == 3 * (4 bytes + 16 * character_size + 0.5 * 1 * (12.75 * character_size + 10 bytes)) == 3 * (4 bytes + 16 * 2 bytes + 0.5 * 1 * (12.75 * 2 bytes + 10 bytes)) == 161.25

Conclusion

Comparing the heap-string-with-a-buffer with the heap-string shows in the above scenario a time advantage of about 35% in respect to time performance. And a space disadvantage of about 27%. Which, in absolute numbers is just about 45 bytes, where as the time advantage is in absolute numbers 690 million insts.

Appendix

To compile the following program, put it into a file called “perf_test.c” and invoke

gcc -I/opt/papi/include/ perf_test.c -L/opt/papi/lib/ -lpapi -lperfctr -o perf_test.bin

Obviously, this runs on Linux x86 only.

/*
** this short program uses the Intel Pentium performance counters,
** which can be accessed through the PAPI project 
** (http://icl.cs.utk.edu/papi/software/index.html)
*/
 
 
#include <stdlib.h>
#include <stdio.h>
#include <papi.h>
#include <string.h>
 
 
static void report(char const * title, int calls, float rtime, float ptime, long_long ins) {
	printf("%20.20s %4i -  average spendings: abs %fms proc %fms %5liinsts\n", 
		   title,
		   calls,
		   rtime / calls, 
		   ptime / calls, 
		   ins / calls);
}
 
 
void dummyFunction(void)
{
}
 
#define FUN_LOOP 256
 
static void test_fun_call(void)
{
	float     g_rtime = 0;
	float     g_ptime = 0;
	long_long g_ins   = 0;
	float     g_ipc   = 0;
 
	int i = 0;
	for (i; i < FUN_LOOP; ++ i)
	{
		float     rtime = 0;
		float     ptime = 0;
		long_long ins   = 0;
		float     ipc   = 0;
 
		/* Start counting events */
		if (PAPI_ipc(&rtime, &ptime, &ins, &ipc))
			abort();
 
		dummyFunction();
 
		/* getting counted events */
		if (PAPI_ipc(&rtime, &ptime, &ins, &ipc))
			abort();
 
		/* Stop counting events */
		if (PAPI_stop_counters(NULL, 0) != PAPI_OK)
			abort();
 
		g_rtime += rtime;
		g_ptime += ptime;
		g_ins   += ins;
		g_ipc   += ipc;
	}
 
	report("fun calls", FUN_LOOP, g_rtime, g_ptime, g_ins);
}
 
 
#define MAFR_LOOP 256
 
static void test_malloc_free(void)
{
	float     g_m_rtime = 0;
	float     g_m_ptime = 0;
	long_long g_m_ins   = 0;
	float     g_m_ipc   = 0;
 
	float     g_f_rtime = 0;
	float     g_f_ptime = 0;
	long_long g_f_ins   = 0;
	float     g_f_ipc   = 0;
 
 
	int i = 0;
	for (i; i < MAFR_LOOP; ++ i)
	{
		float     m_rtime = 0;
		float     m_ptime = 0;
		long_long m_ins   = 0;
		float     m_ipc   = 0;
 
		/* Start counting events */
		if (PAPI_ipc(&m_rtime, &m_ptime, &m_ins, &m_ipc))
			abort();
 
		void * block = malloc(128);
 
		if (PAPI_ipc(&m_rtime, &m_ptime, &m_ins, &m_ipc))
			abort();
 
		/* Stop counting events */
		if (PAPI_stop_counters(NULL, 0) != PAPI_OK)
			abort();
 
		g_m_rtime += m_rtime;
		g_m_ptime += m_ptime;
		g_m_ins   += m_ins;
		g_m_ipc   += m_ipc;
 
 
 
		float     f_rtime = 0;
		float     f_ptime = 0;
		long_long f_ins   = 0;
		float     f_ipc   = 0;
 
		/* Start counting events */
		if (PAPI_ipc(&f_rtime, &f_ptime, &f_ins, &f_ipc))
			abort();
 
		free(block);
 
		/* getting counted events */
		if (PAPI_ipc(&f_rtime, &f_ptime, &f_ins, &f_ipc))
			abort();
 
		/* Stop counting events */
		if (PAPI_stop_counters(NULL, 0) != PAPI_OK)
			abort();
 
		g_f_rtime += f_rtime;
		g_f_ptime += f_ptime;
		g_f_ins   += f_ins;
		g_f_ipc   += f_ipc;
 
	}
 
	report("malloc", MAFR_LOOP, g_m_rtime, g_m_ptime, g_m_ins);
	report("free", MAFR_LOOP, g_f_rtime, g_f_ptime, g_f_ins);
}
 
 
typedef signed long oslInterlockedCount;
 
 
oslInterlockedCount  osl_incrementInterlockedCount(oslInterlockedCount* pCount)
{
	oslInterlockedCount nCount;
 
	__asm__ __volatile__ (
		"movl $1, %0\n\t"
		"lock\n\t" 
		"xaddl %0, %2\n\t"
		"incl %0"
	:	"=&r" (nCount), "=m" (*pCount)
	:	"m" (*pCount)
	:	"memory");
 
    return nCount;
}
 
oslInterlockedCount  osl_decrementInterlockedCount(oslInterlockedCount* pCount)
{
	oslInterlockedCount nCount;
 
	__asm__ __volatile__ (
		"movl $-1, %0\n\t"
		"lock\n\t"
		"xaddl %0, %2\n\t"
		"decl %0"
	:	"=&r" (nCount), "=m" (*pCount)
	:	"m" (*pCount)
	:	"memory");
 
    return nCount;
}
 
 
#define ACRE_LOOP 256
 
static void test_acquire_release(void)
{
	float     g_a_rtime = 0;
	float     g_a_ptime = 0;
	long_long g_a_ins   = 0;
	float     g_a_ipc   = 0;
 
	float     g_r_rtime = 0;
	float     g_r_ptime = 0;
	long_long g_r_ins   = 0;
	float     g_r_ipc   = 0;
 
 
	oslInterlockedCount bla;
 
	int i;
	for (i = 0; i < ACRE_LOOP; ++ i)
	{
		float     a_rtime = 0;
		float     a_ptime = 0;
		long_long a_ins   = 0;
		float     a_ipc   = 0;
 
		/* Start counting events */
		if (PAPI_ipc(&a_rtime, &a_ptime, &a_ins, &a_ipc))
			abort();
 
		osl_incrementInterlockedCount(&bla);
 
		/* getting counted events */
		if (PAPI_ipc(&a_rtime, &a_ptime, &a_ins, &a_ipc))
			abort();
 
		/* Stop counting events */
		if (PAPI_stop_counters(NULL, 0) != PAPI_OK)
			abort();
 
		g_a_rtime += a_rtime;
		g_a_ptime += a_ptime;
		g_a_ins   += a_ins;
		g_a_ipc   += a_ipc;
 
 
 
		float     r_rtime = 0;
		float     r_ptime = 0;
		long_long r_ins   = 0;
		float     r_ipc   = 0;
 
		/* Start counting events */
		if (PAPI_ipc(&r_rtime, &r_ptime, &r_ins, &r_ipc))
			abort();
 
		osl_decrementInterlockedCount(&bla);
 
		/* getting counted events */
		if (PAPI_ipc(&r_rtime, &r_ptime, &r_ins, &r_ipc))
			abort();
 
		/* Stop counting events */
		if (PAPI_stop_counters(NULL, 0) != PAPI_OK)
			abort();
 
		g_r_rtime += r_rtime;
		g_r_ptime += r_ptime;
		g_r_ins   += r_ins;
		g_r_ipc   += r_ipc;
 
	}
 
	report("acquire", ACRE_LOOP, g_a_rtime, g_a_ptime, g_a_ins);
	report("release", ACRE_LOOP, g_r_rtime, g_r_ptime, g_r_ins);
}
 
 
#define CHARS_TO_COPY 1024
 
static void test_character_copy(void)
{
	float     g_rtime = 0;
	float     g_ptime = 0;
	long_long g_ins   = 0;
	float     g_ipc   = 0;
 
	int i = 0;
	for (i; i < 100; ++ i)
	{
		float     rtime = 0;
		float     ptime = 0;
		long_long ins   = 0;
		float     ipc   = 0;
 
		short int buff1[CHARS_TO_COPY];
		short int buff2[CHARS_TO_COPY];
 
		/* Start counting events */
		if (PAPI_ipc(&rtime, &ptime, &ins, &ipc))
			abort();
 
		memcpy(buff2, buff1, CHARS_TO_COPY);
 
		/* getting counted events */
		if (PAPI_ipc(&rtime, &ptime, &ins, &ipc))
			abort();
 
		/* Stop counting events */
		if (PAPI_stop_counters(NULL, 0) != PAPI_OK)
			abort();
 
		g_rtime += rtime;
		g_ptime += ptime;
		g_ins   += ins;
		g_ipc   += ipc;
	}
 
	report("copied characters", CHARS_TO_COPY, g_rtime, g_ptime, g_ins);
}
 
 
int main()
{
	test_fun_call();
	test_malloc_free();
	test_acquire_release();
	test_character_copy();
 
	return 0;
}
Personal tools