Skip to main content

Content Extraction

2025.1.01+

Content extraction is the second step in the knowledge base processing pipeline. It converts uploaded or referenced documents into a plain text or Markdown format that can be further split and embedded for AI-powered search.

You can choose between several extraction methods depending on your use case.

Markdown

This option converts the content of the document into Markdown format, preserving semantic structure like headings, bullet points, and tables.

Supported MIME Types

  • Email files (message/rfc822, application/vnd.ms-outlook)
  • Plain text files (text/plain)
  • Markdown files (text/markdown)
  • Microsoft Word documents (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
  • Microsoft Excel spreadsheets (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
  • Microsoft PowerPoint presentations (application/vnd.openxmlformats-officedocument.presentationml.presentation)

Header Mapping for Microsoft Word

When extracting Microsoft Word documents, it's possible to define a header mapping to convert specific styles into Markdown-compatible headings.

This is useful when documents use custom-named heading styles (e.g., "MyTitleStyle") instead of the default "Heading 1", "Heading 2", etc.

How to identify header styles

  1. Open the Styles Pane in Word and select the styled text.

    Finding the styles pane and opening the modify style dialog

  2. Click Modify Style to view the exact name of the style.

    Sample of the modify style dialog to find the correct name

You can map multiple Word styles to the same Markdown header level.

Original Document

This option is only available when using the OpenAI Vector Store.

In this mode, the original document content is sent directly to OpenAI for parsing. Flowable does not perform any intermediate extraction or formatting.

Use this if you want OpenAI’s model to handle raw document parsing.

Custom

warning

The custom extraction APIs are currently experimental and may change in future releases.

For advanced use cases, you can implement your own extraction logic by providing a custom Java class.

Implement the following interface: com.flowable.agent.engine.impl.knowledgebase.pipeline.steps.extractor.KnowledgeBaseTextBasedFormatExtractor

This gives you full control over how content is parsed before further processing.