Version: v3.1 print this page

Data Pipeline Nodes

This document provides a detailed overview of the various nodes that can be utilized within a data pipeline. Each node type is designed to perform specific tasks, enabling the integration and execution of complex data workflows. Below, you will find the necessary fields required to create each type of node.

ETL Job Node

The ETL Job Node facilitates the execution of ETL jobs, allowing for the inclusion of arguments to be utilized within the job. For instance, this node can be employed to execute an ETL job that identifies the highest paying job and its corresponding salary, as well as to trigger subsequent jobs based on the output of the initial job.

Attribute	Description
Resource	Select an ETL job from the dropdown list of available jobs.
Node Name	A unique identifier for the node.
Input Configurations	Arguments that can be used within the job.

ML Model Inference Node

The ML Model Inference Node is designed to run machine learning models using input arguments to make predictions or decisions. For example, this node can process customer data to predict churn probabilities, with the results forwarded to the next node in the pipeline.

Attribute	Description
Resource	Choose a machine learning model from the list of accessible models.
Node Name	A unique identifier for the node.
Input Dataset	The dataset containing files for ML model inference.
Select Latest File	Automatically selects the latest file for inference if set to 'Yes'.
File Name Execution Property Key	Required if 'Select Latest File' is set to 'No'; specifies the file name from the dataset. (must be an execution property key)
Target Dataset	The dataset where inference results are saved.

Note

Users can perform inference on the input dataset (with a soft limit of 10,000 files) or on up to 100 selected files.

Email Node

The Email Node enables automated email notifications within data pipelines. It can be used to send alerts, reports, or status updates to stakeholders when specific pipeline events occur. This node is particularly useful for monitoring pipeline execution, notifying about job completions, or alerting on failures.

Use Cases

Pipeline Status Notifications: Send alerts when pipelines complete successfully or fail
Data Quality Alerts: Notify stakeholders about data validation issues
Scheduled Reports: Automatically distribute processed data reports
Error Notifications: Alert administrators about critical pipeline failures

Attribute	Description	Required
Node Name	A unique identifier for the node.	Yes
Email Recipient	A list of email addresses to notify (must be an execution property key). Format: `['<email1@domain.com>','<email2@domain.com>']` Example: `['<john.doe@amorphicdata.com>','<jane.doe@amorphicdata.com>']` Note: Multiple recipients can be specified in a single array.	Yes
Email Subject	The subject line of the email (must be an execution property key). Best Practices: - Use descriptive subjects that indicate the pipeline context - Include pipeline name or job identifier - Example: `"Data Pipeline Alert: Customer ETL Job Completed"`	Yes
Email Body	The content of the email (must be an execution property key). Supported Content: - Plain text - HTML formatting - Pipeline execution details - Error messages and stack traces - Data processing summaries	Yes

Configuration Examples

Basic Success Notification

{
  "email_recipient": ["<admin@company.com>"],
  "email_subject": "Pipeline Completed Successfully",
  "email_body": "The data pipeline has completed successfully. All data has been processed and loaded."
}

Detailed Error Alert

{
  "email_recipient": ["<admin@company.com>", "<dev-team@company.com>"],
  "email_subject": "Pipeline Failed: Customer Data Processing",
  "email_body": "The customer data processing pipeline has failed. Please check the logs for more details. Pipeline ID: ${pipeline_execution_id}"
}

Data Quality Report

{
  "email_recipient": ["<data-team@company.com>"],
  "email_subject": "Daily Data Quality Report",
  "email_body": "Daily data quality check completed. Records processed: ${total_records}, Valid records: ${valid_records}, Invalid records: ${invalid_records}"
}

Best Practices

Recipient Management:
- Use execution properties to dynamically set recipients based on pipeline context
- Consider different recipient lists for different types of notifications (success vs failure)
Subject Line Design:
- Include pipeline name or identifier for easy identification
- Use consistent prefixes like "Pipeline Alert:" or "Data Processing:"
- Include status indicators (Success/Failed/Pending)
Email Body Content:
- Include relevant pipeline execution details
- Provide actionable information when errors occur
- Use clear, concise language
- Include links to relevant dashboards or logs when possible
Error Handling:
- Always include error details in failure notifications
- Provide context about what the pipeline was trying to accomplish
- Include next steps or escalation procedures

Note

Emails are not delivered to external domains from SES by default. To remove this restriction, please raise a support ticket.
Email delivery is asynchronous and may have slight delays depending on SES configuration.
Large email bodies may be truncated based on SES limits.
Ensure execution properties are properly configured before pipeline execution.

Important

Test email configurations in development environments before deploying to production
Monitor email delivery rates and bounce rates to ensure notifications are reaching intended recipients
Consider implementing email templates for consistent formatting across different pipeline notifications

Textract Node

The Textract Node extracts text from documents, images, and other file types.

Attribute	Description
Node Name	A unique identifier for the node.
Input Dataset	The dataset containing files to extract text from (PDF, JPG, PNG supported).
File Processing Mode	Modes: All, Change Data Capture, Time Based. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous pipeline execution Time Based: processes the documents based on the custom time period chosen
Features	Choose features to extract: Text, Forms, Tables. Text: extracts all text from document Forms: extracts forms as key-value pairs Tables: extracts tables in csv format
Target Dataset	The dataset where extracted text is saved.

Note

No two textract nodes can have the same input dataset within a pipeline.

Rekognition Node

The Rekognition Node analyzes images and videos to detect and identify objects, people, and text.

Attribute	Description
Node Name	A unique identifier for the node.
Input Dataset	The dataset containing files for analysis (MP4, JPG, PNG supported).
File Processing Mode	Modes: All, Change Data Capture, Time Based. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous pipeline execution Time Based: processes the documents based on the custom time period chosen
Features	Choose features to extract: Text, Faces, Content Moderation, Celebrities, Labels. Text: extracts all text from document Faces: detects faces from image or video Content Moderation: extracts the inappropriate, unwanted, or offensive content analysis results Celebrities: extracts the name and additional information about a celebrity Labels: extracts label name, the percentage confidence in the accuracy of the detected label
Target Dataset	The dataset where extracted data is saved.

Note

No two rekognition nodes can have the same input dataset within a pipeline.

Translate Node

The Translate Node translates text from one language to another.

Attribute	Description
Node Name	A unique identifier for the node.
Source Dataset	The dataset containing files for translation (TXT supported).
File Processing Mode	Modes: All, Change Data Capture, Time Based. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous pipeline execution Time Based: processes the documents based on the custom time period chosen
Source Language	The language of the source text.
Target Language	The language to translate the text into.
Target Dataset	The dataset where translated text is saved (TXT supported).

Note

No two translate nodes can have the same input dataset within a pipeline.
The node only translates the first 5000 characters of the text.

Comprehend Node

The Comprehend Node extracts insights and relationships from text.

Attribute	Description
Node Name	A unique identifier for the node.
Input Dataset	The dataset containing files for analysis (TXT supported).
File Processing Mode	Modes: All, Change Data Capture, Time Based.
Features	Choose features to extract: Entities, KeyPhrases, Sentiment, PiiEntities, Topics. Entities: named entities like people, places, locations etc., in a document KeyPhrases: key phrases or talking points in a document Sentiment: overall sentiment of a text (positive, negative, neutral or mixed) PiiEntities: personally identifiable information (PII) entities in a document Topics: Most common topics in a document
Target Dataset	The dataset where extracted insights are saved.

Note

No two comprehend nodes can have the same input dataset within a pipeline.

Medical Comprehend Node

The Medical Comprehend Node processes medical text to extract insights and relationships.

Attribute	Description
Node Name	A unique identifier for the node.
Input Dataset	The dataset containing files for analysis (TXT supported).
File Processing Mode	Modes: All, Change Data Capture, Time Based. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous pipeline execution Time Based: processes the documents based on the custom time period chosen
Features	Choose features to extract: Medications, Medical Conditions, Personal Health Information, Medical Entities. Medications: Detects medication and dosage information for the patient. Medical conditions: Detects the signs, symptoms, and diagnosis of medical conditions. Personal health information: Detects the patient's personal information. Medical entities: All the medical and personal information in the document
Target Dataset	The dataset where extracted medical information is saved.

Note

No two medical comprehend nodes can have the same input dataset within a pipeline.

Transcribe Node

The Transcribe Node converts audio files into text.

Attribute	Description
Node Name	A unique identifier for the node.
Source Dataset	The dataset containing audio files for transcription (MP3, WAV supported).
Source Language	The language of the audio files.
File Processing Mode	Modes: All, Change Data Capture, Time Based. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous pipeline execution Time Based: processes the documents based on the custom time period chosen
Features	Choose features to extract: Text, ConversationBySpeaker, RedactedText, RedactedConversationBySpeaker.Text, ConversationBySpeaker, RedactedText and RedactedConversationBySpeaker are four features available. Text: Raw text extracted from the audio file. ConversationBySpeaker: Raw conversation displaying speaker and the sentence the speaker spoke. RedactedText: Extracted text from audio file with some content obscured for legal and security purposes. RedactedConversationBySpeaker: Conversation displaying speaker and the sentence the speaker spoke with some content obscured for legal and security purposes.
Target Dataset	The dataset where transcribed text is saved.

Note

No two transcribe nodes can have the same input dataset within a pipeline.

Medical Transcribe Node

The Medical Transcribe Node converts medical audio files into text.

Attribute	Description
Node Name	A unique identifier for the node.
Source Dataset	The dataset containing audio files for transcription (MP3, WAV supported).
Source Language	Currently supports only English-US (en-US).
File Processing Mode	Modes: All, Change Data Capture, Time Based. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous pipeline execution Time Based: processes the documents based on the custom time period chosen
Features	Choose features to extract: Text, ConversationBySpeaker. Text: Raw text extracted from the audio file. ConversationBySpeaker: Raw conversation displaying speaker and the sentence the speaker spoke.
Target Dataset	The dataset where transcribed text is saved.

Note

No two medical transcribe nodes can have the same input dataset within a pipeline.

Data Pipeline Node

The Data Pipeline Node allows for the integration of existing data pipelines, enabling them to run either in parallel or sequentially. For example, it can execute a data pipeline that consist of an ETL Job node followed by an Email node concurrently or consecutively with a data pipeline consisting of a Translate node followed by an Email node.

Attribute	Description
Resource	Select a data pipeline from the list of accessible data pipelines.
Node Name	A unique identifier for the node.

When a parent pipeline is stopped, it automatically stops all child pipelines. Execution properties set at the parent pipeline level take precedence over those defined at the child pipeline level.

File Load Validation Node

The File Load Validation Node is used to validate data before it is loaded into the system.

Attribute	Description
Node Name	A unique identifier for the node.
Timeout	Optional timeout value (in minutes) for node execution. Default is 60 minutes.

It is advised to set number of concurrent runs to 1 on etl job when file load validation node is used. For advanced use cases which involve concurrent data pipeline executions, please refer to: Advanced usage of file load validation node

Note

A file load validation node can only succeed an ETL job node, it cannot exist by itself or succeed other type of nodes in a pipeline.

Sync To S3 Node

The Sync To S3 Node synchronizes data to the S3 storage.

Attribute	Description
Node Name	A unique identifier for the node.
Concurrency Factor	The number of datasets to sync in parallel (1-10).
Domain	The domain name for dataset synchronization.
Sync All Datasets	Indicates whether to sync all datasets in the domain.
Select Datasets	List of datasets to sync if 'Sync All Datasets' is set to 'No'.
Timeout	Optional timeout value (in minutes) for node execution. Default is 60 minutes.

User is given an option to click & 'Download' the manifest file in execution properties of the Sync To S3 node. Below graphic shows manifest file that is generated/downloaded after the sync to s3 node is executed:

Output manifest file

Note

The "Sync to S3" operation can only be executed one at a time for a specific dataset within the same domain.

Datasource Node

The Datasource Node integrates and runs datasources created within the Amorphic platform as part of the data pipeline.

Attribute	Description
Node Name	A unique identifier for the node.
Ingestion Type	Select the ingestion type: Normal Data Load, Full Load Bulk Data.
Dataset	Select the dataset for data ingestion (available for normal data load datasources).
Dataflow	Select the bulk load dataflow to run (available for bulk data load datasources). Note: The Dataflow must be in a 'ready' state for the data pipeline execution to start. If it is not in a 'ready' state, ensure that it is brought to a 'ready' state from the Dataflows page before triggering the data pipeline execution.

Summarization Node

The Summarization Node generates a summary of the provided input text using an LLM

Attribute	Description
Node Name	A unique identifier for the node.
Input Dataset	The dataset containing files for summarization (Txt, PDF supported).
File Processing Mode	Modes: All, Selected Files, Change Data Capture. All: processes all documents in a source dataset Selected Files: processes a subset of documents in a source dataset(Max 100 files) Change Data Capture: processes the documents that have landed after the previous pipeline execution
Model	Select the LLM model to use for summarization.
Target Dataset	The dataset where summarized text is saved. (Txt supported).

Note

The Model must be enabled from the AWS Bedrock console for the Summarization node to work.

LLM Node (Beta)

The LLM Node can be used for customized data processing using LLMs. Users can provide custom prompts to the LLM Node to generate customized data.

Attribute	Description
Node Name	A unique identifier for the node.
Input Dataset	The dataset containing files for LLM processing (All Tabular Data Types supported).
File Processing Mode	Modes: All, Selected Files, Change Data Capture. All: processes all documents in a source dataset Selected Files: processes a subset of documents in a source dataset(Max 100 files) Change Data Capture: processes the documents that have landed after the previous pipeline execution
Model	Select the LLM model to use for LLM processing.
Prompt	The prompt to use for LLM processing. Ensure that the prompt is clear and concise.
Target Dataset	The dataset where LLM processed text is saved.

An example of a prompt is as follows:

For each 'region' in the dataset, calculate the average 'customer_satisfaction_score' for the last quarter (use rows where 'date_of_feedback' is within the last 3 months).
Then, identify regions where the average score is below 3.5 and list the top 2 complaints (based on frequency) from customers in those regions.
Output the result as a JSON object with fields: 'region', 'average_score', and 'top_complaints'.

With this prompt, the LLM Node performs the following advanced operations:

Time-based filtering: Select only rows where date_of_feedback is within the last 3 months.
Grouping: Group feedback by region.
Aggregation: Compute the average customer_satisfaction_score per region.
Conditional filtering: Select only regions where the average score is below 3.5.
Frequency analysis: Identify the top 2 most frequent complaints in those filtered regions.
Structured output: Format the result as a JSON object suitable for downstream analytics.

Note

This node is currently in a beta stage. It is advised to use this node with caution and to test it thoroughly before using it in production.
The Model must be enabled from the AWS Bedrock console for the LLM node to work.

This document aims to provide a comprehensive understanding of the various nodes available within the data pipeline framework, enabling users to effectively design and execute complex data workflows.

ETL Job Node​

ML Model Inference Node​

Email Node​

Use Cases​

Configuration Examples​

Basic Success Notification​

Detailed Error Alert​

Data Quality Report​

Best Practices​

Textract Node​

Rekognition Node​

Translate Node​

Comprehend Node​

Medical Comprehend Node​

Transcribe Node​

Medical Transcribe Node​

Data Pipeline Node​

File Load Validation Node​

Sync To S3 Node​

Datasource Node​

Summarization Node​

LLM Node (Beta)​

ETL Job Node

ML Model Inference Node

Email Node

Use Cases

Configuration Examples

Basic Success Notification

Detailed Error Alert

Data Quality Report

Best Practices

Textract Node

Rekognition Node

Translate Node

Comprehend Node

Medical Comprehend Node

Transcribe Node

Medical Transcribe Node

Data Pipeline Node

File Load Validation Node

Sync To S3 Node

Datasource Node

Summarization Node

LLM Node (Beta)