Skip to main content
Version: v3.0 print this page

Data Pipeline Nodes

This document provides a detailed overview of the various nodes that can be utilized within a data pipeline. Each node type is designed to perform specific tasks, enabling the integration and execution of complex data workflows. Below, you will find the necessary fields required to create each type of node.

ETL Job Node

The ETL Job Node facilitates the execution of ETL jobs, allowing for the inclusion of arguments to be utilized within the job. For instance, this node can be employed to execute an ETL job that identifies the highest paying job and its corresponding salary, as well as to trigger subsequent jobs based on the output of the initial job.

AttributeDescription
ResourceSelect an ETL job from the dropdown list of available jobs.
Node NameA unique identifier for the node.
Input ConfigurationsArguments that can be used within the job.

ML Model Inference Node

The ML Model Inference Node is designed to run machine learning models using input arguments to make predictions or decisions. For example, this node can process customer data to predict churn probabilities, with the results forwarded to the next node in the pipeline.

AttributeDescription
ResourceChoose a machine learning model from the list of accessible models.
Node NameA unique identifier for the node.
Input DatasetThe dataset containing files for ML model inference.
Select Latest FileAutomatically selects the latest file for inference if set to 'Yes'.
File Name Execution Property KeyRequired if 'Select Latest File' is set to 'No'; specifies the file name from the dataset. (must be an execution property key)
Target DatasetThe dataset where inference results are saved.
Note

Users can perform inference on the input dataset (with a soft limit of 10,000 files) or on up to 100 selected files.

Email Node

The Email Node is used to send emails when provided with the recipient, subject, and body arguments.

AttributeDescription
Node NameA unique identifier for the node.
Email RecipientA list of email addresses to notify (must be an execution property key).
Eg: ['<john.doe@amorphicdata.com>','<jane.doe@amorphicdata.com>']
Email SubjectThe subject of the email (must be an execution property key).
Email BodyThe body of the email (must be an execution property key).
Note

Emails are not delivered to external domains from SES. To remove this restriction, please raise a support ticket.

Textract Node

The Textract Node extracts text from documents, images, and other file types.

AttributeDescription
Node NameA unique identifier for the node.
Input DatasetThe dataset containing files to extract text from (PDF, JPG, PNG supported).
File Processing ModeModes: All, Change Data Capture, Time Based.

All: processes all documents in a source dataset
Change Data Capture: processes the documents that have landed after the previous pipeline execution
Time Based: processes the documents based on the custom time period chosen
FeaturesChoose features to extract: Text, Forms, Tables.

Text: extracts all text from document
Forms: extracts forms as key-value pairs
Tables: extracts tables in csv format
Target DatasetThe dataset where extracted text is saved.
Note

No two textract nodes can have the same input dataset within a pipeline.

Rekognition Node

The Rekognition Node analyzes images and videos to detect and identify objects, people, and text.

AttributeDescription
Node NameA unique identifier for the node.
Input DatasetThe dataset containing files for analysis (MP4, JPG, PNG supported).
File Processing ModeModes: All, Change Data Capture, Time Based.

All: processes all documents in a source dataset
Change Data Capture: processes the documents that have landed after the previous pipeline execution
Time Based: processes the documents based on the custom time period chosen
FeaturesChoose features to extract: Text, Faces, Content Moderation, Celebrities, Labels.

Text: extracts all text from document
Faces: detects faces from image or video
Content Moderation: extracts the inappropriate, unwanted, or offensive content analysis results
Celebrities: extracts the name and additional information about a celebrity
Labels: extracts label name, the percentage confidence in the accuracy of the detected label
Target DatasetThe dataset where extracted data is saved.
Note

No two rekognition nodes can have the same input dataset within a pipeline.

Translate Node

The Translate Node translates text from one language to another.

AttributeDescription
Node NameA unique identifier for the node.
Source DatasetThe dataset containing files for translation (TXT supported).
File Processing ModeModes: All, Change Data Capture, Time Based.

All: processes all documents in a source dataset
Change Data Capture: processes the documents that have landed after the previous pipeline execution
Time Based: processes the documents based on the custom time period chosen
Source LanguageThe language of the source text.
Target LanguageThe language to translate the text into.
Target DatasetThe dataset where translated text is saved (TXT supported).
Note
  • No two translate nodes can have the same input dataset within a pipeline.
  • The node only translates the first 5000 characters of the text.

Comprehend Node

The Comprehend Node extracts insights and relationships from text.

AttributeDescription
Node NameA unique identifier for the node.
Input DatasetThe dataset containing files for analysis (TXT supported).
File Processing ModeModes: All, Change Data Capture, Time Based.
FeaturesChoose features to extract: Entities, KeyPhrases, Sentiment, PiiEntities, Topics.

Entities: named entities like people, places, locations etc., in a document
KeyPhrases: key phrases or talking points in a document
Sentiment: overall sentiment of a text (positive, negative, neutral or mixed)
PiiEntities: personally identifiable information (PII) entities in a document
Topics: Most common topics in a document
Target DatasetThe dataset where extracted insights are saved.
Note

No two comprehend nodes can have the same input dataset within a pipeline.

Medical Comprehend Node

The Medical Comprehend Node processes medical text to extract insights and relationships.

AttributeDescription
Node NameA unique identifier for the node.
Input DatasetThe dataset containing files for analysis (TXT supported).
File Processing ModeModes: All, Change Data Capture, Time Based.

All: processes all documents in a source dataset
Change Data Capture: processes the documents that have landed after the previous pipeline execution
Time Based: processes the documents based on the custom time period chosen
FeaturesChoose features to extract: Medications, Medical Conditions, Personal Health Information, Medical Entities.

Medications: Detects medication and dosage information for the patient.
Medical conditions: Detects the signs, symptoms, and diagnosis of medical conditions.
Personal health information: Detects the patient's personal information.
Medical entities: All the medical and personal information in the document
Target DatasetThe dataset where extracted medical information is saved.
Note

No two medical comprehend nodes can have the same input dataset within a pipeline.

Transcribe Node

The Transcribe Node converts audio files into text.

AttributeDescription
Node NameA unique identifier for the node.
Source DatasetThe dataset containing audio files for transcription (MP3, WAV supported).
Source LanguageThe language of the audio files.
File Processing ModeModes: All, Change Data Capture, Time Based.

All: processes all documents in a source dataset
Change Data Capture: processes the documents that have landed after the previous pipeline execution
Time Based: processes the documents based on the custom time period chosen
FeaturesChoose features to extract: Text, ConversationBySpeaker, RedactedText, RedactedConversationBySpeaker.Text, ConversationBySpeaker, RedactedText and RedactedConversationBySpeaker are four features available.

Text: Raw text extracted from the audio file.
ConversationBySpeaker: Raw conversation displaying speaker and the sentence the speaker spoke.
RedactedText: Extracted text from audio file with some content obscured for legal and security purposes.
RedactedConversationBySpeaker: Conversation displaying speaker and the sentence the speaker spoke with some content obscured for legal and security purposes.
Target DatasetThe dataset where transcribed text is saved.
Note

No two transcribe nodes can have the same input dataset within a pipeline.

Medical Transcribe Node

The Medical Transcribe Node converts medical audio files into text.

AttributeDescription
Node NameA unique identifier for the node.
Source DatasetThe dataset containing audio files for transcription (MP3, WAV supported).
Source LanguageCurrently supports only English-US (en-US).
File Processing ModeModes: All, Change Data Capture, Time Based.

All: processes all documents in a source dataset
Change Data Capture: processes the documents that have landed after the previous pipeline execution
Time Based: processes the documents based on the custom time period chosen
FeaturesChoose features to extract: Text, ConversationBySpeaker.

Text: Raw text extracted from the audio file.
ConversationBySpeaker: Raw conversation displaying speaker and the sentence the speaker spoke.
Target DatasetThe dataset where transcribed text is saved.
Note

No two medical transcribe nodes can have the same input dataset within a pipeline.

Data Pipeline Node

The Data Pipeline Node allows for the integration of existing data pipelines, enabling them to run either in parallel or sequentially. For example, it can execute a data pipeline that consist of an ETL Job node followed by an Email node concurrently or consecutively with a data pipeline consisting of a Translate node followed by an Email node.

AttributeDescription
ResourceSelect a data pipeline from the list of accessible data pipelines.
Node NameA unique identifier for the node.

When a parent pipeline is stopped, it automatically stops all child pipelines. Execution properties set at the parent pipeline level take precedence over those defined at the child pipeline level.

File Load Validation Node

The File Load Validation Node is used to validate data before it is loaded into the system.

AttributeDescription
Node NameA unique identifier for the node.
TimeoutOptional timeout value (in minutes) for node execution. Default is 60 minutes.

It is advised to set number of concurrent runs to 1 on etl job when file load validation node is used. For advanced use cases which involve concurrent data pipeline executions, please refer to: Advanced usage of file load validation node

Note

A file load validation node can only succeed an ETL job node, it cannot exist by itself or succeed other type of nodes in a pipeline.

Sync To S3 Node

The Sync To S3 Node synchronizes data to the S3 storage.

AttributeDescription
Node NameA unique identifier for the node.
Concurrency FactorThe number of datasets to sync in parallel (1-10).
DomainThe domain name for dataset synchronization.
Sync All DatasetsIndicates whether to sync all datasets in the domain.
Select DatasetsList of datasets to sync if 'Sync All Datasets' is set to 'No'.
TimeoutOptional timeout value (in minutes) for node execution. Default is 60 minutes.

User is given an option to click & 'Download' the manifest file in execution properties of the Sync To S3 node. Below graphic shows manifest file that is generated/downloaded after the sync to s3 node is executed:

Output manifest file

Note

The "Sync to S3" operation can only be executed one at a time for a specific dataset within the same domain.

Datasource Node

The Datasource Node integrates and runs datasources created within the Amorphic platform as part of the data pipeline.

AttributeDescription
Node NameA unique identifier for the node.
Ingestion TypeSelect the ingestion type: Normal Data Load, Full Load Bulk Data.
DatasetSelect the dataset for data ingestion (available for normal data load datasources).
DataflowSelect the bulk load dataflow to run (available for bulk data load datasources).

Note: The Dataflow must be in a 'ready' state for the data pipeline execution to start. If it is not in a 'ready' state, ensure that it is brought to a 'ready' state from the Dataflows page before triggering the data pipeline execution.

Summarization Node

The Summarization Node generates a summary of the provided input text using an LLM

AttributeDescription
Node NameA unique identifier for the node.
Input DatasetThe dataset containing files for summarization (Txt, PDF supported).
File Processing ModeModes: All, Selected Files, Change Data Capture.

All: processes all documents in a source dataset
Selected Files: processes a subset of documents in a source dataset(Max 100 files)
Change Data Capture: processes the documents that have landed after the previous pipeline execution
ModelSelect the LLM model to use for summarization.
Target DatasetThe dataset where summarized text is saved. (Txt supported).
Note

The Model must be enabled from the AWS Bedrock console for the Summarization node to work.

LLM Node (Beta)

The LLM Node can be used for customized data processing using LLMs. Users can provide custom prompts to the LLM Node to generate customized data.

AttributeDescription
Node NameA unique identifier for the node.
Input DatasetThe dataset containing files for LLM processing (All Tabular Data Types supported).
File Processing ModeModes: All, Selected Files, Change Data Capture.

All: processes all documents in a source dataset
Selected Files: processes a subset of documents in a source dataset(Max 100 files)
Change Data Capture: processes the documents that have landed after the previous pipeline execution
ModelSelect the LLM model to use for LLM processing.
PromptThe prompt to use for LLM processing. Ensure that the prompt is clear and concise.
Target DatasetThe dataset where LLM processed text is saved.

An example of a prompt is as follows:

For each 'region' in the dataset, calculate the average 'customer_satisfaction_score' for the last quarter (use rows where 'date_of_feedback' is within the last 3 months).
Then, identify regions where the average score is below 3.5 and list the top 2 complaints (based on frequency) from customers in those regions.
Output the result as a JSON object with fields: 'region', 'average_score', and 'top_complaints'.

With this prompt, the LLM Node performs the following advanced operations:

  • Time-based filtering: Select only rows where date_of_feedback is within the last 3 months.
  • Grouping: Group feedback by region.
  • Aggregation: Compute the average customer_satisfaction_score per region.
  • Conditional filtering: Select only regions where the average score is below 3.5.
  • Frequency analysis: Identify the top 2 most frequent complaints in those filtered regions.
  • Structured output: Format the result as a JSON object suitable for downstream analytics.
Note
  • This node is currently in a beta stage. It is advised to use this node with caution and to test it thoroughly before using it in production.
  • The Model must be enabled from the AWS Bedrock console for the LLM node to work.

This document aims to provide a comprehensive understanding of the various nodes available within the data pipeline framework, enabling users to effectively design and execute complex data workflows.