Data Pipeline Nodes
This document provides a detailed overview of the various nodes that can be utilized within a data pipeline. Each node type is designed to perform specific tasks, enabling the integration and execution of complex data workflows. Below, you will find the necessary fields required to create each type of node.
ETL Job Node
The ETL Job Node facilitates the execution of ETL jobs, allowing for the inclusion of arguments to be utilized within the job. For instance, this node can be employed to execute an ETL job that identifies the highest paying job and its corresponding salary, as well as to trigger subsequent jobs based on the output of the initial job.
Attribute | Description |
---|---|
Resource | Select an ETL job from the dropdown list of available jobs. |
Node Name | A unique identifier for the node. |
Input Configurations | Arguments that can be used within the job. |
ML Model Inference Node
The ML Model Inference Node is designed to run machine learning models using input arguments to make predictions or decisions. For example, this node can process customer data to predict churn probabilities, with the results forwarded to the next node in the pipeline.
Attribute | Description |
---|---|
Resource | Choose a machine learning model from the list of accessible models. |
Node Name | A unique identifier for the node. |
Input Dataset | The dataset containing files for ML model inference. |
Select Latest File | Automatically selects the latest file for inference if set to 'Yes'. |
File Name Execution Property Key | Required if 'Select Latest File' is set to 'No'; specifies the file name from the dataset. (must be an execution property key) |
Target Dataset | The dataset where inference results are saved. |
Users can perform inference on the input dataset (with a soft limit of 10,000 files) or on up to 100 selected files.
Email Node
The Email Node is used to send emails when provided with the recipient, subject, and body arguments.
Attribute | Description |
---|---|
Node Name | A unique identifier for the node. |
Email Recipient | A list of email addresses to notify (must be an execution property key). Eg: ['<john.doe@amorphicdata.com>','<jane.doe@amorphicdata.com>'] |
Email Subject | The subject of the email (must be an execution property key). |
Email Body | The body of the email (must be an execution property key). |
Emails are not delivered to external domains from SES. To remove this restriction, please raise a support ticket.
Textract Node
The Textract Node extracts text from documents, images, and other file types.
Attribute | Description |
---|---|
Node Name | A unique identifier for the node. |
Input Dataset | The dataset containing files to extract text from (PDF, JPG, PNG supported). |
File Processing Mode | Modes: All, Change Data Capture, Time Based. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous pipeline execution Time Based: processes the documents based on the custom time period chosen |
Features | Choose features to extract: Text, Forms, Tables. Text: extracts all text from document Forms: extracts forms as key-value pairs Tables: extracts tables in csv format |
Target Dataset | The dataset where extracted text is saved. |
No two textract nodes can have the same input dataset within a pipeline.
Rekognition Node
The Rekognition Node analyzes images and videos to detect and identify objects, people, and text.
Attribute | Description |
---|---|
Node Name | A unique identifier for the node. |
Input Dataset | The dataset containing files for analysis (MP4, JPG, PNG supported). |
File Processing Mode | Modes: All, Change Data Capture, Time Based. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous pipeline execution Time Based: processes the documents based on the custom time period chosen |
Features | Choose features to extract: Text, Faces, Content Moderation, Celebrities, Labels. Text: extracts all text from document Faces: detects faces from image or video Content Moderation: extracts the inappropriate, unwanted, or offensive content analysis results Celebrities: extracts the name and additional information about a celebrity Labels: extracts label name, the percentage confidence in the accuracy of the detected label |
Target Dataset | The dataset where extracted data is saved. |
No two rekognition nodes can have the same input dataset within a pipeline.
Translate Node
The Translate Node translates text from one language to another.
Attribute | Description |
---|---|
Node Name | A unique identifier for the node. |
Source Dataset | The dataset containing files for translation (TXT supported). |
File Processing Mode | Modes: All, Change Data Capture, Time Based. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous pipeline execution Time Based: processes the documents based on the custom time period chosen |
Source Language | The language of the source text. |
Target Language | The language to translate the text into. |
Target Dataset | The dataset where translated text is saved (TXT supported). |
- No two translate nodes can have the same input dataset within a pipeline.
- The node only translates the first 5000 characters of the text.
Comprehend Node
The Comprehend Node extracts insights and relationships from text.
Attribute | Description |
---|---|
Node Name | A unique identifier for the node. |
Input Dataset | The dataset containing files for analysis (TXT supported). |
File Processing Mode | Modes: All, Change Data Capture, Time Based. |
Features | Choose features to extract: Entities, KeyPhrases, Sentiment, PiiEntities, Topics. Entities: named entities like people, places, locations etc., in a document KeyPhrases: key phrases or talking points in a document Sentiment: overall sentiment of a text (positive, negative, neutral or mixed) PiiEntities: personally identifiable information (PII) entities in a document Topics: Most common topics in a document |
Target Dataset | The dataset where extracted insights are saved. |
No two comprehend nodes can have the same input dataset within a pipeline.
Medical Comprehend Node
The Medical Comprehend Node processes medical text to extract insights and relationships.
Attribute | Description |
---|---|
Node Name | A unique identifier for the node. |
Input Dataset | The dataset containing files for analysis (TXT supported). |
File Processing Mode | Modes: All, Change Data Capture, Time Based. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous pipeline execution Time Based: processes the documents based on the custom time period chosen |
Features | Choose features to extract: Medications, Medical Conditions, Personal Health Information, Medical Entities. Medications: Detects medication and dosage information for the patient. Medical conditions: Detects the signs, symptoms, and diagnosis of medical conditions. Personal health information: Detects the patient's personal information. Medical entities: All the medical and personal information in the document |
Target Dataset | The dataset where extracted medical information is saved. |
No two medical comprehend nodes can have the same input dataset within a pipeline.
Transcribe Node
The Transcribe Node converts audio files into text.
Attribute | Description |
---|---|
Node Name | A unique identifier for the node. |
Source Dataset | The dataset containing audio files for transcription (MP3, WAV supported). |
Source Language | The language of the audio files. |
File Processing Mode | Modes: All, Change Data Capture, Time Based. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous pipeline execution Time Based: processes the documents based on the custom time period chosen |
Features | Choose features to extract: Text, ConversationBySpeaker, RedactedText, RedactedConversationBySpeaker.Text, ConversationBySpeaker, RedactedText and RedactedConversationBySpeaker are four features available. Text: Raw text extracted from the audio file. ConversationBySpeaker: Raw conversation displaying speaker and the sentence the speaker spoke. RedactedText: Extracted text from audio file with some content obscured for legal and security purposes. RedactedConversationBySpeaker: Conversation displaying speaker and the sentence the speaker spoke with some content obscured for legal and security purposes. |
Target Dataset | The dataset where transcribed text is saved. |
No two transcribe nodes can have the same input dataset within a pipeline.
Medical Transcribe Node
The Medical Transcribe Node converts medical audio files into text.
Attribute | Description |
---|---|
Node Name | A unique identifier for the node. |
Source Dataset | The dataset containing audio files for transcription (MP3, WAV supported). |
Source Language | Currently supports only English-US (en-US). |
File Processing Mode | Modes: All, Change Data Capture, Time Based. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous pipeline execution Time Based: processes the documents based on the custom time period chosen |
Features | Choose features to extract: Text, ConversationBySpeaker. Text: Raw text extracted from the audio file. ConversationBySpeaker: Raw conversation displaying speaker and the sentence the speaker spoke. |
Target Dataset | The dataset where transcribed text is saved. |
No two medical transcribe nodes can have the same input dataset within a pipeline.
Data Pipeline Node
The Data Pipeline Node allows for the integration of existing data pipelines, enabling them to run either in parallel or sequentially. For example, it can execute a data pipeline that consist of an ETL Job node followed by an Email node concurrently or consecutively with a data pipeline consisting of a Translate node followed by an Email node.
Attribute | Description |
---|---|
Resource | Select a data pipeline from the list of accessible data pipelines. |
Node Name | A unique identifier for the node. |
When a parent pipeline is stopped, it automatically stops all child pipelines. Execution properties set at the parent pipeline level take precedence over those defined at the child pipeline level.
File Load Validation Node
The File Load Validation Node is used to validate data before it is loaded into the system.
Attribute | Description |
---|---|
Node Name | A unique identifier for the node. |
Timeout | Optional timeout value (in minutes) for node execution. Default is 60 minutes. |
It is advised to set number of concurrent runs to 1 on etl job when file load validation node is used. For advanced use cases which involve concurrent data pipeline executions, please refer to: Advanced usage of file load validation node
A file load validation node can only succeed an ETL job node, it cannot exist by itself or succeed other type of nodes in a pipeline.
Sync To S3 Node
The Sync To S3 Node synchronizes data to the S3 storage.
Attribute | Description |
---|---|
Node Name | A unique identifier for the node. |
Concurrency Factor | The number of datasets to sync in parallel (1-10). |
Domain | The domain name for dataset synchronization. |
Sync All Datasets | Indicates whether to sync all datasets in the domain. |
Select Datasets | List of datasets to sync if 'Sync All Datasets' is set to 'No'. |
Timeout | Optional timeout value (in minutes) for node execution. Default is 60 minutes. |
User is given an option to click & 'Download' the manifest file in execution properties of the Sync To S3 node. Below graphic shows manifest file that is generated/downloaded after the sync to s3 node is executed:
The "Sync to S3" operation can only be executed one at a time for a specific dataset within the same domain.
Datasource Node
The Datasource Node integrates and runs datasources created within the Amorphic platform as part of the data pipeline.
Attribute | Description |
---|---|
Node Name | A unique identifier for the node. |
Ingestion Type | Select the ingestion type: Normal Data Load, Full Load Bulk Data. |
Dataset | Select the dataset for data ingestion (available for normal data load datasources). |
Dataflow | Select the bulk load dataflow to run (available for bulk data load datasources). Note: The Dataflow must be in a 'ready' state for the data pipeline execution to start. If it is not in a 'ready' state, ensure that it is brought to a 'ready' state from the Dataflows page before triggering the data pipeline execution. |
Summarization Node
The Summarization Node generates a summary of the provided input text using an LLM
Attribute | Description |
---|---|
Node Name | A unique identifier for the node. |
Input Dataset | The dataset containing files for summarization (Txt, PDF supported). |
File Processing Mode | Modes: All, Selected Files, Change Data Capture. All: processes all documents in a source dataset Selected Files: processes a subset of documents in a source dataset(Max 100 files) Change Data Capture: processes the documents that have landed after the previous pipeline execution |
Model | Select the LLM model to use for summarization. |
Target Dataset | The dataset where summarized text is saved. (Txt supported). |
The Model must be enabled from the AWS Bedrock console for the Summarization node to work.
LLM Node (Beta)
The LLM Node can be used for customized data processing using LLMs. Users can provide custom prompts to the LLM Node to generate customized data.
Attribute | Description |
---|---|
Node Name | A unique identifier for the node. |
Input Dataset | The dataset containing files for LLM processing (All Tabular Data Types supported). |
File Processing Mode | Modes: All, Selected Files, Change Data Capture. All: processes all documents in a source dataset Selected Files: processes a subset of documents in a source dataset(Max 100 files) Change Data Capture: processes the documents that have landed after the previous pipeline execution |
Model | Select the LLM model to use for LLM processing. |
Prompt | The prompt to use for LLM processing. Ensure that the prompt is clear and concise. |
Target Dataset | The dataset where LLM processed text is saved. |
An example of a prompt is as follows:
For each 'region' in the dataset, calculate the average 'customer_satisfaction_score' for the last quarter (use rows where 'date_of_feedback' is within the last 3 months).
Then, identify regions where the average score is below 3.5 and list the top 2 complaints (based on frequency) from customers in those regions.
Output the result as a JSON object with fields: 'region', 'average_score', and 'top_complaints'.
With this prompt, the LLM Node performs the following advanced operations:
- Time-based filtering: Select only rows where date_of_feedback is within the last 3 months.
- Grouping: Group feedback by region.
- Aggregation: Compute the average customer_satisfaction_score per region.
- Conditional filtering: Select only regions where the average score is below 3.5.
- Frequency analysis: Identify the top 2 most frequent complaints in those filtered regions.
- Structured output: Format the result as a JSON object suitable for downstream analytics.
- This node is currently in a beta stage. It is advised to use this node with caution and to test it thoroughly before using it in production.
- The Model must be enabled from the AWS Bedrock console for the LLM node to work.
This document aims to provide a comprehensive understanding of the various nodes available within the data pipeline framework, enabling users to effectively design and execute complex data workflows.