Version: v3.1 print this page

Infrastructure Management

Dataload Limits

The way data load limits work is mostly determined by the Data load throttling setting.

If enabled: dataset files are processed one by one, with completion time depending on the number of files.
If disabled: dataset files are processed and completed immediately.

Glue File Processing (API Only)

The system supports an additional API parameter GlueFileProcessing that can be used in conjunction with DataLoadThrottling to enable faster processing using Glue jobs. This is an API-only feature that can be configured with the following parameters:

  Path: /datasets/dataload-throttling
  Method: PUT
  {
    "GlueFileProcessing":"disable" or "enable”
    "DataLoadThrottling":"disable" or "enable",
  }

Note:

Files uploaded directly from the Amorphic UI dataset details page are processed immediately, even if throttling is enabled.

Files uploaded through jobs, ingestion, etc., still follow normal throttling procedures.

This is ONLY applicable for append and update type datasets.

For use cases with large numbers of files, the best data load limit for Redshift is 90-100.

When GlueFileProcessing is enabled, data validation steps will be skipped for datasets targeting LakeFormation and S3 Athena.

Dataload Limits

Important Information: No action is required to enable/disable data load throttling. The system automatically adjusts based on the number of files being processed. Users can manually enable/disable if needed.

If throttling is enabled manually by any user, it will not turn off automatically.

Throttling Automatic Process

If processing files are within the throttle limit: system disables throttling.
If processing files exceed the limit: system enables throttling and processes files in a queue.

Setting Dataload Limits

Dataload limits let users set batch limits for processing uploaded files. Different limits apply to each target location (S3, S3Athena, Lakeformation, Dynamodb, AuroraMySQL).

Example: If users upload 1000 files to an S3 dataset with a 300 file limit, up to 300 files will process in parallel, with remaining files queued. The system polls the queue every 3 minutes to trigger processing according to the specified limits.

The ranges for each target location are calculated based on AWS limits and performance tests.

Users can view counts of recent dataload executions and messages waiting in SQS queues on the Infra Management page.

Updating Limits

Users can update data load limits for all applicable target locations using "Set Limits":

Enter new values in the respective target location fields
Click "Update Limits" to apply changes

Note:

Minimum and maximum values are displayed in helper tooltips.

If updated limits don't appear immediately after successful update, refresh after a few seconds. This may be due to delays in the AWS SSM (Parameter Store) service.

Service Limits

The Service Limits feature in Amorphic is a tool that helps users view and monitor AWS service usage within the Amorphic deployed AWS account. It allows users to set alarm thresholds and notify users when service usage breaches the set threshold percentages.

Service usage datapoints are collected on-the-fly when users enter the Service Limits page
Statistics are dynamically generated
Each service displays an icon indicating its status:
- Green icon: service is well within the set threshold
- Yellow flag icon: service is approaching threshold
- Red alarm icon: service has breached threshold

Supported Services

Currently, the following services are supported:

DMS Subnet Groups
Dynamodb Tables
Event Rules
IAM Roles
Lambda Concurrency
Notebook instances
S3 Buckets
SES Sending quota
SSM Parameters
Tenant Limits

Note: Some services may take longer to generate usage statistics (e.g., SSM Parameters).

Service Limits

Alarm Threshold Management

Users can update the alarm threshold percentage. This threshold:

Is applied at the application level
Applies to all services
Triggers email alerts to subscribed users when exceeded

Email alerts include details of:

Services that exceeded limits
Actions required to address the issue

Important Information: Service limits feature works on top of AWS Service Quotas. In the event that account-level quota increases are required, users should request quota increases via AWS Service Quotas and NOT by raising a support case to AWS.

System Datasets

The Amorphic application provides users with seven system datasets. These datasets are available in a domain with the postfix system. The prefix depends on the shortname and environment selected for the application. User can run SQL queries to analyze dataset contents and maintain a record of query executions.

System_Datasets

Available System Datasets

sys_dynamodb_logs This dataset contains information about calls made to DynamoDB tables, such as the table name accessed, the function name used, the Lambda triggered, the time, and the API (if any) used to make the call.

Log delivery is facilitated by SQS, but depending on the size of the input and output payloads to DynamoDB, these APIs may not be logged due to SQS message size limitations.

sys_api_gateway_logs This dataset contains information about API calls made through the amorphic application. It includes details such as the API resource path used, as well as user information like usernames and email addresses.

sys_cloudfront_logs This dataset contains information about API calls made through CloudFront associated with the Amorphic application.

sys_cloudtrail_logs This dataset contains information about the CloudTrail logs associated with the Amorphic application, including event-related information.

sys_s3_buckets_logs This dataset contains information about the S3 calls made through the Amorphic application, including the S3 endpoint, requester, and bucket owner information.

sys_cost_analysis_table This system dataset stores cost allocation information. It is used to track costs allocated for each tag key and value pair for each service and resource in the Amorphic application.

The dataset contains the following columns:

bill_billing_period_end_date
bill_billing_period_start_date
line_item_blended_cost
line_item_blended_rate
line_item_currency_code
line_item_net_unblended_cost
line_item_net_unblended_rate
line_item_product_code
line_item_resource_id
line_item_unblended_cost
line_item_unblended_rate
line_item_usage_amount
line_item_usage_end_date
line_item_usage_start_date
pricing_currency
pricing_public_on_demand_cost
pricing_public_on_demand_rate
product
product_name
product_servicename
product_region_code
resource_tags

For more information about these columns, refer to the AWS Cost and Usage Reports Documentation.

Note

The sys_cost_analysis_table dataset is only available when the cost feature is enabled for your Amorphic application.

sys_observability_logs This dataset contains information for all actions performed on the Amorphic application. It includes details about when actions occurred, resource details, who triggered the operation, and other operation-specific information.

Following are the resources for which observability logs are tracked:

Assets
Connection appflows
Connections
Connections apps
Cost management
Dashboards
Data classification
Data quality checks
Data sources
Datasets
Datasource sync job
Deep search index
Deep search index sync jobs
Domains
Glossary
Hcls health imaging
Hcls store jobs
Hcls workflows
Insights
Jobs
Models
NL2SQL training document
Notebook lcc
Notebooks
Parameters
Playground
Roles
Schedules
Stream consumers
Stream data transformations
Streams
Studios
System config
Tenants
Views
Workflows

Keeping Data Up to Date

Amorphic provides users with the option to keep the underlying data up to date using the Update Partitions option, which can be found under the System Datasets tab in the Infra Management.

System partition

Note

The Update Partitions option is enabled only for the sys_api_gateway_logs, sys_dynamodb_logs, and sys_observability_logs datasets.

Backend Jobs

With Backend jobs, users can monitor the progress of all ongoing backend tasks within the Amorphic system. This includes their execution rate and any additional information.

Backend Jobs

Job Management Features

Enabling/Disabling Backend Jobs

A selected set of backend jobs can be toggled on/off. These include:

Data Profiling Job
Resource Sync
Update CloudWatch Logs Retention Policy

Running Backend Jobs on Demand

With release version 2.5, users can now run certain backend jobs on request:

Alert High Costing Resource
Auto Terminate Resources
Backup Cloudwatch Logs to S3
Backup Observability Logs to S3
Resource Sync
Data Profiling Job
Workflows time based Event

Special Configuration Requirements

For Backup Cloudwatch Logs to S3 and Backup Observability Logs to S3: Users need to specify the date range, then the logs during that period will be backfilled to S3.
For Data Profiling Job: Users need to select the dataset IDs to run the profiling job. Data profiling jobs for a maximum of only 10 datasets can be triggered in a single request.

Data Profiling Cost Guardrails

User furthur has freedom to provide the following parameters to adjust the profiling job at either execution or at the job level itself:

Number Of Workers(DPU capacity) - Currently maximum configurable is uptil 8 hours
Timeout - Currently maximum configurable is between 2 and 10

Note: Users can backfill API Gateway logs of max 30 days older and for the Observability logs backfill period is 10 days.

Dataload Limits​

Glue File Processing (API Only)​

Service Limits​

System Datasets​

Backend Jobs​

Dataload Limits

Glue File Processing (API Only)

Service Limits

System Datasets

Backend Jobs