Skip to main content
Version: v3.0 print this page

Dataload Limits

The way data load limits work is mostly determined by the Data load throttling setting.

  • If enabled: dataset files are processed one by one, with completion time depending on the number of files.
  • If disabled: dataset files are processed and completed immediately.

Glue File Processing (API Only)

The system supports an additional API parameter GlueFileProcessing that can be used in conjunction with DataLoadThrottling to enable faster processing using Glue jobs. This is an API-only feature that can be configured with the following parameters:

  Path: /datasets/dataload-throttling
Method: PUT
{
"GlueFileProcessing":"disable" or "enable”
"DataLoadThrottling":"disable" or "enable",
}

Note:

  • Files uploaded directly from the Amorphic UI dataset details page are processed immediately, even if throttling is enabled.
  • Files uploaded through jobs, ingestion, etc., still follow normal throttling procedures.
  • This is ONLY applicable for append and update type datasets.
  • For use cases with large numbers of files, the best data load limit for Redshift is 90-100.
  • When GlueFileProcessing is enabled, data validation steps will be skipped for datasets targeting LakeFormation and S3 Athena.

Dataload Limits

Important Information: No action is required to enable/disable data load throttling. The system automatically adjusts based on the number of files being processed. Users can manually enable/disable if needed.

If throttling is enabled manually by any user, it will not turn off automatically.

Throttling Automatic Process

  • If processing files are within the throttle limit: system disables throttling.
  • If processing files exceed the limit: system enables throttling and processes files in a queue.

Setting Dataload Limits

Dataload limits let users set batch limits for processing uploaded files. Different limits apply to each target location (S3, S3Athena, Lakeformation, Dynamodb, AuroraMySQL).

Example: If users upload 1000 files to an S3 dataset with a 300 file limit, up to 300 files will process in parallel, with remaining files queued. The system polls the queue every 3 minutes to trigger processing according to the specified limits.

The ranges for each target location are calculated based on AWS limits and performance tests.

Users can view counts of recent dataload executions and messages waiting in SQS queues on the Infra Management page.

Updating Limits

Users can update data load limits for all applicable target locations using "Set Limits":

  1. Enter new values in the respective target location fields
  2. Click "Update Limits" to apply changes

Note:

  • Minimum and maximum values are displayed in helper tooltips.
  • If updated limits don't appear immediately after successful update, refresh after a few seconds. This may be due to delays in the AWS SSM (Parameter Store) service.

Service Limits

The Service Limits feature in Amorphic is a tool that helps users view and monitor AWS service usage within the Amorphic deployed AWS account. It allows users to set alarm thresholds and notify users when service usage breaches the set threshold percentages.

  • Service usage datapoints are collected on-the-fly when users enter the Service Limits page
  • Statistics are dynamically generated
  • Each service displays an icon indicating its status:
    • Green icon: service is well within the set threshold
    • Yellow flag icon: service is approaching threshold
    • Red alarm icon: service has breached threshold

Supported Services

Currently, the following services are supported:

  • DMS Subnet Groups
  • Dynamodb Tables
  • Event Rules
  • IAM Roles
  • Lambda Concurrency
  • Notebook instances
  • S3 Buckets
  • SES Sending quota
  • SSM Parameters
  • Tenant Limits

Note: Some services may take longer to generate usage statistics (e.g., SSM Parameters).

Service Limits

Alarm Threshold Management

Users can update the alarm threshold percentage. This threshold:

  • Is applied at the application level
  • Applies to all services
  • Triggers email alerts to subscribed users when exceeded

Email alerts include details of:

  • Services that exceeded limits
  • Actions required to address the issue

Important Information: Service limits feature works on top of AWS Service Quotas. In the event that account-level quota increases are required, users should request quota increases via AWS Service Quotas and NOT by raising a support case to AWS.

System Datasets

The Amorphic application provides users with seven system datasets. These datasets are available in a domain with the postfix system. The prefix depends on the shortname and environment selected for the application. User can run SQL queries to analyze dataset contents and maintain a record of query executions.

System_Datasets

Available System Datasets

sys_dynamodb_logs This dataset contains information about calls made to DynamoDB tables, such as the table name accessed, the function name used, the Lambda triggered, the time, and the API (if any) used to make the call.

Log delivery is facilitated by SQS, but depending on the size of the input and output payloads to DynamoDB, these APIs may not be logged due to SQS message size limitations.

sys_api_gateway_logs This dataset contains information about API calls made through the amorphic application. It includes details such as the API resource path used, as well as user information like usernames and email addresses.

sys_cloudfront_logs This dataset contains information about API calls made through CloudFront associated with the Amorphic application.

sys_cloudtrail_logs This dataset contains information about the CloudTrail logs associated with the Amorphic application, including event-related information.

sys_s3_buckets_logs This dataset contains information about the S3 calls made through the Amorphic application, including the S3 endpoint, requester, and bucket owner information.

sys_cost_analysis_table This system dataset stores cost allocation information. It is used to track costs allocated for each tag key and value pair for each service and resource in the Amorphic application.

The dataset contains the following columns:

  • bill_billing_period_end_date
  • bill_billing_period_start_date
  • line_item_blended_cost
  • line_item_blended_rate
  • line_item_currency_code
  • line_item_net_unblended_cost
  • line_item_net_unblended_rate
  • line_item_product_code
  • line_item_resource_id
  • line_item_unblended_cost
  • line_item_unblended_rate
  • line_item_usage_amount
  • line_item_usage_end_date
  • line_item_usage_start_date
  • pricing_currency
  • pricing_public_on_demand_cost
  • pricing_public_on_demand_rate
  • product
  • product_name
  • product_servicename
  • product_region_code
  • resource_tags

For more information about these columns, refer to the AWS Cost and Usage Reports Documentation.

Note

The sys_cost_analysis_table dataset is only available when the cost feature is enabled for your Amorphic application.

sys_observability_logs This dataset contains information for all actions performed on the Amorphic application. It includes details about when actions occurred, resource details, who triggered the operation, and other operation-specific information.

Following are the resources for which observability logs are tracked:

  • Assets
  • Connection appflows
  • Connections
  • Connections apps
  • Cost management
  • Dashboards
  • Data classification
  • Data quality checks
  • Data sources
  • Datasets
  • Datasource sync job
  • Deep search index
  • Deep search index sync jobs
  • Domains
  • Glossary
  • Hcls health imaging
  • Hcls store jobs
  • Hcls workflows
  • Insights
  • Jobs
  • Models
  • NL2SQL training document
  • Notebook lcc
  • Notebooks
  • Parameters
  • Playground
  • Roles
  • Schedules
  • Stream consumers
  • Stream data transformations
  • Streams
  • Studios
  • System config
  • Tenants
  • Views
  • Workflows

Keeping Data Up to Date

Amorphic provides users with the option to keep the underlying data up to date using the Update Partitions option, which can be found under the System Datasets tab in the Infra Management.

System partition

Note

The Update Partitions option is enabled only for the sys_api_gateway_logs, sys_dynamodb_logs, and sys_observability_logs datasets.

Backend Jobs

With Backend jobs, users can monitor the progress of all ongoing backend tasks within the Amorphic system. This includes their execution rate and any additional information.

Backend Jobs

Job Management Features

Enabling/Disabling Backend Jobs

A selected set of backend jobs can be toggled on/off. These include:

  1. Data Profiling Job
  2. Resource Sync
  3. Update CloudWatch Logs Retention Policy

Running Backend Jobs on Demand

With release version 2.5, users can now run certain backend jobs on request:

  1. Alert High Costing Resource
  2. Auto Terminate Resources
  3. Backup Cloudwatch Logs to S3
  4. Backup Observability Logs to S3
  5. Resource Sync
  6. Data Profiling Job
  7. Workflows time based Event

Special Configuration Requirements

  • For Backup Cloudwatch Logs to S3 and Backup Observability Logs to S3: Users need to specify the date range, then the logs during that period will be backfilled to S3.

  • For Data Profiling Job: Users need to select the dataset IDs to run the profiling job. Data profiling jobs for a maximum of only 10 datasets can be triggered in a single request.

Data Profiling Cost Guardrails

User furthur has freedom to provide the following parameters to adjust the profiling job at either execution or at the job level itself:

  • Number Of Workers(DPU capacity) - Currently maximum configurable is uptil 8 hours
  • Timeout - Currently maximum configurable is between 2 and 10

Note: Users can backfill API Gateway logs of max 30 days older and for the Observability logs backfill period is 10 days.