Dataload Limits
The way data load limits work is mostly determined by the Data load throttling setting.
- If enabled: dataset files are processed one by one, with completion time depending on the number of files.
- If disabled: dataset files are processed and completed immediately.
Glue File Processing (API Only)
The system supports an additional API parameter GlueFileProcessing that can be used in conjunction with DataLoadThrottling to enable faster processing using Glue jobs. This is an API-only feature that can be configured with the following parameters:
Path: /datasets/dataload-throttling
Method: PUT
{
"GlueFileProcessing":"disable" or "enable”
"DataLoadThrottling":"disable" or "enable",
}
Note:
- Files uploaded directly from the Amorphic UI dataset details page are processed immediately, even if throttling is enabled.
- Files uploaded through jobs, ingestion, etc., still follow normal throttling procedures.
- This is ONLY applicable for append and update type datasets.
- For use cases with large numbers of files, the best data load limit for Redshift is 90-100.
- When GlueFileProcessing is enabled, data validation steps will be skipped for datasets targeting LakeFormation and S3 Athena.
Important Information: No action is required to enable/disable data load throttling. The system automatically adjusts based on the number of files being processed. Users can manually enable/disable if needed.
If throttling is enabled manually by any user, it will not turn off automatically.
Throttling Automatic Process
- If processing files are within the throttle limit: system disables throttling.
- If processing files exceed the limit: system enables throttling and processes files in a queue.
Setting Dataload Limits
Dataload limits let users set batch limits for processing uploaded files. Different limits apply to each target location (S3, S3Athena, Lakeformation, Dynamodb, AuroraMySQL).
Example: If users upload 1000 files to an S3 dataset with a 300 file limit, up to 300 files will process in parallel, with remaining files queued. The system polls the queue every 3 minutes to trigger processing according to the specified limits.
The ranges for each target location are calculated based on AWS limits and performance tests.
Users can view counts of recent dataload executions and messages waiting in SQS queues on the Infra Management
page.
Updating Limits
Users can update data load limits for all applicable target locations using "Set Limits":
- Enter new values in the respective target location fields
- Click "Update Limits" to apply changes
Note:
- Minimum and maximum values are displayed in helper tooltips.
- If updated limits don't appear immediately after successful update, refresh after a few seconds. This may be due to delays in the AWS SSM (Parameter Store) service.
Service Limits
The Service Limits feature in Amorphic is a tool that helps users view and monitor AWS service usage within the Amorphic deployed AWS account. It allows users to set alarm thresholds and notify users when service usage breaches the set threshold percentages.
- Service usage datapoints are collected on-the-fly when users enter the Service Limits page
- Statistics are dynamically generated
- Each service displays an icon indicating its status:
- Green icon: service is well within the set threshold
- Yellow flag icon: service is approaching threshold
- Red alarm icon: service has breached threshold
Supported Services
Currently, the following services are supported:
- DMS Subnet Groups
- Dynamodb Tables
- Event Rules
- IAM Roles
- Lambda Concurrency
- Notebook instances
- S3 Buckets
- SES Sending quota
- SSM Parameters
- Tenant Limits
Note: Some services may take longer to generate usage statistics (e.g., SSM Parameters).
Alarm Threshold Management
Users can update the alarm threshold percentage. This threshold:
- Is applied at the application level
- Applies to all services
- Triggers email alerts to subscribed users when exceeded
Email alerts include details of:
- Services that exceeded limits
- Actions required to address the issue
Important Information: Service limits feature works on top of AWS Service Quotas. In the event that account-level quota increases are required, users should request quota increases via AWS Service Quotas and NOT by raising a support case to AWS.
System Datasets
The Amorphic application provides users with seven system datasets. These datasets are available in a domain with the postfix system
. The prefix depends on the shortname and environment selected for the application. User can run SQL queries to analyze dataset contents and maintain a record of query executions.
Available System Datasets
sys_dynamodb_logs This dataset contains information about calls made to DynamoDB tables, such as the table name accessed, the function name used, the Lambda triggered, the time, and the API (if any) used to make the call.
Log delivery is facilitated by SQS, but depending on the size of the input and output payloads to DynamoDB, these APIs may not be logged due to SQS message size limitations.
sys_api_gateway_logs This dataset contains information about API calls made through the amorphic application. It includes details such as the API resource path used, as well as user information like usernames and email addresses.
sys_cloudfront_logs This dataset contains information about API calls made through CloudFront associated with the Amorphic application.
sys_cloudtrail_logs This dataset contains information about the CloudTrail logs associated with the Amorphic application, including event-related information.
sys_s3_buckets_logs This dataset contains information about the S3 calls made through the Amorphic application, including the S3 endpoint, requester, and bucket owner information.
sys_cost_analysis_table This system dataset stores cost allocation information. It is used to track costs allocated for each tag key and value pair for each service and resource in the Amorphic application.
The dataset contains the following columns:
bill_billing_period_end_date
bill_billing_period_start_date
line_item_blended_cost
line_item_blended_rate
line_item_currency_code
line_item_net_unblended_cost
line_item_net_unblended_rate
line_item_product_code
line_item_resource_id
line_item_unblended_cost
line_item_unblended_rate
line_item_usage_amount
line_item_usage_end_date
line_item_usage_start_date
pricing_currency
pricing_public_on_demand_cost
pricing_public_on_demand_rate
product
product_name
product_servicename
product_region_code
resource_tags
For more information about these columns, refer to the AWS Cost and Usage Reports Documentation.
The sys_cost_analysis_table
dataset is only available when the cost feature is enabled for your Amorphic application.
sys_observability_logs This dataset contains information for all actions performed on the Amorphic application. It includes details about when actions occurred, resource details, who triggered the operation, and other operation-specific information.
Following are the resources for which observability logs are tracked:
- Assets
- Connection appflows
- Connections
- Connections apps
- Cost management
- Dashboards
- Data classification
- Data quality checks
- Data sources
- Datasets
- Datasource sync job
- Deep search index
- Deep search index sync jobs
- Domains
- Glossary
- Hcls health imaging
- Hcls store jobs
- Hcls workflows
- Insights
- Jobs
- Models
- NL2SQL training document
- Notebook lcc
- Notebooks
- Parameters
- Playground
- Roles
- Schedules
- Stream consumers
- Stream data transformations
- Streams
- Studios
- System config
- Tenants
- Views
- Workflows
Keeping Data Up to Date
Amorphic provides users with the option to keep the underlying data up to date using the Update Partitions
option, which can be found under the System Datasets tab in the Infra Management.
The Update Partitions
option is enabled only for the sys_api_gateway_logs
, sys_dynamodb_logs
, and sys_observability_logs
datasets.
Backend Jobs
With Backend jobs, users can monitor the progress of all ongoing backend tasks within the Amorphic system. This includes their execution rate and any additional information.
Job Management Features
Enabling/Disabling Backend Jobs
A selected set of backend jobs can be toggled on/off. These include:
- Data Profiling Job
- Resource Sync
- Update CloudWatch Logs Retention Policy
Running Backend Jobs on Demand
With release version 2.5, users can now run certain backend jobs on request:
- Alert High Costing Resource
- Auto Terminate Resources
- Backup Cloudwatch Logs to S3
- Backup Observability Logs to S3
- Resource Sync
- Data Profiling Job
- Workflows time based Event
Special Configuration Requirements
-
For Backup Cloudwatch Logs to S3 and Backup Observability Logs to S3: Users need to specify the date range, then the logs during that period will be backfilled to S3.
-
For Data Profiling Job: Users need to select the dataset IDs to run the profiling job. Data profiling jobs for a maximum of only 10 datasets can be triggered in a single request.
User furthur has freedom to provide the following parameters to adjust the profiling job at either execution or at the job level itself:
- Number Of Workers(DPU capacity) - Currently maximum configurable is uptil 8 hours
- Timeout - Currently maximum configurable is between 2 and 10
Note: Users can backfill API Gateway logs of max 30 days older and for the Observability logs backfill period is 10 days.