Google Chrome Block Upload to S3 Aviary
Amazon S3 transfers
The BigQuery Data Transfer Service for Amazon S3 allows you to automatically schedule and manage recurring load jobs from Amazon S3 into BigQuery.
Before you begin
Earlier you create an Amazon S3 transfer:
- Verify that you have completed all deportment required to enable the BigQuery Data Transfer Service.
- Create a BigQuery dataset to store your data.
- Create the destination table for your transfer and specify the schema definition. The destination table must follow the table naming rules. Destination table names also support parameters.
- Retrieve your Amazon S3 URI, your access central ID, and your secret access primal. For information on managing your access keys, meet the AWS documentation.
- If you intend to setup transfer run notifications for Pub/Sub, yous must take
pubsub.topics.setIamPolicy
permissions. Pub/Sub permissions are not required if you just prepare upwards email notifications. For more than information, encounter BigQuery Information Transfer Service run notifications.
Limitations
Amazon S3 transfers are subject area to the following limitations:
- Currently, the bucket portion of the Amazon S3 URI cannot be parameterized.
- Transfers from Amazon S3 are ever triggered with the
WRITE_APPEND
preference which appends information to the destination table. Seeconfiguration.load.writeDisposition
in the load job configuration for additional details. -
Depending on the format of your Amazon S3 source data, there may be additional limitations. For more information, see:
- CSV limitations
- JSON limitations
- Limitations on nested and repeated data
-
The minimum interval fourth dimension betwixt recurring transfers is 24 hours. The default interval for a recurring transfer is 24 hours.
Required permissions
Earlier creating an Amazon S3 transfer:
-
Ensure that the person creating the transfer has the post-obit required permissions in BigQuery:
-
bigquery.transfers.update
permissions to create the transfer - Both
bigquery.datasets.become
andbigquery.datasets.update
permissions on the target dataset
The
bigquery.admin
predefined IAM role includesbigquery.transfers.update
,bigquery.datasets.update
andbigquery.datasets.get
permissions. For more information on IAM roles in BigQuery Information Transfer Service, see Access control reference. -
-
Consult the documentation for Amazon S3 to ensure you lot have configured any permissions necessary to enable the transfer. At a minimum, the Amazon S3 source data must have the AWS managed policy
AmazonS3ReadOnlyAccess
practical to it.
Setting upwards an Amazon S3 data transfer
To create an Amazon S3 data transfer:
Console
-
Go to the BigQuery folio in the Cloud Console.
Go to the BigQuery page
-
Click Transfers.
-
Click Create a Transfer.
-
On the Create Transfer folio:
-
In the Source type section, for Source, choose Amazon S3.
-
In the Transfer config name section, for Brandish name, enter a proper name for the transfer such as
My Transfer
. The transfer name can be any value that allows you to easily identify the transfer if yous need to modify it later. -
In the Schedule options department, for Schedule, get out the default value (Starting time now) or click Start at a set time.
-
For Repeats, cull an option for how ofttimes to run the transfer. Options include:
- Daily (default)
- Weekly
- Monthly
- Custom
- On-demand
If you choose an option other than Daily, additional options are available. For example, if you lot choose Weekly, an option appears for y'all to select the day of the week.
-
For Start date and run time, enter the date and time to start the transfer. If you cull Start at present, this selection is disabled.
-
-
In the Destination settings section, for Destination dataset, choose the dataset you created to store your information.
-
In the Data source details section:
- For Destination table, enter the name of the table you created to store the data in BigQuery. Destination table names support parameters.
- For Amazon S3 URI, enter the URI in the post-obit format
s3://mybucket/myfolder/...
. URIs also support parameters. - For Admission key ID, enter your admission central ID.
- For Cloak-and-dagger access key, enter your secret admission key.
-
For File format cull your data format: newline delimited JSON, CSV, Avro, Parquet, or ORC.
-
In the Transfer options - all formats section:
- For Number of errors allowed, enter an integer value for the maximum number of bad records that can be ignored.
- (Optional) For Decimal target types, enter a comma-separated listing of possible SQL data types that the source decimal values could be converted to. Which SQL data type is selected for conversion depends on the following atmospheric condition:
- The data blazon selected for conversion will be the first information type in the following list that supports the precision and scale of the source data, in this club: NUMERIC, BIGNUMERIC, and Cord.
- If none of the listed data types will support the precision and the scale, the data type supporting the widest range in the specified list is selected. If a value exceeds the supported range when reading the source data, an mistake volition be thrown.
- The information type String supports all precision and scale values.
- If this field is left empty, the information type will default to "NUMERIC,Cord" for ORC, and "NUMERIC" for the other file formats.
- This field cannot contain duplicate data types.
- The order of the data types that you list in this field is ignored.
-
If y'all chose CSV or JSON as your file format, in the JSON,CSV section, cheque Ignore unknown values to accept rows that comprise values that do not match the schema. Unknown values are ignored. For CSV files, this pick ignores actress values at the end of a line.
-
If you chose CSV equally your file format, in the CSV section enter any boosted CSV options for loading data.
-
(Optional) In the Notification options section:
- Click the toggle to enable e-mail notifications. When yous enable this option, the transfer administrator receives an email notification when a transfer run fails.
- For Select a Pub/Sub topic, choose your topic name or click Create a topic to create ane. This option configures Pub/Sub run notifications for your transfer.
-
-
Click Salvage.
bq
Enter the bq mk
command and supply the transfer creation flag — --transfer_config
.
bq mk \ --transfer_config \ --project_id=project_id \ --data_source=data_source \ --display_name=name \ --target_dataset=dataset \ --params='parameters'
Where:
- project_id: Optional. Your Google Cloud projection ID. If
--project_id
isn't supplied to specify a detail project, the default project is used. - data_source: Required. The data source —
amazon_s3
. - display_name: Required. The brandish name for the transfer configuration. The transfer name can exist any value that allows yous to easily identify the transfer if y'all need to modify it later on.
- dataset: Required. The target dataset for the transfer configuration.
-
parameters: Required. The parameters for the created transfer configuration in JSON format. For instance:
--params='{"param":"param_value"}'
. The following are the parameters for an Amazon S3 transfer:- destination_table_name_template: Required. The name of your destination table.
-
data_path: Required. The Amazon S3 URI, in the post-obit format:
s3://mybucket/myfolder/...
URIs also support parameters.
-
access_key_id: Required. Your access key ID.
-
secret_access_key: Required. Your secret access key.
-
file_format: Optional. Indicates the type of files you wish to transfer:
CSV
,JSON
,AVRO
,PARQUET
, orORC
. The default value isCSV
. -
max_bad_records: Optional. The number of immune bad records. The default is
0
. -
decimal_target_types: Optional. A comma-separated list of possible SQL data types that the source decimal values could exist converted to. If this field is not provided, the datatype volition default to "NUMERIC,STRING" for ORC, and "NUMERIC" for the other file formats.
-
ignore_unknown_values: Optional, and ignored if file_format is not
JSON
orCSV
. Whether to ignore unknown values in your information. -
field_delimiter: Optional, and applies simply when
file_format
isCSV
. The graphic symbol that separates fields. The default value is a comma. -
skip_leading_rows: Optional, and applies but when file_format is
CSV
. Indicates the number of header rows you don't want to import. The default value is0
. -
allow_quoted_newlines: Optional, and applies only when file_format is
CSV
. Indicates whether to allow newlines within quoted fields. -
allow_jagged_rows: Optional, and applies but when file_format is
CSV
. Indicates whether to accept rows that are missing trailing optional columns. The missing values volition be filled in with NULLs.
For example, the following command creates an Amazon S3 transfer named My Transfer
using a data_path_template
value of s3://mybucket/myfile/*.csv
, target dataset mydataset
, and file_format
CSV
. This example includes non-default values for the optional params associated with the CSV
file_format.
The transfer is created in the default project:
bq mk --transfer_config \ --target_dataset=mydataset \ --display_name='My Transfer' \ --params='{"data_path_template":"s3://mybucket/myfile/*.csv", "destination_table_name_template":"MyTable", "file_format":"CSV", "max_bad_records":"ane", "ignore_unknown_values":"true", "field_delimiter":"|", "skip_leading_rows":"1", "allow_quoted_newlines":"true", "allow_jagged_rows":"false", "delete_source_files":"truthful"}' \ --data_source=amazon_s3
Subsequently running the command, you lot receive a bulletin like the following:
[URL omitted] Please copy and paste the above URL into your web browser and follow the instructions to retrieve an authentication code.
Follow the instructions and paste the authentication code on the command line.
API
Use the projects.locations.transferConfigs.create
method and supply an case of the TransferConfig
resource.
Java
Querying your data
When your information is transferred to BigQuery, the data is written to ingestion-fourth dimension partitioned tables. For more information, see Introduction to partitioned tables.
If you query your tables direct instead of using the automobile-generated views, you must utilize the _PARTITIONTIME
pseudo-cavalcade in your query. For more information, see Querying partitioned tables.
Impact of prefix matching versus wildcard matching
The Amazon S3 API supports prefix matching, merely not wildcard matching. All Amazon S3 files that match a prefix will be transferred into Google Deject. All the same, only those that match the Amazon S3 URI in the transfer configuration volition actually get loaded into BigQuery. This could result in excess Amazon S3 egress costs for files that are transferred but not loaded into BigQuery.
As an instance, consider this data path:
s3://bucket/binder/*/subfolder/*.csv
Along with these files in the source location:
s3://bucket/folder/whatsoever/subfolder/file1.csv s3://bucket/folder/file2.csv
This volition issue in all Amazon S3 files with the prefix s3://saucepan/folder/
being transferred to Google Deject. In this example, both file1.csv
and file2.csv
will be transferred.
However, simply files matching s3://bucket/folder/*/subfolder/*.csv
will actually load into BigQuery. In this instance, only file1.csv
will exist loaded into BigQuery.
Troubleshooting
The post-obit provides information about common errors and the recommendation resolution.
Amazon S3 PERMISSION_DENIED errors
Error | Recommended activity |
---|---|
The AWS Access Key Id you provided does not exist in our records. | Ensure the access key exists and the ID is correct. |
The request signature we calculated does not match the signature you lot provided. Check your cardinal and signing method. | Ensure that the transfer configuration has the correct corresponding Secret Access Key |
Failed to obtain the location of the source S3 bucket. Additional details: Access Denied Failed to obtain the location of the source S3 bucket. Additional details: HTTP/1.1 403 Forbidden S3 error message: Admission Denied | Ensure the AWS IAM user has permission to perform the following:
|
Server unable to initialize object upload.; InvalidObjectState: The operation is not valid for the object's storage form Failed to obtain the location of the source S3 saucepan. Additional details: All admission to this object has been disabled | Restore any objects that are archived to Amazon Glacier. Objects in Amazon S3 that are archived to Amazon Glacier are not attainable until they are restored |
All access to this object has been disabled | Ostend that the Amazon S3 URI in the transfer configuration is correct |
Amazon S3 transfer limit errors
Mistake | Recommended action |
---|---|
Number of files in transfer exceeds limit of 10000. | Evaluate if the number of wildcards in the Amazon S3 URI can be reduced to just one. If this is possible, retry with a new transfer configuration, every bit the maximum number of files per transfer run will exist higher. Evaluate if the transfer configuration can exist dissever into multiple transfer configurations, each transferring a portion of the source information. |
Size of files in transfer exceeds limit of 16492674416640 bytes. | Evaluate if the transfer configuration can exist split into multiple transfer configurations, each transferring a portion of the source data. |
General issues
Error | Recommended action |
---|---|
Files are transferred from Amazon S3 simply not loaded into BigQuery. The transfer logs may await await like to this: Moving information from Amazon S3 to Google Cloud consummate: Moved <NNN> object(s). | Confirm that the Amazon S3 URI in the transfer configuration is right. If the transfer configuration was meant to load all files with a common prefix, ensure that the Amazon S3 URI ends with a wildcard. |
Other problems | See Troubleshooting transfer configurations. |
What's next
- For an introduction to Amazon S3 transfers, see Overview of Amazon S3 transfers
- For an overview of BigQuery Data Transfer Service, see Introduction to BigQuery Data Transfer Service.
- For information on using transfers including getting information about a transfer configuration, listing transfer configurations, and viewing a transfer'south run history, see Working with transfers.
Except as otherwise noted, the content of this folio is licensed under the Creative Commons Attribution 4.0 License, and lawmaking samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Coffee is a registered trademark of Oracle and/or its affiliates.
Final updated 2022-04-12 UTC.
Source: https://cloud.google.com/bigquery-transfer/docs/s3-transfer